Temporal Mining of Text Documents

 

This is an NSF funded project that is focused on studying methods for extracting temporal information from text documents. The project abstract can be found here.

 

Principle Investigator: Prof. Parvathi Chundi

                                     Computer Science Dept, Univ. Of Nebraska at Omaha

 

Current Members:

 

 

Past Members:

 

 

Collaborators:

 

 

Project Goals and Challenges:

 

Our main goal is to discover temporal information from text documents. We assume that documents contain the publication/creation date which can be used to time stamp their contents. Identifying the keywords/topics that are relevant to a particular time period is always a challenging task. We use the concept of a measure function to help the user attach a value of significance to keywords/topics during a time period. However, measure functions add an extra layer of computation that must be carefully managed so that we can process large document sets. A rudimentary prototype of the project is available here.

 

We are also faced with the challenge of interpreting the usefulness of what we compute. We are currently studying the document set to identify which information is pertinent during which time interval and comparing with the output of our system.

Current Results:

  1. We developed algorithms for constructing time decompositions of a document set that highlights the temporally significant topics/keywords. The algorithms are being incorporated into a research prototype, Chronoscope. Some screenshots of Chronoscope are available here.
  2. We showed how to use the temporal scan statistic to identify time periods with a burst of activity related to a topic. We developed a simple and efficient method to compute a hot spot of a given keyword. A hot spot of a keyword is a time interval with an unusually high presence of the keyword. We conducted several experiments using the titles of papers published in SIGMOD and VLDB conferences between 1975 and 2007.  The hot spots obtained based on the temporal scan statistic are the same or coarser than those obtained by earlier methods. To the best of our knowledge, perhaps this is the first attempt to use the temporal scan statistic for the identification of hot spots from text documents.

 

  1. The item-set time series notion developed toward the end of the first year of the project turned out to be a powerful data abstraction. It helped us to model many types of data as a time series. We extended the traditional time series segmentation framework to include the item-set time series data abstraction by separating the segment difference computation from the segmentation construction computation. We developed efficient methods to compute the segment difference values required for the construction of an optimal segmentation. We conducted several experiments to study the effectiveness of the proposed methods using the Enron data set, a stock market data set, and a synthetic data set.

 

 

  1.  The notion of item-set time series was also used to analyze version control repositories of several open source projects to identify time-varying patterns of developer activity. The experimental results show that the segmentation algorithm produces segments that capture meaningful information and is superior to the information content obtained by arbitrarily segmenting software history into regular time intervals.

 

  1. We also developed a novel approach for temporally analyzing the communication patterns embedded in email data based on time series segmentation. The approach computes egocentric communication patterns of a single user, as well as sociocentric communication patterns involving multiple users, which we refer to as clique patterns. Time series segmentation is used to uncover the patterns that may span multiple time points and to study how these patterns change over time. To find egocentric patterns, the email communication of a user is represented as an item-set time series. An optimal segmentation of the item-set time series is constructed from which patterns are extracted. The email data is represented as an item-setgroup time series and is segmented to extract clique patterns involving multiple users.

 

Publications Related to the Project

·        W. Chen and P. Chundi, Extracting Hot Spots of Topics from Time Stamped Documents, Submitted to a Jounral.

·        W. Chen and P. Chundi, Extracting Hot Spots in Spatial Data: A Study on Grid Based Heuristic Scan Algorithms, UNO Tech Report CST-2010-3.

·        P. Chundi, A. Mills, and W. Chen, Evaluating Time Decompositions of Time Stamped Documents, UNO Tech Report CST-2010-2.

·        W. Chen and P. Chundi, Trend Analysis of Topics Based on Segmentation, 2010 International Conference on Data Warehousing and Knowledge Discovery (DAWAK 2009).

·        P. Chundi, M. Subramaniam, and D. K. Vasireddy, An Approach to Analyze the Enron Email Data Based on Segmentation, Accepted for publication in Elsevier’s Data and Knowledge Engineering Journal.

·        W. Chen and P. Chundi, Extracting Hot Spots of Multi-Keyword Topics from Time Stamped Documents, IEEE Conference on Computational Intelligence and Data Mining, 2009.

·        Harvey P. Siy, Parvathi Chundi, Mahadevan Subramaniam, Summarizing developer work history using time series segmentation: challenge report, 2008 ICSE Workshop on Mining Software Repositories (MSR 2008).

·        W. Chen and P. Chundi, An Approach for Discovering the Hot Spots of Topics from Time Stamped Documents, 2008 SIAM Workshop on Text Mining.

·        P. Chundi and D. J. Rosenkrantz, Efficient Algorithms for Item-Set Time Series, Journal of Data Mining and Knowledge Discovery 2008.

·        P. Chundi and D. J. Rosenkrantz, H. Siy, and M. Subramaniam, A Segmentation-Based Approach for Temporal Analysis of Software Version Repositories, Journal of Software Maintenance and Evolution, Vol 20(3).

·        P. Chundi and D. J. Rosenkrantz, Segmentation of Time Series Data, Encyclopedia of Data Warehousing and Mining (2nd Edition), Edited by Prof. John Wang.

·        H. Siy, P. Chundi, D. J. Rosenkrantz, and M. Subramaniam, Discovering Dynamic Developer Relationships from Software Version Repositories Using Time Series Segmentation, 23rd IEEE International Conference on Software Maintenance 2007.

·        P. Chundi and D. J. Rosenkrantz, Information Preserving Time Decompositions for Time Stamped Documents, Journal of Data Mining and Knowledge Discovery, Vol. 13(1). 2006.

·        R. Zhang and P. Chundi, Using Time Decompositions to Analyze PubMed Abstracts, International Conference on Computer Based Medical Systems, Jun 2006.

·        P. Chundi, R. Zhang, and M. Castellanos, Entropy Based Measure Functions for Analyzing Time Stamped Documents, 2006 Text Mining Workshop, SIAM International Conference on Data Mining.

·        P. Chundi, R. Zhang, D. J. Rosenkrantz, Efficient Algorithms for Computing Time Decompositions for Time Stamped Documents, International Conference on Database and Expert Systems Applications, Sept 2005

·        P. Chundi and D. J. Rosenkrantz, On Lossy Time Decompositions of Time Stamped Documents ACM Conference on Information and Knowledge Management, 2004.

·        P. Chundi and D. J. Rosenkrantz, Constructing Time Decompositions of Time Stamped Documents, SIAM Data Mining Conference, 2004.

 

Point of Contact: Parvathi Chundi (pchundi@mail.unomaha.edu)

Date of Last Update: Oct 11th, 2010.