|
||||||||||||||||
Research Ph.D. ThesesSequential Patterns and Temporal Patterns for Text Mining
By Apirak Hoonlor
In the age of Internet, text mining has been a key research topic for online information retrieval and information extraction. One interesting text mining problem is called a role identification -- a problem of recognizing a group of authors communicating in a specific role within an Internet community. One challenge is to recognize possibly different roles of authors within a communication community based on each individual exchange in electronic communications. Another challenge is the temporal nature of social roles, in which, a person can be associated with many roles over a life time, ranging from a leader of an arm group, a prisoner, a Nobel Prize winner to a president of a country. Normal word features and other traditional vector space models are not optimized for these problems. In this thesis, we present frameworks for sequential pattern mining and temporal pattern mining that overcome these challenges. While our frameworks are initially designed for role identification, they can be applied to other text mining tasks. First, we present Recursive Data Mining (RDM), a sequence pattern mining framework for text data. RDM allows certain degree of approximation in matching patterns necessary to capture non-trivial features in text datasets. RDM recursively and hierarchically mines patterns at varying degrees of abstraction. For temporal pattern mining, we propose two complimentary burstiness frameworks to extract temporal correlated patterns from text stream. One framework mines bursty patterns in the bursty period of a given pattern. The other framework extracts the temporally correlated patterns at each time steps. We also propose the Bursty Distance Measurement (BDM). BDM assigns distance between two documents using the temporal patterns. We applied BDM on text clustering task. The experiment showed a substantial improvement on event-related clustering on online news article data for our framework. Return to main PhD Theses page |
||||||||||||||||
|