Using domain knowledge for text mining
Abstract
Text mining concerns the automated analysis of textual data to store, retrieve, organize and extract useful information from textual data. Text mining systems usually rely on document collections or training data prepared for a particular application. However, in practice, external knowledge is also available in external databases, reference books, web pages and many other sources. This thesis concerns the use of external knowledge for text mining to improve the performance of text mining systems. First we focus on using domain knowledge for clustering and text retrieval for bioin¬formatics. We describe a method for clustering biological data by exploiting the inter-linked structure of biological data. By constructing a network of biological sequences, structures and literature with pairwise relationships, we infer clusters of related articles, sequences and structures by graph partitioning. The resulting clusters exhibit strong topicality, as measured by both a quantitative and qualitative manual evaluation on several biological domains. We also present one application of our approach to the problem of finding scientific papers that describe functions of particular genes. Finally, we study incorporating domain knowledge for text classification. We pro-pose combining domain knowledge with training examples in a Bayesian framework. Domain knowledge is used to specify a prior distribution for parameters of a logistic regression model, and labeled training data is used to find the mode of the posterior distribution. We show experimentally on three text categorization data sets that this approach can produce effective classifiers, particularly when little training data is available.