Automated Generation of Metadata for Mining Image and Text Data
Recent years have witnessed an explosion in the amount of digitally-stored data, the rate at which data is being generated, and the diversity of disciplines relying on the availability of stored data. Massive datasets are increasingly important in a wide range of applications, including observational sciences, product marketing, and the monitoring and operations of large systems. Massive datasets are collected routinely in a variety of settings in astrophysics, particle physics, genetic sequencing, geographical information systems, weather prediction, medical applications, telecommunications, sensors, government databases, and credit card transactions. Data mining associated with massive datasets presents a major problem to the serious data miner. Datasets on the scale of terabytes or more preclude any possibility of serious effort by individual humans at manually examining and characterizing the data objects. My research addresses the challenges of autonomous discovery and triage of the contextually relevant information in massive and complex datasets. The aim is to extract feature vectors from the datasets, which will function as digital objects and then effectively reduce the volume of the datasets. I have developed an automated metadata system for automatically scanning the database for certain statistically appropriate feature vectors, recording them as digital objects, and subsequently augmenting the metadata with appropriate digital objects. The result is that the data miner can do a Boolean search on the augmented metadata and quickly reduces the number of objects to be scanned to a much smaller dataset. Two datasets are considered in my research. The first dataset is text data, and the second dataset is remote sensing data. The text data used in my research are documents from Topic Detection and Tracking (TDT) Pilot Corpus collected by Linguistic Data Consortium, Philadelphia, PA., which is taken directly from CNN and Reuters. The TDT corpus comprises a set of 15863 documents spanning the period from July 1, 1994 to June 30, 1995. Four features are extracted from text dataset, topics feature, discriminating words feature, verbs feature, and bigrams feature. The four features were attached to each document in the dataset as digital objects, which help in retrieving the information related to each document on the dataset. The remote sensing images used in my research consisted of 50 gigabytes of the Multi-angle Imaging SpectroRadiometer (MISR) instrument delivered by the Langley DAAC by the help of the MISR team at the Jet Propulsion Laboratory (JPL). The MISR instrument of NASA's satellite Terra is an excellent prototype database for demonstrating feasibility. The instrument captures radiance measurements which can be converted to georectified images. In my research I developed a set of features part of it is based on Gray Level Co-occurrence Matrix (GLCM). Adjacent pairs of pixels (assuming 256 gray levels) are used to create 256 by 256 matrix with all possible pairs of gray levels reflected. Images with similar GLCM are expected to be similar images. Some of these features are constructed based on the GLCM such as Homogeneity, Contrast, Dissimilarity, Entropy, Angular Second Moment (ASM), and Energy. Other computed features include Histogram-based Contrast, Alternate Vegetation Index (AVI), and Normalized Difference Vegetation Index (NDVI), are also taking into consideration as part of the features extracted on this research.