Predicting gene structure in eukaryotic genomes
Obtaining the complete set of proteins for each eukaryotic organism is an important step in the quest to understand how life evolves and functions. The complex physiology of eukaryotic cells, however, makes direct observation of proteins and their parent genes difficult to achieve. An organism's genome provides the raw data that contains the set of instructions for generating the complete set of proteins, providing the potential to obtain a complete list of proteins without having to rely exclusively on direct observations in the cell. Computational gene prediction systems, therefore, play an important role in compiling sets of putative proteins for each sequenced genome. This dissertation addresses the problem of computational gene prediction in eukaryotic genomes, presenting a framework for predicting precise single isoform protein coding genes in long contiguous stretches of DNA. The framework is extended to predict overlapping alternatively spliced exons in known protein coding regions. A main contribution of this work is to apply classifier stacking with sequential inference, for the first time, to the gene finding problem and to develop a phylogenetic generalized hidden Markov model for the alternative splice site prediction problem. First a linear weighting scheme is developed, which is extended to a statistical prediction model. The statistical model is then transformed to a new sequential inference model to predict alternatively spliced exons. Prediction accuracy of the single isoform gene prediction methods are tested on three eukaryotic genomes: Arabidopsis thaliana, Oryza sativa and human. Application of the gene prediction methods are examined in other eukaryotic genomes. The alternatively spliced exon prediction model is tested in four Drosophila species under a variety of input conditions. Incorporating multiple sources of gene structure evidence is shown to substantially improveme single isoform gene prediction accuracy with performance beginning to rival the accuracy of expert human annotators. Results from the alternative exon prediction experiments demonstrate the potential to reliably predict new alternatively spliced forms of known genes. The use of cross-species sequence conservation information is shown to enhance the precision of alternatively spliced exon prediction.