Files
Abstract
High-throughput sequencing technologies have generated huge amounts of genomic data.This wealth of genomic data provides computational biologists unprecedented opportunitiesto unveil the biological machinery encoded in genomes. Characterizing the structure ofgenomes is an important and challenging task; it is an essential step towards deciphering thenetworks and pathways in a biological system.The characterization of microbial genomic structures includes: (1) identifying neighboringgenes that are co-transcribed (also known as operons); (2) identifying groups of operons withevolutionary relationships (also known as uber-operons); and, (3) elucidating higher levelstructures that share common regulatory controls, including protein-DNA binding eventsand cis-regulatory elements among operons (also known as regulons). The primary goal ofthis thesis is to develop computational methods for elucidating the above three categories ofgenomic structures in prokaryotes.UNIPOP, a maximum bipartite matching-based algorithm, is designed and implementedto predict operon structures of any prokaryotic genome, without relying on experimentaldata or training data. The prediction accuracy of UNIPOP is shown to be superior to mostother operon predictors when evaluating two well-studied organisms.The evolutionary relationships among operons are elucidated by using comparativegenomic data and a maximum matching-based algorithm. The comparative study of uberoperonsand regulons has shown that they are highly related, indicating the effectiveness ofusing uber-operons for predicting regulons.With the availability of predicted operons, we propose an approach, phylogenetic footprintingfor prokaryotes, to study cis regulatory motifs in the promoter regions of operons. Byintegrating the motif data with uber-operon data, and formulating it as a graph partitioningproblem, we predicted regulons in Escherichia coli K12. Different sources of validation haveshown that our predicted regulons were consistent with the data of known regulons, functionalrelatedness and expression data. More importantly, we have also derived some novelregulons which were biologically meaningful.In summary, we predict different levels of genomic structures by developing novel graphtheoreticbased algorithms and using comparative genomic analysis. Our methods are universallyapplicable to all sequenced microbial genomes, and outperform most of the otherpublished methods in terms of prediction accuracy. Our prediction tools can provide assistancein understanding the machinery of gene regulation, biological networks and pathways.