Files
Abstract
Identifying orthologous genes continues to be an early and imperative step in genome analysis but remains a challenging problem. While synteny (conservation of gene order) has previously been used independently and in combination with other methods to identify orthologs, applying synteny in ortholog identification has yet to be automated in a user-friendly manner. This desire for automation and ease-of-use led me to develop OrthoRefine, a standalone program that uses synteny to improve ortholog identification. OrthoRefine implements a look-around window approach to detect synteny, which is used to distinguish orthologs from paralogs in situations where other methods cannot separate paralogs from orthologs reliably. OrthoRefine, applied as a postprocessing step to results obtained with other methods, was tested in tandem with OrthoFinder, one of the most used software for identification of orthologs in recent years, and OMA, an online database of orthologous genes. I evaluated improvements provided by OrthoRefine in several datasets comprised of bacterial, eukaryotic, and archaeal genomes. OrthoRefine efficiently eliminates paralogs from orthologous groups detected by OrthoFinder and those obtained from OMA. Using synteny increased specificity and functional ortholog identification; additionally, analysis of BLAST e-values, phylogenetics, and operon occurrence further supported using synteny for ortholog identification. A comparison of several window sizes suggested that smaller window sizes (eight genes) were generally the most suitable for identifying orthologs via synteny. However, larger windows (30 genes) performed better in datasets containing less closely related genomes. A typical run of OrthoRefine with ~10 bacterial genomes can be completed in a few minutes on a regular desktop PC.