Files
Abstract
Glycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small molecule substrates. Although prevalent across the tree of life, the evolutionary basis for the complex and diverse modes of GT catalytic functions remain enigmatic. This is mainly due to the extensive structural and functional diversification of GTs that presents a major challenge in mapping the relationships connecting sequence, structure, fold and function.In this dissertation, I develop and apply a combination of established and novel tools for large scale sequence based comparisons of glycosyltransferases across the tree of life. Using well curated structure-based sequence alignment profiles, I first align over half a million GT sequences adopting the GT-A fold to identify the conserved GT-A core and define the minimal active site and hydrophobic components required for GT-A function. Based on this conserved core, I build a phylogenetic framework connecting diverse GT-A families and propose a new evolutionary constraint based classification of GT-A sequences into evolutionarily related groups. Next, I use advances in deep learning to develop a GT fold classification and prediction model that extends the analysis from GT-A to other known and novel folds. I build this highly interpretable model to identify the core conserved features of all three major GT folds and predict GT families that are likely to adopt novel folds. Finally, I compile all the diverse datasets generated during these
studies into an interactive data analytics platform that can be used to infer novel hypotheses about GT-A fold enzymes.