Files
Abstract
The analysis of protein sequence information is an important part of bioinformatics, used for high-throughput predictions of protein structure, function, and evolution. While traditional analytical methods utilize sequence alignments, recent advances in representation learning facilitate alternative, alignment-independent strategies. In this work, I develop and apply both alignment-based and alignment-independent approaches to analyze the protein kinase superfamily, a biomedically-relevant and highly conserved class of signaling enzymes. Using a large curated sequence alignment, I characterized sequence variations of the αC-β4 loop across diverse protein kinase enzymes and identified the region as a major kinase regulatory hotspot. Using a more focused alignment, I characterized the functional evolution of tyrosine kinases families across diverse holozoan taxa and proposed a new representative phylogeny. Finally, I infer the evolutionary relationships which connect the protein kinases superfamily to structurally divergent lipid and small-molecule kinases using an alignment-independent approach, facilitated by sequence embeddings learned from Transformer protein language models. My work provides new insights on the functional evolution of the protein kinase superfamily using a combination of traditional and novel approaches inspired by unsupervised analytical techniques from representation learning. The broad applicability of my sequence embedding-based framework is further demonstrated in pilot analyses of phosphatase enzymes as well as the radical S-adenosyl-L-methionine (SAM) superfamily.