Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

The analysis of protein sequence information is an important part of bioinformatics, used for high-throughput predictions of protein structure, function, and evolution. While traditional analytical methods utilize sequence alignments, recent advances in representation learning facilitate alternative, alignment-independent strategies. In this work, I develop and apply both alignment-based and alignment-independent approaches to analyze the protein kinase superfamily, a biomedically-relevant and highly conserved class of signaling enzymes. Using a large curated sequence alignment, I characterized sequence variations of the αC-β4 loop across diverse protein kinase enzymes and identified the region as a major kinase regulatory hotspot. Using a more focused alignment, I characterized the functional evolution of tyrosine kinases families across diverse holozoan taxa and proposed a new representative phylogeny. Finally, I infer the evolutionary relationships which connect the protein kinases superfamily to structurally divergent lipid and small-molecule kinases using an alignment-independent approach, facilitated by sequence embeddings learned from Transformer protein language models. My work provides new insights on the functional evolution of the protein kinase superfamily using a combination of traditional and novel approaches inspired by unsupervised analytical techniques from representation learning. The broad applicability of my sequence embedding-based framework is further demonstrated in pilot analyses of phosphatase enzymes as well as the radical S-adenosyl-L-methionine (SAM) superfamily.

Details

PDF

Statistics

from
to
Export
Download Full History