Files
Abstract
Building effective representations for protein sequences has been a longstanding challenge in computational biology, necessitating sophisticated approaches for precise analysis and interpretation. This thesis capitalizes on recent advancements in protein language models (PLMs), which have revolutionized our ability to understand and interpret the complex language of proteins. Inspired by breakthroughs in natural language processing, PLMs have emerged as powerful tools for addressing intricate biological questions. The research presented here focuses on two critical areas: kinase-substrate phosphorylation prediction and protein sequence conservation. We introduce Phosformer, an innovative deep learning model that sets a new benchmark in predicting kinase-specific phosphosites with unparalleled accuracy across the entire kinome. Phosformer not only enhances the understanding of kinase-peptide interactions but also brings much-needed transparency and generalizability to these predictions. In the realm of protein sequence conservation, this thesis proposes an alignment-free method using PLMs, a significant leap from traditional alignment-based approaches, to accurately identify conserved functional sites in complex protein structures. Overall, this work contributes groundbreaking models and methodologies, advancing our understanding of kinase-substrate interactions and protein sequence conservation, with implications that extend beyond biological research to therapeutic applications, showcasing the transformative potential of PLMs in deciphering the language of proteins.