Files
Abstract
Machine Learning (ML) methods have been increasingly employed in the genetics domain. ML methods have shown promise in the field of characterizing genetic mutations. Mutations can have significant impact on the activity of the Human Epidermal Growth Factor Receptor (EGFR), a protein instrumental in cell proliferation. Over-activation of EGFR is a major cause of tumor growth. Although many computational methods have been proposed to identify disease causing mutations, these methods are not designed to predict mutation impact on protein activity. We explored feature selection strategies suitable for the small, complex data within this domain and tested a variety of machine learning algorithms. We generated a model achieving 85.9% accuracy and an F-Measure of 0.70 with a Support Vector Machine with a Gaussian radial basis function kernel using a set of 6 features. This classifier combined with others using weighted probability voting achieved an area under the ROC curve of 0.83.