BIOLOGICALLY INFORMED DATA AUGMENTATION FOR IMPROVING AI-DRIVEN ENZYME FUNCTION PREDICTION

Patel, Shreyash Dinesh

BIOLOGICALLY INFORMED DATA AUGMENTATION FOR IMPROVING AI-DRIVEN ENZYME FUNCTION PREDICTION

Patel, Shreyash Dinesh

2025

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Earth harbors vast microbial genetic diversity, yet AI-driven functional prediction remains challenging due to underrepresentation in functional reference databases and severe class imbalance among 2,200 Enzyme Commission (EC) classes. This project tests three data augmentation methods to increase underrepresented EC classes: (1) reverse-complement (doubling 150,000 training samples), (2) synonymous codon substitution (generating 600,000 sequences with 25–70% replacement probability), and (3) conditional GAN generation conditioned on GC content and codon frequency. We created class-balanced training datasets and trained a classifier using a pretrained DNA encoder, LookingGlass, with a 1D convolutional neural network (CNN) decoder. Model performance was evaluated using micro- and macro-averaged F1 scores. Experiments revealed that codon substitution significantly improved macro-F1 (from 0.15 to 0.23) and rare-class recall (from 33.42 to 38%), while reverse complementation degraded performance by introducing label noise. GAN-based augmentation yielded marginal gains without filtering. This work develops a complete training system, evaluation framework, and benchmark datasets to enhance AI-driven functional annotation of DNA sequences across Earth’s diverse microbial communities.

Details

Record ID

27270

Record Created

2025-09-23

Title

BIOLOGICALLY INFORMED DATA AUGMENTATION FOR IMPROVING AI-DRIVEN ENZYME FUNCTION PREDICTION

Author

Patel, Shreyash Dinesh

Contributor

Hoarfrost, Adrienne Advisor (University of Georgia)
Maier, Frederick Committee Member (University of Georgia)
Rasheed, Khaled Committee Member (University of Georgia)

College or School

Franklin College of Arts and Sciences

Department

Marine Sciences

Date

2025-08

Content Type

Thesis

Pagination

61

File Format

pdf

Language

English

Degree Type

Master of Science (MS)

Name of Granting Institution

University of Georgia

Year Degree Granted

2025-08

Keywords

Class imbalance; Data augmentation; Enzyme Commission (EC) classes; Microbial dark matter; Rare‐class recall; Synonymous codon substitution

Record Appears in

College, School, or Unit > Franklin College of Arts and Sciences > Marine Sciences
Electronic Theses and Dissertations > Graduate Thesis
All Resources

System Control Number

https://www.proquest.com/LegacyDocView/DISSNUM/31848539

PDF

Statistics

Download Full History