Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Earth harbors vast microbial genetic diversity, yet AI-driven functional prediction remains challenging due to underrepresentation in functional reference databases and severe class imbalance among 2,200 Enzyme Commission (EC) classes. This project tests three data augmentation methods to increase underrepresented EC classes: (1) reverse-complement (doubling 150,000 training samples), (2) synonymous codon substitution (generating 600,000 sequences with 25–70% replacement probability), and (3) conditional GAN generation conditioned on GC content and codon frequency. We created class-balanced training datasets and trained a classifier using a pretrained DNA encoder, LookingGlass, with a 1D convolutional neural network (CNN) decoder. Model performance was evaluated using micro- and macro-averaged F1 scores. Experiments revealed that codon substitution significantly improved macro-F1 (from 0.15 to 0.23) and rare-class recall (from 33.42 to 38%), while reverse complementation degraded performance by introducing label noise. GAN-based augmentation yielded marginal gains without filtering. This work develops a complete training system, evaluation framework, and benchmark datasets to enhance AI-driven functional annotation of DNA sequences across Earth’s diverse microbial communities.

Details

PDF

Statistics

from
to
Export
Download Full History