Sample size determination in multi-class classification and prediction based on single-nucleotide polymorphisms

Liu, Xinyu

Sample size determination in multi-class classification and prediction based on single-nucleotide polymorphisms

Liu, Xinyu

2013

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Single-nucleotide polymorphisms (SNPs), believed to determine human differences, are widely used to predict risk of diseases and class membership of subjects. In the literature, several supervised machine learning methods, such as, support vector machine, neural network and logistic regression, are available for classification. Typically, however, samples for training a machine are limited and/or the sampling cost is high. Thus, it is essential to determine the minimum sample size needed to construct a classifier based on SNP data. Such a classifier would facilitate correct classification while keeping the sample size to a minimum, thereby making the studies cost-effective.In this dissertation, for coded SNP data from two classes, an optimal classifier and an approximation to its probability of correct classification (PCC) are derived. A linear classifier is constructed and an approximation to its PCC is also derived. These approximations are validated through a variety of Monte Carlo simulations. A sample size determination algorithm based on the criterion which ensures that the difference between the two approximate PCCs is below a threshold, is given. For the HapMap data on Chinese and Japanese populations, a linear classifier is built using 51 independent SNPs, and the required total sample sizes are determined using our algorithm.For coded SNP data from D(>=2) classes, we derive an optimal Bayes classifier and a linear classifier, and obtain a normal approximation to the PCC for each classifier. These approximations are used to evaluate the associated Area Under the Receiver Operating Characteristic (ROC) Curve (AUCs) or Volume Under the ROC hyper-Surface (VUSs). We give an algorithm for sample size determination, which ensures that the difference between the two approximate AUCs (or VUSs) is below a pre-specified threshold. The performance of this algorithm is also illustrated via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent SNPs and the required total sample sizes are determined. We also illustrate the usefulness of our sample size determination algorithm in a prediction problem using a Heterogeneous Stock Mice data.

Details

Record ID

19544

Record Created

2024-12-05

Title

Sample size determination in multi-class classification and prediction based on single-nucleotide polymorphisms

Author

Liu, Xinyu

Contributor

Sriram, T. N. Advisor
McCormick, William P. Committee Member
Reeves, Jaxk Committee Member
Wang, Lily Committee Member
Yin, Xiangrong Committee Member

College or School

Franklin College of Arts and Sciences

Department

Statistics

Date

2013

Publisher

University of Georgia

Content Type

Dissertation

Language

English

Dissertation/ Thesis Note

Doctoral

Degree Type

Doctor of Philosophy (PHD)

Name of Granting Institution

University of Georgia, Summer 2013

Year Degree Granted

2013

Keywords

Area Under the Receiver Operating Characteristic Curve; Classification; Hapmap data; Heterogeneous Stock Mice data; Probability of correct classification; Receiver Operating Characteristic; Sample Size Determination; Single-nucleotide polymorphisms; Volume Under the Receiver Operating Characteristic hyper-Surface; Wald test.

Record Appears in

College, School, or Unit > Franklin College of Arts and Sciences > Statistics
Electronic Theses and Dissertations > Doctoral Dissertation
All Resources
Doctoral

System Control Number

9949333454602959

PDF

Statistics

Download Full History