Statistical Methods on Variable Selection in Structured Longitudinal Data with Missing Information

Son, Heejung

Statistical Methods on Variable Selection in Structured Longitudinal Data with Missing Information

Son, Heejung

2025

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Despite substantial declines in cardiovascular disease (CVD) mortality across counties in the United States from 2009 to 2018, notable racial/ethnic, socioeconomic, and regional disparities persist. Health disparities in CVD mortality are closely linked to social determinants of health (SDOH), highlighting the need to address SDOH domains. Addressing these domains through targeted strategies is vital for reducing disparities and improving CVD outcomes. Challenges related to longitudinal data on SDOH include correlations of observations from the same subject and potential time-varying response patterns. Therefore, it is crucial to utilize statistical models that consider the within-subject correlation and the time-dependent effects of covariates. Models providing population-averaged effects or individual-specific estimates have been developed to address these challenges. Missing data often arise in longitudinal studies and are generally assumed to be missing at random when conditioned on relevant observed information. Modern longitudinal studies operate within a high-dimensional framework. Variable selection and regularization methods effectively address related challenges in SDOH, as they shrink coefficients to prevent overfitting and select variables within groups. The Exclusive Lasso manages grouped variables, ensuring at least one predictor from each predefined group is selected. Given the high-dimensional and longitudinal nature of county-level SDOH data, advanced clustering methods are necessary to reveal variations in the longitudinal relationship between SDOH domains and CVD mortality. Different subpopulations can demonstrate distinct behaviors over time, highlighting the necessity for clustering techniques to identify more homogeneous groups. In this dissertation, I developed a novel approach to integrate Exclusive Lasso into penalized weighted generalized estimating equations to facilitate domain-specific variable selection under missing at random. Furthermore, I propose a model-based clustering extension for high-dimensional longitudinal data, utilizing Exclusive Lasso to identify subpopulations of counties influenced by distinct covariates within each domain. Finally, to enhance this approach, I will employ the model-based clustering method using Exclusive Lasso to refine our understanding of county-level variations within each state. By integrating an additional algorithm that considers these variations, we can categorize counties based on their unique characteristics.

Details

Record ID

27030

Record Created

2025-09-23

Title

Statistical Methods on Variable Selection in Structured Longitudinal Data with Missing Information

Author

Son, Heejung

Contributor

Shen, Ye Advisor (University of Georgia)
Zhang, Donglan Advisor (University of Georgia)
Chen, Zhuo Committee Member (University of Georgia)
Dobbin, Kevin K Committee Member (University of Georgia)
Rathbun, Stephen L Committee Member (University of Georgia)

College or School

College of Public Health

Department

Biostatistics

Date

2025-05

Content Type

Dissertation

Pagination

143

File Format

pdf

Language

English

Degree Type

Doctor of Philosophy (PHD)

Name of Granting Institution

University of Georgia

Year Degree Granted

2025-05

Keywords

Cardiovascular Disease; Exclusive Lasso; High-Dimensional Longitudinal Data; Missing Data; Social Determinant of Health; Variable Selection

Record Appears in

College, School, or Unit > College of Public Health > Biostatistics
Electronic Theses and Dissertations > Doctoral Dissertation
All Resources
Doctoral

System Control Number

https://www.proquest.com/LegacyDocView/DISSNUM/31846413

PDF

Statistics

Download Full History