Files
Abstract
High or ultrahigh dimensional data set with group structure emerge in a wide range of scientific research and applications nowadays. However, sparsity may exist in this high or ultrahigh dimensional data with such group form. In such case, our primary goal is to select the important groups that are significantly correlated with outcome. In particular, grouped variable selection plays a critical role in selecting groups and estimating the nonzero coefficients for these covariates within these important groups. Nevertheless, in the presence of ultra-high dimensional data consisting of grouped variables, many algorithms for grouped variable selection may fail to converge or yield insensible results. Even if the algorithm works, it will suffer from a rather intensive computation load. In this dissertation, we propose a two-stage procedure, grouped variable screening and selection, to solve those challenging issues. At the first stage, grouped variable screening is applied to reduce the dimensionality of data by filtering out the unimportant groups that have no contribution to outcome. A sure screening property is established to ensure an overwhelming probability of retaining all important groups after the screening procedure under suitable conditions. This work will mainly focus on four grouped variable screening criteria. At the second stage, since the data have been reduced from ultra-high dimensionality to the moderate one or even lower than sample size, grouped variable selection methods are able to select the important groups effectively and estimate the nonzero coefficients accurately. Meanwhile, the computation can be decreased dramatically in terms of running time and complexity when executing the grouped variable selection. The performance of the proposed two-stage procedure is evaluated by various simulated examples and a real data set in genetic analysis. An R package called grpss is developed to incorporate the two-stage procedure into real applications.