Files
Abstract
Finite mixture models provide a flexible way to model data coming from population consisting of finite number of homogeneous subpopulations. These models are particularly useful in determining clusters or subgroups within a data. Model selection is a crucial step in every statistical data analysis and especially so for data coming from unknown number of subpopulations. In this thesis, we focus squarely on determining parsimonious finite mixture models using a model selection criterion based on L2 distance.In many applications, the scientific information available may not be sufficient to determine the number of components in finite mixture models;hence, it is important to find mixtures with fewest number of components, known as the mixture complexity, that provide satisfactory fit to the data. Estimation of mixture complexity is a fundamental yet challenging problem that has received an enormous attention in the past few decades. In this thesis, we treat the estimation of mixture complexity as a model selection problem andconstruct an estimator of mixture complexity as a by-product of minimizing a Information Criterion based on L_2 distance for both count and continuous data. The estimator of mixture complexity, is shown to be consistent when the form of componentdensities are unknown but are postulated to be members of some parametric family. The estimator is also shown to be robust against model misspecification via simulations. When the model is correctly specified, Monte Carlo simulations for a wide variety of normal and Poisson mixtures show that our estimator is very competitive with several others in the literature in correctly identifying the true mixture complexity. The performance of this method is illustrated for several simulated data and well-known real datasets.