Files
Abstract
Contemporary data sets can be too large or complex for traditional statistical methods to handle. One approach is to use symbolic data first introduced by Diday (1987). Our interest is the study of model-based clustering for symbolic data, especially for distributions (i.e., observations are not single numerical point values). We will describe symbolic data and considerable differences between symbolic data and classical data. For multivariate data, with p > 1, we only have the marginal distributions; so we do not know the dependence relationship between random variables. One approach to measure these dependencies is that of Vrac et al. (2012) in which a copula function is used to describe the cumulative joint distribution function of random variables in a mixture model. We further develop the algorithm from various perspectives. The model-based clustering algorithm is also implemented in R and applied to simulated data.