Files
Abstract
In this era of Big Data, large-scale data storage provides the motivation for statisticians to analyze new types of data. The standard statistical techniques with the Euclidean metric are typically not designed to handle those new types of data. Because extracting useful information from these vast amounts of complex data is critical in modern research, there is a strong need to develop new approaches with non-Euclidean metrics for analyzing highly structured data. Among the complex data emerging in various fields of science, our research focuses on analysis of data objects.The first topic focuses on classification problems when predictors are observed as or aggregated into histograms. Because conventional classification methods take vectors as input, a natural approach converts histograms into vector-valued data using summary values, such as the mean or median. However, this approach forgoes the distributional information available in histograms. To address this issue, we propose a margin-based classifier called support histogram machine (SHM) for histogram-valued data. We adopt the support vector machine framework and the Wasserstein-Kantorovich metric to measure distances between histograms. The proposed optimization problem is solved by a dual approach. We then test the proposed SHM via simulated and real examples and demonstrate its superior performance to summary-value-based methods.
In the second topic, we propose two clustering methods based on the Fréchet distance for
longitudinal data: Multivariate Fréchet K-means (MFKmL) and Sparse Fréchet K-medoids
(SFKmL). The Fréchet distance is a useful tool when measuring the similarity between trajectories
based on their shapes. The MFKmL method follows the standard K-means algorithm
with the Fréchet distance in multiple dimensions, and the SFKmL method introduces sparsity
with the variable-wise Fréchet distance in the K-medoids algorithm. A simulation study
suggests that SFKmL outperforms MFKmL and an existing clustering method. Moreover,
the real data analysis using SFKmL provides a clustering result that is interpretable from a
clinical perspective.
Lastly, we proposed a conditional distribution estimator in a regression setting with a histogram-valued target variable and vector-valued covariates. By partitioning the support of the target variable, the cumulative relative frequencies associated with the histogram bins are obtained and embedded into the output layer in the neural network. We explore the prediction performance of the proposed method in various simulation settings.