Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

In this era of Big Data, large-scale data storage provides the motivation for statisticians to analyze new types of data. The standard statistical techniques with the Euclidean metric are typically not designed to handle those new types of data. Because extracting useful information from these vast amounts of complex data is critical in modern research, there is a strong need to develop new approaches with non-Euclidean metrics for analyzing highly structured data. Among the complex data emerging in various fields of science, our research focuses on analysis of data objects.The first topic focuses on classification problems when predictors are observed as or aggregated into histograms. Because conventional classification methods take vectors as input, a natural approach converts histograms into vector-valued data using summary values, such as the mean or median. However, this approach forgoes the distributional information available in histograms. To address this issue, we propose a margin-based classifier called support histogram machine (SHM) for histogram-valued data. We adopt the support vector machine framework and the Wasserstein-Kantorovich metric to measure distances between histograms. The proposed optimization problem is solved by a dual approach. We then test the proposed SHM via simulated and real examples and demonstrate its superior performance to summary-value-based methods. In the second topic, we propose two clustering methods based on the Fréchet distance for longitudinal data: Multivariate Fréchet K-means (MFKmL) and Sparse Fréchet K-medoids (SFKmL). The Fréchet distance is a useful tool when measuring the similarity between trajectories based on their shapes. The MFKmL method follows the standard K-means algorithm with the Fréchet distance in multiple dimensions, and the SFKmL method introduces sparsity with the variable-wise Fréchet distance in the K-medoids algorithm. A simulation study suggests that SFKmL outperforms MFKmL and an existing clustering method. Moreover, the real data analysis using SFKmL provides a clustering result that is interpretable from a clinical perspective. Lastly, we proposed a conditional distribution estimator in a regression setting with a histogram-valued target variable and vector-valued covariates. By partitioning the support of the target variable, the cumulative relative frequencies associated with the histogram bins are obtained and embedded into the output layer in the neural network. We explore the prediction performance of the proposed method in various simulation settings.

Details

PDF

Statistics

from
to
Export
Download Full History