Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Modern machine learning (ML) pipelines, especially deep learning (DL) pipelines, tend to be constrained by the lack of labeled data, whereas raw unlabeled data is relatively abundant. The process of labeling data requires experts to leverage domain knowledge to assign potentially-arbitrary labels to samples. This process is inaccessible to many due to the need of finding a domain expert, as well as overcoming the financial costs of employing such an expert. Furthermore, there is room for error due to accidental mislabeling by the expert. In emerging problems, there may not even be a fundamental set of labels agreed upon by domain experts. Furthermore, labels may be encoded within a hierarchical label schema at different levels of fidelity, leading to sentimental ambiguity. For example, while all shirts are tops, not all tops are shirts. Thus the space of clothing may include labels such as "shirts" and "tops", but whether one is more appropriate than the other is a problem-dependent answer. Sometimes, greater specificity can lead to more complex and confounding models with lower efficacy, whereas being too general may lead to coarser models which do not encode sufficient complexity to model data patterns. We develop a novel workflow and pipeline to mitigate these problems, built around HDBSCAN which is the current SOTA unsupervised hierarchical clustering machine learning algorithm. We start by modifying HDBSCAN into Path-Constrained HDBSCAN (PCH), a semi-supervised algorithm to allow for expert-sentiment driven hierarchical clustering, which serves to quickly create an initial label schema based on the experts' semantics, amplifying and encoding their personal domain knowledge. We also provide a novel sampling method built specifically for PCH that allows for useful expert queries. We then train a deep representation network designed to produce a rich representation space while also learning representative samples from the data. We then define a workflow for introspecting the learned samples to gain insights which generalize back to the dataset as a whole.

Details

PDF

Statistics

from
to
Export
Download Full History