Interactive Hierarchical Label Discovery

Zain, Meekail

Interactive Hierarchical Label Discovery

Zain, Meekail

2025

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Modern machine learning (ML) pipelines, especially deep learning (DL) pipelines, tend to be constrained by the lack of labeled data, whereas raw unlabeled data is relatively abundant. The process of labeling data requires experts to leverage domain knowledge to assign potentially-arbitrary labels to samples. This process is inaccessible to many due to the need of finding a domain expert, as well as overcoming the financial costs of employing such an expert. Furthermore, there is room for error due to accidental mislabeling by the expert. In emerging problems, there may not even be a fundamental set of labels agreed upon by domain experts. Furthermore, labels may be encoded within a hierarchical label schema at different levels of fidelity, leading to sentimental ambiguity. For example, while all shirts are tops, not all tops are shirts. Thus the space of clothing may include labels such as "shirts" and "tops", but whether one is more appropriate than the other is a problem-dependent answer. Sometimes, greater specificity can lead to more complex and confounding models with lower efficacy, whereas being too general may lead to coarser models which do not encode sufficient complexity to model data patterns. We develop a novel workflow and pipeline to mitigate these problems, built around HDBSCAN which is the current SOTA unsupervised hierarchical clustering machine learning algorithm. We start by modifying HDBSCAN into Path-Constrained HDBSCAN (PCH), a semi-supervised algorithm to allow for expert-sentiment driven hierarchical clustering, which serves to quickly create an initial label schema based on the experts' semantics, amplifying and encoding their personal domain knowledge. We also provide a novel sampling method built specifically for PCH that allows for useful expert queries. We then train a deep representation network designed to produce a rich representation space while also learning representative samples from the data. We then define a workflow for introspecting the learned samples to gain insights which generalize back to the dataset as a whole.

Record ID

27194

Record Created

2025-09-23

Title

Interactive Hierarchical Label Discovery

Author

Zain, Meekail

Contributor

Quinn, Shannon Advisor (University of Georgia)
Quinn, Shannon Committee Member (University of Georgia)
Bai, Ray Committee Member (University of Georgia)
Bhandarkar, Suchendra M Committee Member (University of Georgia)

College or School

Franklin College of Arts and Sciences

Department

School of Computer Science

Date

2025-05

Content Type

Dissertation

Pagination

115

File Format

pdf

Language

English

Degree Type

Doctor of Philosophy (PHD)

Name of Granting Institution

University of Georgia

Year Degree Granted

2025-05

Keywords

Generative Modeling; Machine Learning; Manifold Theory; Representation Theory; Semi-Supervised Clustering; Unsupervised Clustering

Record Appears in

College, School, or Unit > Franklin College of Arts and Sciences
Electronic Theses and Dissertations > Doctoral Dissertation
All Resources
Doctoral

System Control Number

https://www.proquest.com/LegacyDocView/DISSNUM/31848160

Download Full History

Interactive Hierarchical Label Discovery

Files

Abstract

Details

PDF

Statistics