Files
Abstract
Clustering graph-based data is a core problem in contemporary data science. In particular, as data sets grow in size and dimension, efficient algorithms are needed that can go beyond the O(n^2) run time of classical algorithms such as spectral clustering. In this dissertation we propose a new paradigm that rephrases the problem of finding clusters in graphs as a compressive sensing problem. This enables us to use fast algorithms originally developed for sparse recovery (in particular, greedy algorithms such as Orthogonal Matching Pursuit or Subspace Pursuit) for the clustering problem. We propose two new algorithms, and several variations thereof, in this paradigm, which we deem Cluster Pursuit algorithms. In particular, SingleClusterPursuit takes a small set of seed vertices and returns a good cluster containing them in O(dmax*n*log(n)) time, where dmax is the maximum vertex degree of the graph, while DynamicClusterPursuit efficiently updates an existing cluster in an evolving network. We further prove that SingleClusterPursuit is able to recover a large fraction of a given cluster for graphs drawn from a well-known probabilistic model of graphs with communities, namely the stochastic block model.In an additional chapter, we study the related problem of turning Euclidean data into graph data, so that graph-based clustering algorithms such as those discussed above can be used. We analyze the use of power weighted shortest path distances to measure the distance between such data points, and show that this can lead to significant improvements in classification accuracy on both real and synthetic data sets.