Publications

Note: ^abc denotes that the authors are listed in alphabetical order, per the tradition of (theoretical) computer science.

Robust Online Correlation Clustering

^abc Authors: Silvio Lattanzi, Benjamin Moseley, Sergei Vassilvitskii, Yuyan Wang and Rudy Zhou

Published at: NeurIPS, 2021

In correlation clustering we are given a set of points along with recommendations whether each pair of points should be placed in the same cluster or into separate clusters. The goal is to cluster the points to minimize disagreements from the recommendations. We study the correlation clustering problem in the online setting, where points arrive one at a time, and upon arrival the algorithm must make an irrevocable cluster assignment decision. The online version is known to have a polynomial lower-bound on competitive ratio for any algorithm possible. In this work we go beyond worst case analysis, and show that the celebrated Pivot algorithm performs well when given access to a small number of random samples from the input. Moreover, we prove that Pivot is robust to additional adversarial perturbations of the sample set in this setting. We conclude with an empirical analysis validating our theoretical findings.

Relational Algorithms for k-means Clustering

^abc Authors: Benjamin Moseley, Kirk Pruhs, Alireza Samadian and Yuyan Wang

Published at: ICALP, 2021

Given a relational database, we give a constant approximation k-means algorithm that operates directly on the relational input without converting the relational database to the conventional sample-feature matrix input format. Most data encountered by data scientists is stored in relational databases. Running k-means on a relation database usually requires converting the database into a set of data points, where each data point is a collection of values for different features (attributes). The size of this set could be exponential in the size of the relational database. Our algorithm takes advantage of the highly compact relational representation and achieves a running time that could potentially be sublinear in the number of points the database represents. The algorithm gives a constant approximation of the optimal k-means.

Link: here

Faster Hierarchical Clustering in General Metric Spaces using Approximate Nearest Neighbors

^abc Authors: Benjamin Moseley, Sergei Vassilvitskii and Yuyan Wang

Published at: AISTATS, 2021

We use approximate nearest neighbor (ANN) queries to improve the running time of single-linkage and average-linkage, both popular methods for Hierarchical Clustering, for inputs in general metric spaces where ANN techniques may or may not be known. We first propose such a method which works for any general metric space where ANN exists, and then extend it to the more common scenario where an ANN-friendly proxy metric is knwon. We complement our theoretical analysis with an empirical evaluation.

Scaling Average-Linkage via Sparse Cluster Embeddings

^abc Authors: Thomas Lavastida, Kefu Lu, Benjamin Moseley and Yuyan Wang

Published at: ACML, 2021

Average-linkage, one of the most popular hierarchical clustering algorithms, does not scale to large data sets. The fastest known implementation has running time quadratic in the number of data points. This paper presents a technique we call cluster embedding. It maps clusters into points in another Euclidean space with slightly higher dimensions. Further, the distances between the mapped points well approximates the average distance between the clusters. When coupled with nearest-neighbor techniques, this embedding helps us efficiently find close clusters to merge and is used to break the quadratic running time bound. The speed-up is testified both theoretically and empirically.

Fair Hierarchical Clustering

^abc Authors: Sara Ahmadian, Alessandro Epasto, Marina Knittel, Ravi Kumar, Mohammad Mahdian, Benjamin Moseley, Philip Pham, Sergei Vassilvitskii and Yuyan Wang

Published at: NeurIPS, 2020

Defines fairness in the context of Hierarchical Clustering (HC) and designs algorithm that efficiently produces fair HC solutions for existing objective functions for HC. The proposed fairness is a hard constraint on HC trees. It assigns colors to data points according to their values of a chosen protected feature and imposes that all clusters (apart from leaves) in the HC tree, none of the colors could be overly dominant.

Link: here

An Objective for Hierarchical Clustering in Euclidean Space and Its Connection to Bisecting K-means

^abc Authors: Benjamin Moseley and Yuyan Wang

Published at: AAAI, 2020

Design a new global objective function to measure the quality of Hierarchical Clustering solutions for inputs in Eulidean space and discussed about its theoretical connection to groud-truth inputs and algorithms used in practice. The objective captures the criterion that has motivated the use of divisive clustering algorithms: that when a split happens, points in the same cluster should be more similar than points in different clusters. This objective gives reasonable results on ground-truth inputs for hierarchical clustering and has a theoretical connection between this objective and the bisecting k-means algorithm.

Link: here

Zebra regression: estimating motion of periodically-emittedparticles from unlabeled sightings

Authors: Yuyan Wang, Senaka Buthpitiya and Alex Fabrikant

In submission.

We introduce Zebra regression, a novel streaming regression task. A stream of periodically-created particles travels along a finite, 1-dimensional trajectory, and we observe a stream of sporadic, noisy reports that estimate the location of some particle, but without exact information on which particle is being observed. We seek a streaming clustering algorithm that maintains accurate estimates of the locations of the particles, where a cluster of data reports corresponds to a moving particle.