# Clusteranalysis

A clustering is a partition of a set of objects into groups (clusters), such that objects in the same cluster are similar to one another, while objects in different clusters differ substantially. As a first step, we have to define a notion of similarity.

As an example, consider a map. If we interpret locations marked on the map as a set of objects, then we could say the two objects are similar if the distance between the two is small. A clustering of the locations would then correspond to a division of the map into conurbations.

Algorithms solving such a clustering problem try to compute a good clustering. That is, a clustering with small cost. So, in addition to defining similarity, we also need to define an objective function, which signifies the quality of the computed clustering.

## k-means

One of the most famous clustering problems is the task of partitioning the input data set into a given number of clusters. Additionally, each cluster is assigned some representative. The cost of the clustering is then defined as the sum of the distances of all data points to their respective representative. We often used the squared Euclidean distance, or more generally Bregman divergences, to compute the distance between data points.

The k-means algorithm (Lloyd, 1982) is a technique which iteratively improves some initial solution, to find a clustering which lies at a local minimum of the objective function. A downside of the k-means algorithm is that the quality of the solution strongly depends on the initially provided solution. The k-means++ initialization (Arthur und Vassilvitskii, 2007) finds such an initial solution, whose expected cost is only a logarithmic factor worse than the optimal cost.

## Fuzzy k-means

One generalization of the k-means problem consists of relaxing the binary assignment of points to clusters. Instead, each data point is assigned to each cluster to some degree of membership. Similar to the k-means algorithm, there is an algorithm for the fuzzy k-means problem, which finds a local minimum of the objective function.

## Clustering with Gaussian Mixtures

In this flavor of clustering problems, we combine the partial assignment of points with Gaussian distributions as representatives. This problem is well-known as the Maximum Likelihood Estimation (MLE) problem for Gaussian mixtures. The Expectation Maximization (EM) algorithm (Dempster, 1977) is a generalization of the k-means algorithms. However, the EM algorithm does not necesarrily find a local minimum of the objective function. Thus, it is even more important to find good initial solutions.

## Our Results

- A new theoretical and empirical analysis of a randomized variant of the EM algorithm for Gaussian mixtures, the Stochastic Expectation Maximization (SEM) algorithm. Assuming certain conditions, the EM and SEM algorithm behave similarly, however, the SEM algorithm is almost twice as fast as the EM algorithm.
- An empirical comparison of different initialization techniques for the EM algorithm for Gaussian mixtures.
- Non-representability of optimal fuzzy k-means solutions.
- An algorithm with guaranteed approximation ratio for the fuzzy k-means problem.
- A Coreset construction for the fuzzy k-means problem.

## Open Questions

- Is there a polynomial-time algorithm with a provable approximation ratio for the fuzzy k-means problem?
- Is there a classification of the fuzzy k-means problem into the classical complexity classes?
- Are there initialization techniques with provable guarantees for the EM algorithm with Gaussian mixtures?

### Implementation and Datasets

## Publications

Open list in Research Information System

## 2019

J. Blömer, S. Brauer, K. Bujna, D. Kuntze, Advances in Data Analysis and Classification (2019)

S. Brauer, 2019

S. Brauer, Theoretical Computer Science (2019), 754, pp. 88-106

## 2018

J. Blömer, S. Brauer, K. Bujna, in: 29th International Symposium on Algorithms and Computation (ISAAC 2018), Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2018, pp. 46:1--46:12

## 2017

S. Brauer, in: Lecture Notes in Computer Science, Springer International Publishing, 2017, pp. 116-127

K. Bujna, Universität Paderborn, 2017

## 2016

J. Blömer, S. Brauer, K. Bujna, in: 2016 IEEE 16th International Conference on Data Mining (ICDM), IEEE, 2016, pp. 805-810

J. Blömer, K. Bujna, in: Advances in Knowledge Discovery and Data Mining, Springer International Publishing, 2016, pp. 296-308

J. Blömer, S. Brauer, K. Bujna, 2016

## 2014

S. Brauer, Master's thesis, 2014

M.R. Ackermann, J. Blömer, D. Kuntze, C. Sohler, Algorithmica (2014), 69

J. Blömer, K. Bujna, D. Kuntze, in: 2014 22nd International Conference on Pattern Recognition, IEEE, 2014

## 2013

D. Kuntze, Universität Paderborn, 2013

## 2012

M.R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen, C. Sohler, ACM, 2012

## 2011

M.R. Ackermann, J. Blömer, C. Scholz, 2011

## 2010

M.R. Ackermann, J. Blömer, in: SWAT 2010, Springer Berlin Heidelberg, 2010, pp. 212-223

M.R. Ackermann, J. Blömer, C. Sohler, ACM Trans. Algorithms (2010)(4), pp. 59:1--59:26

A. Krueger, V. Leutnant, R. Haeb-Umbach, M. Ackermann, J. Blömer, Proc. of ITG Fachtagung Sprachkommunikation. ITG, Bochum, Germany (2010)

## 2009

M.R. Ackermann, J. Blömer, in: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, 2009, pp. 1088-1097

M.R. Ackermann, Universität Paderborn, 2009