Cluster ana­lys­is

Clustering is the division of a set of objects into groups (clusters) so that objects within a cluster are similar to each other, while objects in different clusters differ from each other. To do this, however, it is first necessary to define what is meant by similarity.
Let's look at a map, for example. If we interpret the marked locations as our set of objects, then we can agree that the similarity of objects is high if the distance between the locations is small. Clustering then corresponds to dividing the map into urban centres.
Clustering algorithms attempt to calculate the best possible clusterings, i.e. clusterings with the lowest possible costs. A cost function must therefore also be selected that evaluates the quality of a calculated clustering.

k-means

One of the best-known clustering problems is the categorisation of data points into a fixed number of clusters. Each of the clusters has a clearly defined representative. The cost of clustering is then determined by the sum of the distances of all data points to their respective representatives. The squared Euclidean distance, or more generally Bregman divergences, is usually used as the distance measure.
The k-means algorithm (Lloyds, 1982) is to improve a given solution step by step and thus determine a clustering that represents a local minimum of the cost function. One problem with the k-means algorithm is that the quality of the solution found depends strongly on the given solution. The k-means++ initialisation (Arthur and Vassilvitskii, 2007) can be used to determine an initial solution whose expected costs approach the optimal costs by a logarithmic factor.

Fuzzy k-means

A generalisation of the k-means problem is to soften the fixed assignment of the data points to the clusters. Each data point is assigned proportionally to each cluster. Analogous to the k-means algorithm, there is an algorithm for the fuzzy k-means problem that determines a local minimum of the cost function.

Clus­ter­ing with Gaus­si­an mix­tures

This research topic combines the proportional assignment of data points to all clusters and the use of Gaussian distributions as representatives. This problem is also known as the Maximum Likelihood Estimation (MLE) problem for Gaussian mixtures. The Expectation Maximisation (EM) algorithm (Dempster, 1977) is a generalisation of the k-means algorithm, which, however, does not necessarily provide a local optimum. This makes it all the more important to initialise the algorithm with a good solution.

Our res­ults

  • A new theoretical and experimental analysis of a randomised version of the EM algorithm for Gaussian mixtures, the Stochastic Expectation Maximisation (SEM) algorithm. Under certain conditions, the EM and SEM algorithms behave almost identically, although the SEM algorithm is almost twice as fast.
  • An experimental comparison of different initialisation methods for the EM algorithm for Gaussian mixtures.
  • Non-representability of optimal fuzzy k-means solutions.
  • An algorithm with guaranteed approximation quality for the fuzzy k-means problem.
  • A kernel set construction for the fuzzy k-means problem.

Open ques­tions

  • Is there a polynomial-time algorithm with guaranteed performance for the fuzzy k-means problem?
  • Can we categorise the fuzzy k-means problem into one of the classical complexity classes?
  • Are there provably good initialisation methods for the EM algorithm for Gaussian mixtures?

Im­ple­ment­a­tions and data sets

Pub­lic­a­tions

A Complexity Theoretical Study of Fuzzy K-Means

J. Blömer, S. Brauer, K. Bujna, ACM Transactions on Algorithms 16 (2020) 1–25.


How well do SEM algorithms imitate EM algorithms? A non-asymptotic analysis for mixture models

J. Blömer, S. Brauer, K. Bujna, D. Kuntze, Advances in Data Analysis and Classification 14 (2020) 147–173.



Classification and Approximation of Geometric Location Problems

S. Brauer, Classification and Approximation of Geometric Location Problems, Paderborn, 2019.


Coresets for Fuzzy K-Means with Applications

J. Blömer, S. Brauer, K. Bujna, in: 29th International Symposium on Algorithms and Computation  (ISAAC 2018), Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2018, pp. 46:1--46:12.


Soft Clustering Algorithms - Theoretical and Practical Improvements

K. Bujna, Soft Clustering Algorithms - Theoretical and Practical Improvements, Universität Paderborn, 2017.


Complexity of Single-Swap Heuristics for Metric Facility Location and Related Problems

S. Brauer, in: D. Fotakis, A. Pagourtzis, V.T. Paschos (Eds.), Lecture Notes in Computer Science, Springer International Publishing, Cham, 2017, pp. 116–127.


Hard-Clustering with Gaussian Mixture Models

J. Blömer, S. Brauer, K. Bujna, (2016).


Adaptive Seeding for Gaussian Mixture Models

J. Blömer, K. Bujna, in: Advances in Knowledge Discovery and Data Mining, Springer International Publishing, Cham, 2016, pp. 296–308.


A Theoretical Analysis of the Fuzzy K-Means Problem

J. Blömer, S. Brauer, K. Bujna, in: 2016 IEEE 16th International Conference on Data Mining (ICDM), IEEE, 2016, pp. 805–810.


A Probabilistic Expectation Maximization Algorithm for Multivariate Laplacian Mixtures

S. Brauer, A Probabilistic Expectation Maximization Algorithm for Multivariate Laplacian Mixtures, 2014.


Analysis of Agglomerative Clustering

M.R. Ackermann, J. Blömer, D. Kuntze, C. Sohler, Algorithmica 69 (2014).


A Theoretical and Experimental Comparison of the EM and SEM Algorithm

J. Blömer, K. Bujna, D. Kuntze, in: 2014 22nd International Conference on Pattern Recognition, IEEE, 2014.


Practical algorithms for clustering and modeling large data sets - Analysis and improvements

D. Kuntze, Practical Algorithms for Clustering and Modeling Large Data Sets - Analysis and Improvements, Universität Paderborn, 2013.


StreamKM++: A clustering algorithm for data streams

M.R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen, C. Sohler, 17 (2012).


Hardness and Non-Approximability of Bregman Clustering Problems.

M.R. Ackermann, J. Blömer, C. Scholz, (2011).


Bregman Clustering for Separable Instances

M.R. Ackermann, J. Blömer, in: SWAT 2010, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 212–223.


Clustering for Metric and Nonmetric Distance Measures

M.R. Ackermann, J. Blömer, C. Sohler, ACM Trans. Algorithms (2010) 59:1--59:26.


On the initialization of dynamic models for speech features

A. Krueger, V. Leutnant, R. Haeb-Umbach, M. Ackermann, J. Blömer, Proc. of ITG Fachtagung Sprachkommunikation. ITG, Bochum, Germany (2010).


Algorithms for the Bregman k-Median Problem

M.R. Ackermann, Algorithms for the Bregman K-Median Problem, Universität Paderborn, 2009.


Coresets and Approximate Clustering for Bregman Divergences

M.R. Ackermann, J. Blömer, in: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2009, pp. 1088–1097.


Show all publications