Achtung:

Sie haben Javascript deaktiviert!
Sie haben versucht eine Funktion zu nutzen, die nur mit Javascript möglich ist. Um sämtliche Funktionalitäten unserer Internetseite zu nutzen, aktivieren Sie bitte Javascript in Ihrem Browser.

AG Codes and Kryptographie Show image information

AG Codes and Kryptographie

Clusteranalysis

A clustering is a partition of a set of objects into groups (clusters), such that objects in the same cluster are similar to one another, while objects in different clusters differ substantially. As a first step, we have to define a notion of similarity.
As an example, consider a map. If we interpret locations marked on the map as a set of objects, then we could say the two objects are similar if the distance between the two is small. A clustering of the locations would then correspond to a division of the map into conurbations.
Algorithms solving such a clustering problem try to compute a good clustering. That is, a clustering with small cost. So, in addition to defining similarity, we also need to define an objective function, which signifies the quality of the computed clustering.

k-means

One of the most famous clustering problems is the task of partitioning the input data set into a given number of clusters. Additionally, each cluster is assigned some representative. The cost of the clustering is then defined as the sum of the distances of all data points to their respective representative. We often used the squared Euclidean distance, or more generally Bregman divergences, to compute the distance between data points.
The k-means algorithm (Lloyd, 1982) is a technique which iteratively improves some initial solution, to find a clustering which lies at a local minimum of the objective function. A downside of the k-means algorithm is that the quality of the solution strongly depends on the initially provided solution. The k-means++ initialization (Arthur und Vassilvitskii, 2007) finds such an initial solution, whose expected cost is only a logarithmic factor worse than the optimal cost.

Fuzzy k-means

One generalization of the k-means problem consists of relaxing the binary assignment of points to clusters. Instead, each data point is assigned to each cluster to some degree of membership. Similar to the k-means algorithm, there is an algorithm for the fuzzy k-means problem, which finds a local minimum of the objective function.

Clustering with Gaussian Mixtures

In this flavor of clustering problems, we combine the partial assignment of points with Gaussian distributions as representatives. This problem is well-known as the Maximum Likelihood Estimation (MLE) problem for Gaussian mixtures. The Expectation Maximization (EM) algorithm (Dempster, 1977) is a generalization of the k-means algorithms. However, the EM algorithm does not necesarrily find a local minimum of the objective function. Thus, it is even more important to find good initial solutions.

Our Results

  • A new theoretical and empirical analysis of a randomized variant of the EM algorithm for Gaussian mixtures, the Stochastic Expectation Maximization (SEM) algorithm. Assuming certain conditions, the EM and SEM algorithm behave similarly, however, the SEM algorithm is almost twice as fast as the EM algorithm.
  • An empirical comparison of different initialization techniques for the EM algorithm for Gaussian mixtures.
  • Non-representability of optimal fuzzy k-means solutions.
  • An algorithm with guaranteed approximation ratio for the fuzzy k-means problem.
  • A Coreset construction for the fuzzy k-means problem.

Open Questions

  • Is there a polynomial-time algorithm with a provable approximation ratio for the fuzzy k-means problem?
  • Is there a classification of the fuzzy k-means problem into the classical complexity classes?
  • Are there initialization techniques with provable guarantees for the EM algorithm with Gaussian mixtures?

Publications


Open list in Research Information System

2019

How well do SEM algorithms imitate EM algorithms? A non-asymptotic analysis for mixture models

J. Blömer, S. Brauer, K. Bujna, D. Kuntze, Advances in Data Analysis and Classification (2019)

DOI



Complexity of single-swap heuristics for metric facility location and related problems

S. Brauer, Theoretical Computer Science (2019), 754, pp. 88-106

DOI


2018

Coresets for Fuzzy K-Means with Applications

J. Blömer, S. Brauer, K. Bujna, in: 29th International Symposium on Algorithms and Computation (ISAAC 2018), Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2018, pp. 46:1--46:12

DOI


2017

Complexity of Single-Swap Heuristics for Metric Facility Location and Related Problems

S. Brauer, in: Lecture Notes in Computer Science, Springer International Publishing, 2017, pp. 116-127

Metric facility location and K-means are well-known problems of combinatorial optimization. Both admit a fairly simple heuristic called single-swap, which adds, drops or swaps open facilities until it reaches a local optimum. For both problems, it is known that this algorithm produces a solution that is at most a constant factor worse than the respective global optimum. In this paper, we show that single-swap applied to the weighted metric uncapacitated facility location and weighted discrete K-means problem is tightly PLS-complete and hence has exponential worst-case running time.



2016

A Theoretical Analysis of the Fuzzy K-Means Problem

J. Blömer, S. Brauer, K. Bujna, in: 2016 IEEE 16th International Conference on Data Mining (ICDM), IEEE, 2016, pp. 805-810

One of the most popular fuzzy clustering techniques is the fuzzy K-means algorithm (also known as fuzzy-c-means or FCM algorithm). In contrast to the K-means and K-median problem, the underlying fuzzy K-means problem has not been studied from a theoretical point of view. In particular, there are no algorithms with approximation guarantees similar to the famous K-means++ algorithm known for the fuzzy K-means problem. This work initiates the study of the fuzzy K-means problem from an algorithmic and complexity theoretic perspective. We show that optimal solutions for the fuzzy K-means problem cannot, in general, be expressed by radicals over the input points. Surprisingly, this already holds for simple inputs in one-dimensional space. Hence, one cannot expect to compute optimal solutions exactly. We give the first (1+eps)-approximation algorithms for the fuzzy K-means problem. First, we present a deterministic approximation algorithm whose runtime is polynomial in N and linear in the dimension D of the input set, given that K is constant, i.e. a polynomial time approximation scheme (PTAS) for fixed K. We achieve this result by showing that for each soft clustering there exists a hard clustering with similar properties. Second, by using techniques known from coreset constructions for the K-means problem, we develop a deterministic approximation algorithm that runs in time almost linear in N but exponential in the dimension D. We complement these results with a randomized algorithm which imposes some natural restrictions on the sought solution and whose runtime is comparable to some of the most efficient approximation algorithms for K-means, i.e. linear in the number of points and the dimension, but exponential in the number of clusters.


Adaptive Seeding for Gaussian Mixture Models

J. Blömer, K. Bujna, in: Advances in Knowledge Discovery and Data Mining, Springer International Publishing, 2016, pp. 296-308

DOI



2014


Analysis of Agglomerative Clustering

M.R. Ackermann, J. Blömer, D. Kuntze, C. Sohler, Algorithmica (2014), 69

DOI


A Theoretical and Experimental Comparison of the EM and SEM Algorithm

J. Blömer, K. Bujna, D. Kuntze, in: 2014 22nd International Conference on Pattern Recognition, IEEE, 2014

DOI


2013


2012

StreamKM++: A clustering algorithm for data streams

M.R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen, C. Sohler, ACM, 2012

DOI


2011

Hardness and Non-Approximability of Bregman Clustering Problems.

M.R. Ackermann, J. Blömer, C. Scholz, 2011


2010

Bregman Clustering for Separable Instances

M.R. Ackermann, J. Blömer, in: SWAT 2010, Springer Berlin Heidelberg, 2010, pp. 212-223

DOI


Clustering for Metric and Nonmetric Distance Measures

M.R. Ackermann, J. Blömer, C. Sohler, ACM Trans. Algorithms (2010)(4), pp. 59:1--59:26

DOI


On the initialization of dynamic models for speech features

A. Krueger, V. Leutnant, R. Haeb-Umbach, M. Ackermann, J. Blömer, Proc. of ITG Fachtagung Sprachkommunikation. ITG, Bochum, Germany (2010)


2009

Coresets and Approximate Clustering for Bregman Divergences

M.R. Ackermann, J. Blömer, in: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, 2009, pp. 1088-1097

DOI


Algorithms for the Bregman k-Median Problem

M.R. Ackermann, Universität Paderborn, 2009


Open list in Research Information System

The University for the Information Society