# Clustering Algorithms

# Organisation

The course material, exercises, and news are distributed via PANDA.

# Topics

In this course we are going to study an important tool to analyze collected data: clustering. Clustering is the process of dividing data into useful or sensible groups. A sensible division should resemble the data's natural structure. Sometimes the goal is that each cluster should contain as many items of a similar kind as possible ( for example in data compression). Clustering is a very natural way to analyze and structure data. Especially in natural sciences we are working with data whose structure is unknown to us. An example is the human DNA, that humankind is trying to decode. Clustering can be a very powerful tool in such cases.

### Preliminary List of Topics

- Distances
- Hierarchical Clustering
- k-Means
- k-Means++ Initialization
- Constant Factor Approximation of k-Means
- Constant Factor Approximation of k-Medians
- Coresets
- Dimension Reduction Techniques
- BIRCH
- DBSCAN

# Dates and Times

### Lecture

- Fridays, 08:15 - 10:45 in F0.530

### Tutorial

- Mondays, 14:15 - 15:45 in F1.110

The first lecture will be held on 11/10, the first tutorial on 14/10.

Due to scheduling difficulties, the lecture and tutorial will sometimes switch time slots. Please follow the respective announcements in PANDA.

# Exercises

There are two types of exercises: Type 1 and type 2. Both are submitted and graded independent of one another.

### Type 1 Exercises

Type 1 exercises deal with one specific lecture and are designed for repetition and comprehension of its content. They can be solved within a short timeframe, assuming you've already put some work towards understanding the lecture content.

Type 1 exercises are released every week and are due one week later. They are mandatory, i.e. you need to achieve a certain percentage of points in order to take the exam. Type 1 exercises must be solved by you individually, they are not group exercises.

### Type 2 Exercises

Type 2 exercises are designed to deepen your understanding by asking advanced questions (potentially spanning multiple lectures). These are more time-intensive and may require you to come up with the right ideas. For this, you'll have more time.

Type 2 exercises are released irregularly and you will have 2-3 weeks to solve them. Type 2 exercises are not mandatory, but solving them grants a bonus for your final grade. Type 2 exercises can be solved in groups of up to 2 students.

# Exam

The exam will be an oral block exam. For this, we will (later in the semester) announce two blocks where you can take your oral exam.

### Requirements

As a Studienleistung, you need to pass at least 70% of the weekly type 1 exercises. A weekly exercise is considered "passed" if you reach at least 50% of its points. You necessarily need the Studienleistung in order to participate in the exam.

### Bonus System

If you achieved at least 75% of all points in type 2 exercises, and you pass the exam, then your grade will improve by one grade step.

# Literature

Bishop, C. M. (2006). *Pattern Recognition and Machine Learning.* Springer-Verlag New York Inc. ISBN 978-0-387-31073-2

MacKay, D. (2003) *Information Theory, Inference, and Learning Algorithms. *Cambridge University Press. ISBN 978-0-521-64298-9

Arthur, D. and Vassilvitskii, S. (2007). k-Means++: The Advantages of Careful Seeding. In *Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms.*

Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A. Y. (2004). A Local Search Approximation Algorithm for k-Means Clustering. *Computational Geometry*, 28(2-3)* *DOI 10.1145/513400.513402

Charikar M., Guha S., Tardos É., and Shmoys D. B. (2002). A Constant-Factor Approximation Algorithm for the k-Median Problem. *Journal of Computer and System Sciences, *65(1) DOI 10.1006/jcss.2002.1882