This lab is dedicated to the k-means clustering method. In words, k-means takes as input a collection of points in \(\mathbb{R}^d\) (a numerical dataset) and a positive integer \(k\). It returns a collection of \(k\) points (the centers) from \(\mathbb{R}^d\). The centers define a Voronoï tesselation/partition/diagran of \(\mathbb{R}^d\). The Voronoï cells define a clustering of the original dataset.
In the next chunk, we generate a Voronoï diagram on \(\mathbb{R}^2\) with \(100\) cells defined from \(100\) random points drawn from the uniform distribution on a square. Function stat_voronoi() comes from ggvoronoi
Two adjacent Voronoï cells are separated by a (possibly semi-infinite) line segment
Let the so-called centers be denoted by \(c_1, \ldots, c_n\). They form the codebook\(\mathcal{C}\).
The Voronoï cell with center\(c_i\) is defined by \[\left\{x : x \in \mathbb{R}^d, \qquad \| x- c_i \|_2 = \min_{j \leq n} \|x -c_j\|_2\right\}\]
The center of a Voronoï cell is usually not its barycenter
k-means objective function
The \(k\)-means algorithm aims at building a codebook\(\mathcal{C}\) that minimizes \[\mathcal{C} \mapsto \sum_{i=1}^n \min_{c \in \mathcal{C}} \Vert X_i - c\Vert_2^2\] over all codebooks with given cardinality
If \(c \in \mathcal{C}\) is the closest centroid to \(X \in \mathbb{R}^p\), \[\|c - X\|^2\] is the quantization/reconstruction error suffered when using codebook \(\mathcal{C}\) to approximate \(X\)
If there are no restrictions on the dimension of the input space, on the number of centroids, or on sample size, computing an optimal codebook is a \(\mathsf{NP}\) -hard problem
kmeans() is a wrapper for a collection of Algorithms that look like the Lloyd algorithm
Initialize by sampling from the data
k-Mean++ try to take them as separated as possible.
Iterate the two phases until ?
No guarantee to converge to a global optimum!
Proceeed by trial and error.
Repeat and keep the best result.
Iris data
Run kmeans() on the projection of the Iris dataset on the Sepal plane. We look for a partition into three cells.
Use broom::augment() and broom::tidy() to prepare two dataframes that will allow you to overlay a scatterplot of the dataset and a Voronoï diagram defined by the centers outpu by kmeans().