name: inter-slide class: left, middle, inverse {{content}} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } </style>
--- class: middle, left, inverse # Exploratory Data Analysis : Clustering and k-means ### 2021-12-10 #### [Master I MIDS & MFA]() #### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- exclude: true template: inter-slide # Exploratory Data Analysis VIII k-Means ### 2021-12-10 #### [Master I MIDS & MFA]() #### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- class: middle, inverse ##
### [Clustering problem](#clusterPb) ### [Kleinberg's Theorem](#kleinberg) ### [Flavors of clustering](#flavors) ### [_k_-Means](#kmeans) ### [_k_-Means and Quantization](#quantization) --- name:clusterPb template: inter-slide ## Clustering problems --- ### In words Clustering consists in _partitioning_ points collections from some metric space in such a way that - points within the same group are close enough while - points from different groups are distant ??? In the background: some notion of distance/similarity --- ### Clustering in ML applications Clustering shows up in many Machine Learning applications, for example: -
__Marketing__: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records -
__Biology__: classification of plants and animals given their features -
__Bookshops__: book ordering (recommendation) -
__Insurance__: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds -
__City-planning__: identifying groups of houses according to their type, value and geographical location -
__Internet__: document classification; clustering weblog data to discover groups of similar access patterns; topic modeling ??? Many distinct goals: clustering is often just one step in a data analysis pipeline For recommendation systems, marketing, objects that fit into the same group call for the same action Some clustering should be hierarchical (toxonomy in life sciences) others can just be flat --- A clustering application relies on the elaboration of a _metric/dissimilarity_ over some input space This tasks is entangled with _feature engineering_ Focus on one specific context: _market segmentation_
-
__Data__: Base of customer data containing their properties and past buying records -
__Goal__: Use the customers *similarities* to find groups - __Possible directions:__ + Dimension reduction (PCA, CA, MCA, ...) + __Clustering__ `\(\approx\)` _non-supervised classification_ ??? Are the directions complementary? or not? Clustering may be done before dimension reduction or the other way --- ###
Dimension reduction Dimension reduction technologies start form: - Training data `\(\mathcal{D}=\{\vec{X}_1,\ldots,\vec{X}_n\} \in \mathcal{X}^n\)` (i.i.d. `\(\sim \Pr\)`) - Space `\(\mathcal{X}\)` of possibly high dimension. and elaborate a _Dimension Reduction Map_ Dimension reduction technologies construct a map `\(\Phi\)` from the space `\(\mathcal{X}\)` into a space `\(\mathcal{X}'\)` of __smaller dimension__ --- ###
Clustering techniques Clustering techniques start from _training data_: `$$\mathcal{D}=\{\vec{X}_1,\ldots,\vec{X}_n\} \in \mathcal{X}^n$$` assuming `\(\vec{X}_i \sim_{\text{i.i.d.}} \Pr\)`, and partition the data into (latent?) groups, Clustering techniques construct a map `\(f\)` from `\(\mathcal{D}\)` to `\(\{1,\ldots,K\}\)` where `\(K\)` is a number of classes to be fixed: `\(f: \quad \vec{X}_i \mapsto k_i\)` --- ### Dimension reduction and clustering may be combined For example, it is commonplace to first perform PCA, project the data on the leading principal components and then perform `\(k\)`-means clustering on the projected data Clustering tasks may be motivated along different directions: - The search for an interpretation of groups - Use of groups in further processing (prediction, ...) ??? This is especially true as many clustering approaches suffer from the curse of dimensionality --- ### Good clustering We need to define the __quality of a cluster__
Unfortunately, no obvious quality measure exists!
Clustering quality may be assessed by scrutinizing - _Inner homogeneity_: samples in the same group should be similar - _Outer inhomogeneity_: samples in two different groups should be different. --- ### Shades of similarity There are many possible definitions of _similar_ and _different_ Often, they are based on the distance between the samples Examples based on the (squared) Euclidean distance: - Inner homogeneity `\(\approx\)` intra class variance/inertia, - Outer inhomogeneity `\(\approx\)` inter class variance/inertia. Remember that, in flat clustering, the choice of the number `\(K\)` of clusters is often delicate --- template: inter-slide name: kleinberg ## Kleinberg's theorem --- ###
- Clustering is not a single method - Clustering methods address a large range of problems. ??? Turning this informal statement into a formal definition proves challenging. --- ### Definition Clustering function Define a _clustering function_ `\(F\)` as a function that - takes as input any finite domain `\(\mathcal{X}\)` with a dissimilarity function `\(d\)` over its pairs and - returns a partition of `\(\mathcal{X}\)` --- ### Desirable properties A clustering function should ideally satisfy the next three properties 1. _Scale Invariance_. For any domain set `\(\mathcal{X}\)`, dissimilarity function `\(d\)`, and any `\(\alpha>0\)`, the following should hold: `\(F(\mathcal{X},d) = F(\mathcal{X},\alpha d)\)`. 2. _Richness_ For any finite `\(\mathcal{X}\)` and every partition `\(C = (C_1,\ldots,C_k)\)` of `\(\mathcal{X}\)` (into nonempty subsets) there exists some dissimilarity function `\(d\)` over `\(\mathcal{X}\)` such that `\(F(\mathcal{X},d)=C\)`. 3. _Consistency_ If `\(d\)` and `\(d'\)` are dissimilarity functions over `\(\mathcal{X}\)`, such that for all `\(x, y \in \mathcal{X}\)`, + if `\(x,y\)` belong to the same cluster in `\(F(\mathcal{X},d)\)` then `\(d'(x,y) \leq d(x,y)\)`, + if `\(x,y\)` belong to different clusters in `\(F(\mathcal{X},d)\)` then `\(d'(x,y) \geq d(x,y)\)`, then `\(F(\mathcal{X},d) = F(\mathcal{X},d')\)`. ---
Designing clustering functions meeting simultaneously _any two_ of the _three_ properties is doable but
The three reasonable goals are _conflicting_ ### Kleinberg's impossibility theorem .bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[ **No** clustering function `\(F\)` satisfies simultaneously all three properties: - _Scale Invariance_, - _Richness_, and - _Consistency_ ] --- template: inter-slide name: flavors ## Flavors of clustering --- ### Flat/Hierarchical and ... A wide variety of clustering methods have been used in Statistics and Machine Learning. - __Flat clustering (for example `\(k\)`-means)__ partitions sample into a fixed number of classes (usually denoted by `\(k\)`). The partition is determined by some algorithm .f6[The ultimate objective is to optimize some cost function. Whether the objective is achieved or even approximately achieved using a reasonable amount of computational resources is not settled] - __Model based clustering__ is based on a generative model: data are assumed to be sampled from a specific model (usually finite mixtures of Gaussians, the model may or may not be parametric) .f6[Clustering consists in fitting such a mixture model and then assigning sample points to mixture components] - _Hierarchical clustering_ is the topic of next lesson --- ### Carte du tendre .fl.w-30.f6[In Machine Learning, `\(k\)`-means and hierarchical clustering belong to a range of tasks called _non-supervised learning_ This contrasts with regression which belongs to the realm of _supervised learning_ ] .fl.w-70[ ![](https://scikit-learn.org/stable/_static/ml_map.png) ] --- template: inter-slide name: kmeans ## _k_-means --- The `\(k\)`-means algorithm is an iterative method that constructs a sequence of Voronoï partitions A Voronoï diagram draws the nearest neighbor regions around a set of points. ### Definition: Voronoï partitions Assume: - sample `\(X_1, \ldots, X_n\)` from `\(\mathbb{R}^p\)` - `\(\mathbb{R}^p\)` is endowed with a metric `\(d\)`, usually `\(\ell_2\)`, sometimes a weighted `\(\ell_2\)` distance or `\(\ell_1\)` Each cluster is defined by a _centroid_ The collection of centroids is (sometimes) called the _codebook_ `\(\mathcal{C}=c_1, \ldots, c_k\)` Each centroid `\(c_j\)` defines a class: `$$C_j = \bigg\{ X_i : d(X_i, c_j) = \min_{j' \leq k} d(X_i, c_{j'})\bigg\}$$` and more generally a _Voronoï cell_ in `\(\mathbb{R}^p\)` `$$C_j = \bigg\{ x : x \in \mathbb{R}^p, d(x, c_j) = \min_{j' \leq k} d(x, c_{j'})\bigg\}$$` --- ### A Voronoï tesselation <!-- .middle.center[![](./img/voronoi.png)] --> .fl.w-70.pa2[ <img src="cm-8-EDA_files/figure-html/unnamed-chunk-4-1.png" width="504" /> ] .fl.w-30.pa2.f6[ #### Euclidean distance, dimension 2 A voronoi tesselation generated by `\(100\)` points picked at random on the gred `\(\{1,\ldots, 200\}^2\)` Note that cell boundaries are line segments Note that centroids may lie close to boundaries The position of the centroid of a Voronoi cell depends on the positions of the centroids of the neighboring cells ] --- ### A Voronoi partition for projected Iris dataset .fl.w-30.pa2[ The black points marked with a cross define three centroids. The straight lines delimit the Voronoï cells defined by the three centroids. The colored points come from the Iris dataset: each point is colored according to the the cell it belongs to. ] .fl.w-70.pa2[ .panelset[ .panel[.panel-name[Code] ```r data(iris) pacman::p_load(ggvoronoi) kms <- kmeans(iris[,1:2], 3) df_centers <- as.data.frame(kms$centers) %>% tibble::rownames_to_column(var=".cluster") broom::augment(kms, iris) %>% ggplot() + aes(x=Sepal.Length, y=Sepal.Width, colour=.cluster) + geom_point(aes()) + * stat_voronoi(data = df_centers, geom="path", outline=data.frame(x=c(4, 8, 8, 4), y=c(2, 2, 4.5, 4.5)) ) + * geom_point(data = df_centers, colour = "black", shape="+", size=5) + coord_fixed() + labs(col="Voronoï cells") + ggtitle("Kmeans over Iris dataset, k=3") ``` ] .panel[.panel-name[Plot] ![](cm-8-EDA_files/figure-html/voronoi-1.png) ] ] ] --- exclude: true ```r geom_sugar <- function(df_centers, species=TRUE){ list( if (species) geom_point(aes(shape=Species, color=.cluster)) else geom_point(aes(color=.cluster)), stat_voronoi(data = df_centers, geom="path", outline=data.frame(x=c(4, 8, 8, 4), y=c(2, 2, 4.5, 4.5)) ), * geom_point(data = df_centers, colour = "black", shape="+", size=5), coord_fixed(), labs(col="Voronoï cells") )} broom::augment(kms, iris) %>% ggplot(aes(x=Sepal.Length, y=Sepal.Width)) + geom_sugar(df_centers, species=FALSE) ``` <img src="cm-8-EDA_files/figure-html/unnamed-chunk-5-1.png" width="504" /> --- ### _k_-means objective function The `\(k\)`-means algorithm aims at building a _codebook_ `\(\mathcal{C}\)` that minimizes `$$\mathcal{C} \mapsto \sum_{i=1}^n \min_{c \in \mathcal{C}} \Vert X_i - c\Vert_2^2$$` over all codebooks with given cardinality If `\(c \in \mathcal{C}\)` is the closest centroid to `\(X \in \mathbb{R}^p\)`, `$$\|c - X\|^2$$` is the _quantization/reconstruction error_ suffered when using codebook `\(\mathcal{C}\)` to approximate `\(X\)`
If there are no restrictions on the dimension of the input space, on the number of centroids, or on sample size, computing an optimal codebook is a `\(\mathsf{NP}\)` -hard problem ??? `\(k\)`-means has a lot to do with rate-distortion coding --- ### `\(k\)`-means at work .fl.w-30.pa2[ We may figure out what an optimized Voronoï partition looks like on the Iris dataset `kmeans` with `\(k=3\)` on the Iris dataset Function `kmeans` is run with default arguments We chose the `Sepal` plane for clustering and visualization This is arbitrary. We could have chosen a `Petal` plane, a `Width` plane, or a plane defined by principal axes. ] .fl.w-70.pa2[ .panelset[ .panel[.panel-name[Code] ```r kms <- kmeans(select(iris, Sepal.Length, Sepal.Width), 3) broom::augment(kms, iris) %>% ggplot() + geom_point(aes(x=Sepal.Length, y=Sepal.Width, shape=Species, col=.cluster)) + *geom_point(data=data.frame(kms$centers), aes(x=Sepal.Length, y=Sepal.Width), shape='+', size=5) + *stat_voronoi(data = as.data.frame(kms$centers), aes(x=Sepal.Length,y=Sepal.Width), geom="path", outline=data.frame(x=c(4, 8, 8, 4), y=c(2, 2, 4.5, 4.5))) -> p p + ggtitle("K-means with k=3", "Iris data") + labs(col="Clusters") ``` ] .panel[.panel-name[Plot] ![](cm-8-EDA_files/figure-html/iriskmeans3-1.png) ]]] --- ### A `\(k\)`-means clustering is completely characterized by the `\(k\)` centroids Once centroids are known, clusters can be recovered by searching the closest centroid for each sample point (that is by delimiting the Voronoï cells). - How can we assess the _quality_ of a `\(k\)`-means clustering? - Can we compare the clusterings achieved by picking different values of `\(k\)`? There is no obvious assessment criterion! ??? The _quality_ of a clustering can be appreciated according to a wide variety of performance indicators - Distortion: this is the `\(k\)`-means cost - Shape of clusters - Relevance of clusters - Stability: does clustering depend on few points? --- ### Caveat When visualizing `\(k\)`-means clustering on `Iris` data, we are cheating.
We have a gold standard classification delivered by botanists The botanists classification can be challenged We can compare classification originating from _phenotypes_ (appearance) and classification based on _phylogeny_ (comparing DNAs)
--- ### Summarising a `\(k\)`-means clustering .fl.w-50.pa2[ A clustering can be summarized and illustrated. In
A meaningful summary is provided by the generic function `summary()`, or a `tidy` summary is providede by `broom::tidy(...)` ```r select(iris, Sepal.Length, Sepal.Width) %>% kmeans(centers = 3) %>% * broom::tidy() %>% knitr::kable(format = "markdown", digits = 2) -> t ``` ] .fl.w-50.pa2[ .f6[ | Sepal.Length| Sepal.Width| size| withinss|cluster | |------------:|-----------:|----:|--------:|:-------| | 6.81| 3.07| 47| 12.62|1 | | 5.01| 3.43| 50| 13.13|2 | | 5.77| 2.69| 53| 11.30|3 | The concise summary tells us the number of points that are assigned to each cluster, and the Within Sum of Squares (WNSS). It says something about inner homogeneity and (apparently) nothing about outer homogeneity ] ] --- ### `\(k\)`-means with `\(k=2\)` .fl.w-30.pa2[ We pursue the exploration of `kmeans` by building another clustering for Iris dataset. This times with `\(k=2\)`. ] .fl.w-70.pa2[ .panelset[ .panel[.panel-name[Code] ```r kms <- kmeans(select(iris, Sepal.Length, Sepal.Width), 2) iris2 <- broom::augment(kms, iris) broom::augment(kms, iris) %>% ggplot() + geom_point(aes(x=Sepal.Length, y=Sepal.Width, shape=Species, col=.cluster)) + *geom_point(data=data.frame(kms$centers), aes(x=Sepal.Length, y=Sepal.Width), shape='+', size=5) + *stat_voronoi(data = as.data.frame(kms$centers), aes(x=Sepal.Length,y=Sepal.Width), geom="path", outline=data.frame(x=c(4, 8, 8, 4), y=c(2, 2, 4.5, 4.5))) + ggtitle(label="Kmeans Iris data", subtitle="k=2") + labs(col="Clusters") ``` ] .panel[.panel-name[Plot] ![](cm-8-EDA_files/figure-html/iriskmeans2-1.png) ] .panel[.panel-name[Numerical summary] | Sepal.Length| Sepal.Width| size| withinss|cluster | |------------:|-----------:|----:|--------:|:-------| | 5.22| 3.13| 83| 35.09|1 | | 6.61| 2.97| 67| 23.11|2 | ] ]] --- ###
How should we pick `\(k\)`? .fl.w-30.pa2.f6[ Even if we could compute a provably optimal codebook for each `\(k\)`, choosing `\(k\)` would not be obvious A common recipe consists of plotting within clusters sum of squares (`WNSS`) against `\(k\)` Within clusters sum of squares (WNSS) decreases sharply between `\(k=2\)` and `\(k=3\)` For larger values of `\(k\)`, the decay is much smaller. The _elbow_ rule of thumb suggests to choose `\(k=3\)`. ] .fl.w-70.pa2[ .panelset[ .panel[.panel-name[Code] ```r require(foreach) foreach (k=2:10, .combine = rbind) %do% { select(iris, Sepal.Length, Sepal.Width) %>% kmeans(centers = k, nstart=32L) %>% * broom::glance() %>% force() } %>% rownames_to_column(var = "k") %>% mutate(k=as.integer(k)+1) -> tmp ``` .f6[ We have run `kmeans` over the Iris dataset, for `\(k\)` in range `\(2, \ldots, 10\)`. For each value of `\(k\)`, we performed `\(32\)` randomized initializations, and chose the partition that minimizes within clusters sum of squares ] ] .panel[.panel-name[Numerical summary] | k| totss| tot.withinss| betweenss| iter| |--:|------:|------------:|---------:|----:| | 2| 130.48| 58.20| 72.27| 1| | 3| 130.48| 37.05| 93.42| 2| | 4| 130.48| 27.97| 102.51| 3| | 5| 130.48| 20.96| 109.52| 3| | 6| 130.48| 17.33| 113.14| 2| | 7| 130.48| 14.76| 115.72| 2| | 8| 130.48| 12.81| 117.67| 3| | 9| 130.48| 11.07| 119.40| 3| | 10| 130.48| 9.77| 120.70| 4| ] .panel[.panel-name[Elbow plot] <img src="cm-8-EDA_files/figure-html/iriswithinss-1.png" width="504" /> ] ]] --- ### Incentive to choose `\(k=4\)`? .fl.w-30.pa2[ Depending on initialization, taking `\(k=4\)` creates a cluster at the boundary between `versicolor` and `virginica` or it may split the `setosa` cluster ] .fl.w-70.pa2[ <div class="figure"> <img src="cm-8-EDA_files/figure-html/iriskmeans4-1.png" alt="(ref:iriskmeans4)" width="504" /> <p class="caption">(ref:iriskmeans4)</p> </div> ] --- ```r broom::tidy(kmeans(select(iris, Sepal.Length, Sepal.Width), 4)) %>% knitr::kable(format = "markdown", digits = 2) ``` | Sepal.Length| Sepal.Width| size| withinss|cluster | |------------:|-----------:|----:|--------:|:-------| | 5.92| 2.75| 53| 8.25|1 | | 5.19| 3.64| 32| 4.63|2 | | 6.88| 3.10| 41| 10.63|3 | | 4.77| 2.89| 24| 4.45|4 | --- ### Initialization matters! .fl.w-40.pa2[ - Initialize by samples. - `k-Mean++` try to take them as separated as possible. - No guarantee to converge to a global optimum! - Trial and error. - Repeat and keep the best result. ] .fl.w-60.pa2[ ```r kmeans(x, # data centers, # initial centroids or number of clusters iter.max = 10, nstart = 1, # number of trials algorithm = c("Hartigan-Wong", # default "Lloyd", #<< old one "Forgy", "MacQueen"), trace=FALSE) ``` ] ??? TODO: one dimensional example with animation (plotly) --- exclude: true ```r kms <- kmeans(select(iris, Sepal.Length, Sepal.Width), centers = 3, iter.max = 100, nstart= 1, trace = 10) ``` ``` ## KMNS(*, k=3): iter= 1, indx=2 ## QTRAN(): istep=150, icoun=33, NCP[1:3]= 245 267 267 ## QTRAN(): istep=300, icoun=34, NCP[1:3]= 361 416 416 ## QTRAN(): istep=450, icoun=12, NCP[1:3]= 361 588 588 ## QTRAN(): istep=600, icoun=1, NCP[1:3]= 699 749 749 ## QTRAN(): istep=750, icoun=17, NCP[1:3]= 835 883 883 ## QTRAN(): istep=900, icoun=38, NCP[1:3]= 1007 1012 1012 ## QTRAN(): istep=1050, icoun=46, NCP[1:3]= 1007 1154 1154 ## KMNS(*, k=3): iter= 2, indx=150 ``` --- ### Lloyd's Algorithm (detailed) for fixed _k_ (naive _k_-means) 1. Initialize Choose `\(k\)` centroids 2. Iterations: Two phases 1. (Movement) Assign each sample point to the closest _centroid_ Assign each sample point to a class in the Voronoi partition defined by the centroids 1. (Update) For each class in the current Voronoi partition, update teh _centroid_ so as to minimize the Within Cluster Sum of Squared distances. ??? From `scikit-learn` documentation > The k-means problem is solved using either Lloyd’s or Elkan’s algorithm. > The average complexity is given by `\(O(k \times n \times T)\)`, were `\(n\)` is the number of samples and `\(T\)` is the number of iterations. > The worst case complexity is given by `\(O(n^(k+2/p))\)` with `\(n = n_{\text{samples}}\)`, `\(p = n_{\text{features}}\)`. (D. Arthur and S. Vassilvitskii, ‘How slow is the k-means method?’ SoCG2006) > In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times. > If the algorithm stops before fully converging (because of `tol` or `max_iter`), `labels_` and `cluster_centers_` will not be consistent, i.e. the `cluster_centers_` will not be the means of the points in each cluster. Also, the estimator will reassign `labels_` after the last iteration to make `labels_` consistent with predict on the training set. --- ### Lloyd's iterations <div class="figure"> <img src="cm-8-EDA_files/figure-html/lloyd1-1.png" alt="After 1 step" width="864" /> <p class="caption">After 1 step</p> </div> --- ### Lloyd's iterations (continued) <div class="figure"> <img src="cm-8-EDA_files/figure-html/lloyd5-1.png" alt="After 2 steps" width="864" /> <p class="caption">After 2 steps</p> </div> --- ### Lloyd's iterations (continued) <div class="figure"> <img src="cm-8-EDA_files/figure-html/lloyd00-1.png" alt="After 4 steps" width="864" /> <p class="caption">After 4 steps</p> </div> --- ### Analysis Given - codebook `\(\mathcal{C} =\big\{c_1, \ldots, c_k\big\}\)` and - clusters `\(C_1, \ldots C_k\)`, the _within-clusters sum of squares_ is defined as `$$\sum_{j=1}^k \sum_{i: X_i \in C_j} \bigg\Vert c_j - X_i \bigg\Vert^2$$`
This is also the kmeans cost ### Lemma .bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[ At each stage, the _within clusters sums of squares_ does not increase ] ??? There is no guarantee that the algorithm will converge in few iterations Iterations are carried out in a brute force manner --- ### Proof Let `\(\mathcal{C}^{(t)} =\big\{ c^{(t)}_1, \ldots, c_k^{(t)}\big\}\)` be the codebook after `\(t\)` steps Let `\(\big({C}^{(t)}_j\big)_{j \leq k}\)` be the clusters after `\(t\)` steps - Centroids at step `\(t+1\)` are the barycenters of clusters `\(\big({C}^{(t)}_j\big)_{j \leq k}\)` `$$c^{(t+1)}_j = \frac{1}{|C_j^{(t)}|} \sum_{X_i \in C^{(t)}_j} X_i$$` - Clusters `\(C^{(t+1)}_j\)` are defined by `$$C^{(t+1)}_j = \bigg\{ X_i : \Vert X_i - c^{(t+1)}_j\Vert = \min_{c \in \mathcal{C}^{(t+1)}} \Vert X_i - c\Vert \bigg\}$$` Each sample point is assigned to the closest centroid --- ### Proof (continued) `$$\sum_{j=1}^k \sum_{X_i \in C^{(t)}_j} \bigg\Vert c^{(t)}_j - X_i\bigg\Vert^2 \geq \sum_{j=1}^k \sum_{X_i \in C^{(t)}_j} \bigg\Vert c^{(t+1)}_j - X_i\bigg\Vert^2$$` since for each `\(j\)`, the mean `\(c^{(t+1)}_j\)` minimizes the average square distance to points in `\(C^{(t)}_j\)` `$$\sum_{j=1}^k \sum_{X_i \in C^{(t)}_j} \bigg\Vert c^{(t+1)}_j - X_i\bigg\Vert^2 \geq \sum_{j=1}^k \sum_{X_i \in C^{(t)}_j} \min_{c \in \mathcal{C}^{(t+1)}}\bigg\Vert c - X_i\bigg\Vert^2$$` `$$\sum_{j=1}^k \sum_{X_i \in C^{(t)}_j} \min_{c \in \mathcal{C}^{(t+1)}}\bigg\Vert c - X_i\bigg\Vert^2 = \sum_{j=1}^k \sum_{X_i \in C^{(t+1)}_j} \bigg\Vert c^{(t+1)}_j - X_i\bigg\Vert^2$$`
--- ###
Variants of _k_-means Implementations of `\(k\)`-means vary with respect to - Initialization + `k-means++` + Forgy : pick initial centroids at random from the dataset + Random partition : pick a random partition of the dataset and initialize centroids by computing means in each class + ... - Movement/assignment + Naive `\(k\)` means uses brute-force search for closest centroid. Each step requires `\(\Omega(n \times k)\)` operations + Elkan (used by
`scikit-learn`) + Hartigan-Wong
default + ... ??? > Lloyd's algorithm is the standard approach for this problem. However, it spends a lot of processing time computing the distances between each of the k cluster centers and the n data points. Since points usually stay in the same clusters after a few iterations, much of this work is unnecessary, making the naïve implementation very inefficient. Some implementations use caching and the triangle inequality in order to create bounds and accelerate Lloyd's algorithm. .fr.f6[Wikipedia] In base
, `kmeans` is a wrapper for different but related algorithms. Lloyd's algorithm is the first and simplest versions of a series of heuristic methods designed to minimize the k-means cost - `MacQueen` modify the mean each time a sample is assigned to a new cluster - `Hartigan-Wong` is the _default_ method. It modifies the mean by removing the considered sample point, assign it to the nearby center and recompute the new mean after assignment. - `Forgy` --- template: inter-slide name: pcaandkmeans ## Combining PCA and `\(k\)`-means --- ###
The result of a clustering procedure like `kmeans` can be visualized by projecting the dataset on a pair of native variables and using some aesthetics to emphasize the clusters This is not always the best way. First choosing a pair of native variables may not be straightforward. The projected pairwise distances may not faithfully reflect the pairwise distances that serve for clustering. It makes sense to project the dataset of the `\(2\)`-dimensional subspace that maximizes the projected inertia, that is on the space generated by the first two principal components --- ### PCA, projection, `\(k\)`-means .fl.w-30.pa2.f6[ The kmeans clustering of the Iris dataset is projected on the first two principal components: `prcomp` is used to perform PCA with neither centering nor scaling `kmeans` is applied to the rotated data The straight lines are the not the projections of the boundaries of the (4-dimensional) Voronoï cells defined by the clusters centroids, but the boundaries of the 2-dimensional Voronoï celles defined by the projections of the cluster centroids ] .fl.w-70.pa2[ .panelset[ .panel[.panel-name[Code] ```r iris_a <- broom::augment(prcomp(x = iris[, -5], center = FALSE, scale.=FALSE, rank. = 4), iris) km3 <- iris_a %>% select(starts_with(".fitted")) %>% kmeans(3, nstart = 20) iris_a <- broom::augment(km3, iris_a) ``` ] .panel[.panel-name[Plot cooking] ```r ggplot(data=iris_a, aes(x=.fittedPC1, y=.fittedPC2)) + coord_fixed(ratio=1) + * stat_voronoi(data=data.frame(km3$centers), geom="path", outline = data.frame(x=c(-12, -4, -4 , -12), y=c(-3, -3, 4, 4))) + geom_point(aes(shape=Species, col=.cluster)) + * geom_point(data=data.frame(km3$centers), aes(x=.fittedPC1, y=.fittedPC2), shape='+', size=5) + xlab("PC1") + ylab("PC2") + labs(col="Cluster") + ggtitle('Kmeans clustering of Iris dataset projected on first principal components.') ``` ] .panel[.panel-name[Plot] ![](cm-8-EDA_files/figure-html/pcakmeans-1.png) ] ]] ??? --- ###
Questions around _k_-means - Choosing `\(k\)` - Assessing clustering quality - Scaling or not scaling ? - Choosing a distance - Initialization methods - Movement/assignment update - Stopping rules - --- template: inter-slide name: quantization ## `\(k\)`-means and _quantization_ --- Quantization plays an important role in signal processing and information theory (lossy coding with quadratic distortion) Given a probability distribution `\(P\)` over a metric space `\((\mathcal{X},d)\)`, a `\(k\)`-quantizer is defined by a `\(k\)`-element subset of `\(\mathcal{X}\)`, `\(\mathbf{c} := \{x_1,\ldots, x_k\}\)` called a codebook. The codebook defines a quantization by mapping every `\(x \in \mathcal{X}\)` to its nearest neighbor in codebook `\(\mathbf{c}\)` The quality of a codebook is assessed by its mean distortion measured as the mean quadratic distance to the nearest neighbor: `$$\mathsf{R}(\mathbf{c}) := \mathbb{E}\left[\min_{x \in \mathbf{c}}d(X,x)^2\right]$$` where `\(X \sim P\)` --- When the `\(P\)` is known, designing an optimal codebook may be a difficult optimization problem When `\(P\)` is unknown, if the statistician is left with an i.i.d. sample `\(X_1,\ldots,X_n \sim P\)`, the first reasonable thing to do is minimizing the empirical distortion, the `\(k\)`-means cost: `$$\mathsf{R}_n(\mathbf{c}) := \frac{1}{n} \sum_{i=1}^n \min_{x \in \mathbf{c}} d(X_i,x)^2$$` ### Theorem NP-hardness of k-means cost minimization .bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[ Mininizing the `\(k\)`-means cost (sum of within clusters sum of squares) is `\(\mathsf{NP}\)` -hard. ] ??? Sanjoy Dasgupta proved that if `\(\mathsf{P} \neq \mathsf{NP}\)`, minimizing the `\(k\)`-means cost (sum of within clusters sum of squares) is __computationally intractable__ This is one result in a long series of negative results going back to the late seventies --- ### Statistical/information-theoretical issues Even though minimizing the `\(k\)`-means cost is hard, one may investigate the _statistical_ problem raised by minimizing the `\(k\)`-means cost. `\(k\)`-means served as a showcase for empirical process theory during the early 1980's Significant progress during recent years The `\(k\)`-means cost provides a concrete illustration of a recurrent situation. If the sampling distribution is square integrable and has a density with respect to Lebesgue measure, the mapping `\(\mathsf{R}(\mathbf{c})\)` is differentiable, its gradient can be explicitly computed ??? `Cite(myBib, "Pol84")` `Cite(myBib, "MR3080408")`, `Cite(myBib, "MR3316191")` --- ### Smoothness of _k_-means cost Whereas the local behavior of the `\(k\)`-means cost is simple, the global behavior remains elusive. Bounding the number of global minima, local minima, local extrema and saddlepoints is difficult. Observation: under fairly general assumptions, the `\(k\)`-means cost function is twice differentiable in the neighborhood of optimal codebooks Even though the `\(k\)`-means cost function is not convex, recent advances tell us that if sample size tends to infinity, an empirical cost functions will also share local minima, local maxima, and local saddlepoints with the theoretical population cost function ??? `Cite(myBib, "Pol84")` `Cite(myBib, "MR3851754")` --- ### Pollard's regularity conditions - The sampling distribution is absolutely continuous with respect to Lebesgue measure on `\(\mathbb{R}^p\)`. - The Hessian matrix of the mapping `\(\mathbf{c} \mapsto \mathsf{R}(\mathbf{c})\)` is positive definite for all optimal codebooks -- Under Pollard's regularity conditions, let `\(\mathbf{c}^*\)` denote the optimal codebook, and `\(\widehat{\mathbf{c}}_n\)` denote the optimal empirical codebook Large sample behavior of the empirically optimal codebook: - `\(\sqrt{n}\|\widehat{\mathbf{c}}_n - \mathbf{c}^*\|\)` is asymptotically normal and - `\(n \left( \mathsf{R}(\widehat{\mathbf{c}}_n) - \mathsf{R}(\mathbf{c}^*)\right)\)` is stochastically bounded --- ### Key observation Pollard's condition entails that for some constant `\(\kappa_0>0\)`, `$$\mathsf{R}(\mathbf{c}) - \mathsf{R}(\mathbf{c}^*) \geq \kappa_0 \|\mathbf{c}- \mathbf{c}^*\|^2$$` --- ###
Conclusion - Euclidean distance is used as a metric and inertia is used as a measure of cluster scatter - The number of clusters `\(k\)` is an input parameter - Convergence to a local minimum may produce counterintuitive ("wrong") results ??? Squared Euclidean distance is very sensitive to outliers An inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set. --- exclude: true ### References .f6[ NULL ] --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: cover # The End