EDA VIII: k-means

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

# Exploratory Data Analysis : Clustering and k-means

### 2021-12-10

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
exclude: true
template: inter-slide

# Exploratory Data Analysis VIII k-Means

### 2021-12-10

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
class: middle, inverse

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### [Clustering problem](#clusterPb)

### [Kleinberg's Theorem](#kleinberg)

### [Flavors of clustering](#flavors)

### [_k_-Means](#kmeans)

### [_k_-Means and Quantization](#quantization)

---
name:clusterPb
template: inter-slide

## Clustering problems

---
### In words

Clustering consists in _partitioning_  points collections
from some metric space

in such a way that

- points within the same group are close enough

while

- points from different groups are distant

???

In the background: some notion of distance/similarity

---

### Clustering in ML applications

Clustering shows up in many Machine Learning applications, for example:

-   <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M157.52 272h36.96L176 218.78 157.52 272zM352 256c-13.23 0-24 10.77-24 24s10.77 24 24 24 24-10.77 24-24-10.77-24-24-24zM464 64H48C21.5 64 0 85.5 0 112v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V112c0-26.5-21.5-48-48-48zM250.58 352h-16.94c-6.81 0-12.88-4.32-15.12-10.75L211.15 320h-70.29l-7.38 21.25A16 16 0 0 1 118.36 352h-16.94c-11.01 0-18.73-10.85-15.12-21.25L140 176.12A23.995 23.995 0 0 1 162.67 160h26.66A23.99 23.99 0 0 1 212 176.13l53.69 154.62c3.61 10.4-4.11 21.25-15.11 21.25zM424 336c0 8.84-7.16 16-16 16h-16c-4.85 0-9.04-2.27-11.98-5.68-8.62 3.66-18.09 5.68-28.02 5.68-39.7 0-72-32.3-72-72s32.3-72 72-72c8.46 0 16.46 1.73 24 4.42V176c0-8.84 7.16-16 16-16h16c8.84 0 16 7.16 16 16v160z"/></svg> __Marketing__: finding groups of customers with similar
    behavior given a large database of customer data containing their
    properties and past buying records

-   <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 360V24c0-13.3-10.7-24-24-24H96C43 0 0 43 0 96v320c0 53 43 96 96 96h328c13.3 0 24-10.7 24-24v-16c0-7.5-3.5-14.3-8.9-18.7-4.2-15.4-4.2-59.3 0-74.7 5.4-4.3 8.9-11.1 8.9-18.6zM128 134c0-3.3 2.7-6 6-6h212c3.3 0 6 2.7 6 6v20c0 3.3-2.7 6-6 6H134c-3.3 0-6-2.7-6-6v-20zm0 64c0-3.3 2.7-6 6-6h212c3.3 0 6 2.7 6 6v20c0 3.3-2.7 6-6 6H134c-3.3 0-6-2.7-6-6v-20zm253.4 250H96c-17.7 0-32-14.3-32-32 0-17.6 14.4-32 32-32h285.4c-1.9 17.1-1.9 46.9 0 64z"/></svg> __Bookshops__: book ordering (recommendation)

-   <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M143.25 220.81l-12.42 46.37c-3.01 11.25-3.63 22.89-2.41 34.39l-35.2 28.98c-6.57 5.41-16.31-.43-14.62-8.77l15.44-76.68c1.06-5.26-2.66-10.28-8-10.79l-77.86-7.55c-8.47-.82-11.23-11.83-4.14-16.54l65.15-43.3c4.46-2.97 5.38-9.15 1.98-13.29L21.46 93.22c-5.41-6.57.43-16.3 8.78-14.62l76.68 15.44c5.26 1.06 10.28-2.66 10.8-8l7.55-77.86c.82-8.48 11.83-11.23 16.55-4.14l43.3 65.14c2.97 4.46 9.15 5.38 13.29 1.98l60.4-49.71c6.57-5.41 16.3.43 14.62 8.77L262.1 86.38c-2.71 3.05-5.43 6.09-7.91 9.4l-32.15 42.97-10.71 14.32c-32.73 8.76-59.18 34.53-68.08 67.74zm494.57 132.51l-12.42 46.36c-3.13 11.68-9.38 21.61-17.55 29.36a66.876 66.876 0 0 1-8.76 7l-13.99 52.23c-1.14 4.27-3.1 8.1-5.65 11.38-7.67 9.84-20.74 14.68-33.54 11.25L515 502.62c-17.07-4.57-27.2-22.12-22.63-39.19l8.28-30.91-247.28-66.26-8.28 30.91c-4.57 17.07-22.12 27.2-39.19 22.63l-30.91-8.28c-12.8-3.43-21.7-14.16-23.42-26.51-.57-4.12-.35-8.42.79-12.68l13.99-52.23a66.62 66.62 0 0 1-4.09-10.45c-3.2-10.79-3.65-22.52-.52-34.2l12.42-46.37c5.31-19.8 19.36-34.83 36.89-42.21a64.336 64.336 0 0 1 18.49-4.72l18.13-24.23 32.15-42.97c3.45-4.61 7.19-8.9 11.2-12.84 8-7.89 17.03-14.44 26.74-19.51 4.86-2.54 9.89-4.71 15.05-6.49 10.33-3.58 21.19-5.63 32.24-6.04 11.05-.41 22.31.82 33.43 3.8l122.68 32.87c11.12 2.98 21.48 7.54 30.85 13.43a111.11 111.11 0 0 1 34.69 34.5c8.82 13.88 14.64 29.84 16.68 46.99l6.36 53.29 3.59 30.05a64.49 64.49 0 0 1 22.74 29.93c4.39 11.88 5.29 25.19 1.75 38.39zM255.58 234.34c-18.55-4.97-34.21 4.04-39.17 22.53-4.96 18.49 4.11 34.12 22.65 39.09 18.55 4.97 45.54 15.51 50.49-2.98 4.96-18.49-15.43-53.67-33.97-58.64zm290.61 28.17l-6.36-53.29c-.58-4.87-1.89-9.53-3.82-13.86-5.8-12.99-17.2-23.01-31.42-26.82l-122.68-32.87a48.008 48.008 0 0 0-50.86 17.61l-32.15 42.97 172 46.08 75.29 20.18zm18.49 54.65c-18.55-4.97-53.8 15.31-58.75 33.79-4.95 18.49 23.69 22.86 42.24 27.83 18.55 4.97 34.21-4.04 39.17-22.53 4.95-18.48-4.11-34.12-22.66-39.09z"/></svg> __Insurance__: identifying groups of motor insurance policy
    holders with a high average claim cost; identifying frauds

-  <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M616 192H480V24c0-13.26-10.74-24-24-24H312c-13.26 0-24 10.74-24 24v72h-64V16c0-8.84-7.16-16-16-16h-16c-8.84 0-16 7.16-16 16v80h-64V16c0-8.84-7.16-16-16-16H80c-8.84 0-16 7.16-16 16v80H24c-13.26 0-24 10.74-24 24v360c0 17.67 14.33 32 32 32h576c17.67 0 32-14.33 32-32V216c0-13.26-10.75-24-24-24zM128 404c0 6.63-5.37 12-12 12H76c-6.63 0-12-5.37-12-12v-40c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40zm0-96c0 6.63-5.37 12-12 12H76c-6.63 0-12-5.37-12-12v-40c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40zm0-96c0 6.63-5.37 12-12 12H76c-6.63 0-12-5.37-12-12v-40c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40zm128 192c0 6.63-5.37 12-12 12h-40c-6.63 0-12-5.37-12-12v-40c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40zm0-96c0 6.63-5.37 12-12 12h-40c-6.63 0-12-5.37-12-12v-40c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40zm0-96c0 6.63-5.37 12-12 12h-40c-6.63 0-12-5.37-12-12v-40c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40zm160 96c0 6.63-5.37 12-12 12h-40c-6.63 0-12-5.37-12-12v-40c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40zm0-96c0 6.63-5.37 12-12 12h-40c-6.63 0-12-5.37-12-12v-40c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40zm0-96c0 6.63-5.37 12-12 12h-40c-6.63 0-12-5.37-12-12V76c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40zm160 288c0 6.63-5.37 12-12 12h-40c-6.63 0-12-5.37-12-12v-40c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40zm0-96c0 6.63-5.37 12-12 12h-40c-6.63 0-12-5.37-12-12v-40c0-6.63 5.37-12 12-12h40c6.63 0 12 5.37 12 12v40z"/></svg> __City-planning__: identifying groups of houses according to
    their type, value and geographical location

???

Many distinct goals: clustering is often just one step in a data analysis pipeline

For recommendation systems, marketing, objects that fit into the same group call for the same action

Some clustering should be hierarchical (toxonomy in life sciences) others can just be flat

---

A clustering application relies on  the elaboration of
a _metric/dissimilarity_ over some input space

This tasks is entangled  with _feature engineering_

Focus on one specific context: _market segmentation_  <svg aria-hidden="true" role="img" viewBox="0 0 544 512" style="height:1em;width:1.06em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M527.79 288H290.5l158.03 158.03c6.04 6.04 15.98 6.53 22.19.68 38.7-36.46 65.32-85.61 73.13-140.86 1.34-9.46-6.51-17.85-16.06-17.85zm-15.83-64.8C503.72 103.74 408.26 8.28 288.8.04 279.68-.59 272 7.1 272 16.24V240h223.77c9.14 0 16.82-7.68 16.19-16.8zM224 288V50.71c0-9.55-8.39-17.4-17.84-16.06C86.99 51.49-4.1 155.6.14 280.37 4.5 408.51 114.83 513.59 243.03 511.98c50.4-.63 96.97-16.87 135.26-44.03 7.9-5.6 8.42-17.23 1.57-24.08L224 288z"/></svg>

-   <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 73.143v45.714C448 159.143 347.667 192 224 192S0 159.143 0 118.857V73.143C0 32.857 100.333 0 224 0s224 32.857 224 73.143zM448 176v102.857C448 319.143 347.667 352 224 352S0 319.143 0 278.857V176c48.125 33.143 136.208 48.572 224 48.572S399.874 209.143 448 176zm0 160v102.857C448 479.143 347.667 512 224 512S0 479.143 0 438.857V336c48.125 33.143 136.208 48.572 224 48.572S399.874 369.143 448 336z"/></svg> __Data__: Base of customer data containing their properties
    and past buying records

-   <svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M248 8C111.03 8 0 119.03 0 256s111.03 248 248 248 248-111.03 248-248S384.97 8 248 8zm0 432c-101.69 0-184-82.29-184-184 0-101.69 82.29-184 184-184 101.69 0 184 82.29 184 184 0 101.69-82.29 184-184 184zm0-312c-70.69 0-128 57.31-128 128s57.31 128 128 128 128-57.31 128-128-57.31-128-128-128zm0 192c-35.29 0-64-28.71-64-64s28.71-64 64-64 64 28.71 64 64-28.71 64-64 64z"/></svg> __Goal__: Use the customers *similarities* to find groups

-   __Possible directions:__

+ Dimension reduction (PCA, CA, MCA, ...)

+ __Clustering__ `$\approx$` _non-supervised classification_

???

Are the directions complementary? or not?

Clustering may be done before dimension reduction or the other way

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M416 48c0-8.84-7.16-16-16-16h-64c-8.84 0-16 7.16-16 16v48h96V48zM63.91 159.99C61.4 253.84 3.46 274.22 0 404v44c0 17.67 14.33 32 32 32h96c17.67 0 32-14.33 32-32V288h32V128H95.84c-17.63 0-31.45 14.37-31.93 31.99zm384.18 0c-.48-17.62-14.3-31.99-31.93-31.99H320v160h32v160c0 17.67 14.33 32 32 32h96c17.67 0 32-14.33 32-32v-44c-3.46-129.78-61.4-150.16-63.91-244.01zM176 32h-64c-8.84 0-16 7.16-16 16v48h96V48c0-8.84-7.16-16-16-16zm48 256h64V128h-64v160z"/></svg> Dimension reduction

Dimension reduction technologies start form:

-   Training data
    `$\mathcal{D}=\{\vec{X}_1,\ldots,\vec{X}_n\} \in \mathcal{X}^n$` (i.i.d.
    `$\sim \Pr$`)

-   Space `$\mathcal{X}$` of possibly high dimension.

and elaborate a  _Dimension Reduction Map_

Dimension reduction technologies construct a map `$\Phi$` from the space `$\mathcal{X}$`
into a space `$\mathcal{X}'$` of __smaller dimension__

---

###  <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M416 48c0-8.84-7.16-16-16-16h-64c-8.84 0-16 7.16-16 16v48h96V48zM63.91 159.99C61.4 253.84 3.46 274.22 0 404v44c0 17.67 14.33 32 32 32h96c17.67 0 32-14.33 32-32V288h32V128H95.84c-17.63 0-31.45 14.37-31.93 31.99zm384.18 0c-.48-17.62-14.3-31.99-31.93-31.99H320v160h32v160c0 17.67 14.33 32 32 32h96c17.67 0 32-14.33 32-32v-44c-3.46-129.78-61.4-150.16-63.91-244.01zM176 32h-64c-8.84 0-16 7.16-16 16v48h96V48c0-8.84-7.16-16-16-16zm48 256h64V128h-64v160z"/></svg> Clustering techniques

Clustering techniques start from  _training data_:

`$$\mathcal{D}=\{\vec{X}_1,\ldots,\vec{X}_n\} \in \mathcal{X}^n$$`

assuming `$\vec{X}_i \sim_{\text{i.i.d.}} \Pr$`,  and partition the data into (latent?) groups,

Clustering techniques construct a map `$f$` from `$\mathcal{D}$` to `$\{1,\ldots,K\}$` where `$K$`
is a number of classes to be fixed: `$f: \quad \vec{X}_i \mapsto k_i$`

---

### Dimension reduction and clustering may be combined

For example, it is
commonplace to first perform PCA, project the data on the leading principal components
and then perform `$k$`-means clustering on the projected data

Clustering tasks may be motivated along different directions:

-   The search for an interpretation of groups

-   Use of groups in further processing (prediction, ...)

???

This is especially true as many clustering approaches suffer from the curse of dimensionality

---

### Good clustering

We need to define the __quality of a cluster__  <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M223.75 130.75L154.62 15.54A31.997 31.997 0 0 0 127.18 0H16.03C3.08 0-4.5 14.57 2.92 25.18l111.27 158.96c29.72-27.77 67.52-46.83 109.56-53.39zM495.97 0H384.82c-11.24 0-21.66 5.9-27.44 15.54l-69.13 115.21c42.04 6.56 79.84 25.62 109.56 53.38L509.08 25.18C516.5 14.57 508.92 0 495.97 0zM256 160c-97.2 0-176 78.8-176 176s78.8 176 176 176 176-78.8 176-176-78.8-176-176-176zm92.52 157.26l-37.93 36.96 8.97 52.22c1.6 9.36-8.26 16.51-16.65 12.09L256 393.88l-46.9 24.65c-8.4 4.45-18.25-2.74-16.65-12.09l8.97-52.22-37.93-36.96c-6.82-6.64-3.05-18.23 6.35-19.59l52.43-7.64 23.43-47.52c2.11-4.28 6.19-6.39 10.28-6.39 4.11 0 8.22 2.14 10.33 6.39l23.43 47.52 52.43 7.64c9.4 1.36 13.17 12.95 6.35 19.59z"/></svg>

Unfortunately, no obvious quality measure exists! <svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M248 8C111 8 0 119 0 256s111 248 248 248 248-111 248-248S385 8 248 8zm80 168c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm-160 0c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm170.2 218.2C315.8 367.4 282.9 352 248 352s-67.8 15.4-90.2 42.2c-13.5 16.3-38.1-4.2-24.6-20.5C161.7 339.6 203.6 320 248 320s86.3 19.6 114.7 53.8c13.6 16.2-11 36.7-24.5 20.4z"/></svg>

Clustering quality may be assessed by scrutinizing

-   _Inner homogeneity_: samples in the same group should be similar

-   _Outer inhomogeneity_: samples in two different groups should be
    different.

---

### Shades of similarity

There are many possible definitions of _similar_ and _different_

Often, they are  based on the distance between the samples

Examples based on the (squared) Euclidean distance:

-   Inner homogeneity `$\approx$` intra class variance/inertia,

-   Outer inhomogeneity `$\approx$` inter class variance/inertia.

Remember that, in flat clustering,
the choice of the number `$K$` of clusters is often delicate

---
template: inter-slide
name: kleinberg

## Kleinberg's theorem

---

###  <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M569.517 440.013C587.975 472.007 564.806 512 527.94 512H48.054c-36.937 0-59.999-40.055-41.577-71.987L246.423 23.985c18.467-32.009 64.72-31.951 83.154 0l239.94 416.028zM288 354c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"/></svg>

- Clustering is not a single method

- Clustering methods address a large range of   problems.

???

Turning this informal statement into a formal definition proves challenging.

---

### Definition Clustering function

Define a _clustering function_ `$F$` as a function that

- takes as input any finite domain `$\mathcal{X}$` with a dissimilarity function `$d$` over its pairs

and

- returns a partition of `$\mathcal{X}$`

---

### Desirable properties

A clustering function should ideally satisfy the next three properties

1. _Scale Invariance_. For any domain set `$\mathcal{X}$`, dissimilarity function `$d$`, and any
`$\alpha>0$`, the following should hold: `$F(\mathcal{X},d) = F(\mathcal{X},\alpha d)$`.

2. _Richness_ For any finite `$\mathcal{X}$` and every partition `$C = (C_1,\ldots,C_k)$` of `$\mathcal{X}$` (into
nonempty subsets) there exists some dissimilarity function `$d$` over `$\mathcal{X}$` such that
`$F(\mathcal{X},d)=C$`.

3.  _Consistency_ If `$d$` and `$d'$` are dissimilarity functions over `$\mathcal{X}$`, such that
for all `$x, y \in \mathcal{X}$`,
    +  if `$x,y$` belong to the same cluster in `$F(\mathcal{X},d)$` then `$d'(x,y) \leq d(x,y)$`,
    + if `$x,y$` belong to different clusters in `$F(\mathcal{X},d)$` then `$d'(x,y) \geq d(x,y)$`,

then `$F(\mathcal{X},d) = F(\mathcal{X},d')$`.

---

but

### Kleinberg's impossibility theorem

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

**No** clustering function `$F$` satisfies simultaneously all  three properties:

- _Scale Invariance_,

- _Richness_, and

- _Consistency_

]

---
template: inter-slide
name: flavors

## Flavors of clustering

---

### Flat/Hierarchical and ...

A wide variety of clustering methods have been used in Statistics and Machine Learning.

- __Flat clustering (for example `$k$`-means)__ partitions  sample into a fixed
number of classes (usually denoted by `$k$`). The partition is determined by
some algorithm

.f6[The ultimate objective is to optimize some cost function.
Whether the objective is achieved or even approximately achieved using
a reasonable amount of computational resources is not settled]

- __Model based clustering__ is based on a generative model: data are assumed
to be sampled from a specific model (usually finite mixtures of Gaussians, the model may or may not be parametric)

.f6[Clustering consists in fitting such a mixture model and then assigning sample points to mixture components]

- _Hierarchical clustering_ is the topic of next lesson

---

### Carte du tendre

.fl.w-30.f6[In Machine Learning, `$k$`-means and hierarchical clustering   belong to a range of tasks called _non-supervised learning_

This contrasts with regression which belongs to the realm of _supervised learning_
]

.fl.w-70[

![](https://scikit-learn.org/stable/_static/ml_map.png)

]

---
template: inter-slide
name:  kmeans

## _k_-means

---

The `$k$`-means algorithm is an iterative method that constructs a sequence
of Voronoï partitions

A Voronoï diagram draws the nearest neighbor regions around a set of points.

### Definition: Voronoï partitions

Assume:

- sample `$X_1, \ldots, X_n$` from `$\mathbb{R}^p$`
- `$\mathbb{R}^p$` is endowed with a metric `$d$`, usually `$\ell_2$`, sometimes
a weighted `$\ell_2$` distance or `$\ell_1$`

Each cluster is defined by a _centroid_

The collection of centroids is (sometimes)  called the _codebook_ `$\mathcal{C}=c_1, \ldots, c_k$`

Each centroid `$c_j$` defines a class:

`$$C_j = \bigg\{ X_i : d(X_i, c_j) = \min_{j' \leq k} d(X_i, c_{j'})\bigg\}$$`

and more generally a _Voronoï cell_ in `$\mathbb{R}^p$`

`$$C_j = \bigg\{ x :  x \in \mathbb{R}^p, d(x, c_j) = \min_{j' \leq k} d(x, c_{j'})\bigg\}$$`

---

### A Voronoï tesselation

.fl.w-70.pa2[
<img src="cm-8-EDA_files/figure-html/unnamed-chunk-4-1.png" width="504" />
]

.fl.w-30.pa2.f6[

#### Euclidean distance, dimension 2

A voronoi tesselation generated by `$100$`
points picked at random on the gred `$\{1,\ldots, 200\}^2$`

Note that cell boundaries are line segments

Note that centroids may lie close to boundaries

The position of the centroid of a Voronoi cell depends on the positions
of the centroids of the neighboring cells

]

---

### A Voronoi partition for projected Iris dataset

.fl.w-30.pa2[

The black points marked with a cross define three centroids.

The straight lines delimit the Voronoï cells defined by the three centroids.

The colored points come from the Iris dataset:
each point is colored according to the the cell it belongs to.

]

.fl.w-70.pa2[

```r
data(iris)
pacman::p_load(ggvoronoi)

kms <- kmeans(iris[,1:2], 3)

df_centers <- as.data.frame(kms$centers) %>%
  tibble::rownames_to_column(var=".cluster")

broom::augment(kms, iris) %>%
  ggplot() +
  aes(x=Sepal.Length, y=Sepal.Width, colour=.cluster) +
  geom_point(aes()) +
* stat_voronoi(data = df_centers,
               geom="path",
               outline=data.frame(x=c(4, 8, 8, 4), y=c(2, 2, 4.5, 4.5))
               ) +
* geom_point(data = df_centers,
             colour = "black",
             shape="+",
             size=5) +
  coord_fixed() +
  labs(col="Voronoï cells") +
  ggtitle("Kmeans over Iris dataset, k=3")
```
]

![](cm-8-EDA_files/figure-html/voronoi-1.png)

]
]
]
---
exclude: true

```r
geom_sugar <- function(df_centers, species=TRUE){
  list(
    if (species)  geom_point(aes(shape=Species, color=.cluster))
    else
    geom_point(aes(color=.cluster)),
    stat_voronoi(data = df_centers,
                 geom="path",
                 outline=data.frame(x=c(4, 8, 8, 4), y=c(2, 2, 4.5, 4.5))
                 ),
*   geom_point(data = df_centers,
               colour = "black",
               shape="+",
               size=5),
    coord_fixed(),
    labs(col="Voronoï cells")
)}

broom::augment(kms, iris) %>%
  ggplot(aes(x=Sepal.Length, y=Sepal.Width)) +
  geom_sugar(df_centers, species=FALSE)
```

---
### _k_-means objective function

The `$k$`-means algorithm aims at building a _codebook_ `$\mathcal{C}$` that minimizes
`$$\mathcal{C} \mapsto \sum_{i=1}^n \min_{c \in \mathcal{C}}  \Vert X_i - c\Vert_2^2$$`
over all codebooks with given cardinality

If `$c \in \mathcal{C}$` is the closest centroid to `$X \in \mathbb{R}^p$`,
`$$\|c - X\|^2$$` is the _quantization/reconstruction error_ suffered when using codebook `$\mathcal{C}$` to approximate `$X$`

???

`$k$`-means has a lot to do with rate-distortion coding

---

### `$k$`-means at work

.fl.w-30.pa2[

We may figure out what an optimized Voronoï partition  looks like on the Iris dataset

`kmeans` with `$k=3$` on the Iris dataset

Function `kmeans` is run with default arguments

We chose the `Sepal` plane for clustering and visualization

This is arbitrary. We could have chosen a `Petal`  plane, a `Width` plane, or a plane defined by principal axes.
]

.fl.w-70.pa2[

```r
kms <- kmeans(select(iris, Sepal.Length, Sepal.Width), 3)

broom::augment(kms, iris) %>%
 ggplot() +
 geom_point(aes(x=Sepal.Length, y=Sepal.Width,
            shape=Species, col=.cluster)) +
*geom_point(data=data.frame(kms$centers),
            aes(x=Sepal.Length, y=Sepal.Width),
            shape='+',
            size=5) +
*stat_voronoi(data = as.data.frame(kms$centers),
               aes(x=Sepal.Length,y=Sepal.Width),
               geom="path",
               outline=data.frame(x=c(4, 8, 8, 4), y=c(2, 2, 4.5, 4.5))) -> p

p +
  ggtitle("K-means with k=3", "Iris data") +
  labs(col="Clusters")
```
]

![](cm-8-EDA_files/figure-html/iriskmeans3-1.png)

]]]

---

### A `$k$`-means clustering is completely characterized by the `$k$` centroids

Once centroids are known, clusters can be recovered by searching the closest
centroid for each sample point (that is by delimiting the Voronoï cells).

- How can we assess the _quality_ of a `$k$`-means clustering?

- Can we compare the clusterings achieved by picking different values of `$k$`?

There is no obvious assessment criterion!

???

The _quality_ of a clustering can be appreciated according to a wide variety of performance indicators

- Distortion: this is the `$k$`-means cost
- Shape of clusters
- Relevance of clusters
- Stability: does clustering depend on few points?

---

### Caveat

When visualizing `$k$`-means clustering on `Iris` data, we are cheating.  <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M206.86 245.15c-35.88 10.45-59.95 41.2-57.53 74.1 11.4-12.72 28.81-23.7 49.9-30.92l7.63-43.18zM95.81 295L64.08 115.49c-.29-1.62.28-2.62.24-2.65 57.76-32.06 123.12-49.01 189.01-49.01 1.61 0 3.23.17 4.85.19 13.95-13.47 31.73-22.83 51.59-26 18.89-3.02 38.05-4.55 57.18-5.32-9.99-13.95-24.48-24.23-41.77-27C301.27 1.89 277.24 0 253.32 0 176.66 0 101.02 19.42 33.2 57.06 9.03 70.48-3.92 98.48 1.05 126.58l31.73 179.51c14.23 80.52 136.33 142.08 204.45 142.08 3.59 0 6.75-.46 10.01-.8-13.52-17.08-28.94-40.48-39.5-67.58-47.61-12.98-106.06-51.62-111.93-84.79zm97.55-137.46c-.73-4.12-2.23-7.87-4.07-11.4-8.25 8.91-20.67 15.75-35.32 18.32-14.65 2.58-28.67.4-39.48-5.17-.52 3.94-.64 7.98.09 12.1 3.84 21.7 24.58 36.19 46.34 32.37 21.75-3.82 36.28-24.52 32.44-46.22zM606.8 120.9c-88.98-49.38-191.43-67.41-291.98-51.35-27.31 4.36-49.08 26.26-54.04 54.36l-31.73 179.51c-15.39 87.05 95.28 196.27 158.31 207.35 63.03 11.09 204.47-53.79 219.86-140.84l31.73-179.51c4.97-28.11-7.98-56.11-32.15-69.52zm-273.24 96.8c3.84-21.7 24.58-36.19 46.34-32.36 21.76 3.83 36.28 24.52 32.45 46.22-.73 4.12-2.23 7.87-4.07 11.4-8.25-8.91-20.67-15.75-35.32-18.32-14.65-2.58-28.67-.4-39.48 5.17-.53-3.95-.65-7.99.08-12.11zm70.47 198.76c-55.68-9.79-93.52-59.27-89.04-112.9 20.6 25.54 56.21 46.17 99.49 53.78 43.28 7.61 83.82.37 111.93-16.6-14.18 51.94-66.71 85.51-122.38 75.72zm130.3-151.34c-8.25-8.91-20.68-15.75-35.33-18.32-14.65-2.58-28.67-.4-39.48 5.17-.52-3.94-.64-7.98.09-12.1 3.84-21.7 24.58-36.19 46.34-32.37 21.75 3.83 36.28 24.52 32.45 46.22-.73 4.13-2.23 7.88-4.07 11.4z"/></svg>

We have a gold standard classification delivered by botanists

The botanists classification can be challenged

We can compare classification originating from _phenotypes_ (appearance) and classification based on _phylogeny_ (comparing DNAs)  <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M.1 494.1c-1.1 9.5 6.3 17.8 15.9 17.8l32.3.1c8.1 0 14.9-5.9 16-13.9.7-4.9 1.8-11.1 3.4-18.1H380c1.6 6.9 2.9 13.2 3.5 18.1 1.1 8 7.9 14 16 13.9l32.3-.1c9.6 0 17.1-8.3 15.9-17.8-4.6-37.9-25.6-129-118.9-207.7-17.6 12.4-37.1 24.2-58.5 35.4 6.2 4.6 11.4 9.4 17 14.2H159.7c21.3-18.1 47-35.6 78.7-51.4C410.5 199.1 442.1 65.8 447.9 17.9 449 8.4 441.6.1 432 .1L399.6 0c-8.1 0-14.9 5.9-16 13.9-.7 4.9-1.8 11.1-3.4 18.1H67.8c-1.6-7-2.7-13.1-3.4-18.1-1.1-8-7.9-14-16-13.9L16.1.1C6.5.1-1 8.4.1 17.9 5.3 60.8 31.4 171.8 160 256 31.5 340.2 5.3 451.2.1 494.1zM224 219.6c-25.1-13.7-46.4-28.4-64.3-43.6h128.5c-17.8 15.2-39.1 30-64.2 43.6zM355.1 96c-5.8 10.4-12.8 21.1-21 32H114c-8.3-10.9-15.3-21.6-21-32h262.1zM92.9 416c5.8-10.4 12.8-21.1 21-32h219.4c8.3 10.9 15.4 21.6 21.2 32H92.9z"/></svg>

---

### Summarising a `$k$`-means clustering

.fl.w-50.pa2[

A clustering can be summarized and illustrated.

In <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> A meaningful summary is provided
by the generic function `summary()`, or a `tidy` summary is providede by
`broom::tidy(...)`

```r
select(iris, Sepal.Length, Sepal.Width) %>%
  kmeans(centers = 3) %>%
* broom::tidy() %>%
  knitr::kable(format = "markdown",
               digits = 2) -> t
```

]

.fl.w-50.pa2[

| Sepal.Length| Sepal.Width| size| withinss|cluster |
|------------:|-----------:|----:|--------:|:-------|
|         6.81|        3.07|   47|    12.62|1       |
|         5.01|        3.43|   50|    13.13|2       |
|         5.77|        2.69|   53|    11.30|3       |

The concise summary  tells us the number of points that are assigned to each cluster, and the Within Sum of Squares (WNSS). It says something
about inner homogeneity and (apparently) nothing about outer homogeneity

]

---

### `$k$`-means with `$k=2$`

.fl.w-30.pa2[
We pursue the exploration of `kmeans` by building another clustering for Iris dataset.

This times  with `$k=2$`.

]

.fl.w-70.pa2[

```r
kms <- kmeans(select(iris, Sepal.Length, Sepal.Width), 2)
iris2 <- broom::augment(kms, iris)

![](cm-8-EDA_files/figure-html/iriskmeans2-1.png)

]

| Sepal.Length| Sepal.Width| size| withinss|cluster |
|------------:|-----------:|----:|--------:|:-------|
|         5.22|        3.13|   83|    35.09|1       |
|         6.61|        2.97|   67|    23.11|2       |

]

]]

---

### <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M202.021 0C122.202 0 70.503 32.703 29.914 91.026c-7.363 10.58-5.093 25.086 5.178 32.874l43.138 32.709c10.373 7.865 25.132 6.026 33.253-4.148 25.049-31.381 43.63-49.449 82.757-49.449 30.764 0 68.816 19.799 68.816 49.631 0 22.552-18.617 34.134-48.993 51.164-35.423 19.86-82.299 44.576-82.299 106.405V320c0 13.255 10.745 24 24 24h72.471c13.255 0 24-10.745 24-24v-5.773c0-42.86 125.268-44.645 125.268-160.627C377.504 66.256 286.902 0 202.021 0zM192 373.459c-38.196 0-69.271 31.075-69.271 69.271 0 38.195 31.075 69.27 69.271 69.27s69.271-31.075 69.271-69.271-31.075-69.27-69.271-69.27z"/></svg> How should we pick `$k$`?

.fl.w-30.pa2.f6[

Even if we could compute a provably  optimal codebook for each `$k$`,
choosing `$k$` would not be obvious

A common recipe consists of
plotting within clusters sum of squares (`WNSS`) against `$k$`

Within clusters sum of squares (WNSS) decreases sharply between `$k=2$` and `$k=3$`

For larger values of `$k$`, the decay is much smaller.

The _elbow_ rule of thumb suggests to choose `$k=3$`.
]

.fl.w-70.pa2[

```r
require(foreach)

foreach (k=2:10, .combine = rbind) %do% {
  select(iris, Sepal.Length, Sepal.Width) %>%
  kmeans(centers = k,   nstart=32L) %>%
* broom::glance() %>%
  force()
} %>%
  rownames_to_column(var = "k") %>%
  mutate(k=as.integer(k)+1) -> tmp
```

.f6[
We have run  `kmeans` over the Iris dataset, for `$k$` in range `$2, \ldots, 10$`. For each value of `$k$`,  we performed `$32$` randomized initializations, and chose the partition that minimizes within clusters sum of squares

]

|  k|  totss| tot.withinss| betweenss| iter|
|--:|------:|------------:|---------:|----:|
|  2| 130.48|        58.20|     72.27|    1|
|  3| 130.48|        37.05|     93.42|    2|
|  4| 130.48|        27.97|    102.51|    3|
|  5| 130.48|        20.96|    109.52|    3|
|  6| 130.48|        17.33|    113.14|    2|
|  7| 130.48|        14.76|    115.72|    2|
|  8| 130.48|        12.81|    117.67|    3|
|  9| 130.48|        11.07|    119.40|    3|
| 10| 130.48|         9.77|    120.70|    4|

]

]
]]

---

### Incentive to choose `$k=4$`?

.fl.w-30.pa2[
Depending on initialization, taking `$k=4$` creates a cluster at the boundary between `versicolor` and `virginica` or it may split the `setosa` cluster
]

.fl.w-70.pa2[
<div class="figure">
<img src="cm-8-EDA_files/figure-html/iriskmeans4-1.png" alt="(ref:iriskmeans4)" width="504" />
<p class="caption">(ref:iriskmeans4)</p>
</div>
]
---

```r
broom::tidy(kmeans(select(iris, Sepal.Length, Sepal.Width), 4)) %>%
  knitr::kable(format = "markdown", digits = 2)
```

| Sepal.Length| Sepal.Width| size| withinss|cluster |
|------------:|-----------:|----:|--------:|:-------|
|         5.92|        2.75|   53|     8.25|1       |
|         5.19|        3.64|   32|     4.63|2       |
|         6.88|        3.10|   41|    10.63|3       |
|         4.77|        2.89|   24|     4.45|4       |

---

### Initialization matters!

.fl.w-40.pa2[
- Initialize by samples.

- `k-Mean++`  try to take them as separated as possible.

- No guarantee to converge to a global optimum!

- Trial and error.

- Repeat and keep the best result.
]

.fl.w-60.pa2[

```r
kmeans(x,       # data
       centers, # initial centroids or number of clusters
       iter.max = 10,
       nstart = 1,  # number of trials
       algorithm = c("Hartigan-Wong", # default
                     "Lloyd",         #<< old one
                     "Forgy",
                     "MacQueen"),
       trace=FALSE)
```

]

???

TODO:
  one dimensional example with animation (plotly)

---
exclude: true

```r
kms <- kmeans(select(iris, Sepal.Length, Sepal.Width),
              centers = 3,
              iter.max = 100,
              nstart= 1,
              trace = 10)
```

```
## KMNS(*, k=3): iter=  1, indx=2
##  QTRAN(): istep=150, icoun=33, NCP[1:3]= 245 267 267
##  QTRAN(): istep=300, icoun=34, NCP[1:3]= 361 416 416
##  QTRAN(): istep=450, icoun=12, NCP[1:3]= 361 588 588
##  QTRAN(): istep=600, icoun=1, NCP[1:3]= 699 749 749
##  QTRAN(): istep=750, icoun=17, NCP[1:3]= 835 883 883
##  QTRAN(): istep=900, icoun=38, NCP[1:3]= 1007 1012 1012
##  QTRAN(): istep=1050, icoun=46, NCP[1:3]= 1007 1154 1154
## KMNS(*, k=3): iter=  2, indx=150
```

---

### Lloyd's Algorithm (detailed) for fixed _k_ (naive _k_-means)

1. Initialize
  Choose `$k$` centroids

2. Iterations: Two phases
   1. (Movement) Assign each sample point to the closest _centroid_
   Assign each sample point to a class in the Voronoi partition defined by the centroids
   1. (Update) For each class in the current Voronoi partition, update teh  _centroid_ so as to minimize the Within Cluster Sum of Squared distances.

???

From `scikit-learn` documentation

> The k-means problem is solved using either Lloyd’s or Elkan’s algorithm.

> The average complexity is given by `$O(k \times n \times T)$`, were `$n$` is the number of samples and `$T$` is the number of iterations.

> The worst case complexity is given by `$O(n^(k+2/p))$` with `$n = n_{\text{samples}}$`, `$p = n_{\text{features}}$`. (D. Arthur and S. Vassilvitskii, ‘How slow is the k-means method?’ SoCG2006)

> In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.

> If the algorithm stops before fully converging (because of `tol` or `max_iter`), `labels_` and `cluster_centers_` will not be consistent, i.e. the `cluster_centers_` will not be the means of the points in each cluster. Also, the estimator will reassign `labels_` after the last iteration to make `labels_` consistent with predict on the training set.

---

### Lloyd's iterations

<div class="figure">
<img src="cm-8-EDA_files/figure-html/lloyd1-1.png" alt="After 1 step" width="864" />
<p class="caption">After 1 step</p>
</div>

---

### Lloyd's iterations  (continued)

<div class="figure">
<img src="cm-8-EDA_files/figure-html/lloyd5-1.png" alt="After 2 steps" width="864" />
<p class="caption">After 2 steps</p>
</div>

---

### Lloyd's iterations (continued)

<div class="figure">
<img src="cm-8-EDA_files/figure-html/lloyd00-1.png" alt="After 4 steps" width="864" />
<p class="caption">After 4 steps</p>
</div>

---

### Analysis

Given

- codebook `$\mathcal{C} =\big\{c_1, \ldots, c_k\big\}$` and
- clusters `$C_1, \ldots C_k$`,

the  _within-clusters sum of squares_ is defined as
`$$\sum_{j=1}^k  \sum_{i: X_i \in C_j} \bigg\Vert c_j - X_i \bigg\Vert^2$$`

### Lemma

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

At each stage, the _within clusters sums of squares_ does not increase

]

???

There is no guarantee that the algorithm will converge in few
iterations

Iterations are carried out in a brute force manner

---

### Proof

Let `$\mathcal{C}^{(t)} =\big\{ c^{(t)}_1, \ldots, c_k^{(t)}\big\}$` be the codebook after `$t$` steps

Let `$\big({C}^{(t)}_j\big)_{j \leq k}$` be the clusters after `$t$` steps

- Centroids at step `$t+1$` are the barycenters of clusters `$\big({C}^{(t)}_j\big)_{j \leq k}$`

`$$c^{(t+1)}_j = \frac{1}{|C_j^{(t)}|} \sum_{X_i \in C^{(t)}_j} X_i$$`

- Clusters `$C^{(t+1)}_j$` are defined by

`$$C^{(t+1)}_j = \bigg\{ X_i : \Vert X_i -  c^{(t+1)}_j\Vert = \min_{c \in \mathcal{C}^{(t+1)}} \Vert X_i -  c\Vert \bigg\}$$`

Each sample point is assigned to the closest centroid

---

### Proof (continued)

`$$\sum_{j=1}^k \sum_{X_i \in C^{(t)}_j} \bigg\Vert c^{(t)}_j  - X_i\bigg\Vert^2  \geq \sum_{j=1}^k \sum_{X_i \in C^{(t)}_j} \bigg\Vert c^{(t+1)}_j  - X_i\bigg\Vert^2$$`

since for each `$j$`, the mean `$c^{(t+1)}_j$` minimizes the average square distance to points in `$C^{(t)}_j$`

`$$\sum_{j=1}^k \sum_{X_i \in C^{(t)}_j} \bigg\Vert c^{(t+1)}_j  - X_i\bigg\Vert^2 \geq \sum_{j=1}^k \sum_{X_i \in C^{(t)}_j} \min_{c \in \mathcal{C}^{(t+1)}}\bigg\Vert c  - X_i\bigg\Vert^2$$`

`$$\sum_{j=1}^k \sum_{X_i \in C^{(t)}_j} \min_{c \in \mathcal{C}^{(t+1)}}\bigg\Vert c  - X_i\bigg\Vert^2 = \sum_{j=1}^k \sum_{X_i \in C^{(t+1)}_j} \bigg\Vert c^{(t+1)}_j  - X_i\bigg\Vert^2$$`

---

### <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M384 144c0-44.2-35.8-80-80-80s-80 35.8-80 80c0 36.4 24.3 67.1 57.5 76.8-.6 16.1-4.2 28.5-11 36.9-15.4 19.2-49.3 22.4-85.2 25.7-28.2 2.6-57.4 5.4-81.3 16.9v-144c32.5-10.2 56-40.5 56-76.3 0-44.2-35.8-80-80-80S0 35.8 0 80c0 35.8 23.5 66.1 56 76.3v199.3C23.5 365.9 0 396.2 0 432c0 44.2 35.8 80 80 80s80-35.8 80-80c0-34-21.2-63.1-51.2-74.6 3.1-5.2 7.8-9.8 14.9-13.4 16.2-8.2 40.4-10.4 66.1-12.8 42.2-3.9 90-8.4 118.2-43.4 14-17.4 21.1-39.8 21.6-67.9 31.6-10.8 54.4-40.7 54.4-75.9zM80 64c8.8 0 16 7.2 16 16s-7.2 16-16 16-16-7.2-16-16 7.2-16 16-16zm0 384c-8.8 0-16-7.2-16-16s7.2-16 16-16 16 7.2 16 16-7.2 16-16 16zm224-320c8.8 0 16 7.2 16 16s-7.2 16-16 16-16-7.2-16-16 7.2-16 16-16z"/></svg> Variants of _k_-means

Implementations of  `$k$`-means vary with respect to

- Initialization
  + `k-means++`
  + Forgy : pick initial centroids at random from the dataset
  + Random partition : pick a random partition of the dataset and initialize centroids by computing means in each class
  + ...

- Movement/assignment
  + Naive `$k$` means uses  brute-force search for closest centroid. Each step requires `$\Omega(n \times k)$` operations
  + Elkan (used by <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M439.8 200.5c-7.7-30.9-22.3-54.2-53.4-54.2h-40.1v47.4c0 36.8-31.2 67.8-66.8 67.8H172.7c-29.2 0-53.4 25-53.4 54.3v101.8c0 29 25.2 46 53.4 54.3 33.8 9.9 66.3 11.7 106.8 0 26.9-7.8 53.4-23.5 53.4-54.3v-40.7H226.2v-13.6h160.2c31.1 0 42.6-21.7 53.4-54.2 11.2-33.5 10.7-65.7 0-108.6zM286.2 404c11.1 0 20.1 9.1 20.1 20.3 0 11.3-9 20.4-20.1 20.4-11 0-20.1-9.2-20.1-20.4.1-11.3 9.1-20.3 20.1-20.3zM167.8 248.1h106.8c29.7 0 53.4-24.5 53.4-54.3V91.9c0-29-24.4-50.7-53.4-55.6-35.8-5.9-74.7-5.6-106.8.1-45.2 8-53.4 24.7-53.4 55.6v40.7h106.9v13.6h-147c-31.1 0-58.3 18.7-66.8 54.2-9.8 40.7-10.2 66.1 0 108.6 7.6 31.6 25.7 54.2 56.8 54.2H101v-48.8c0-35.3 30.5-66.4 66.8-66.4zm-6.7-142.6c-11.1 0-20.1-9.1-20.1-20.3.1-11.3 9-20.4 20.1-20.4 11 0 20.1 9.2 20.1 20.4s-9 20.3-20.1 20.3z"/></svg> `scikit-learn`)
  + Hartigan-Wong <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> default
  + ...

???

> Lloyd's algorithm is the standard approach for this problem. However, it spends a lot of processing time computing the distances between each of the k cluster centers and the n data points. Since points usually stay in the same clusters after a few iterations, much of this work is unnecessary, making the naïve implementation very inefficient. Some implementations use caching and the triangle inequality in order to create bounds and accelerate Lloyd's algorithm. .fr.f6[Wikipedia]

In base <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>, `kmeans` is a wrapper for different but related algorithms.

Lloyd's algorithm is the first and simplest versions of a series of heuristic methods designed to minimize the k-means cost

- `MacQueen`  modify the mean each time a sample is assigned to a new cluster

- `Hartigan-Wong` is the _default_ method. It modifies the mean by removing the considered
sample point, assign it to the nearby center and recompute the new mean after assignment.

- `Forgy`

---
template: inter-slide
name: pcaandkmeans

## Combining PCA and `$k$`-means

---

The result of a clustering procedure like `kmeans` can be visualized by projecting the dataset on a pair of native variables and  using some aesthetics to emphasize the clusters

This is not always the best way.

First choosing a pair of native variables may not be straightforward. The projected pairwise distances may not faithfully reflect the pairwise distances that serve for clustering.

It makes sense to project the dataset
of the `$2$`-dimensional subspace that maximizes the projected inertia, that is on the space generated
by the first two principal components

---

### PCA, projection, `$k$`-means

.fl.w-30.pa2.f6[
The kmeans clustering of the Iris dataset is projected on the first two principal components: `prcomp` is used to perform PCA with neither centering nor scaling

`kmeans` is applied to the rotated data

The straight lines are the not the projections of the boundaries of the (4-dimensional) Voronoï cells defined by the clusters centroids, but the boundaries of the 2-dimensional Voronoï celles defined by the projections of the cluster centroids

]
.fl.w-70.pa2[

```r
iris_a <- broom::augment(prcomp(x = iris[, -5],
                         center = FALSE,
                         scale.=FALSE,
                         rank. = 4), iris)

km3 <- iris_a %>%
  select(starts_with(".fitted")) %>%
  kmeans(3, nstart = 20)

iris_a <- broom::augment(km3, iris_a)
```

]

```r
ggplot(data=iris_a,
       aes(x=.fittedPC1, y=.fittedPC2)) +
    coord_fixed(ratio=1) +
*   stat_voronoi(data=data.frame(km3$centers),
                     geom="path",
               outline = data.frame(x=c(-12, -4, -4 , -12),
                                    y=c(-3, -3, 4, 4))) +
  geom_point(aes(shape=Species, col=.cluster)) +
* geom_point(data=data.frame(km3$centers),
             aes(x=.fittedPC1, y=.fittedPC2),
             shape='+', size=5) +
  xlab("PC1") +
  ylab("PC2") +
  labs(col="Cluster") +
  ggtitle('Kmeans clustering of Iris dataset projected on first principal components.')
```
]

![](cm-8-EDA_files/figure-html/pcakmeans-1.png)
]
]]

???

---

- Choosing `$k$`

- Assessing clustering quality

- Scaling or not scaling ?

- Choosing a distance

- Initialization methods

- Movement/assignment update

- Stopping rules

---
template: inter-slide
name: quantization

## `$k$`-means and _quantization_

---

Quantization plays an important role in signal processing and information theory (lossy coding with quadratic distortion)

Given a probability distribution `$P$` over
a metric space `$(\mathcal{X},d)$`, a `$k$`-quantizer is defined by a `$k$`-element subset of
`$\mathcal{X}$`, `$\mathbf{c} :=  \{x_1,\ldots, x_k\}$` called a codebook.

The codebook defines a quantization by mapping every `$x \in \mathcal{X}$` to its nearest neighbor in codebook `$\mathbf{c}$`

The quality of a codebook is assessed by its mean distortion measured as the mean quadratic distance to the nearest neighbor:

`$$\mathsf{R}(\mathbf{c}) := \mathbb{E}\left[\min_{x \in \mathbf{c}}d(X,x)^2\right]$$`

where `$X \sim P$`

---

When the  `$P$` is known,  designing an optimal codebook  may  be a difficult optimization problem

When `$P$` is unknown, if the statistician is left with an i.i.d. sample `$X_1,\ldots,X_n \sim P$`, the first reasonable thing to do is minimizing the empirical distortion, the `$k$`-means cost:

`$$\mathsf{R}_n(\mathbf{c}) :=  \frac{1}{n} \sum_{i=1}^n \min_{x \in \mathbf{c}} d(X_i,x)^2$$`

### Theorem NP-hardness of k-means cost minimization

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

Mininizing the `$k$`-means cost (sum of within clusters sum of squares) is `$\mathsf{NP}$` -hard.

]

???

Sanjoy Dasgupta proved that if `$\mathsf{P} \neq \mathsf{NP}$`, minimizing
the `$k$`-means cost (sum of within clusters sum of squares) is __computationally intractable__

This is one result in a long series of negative results going back
to the late seventies

---

### Statistical/information-theoretical issues

Even though minimizing the `$k$`-means cost is hard,
one may investigate the _statistical_ problem raised by minimizing the `$k$`-means cost.

`$k$`-means served as a showcase for empirical process theory during the early 1980's

Significant progress during recent years
The `$k$`-means cost provides a concrete illustration of a recurrent situation.

If the sampling distribution is square integrable and has a density with respect to Lebesgue measure, the mapping  `$\mathsf{R}(\mathbf{c})$`
is differentiable, its gradient can be explicitly computed

???

`Cite(myBib, "Pol84")`  `Cite(myBib, "MR3080408")`, `Cite(myBib, "MR3316191")`

---

### Smoothness of _k_-means cost

Whereas the local behavior of the `$k$`-means cost is simple, the global behavior remains elusive.

Bounding the number of global minima, local minima, local extrema and saddlepoints   is difficult.

Observation: under fairly general assumptions,
the `$k$`-means cost function is twice differentiable in the neighborhood of optimal codebooks

Even though the `$k$`-means cost function is not convex, recent advances tell us that if sample size tends to infinity,  an empirical cost functions will also share local minima, local maxima, and local saddlepoints with the  theoretical population cost function

???

`Cite(myBib, "Pol84")`

`Cite(myBib, "MR3851754")`

---

### Pollard's regularity conditions

- The sampling distribution is absolutely continuous with respect to Lebesgue measure on `$\mathbb{R}^p$`.

- The Hessian matrix of the mapping `$\mathbf{c} \mapsto \mathsf{R}(\mathbf{c})$` is positive definite for all optimal codebooks

Under Pollard's regularity conditions, let `$\mathbf{c}^*$` denote the optimal codebook, and `$\widehat{\mathbf{c}}_n$` denote the optimal empirical codebook

Large sample  behavior of the empirically optimal codebook:

- `$\sqrt{n}\|\widehat{\mathbf{c}}_n - \mathbf{c}^*\|$` is asymptotically normal

and

- `$n \left( \mathsf{R}(\widehat{\mathbf{c}}_n) - \mathsf{R}(\mathbf{c}^*)\right)$` is stochastically bounded

---

### Key observation

Pollard's condition entails

that

for some constant `$\kappa_0>0$`,

`$$\mathsf{R}(\mathbf{c}) - \mathsf{R}(\mathbf{c}^*) \geq \kappa_0 \|\mathbf{c}- \mathbf{c}^*\|^2$$`

---

### <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M576 240c0-23.63-12.95-44.04-32-55.12V32.01C544 23.26 537.02 0 512 0c-7.12 0-14.19 2.38-19.98 7.02l-85.03 68.03C364.28 109.19 310.66 128 256 128H64c-35.35 0-64 28.65-64 64v96c0 35.35 28.65 64 64 64h33.7c-1.39 10.48-2.18 21.14-2.18 32 0 39.77 9.26 77.35 25.56 110.94 5.19 10.69 16.52 17.06 28.4 17.06h74.28c26.05 0 41.69-29.84 25.9-50.56-16.4-21.52-26.15-48.36-26.15-77.44 0-11.11 1.62-21.79 4.41-32H256c54.66 0 108.28 18.81 150.98 52.95l85.03 68.03a32.023 32.023 0 0 0 19.98 7.02c24.92 0 32-22.78 32-32V295.13C563.05 284.04 576 263.63 576 240zm-96 141.42l-33.05-26.44C392.95 311.78 325.12 288 256 288v-96c69.12 0 136.95-23.78 190.95-66.98L480 98.58v282.84z"/></svg> Conclusion

- Euclidean distance is used as a metric and inertia is used as a measure of cluster scatter

- The number of clusters `$k$` is an input parameter

- Convergence to a local minimum may produce counterintuitive ("wrong") results

???

Squared Euclidean distance is very sensitive to outliers

An inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set.

---
exclude: true

### References

NULL

]

---

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: cover

# The End