EDA X: Model-based Clustering

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

# Exploratory Data Analysis : Model-based Clustering

### 2021-12-10

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
exclude: true
class: middle, left, inverse

# Exploratory Data Analysis X Model-Based Clustering

### 2021-12-10

#### EDA Master I MIDS et MFA

#### [EDA Master I MIDS et MFA](http://stephane-v-boucheron.fr/courses/eda)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
class: middle, inverse

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### [Model Based Clustering: the idea](#bigpic)

### [Gaussian Mixture Models](#gmm)

### [EM Algorithm](#em)

### [Mclust](#mclust)

???

### [DBscan](#dbscan)

---
template: inter-slide
name: bigpic

## Model based clustering: the idea

---

### Mixture models

Many natural statistics, such as the distribution of people’s heights, can be
modeled as a _mixture of Gaussians_

The _components of the mixture_ represent
the parts of the distribution coming from different subpopulations

If we
don’t know about the subpopulations in advance, can we figure out what they
are and estimate their parameters?

Can we  classify samples based on
which subpopulation they are likely to have come from?

???

The first provable guarantees for his method.

Building on this, we will solve the highdimensional
learning problem too. Along the way, we will develop insights
about systems of polynomial equations and how they can be used for parameter
learning.

---

The one-dimensional case was introduced by
Karl Pearson

Karl Pearson introduced

- The _method of moments_ for estimating the parameters of a distribution

- _Mixture models_ for modeling the presence of subpopulations

.f6[
> Pearson, K. (1894). _Contributions to the mathematical theory of evolution_. Phil. Trans. Roy. Soc. London A 185, 71-110.
]

---

### Histogram for Naples crabs data

.fl.w-40.f6.pa2[

```r
data("pearson")

p <- pearson %>%
  mutate(dens=250*freq/sum(freq)) %>%
  ggplot() +
  aes(x=c(.58, ratio[1:28]),
      y=dens) +
  geom_step(direction = 'hv') +
  xlab("") +
  ylab("density") +
  ggtitle("Naples crabs")

p
```

> Mrs and Mr Weldon collected 1,000 Naples crabs
and measured 23 different physical attributes of each of them.

]

.fl.w-60.pa2[

![](cm-10-EDA_files/figure-html/histo_crab-1.png)

> All but one of these statistics was approximately
Gaussian.

> So why weren’t they all Gaussian?

]
]

???

---

### Crabs modeling

Heuristic 1 <svg aria-hidden="true" role="img" viewBox="0 0 352 512" style="height:1em;width:0.69em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M96.06 454.35c.01 6.29 1.87 12.45 5.36 17.69l17.09 25.69a31.99 31.99 0 0 0 26.64 14.28h61.71a31.99 31.99 0 0 0 26.64-14.28l17.09-25.69a31.989 31.989 0 0 0 5.36-17.69l.04-38.35H96.01l.05 38.35zM0 176c0 44.37 16.45 84.85 43.56 115.78 16.52 18.85 42.36 58.23 52.21 91.45.04.26.07.52.11.78h160.24c.04-.26.07-.51.11-.78 9.85-33.22 35.69-72.6 52.21-91.45C335.55 260.85 352 220.37 352 176 352 78.61 272.91-.3 175.45 0 73.44.31 0 82.97 0 176zm176-80c-44.11 0-80 35.89-80 80 0 8.84-7.16 16-16 16s-16-7.16-16-16c0-61.76 50.24-112 112-112 8.84 0 16 7.16 16 16s-7.16 16-16 16z"/></svg>: Naples crabs do not form a single population but mixture of two subpopulations

Heuristic 2 <svg aria-hidden="true" role="img" viewBox="0 0 352 512" style="height:1em;width:0.69em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M96.06 454.35c.01 6.29 1.87 12.45 5.36 17.69l17.09 25.69a31.99 31.99 0 0 0 26.64 14.28h61.71a31.99 31.99 0 0 0 26.64-14.28l17.09-25.69a31.989 31.989 0 0 0 5.36-17.69l.04-38.35H96.01l.05 38.35zM0 176c0 44.37 16.45 84.85 43.56 115.78 16.52 18.85 42.36 58.23 52.21 91.45.04.26.07.52.11.78h160.24c.04-.26.07-.51.11-.78 9.85-33.22 35.69-72.6 52.21-91.45C335.55 260.85 352 220.37 352 176 352 78.61 272.91-.3 175.45 0 73.44.31 0 82.97 0 176zm176-80c-44.11 0-80 35.89-80 80 0 8.84-7.16 16-16 16s-16-7.16-16-16c0-61.76 50.24-112 112-112 8.84 0 16 7.16 16 16s-7.16 16-16 16z"/></svg>: Each subpopulation is approximately Gaussian

`$$\pi_1 \times \frac{1}{\sigma_1}\phi\left(\frac{\cdot - \mu_1}{\sigma_1}\right) + (1- \pi_1) \times \frac{1}{\sigma_2}\phi\left(\frac{\cdot - \mu_2}{\sigma_2}\right)$$`

`$\theta= (\pi_1, \mu_1, \sigma_1, \mu_2, \sigma_2) \in [0,1] \times \left(\mathbb{R} \times ]0, \infty)\right)^2$`

???

> Although not as good a fit as a mixture of two normals, a single Weibull component is an acceptable fit at the 1% level of significance. Since there was no independent biological evidence that the population was a mixture, the fact that a mixture of normals fits well does not prove that there are two species of crabs.

.fr[
[http://ms.mcmaster.ca/peter/mix/demex/excrabs.html](http://ms.mcmaster.ca/peter/mix/demex/excrabs.html)]

]

---

### A system of polynomial equations

If `$X \sim \mathcal{N}(0,1)$`,

`$$M_k := \mathbb{E} X^k =  \begin{cases} 0 & \text{if } k\equiv 1 \mod 2\\  \frac{k!}{2^{k/2} (k/2)!}&\text{otherwise}\end{cases}$$`

`$Y \sim \mathcal{N}(\mu,\sigma^2) \sim \mu + \sigma X$`

`$$\mathbb{E} Y^k = \sum_{r=0}^k  \binom{k}{r} \mu^{r-k} \sigma^k M_k$$`

Moments of mixtures of Gaussians are convex combinations of Gaussian moments

Matching empirical moments `$\widehat{M}_k$` with Gaussian Mixtures moments lead to
a collection of polynomial equations

`$$\widehat{M}_k = \pi_1 \sum_{r=0}^k  \binom{k}{r} \mu_1^{r-k} \sigma_1^k M_k +(1-\pi_1)\sum_{r=0}^k  \binom{k}{r} \mu_2^{r-k} \sigma_2^k M_k \qquad\text{for } k \leq 5$$`

---

### <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M192 160c44.2 0 80-35.8 80-80S236.2 0 192 0s-80 35.8-80 80 35.8 80 80 80zm-53.4 248.8l25.6-32-61.5-51.2L56.8 383c-11.4 14.2-11.7 34.4-.8 49l48 64c7.9 10.5 19.9 16 32 16 8.3 0 16.8-2.6 24-8 17.7-13.2 21.2-38.3 8-56l-29.4-39.2zm142.7-83.2l-61.5 51.2 25.6 32L216 448c-13.2 17.7-9.7 42.8 8 56 7.2 5.4 15.6 8 24 8 12.2 0 24.2-5.5 32-16l48-64c10.9-14.6 10.6-34.8-.8-49l-45.9-57.4zM376.7 145c-12.7-18.1-37.6-22.4-55.7-9.8l-40.6 28.5c-52.7 37-124.2 37-176.8 0L63 135.3C44.9 122.6 20 127 7.3 145-5.4 163.1-1 188 17 200.7l40.6 28.5c17 11.9 35.4 20.9 54.4 27.9V288h160v-30.8c19-7 37.4-16 54.4-27.9l40.6-28.5c18.1-12.8 22.4-37.7 9.7-55.8z"/></svg> The birth of the method of moments

Pearson solved the system of polynomial equations by hand, and he
found a number of candidate solutions.

Each solution corresponds to a way to
set all five unknown parameters so that the _moments of the mixture_ match the
_empirical moments_

Some of the solutions were clearly not right; some had negative values for
the variance, or a value for the mixing weight `$\pi_1$` that was not between zero and
one

After eliminating these solutions, Pearson was still left with
more than one candidate solution.

His approach was to choose the candidate
whose prediction was closest to the empirical sixth moment.

This is called the _sixth moment test_

---

### Naples crabs and Gaussian mixtures

.fl.w-70.pa2[
<img src="cm-10-EDA_files/figure-html/crabs-gmm-1.png" width="504" />
]

.fl.w-30.pa2.f6[

- `$\pi_1 =.5$`

- `$\mu_1 = .6343$`

- `$\sigma_1=.0190$`

- `$\mu_2 = .0190$`

- `$\sigma_2 = .0121$`

???

Pearson (1894) analysed this histogram by the method of moments. The calculation was formidable and done without the aid of computing machinery of any kind. He found two solutions, one with 41.45% of the population in the first component and 58.55% in the second, the other with 53.28% in the first component and 46.72% in the second. He preferred the first solution on the basis of agreement with the sixth moment. MIX does not converge to a unique solution if all parameters are unconstrained. The iterations wander between a 6:4 and 4:6 ratio for the two components, with no fit being significantly better than any other. The standard errors of the proportions are quite large. The fit shown here resolves this uncertainty by constraining the proportions to be equal. The presence of two components was interpreted by Pearson as evidence that there were two species of crabs. I know of no biological justification for constraining the proportions to be equal, but the fit obtained is excellent. Constraining the standard deviations to be equal does not give an acceptable fit.

[From mixdist](http://ms.mcmaster.ca/peter/mix/demex/excrabs.html)

---

In model-based clustering, one assumes that observed  data `$x_1, \ldots, x_n$` are i.i.d. from some _mixture_ distribution

Assume we have a parametric model, that is `$\Theta \subseteq \mathbb{R}^p$`, and a collection `$(P_\theta)_{\theta \in \Theta}$` of probability distributions over the space of observations

The model is assumed to be dominated and equipped with a density `$p( \cdot \mid \theta)$` for each `$P_\theta$`

We also have a finite latent state space `$\mathcal{Z}$` ( `$|\mathcal{Z}|=K$` )

We denote by `$\mathcal{M}(\mathcal{Z})$` the set of probability distributions over `$\mathcal{Z}$` (multinomial distribution).

A `$K$`-mixture distribution, is defined by a _mixing_ distribution `$\pi \in \mathcal{M}(\mathcal{Z})$` and `$K$` parameters `$(\theta_z)_{z\in \mathcal{Z}}$` each belonging to `$\Theta$`

The mixture distribution over _observation space_ has density
`$$p(\cdot \mid \pi, \theta) =\sum_{z \in \mathcal{Z}} \pi(z) \times p(\cdot | \theta_z )$$`

The parameter space for mixture distributions is `$\mathcal{M}(\mathcal{Z}) \times \Theta^{|\mathcal{Z}|}$`

---

### Sampling from a mixture distribution

Sampling from `$\sum_{z \in \mathcal{Z}} \pi(z) \times p(\cdot | \theta_z )$`
can be done in two steps

1. Generate the _latent_ variable `$Z$` from the mixing distribution `$\pi$`

2. Generate the _observed_ variable `$X$` according to `$p(\cdot | \theta_Z)$`

The couple `$(Z, X)$` defines the _complete_ datum

The mixture distribution defines joint distribution for `$(Z, X)$` with density `$\pi(z)\times p(x \mid \theta_z)$`

---

### Conditional distributions

`$p(\cdot | \theta_Z)$` also denoted `$p(\cdot \mid \theta, Z)$` may be viewed as the conditional density of the observed variable `$X$` given the latent variable `$Z$` (and in Bayesian perspective as the conditional density of `$X$` with respect to `$Z$` and `$\theta$`)

Analogously,
`$$p(z \mid x, \pi, \theta) = \frac{\pi(z)\times p(x \mid \theta_z)}{\sum_{z' \in \mathcal{Z}} \pi(z') \times p(\cdot | \theta_{z'} )}$$`
is the posterior density/probability of latent state/variable `$z$` given observation `$x$`

---
exclude: true

With each sample point comes a latent or hidden variable that tells which mixture component the sample point comes from

Clusters ideally correspond to _mixture components_

Distributional assumptions on the sampling mechanism allow to analyze the performance of  clustering methods

???

This is another interplay between algorithm analysis and the concentration of measure phenomenon

Back the traditional justification that heuristic methods practically work

---

### <svg aria-hidden="true" role="img" viewBox="0 0 352 512" style="height:1em;width:0.69em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M96.06 454.35c.01 6.29 1.87 12.45 5.36 17.69l17.09 25.69a31.99 31.99 0 0 0 26.64 14.28h61.71a31.99 31.99 0 0 0 26.64-14.28l17.09-25.69a31.989 31.989 0 0 0 5.36-17.69l.04-38.35H96.01l.05 38.35zM0 176c0 44.37 16.45 84.85 43.56 115.78 16.52 18.85 42.36 58.23 52.21 91.45.04.26.07.52.11.78h160.24c.04-.26.07-.51.11-.78 9.85-33.22 35.69-72.6 52.21-91.45C335.55 260.85 352 220.37 352 176 352 78.61 272.91-.3 175.45 0 73.44.31 0 82.97 0 176zm176-80c-44.11 0-80 35.89-80 80 0 8.84-7.16 16-16 16s-16-7.16-16-16c0-61.76 50.24-112 112-112 8.84 0 16 7.16 16 16s-7.16 16-16 16z"/></svg> (Mixture) Model based clustering

- Assume the observations `$x_1, \ldots, x_n$` come from  i.i.d. sampling from some mixture distribution (from some mixture model)

- Estimate the mixture distribution parameters
  - `$\widehat{\pi} \in \mathcal{M}(\mathcal{Z})$`
  - `$\widehat{\theta} = (\widehat{\theta}_z)_{z \in \mathcal{Z}} \in \Theta^{|\mathcal{Z}|}$`

- Classify the observations using the estimated mixture distribution parameters

`$$\widehat{z}_i =  \arg\max_{z \in \mathcal{Z}} \widehat{\pi}(z) \times p(x_i \mid \widehat{\theta}_z) = \arg\max_{z \in \mathcal{Z}} p(z \mid x_i, \widehat{\pi}, \widehat{\theta})$$`

Parameter estimation relies on a likelihood method:

- Maximum Likelihood estimation
- Bayesion method
- _Others_ if ML is not computationally feasible

???

- Use a _generative model_ of the data:

`$$\Pr{\vec{X}} = \sum_{k=1}^{K} \pi_k \Pr_{\theta_k}\bigg\{\vec{X}| k\bigg\}$$`

where `$\pi_k$` are proportions and `$\Pr_{\theta_k}\bigg\{\vec{X}|k\bigg\}$`
are parametric probability models.

- Estimate parameters (often using a _Maximum Likelihood_ principle)

- Assign each observations to the class maximizing the
_posterior probability_ (obtained by _Bayes formula_)

`$$\frac{\widehat{\pi_k}  \Pr_{\widehat{\theta_k}}\bigg\{\vec{X}| k\bigg\}}{\sum_{k'=1}^K \widehat{\pi_{k'}} \Pr_{\widehat{\theta_{k'}}} \bigg\{\vec{X}| k'\bigg\}}$$`

---
exclude: true

### A two classes example

- A mixture `$\pi_1 f_1(\vec{X}) + \pi_2 f_2(\vec{X})$`

- and the posterior probability `$\pi_i f_i(\vec{X})/(\pi_1 f_1(\vec{X}) + \pi_2 f_2(\vec{X}))$`

- Natural class assignment!

- A mixture `$\pi_1 f_1(\vec{X}) + \pi_ 2 f_2(\vec{X})$`

- Two populations with a parametric distribution `$f_i$`.

???

- Most classical choice: Gaussian distribution (GMM)

---
exclude: true

### A two classes example: Gaussian setting again

- `$\vec{X}_1,\ldots,\vec{X}_n$` i.i.d.

- `$\vec{X}_i \sim \mathcal{N}(\mu_1,\sigma_1^2)$` with probability
`$\pi_1$` or `$\vec{X}_i \sim \mathcal{N}(\mu_2,\sigma_2^2)$` with probability `$\pi_2$`

- We don't know the parameters `$\mu_i$`, `$\sigma_i$`, `$\pi_i$`

- We don't know from which distribution each `$\vec{X}_i$` has been
drawn

???

Convolution viewpoint when we deal with a location mixture

---

### <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M120 376c-13.3 0-24 10.7-24 24s10.7 24 24 24 24-10.7 24-24-10.7-24-24-24zm80 0c-13.3 0-24 10.7-24 24s10.7 24 24 24 24-10.7 24-24-10.7-24-24-24zm80 0c-13.3 0-24 10.7-24 24s10.7 24 24 24 24-10.7 24-24-10.7-24-24-24zm80 0c-13.3 0-24 10.7-24 24s10.7 24 24 24 24-10.7 24-24-10.7-24-24-24zm238.6 49.4c-14.5-14.5-22.6-34.1-22.6-54.6V269.2c0-20.5 8.1-40.1 22.6-54.6l36.7-36.7c6.2-6.2 6.2-16.4 0-22.6l-22.6-22.6c-6.2-6.2-16.4-6.2-22.6 0l-36.7 36.7c-26.5 26.5-41.4 62.4-41.4 99.9V288h-64v-50.9c0-8.7-1.8-17.2-5.2-25.2L364.5 29.1C356.9 11.4 339.6 0 320.3 0H176c-26.5 0-48 21.5-48 48v112h-16c-26.5 0-48 21.5-48 48v91.2C26.3 317.2 0 355.4 0 400c0 61.9 50.1 112 112 112h256c61.9 0 112-50.1 112-112 0-17.3-4.2-33.4-11.2-48H512v18.7c0 37.5 14.9 73.4 41.4 99.9l36.7 36.7c6.2 6.2 16.4 6.2 22.6 0l22.6-22.6c6.2-6.2 6.2-16.4 0-22.6l-36.7-36.7zM192 64h117.8l68.6 160H256l-64-64V64zm176 384H112c-26.5 0-48-21.5-48-48s21.5-48 48-48h256c26.5 0 48 21.5 48 48s-21.5 48-48 48z"/></svg> Maximum Likelihood

Log-likelihood with respect to observed data

`$$\ell(\theta) = \sum_{i=1}^n \log \left( \sum_{z \in \mathcal{Z}} \pi(z) \times p(x_i | \theta_z )\right)$$`

---

### <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M624 352h-16V243.9c0-12.7-5.1-24.9-14.1-33.9L494 110.1c-9-9-21.2-14.1-33.9-14.1H416V48c0-26.5-21.5-48-48-48H48C21.5 0 0 21.5 0 48v320c0 26.5 21.5 48 48 48h16c0 53 43 96 96 96s96-43 96-96h128c0 53 43 96 96 96s96-43 96-96h48c8.8 0 16-7.2 16-16v-32c0-8.8-7.2-16-16-16zM160 464c-26.5 0-48-21.5-48-48s21.5-48 48-48 48 21.5 48 48-21.5 48-48 48zm144-248c0 4.4-3.6 8-8 8h-56v56c0 4.4-3.6 8-8 8h-48c-4.4 0-8-3.6-8-8v-56h-56c-4.4 0-8-3.6-8-8v-48c0-4.4 3.6-8 8-8h56v-56c0-4.4 3.6-8 8-8h48c4.4 0 8 3.6 8 8v56h56c4.4 0 8 3.6 8 8v48zm176 248c-26.5 0-48-21.5-48-48s21.5-48 48-48 48 21.5 48 48-21.5 48-48 48zm80-208H416V144h44.1l99.9 99.9V256z"/></svg> What if we knew the latent states?

Log-likelihood with respect to complete data

`$$\begin{array}{rl}\ell_c(\theta) & = \sum_{i=1}^n \log \left( \prod_{z \in \mathcal{Z}} \left(\pi(z) \times p(x_i | \theta_z )\right)^{\mathbb{I}_{z=z_i}}\right)\\ & = \sum_{z\in \mathcal{Z}} \sum_{i=1}^n\mathbb{I}_{z=z_i} \left(\log \pi(z) + \log p(x_i | \theta_z )\right)\end{array}$$`

Optimizing Log-likelihood with respect to complete data is as easy as maximizing the likelihood in model `$(P_\theta)_{\theta \in \Theta}$`

???

- Density:
`$$\pi_1  \Phi(\vec{X},\mu_1,\sigma_1^2) + \pi_2
\Phi(\vec{X},\mu_2,\sigma_2^2)$$`

- `$\log$`-likelihood:
`$$\mathcal{L}(\theta) = \sum_{i=1}^n \log\left( \pi_1
\Phi(\vec{X}_i,\mu_1,\sigma_1^2) + \pi_2
\Phi(\vec{X}_i,\mu_2,\sigma_2^2)
\right)$$`

- No straightforward way to optimize the parameters!

---
exclude: true

### <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M624 352h-16V243.9c0-12.7-5.1-24.9-14.1-33.9L494 110.1c-9-9-21.2-14.1-33.9-14.1H416V48c0-26.5-21.5-48-48-48H48C21.5 0 0 21.5 0 48v320c0 26.5 21.5 48 48 48h16c0 53 43 96 96 96s96-43 96-96h128c0 53 43 96 96 96s96-43 96-96h48c8.8 0 16-7.2 16-16v-32c0-8.8-7.2-16-16-16zM160 464c-26.5 0-48-21.5-48-48s21.5-48 48-48 48 21.5 48 48-21.5 48-48 48zm144-248c0 4.4-3.6 8-8 8h-56v56c0 4.4-3.6 8-8 8h-48c-4.4 0-8-3.6-8-8v-56h-56c-4.4 0-8-3.6-8-8v-48c0-4.4 3.6-8 8-8h56v-56c0-4.4 3.6-8 8-8h48c4.4 0 8 3.6 8 8v56h56c4.4 0 8 3.6 8 8v48zm176 248c-26.5 0-48-21.5-48-48s21.5-48 48-48 48 21.5 48 48-21.5 48-48 48zm80-208H416V144h44.1l99.9 99.9V256z"/></svg> What if?

- Assume we know from which distribution each  point has
been sampled from (we know the latent variables): `$Z_i=1$` if from `$f_1$` and `$Z_i=0$` otherwise

- Complete `$\log$`-likelihood:
`$$\sum_{i=1}^n  Z_i \log
\phi(\vec{X}_i -\mu_1,\sigma_1^2)  + (1-Z_i) \log
\phi(\vec{X}_i -\mu_2,\sigma_2^2)$$`

- Easy optimization (at least for Gaussian mixtures)

- <svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M248 8C111 8 0 119 0 256s111 248 248 248 248-111 248-248S385 8 248 8zm80 168c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm-160 0c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm170.2 218.2C315.8 367.4 282.9 352 248 352s-67.8 15.4-90.2 42.2c-13.5 16.3-38.1-4.2-24.6-20.5C161.7 339.6 203.6 320 248 320s86.3 19.6 114.7 53.8c13.6 16.2-11 36.7-24.5 20.4z"/></svg> But `$Z_i$`'s are unknown...

---

- Replace each `$Z_i$` by its _expectation_ given  current estimate of `$\pi, \theta$`

- `$\mathbb{E}_{\pi, \theta}{Z_i}=\Pr\{Z_i=z| x_i, \pi, \theta\}$` (posterior probability)

- Restimate `$\theta$` using a proxy for the complete log-likelihood

- and iterate...

- Can be proved to be a good idea!

???

---
template: inter-slide
name: em

## EM algorithm

---

### EM Algorithm

Dempster–Laird–Rubin (1977) after several others.

- (Random) initialization: `$\pi^0$`, `$\theta^0_z, z \in \mathcal{Z}$`,

- Repeat (until ?):
  - __Expectation__ (with respect to current posterior probability):

+ Compute _responsibilities_:  `$r^i_z = \Pr\{Z_i=z|\pi^t \theta^t\}$`
  + This defines
       `$$(\pi, \theta) \mapsto \sum_{z\in \mathcal{Z}} \sum_{i=1}^n r^i_z \left(\log \pi(z) + \log p(x_i | \theta_z )\right)$$`
  - __Maximization__  with respect to `$\pi, \theta$` of
`$$\sum_{z\in \mathcal{Z}} \sum_{i=1}^n r^i_z \left(\log \pi(z) + \log p(x_i | \theta_z )\right)$$`

See [Expectation-Maximinization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)

???

`$\sum_{z\in \mathcal{Z}} \sum_{i=1}^n r^i_z \left(\log \pi(z) + \log p(x_i | \theta_z )\right)$`

---

### Why EM?

Because, like _k_-means, EM is a heuristic attempt to solve a difficult computational problem, EM needs to be motivated

We will just check, that like for _k_-means, each iteration of EM increases the log-likelihood

EM converges towards a local maximum (or a saddlepoint) of the log-likelihood

This result is proved without too many assumptions on the component model

As for _k_-means, initialization matters

Establishing that under further assumptions, EM does with high probability converge towards a global maximum, and that it provides a consistent estimation procedure has attracted a lot of work during the last 15 years

---
exclude: true

### Comparison with _k_-means

Soft assignment (whereas `$k$`-means proceeds by hard assignment)

---

### About the proxy for log-likelihood maximinization

- Log-likelihood with respect to observed data

`$$\ell_n(\pi, \theta) = \sum_{i=1}^n \log \left(\sum_{z \in \{1, \ldots, k\}} p(x_i, z \mid \pi, \theta)\right)$$`

Latent variables `$(z_i)_{i \leq n}$` make computations expensive. In some mixture models, computing `$\ell_n(\theta)$` for a spectific `$\theta$` may be costly. Optimizing `$\ell_n(\theta)$` with respect to `$\theta$` is usually costly

- Log-likelihood with respect to _complete data_

`$$\ell_c(\theta) = \sum_{i=1}^n \log \left( p(x_i, z_i \mid \pi, \theta)\right)$$`

may be easy to compute and even to optimize (for example for Gaussian mixtures)

---

### A lower bound for log-likelihood with respect to observed data

For each `$i\leq n$`,  let `$q_i$` be any positive probability mass function over `$\mathcal{Z}$`

`$$\begin{array}{rl}\ell(\pi, \theta) & = \sum_{i=1}^n \log \left(\sum_{z_i \in \mathcal{Z}} p(x_i, z_i \mid \pi, \theta)\right)\\ & = \sum_{i=1}^n \log \left(\sum_{z_i \in \mathcal{Z}} q_i(z_i) \frac{p(x_i, z_i \mid \pi, \theta)}{q_i(z_i)}\right)\\ & \geq \sum_{i=1}^n \sum_{z_i \in \mathcal{Z}} q_i(z_i) \log \left( \frac{p(x_i, z_i \mid pi, \theta)}{q_i(z_i)}\right)\end{array}$$`

Let us denote the right-hand-side of the inequality by `$Q(\pi, \theta, q)$`
where `$q$` stands for the sequence `$q_1, \ldots q_n$`

---

### Reshaping the lower bound

`$$\begin{array}{rl} Q(\pi, \theta, q) & = \sum_{i=1}^n \sum_{z_i \in \mathcal{Z}} q_i(z_i) \log \left( \frac{p(x_i, z_i \mid \pi,\theta)}{q_i(z_i)}\right)\\ & = \sum_{i=1}^n \left[\mathbb{E}_{q_i} \log \left( p(x_i, Z \mid \pi,\theta)\right) + H(q_i)\right]\end{array}$$`

where `$H(q_i) = - \sum_{z \in \mathcal{Z}} q_i(z) \log q_i(z)$`
is the Shannon entropy of the probability distribution defined by probability mass function `$q_i$`.

---

### Picking a good `$q_i$`

`$$\begin{array}{rl}\sum_{z_i \in \mathcal{Z}} q_i(z_i) \log \left( \frac{p(x_i, z_i \mid \pi,\theta)}{q_i(z_i)}\right) & =  \sum_{z_i \in \mathcal{Z}} q_i(z_i) \log \left( \frac{p(z_i \mid x_i, \pi, \theta)p(x_i\mid \pi, \theta)}{q_i(z_i)}\right)\\ & = \sum_{z_i \in \mathcal{Z}} q_i(z_i) \log \left( \frac{p(z_i \mid x_i, \pi, \theta)}{q_i(z_i)}\right)+ \sum_{z_i \in \mathcal{Z}} q_i(z_i) \log \left( p(x_i\mid \pi,\theta)\right)\\ & = - D\left( q_i\Vert p(\cdot\mid x_i, \pi,\theta)\right) + \sum_{z_i \in \mathcal{Z}} q_i(z_i) \log \left( p(x_i\mid \pi,\theta)\right)\\ & = - D\left( q_i\Vert p(\cdot\mid x_i, \pi,\theta)\right) +  \log \left( p(x_i\mid \pi,\theta)\right)\end{array}$$`

where `$D(P \Vert Q)$` denotes the relative entropy or Kullback-Leibler divergence between probability distributions `$P$` and `$Q$`

The expression is maximized by choosing `$q_i(\cdot) =  p(\cdot\mid x_i, \pi,\theta)$` which is also called the posterior probability of state ($z_i$) given observed datum `$z_i$` and parameter `$\theta$`

---

### Picking a good `$q_i$` (continued)

Given current estimate `$\pi^t, \theta^t$` for the _true value_ of the parameter,  for each `$i \leq n$`, we choose

`$$q^t_i = p(\cdot\mid x_i, \pi^t,\theta^t)$$`

Let `$q^t$` denote the sequence `$q^t_1, \ldots, q^t_n$`

The lower bound on the log-likelihood with respect to observed data turns to

`$$Q(\pi,\theta, q^t) = \sum_{i=1}^n \mathbb{E}_{q^t_i} \log \left( p(x_i, Z \mid \pi, \theta)\right) + \sum_{i=1}^n H(q^t_i)$$`

---

### Revisiting the _M_-step

`$$\theta^{t+1} = \arg\max_{\theta} \sum_{i=1}^n \mathbb{E}_{q^t_i} \log \left( p(x_i, Z \mid \pi, \theta)\right)=  \arg\max_{\theta} Q(\pi, \theta, q^t)$$`

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M512 199.652c0 23.625-20.65 43.826-44.8 43.826h-99.851c16.34 17.048 18.346 49.766-6.299 70.944 14.288 22.829 2.147 53.017-16.45 62.315C353.574 425.878 322.654 448 272 448c-2.746 0-13.276-.203-16-.195-61.971.168-76.894-31.065-123.731-38.315C120.596 407.683 112 397.599 112 385.786V214.261l.002-.001c.011-18.366 10.607-35.889 28.464-43.845 28.886-12.994 95.413-49.038 107.534-77.323 7.797-18.194 21.384-29.084 40-29.092 34.222-.014 57.752 35.098 44.119 66.908-3.583 8.359-8.312 16.67-14.153 24.918H467.2c23.45 0 44.8 20.543 44.8 43.826zM96 200v192c0 13.255-10.745 24-24 24H24c-13.255 0-24-10.745-24-24V200c0-13.255 10.745-24 24-24h48c13.255 0 24 10.745 24 24zM68 368c0-11.046-8.954-20-20-20s-20 8.954-20 20 8.954 20 20 20 20-8.954 20-20z"/></svg> <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M512 199.652c0 23.625-20.65 43.826-44.8 43.826h-99.851c16.34 17.048 18.346 49.766-6.299 70.944 14.288 22.829 2.147 53.017-16.45 62.315C353.574 425.878 322.654 448 272 448c-2.746 0-13.276-.203-16-.195-61.971.168-76.894-31.065-123.731-38.315C120.596 407.683 112 397.599 112 385.786V214.261l.002-.001c.011-18.366 10.607-35.889 28.464-43.845 28.886-12.994 95.413-49.038 107.534-77.323 7.797-18.194 21.384-29.084 40-29.092 34.222-.014 57.752 35.098 44.119 66.908-3.583 8.359-8.312 16.67-14.153 24.918H467.2c23.45 0 44.8 20.543 44.8 43.826zM96 200v192c0 13.255-10.745 24-24 24H24c-13.255 0-24-10.745-24-24V200c0-13.255 10.745-24 24-24h48c13.255 0 24 10.745 24 24zM68 368c0-11.046-8.954-20-20-20s-20 8.954-20 20 8.954 20 20 20 20-8.954 20-20z"/></svg>

Because of the peculiar choice of `$q_i^t$`,

`$$Q(\pi^t,\theta^t, q^t) = \sum_{i=1}^n \log p(x_i \mid \pi^t,\theta^t) = \ell(\theta^t)$$`

---

### Proposition

The EM iteration monotonically increases the likelihood with respect to observed data:

`$$\ell(\pi^{t+1},\theta^{t+1}) \geq Q(\pi^{t+1}, \theta^{t+1}, q^t) \geq Q(\pi^{t},\theta^{t}, q^t) = \ell(\pi^{t}, \theta^t)$$`

for `$t >0$`

---
exclude: true

Expected complete data Log-Likelihood

`$$Q(θ, θ^{t-1}) = \mathbb{E}[\ell_c(θ) \mid\text{Data},  θ^{t-1}]$$`

`$Q$` is called the _auxiliary function_

The E step consists in computing `$Q(θ, θ^{t-1})$`

The terms which define the auxiliary function are known as the _expected sufficient statistics_ (ESS).

The E steps optimizes _responsibilities_

The M step consists in optimizing `$Q$` with respect to `$theta$`

`$$θ^t = \arg \max_{θ}  Q(θ, θ^{t-1})$$`

This can be tweaked in order to perform MAP estimation.

---
exclude: true

### Properties of EM algorithm

The log-likelihood of the sequence of estimates delivered by the EM algorithm is non decreasing

???

---
exclude: true
### Proof

???

---
exclude: true
### Importance of initialization

---

### About EM

Besides convergence toward a _local extremum_ of the likelihood function, and
rates of convergence in the vicinity of such an extremum, it is hard to come up with  guarantees about the behavior of  the EM algorithm

---

With high probability, a variant of the EM algorithm clusters  correctly a sample of a mixture of _well separated_ high dimensional Gaussian distributions.

Key observation : _the Gaussian concentration phenomenon_

> if we consider two independent points drawn from two (possibly identical) high dimensional Gaussian distributions, their mutual distance (the Euclidean norm of their difference)  is highly concentrated around its mean value.

Reading mutual distances is almost enough to perform correct clustering

???

[@MR3290441], [@MR2276533], [@MR3150710].

---
exclude: true
template: inter-slide
name: gmm

## Gaussian mixture models

---
exclude: true
### Gaussian Mixture Model

- Use

`$$\Pr_{\theta_k}\{\vec{X} \mid k\}  \sim \mathcal{N}(\mu_k, \Sigma_k)$$`

with `$\mathcal{N}(\mu,\Sigma)$` the Gaussian distribution with mean `$\mu$` and covariance matrix `$\Sigma$`.

- Efficient iterative optimization algorithm (EM)

- Frequent constraints on the covariance matrices

- Connection with `$k$`-means when covariance matrices are
assumed to be the same multiple of the identity.

---
exclude: true
### Probabilistic latent semantic analysis (PLSA)

- Documents described by their word counts `$w$`

- Model:
`$$\Pr\{w\} = \sum_{k=1}^K \Pr\{k\} \Pr_{\theta_k}\{w \mid k\}$$`

with `$k$` the (hidden) topic, `$\Pr\{k\}$` a topic probability and
`$\Pr_{\theta_k}\{w \mid | k\}$` a multinomial law for a given topic.

- Clustering according to
`$$\Pr\{k | w\} = \frac{\widehat{\Pr\{k\}} \Pr_{\widehat{\theta_k}} \{w\mid k\}}{ \sum_{k'}\widehat{\Pr\{k'\}} \Pr_{\widehat{\theta_{k'}}} \{w \mid k'\}}$$`

---
template: inter-slide
name: mclust

## mclust package

---

### Iris data with `Mclust`

```r
data(iris) ; require(mclust)

*xem <- mclust::Mclust(iris[,-5],
                      G = 1:9,
*                     modelNames = )
```

Creates an object of class Mclust

- `G`
- `modelNames`
-

]

```r
xem %>% broom::glance()
```

```
## # A tibble: 1 × 7
##   model     G   BIC logLik    df hypvol  nobs
##   <chr> <int> <dbl>  <dbl> <dbl>  <dbl> <int>
## 1 VEV       2 -562.  -216.    26     NA   150
```

```r
xem %>% broom::tidy()
```

```
## # A tibble: 2 × 7
##   component  size proportion mean.Sepal.Length mean.Sepal.Width mean.Petal.Leng…
##       <int> <int>      <dbl>             <dbl>            <dbl>            <dbl>
## 1         1    50      0.333              5.01             3.43             1.46
## 2         2   100      0.667              6.26             2.87             4.91
## # … with 1 more variable: mean.Petal.Width <dbl>
```

]

```r
broom::augment(xem, iris) %>%
  ggplot() +
  aes(x=Petal.Length, y= Petal.Width, color=.class, shape=Species) +
  geom_point()
```

]

---

### With `fviz_mclust` from `factoextra`

`Mclust` trains many models with different number of clusters

Different models are compared using a penalized criterion (default is BIC)

`Mclust` selects model VEV with two clusters (setosa is separated from the two other species)

```r
*fviz_mclust(xem,
*           what = "BIC")
```

---

### Plotting "uncertainty"

- Uncertainty concerning ...

- In the first factorial plane

```r
*fviz_mclust(xem,
*           what="uncertainty")
```

```
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
```

---

### Plotting classification

```r
*fviz_mclust(xem,
*           what="classification", )
```

---
exclude:true

### Model-based clustering and dimension reduction: `MclustDR`

---
exclude:true
template: inter-slide
name: rmixmod

## Rmixmod package

---
exclude:true

```r
require(Rmixmod)
```
> Rmixmod is a package based on the existing MIXMOD software. MIXMOD is a tool for fitting a mixture model of multivariate Gaussian or multinomial components to a given data set with either a clustering, a density estimation or a discriminant analysis point of view

???

---
exclude:true

Estimation of the mixture parameters is performed either through maximum likelihood via the EM (Expectation Maximization, Dempster et al. 1977), the SEM (Stochastic EM, Celeux and Diebolt 1985) algorithm or through classification maximum likelihood via the CEM algorithm (Clustering EM, Celeux and Govaert 1992).

---
exclude: true

These three algorithms can be chained to obtain original fitting strategies (e.g. CEM then EM with results of CEM) to use advantages of each of them in the estimation process. As mixture problems usually have multiple local maxima, the program will produce different results, depending on the initial estimates supplied by the user. If the user does not input his own initial estimates, some initial estimates procedures are proposed (random centers for instance).

---
exclude:true

In the Gaussian case, fourteen models are implemented.

The models may depend on constraints on the covariance matrix

The models and the number of clusters can be selected by different criteria :

- BIC (Bayesian Information Criterion),
- ICL (Integrated Completed Likelihood, a classification version of BIC),
- NEC (Entropy Criterion), or
- Cross-Validation (CV).

---

### Geyser data

.fl.w-50.pa2[

```r
data(geyser)
*xem1<-mixmodCluster(geyser,3)

geyser %>%
  ggplot() +
  aes(x=Duration, y=Waiting.Time) +
  geom_point() +
  ggtitle("Geyser data")
```
]

.fl.w-50.pa2[

![](cm-10-EDA_files/figure-html/geyser-scatter-1.png)

]
]

```
## **************************************************************
## * Number of samples    =  272 
## * Problem dimension    =  2 
## **************************************************************
## *       Number of cluster =  3 
## *              Model Type =  Gaussian_pk_Lk_C 
## *               Criterion =  BIC(2321.9458)
## *              Parameters =  list by cluster
## *                  Cluster  1 : 
##                          Proportion =  0.3558 
##                               Means =  2.0363 54.4793 
##                           Variances = |     0.0742     0.4846 |
##                                       |     0.4846    32.1347 |
## *                  Cluster  2 : 
##                          Proportion =  0.4046 
##                               Means =  4.5138 80.9044 
##                           Variances = |     0.0677     0.4420 |
##                                       |     0.4420    29.3092 |
## *                  Cluster  3 : 
##                          Proportion =  0.2396 
##                               Means =  3.9110 78.3816 
##                           Variances = |     0.1055     0.6887 |
##                                       |     0.6887    45.6687 |
## *          Log-likelihood =  -1124.5352 
## **************************************************************
```
]
]

```
## [1] 1
```

```
## [1] 2
```

]
]

---
exclude: true
### Iris dataset

```r
data(iris)

*xem <- mixmodCluster(iris[,-5], 3)
```

Creates an object of class MixmodCluster

]

]

```r
plot(xem)
```

```
## [1] 1
```

```
## [1] 2
```

```
## [1] 3
```

```
## [1] 4
```

]

---
exclude: true

---
exclude:true

### Diving deeper

```r
class(xem)
```

```
## [1] "MixmodCluster"
## attr(,"package")
## [1] "Rmixmod"
```

---
exclude:true

## dbscan package

---
exclude: true

### References

NULL

]

---
exclude: true

- [An introduction on GMM](https://www.r-bloggers.com/2017/02/an-intro-to-gaussian-mixture-modeling/)
- [Towards Data Science](https://towardsdatascience.com/mixture-modelling-from-scratch-in-r-5ab7bfc83eef)
- [RmixMod](https://www.rdocumentation.org/packages/Rmixmod/versions/2.1.5)

---

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: cover

# The End