Processing math: 12%
+ - 0:00:00
Notes for current slide
Notes for next slide

Exploratory Data Analysis : Hierarchical Clustering

2021-12-10

Master I MIDS & MFA

Analyse Exploratoire de Données

Stéphane Boucheron

Hierarchical Clustering: the idea

Inside hclust

Comparing dendrograms

Hierarchical clustering

  • Dendrogram : what is it?

  • From dendrogram to clusterings: cutting a dendrogram

  • Grow a dendrogram

  • Displaying a dendrogram

  • Validation and assesment

Hierarchical clustering [...] is a method of cluster analysis which seeks to build a hierarchy of clusters

Wikipedia

Hierarchical clustering [...] is a method of cluster analysis which seeks to build a hierarchy of clusters

Wikipedia

Recall that a clustering is a partition of some dataset

A partition D of X is a refinement of another partition D if every class in D is a subset of a class in D. Partitions D and D are said to be nested

Hierarchical clustering [...] is a method of cluster analysis which seeks to build a hierarchy of clusters

Wikipedia

Recall that a clustering is a partition of some dataset

A partition D of X is a refinement of another partition D if every class in D is a subset of a class in D. Partitions D and D are said to be nested

A hierarchical clustering of X is a sequence of |X|1 nested partitions of X, starting from the trivial partition into |X| singletons and ending into the trivial partition in 1 subset ( X itself)

A hierarchical clustering consists of |X|1 nested flat clusterings

We will explore agglomerative or bottom-up methods to build hierarchical clusterings

Hierchical clustering and dendrograms

The result of hierarchical clustering is a tree where leafs are labelled by sample points and internal nodes correspond to merging operations

The tree conveys more information: if the tree is properly decorated, it is possible to reconstruct the different merging steps and to know which rule was applied when some merging operation was performed

The tree is called a dendrogram

Dendrogram and trees show up in several areas.

Classification and Regression trees play an important role in Machine Learning.

ggdendro and dendextend may also be used to manipulate regression trees

  • Cutting a dendrogram: getting a flat clustering

  • Building a dendrogram: inside hclust

  • Displaying, reporting dendrograms

Cutting a dendrogram: Iris illustration

The famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

[From ?iris]

The Iris flower data set is fun for learning supervised classification algorithms, and is known as a difficult case for unsupervised learning.

The Setosa species are distinctly different from Versicolor and Virginica (they have lower petal length and width). But Versicolor and Virginica cannot easily be separated based on measurements of their sepal and petal width/length.

[From dendextend package]

iris_num <- select(iris, where(is.numeric))
rownames(iris_num) <- str_c(iris$Species,
1:150, sep = "_")
iris_num %>%
dist() %>%
hclust() -> iris_hclust
iris_dendro <- dendro_data(iris_hclust)
ggdendrogram(iris_dendro,
leaf_labels = TRUE,
rotate = TRUE) +
ggtitle("Dendrogram for Iris") +
theme_dendro() +
# coord_flip() +
scale_y_reverse(expand = c(0.2, 0))
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

  • as.matrix(dist(iris[,1:4])) returns the matrix of pairwise distances
  • default distance is Euclidean distance
  • Using broom::augment ??
  • There is no augment.hclust method: No augment method for objects of class hclust
  • Plot comments

Cutting a dendrogram: Iris illustration (continued)

p <- iris %>%
ggplot() +
aes(x=Petal.Length, y=Petal.Width)
p +
geom_point(aes(shape=Species)) +
labs(shape= "Species") +
ggtitle(label= "Iris data",
subtitle = "Species in Petal plane")

Does the flat clustering obtained by cutting the dendrogram at some height reflect the partition into species

Cutting a dendrogram: Iris illustration (continued)

iris_3 <- cutree(iris_hclust, 3)
p +
geom_point(aes(shape=Species,
colour=factor(iris_3))) +
labs(colour= "Clusters") +
ggtitle("Iris data",
"Hierarchical clustering, 3 classes")

Cutting a dendrogram: Iris illustration (continued)

p +
geom_point(aes(shape=Species,
colour=factor(iris_3))) +
labs(colour= "Clusters") +
labs(shape="Species") +
ggtitle("Iris data",
"Hierarchical clustering and Species")

Cutting a dendrogram: Iris illustration (continued)

iris_2 <- cutree(iris_hclust, 2)
p +
geom_point(aes(shape=Species,
colour=factor(iris_2))) +
labs(colour= "Clusters") +
labs(shape="Species") +
ggtitle("Iris data",
"Clustering in 2 classes and Species")

Cutting a dendrogram: Iris illustration (continued)

iris_4 <- cutree(iris_hclust, 4)
p +
geom_point(aes(shape=Species,
colour=factor(iris_4))) +
labs(colour= "Clusters") +
labs(shape="Species") +
ggtitle("Iris data",
"Clustering in 4 classes and Species")

Cutting a dendrogram: better Iris illustration (continued)

The dendextend package offers a set of functions for extending dendrogram objects in , letting you

  • visualize and
  • compare trees of hierarchical clusterings,

Features:

  • Adjust a tree’s graphical parameters - the color, size, type, etc of its branches, nodes and labels
  • Visually and statistically compare different dendrograms to one another
require(dendextend)
require(colorspace)
species_col <- rev(rainbow_hcl(3))[as.numeric(iris$Species)]
dend <- as.dendrogram(iris_hclust)
# order it the closest we can to the order of the observations:
dend <- rotate(dend, 1:150)
# Color the branches based on the clusters:
dend <- color_branches(dend, k=3) #, groupLabels=iris_species)
# Manually match the labels, as much as possible, to the real classification of the flowers:
labels_colors(dend) <-
rainbow_hcl(3)[sort_levels_values(
as.numeric(iris[,5])[order.dendrogram(dend)]
)]
# We hang the dendrogram a bit:
dend <- hang.dendrogram(dend, hang_height=0.1)
# reduce the size of the labels:
dend <- assign_values_to_leaves_nodePar(dend, 0.5, "lab.cex")
# dend <- set(dend, "labels_cex", 0.5)
# And plot:
par(mar = c(3,3,3,7))
plot(dend,
main = "Clustered Iris data set
(the labels give the true flower species)",
horiz = TRUE, nodePar = list(cex = .007))

Inside hclust

About class hclust

Results from function hclust() are objects of class hclust :

iris_hclust is an object of class hclust

Function cutree() returns a flat clustering of the dataset

What does height stand for?

What does merge stand for?

What does order stand for?

How different are the different method?

Hierarchical clustering of USArrests

data("USArrests")
USArrests %>% glimpse()
## Rows: 50
## Columns: 4
## $ Murder <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4, 5.3, 2.…
## $ Assault <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46, 120, 24…
## $ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 65, 57, 6…
## $ Rape <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9, 25.8, 2…
hc <- hclust(dist(scale(USArrests[,-3])), "ave")
hcdata <- dendro_data(hc, type = "rectangle")
ggplot() +
geom_segment(data = segment(hcdata),
aes(x = x, y = y,
xend = xend, yend = yend)
) +
geom_text(data = label(hcdata),
aes(x = x, y = y,
label = label, hjust = 0),
size = 3
) +
coord_flip() +
scale_y_reverse(expand = c(0.2, 0))

About dendrograms

iris_dendro is an object of class dendro

An object of class dendro is a list of 4 objects:

  • segments
  • labels
  • leaf_labels
  • class
iris_dendro <- dendro_data(iris_hclust)

dendrogram <-> ultrametric disimilarities

Questions

  • How to build the dendrogram?

  • How to choose the cut?

Bird-eye view at hierarchical agglomerative clustering methods

All hierarchical agglomerative clustering methods (HACMs) can be described by the following general algorithm.

  1. At each stage distances between clusters are recomputed by the Lance-Williams dissimilarity update formula according to the particular clustering method being used.

  2. Identify the 2 closest points and combine them into a cluster (treating existing clusters as points too)

  3. If more than one cluster remains, return to step 1.

Greed is good!

  • Hierarchical agglomerative clustering methods are examples of greedy algorithms

  • Greedy algorithms sometimes compute optimal solutions

    • Huffmann coding (Information Theory)

    • Minimum spanning tree (Graph algorithms)

  • Greedy algorithms sometimes compute sub-optimal solutions

    • Set cover (NP-hard problem)

    • ...

  • Efficient greedy algorithms rely on ad hoc data structures

    • Priority queues

    • Union-Find

Some HACM are just implementation of Minimum Spanning Tree algorithms

Algorithm (detailed)

  • Start with (C(0)i)=({Xi}) the collection of all singletons.

  • At step s, we have ns clusters (C(s)i):

    • Find the two most similar clusters according to a criterion Δ: (i,i)=argmin(j,j)Δ(C(s)j,C(s)j)

    • Merge C(s)i and C(s)i into C(s+1)i

    • Keep the ns2 other clusters Ci

  • Repeat until there is only one cluster left

Analysis

  • Complexity: O(n^3) in general.

  • Can be reduced to O(n^2) (sometimes to O(n \log n))

    • if the number of possible mergers for a given cluster is bounded.

    • for the most classical distances by maintaining a nearest neighbors list.

Merging criterion based on the distance between points

Minimum linkage:

\Delta(\mathcal{C}_i, \mathcal{C}_j) =\min_{\vec{X}_i \in \mathcal{C}_i} \min_{\vec{X}_j \in \mathcal{C}_j} d(\vec{X}_i, \vec{X}_j)

Maximum linkage:

\Delta(\mathcal{C}_i, \mathcal{C}_j) = \max_{\vec{X}_i \in \mathcal{C}_i} \max_{\vec{X}_j \in \mathcal{C}_j} d(\vec{X}_i, \vec{X}_j)

Average linkage:

\Delta(\mathcal{C}_i, \mathcal{C}_j) =\frac{1}{|\mathcal{C}_i||\mathcal{C}_j|} \sum_{\vec{X}_i \in \mathcal{C}_i}\sum_{\vec{X}_j \in \mathcal{C}_j} d(\vec{X}_i, \vec{X}_j)

Ward's criterion : minimum variance/inertia criterion

\Delta(\mathcal{C}_i, \mathcal{C}_j) = \sum_{\vec{X}_i \in \mathcal{C}_i} \left( d^2(\vec{X}_i, \mu_{\mathcal{C}_i \cup \mathcal{C}_j} ) - d^2(\vec{X}_i, \mu_{\mathcal{C}_i}) \right) +

\qquad\qquad \qquad \sum_{\vec{X}_j \in \mathcal{C}_j} \left( d^2(\vec{X}_j, \mu_{\mathcal{C}_i \cup \mathcal{C}_j} ) - d^2(\vec{X}_j, \mu_{\mathcal{C}_j}) \right)

If d is the euclidean distance

\Delta(\mathcal{C}_i, \mathcal{C}_j) = \frac{ |\mathcal{C}_i||\mathcal{C}_j|}{|\mathcal{C}_i|+ |\mathcal{C}_j|} d^2(\mu_{\mathcal{C}_i}, \mu_{\mathcal{C}_j})

Lance-Williams update formulae

Suppose that clusters C_{i} and C_{j} were next to be merged

At this point, all of the current pairwise cluster distances are known

The recursive formula gives the updated cluster distances following the pending merge of clusters C_{i} and C_{j}.

Let

  • d_{ij}, d_{ik}, and d_{jk} be shortands for the pairwise "distances" between clusters C_{i}, C_{j}, and C_{k}
  • d_{{(ij)k}} be shortand for the "distance" between the new cluster C_{i}\cup C_{j} and C_{k} ( k\not\in \{i,j\} )

An algorithm belongs to the Lance-Williams family if the updated cluster "distance" d_{{(ij)k}} can be computed recursively by

d_{(ij)k}=\alpha _{i}d_{ik}+\alpha _{j}d_{jk}+\beta d_{ij}+\gamma |d_{ik}-d_{jk}|

where \alpha _{i},\alpha _{j},\beta , and \gamma are parameters, which may depend on cluster sizes, that together with the cluster "distance" function d_{ij} determine the clustering algorithm.

Clustering algorithms such as single linkage, complete linkage, and group average method have a recursive formula of the above type

Lance-Williams update formula for Ward's criterion

\begin{array}{rl}d\left(C_i \cup C_j, C_k\right) & = \frac{n_i+n_k}{n_i+n_j+n_k}d\left(C_i, C_k\right) +\frac{n_j+n_k}{n_i+n_j+n_k}d\left(C_j, C_k\right) \\ & \phantom{==}- \frac{n_k}{n_i+n_j+n_k} d\left(C_i, C_j\right)\end{array}

\alpha_i = \frac{n_i+n_k}{n_i+n_j+n_k} \qquad \alpha_j = \frac{n_j+n_k}{n_i+n_j+n_k}\qquad \beta = \frac{- n_k}{n_i+n_j+n_k}

An unfair quotation

Ward's minimum variance criterion minimizes the total within-cluster variance Wikipedia

  • Is that correct?

  • If corrected, what does it mean?

If we understand the statement as:

for any k, the flat clustering obtained by cutting the dendrogram so as to obtain a k-clusters partition minimizes the total within-cluster variance/inertia amongst all k-clusterings

then, the statement is not proved. If it were proved, it would imply \mathsf{P}=\mathsf{NP}

The total within-cluster variance/inertia is the objective function in the k-means problem.

The statement is misleading

What happens in Ward's method?

At each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging Wikipedia

This increase is a weighted squared distance between cluster centers Wikipedia

At the initial step, all clusters are singletons (clusters containing a single point). To apply a recursive algorithm under this objective function, the initial distance between individual objects must be (proportional to) squared Euclidean distance.

Views on Inertia:

I = \frac{1}{n} \sum_{i=1}^n \|\vec{X}_i - \vec{m} \|^2

where \vec{m} = \sum_{i=1}^n \frac{1}{n}\vec{X}_i

I = \frac{1}{2n^2} \sum_{i,j} \|\vec{X}_i - \vec{X}_j\|^2

Twice the mean squared distance to the mean equals the mean squared distance between sample points

Recall that for a real random variable Z with mean \mu

\operatorname{var}(Z) = \mathbb{E}(Z -m)^2 = \inf_a \mathbb{E}(Z -a)^2

and

\operatorname{var}(Z) = \frac{1}{2} \mathbb{E}(Z -Z')^2

where Z' is an independent copy of Z

The different formulae for inertia is just mirroring the different views at variance

The inertia is the trace of an empirical covariance matrix.

Decompositions of inertia (Huyghens formula)

  • Sample x_1,\ldots, x_{n+m} with mean \bar{X}_{n+m} and variance V

  • Partition \{1,\ldots,n+m\} = A \cup B with |A|=n, |B|=m, A \cap B =\emptyset

  • Let \bar{X}_n = \frac{1}{n}\sum_{i \in A} X_i and \bar{X}_m=\frac{1}{m}\sum_{i \in B}X_i \bar{X}_{n+m} = \frac{n}{n+m} \bar{X}_{n} +\frac{m}{n+m} \bar{X}_{m}

  • Let V_A be the variance of (x_i)_{i\in A}, V_B be the variance of (x_i)_{i\in B}

Decompositions of inertia (Huyghens formula)

  • Let V_{\text{between}} be the variance of a ghost sample with n copies of \bar{X}_n and m copies of \bar{X}_m V_{\text{between}} = \frac{n}{n+m} (\bar{X}_n -\bar{X}_{n+m})^2 + \frac{m}{n+m} (\bar{X}_m -\bar{X}_{n+m})^2

  • Let V_{\text{within}} be the weighted mean of variances within classes A and B V_{\text{within}} = \frac{n}{n+m} V_A + \frac{m}{n+m} V_B

Decompositions of inertia

Proposition: Huyghens formula I

V = V_{\text{within}} + V_{\text{between}}

Huyghens formula can be extended to any number of classes

Proposition: Huyghens (II)

  • Sample \vec{x}_1, \ldots,\vec{x}_n from \mathbb{R}^p with mean \bar{X}_n, inertia I.

  • Let A_1, A_2\ldots, A_k be a partition of \{1,\ldots,n\}.

  • Let I_\ell (resp. \bar{X}^\ell) be the inertia (resp. the mean) of sub-sample \vec{x}_i, i\in A_\ell

  • Let I_{\text{between}} be the inertia of the ghost sample, formed by |A_1| copies of \bar{X}^1, |A_2| copies of \bar{X}^2, ... |A_k| copies of \bar{X}^k

  • Let I_{\text{within}} = \sum_{\ell=1}^k \frac{|A_\ell|}{n} I_\ell

I = I_{\text{within}} + I_{\text{between}}

Comparing dendrograms

Cophenetic disimilarity

Given a dendrogram, the cophenetic disimilarity between two sample points x, x' is the intergroup disimilarity at which observations x and x' are first joined.

Proposition

A cophenetic disimilarity has the ultrametric property

All triangles are isoceles and the unequal length should be no longer than the length of the two equal sides

Cophenetic correlation coefficient

The cophenetic correlation coefficient measures how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points wikipedia

Computing cophenetic correlation coefficient

In use the dendextend package

dd <- select(iris, where(is.numeric)) %>% dist
methods <- c("single", "complete", "average", "mcquitty",
"ward.D", "centroid", "median", "ward.D2")
cor_coph <- purrr::map_dbl(methods,
~ cor_cophenetic(hclust(dd, method = .), dd))
names(cor_coph) <- methods
as_tibble_row(cor_coph ) %>% knitr::kable(digits = 2)
single complete average mcquitty ward.D centroid median ward.D2
0.86 0.73 0.88 0.87 0.86 0.87 0.86 0.87

Packages

The End

Exploratory Data Analysis : Hierarchical Clustering

2021-12-10

Master I MIDS & MFA

Analyse Exploratoire de Données

Stéphane Boucheron

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow