Dendrogram : what is it?
From dendrogram to clusterings: cutting a dendrogram
Grow a dendrogram
Displaying a dendrogram
Validation and assesment
Hierarchical clustering [...] is a method of cluster analysis which seeks to build a hierarchy of clusters
Wikipedia
Recall that a clustering is a partition of some dataset
A partition D of X is a refinement of another partition D′ if every class in D is a subset of a class in D′. Partitions D and D′ are said to be nested
Hierarchical clustering [...] is a method of cluster analysis which seeks to build a hierarchy of clusters
Wikipedia
Recall that a clustering is a partition of some dataset
A partition D of X is a refinement of another partition D′ if every class in D is a subset of a class in D′. Partitions D and D′ are said to be nested
A hierarchical clustering of X is a sequence of |X|−1 nested partitions of X, starting from the trivial partition into |X| singletons and ending into the trivial partition in 1 subset ( X itself)
A hierarchical clustering consists of |X|−1 nested flat clusterings
We will explore agglomerative or bottom-up methods to build hierarchical clusterings
The result of hierarchical clustering is a tree where leafs are labelled by sample points and internal nodes correspond to merging operations
The tree conveys more information: if the tree is properly decorated, it is possible to reconstruct the different merging steps and to know which rule was applied when some merging operation was performed
The tree is called a dendrogram
Dendrogram and trees show up in several areas.
Classification and Regression trees play an important role in Machine Learning.
ggdendro
and dendextend
may also be used to manipulate regression trees
The famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
[From ?iris
]
The Iris flower data set is fun for learning supervised classification algorithms, and is known as a difficult case for unsupervised learning.
The Setosa species are distinctly different from Versicolor and Virginica (they have lower petal length and width). But Versicolor and Virginica cannot easily be separated based on measurements of their sepal and petal width/length.
[From dendextend
package]
iris_num <- select(iris, where(is.numeric))rownames(iris_num) <- str_c(iris$Species, 1:150, sep = "_")iris_num %>% dist() %>% hclust() -> iris_hclustiris_dendro <- dendro_data(iris_hclust)ggdendrogram(iris_dendro, leaf_labels = TRUE, rotate = TRUE) + ggtitle("Dendrogram for Iris") + theme_dendro() +# coord_flip() + scale_y_reverse(expand = c(0.2, 0))
## Scale for 'y' is already present. Adding another scale for 'y', which will## replace the existing scale.
as.matrix(dist(iris[,1:4]))
returns the matrix of pairwise distancesbroom::augment
??augment.hclust
method: No augment method for objects of class hclust
Does the flat clustering obtained by cutting the dendrogram at some height reflect the partition into species
The dendextend
package offers a set of functions for extending dendrogram objects in , letting you
Features:
require(dendextend)require(colorspace)species_col <- rev(rainbow_hcl(3))[as.numeric(iris$Species)]dend <- as.dendrogram(iris_hclust)# order it the closest we can to the order of the observations:dend <- rotate(dend, 1:150)# Color the branches based on the clusters:dend <- color_branches(dend, k=3) #, groupLabels=iris_species)# Manually match the labels, as much as possible, to the real classification of the flowers:labels_colors(dend) <- rainbow_hcl(3)[sort_levels_values( as.numeric(iris[,5])[order.dendrogram(dend)] )]
# We hang the dendrogram a bit:dend <- hang.dendrogram(dend, hang_height=0.1)# reduce the size of the labels:dend <- assign_values_to_leaves_nodePar(dend, 0.5, "lab.cex")# dend <- set(dend, "labels_cex", 0.5)# And plot:par(mar = c(3,3,3,7))plot(dend, main = "Clustered Iris data set (the labels give the true flower species)", horiz = TRUE, nodePar = list(cex = .007))
class of dend
What does height
stand for?
What does merge
stand for?
What does order
stand for?
How different are the different method
?
USArrests
data("USArrests")USArrests %>% glimpse()
## Rows: 50## Columns: 4## $ Murder <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4, 5.3, 2.…## $ Assault <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46, 120, 24…## $ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 65, 57, 6…## $ Rape <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9, 25.8, 2…
hc <- hclust(dist(scale(USArrests[,-3])), "ave")hcdata <- dendro_data(hc, type = "rectangle")ggplot() + geom_segment(data = segment(hcdata), aes(x = x, y = y, xend = xend, yend = yend) ) + geom_text(data = label(hcdata), aes(x = x, y = y, label = label, hjust = 0), size = 3 ) + coord_flip() + scale_y_reverse(expand = c(0.2, 0))
dendrogram <-> ultrametric
disimilarities
At each stage distances between clusters are recomputed by the Lance-Williams dissimilarity update formula according to the particular clustering method being used.
Identify the 2 closest points and combine them into a cluster (treating existing clusters as points too)
If more than one cluster remains, return to step 1.
Hierarchical agglomerative clustering methods are examples of greedy algorithms
Greedy algorithms sometimes compute optimal solutions
Huffmann coding (Information Theory)
Minimum spanning tree (Graph algorithms)
Greedy algorithms sometimes compute sub-optimal solutions
Set cover (NP-hard problem)
...
Efficient greedy algorithms rely on ad hoc data structures
Priority queues
Union-Find
Some HACM are just implementation of Minimum Spanning Tree algorithms
Start with (C(0)i)=({→Xi}) the collection of all singletons.
At step s, we have n−s clusters (C(s)i):
Find the two most similar clusters according to a criterion Δ: (i,i′)=argmin(j,j′)Δ(C(s)j,C(s)j′)
Merge C(s)i and C(s)i′ into C(s+1)i
Keep the n−s−2 other clusters Ci″
Repeat until there is only one cluster left
\Delta(\mathcal{C}_i, \mathcal{C}_j) =\min_{\vec{X}_i \in \mathcal{C}_i} \min_{\vec{X}_j \in \mathcal{C}_j} d(\vec{X}_i, \vec{X}_j)
\Delta(\mathcal{C}_i, \mathcal{C}_j) = \max_{\vec{X}_i \in \mathcal{C}_i} \max_{\vec{X}_j \in \mathcal{C}_j} d(\vec{X}_i, \vec{X}_j)
\Delta(\mathcal{C}_i, \mathcal{C}_j) =\frac{1}{|\mathcal{C}_i||\mathcal{C}_j|} \sum_{\vec{X}_i \in \mathcal{C}_i}\sum_{\vec{X}_j \in \mathcal{C}_j} d(\vec{X}_i, \vec{X}_j)
\Delta(\mathcal{C}_i, \mathcal{C}_j) = \sum_{\vec{X}_i \in \mathcal{C}_i} \left( d^2(\vec{X}_i, \mu_{\mathcal{C}_i \cup \mathcal{C}_j} ) - d^2(\vec{X}_i, \mu_{\mathcal{C}_i}) \right) +
\qquad\qquad \qquad \sum_{\vec{X}_j \in \mathcal{C}_j} \left( d^2(\vec{X}_j, \mu_{\mathcal{C}_i \cup \mathcal{C}_j} ) - d^2(\vec{X}_j, \mu_{\mathcal{C}_j}) \right)
\Delta(\mathcal{C}_i, \mathcal{C}_j) = \frac{ |\mathcal{C}_i||\mathcal{C}_j|}{|\mathcal{C}_i|+ |\mathcal{C}_j|} d^2(\mu_{\mathcal{C}_i}, \mu_{\mathcal{C}_j})
Suppose that clusters C_{i} and C_{j} were next to be merged
At this point, all of the current pairwise cluster distances are known
The recursive formula gives the updated cluster distances following the pending merge of clusters C_{i} and C_{j}.
Let
An algorithm belongs to the Lance-Williams family if the updated cluster "distance" d_{{(ij)k}} can be computed recursively by
d_{(ij)k}=\alpha _{i}d_{ik}+\alpha _{j}d_{jk}+\beta d_{ij}+\gamma |d_{ik}-d_{jk}|
where \alpha _{i},\alpha _{j},\beta , and \gamma are parameters, which may depend on cluster sizes, that together with the cluster "distance" function d_{ij} determine the clustering algorithm.
Clustering algorithms such as single linkage, complete linkage, and group average method have a recursive formula of the above type
\begin{array}{rl}d\left(C_i \cup C_j, C_k\right) & = \frac{n_i+n_k}{n_i+n_j+n_k}d\left(C_i, C_k\right) +\frac{n_j+n_k}{n_i+n_j+n_k}d\left(C_j, C_k\right) \\ & \phantom{==}- \frac{n_k}{n_i+n_j+n_k} d\left(C_i, C_j\right)\end{array}
\alpha_i = \frac{n_i+n_k}{n_i+n_j+n_k} \qquad \alpha_j = \frac{n_j+n_k}{n_i+n_j+n_k}\qquad \beta = \frac{- n_k}{n_i+n_j+n_k}
If we understand the statement as:
for any k, the flat clustering obtained by cutting the dendrogram so as to obtain a k-clusters partition minimizes the total within-cluster variance/inertia amongst all k-clusterings
then, the statement is not proved. If it were proved, it would imply \mathsf{P}=\mathsf{NP}
The total within-cluster variance/inertia is the objective function in the k-means problem.
The statement is misleading
At each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging Wikipedia
This increase is a weighted squared distance between cluster centers Wikipedia
At the initial step, all clusters are singletons (clusters containing a single point). To apply a recursive algorithm under this objective function, the initial distance between individual objects must be (proportional to) squared Euclidean distance.
Recall that for a real random variable Z with mean \mu
\operatorname{var}(Z) = \mathbb{E}(Z -m)^2 = \inf_a \mathbb{E}(Z -a)^2
and
\operatorname{var}(Z) = \frac{1}{2} \mathbb{E}(Z -Z')^2
where Z' is an independent copy of Z
The different formulae for inertia is just mirroring the different views at variance
The inertia is the trace of an empirical covariance matrix.
Sample x_1,\ldots, x_{n+m} with mean \bar{X}_{n+m} and variance V
Partition \{1,\ldots,n+m\} = A \cup B with |A|=n, |B|=m, A \cap B =\emptyset
Let \bar{X}_n = \frac{1}{n}\sum_{i \in A} X_i and \bar{X}_m=\frac{1}{m}\sum_{i \in B}X_i \bar{X}_{n+m} = \frac{n}{n+m} \bar{X}_{n} +\frac{m}{n+m} \bar{X}_{m}
Let V_A be the variance of (x_i)_{i\in A}, V_B be the variance of (x_i)_{i\in B}
Let V_{\text{between}} be the variance of a ghost sample with n copies of \bar{X}_n and m copies of \bar{X}_m V_{\text{between}} = \frac{n}{n+m} (\bar{X}_n -\bar{X}_{n+m})^2 + \frac{m}{n+m} (\bar{X}_m -\bar{X}_{n+m})^2
Let V_{\text{within}} be the weighted mean of variances within classes A and B V_{\text{within}} = \frac{n}{n+m} V_A + \frac{m}{n+m} V_B
Huyghens formula can be extended to any number of classes
Sample \vec{x}_1, \ldots,\vec{x}_n from \mathbb{R}^p with mean \bar{X}_n, inertia I.
Let A_1, A_2\ldots, A_k be a partition of \{1,\ldots,n\}.
Let I_\ell (resp. \bar{X}^\ell) be the inertia (resp. the mean) of sub-sample \vec{x}_i, i\in A_\ell
Let I_{\text{between}} be the inertia of the ghost sample, formed by |A_1| copies of \bar{X}^1, |A_2| copies of \bar{X}^2, ... |A_k| copies of \bar{X}^k
Let I_{\text{within}} = \sum_{\ell=1}^k \frac{|A_\ell|}{n} I_\ell
I = I_{\text{within}} + I_{\text{between}}
Given a dendrogram, the cophenetic disimilarity between two sample points x, x' is the intergroup disimilarity at which observations x and x' are first joined.
A cophenetic disimilarity has the ultrametric property
All triangles are isoceles and the unequal length should be no longer than the length of the two equal sides
The cophenetic correlation coefficient measures how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points wikipedia
In use the dendextend
package
dd <- select(iris, where(is.numeric)) %>% distmethods <- c("single", "complete", "average", "mcquitty", "ward.D", "centroid", "median", "ward.D2")cor_coph <- purrr::map_dbl(methods, ~ cor_cophenetic(hclust(dd, method = .), dd))names(cor_coph) <- methodsas_tibble_row(cor_coph ) %>% knitr::kable(digits = 2)
single | complete | average | mcquitty | ward.D | centroid | median | ward.D2 |
---|---|---|---|---|---|---|---|
0.86 | 0.73 | 0.88 | 0.87 | 0.86 | 0.87 | 0.86 | 0.87 |
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Hierarchical clustering [...] is a method of cluster analysis which seeks to build a hierarchy of clusters
Wikipedia
Recall that a clustering is a partition of some dataset
A partition D of \mathcal{X} is a refinement of another partition D' if every class in D is a subset of a class in D'. Partitions D and D' are said to be nested
Hierarchical clustering [...] is a method of cluster analysis which seeks to build a hierarchy of clusters
Wikipedia
Recall that a clustering is a partition of some dataset
A partition D of \mathcal{X} is a refinement of another partition D' if every class in D is a subset of a class in D'. Partitions D and D' are said to be nested
A hierarchical clustering of \mathcal{X} is a sequence of |\mathcal{X}| -1 nested partitions of \mathcal{X}, starting from the trivial partition into |\mathcal{X}| singletons and ending into the trivial partition in 1 subset ( \mathcal{X} itself)
A hierarchical clustering consists of |\mathcal{X}|-1 nested flat clusterings
We will explore agglomerative or bottom-up methods to build hierarchical clusterings
The result of hierarchical clustering is a tree where leafs are labelled by sample points and internal nodes correspond to merging operations
The tree conveys more information: if the tree is properly decorated, it is possible to reconstruct the different merging steps and to know which rule was applied when some merging operation was performed
The tree is called a dendrogram
Dendrogram and trees show up in several areas.
Classification and Regression trees play an important role in Machine Learning.
ggdendro
and dendextend
may also be used to manipulate regression trees
The famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
[From ?iris
]
The Iris flower data set is fun for learning supervised classification algorithms, and is known as a difficult case for unsupervised learning.
The Setosa species are distinctly different from Versicolor and Virginica (they have lower petal length and width). But Versicolor and Virginica cannot easily be separated based on measurements of their sepal and petal width/length.
[From dendextend
package]
iris_num <- select(iris, where(is.numeric))rownames(iris_num) <- str_c(iris$Species, 1:150, sep = "_")iris_num %>% dist() %>% hclust() -> iris_hclustiris_dendro <- dendro_data(iris_hclust)ggdendrogram(iris_dendro, leaf_labels = TRUE, rotate = TRUE) + ggtitle("Dendrogram for Iris") + theme_dendro() +# coord_flip() + scale_y_reverse(expand = c(0.2, 0))
## Scale for 'y' is already present. Adding another scale for 'y', which will## replace the existing scale.
as.matrix(dist(iris[,1:4]))
returns the matrix of pairwise distancesbroom::augment
??augment.hclust
method: No augment method for objects of class hclust
Does the flat clustering obtained by cutting the dendrogram at some height reflect the partition into species
The dendextend
package offers a set of functions for extending dendrogram objects in , letting you
Features:
require(dendextend)require(colorspace)species_col <- rev(rainbow_hcl(3))[as.numeric(iris$Species)]dend <- as.dendrogram(iris_hclust)# order it the closest we can to the order of the observations:dend <- rotate(dend, 1:150)# Color the branches based on the clusters:dend <- color_branches(dend, k=3) #, groupLabels=iris_species)# Manually match the labels, as much as possible, to the real classification of the flowers:labels_colors(dend) <- rainbow_hcl(3)[sort_levels_values( as.numeric(iris[,5])[order.dendrogram(dend)] )]
# We hang the dendrogram a bit:dend <- hang.dendrogram(dend, hang_height=0.1)# reduce the size of the labels:dend <- assign_values_to_leaves_nodePar(dend, 0.5, "lab.cex")# dend <- set(dend, "labels_cex", 0.5)# And plot:par(mar = c(3,3,3,7))plot(dend, main = "Clustered Iris data set (the labels give the true flower species)", horiz = TRUE, nodePar = list(cex = .007))
class of dend
What does height
stand for?
What does merge
stand for?
What does order
stand for?
How different are the different method
?
USArrests
data("USArrests")USArrests %>% glimpse()
## Rows: 50## Columns: 4## $ Murder <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4, 5.3, 2.…## $ Assault <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46, 120, 24…## $ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 65, 57, 6…## $ Rape <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9, 25.8, 2…
hc <- hclust(dist(scale(USArrests[,-3])), "ave")hcdata <- dendro_data(hc, type = "rectangle")ggplot() + geom_segment(data = segment(hcdata), aes(x = x, y = y, xend = xend, yend = yend) ) + geom_text(data = label(hcdata), aes(x = x, y = y, label = label, hjust = 0), size = 3 ) + coord_flip() + scale_y_reverse(expand = c(0.2, 0))
At each stage distances between clusters are recomputed by the Lance-Williams dissimilarity update formula according to the particular clustering method being used.
Identify the 2 closest points and combine them into a cluster (treating existing clusters as points too)
If more than one cluster remains, return to step 1.
Hierarchical agglomerative clustering methods are examples of greedy algorithms
Greedy algorithms sometimes compute optimal solutions
Huffmann coding (Information Theory)
Minimum spanning tree (Graph algorithms)
Greedy algorithms sometimes compute sub-optimal solutions
Set cover (NP-hard problem)
...
Efficient greedy algorithms rely on ad hoc data structures
Priority queues
Union-Find
Some HACM are just implementation of Minimum Spanning Tree algorithms
Start with (\mathcal{C}_{i}^{(0)})= (\{ \vec{X}_i \}) the collection of all singletons.
At step s, we have n-s clusters (\mathcal{C}_{i}^{(s)}):
Find the two most similar clusters according to a criterion \Delta: (i,i') = \operatorname{argmin}_{(j,j')} \Delta(\mathcal{C}_{j}^{(s)},\mathcal{C}_{j'}^{(s)})
Merge \mathcal{C}_{i}^{(s)} and \mathcal{C}_{i'}^{(s)} into \mathcal{C}_{i}^{(s+1)}
Keep the n-s-2 other clusters \mathcal{C}_{i''}^{(s+1)} = \mathcal{C}_{i''}^{(s)}
Repeat until there is only one cluster left
\Delta(\mathcal{C}_i, \mathcal{C}_j) =\min_{\vec{X}_i \in \mathcal{C}_i} \min_{\vec{X}_j \in \mathcal{C}_j} d(\vec{X}_i, \vec{X}_j)
\Delta(\mathcal{C}_i, \mathcal{C}_j) = \max_{\vec{X}_i \in \mathcal{C}_i} \max_{\vec{X}_j \in \mathcal{C}_j} d(\vec{X}_i, \vec{X}_j)
\Delta(\mathcal{C}_i, \mathcal{C}_j) =\frac{1}{|\mathcal{C}_i||\mathcal{C}_j|} \sum_{\vec{X}_i \in \mathcal{C}_i}\sum_{\vec{X}_j \in \mathcal{C}_j} d(\vec{X}_i, \vec{X}_j)
\Delta(\mathcal{C}_i, \mathcal{C}_j) = \sum_{\vec{X}_i \in \mathcal{C}_i} \left( d^2(\vec{X}_i, \mu_{\mathcal{C}_i \cup \mathcal{C}_j} ) - d^2(\vec{X}_i, \mu_{\mathcal{C}_i}) \right) +
\qquad\qquad \qquad \sum_{\vec{X}_j \in \mathcal{C}_j} \left( d^2(\vec{X}_j, \mu_{\mathcal{C}_i \cup \mathcal{C}_j} ) - d^2(\vec{X}_j, \mu_{\mathcal{C}_j}) \right)
\Delta(\mathcal{C}_i, \mathcal{C}_j) = \frac{ |\mathcal{C}_i||\mathcal{C}_j|}{|\mathcal{C}_i|+ |\mathcal{C}_j|} d^2(\mu_{\mathcal{C}_i}, \mu_{\mathcal{C}_j})
Suppose that clusters C_{i} and C_{j} were next to be merged
At this point, all of the current pairwise cluster distances are known
The recursive formula gives the updated cluster distances following the pending merge of clusters C_{i} and C_{j}.
Let
An algorithm belongs to the Lance-Williams family if the updated cluster "distance" d_{{(ij)k}} can be computed recursively by
d_{(ij)k}=\alpha _{i}d_{ik}+\alpha _{j}d_{jk}+\beta d_{ij}+\gamma |d_{ik}-d_{jk}|
where \alpha _{i},\alpha _{j},\beta , and \gamma are parameters, which may depend on cluster sizes, that together with the cluster "distance" function d_{ij} determine the clustering algorithm.
Clustering algorithms such as single linkage, complete linkage, and group average method have a recursive formula of the above type
\begin{array}{rl}d\left(C_i \cup C_j, C_k\right) & = \frac{n_i+n_k}{n_i+n_j+n_k}d\left(C_i, C_k\right) +\frac{n_j+n_k}{n_i+n_j+n_k}d\left(C_j, C_k\right) \\ & \phantom{==}- \frac{n_k}{n_i+n_j+n_k} d\left(C_i, C_j\right)\end{array}
\alpha_i = \frac{n_i+n_k}{n_i+n_j+n_k} \qquad \alpha_j = \frac{n_j+n_k}{n_i+n_j+n_k}\qquad \beta = \frac{- n_k}{n_i+n_j+n_k}
If we understand the statement as:
for any k, the flat clustering obtained by cutting the dendrogram so as to obtain a k-clusters partition minimizes the total within-cluster variance/inertia amongst all k-clusterings
then, the statement is not proved. If it were proved, it would imply \mathsf{P}=\mathsf{NP}
The total within-cluster variance/inertia is the objective function in the k-means problem.
The statement is misleading
At each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging Wikipedia
This increase is a weighted squared distance between cluster centers Wikipedia
At the initial step, all clusters are singletons (clusters containing a single point). To apply a recursive algorithm under this objective function, the initial distance between individual objects must be (proportional to) squared Euclidean distance.
Recall that for a real random variable Z with mean \mu
\operatorname{var}(Z) = \mathbb{E}(Z -m)^2 = \inf_a \mathbb{E}(Z -a)^2
and
\operatorname{var}(Z) = \frac{1}{2} \mathbb{E}(Z -Z')^2
where Z' is an independent copy of Z
The different formulae for inertia is just mirroring the different views at variance
The inertia is the trace of an empirical covariance matrix.
Sample x_1,\ldots, x_{n+m} with mean \bar{X}_{n+m} and variance V
Partition \{1,\ldots,n+m\} = A \cup B with |A|=n, |B|=m, A \cap B =\emptyset
Let \bar{X}_n = \frac{1}{n}\sum_{i \in A} X_i and \bar{X}_m=\frac{1}{m}\sum_{i \in B}X_i \bar{X}_{n+m} = \frac{n}{n+m} \bar{X}_{n} +\frac{m}{n+m} \bar{X}_{m}
Let V_A be the variance of (x_i)_{i\in A}, V_B be the variance of (x_i)_{i\in B}
Let V_{\text{between}} be the variance of a ghost sample with n copies of \bar{X}_n and m copies of \bar{X}_m V_{\text{between}} = \frac{n}{n+m} (\bar{X}_n -\bar{X}_{n+m})^2 + \frac{m}{n+m} (\bar{X}_m -\bar{X}_{n+m})^2
Let V_{\text{within}} be the weighted mean of variances within classes A and B V_{\text{within}} = \frac{n}{n+m} V_A + \frac{m}{n+m} V_B
Huyghens formula can be extended to any number of classes
Sample \vec{x}_1, \ldots,\vec{x}_n from \mathbb{R}^p with mean \bar{X}_n, inertia I.
Let A_1, A_2\ldots, A_k be a partition of \{1,\ldots,n\}.
Let I_\ell (resp. \bar{X}^\ell) be the inertia (resp. the mean) of sub-sample \vec{x}_i, i\in A_\ell
Let I_{\text{between}} be the inertia of the ghost sample, formed by |A_1| copies of \bar{X}^1, |A_2| copies of \bar{X}^2, ... |A_k| copies of \bar{X}^k
Let I_{\text{within}} = \sum_{\ell=1}^k \frac{|A_\ell|}{n} I_\ell
I = I_{\text{within}} + I_{\text{between}}
Given a dendrogram, the cophenetic disimilarity between two sample points x, x' is the intergroup disimilarity at which observations x and x' are first joined.
A cophenetic disimilarity has the ultrametric property
All triangles are isoceles and the unequal length should be no longer than the length of the two equal sides
The cophenetic correlation coefficient measures how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points wikipedia
In use the dendextend
package
dd <- select(iris, where(is.numeric)) %>% distmethods <- c("single", "complete", "average", "mcquitty", "ward.D", "centroid", "median", "ward.D2")cor_coph <- purrr::map_dbl(methods, ~ cor_cophenetic(hclust(dd, method = .), dd))names(cor_coph) <- methodsas_tibble_row(cor_coph ) %>% knitr::kable(digits = 2)
single | complete | average | mcquitty | ward.D | centroid | median | ward.D2 |
---|---|---|---|---|---|---|---|
0.86 | 0.73 | 0.88 | 0.87 | 0.86 | 0.87 | 0.86 | 0.87 |