9 Clustering

9.1 Lecture Slides

9.2 Lab

9.2.1 Are red wines of the Langhe region different?

The data consits of 177 samples of red wine produced in the municipalities of (and bordering) Alba, Morra and Barolo. In fact, the data is complete because the variable vintages identifies the grape variety. Let us suppose that we don’t know the variety. Our goal is to study which groups get “detected” by the clustering algorithms.

The 13 organoleptic characteristics available in the dataset are:

alcohol percentage of alcohol in the wine
malic.acid concentration of malic acid
ash mineral content remaining after burning the sample
ash.alkalinity alkalinity level of the ash
magnesium magnesium content
total..phenols amount of phenolic compounds
flavanoids a specific class of phenolic compounds affecting taste and color
non.flav..phenols phenols that are not flavonoids
proanth compounds contributing to bitterness and astringency
color..int strength of wine color
col..hue shade of the wine color
OD.ratio ratio measuring phenolic concentration using spectrophotometry
proline amino acid concentration related to grape ripeness

Load the data and and glimpse at the data

library(dplyr)
library(ggplot2)
library(factoextra)
library(ade4)
library(pheatmap)
library(RColorBrewer)
library(GGally)

load("slides/data/vini.rda")
glimpse(vini)

Rows: 177
Columns: 14
$ vintages          <fct> Barolo, Barolo, Barolo, Barolo, Barolo, Barolo, Baro…
$ alcohol           <dbl> 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, 14.06, 14.…
$ malic.acid        <dbl> 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1.64, 1.35…
$ ash               <dbl> 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2.17, 2.27…
$ ash.alkalinity    <dbl> 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 14.0, 16.0…
$ magnesium         <dbl> 100, 101, 113, 118, 112, 96, 121, 97, 98, 105, 95, 8…
$ tot..phenols      <dbl> 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2.80, 2.98…
$ flavonoids        <dbl> 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2.98, 3.15…
$ non.flav..phenols <dbl> 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0.29, 0.22…
$ proanth           <dbl> 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1.98, 1.85…
$ col..int.         <dbl> 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5.20, 7.22…
$ col..hue          <dbl> 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1.08, 1.01…
$ OD.ratio          <dbl> 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2.85, 3.55…
$ proline           <dbl> 1050, 1185, 1480, 735, 1450, 1290, 1295, 1045, 1045,…

9.2.1.1 Data wrangling and exploration

Display the matrix of all correlations between the measurements in a heatmap. What do you notice ?

pheatmap(cor(vini[,-1]), cell.width = 10, cell.height = 10)

Produce all pairwise scatterplots for the vini data (except vintages)

ggpairs(vini[, -1], axisLabels = "none")

Exercise

Does the plot tell you anything?

Try to plot the data on the PC component space

pca_vini =  prcomp(vini[,-1], scale. = TRUE)
print(summary(pca_vini))

Importance of components:
                          PC1    PC2    PC3     PC4     PC5    PC6     PC7
Standard deviation     2.1628 1.5816 1.2055 0.96148 0.92830 0.8030 0.74295
Proportion of Variance 0.3598 0.1924 0.1118 0.07111 0.06629 0.0496 0.04246
Cumulative Proportion  0.3598 0.5522 0.6640 0.73515 0.80144 0.8510 0.89350
                           PC8     PC9    PC10    PC11    PC12   PC13
Standard deviation     0.59223 0.53775 0.49680 0.47481 0.41034 0.3224
Proportion of Variance 0.02698 0.02224 0.01899 0.01734 0.01295 0.0080
Cumulative Proportion  0.92048 0.94272 0.96171 0.97905 0.99200 1.0000

fviz_pca_biplot(pca_vini, label = "var") +
  ggtitle("")

Exercise

What do you notice from the biplot? How many groups should we choose?

Now you can check your answer with the true grape variety

my_colors <- brewer.pal(n = 3, name = "Set2")
fviz_pca_biplot(pca_vini, 
                label = "var", 
                habillage = vini$vintages) +
  scale_color_manual(values = my_colors)+
  ggtitle("")

9.2.1.2 Clustering

To avoid the final outcome depends on the measurement scales of the data, distances should be calculated on “standardized variables”.

scaledvini <- scale(vini[,-1])

Compute the Euclidean distance

dist_E <- dist(scaledvini)

9.2.1.2.1 Hierarchical clustering

Perform hierarchical clustering on the scaled data using Single linkage, Complete linkage, and Ward method. What do you notice?

hclust(dist_E, method = "single") |> plot(main = "Single linkage", cex=0.5)

hclust(dist_E, method = "complete") |> plot(main = "Complete linkage", cex=0.5)

hc_W <- hclust(dist_E, method = "ward.D2") 
plot(hc_W ,main = "Ward method", cex=0.5)

Exercise

Which method should we choose?
How many groups should we choose?

Decision rule choose the method that:

Shows largest vertical gaps before merging
Produces compact, balanced clusters
Has a clear number of clusters
Matches your real-world understanding

From the dendrogram, we can see the Ward method has the large jump before clusters merge and a clear number of cluster. Moreover, the data is complete, so we can check if the result of the hierarchical clustering discover the true cluster, i.e., the variable vintages. We add colors by using the mypclust function from the rafalib package.

library(rafalib)
vintages <- as.character(vini$vintages)
myplclust(hc_W, labels=vintages, lab.col=as.fumeric(vintages), cex=0.5)

It does seem that hierarchical clustering using the Ward method has discovered the variety.

Hierarchical clustering does not directly define specific clusters. Instead, the dendrogram shows the distance between groups: the height at which two branches split indicates how dissimilar those groups are. To obtain clusters, we need to cut the dendrogram at a chosen distance threshold, and all samples connected below that level are grouped into the same cluster. For example, if we cut the dendrogram at 27 we have three clusters; and if we cut the dendrogram at 12 we, have five clusters

myplclust(hc_W, labels=vintages, lab.col=as.fumeric(vintages), cex=0.5)
abline(h=27, col="red")
abline(h=12, col="orange")

In this example, the dendrogram clearly suggest three clusters because there is a clear separation before and after clusters merge. We also can examine how the clusters overlap with the actual vintages

hclusters <- cutree(hc_W, h=27)
table(true=vintages, cluster=hclusters)

            cluster
true          1  2  3
  Barbera     0  0 48
  Barolo     58  0  0
  Grignolino  5 58  8

We can also ask to cut the dendrogram in order to have a specific number of clusters by using the cutree function.

hclusters <- cutree(hc_W, k=3)
table(hclusters )

hclusters
 1  2  3 
63 58 56

Exercise

Does this result match the actual vintages?

table(hclusters,vini$vintages) |> fisher.test()


    Fisher's Exact Test for Count Data

data:  table(hclusters, vini$vintages)
p-value < 2.2e-16
alternative hypothesis: two.sided

Calculate the Adjusted Rand Index (ARI)

library(mclust)
adjustedRandIndex(vini$vintages, hclusters)

[1] 0.7882419

Exercise

What the Adjusted Rand Index tell you?

If we do not have the true clusters we also can evaluate clustering result by calculating the silhouette width

library(cluster)
sil <- silhouette(hclusters , dist_E)
plot(sil, col = my_colors, border = NA)

Exercise

What the silhouette width tell you?

fviz_pca_biplot(pca_vini, 
                label = "var", 
                habillage = hclusters) +
  scale_color_manual(values = my_colors)+
  ggtitle("")

Exercise

How do these groups differ from each other?

9.2.1.2.2 K-means

Perform the k-means algorithm

set.seed(123)
km <- kmeans(dist_E, centers = 3, nstart = 10)

Calculate the Adjusted Rand Index and the silhouette width. Do they tell you the same story as Hierarchical clustering?

adjustedRandIndex(vini$vintages, km$cluster)

[1] 0.7679186

table(vini$vintages, km$cluster) |> fisher.test()


    Fisher's Exact Test for Count Data

data:  table(vini$vintages, km$cluster)
p-value < 2.2e-16
alternative hypothesis: two.sided

adjustedRandIndex(hclusters, km$cluster)

[1] 0.7648521

table(hclusters, km$cluster) |> fisher.test()


    Fisher's Exact Test for Count Data

data:  table(hclusters, km$cluster)
p-value < 2.2e-16
alternative hypothesis: two.sided

sil <- silhouette(km$cluster , dist_E)
plot(sil, col = my_colors, border = NA)

9.2.2 2. RNA gene expression ¹

The data represent RNA expression measurements for 22,215 genes from 189 samples. Each sample is labeled with a tissue of origin.

Tissue	Description
kidney	Renal tissue, involved in filtering blood and urine production
hippocampus	Brain region important for memory formation
cerebellum	Brain region controlling coordination and balance
colon	Part of the large intestine
liver	Metabolic organ; detoxification, protein synthesis, bile production
endometrium	Lining of the uterus
placenta	Fetal-maternal organ during pregnancy

The data is in the tissuesGeneExpression package available on Github. To install we use comment

library(devtools)
install_github("genomicsclass/tissuesGeneExpression")

Load the data and glimse at the data

library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e)

[1] 22215   189

table(tissue)

tissue
 cerebellum       colon endometrium hippocampus      kidney       liver 
         38          34          15          31          39          26 
   placenta 
          6

Plot the heatmap of the 50 most variable genes, i.e., 50 genes with most variance over all samples using the heatmap.2 function from gplots package.

# select the top 50 most variable genes
library(genefilter)
rv <- rowVars(e)
idx <- order(rv,decreasing = TRUE)
exp_50 <- e[idx[1:50],]
# generate color for heatmap
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
library(gplots)
heatmap.2(exp_50, labCol=tissue,
          trace="none",
          ColSideColors=as.character(as.fumeric(tissue)+1),
          col=hmcol)

We did not use tissue information to create this heatmap, but just with 50 most variable genes, we have a good separation across tissues.

Calculate the range of the mean and variance of the data

range(apply(e,1,mean))

[1]  4.328881 14.036989

range(apply(e,1,sd))

[1] 0.05987416 3.30351280

Exercise

What do you notice?

9.2.2.1 Clustering

Compute the Euclidean distance

dist_E <- dist(t(e))

9.2.2.1.1 Hierarchical clustering

Perform hierarchical clustering using Single linkage, Complete linkage, and Ward method. What do you notice?

hclust(dist_E, method = "single") |> plot(main = "Single linkage", cex=0.5)

hclust(dist_E, method = "complete") |> plot(main = "Complete linkage", cex=0.5)

hc_W <- hclust(dist_E, method = "ward.D2") 
plot(hc_W ,main = "Ward method", cex=0.5)

Exercise

Which method should we choose?
How many groups should we choose?

myplclust(hc_W, labels=tissue, lab.col=as.fumeric(tissue), cex=0.5)

It does seem that hierarchical clustering using the Ward method has discovered the tissue.

Now, let cut the dendrogram at a certain height we wish to cut. We use 350 as an example:

myplclust(hc_W, labels=tissue, lab.col=as.fumeric(tissue), cex=0.5)
abline(h=350, col="red")

hclusters <- cutree(hc_W, h=350)
table(true=tissue, cluster=hclusters)

             cluster
true           1  2  3  4  5  6  7  8
  cerebellum   0  5 31  0  0  2  0  0
  colon        0  0  0 34  0  0  0  0
  endometrium  0  0  0  0  0  0 15  0
  hippocampus  0 31  0  0  0  0  0  0
  kidney      37  0  0  0  0  2  0  0
  liver        0  0  0  0 24  2  0  0
  placenta     0  0  0  0  0  0  0  6

Exercise

Does this result match the actual tissue?

Most tissues are clearly associated with a single cluster; however, in some cases, a tissue is split across two clusters. This is typically a consequence of overestimating the number of clusters. Selecting the right number of clusters remains a difficult and actively studied problem.

Calculate the Adjusted Rand Index (ARI)

adjustedRandIndex(tissue, hclusters)

[1] 0.8913948

Exercise

What the Adjusted Rand Index tell you?

Evaluate clustering result by calculating the silhouette width

my_colors <- brewer.pal(n = 8, name = "Set2")
sil <- silhouette(hclusters , dist_E)
plot(sil, col = my_colors, border = NA)

Exercise

What the silhouette width tell you?

9.2.2.1.2 K-means

Perform k-means algorithm

set.seed(123)
km <- kmeans(dist_E, centers = 7, nstart = 10)

Calculate the Adjusted Rand Index. Do they tell you the same story as Hierarchical clustering?

adjustedRandIndex(tissue, km$cluster)

[1] 0.8747183

adjustedRandIndex(hclusters, km$cluster)

[1] 0.9674547

9.2.2.1.3 Repeat everything on the subdata with the 50 most variable genes. What do you notice? Do these results agree with the results that you performed on the full data?

dist_E50 <- dist(t(exp_50))

# Hierarchical clustering
hc_W50 <- hclust(dist_E50, method = "ward.D2") 
myplclust(hc_W50, labels=tissue, lab.col=as.fumeric(tissue), cex=0.5)

hclusters50 <- cutree(hc_W50, k=8)
table(true=tissue, cluster=hclusters50)

             cluster
true           1  2  3  4  5  6  7  8
  cerebellum   0  0  0 33  0  0  0  5
  colon        0  0  0  0 29  5  0  0
  endometrium  0  0  0  0  0 15  0  0
  hippocampus  0  0 31  0  0  0  0  0
  kidney      24 15  0  0  0  0  0  0
  liver        0  0  0  0  0  0 26  0
  placenta     0  0  0  0  0  6  0  0

adjustedRandIndex(tissue, hclusters50)

[1] 0.8100697

adjustedRandIndex(hclusters, hclusters50)

[1] 0.7759082

pca_exp <- prcomp(t(exp_50), scale. = TRUE)
fviz_pca_ind(pca_exp, geom = "point", habillage = tissue)

ggtitle("")

<ggplot2::labels> List of 1
 $ title: chr ""

9.3 Homework

Exercise 1. Repeat all the analysis that we did for Example 2 on the scale data. What do you notice? Do the new result agree with the result from Example 2 ?

Exercise 2. The fiumi.rda dataset includes measurements of chemical concentrations for 37 rivers or creeks of a particular geographical area. The variables available are:

NO3: nitrate \(NO_{3^-}\);
SO4: sulfate \(SO_{4}^{2-}\);
CL: chloride \(Cl^-\);
CA: calcium \(Ca^{2+}\);

Check if the data is already standardized.
Repeat the previous analysis (we did in Example 1) by using both, hierarchical clustering and K-Means. How many groups are there? Do hierarchical clustering and K-Means agree? What are the differences among the groups? Could the difference among the detected groups be due to the different sources of pollution in the catchment area of the rivers?

9.4 Further reading

S. Holmes and W. Huber. Modern Statistics for Modern Biology. Chapter 5.

References

The example 2 about gene expression comes from from Data Analysis for the Life Sciences.↩︎

9.1 Lecture Slides

9.2 Lab

9.2.1 Are red wines of the Langhe region different?

9.2.1.1 Data wrangling and exploration

9.2.1.2 Clustering

9.2.1.2.1 Hierarchical clustering

9.2.1.2.2 K-means

9.2.2 2. RNA gene expression 1

9.2.2.1 Clustering

9.2.2.1.1 Hierarchical clustering

9.2.2.1.2 K-means

9.2.2.1.3 Repeat everything on the subdata with the 50 most variable genes. What do you notice? Do these results agree with the results that you performed on the full data?

9.3 Homework

9.4 Further reading

References

9.2.2 2. RNA gene expression ¹