-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdunn-index.html
152 lines (152 loc) · 31.6 KB
/
dunn-index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
<!doctype html><html lang=en-uk><head><script data-goatcounter=https://ruivieira-dev.goatcounter.com/count async src=//gc.zgo.at/count.js></script><script src=https://unpkg.com/@alpinejs/[email protected]/dist/cdn.min.js></script><script src=https://unpkg.com/[email protected]/dist/cdn.min.js></script><script type=module src=https://ruivieira.dev/js/deeplinks/deeplinks.js></script><link rel=preload href=https://ruivieira.dev/lib/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin=anonymous><link rel=preload href=https://ruivieira.dev/lib/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin=anonymous><link rel=preload href=https://ruivieira.dev/lib/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin=anonymous><link rel=preload href=https://ruivieira.dev/fonts/firacode/FiraCode-Regular.woff2 as=font type=font/woff2 crossorigin=anonymous><link rel=preload href=https://ruivieira.dev/fonts/vollkorn/Vollkorn-Regular.woff2 as=font type=font/woff2 crossorigin=anonymous><link rel=stylesheet href=https://ruivieira.dev/css/kbd.css type=text/css><meta charset=utf-8><meta http-equiv=X-UA-Compatible content="IE=edge"><title>Dunn index · Rui Vieira</title>
<link rel=canonical href=https://ruivieira.dev/dunn-index.html><meta name=viewport content="width=device-width,initial-scale=1"><meta name=robots content="all,follow"><meta name=googlebot content="index,follow,snippet,archive"><meta property="og:title" content="Dunn index"><meta property="og:description" content="There are several ways to measure the robustness of a clustering algorithm. Three commonly used metrics are the Dunn index, Davis-Bouldin index, Silhoutte index and Calinski-Harabasz index.
But before we start, let’s introduce some concepts.
We are interested in clustering algorithms for a dataset $\mathcal{D}$ with $N$ elements in a $n$-dimensional real space, that is:
$$ \mathcal{D} = {x_1, x_2, \ldots, x_N} \in \mathbb{R}^p $$
The clustering algorithm will create a set $C$ of $K$ distinct disjoint groups from $\mathcal{D}$ $C={c_1, c_2, \ldots, c_k}$, such that:"><meta property="og:type" content="article"><meta property="og:url" content="https://ruivieira.dev/dunn-index.html"><meta property="article:section" content="posts"><meta property="article:modified_time" content="2023-09-02T17:28:34+01:00"><meta name=twitter:card content="summary"><meta name=twitter:title content="Dunn index"><meta name=twitter:description content="There are several ways to measure the robustness of a clustering algorithm. Three commonly used metrics are the Dunn index, Davis-Bouldin index, Silhoutte index and Calinski-Harabasz index.
But before we start, let’s introduce some concepts.
We are interested in clustering algorithms for a dataset $\mathcal{D}$ with $N$ elements in a $n$-dimensional real space, that is:
$$ \mathcal{D} = {x_1, x_2, \ldots, x_N} \in \mathbb{R}^p $$
The clustering algorithm will create a set $C$ of $K$ distinct disjoint groups from $\mathcal{D}$ $C={c_1, c_2, \ldots, c_k}$, such that:"><link rel=stylesheet href=https://ruivieira.dev/css/styles.css><!--[if lt IE 9]><script src=https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js></script><script src=https://oss.maxcdn.com/respond/1.4.2/respond.min.js></script><![endif]--><link rel=icon type=image/png href=https://ruivieira.dev/images/favicon.ico></head><body class="max-width mx-auto px3 ltr" x-data="{currentHeading: undefined}"><div class="content index py4"><div id=header-post><a id=menu-icon href=#><i class="fas fa-eye fa-lg"></i></a>
<a id=menu-icon-tablet href=#><i class="fas fa-eye fa-lg"></i></a>
<a id=top-icon-tablet href=# onclick='$("html, body").animate({scrollTop:0},"fast")' style=display:none aria-label="Top of Page"><i class="fas fa-chevron-up fa-lg"></i></a>
<span id=menu><span id=nav><ul><li><a href=https://ruivieira.dev/>Home</a></li><li><a href=https://ruivieira.dev/blog/>Blog</a></li><li><a href=https://ruivieira.dev/draw/>Drawings</a></li><li><a href=https://ruivieira.dev/map/>All pages</a></li><li><a href=https://ruivieira.dev/search.html>Search</a></li></ul></span><br><div id=share style=display:none></div><div id=toc><h4>Contents</h4><nav id=TableOfContents><ul><li><a href=#dunn-index :class="{'toc-h2':true, 'toc-highlight': currentHeading == '#dunn-index' }">Dunn index</a></li><li><a href=#calinski-harabasz-index :class="{'toc-h2':true, 'toc-highlight': currentHeading == '#calinski-harabasz-index' }">Calinski-Harabasz index</a></li></ul></nav><h4>Related</h4><nav><ul><li class="header-post toc"><span class=backlink-count>1</span>
<a href=https://ruivieira.dev/distance-metrics.html>Distance metrics</a></li></ul></nav></div></span></div><article class=post itemscope itemtype=http://schema.org/BlogPosting><header><h1 class=posttitle itemprop="name headline">Dunn index</h1><div class=meta><div class=postdate>Updated <time datetime="2023-09-02 17:28:34 +0100 BST" itemprop=datePublished>2023-09-02</time>
<span class=commit-hash>(<a href=https://ruivieira.dev/log/index.html#d64c4a5>d64c4a5</a>)</span></div></div></header><div class=content itemprop=articleBody><p>There are several ways to measure the robustness of a clustering algorithm. Three commonly used metrics are the <a href=#dunn-index>Dunn index</a>, <em>Davis-Bouldin index</em>, <em>Silhoutte index</em> and <a href=#calinski-harabasz-index>Calinski-Harabasz index</a>.</p><p>But before we start, let’s introduce some concepts.</p><p>We are interested in clustering algorithms for a dataset $\mathcal{D}$ with $N$ elements in a $n$-dimensional real space, that is:</p><p>$$
\mathcal{D} = {x_1, x_2, \ldots, x_N} \in \mathbb{R}^p
$$</p><p>The clustering algorithm will create a set $C$ of $K$ distinct disjoint groups from $\mathcal{D}$ $C={c_1, c_2, \ldots, c_k}$, such that:</p><p>$$
\cup_{c_k\in C}c_k=\mathcal{D} \
c_k \cap c_l \neq \emptyset \forall k\neq l
$$</p><p>Each group (or cluster) $c_k$, will have a <em>centroid</em>, $\bar{c}_k$, which is the mean vector of its elements such that:</p><p>$$
\bar{c}<em>k=\frac{1}{|c_k|}\sum</em>{x_i \in c_k}x_i
$$</p><p>We will also make use of the dataset’s mean vector, $\bar{\mathcal{D}}$, defined as:</p><p>$$
\bar{\mathcal{D}}=\frac{1}{N}\sum_{x_i \in X}x_i
$$</p><h2 id=dunn-index x-intersect="currentHeading = '#dunn-index'">Dunn index</h2><p>The <em>Dunn index</em> aims at quantifying the compactness and variance of the clustering.
A cluster is considered <em>compact</em> if there is small variance between members of the cluster.
This can be calculated using $\Delta(c_k)$, where</p><p>$$
\Delta(c_k) = \max_{x_i, x_j \in c_k}{d_e(x_i, x_j)}
$$</p><p>and $d_e$ is the <a href=https://ruivieira.dev/distance-metrics.html#euclidean-distance-(l2)>Euclidian distance</a> defined as:</p><p>$$
d_e=\sqrt{\sum_{j=1}^p (x_{ij}-x_{kj})^2}.
$$</p><p>A cluster is considered <em>well separated</em> if the cluster are far-apart. This can quantified using</p><p>$$
\delta(c_k, c_l) = \min_{x_i \in c_k}\min_{x_j\in c_l}{d_e(x_i, x_j)}.
$$</p><p>Given these quantities, the <em>Dunn index</em> for a set of clusters $C$, $DI(C)$, is then defined by:</p><p>$$
DI(C)=\frac{\min_{c_k \in C}{\delta(c_k, c_l)}}{\max_{c_k\in C}\Delta(c_k)}
$$</p><p>A higher <em>Dunn Index</em> will indicate compact, well-separated clusters, while a lower index will indicate less compact or less well-separated clusters.</p><p>We can now try to calculate the metric for the dataset we’ve created previously.
Let’s simulate some data and apply the Dunn index from scratch.
First, we will create a compact and well-separated dataset using the <a href=https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html>make_blobs</a> method in <code>scikit-learn</code>.
We will create a dataset of $\mathbb{R}^2$ data (for easier plotting), with three clusters.</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>sklearn.datasets</span> <span style=font-weight:700>import</span> make_blobs
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>X, y <span style=font-weight:700>=</span> make_blobs(n_samples<span style=font-weight:700>=</span><span style=color:#099>1000</span>,
</span></span><span style=display:flex><span> centers<span style=font-weight:700>=</span><span style=color:#099>3</span>,
</span></span><span style=display:flex><span> n_features<span style=font-weight:700>=</span><span style=color:#099>2</span>,
</span></span><span style=display:flex><span> random_state<span style=font-weight:700>=</span><span style=color:#099>23</span>)
</span></span></code></pre></div><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>import</span> <span style=color:#555>pandas</span> <span style=font-weight:700>as</span> <span style=color:#555>pd</span>
</span></span><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>plotnine</span> <span style=font-weight:700>import</span> <span style=font-weight:700>*</span>
</span></span><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>plotnine.data</span> <span style=font-weight:700>import</span> <span style=font-weight:700>*</span>
</span></span><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>plotutils</span> <span style=font-weight:700>import</span> <span style=font-weight:700>*</span>
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>data <span style=font-weight:700>=</span> pd<span style=font-weight:700>.</span>DataFrame(X, columns<span style=font-weight:700>=</span>[<span style=color:#b84>"x1"</span>, <span style=color:#b84>"x2"</span>])
</span></span><span style=display:flex><span>data[<span style=color:#b84>"y"</span>] <span style=font-weight:700>=</span> y
</span></span><span style=display:flex><span>data[<span style=color:#b84>"y"</span>] <span style=font-weight:700>=</span> data<span style=font-weight:700>.</span>y<span style=font-weight:700>.</span>astype(<span style=color:#b84>'category'</span>)
</span></span></code></pre></div><p><img src=https://ruivieira.dev/Dunn%20index_files/figure-gfm/cell-4-output-1.png alt loading=lazy></p><p>We now cluster the data and we will have, as expected three distinct clusters, plotted below.</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>sklearn</span> <span style=font-weight:700>import</span> cluster
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>k_means <span style=font-weight:700>=</span> cluster<span style=font-weight:700>.</span>KMeans(n_clusters<span style=font-weight:700>=</span><span style=color:#099>3</span>)
</span></span><span style=display:flex><span>k_means<span style=font-weight:700>.</span>fit(data)
</span></span><span style=display:flex><span>y_pred <span style=font-weight:700>=</span> k_means<span style=font-weight:700>.</span>predict(data)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>prediction <span style=font-weight:700>=</span> pd<span style=font-weight:700>.</span>concat([data, pd<span style=font-weight:700>.</span>DataFrame(y_pred, columns<span style=font-weight:700>=</span>[<span style=color:#b84>'pred'</span>])], axis <span style=font-weight:700>=</span> <span style=color:#099>1</span>)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>clus0 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>0</span>]
</span></span><span style=display:flex><span>clus1 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>1</span>]
</span></span><span style=display:flex><span>clus2 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>2</span>]
</span></span><span style=display:flex><span>k_list <span style=font-weight:700>=</span> [clus0<span style=font-weight:700>.</span>values, clus1<span style=font-weight:700>.</span>values,clus2<span style=font-weight:700>.</span>values]
</span></span></code></pre></div><p>Let’s focus now on two of these cluster, let’s call them $c_k$ and $c_l$.</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>ck <span style=font-weight:700>=</span> k_list[<span style=color:#099>0</span>]
</span></span><span style=display:flex><span>cl <span style=font-weight:700>=</span> k_list[<span style=color:#099>1</span>]
</span></span></code></pre></div><p>We know we have to calculate the distance between the points in $c_k$ and $c_l$. We know that the <code>len(ck)=len(cl)=333</code> we create</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>import</span> <span style=color:#555>numpy</span> <span style=font-weight:700>as</span> <span style=color:#555>np</span>
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>values <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>ones([<span style=color:#999>len</span>(ck), <span style=color:#999>len</span>(cl)])
</span></span><span style=display:flex><span>values
</span></span></code></pre></div><pre><code>array([[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]])
</code></pre><p>For each pair of points, we then get the norm of $x_i-x_j$. For instance, for $i=0\in c_k$ and $i=1\in c_l$, we would have:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>values[<span style=color:#099>0</span>, <span style=color:#099>1</span>] <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>linalg<span style=font-weight:700>.</span>norm(ck[<span style=color:#099>0</span>]<span style=font-weight:700>-</span>cl[<span style=color:#099>1</span>])
</span></span><span style=display:flex><span><span style=color:#999>print</span>(ck[<span style=color:#099>0</span>], cl[<span style=color:#099>1</span>])
</span></span><span style=display:flex><span><span style=color:#999>print</span>(values[<span style=color:#099>0</span>, <span style=color:#099>1</span>])
</span></span></code></pre></div><pre><code>[-5.37039106 3.47555168 2. 0. ] [ 5.46312794 -3.08938807 1. 1. ]
12.746119711608184
</code></pre><p>The calculation of $\delta(c_k, c_l)$ between two clusters $c_k$ and $c_l$ will be defined as follows:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>import</span> <span style=color:#555>numpy</span> <span style=font-weight:700>as</span> <span style=color:#555>np</span>
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span><span style=font-weight:700>def</span> <span style=color:#900;font-weight:700>δ</span>(ck, cl):
</span></span><span style=display:flex><span> values <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>ones([<span style=color:#999>len</span>(ck), <span style=color:#999>len</span>(cl)])
</span></span><span style=display:flex><span> <span style=font-weight:700>for</span> i <span style=font-weight:700>in</span> <span style=color:#999>range</span>(<span style=color:#099>0</span>, <span style=color:#999>len</span>(ck)):
</span></span><span style=display:flex><span> <span style=font-weight:700>for</span> j <span style=font-weight:700>in</span> <span style=color:#999>range</span>(<span style=color:#099>0</span>, <span style=color:#999>len</span>(cl)):
</span></span><span style=display:flex><span> values[i, j] <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>linalg<span style=font-weight:700>.</span>norm(ck[i]<span style=font-weight:700>-</span>cl[j])
</span></span><span style=display:flex><span> <span style=font-weight:700>return</span> np<span style=font-weight:700>.</span>min(values)
</span></span></code></pre></div><p>So, for our two clusters above, $\delta(c_k, c_l)$ will be:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>δ(ck, cl)
</span></span></code></pre></div><pre><code>8.13474311744193
</code></pre><p>Within a single cluster $c_k$, we can calculate $\Delta(c_k)$ similarly as:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>def</span> <span style=color:#900;font-weight:700>Δ</span>(ci):
</span></span><span style=display:flex><span> values <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>zeros([<span style=color:#999>len</span>(ci), <span style=color:#999>len</span>(ci)])
</span></span><span style=display:flex><span> <span style=font-weight:700>for</span> i <span style=font-weight:700>in</span> <span style=color:#999>range</span>(<span style=color:#099>0</span>, <span style=color:#999>len</span>(ci)):
</span></span><span style=display:flex><span> <span style=font-weight:700>for</span> j <span style=font-weight:700>in</span> <span style=color:#999>range</span>(<span style=color:#099>0</span>, <span style=color:#999>len</span>(ci)):
</span></span><span style=display:flex><span> values[i, j] <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>linalg<span style=font-weight:700>.</span>norm(ci[i]<span style=font-weight:700>-</span>ci[j])
</span></span><span style=display:flex><span> <span style=font-weight:700>return</span> np<span style=font-weight:700>.</span>max(values)
</span></span></code></pre></div><p>So, for instance, for our $c_k$ and $c_l$ we would have:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=color:#999>print</span>(Δ(ck))
</span></span><span style=display:flex><span><span style=color:#999>print</span>(Δ(cl))
</span></span></code></pre></div><pre><code>6.726025773561468
6.173844284636552
</code></pre><p>We can now define the <em>Dunn index</em> as</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>def</span> <span style=color:#900;font-weight:700>dunn</span>(k_list):
</span></span><span style=display:flex><span> δs <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>ones([<span style=color:#999>len</span>(k_list), <span style=color:#999>len</span>(k_list)])
</span></span><span style=display:flex><span> Δs <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>zeros([<span style=color:#999>len</span>(k_list), <span style=color:#099>1</span>])
</span></span><span style=display:flex><span> l_range <span style=font-weight:700>=</span> <span style=color:#999>list</span>(<span style=color:#999>range</span>(<span style=color:#099>0</span>, <span style=color:#999>len</span>(k_list)))
</span></span><span style=display:flex><span> <span style=font-weight:700>for</span> k <span style=font-weight:700>in</span> l_range:
</span></span><span style=display:flex><span> <span style=font-weight:700>for</span> l <span style=font-weight:700>in</span> (l_range[<span style=color:#099>0</span>:k]<span style=font-weight:700>+</span>l_range[k<span style=font-weight:700>+</span><span style=color:#099>1</span>:]):
</span></span><span style=display:flex><span> δs[k, l] <span style=font-weight:700>=</span> δ(k_list[k], k_list[l])
</span></span><span style=display:flex><span> Δs[k] <span style=font-weight:700>=</span> Δ(k_list[k])
</span></span><span style=display:flex><span> di <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>min(δs)<span style=font-weight:700>/</span>np<span style=font-weight:700>.</span>max(Δs)
</span></span><span style=display:flex><span> <span style=font-weight:700>return</span> di
</span></span></code></pre></div><p>and calculate the <em>Dunn index</em> for our clustered values list as</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>dunn(k_list)
</span></span></code></pre></div><pre><code>0.14867620697065728
</code></pre><p>Intuitively, we can expect a dataset with less well-defined clusters to have a lower <em>Dunn index</em>. Let’s try it.
We first generate the new dataset.</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>X2, y2 <span style=font-weight:700>=</span> make_blobs(n_samples<span style=font-weight:700>=</span><span style=color:#099>1000</span>,
</span></span><span style=display:flex><span> centers<span style=font-weight:700>=</span><span style=color:#099>3</span>,
</span></span><span style=display:flex><span> n_features<span style=font-weight:700>=</span><span style=color:#099>2</span>,
</span></span><span style=display:flex><span> cluster_std<span style=font-weight:700>=</span><span style=color:#099>10.0</span>,
</span></span><span style=display:flex><span> random_state<span style=font-weight:700>=</span><span style=color:#099>24</span>)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>df <span style=font-weight:700>=</span> pd<span style=font-weight:700>.</span>DataFrame(X2, columns<span style=font-weight:700>=</span>[<span style=color:#b84>'A'</span>, <span style=color:#b84>'B'</span>])
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>k_means <span style=font-weight:700>=</span> cluster<span style=font-weight:700>.</span>KMeans(n_clusters<span style=font-weight:700>=</span><span style=color:#099>3</span>)
</span></span><span style=display:flex><span>k_means<span style=font-weight:700>.</span>fit(df)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span><span style=color:#998;font-style:italic>#K-means training </span>
</span></span><span style=display:flex><span>y_pred <span style=font-weight:700>=</span> k_means<span style=font-weight:700>.</span>predict(df)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>prediction <span style=font-weight:700>=</span> pd<span style=font-weight:700>.</span>concat([df,pd<span style=font-weight:700>.</span>DataFrame(y_pred, columns<span style=font-weight:700>=</span>[<span style=color:#b84>'pred'</span>])], axis <span style=font-weight:700>=</span> <span style=color:#099>1</span>)
</span></span><span style=display:flex><span>prediction[<span style=color:#b84>"pred"</span>] <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>pred<span style=font-weight:700>.</span>astype(<span style=color:#b84>'category'</span>)
</span></span></code></pre></div><p><img src=https://ruivieira.dev/Dunn%20index_files/figure-gfm/cell-16-output-1.png alt loading=lazy></p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>clus0 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>0</span>]
</span></span><span style=display:flex><span>clus1 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>1</span>]
</span></span><span style=display:flex><span>clus2 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>2</span>]
</span></span><span style=display:flex><span>k_list <span style=font-weight:700>=</span> [clus0<span style=font-weight:700>.</span>values, clus1<span style=font-weight:700>.</span>values,clus2<span style=font-weight:700>.</span>values]
</span></span></code></pre></div><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>dunn(k_list)
</span></span></code></pre></div><pre><code>0.019563892388205984
</code></pre><h2 id=calinski-harabasz-index x-intersect="currentHeading = '#calinski-harabasz-index'">Calinski-Harabasz index</h2><p>The Calinski-Harabasz index<sup id=fnref:1><a href=#fn:1 class=footnote-ref role=doc-noteref>1</a></sup> (also known as the variance ratio criterion) is a measure of the quality of a clustering algorithm. It is commonly used to evaluate the results of a clustering technique and to compare the performance of different clustering algorithms.</p><p>The index is calculated by dividing the between-cluster variance by the within-cluster variance. A higher Calinski-Harabasz index indicates a better separation of the clusters and a better overall clustering result.</p><p>For the following example we will use <a href=https://ruivieira.dev/scikit-learn.html>Scikit-learn</a>’s implementation<sup id=fnref:2><a href=#fn:2 class=footnote-ref role=doc-noteref>2</a></sup> of the Calinski-Harabasz index.</p><p>If we apply it to the previous well-defined cluster data, <code>X</code>, <code>y</code>:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>sklearn.cluster</span> <span style=font-weight:700>import</span> KMeans
</span></span><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>sklearn.datasets</span> <span style=font-weight:700>import</span> make_blobs
</span></span><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>sklearn.metrics</span> <span style=font-weight:700>import</span> calinski_harabasz_score
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span><span style=color:#998;font-style:italic># Calculate the Calinski-Harabasz index</span>
</span></span><span style=display:flex><span>score <span style=font-weight:700>=</span> calinski_harabasz_score(X, y)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span><span style=color:#999>print</span>(<span style=color:#b84>"Calinski-Harabasz index:"</span>, score)
</span></span></code></pre></div><pre><code>Calinski-Harabasz index: 12403.892218876248
</code></pre><p>While applying it to the less well-defined cluster will return a lower index:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>score <span style=font-weight:700>=</span> calinski_harabasz_score(X2, y2)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span><span style=color:#999>print</span>(<span style=color:#b84>"Calinski-Harabasz index:"</span>, score)
</span></span></code></pre></div><pre><code>Calinski-Harabasz index: 135.29288069299935
</code></pre><div class=footnotes role=doc-endnotes><hr><ol><li id=fn:1><p>Caliński, Tadeusz, and Jerzy Harabasz. “A dendrite method for cluster analysis.” Communications in Statistics-theory and Methods 3, no. 1 (1974): 1-27. <a href=#fnref:1 class=footnote-backref role=doc-backlink>↩︎</a></p></li><li id=fn:2><p><a href=https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html>https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html</a> <a href=#fnref:2 class=footnote-backref role=doc-backlink>↩︎</a></p></li></ol></div></div></article><div id=footer-post-container><div id=footer-post><div id=nav-footer style=display:none><ul><li><a href=https://ruivieira.dev/>Home</a></li><li><a href=https://ruivieira.dev/blog/>Blog</a></li><li><a href=https://ruivieira.dev/draw/>Drawings</a></li><li><a href=https://ruivieira.dev/map/>All pages</a></li><li><a href=https://ruivieira.dev/search.html>Search</a></li></ul></div><div id=toc-footer style=display:none><nav id=TableOfContents><ul><li><a href=#dunn-index>Dunn index</a></li><li><a href=#calinski-harabasz-index>Calinski-Harabasz index</a></li></ul></nav></div><div id=share-footer style=display:none></div><div id=actions-footer><a id=menu-toggle class=icon href=# onclick='return $("#nav-footer").toggle(),!1' aria-label=Menu><i class="fas fa-bars fa-lg" aria-hidden=true></i> Menu</a>
<a id=toc-toggle class=icon href=# onclick='return $("#toc-footer").toggle(),!1' aria-label=TOC><i class="fas fa-list fa-lg" aria-hidden=true></i> TOC</a>
<a id=share-toggle class=icon href=# onclick='return $("#share-footer").toggle(),!1' aria-label=Share><i class="fas fa-share-alt fa-lg" aria-hidden=true></i> share</a>
<a id=top style=display:none class=icon href=# onclick='$("html, body").animate({scrollTop:0},"fast")' aria-label="Top of Page"><i class="fas fa-chevron-up fa-lg" aria-hidden=true></i> Top</a></div></div></div><footer id=footer><div class=footer-left>Copyright © 2024 Rui Vieira</div><div class=footer-right><nav><ul><li><a href=https://ruivieira.dev/>Home</a></li><li><a href=https://ruivieira.dev/blog/>Blog</a></li><li><a href=https://ruivieira.dev/draw/>Drawings</a></li><li><a href=https://ruivieira.dev/map/>All pages</a></li><li><a href=https://ruivieira.dev/search.html>Search</a></li></ul></nav></div></footer></div></body><link rel=stylesheet href=https://ruivieira.dev/css/fa.min.css><script src=https://ruivieira.dev/js/jquery-3.6.0.min.js></script><script src=https://ruivieira.dev/js/mark.min.js></script><script src=https://ruivieira.dev/js/main.js></script><script>MathJax={tex:{inlineMath:[["$","$"],["\\(","\\)"]]},svg:{fontCache:"global"}}</script><script type=text/javascript id=MathJax-script async src=https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js></script></html>