dunn-index.html

<!doctype html><html lang=en-uk><head><script data-goatcounter=https://ruivieira-dev.goatcounter.com/count async src=//gc.zgo.at/count.js></script><script src=https://unpkg.com/@alpinejs/intersect@3.x.x/dist/cdn.min.js></script><script src=https://unpkg.com/alpinejs@3.x.x/dist/cdn.min.js></script><script type=module src=https://ruivieira.dev/js/deeplinks/deeplinks.js></script><link rel=preload href=https://ruivieira.dev/lib/fonts/fa-brands-400.woff2 as=font type=font/woff2 crossorigin=anonymous><link rel=preload href=https://ruivieira.dev/lib/fonts/fa-regular-400.woff2 as=font type=font/woff2 crossorigin=anonymous><link rel=preload href=https://ruivieira.dev/lib/fonts/fa-solid-900.woff2 as=font type=font/woff2 crossorigin=anonymous><link rel=preload href=https://ruivieira.dev/fonts/firacode/FiraCode-Regular.woff2 as=font type=font/woff2 crossorigin=anonymous><link rel=preload href=https://ruivieira.dev/fonts/vollkorn/Vollkorn-Regular.woff2 as=font type=font/woff2 crossorigin=anonymous><link rel=stylesheet href=https://ruivieira.dev/css/kbd.css type=text/css><meta charset=utf-8><meta http-equiv=X-UA-Compatible content="IE=edge"><title>Dunn index · Rui Vieira</title>
<link rel=canonical href=https://ruivieira.dev/dunn-index.html><meta name=viewport content="width=device-width,initial-scale=1"><meta name=robots content="all,follow"><meta name=googlebot content="index,follow,snippet,archive"><meta property="og:title" content="Dunn index"><meta property="og:description" content="There are several ways to measure the robustness of a clustering algorithm. Three commonly used metrics are the Dunn index, Davis-Bouldin index, Silhoutte index and Calinski-Harabasz index.
But before we start, let&rsquo;s introduce some concepts.
We are interested in clustering algorithms for a dataset $\mathcal{D}$ with $N$ elements in a $n$-dimensional real space, that is:
$$ \mathcal{D} = {x_1, x_2, \ldots, x_N} \in \mathbb{R}^p $$
The clustering algorithm will create a set $C$ of $K$ distinct disjoint groups from $\mathcal{D}$ $C={c_1, c_2, \ldots, c_k}$, such that:"><meta property="og:type" content="article"><meta property="og:url" content="https://ruivieira.dev/dunn-index.html"><meta property="article:section" content="posts"><meta property="article:modified_time" content="2023-09-02T17:28:34+01:00"><meta name=twitter:card content="summary"><meta name=twitter:title content="Dunn index"><meta name=twitter:description content="There are several ways to measure the robustness of a clustering algorithm. Three commonly used metrics are the Dunn index, Davis-Bouldin index, Silhoutte index and Calinski-Harabasz index.
But before we start, let&rsquo;s introduce some concepts.
We are interested in clustering algorithms for a dataset $\mathcal{D}$ with $N$ elements in a $n$-dimensional real space, that is:
$$ \mathcal{D} = {x_1, x_2, \ldots, x_N} \in \mathbb{R}^p $$
The clustering algorithm will create a set $C$ of $K$ distinct disjoint groups from $\mathcal{D}$ $C={c_1, c_2, \ldots, c_k}$, such that:"><link rel=stylesheet href=https://ruivieira.dev/css/styles.css><!--[if lt IE 9]><script src=https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js></script><script src=https://oss.maxcdn.com/respond/1.4.2/respond.min.js></script><![endif]--><link rel=icon type=image/png href=https://ruivieira.dev/images/favicon.ico></head><body class="max-width mx-auto px3 ltr" x-data="{currentHeading: undefined}"><div class="content index py4"><div id=header-post><a id=menu-icon href=#><i class="fas fa-eye fa-lg"></i></a>
<a id=menu-icon-tablet href=#><i class="fas fa-eye fa-lg"></i></a>
<a id=top-icon-tablet href=# onclick='$("html, body").animate({scrollTop:0},"fast")' style=display:none aria-label="Top of Page"><i class="fas fa-chevron-up fa-lg"></i></a>
<span id=menu><span id=nav><ul><li><a href=https://ruivieira.dev/>Home</a></li><li><a href=https://ruivieira.dev/blog/>Blog</a></li><li><a href=https://ruivieira.dev/draw/>Drawings</a></li><li><a href=https://ruivieira.dev/map/>All pages</a></li><li><a href=https://ruivieira.dev/search.html>Search</a></li></ul></span><br><div id=share style=display:none></div><div id=toc><h4>Contents</h4><nav id=TableOfContents><ul><li><a href=#dunn-index :class="{'toc-h2':true, 'toc-highlight': currentHeading == '#dunn-index' }">Dunn index</a></li><li><a href=#calinski-harabasz-index :class="{'toc-h2':true, 'toc-highlight': currentHeading == '#calinski-harabasz-index' }">Calinski-Harabasz index</a></li></ul></nav><h4>Related</h4><nav><ul><li class="header-post toc"><span class=backlink-count>1</span>
<a href=https://ruivieira.dev/distance-metrics.html>Distance metrics</a></li></ul></nav></div></span></div><article class=post itemscope itemtype=http://schema.org/BlogPosting><header><h1 class=posttitle itemprop="name headline">Dunn index</h1><div class=meta><div class=postdate>Updated <time datetime="2023-09-02 17:28:34 +0100 BST" itemprop=datePublished>2023-09-02</time>
<span class=commit-hash>(<a href=https://ruivieira.dev/log/index.html#d64c4a5>d64c4a5</a>)</span></div></div></header><div class=content itemprop=articleBody><p>There are several ways to measure the robustness of a clustering algorithm. Three commonly used metrics are the <a href=#dunn-index>Dunn index</a>, <em>Davis-Bouldin index</em>, <em>Silhoutte index</em> and <a href=#calinski-harabasz-index>Calinski-Harabasz index</a>.</p><p>But before we start, let&rsquo;s introduce some concepts.</p><p>We are interested in clustering algorithms for a dataset $\mathcal{D}$ with $N$ elements in a $n$-dimensional real space, that is:</p><p>$$
\mathcal{D} = {x_1, x_2, \ldots, x_N} \in \mathbb{R}^p
$$</p><p>The clustering algorithm will create a set $C$ of $K$ distinct disjoint groups from $\mathcal{D}$ $C={c_1, c_2, \ldots, c_k}$, such that:</p><p>$$
\cup_{c_k\in C}c_k=\mathcal{D} \
c_k \cap c_l \neq \emptyset \forall k\neq l
$$</p><p>Each group (or cluster) $c_k$, will have a <em>centroid</em>, $\bar{c}_k$, which is the mean vector of its elements such that:</p><p>$$
\bar{c}<em>k=\frac{1}{|c_k|}\sum</em>{x_i \in c_k}x_i
$$</p><p>We will also make use of the dataset&rsquo;s mean vector, $\bar{\mathcal{D}}$, defined as:</p><p>$$
\bar{\mathcal{D}}=\frac{1}{N}\sum_{x_i \in X}x_i
$$</p><h2 id=dunn-index x-intersect="currentHeading = '#dunn-index'">Dunn index</h2><p>The <em>Dunn index</em> aims at quantifying the compactness and variance of the clustering.
A cluster is considered <em>compact</em> if there is small variance between members of the cluster.
This can be calculated using $\Delta(c_k)$, where</p><p>$$
\Delta(c_k) = \max_{x_i, x_j \in c_k}{d_e(x_i, x_j)}
$$</p><p>and $d_e$ is the <a href=https://ruivieira.dev/distance-metrics.html#euclidean-distance-(l2)>Euclidian distance</a> defined as:</p><p>$$
d_e=\sqrt{\sum_{j=1}^p (x_{ij}-x_{kj})^2}.
$$</p><p>A cluster is considered <em>well separated</em> if the cluster are far-apart. This can quantified using</p><p>$$
\delta(c_k, c_l) = \min_{x_i \in c_k}\min_{x_j\in c_l}{d_e(x_i, x_j)}.
$$</p><p>Given these quantities, the <em>Dunn index</em> for a set of clusters $C$, $DI(C)$, is then defined by:</p><p>$$
DI(C)=\frac{\min_{c_k \in C}{\delta(c_k, c_l)}}{\max_{c_k\in C}\Delta(c_k)}
$$</p><p>A higher <em>Dunn Index</em> will indicate compact, well-separated clusters, while a lower index will indicate less compact or less well-separated clusters.</p><p>We can now try to calculate the metric for the dataset we&rsquo;ve created previously.
Let&rsquo;s simulate some data and apply the Dunn index from scratch.
First, we will create a compact and well-separated dataset using the <a href=https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html>make_blobs</a> method in <code>scikit-learn</code>.
We will create a dataset of $\mathbb{R}^2$ data (for easier plotting), with three clusters.</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>sklearn.datasets</span> <span style=font-weight:700>import</span> make_blobs
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>X, y <span style=font-weight:700>=</span> make_blobs(n_samples<span style=font-weight:700>=</span><span style=color:#099>1000</span>,
</span></span><span style=display:flex><span>                  centers<span style=font-weight:700>=</span><span style=color:#099>3</span>,
</span></span><span style=display:flex><span>                  n_features<span style=font-weight:700>=</span><span style=color:#099>2</span>,
</span></span><span style=display:flex><span>                  random_state<span style=font-weight:700>=</span><span style=color:#099>23</span>)
</span></span></code></pre></div><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>import</span> <span style=color:#555>pandas</span> <span style=font-weight:700>as</span> <span style=color:#555>pd</span>
</span></span><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>plotnine</span> <span style=font-weight:700>import</span> <span style=font-weight:700>*</span>
</span></span><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>plotnine.data</span> <span style=font-weight:700>import</span> <span style=font-weight:700>*</span>
</span></span><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>plotutils</span> <span style=font-weight:700>import</span> <span style=font-weight:700>*</span>
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>data <span style=font-weight:700>=</span> pd<span style=font-weight:700>.</span>DataFrame(X, columns<span style=font-weight:700>=</span>[<span style=color:#b84>&#34;x1&#34;</span>, <span style=color:#b84>&#34;x2&#34;</span>])
</span></span><span style=display:flex><span>data[<span style=color:#b84>&#34;y&#34;</span>] <span style=font-weight:700>=</span> y
</span></span><span style=display:flex><span>data[<span style=color:#b84>&#34;y&#34;</span>] <span style=font-weight:700>=</span> data<span style=font-weight:700>.</span>y<span style=font-weight:700>.</span>astype(<span style=color:#b84>&#39;category&#39;</span>)
</span></span></code></pre></div><p><img src=https://ruivieira.dev/Dunn%20index_files/figure-gfm/cell-4-output-1.png alt loading=lazy></p><p>We now cluster the data and we will have, as expected three distinct clusters, plotted below.</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>sklearn</span> <span style=font-weight:700>import</span> cluster
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>k_means <span style=font-weight:700>=</span> cluster<span style=font-weight:700>.</span>KMeans(n_clusters<span style=font-weight:700>=</span><span style=color:#099>3</span>)
</span></span><span style=display:flex><span>k_means<span style=font-weight:700>.</span>fit(data)
</span></span><span style=display:flex><span>y_pred <span style=font-weight:700>=</span> k_means<span style=font-weight:700>.</span>predict(data)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>prediction <span style=font-weight:700>=</span> pd<span style=font-weight:700>.</span>concat([data, pd<span style=font-weight:700>.</span>DataFrame(y_pred, columns<span style=font-weight:700>=</span>[<span style=color:#b84>&#39;pred&#39;</span>])], axis <span style=font-weight:700>=</span> <span style=color:#099>1</span>)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>clus0 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>0</span>]
</span></span><span style=display:flex><span>clus1 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>1</span>]
</span></span><span style=display:flex><span>clus2 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>2</span>]
</span></span><span style=display:flex><span>k_list <span style=font-weight:700>=</span> [clus0<span style=font-weight:700>.</span>values, clus1<span style=font-weight:700>.</span>values,clus2<span style=font-weight:700>.</span>values]
</span></span></code></pre></div><p>Let&rsquo;s focus now on two of these cluster, let&rsquo;s call them $c_k$ and $c_l$.</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>ck <span style=font-weight:700>=</span> k_list[<span style=color:#099>0</span>]
</span></span><span style=display:flex><span>cl <span style=font-weight:700>=</span> k_list[<span style=color:#099>1</span>]
</span></span></code></pre></div><p>We know we have to calculate the distance between the points in $c_k$ and $c_l$. We know that the <code>len(ck)=len(cl)=333</code> we create</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>import</span> <span style=color:#555>numpy</span> <span style=font-weight:700>as</span> <span style=color:#555>np</span>
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>values <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>ones([<span style=color:#999>len</span>(ck), <span style=color:#999>len</span>(cl)])
</span></span><span style=display:flex><span>values
</span></span></code></pre></div><pre><code>array([[1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])
</code></pre><p>For each pair of points, we then get the norm of $x_i-x_j$. For instance, for $i=0\in c_k$ and $i=1\in c_l$, we would have:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>values[<span style=color:#099>0</span>, <span style=color:#099>1</span>] <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>linalg<span style=font-weight:700>.</span>norm(ck[<span style=color:#099>0</span>]<span style=font-weight:700>-</span>cl[<span style=color:#099>1</span>])
</span></span><span style=display:flex><span><span style=color:#999>print</span>(ck[<span style=color:#099>0</span>], cl[<span style=color:#099>1</span>])
</span></span><span style=display:flex><span><span style=color:#999>print</span>(values[<span style=color:#099>0</span>, <span style=color:#099>1</span>])
</span></span></code></pre></div><pre><code>[-5.37039106  3.47555168  2.          0.        ] [ 5.46312794 -3.08938807  1.          1.        ]
12.746119711608184
</code></pre><p>The calculation of $\delta(c_k, c_l)$ between two clusters $c_k$ and $c_l$ will be defined as follows:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>import</span> <span style=color:#555>numpy</span> <span style=font-weight:700>as</span> <span style=color:#555>np</span>
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span><span style=font-weight:700>def</span> <span style=color:#900;font-weight:700>δ</span>(ck, cl):  
</span></span><span style=display:flex><span>    values <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>ones([<span style=color:#999>len</span>(ck), <span style=color:#999>len</span>(cl)])
</span></span><span style=display:flex><span>    <span style=font-weight:700>for</span> i <span style=font-weight:700>in</span> <span style=color:#999>range</span>(<span style=color:#099>0</span>, <span style=color:#999>len</span>(ck)):
</span></span><span style=display:flex><span>        <span style=font-weight:700>for</span> j <span style=font-weight:700>in</span> <span style=color:#999>range</span>(<span style=color:#099>0</span>, <span style=color:#999>len</span>(cl)):
</span></span><span style=display:flex><span>            values[i, j] <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>linalg<span style=font-weight:700>.</span>norm(ck[i]<span style=font-weight:700>-</span>cl[j])
</span></span><span style=display:flex><span>    <span style=font-weight:700>return</span> np<span style=font-weight:700>.</span>min(values)
</span></span></code></pre></div><p>So, for our two clusters above, $\delta(c_k, c_l)$ will be:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>δ(ck, cl)
</span></span></code></pre></div><pre><code>8.13474311744193
</code></pre><p>Within a single cluster $c_k$, we can calculate $\Delta(c_k)$ similarly as:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>def</span> <span style=color:#900;font-weight:700>Δ</span>(ci):
</span></span><span style=display:flex><span>    values <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>zeros([<span style=color:#999>len</span>(ci), <span style=color:#999>len</span>(ci)])
</span></span><span style=display:flex><span>    <span style=font-weight:700>for</span> i <span style=font-weight:700>in</span> <span style=color:#999>range</span>(<span style=color:#099>0</span>, <span style=color:#999>len</span>(ci)):
</span></span><span style=display:flex><span>        <span style=font-weight:700>for</span> j <span style=font-weight:700>in</span> <span style=color:#999>range</span>(<span style=color:#099>0</span>, <span style=color:#999>len</span>(ci)):
</span></span><span style=display:flex><span>            values[i, j] <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>linalg<span style=font-weight:700>.</span>norm(ci[i]<span style=font-weight:700>-</span>ci[j])
</span></span><span style=display:flex><span>    <span style=font-weight:700>return</span> np<span style=font-weight:700>.</span>max(values)
</span></span></code></pre></div><p>So, for instance, for our $c_k$ and $c_l$ we would have:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=color:#999>print</span>(Δ(ck))
</span></span><span style=display:flex><span><span style=color:#999>print</span>(Δ(cl))
</span></span></code></pre></div><pre><code>6.726025773561468
6.173844284636552
</code></pre><p>We can now define the <em>Dunn index</em> as</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>def</span> <span style=color:#900;font-weight:700>dunn</span>(k_list):
</span></span><span style=display:flex><span>    δs <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>ones([<span style=color:#999>len</span>(k_list), <span style=color:#999>len</span>(k_list)])
</span></span><span style=display:flex><span>    Δs <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>zeros([<span style=color:#999>len</span>(k_list), <span style=color:#099>1</span>])
</span></span><span style=display:flex><span>    l_range <span style=font-weight:700>=</span> <span style=color:#999>list</span>(<span style=color:#999>range</span>(<span style=color:#099>0</span>, <span style=color:#999>len</span>(k_list)))
</span></span><span style=display:flex><span>    <span style=font-weight:700>for</span> k <span style=font-weight:700>in</span> l_range:
</span></span><span style=display:flex><span>        <span style=font-weight:700>for</span> l <span style=font-weight:700>in</span> (l_range[<span style=color:#099>0</span>:k]<span style=font-weight:700>+</span>l_range[k<span style=font-weight:700>+</span><span style=color:#099>1</span>:]):
</span></span><span style=display:flex><span>            δs[k, l] <span style=font-weight:700>=</span> δ(k_list[k], k_list[l])
</span></span><span style=display:flex><span>            Δs[k] <span style=font-weight:700>=</span> Δ(k_list[k])
</span></span><span style=display:flex><span>            di <span style=font-weight:700>=</span> np<span style=font-weight:700>.</span>min(δs)<span style=font-weight:700>/</span>np<span style=font-weight:700>.</span>max(Δs)
</span></span><span style=display:flex><span>    <span style=font-weight:700>return</span> di
</span></span></code></pre></div><p>and calculate the <em>Dunn index</em> for our clustered values list as</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>dunn(k_list)
</span></span></code></pre></div><pre><code>0.14867620697065728
</code></pre><p>Intuitively, we can expect a dataset with less well-defined clusters to have a lower <em>Dunn index</em>. Let&rsquo;s try it.
We first generate the new dataset.</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>X2, y2 <span style=font-weight:700>=</span> make_blobs(n_samples<span style=font-weight:700>=</span><span style=color:#099>1000</span>,
</span></span><span style=display:flex><span>                  centers<span style=font-weight:700>=</span><span style=color:#099>3</span>, 
</span></span><span style=display:flex><span>                  n_features<span style=font-weight:700>=</span><span style=color:#099>2</span>,
</span></span><span style=display:flex><span>                  cluster_std<span style=font-weight:700>=</span><span style=color:#099>10.0</span>, 
</span></span><span style=display:flex><span>                  random_state<span style=font-weight:700>=</span><span style=color:#099>24</span>) 
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>df <span style=font-weight:700>=</span> pd<span style=font-weight:700>.</span>DataFrame(X2, columns<span style=font-weight:700>=</span>[<span style=color:#b84>&#39;A&#39;</span>, <span style=color:#b84>&#39;B&#39;</span>])
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>k_means <span style=font-weight:700>=</span> cluster<span style=font-weight:700>.</span>KMeans(n_clusters<span style=font-weight:700>=</span><span style=color:#099>3</span>) 
</span></span><span style=display:flex><span>k_means<span style=font-weight:700>.</span>fit(df) 
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span><span style=color:#998;font-style:italic>#K-means training </span>
</span></span><span style=display:flex><span>y_pred <span style=font-weight:700>=</span> k_means<span style=font-weight:700>.</span>predict(df)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span>prediction <span style=font-weight:700>=</span> pd<span style=font-weight:700>.</span>concat([df,pd<span style=font-weight:700>.</span>DataFrame(y_pred, columns<span style=font-weight:700>=</span>[<span style=color:#b84>&#39;pred&#39;</span>])], axis <span style=font-weight:700>=</span> <span style=color:#099>1</span>)
</span></span><span style=display:flex><span>prediction[<span style=color:#b84>&#34;pred&#34;</span>] <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>pred<span style=font-weight:700>.</span>astype(<span style=color:#b84>&#39;category&#39;</span>)
</span></span></code></pre></div><p><img src=https://ruivieira.dev/Dunn%20index_files/figure-gfm/cell-16-output-1.png alt loading=lazy></p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>clus0 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>0</span>]
</span></span><span style=display:flex><span>clus1 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>1</span>]
</span></span><span style=display:flex><span>clus2 <span style=font-weight:700>=</span> prediction<span style=font-weight:700>.</span>loc[prediction<span style=font-weight:700>.</span>pred <span style=font-weight:700>==</span> <span style=color:#099>2</span>]  
</span></span><span style=display:flex><span>k_list <span style=font-weight:700>=</span> [clus0<span style=font-weight:700>.</span>values, clus1<span style=font-weight:700>.</span>values,clus2<span style=font-weight:700>.</span>values]
</span></span></code></pre></div><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>dunn(k_list)
</span></span></code></pre></div><pre><code>0.019563892388205984
</code></pre><h2 id=calinski-harabasz-index x-intersect="currentHeading = '#calinski-harabasz-index'">Calinski-Harabasz index</h2><p>The Calinski-Harabasz index<sup id=fnref:1><a href=#fn:1 class=footnote-ref role=doc-noteref>1</a></sup> (also known as the variance ratio criterion) is a measure of the quality of a clustering algorithm. It is commonly used to evaluate the results of a clustering technique and to compare the performance of different clustering algorithms.</p><p>The index is calculated by dividing the between-cluster variance by the within-cluster variance. A higher Calinski-Harabasz index indicates a better separation of the clusters and a better overall clustering result.</p><p>For the following example we will use <a href=https://ruivieira.dev/scikit-learn.html>Scikit-learn</a>&rsquo;s implementation<sup id=fnref:2><a href=#fn:2 class=footnote-ref role=doc-noteref>2</a></sup> of the Calinski-Harabasz index.</p><p>If we apply it to the previous well-defined cluster data, <code>X</code>, <code>y</code>:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>sklearn.cluster</span> <span style=font-weight:700>import</span> KMeans
</span></span><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>sklearn.datasets</span> <span style=font-weight:700>import</span> make_blobs
</span></span><span style=display:flex><span><span style=font-weight:700>from</span> <span style=color:#555>sklearn.metrics</span> <span style=font-weight:700>import</span> calinski_harabasz_score
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span><span style=color:#998;font-style:italic># Calculate the Calinski-Harabasz index</span>
</span></span><span style=display:flex><span>score <span style=font-weight:700>=</span> calinski_harabasz_score(X, y)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span><span style=color:#999>print</span>(<span style=color:#b84>&#34;Calinski-Harabasz index:&#34;</span>, score)
</span></span></code></pre></div><pre><code>Calinski-Harabasz index: 12403.892218876248
</code></pre><p>While applying it to the less well-defined cluster will return a lower index:</p><div class=highlight><pre tabindex=0 style=background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-python data-lang=python><span style=display:flex><span>score <span style=font-weight:700>=</span> calinski_harabasz_score(X2, y2)
</span></span><span style=display:flex><span>
</span></span><span style=display:flex><span><span style=color:#999>print</span>(<span style=color:#b84>&#34;Calinski-Harabasz index:&#34;</span>, score)
</span></span></code></pre></div><pre><code>Calinski-Harabasz index: 135.29288069299935
</code></pre><div class=footnotes role=doc-endnotes><hr><ol><li id=fn:1><p>Caliński, Tadeusz, and Jerzy Harabasz. &ldquo;A dendrite method for cluster analysis.&rdquo; Communications in Statistics-theory and Methods 3, no. 1 (1974): 1-27.&#160;<a href=#fnref:1 class=footnote-backref role=doc-backlink>&#8617;&#xfe0e;</a></p></li><li id=fn:2><p><a href=https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html>https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html</a>&#160;<a href=#fnref:2 class=footnote-backref role=doc-backlink>&#8617;&#xfe0e;</a></p></li></ol></div></div></article><div id=footer-post-container><div id=footer-post><div id=nav-footer style=display:none><ul><li><a href=https://ruivieira.dev/>Home</a></li><li><a href=https://ruivieira.dev/blog/>Blog</a></li><li><a href=https://ruivieira.dev/draw/>Drawings</a></li><li><a href=https://ruivieira.dev/map/>All pages</a></li><li><a href=https://ruivieira.dev/search.html>Search</a></li></ul></div><div id=toc-footer style=display:none><nav id=TableOfContents><ul><li><a href=#dunn-index>Dunn index</a></li><li><a href=#calinski-harabasz-index>Calinski-Harabasz index</a></li></ul></nav></div><div id=share-footer style=display:none></div><div id=actions-footer><a id=menu-toggle class=icon href=# onclick='return $("#nav-footer").toggle(),!1' aria-label=Menu><i class="fas fa-bars fa-lg" aria-hidden=true></i> Menu</a>
<a id=toc-toggle class=icon href=# onclick='return $("#toc-footer").toggle(),!1' aria-label=TOC><i class="fas fa-list fa-lg" aria-hidden=true></i> TOC</a>
<a id=share-toggle class=icon href=# onclick='return $("#share-footer").toggle(),!1' aria-label=Share><i class="fas fa-share-alt fa-lg" aria-hidden=true></i> share</a>
<a id=top style=display:none class=icon href=# onclick='$("html, body").animate({scrollTop:0},"fast")' aria-label="Top of Page"><i class="fas fa-chevron-up fa-lg" aria-hidden=true></i> Top</a></div></div></div><footer id=footer><div class=footer-left>Copyright &copy; 2024 Rui Vieira</div><div class=footer-right><nav><ul><li><a href=https://ruivieira.dev/>Home</a></li><li><a href=https://ruivieira.dev/blog/>Blog</a></li><li><a href=https://ruivieira.dev/draw/>Drawings</a></li><li><a href=https://ruivieira.dev/map/>All pages</a></li><li><a href=https://ruivieira.dev/search.html>Search</a></li></ul></nav></div></footer></div></body><link rel=stylesheet href=https://ruivieira.dev/css/fa.min.css><script src=https://ruivieira.dev/js/jquery-3.6.0.min.js></script><script src=https://ruivieira.dev/js/mark.min.js></script><script src=https://ruivieira.dev/js/main.js></script><script>MathJax={tex:{inlineMath:[["$","$"],["\\(","\\)"]]},svg:{fontCache:"global"}}</script><script type=text/javascript id=MathJax-script async src=https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js></script></html>