-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtemplate_L04.Rmd
executable file
·71 lines (42 loc) · 2.87 KB
/
template_L04.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
title: "Unsupervised Learning Lab II (L09)"
author: "Data Science III (STAT 301-3)"
date: "April 30th, 2018"
output: html_document
---
# Overview
The goal of this lab is to continue practicing the application of unsupervised learning techniques.
# Datasets
We will be utilizing the `USArrests` data (*USArrests.csv*) which is contained in the **data** subdirectory. Students are able to access the appropriate codebook using `USArrests`. We we also be using the `college_reshaped.csv` dataset which contains both categorical and numerical data (found in **data** subdirectory). The dataset was formed using the `College` dataset from the `ISLR` package.
# Exercises
Please complete the following exercises. The document should be neatly formatted.
#### Load Packages
```{r, message=FALSE}
# Loading package(s)
```
<br>
#### Exercise 1 (Ex. 9 Section 10.7 pg 416)
Consider the `USArrests` data. Perform hierarchical clustering on the states.
a. Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
<br>
b. Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters? *Challenge: Maybe plotting a map and filling by cluster membership might be a good idea.*
<br>
c. Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
<br>
d. What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
<br><br>
#### Exercise 2
Consider the the `college_reshaped.csv` dataset. Scale the numerical features so that they have a standard deviation of one.
a. Run $K$-means on the data.
Try different numbers of clusters $K$.
Does a specific value of $K$ tend to produce better or more distinct clusters?
<br>
b. Run hierarchical clustering. Try different numbers of clusters, and use both the Euclidean distance and complete linkage as dissimilarity metrics.
Be sure that the number of clusters you use in this exercise is similar to the number of clusters you tried in part (a).
What sort of clusters result?
<br>
c. Run spectral clustering using the radial kernel. Set the number of clusters for the algorithm equal to the number of clusters you found useful in parts (a-b). Do you obtain different clusters than those algorithms?
<br>
d. Use the `cluster` package (specifically the `daisy()` & `pam()`) to perform clustering. Again, use the same number of clusters you used on part (a). Do you obtain different clusters?
<br>
e. Discuss how similar cluster membership is for parts (a-d). What are some reasons that clusters are similar? Why would they be different? In your opinion, do clusters from any one algorithm seem better or more intuitive for this data?