classification.html

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta http-equiv="X-UA-Compatible" content="IE=EDGE" />


<title>Classification</title>

<script src="site_libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="site_libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="site_libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script src="site_libs/navigation-1.1/tabsets.js"></script>
<link href="site_libs/highlightjs-9.12.0/default.css" rel="stylesheet" />
<script src="site_libs/highlightjs-9.12.0/highlight.js"></script>
<link href="site_libs/font-awesome-5.1.0/css/all.css" rel="stylesheet" />
<link href="site_libs/font-awesome-5.1.0/css/v4-shims.css" rel="stylesheet" />

<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
  pre:not([class]) {
    background-color: white;
  }
</style>
<script type="text/javascript">
if (window.hljs) {
  hljs.configure({languages: []});
  hljs.initHighlightingOnLoad();
  if (document.readyState && document.readyState === "complete") {
    window.setTimeout(function() { hljs.initHighlighting(); }, 0);
  }
}
</script>


<style type="text/css">
h1 {
  font-size: 34px;
}
h1.title {
  font-size: 38px;
}
h2 {
  font-size: 30px;
}
h3 {
  font-size: 24px;
}
h4 {
  font-size: 18px;
}
h5 {
  font-size: 16px;
}
h6 {
  font-size: 12px;
}
.table th:not([align]) {
  text-align: left;
}
</style>


<style type = "text/css">
.main-container {
  max-width: 940px;
  margin-left: auto;
  margin-right: auto;
}
code {
  color: inherit;
  background-color: rgba(0, 0, 0, 0.04);
}
img {
  max-width:100%;
}
.tabbed-pane {
  padding-top: 12px;
}
.html-widget {
  margin-bottom: 20px;
}
button.code-folding-btn:focus {
  outline: none;
}
summary {
  display: list-item;
}
</style>


<style type="text/css">
/* padding for bootstrap navbar */
body {
  padding-top: 51px;
  padding-bottom: 40px;
}
/* offset scroll position for anchor links (for fixed navbar)  */
.section h1 {
  padding-top: 56px;
  margin-top: -56px;
}
.section h2 {
  padding-top: 56px;
  margin-top: -56px;
}
.section h3 {
  padding-top: 56px;
  margin-top: -56px;
}
.section h4 {
  padding-top: 56px;
  margin-top: -56px;
}
.section h5 {
  padding-top: 56px;
  margin-top: -56px;
}
.section h6 {
  padding-top: 56px;
  margin-top: -56px;
}
.dropdown-submenu {
  position: relative;
}
.dropdown-submenu>.dropdown-menu {
  top: 0;
  left: 100%;
  margin-top: -6px;
  margin-left: -1px;
  border-radius: 0 6px 6px 6px;
}
.dropdown-submenu:hover>.dropdown-menu {
  display: block;
}
.dropdown-submenu>a:after {
  display: block;
  content: " ";
  float: right;
  width: 0;
  height: 0;
  border-color: transparent;
  border-style: solid;
  border-width: 5px 0 5px 5px;
  border-left-color: #cccccc;
  margin-top: 5px;
  margin-right: -10px;
}
.dropdown-submenu:hover>a:after {
  border-left-color: #ffffff;
}
.dropdown-submenu.pull-left {
  float: none;
}
.dropdown-submenu.pull-left>.dropdown-menu {
  left: -100%;
  margin-left: 10px;
  border-radius: 6px 0 6px 6px;
}
</style>

<script>
// manage active state of menu based on current page
$(document).ready(function () {
  // active menu anchor
  href = window.location.pathname
  href = href.substr(href.lastIndexOf('/') + 1)
  if (href === "")
    href = "index.html";
  var menuAnchor = $('a[href="' + href + '"]');

  // mark it active
  menuAnchor.parent().addClass('active');

  // if it's got a parent navbar menu mark it active as well
  menuAnchor.closest('li.dropdown').addClass('active');
});
</script>

<!-- tabsets -->

<style type="text/css">
.tabset-dropdown > .nav-tabs {
  display: inline-table;
  max-height: 500px;
  min-height: 44px;
  overflow-y: auto;
  background: white;
  border: 1px solid #ddd;
  border-radius: 4px;
}

.tabset-dropdown > .nav-tabs > li.active:before {
  content: "";
  font-family: 'Glyphicons Halflings';
  display: inline-block;
  padding: 10px;
  border-right: 1px solid #ddd;
}

.tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before {
  content: "&#xe258;";
  border: none;
}

.tabset-dropdown > .nav-tabs.nav-tabs-open:before {
  content: "";
  font-family: 'Glyphicons Halflings';
  display: inline-block;
  padding: 10px;
  border-right: 1px solid #ddd;
}

.tabset-dropdown > .nav-tabs > li.active {
  display: block;
}

.tabset-dropdown > .nav-tabs > li > a,
.tabset-dropdown > .nav-tabs > li > a:focus,
.tabset-dropdown > .nav-tabs > li > a:hover {
  border: none;
  display: inline-block;
  border-radius: 4px;
}

.tabset-dropdown > .nav-tabs.nav-tabs-open > li {
  display: block;
  float: none;
}

.tabset-dropdown > .nav-tabs > li {
  display: none;
}
</style>

<!-- code folding -->


</head>

<body>


<div class="container-fluid main-container">


<div class="navbar navbar-default  navbar-fixed-top" role="navigation">
  <div class="container">
    <div class="navbar-header">
      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
      </button>
      <a class="navbar-brand" href="index.html">IS709 - Introduction to Data Science</a>
    </div>
    <div id="navbar" class="navbar-collapse collapse">
      <ul class="nav navbar-nav">
        <li>
  <a href="index.html">Home</a>
</li>
<li class="dropdown">
  <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">
    Tutorials
     
    <span class="caret"></span>
  </a>
  <ul class="dropdown-menu" role="menu">
    <li>
      <a href="installation.html">R/RStudio Installation</a>
    </li>
    <li>
      <a href="azure_notebooks.html">Azure Notebooks</a>
    </li>
    <li>
      <a href="intro_to_r.html">Introduction to R</a>
    </li>
    <li>
      <a href="amelia.html">Missing Data Imputation</a>
    </li>
    <li>
      <a href="discretization.html">Dimensionality Reduction and Discretization</a>
    </li>
    <li>
      <a href="classification.html">Classification</a>
    </li>
    <li>
      <a href="clustering.html">Clustering</a>
    </li>
  </ul>
</li>
      </ul>
      <ul class="nav navbar-nav navbar-right">
        <li>
  <a href="https://mehmetaliakyol.com/">
    <span class="fa fa-question fa-lg"></span>
     
  </a>
</li>
      </ul>
    </div><!--/.nav-collapse -->
  </div><!--/.container -->
</div><!--/.navbar -->

<div class="fluid-row" id="header">


<h1 class="title toc-ignore">Classification</h1>

</div>


<p><strong>Objectives</strong>:</p>
<p>The objective of this document is to give a brief introduction to classification methods and model evaluation. After completing this tutorial you will be able to:</p>
<ul>
<li>Generate classification models <!--* Calculate accuracy, sensitivity and specificity values--></li>
<li>Evaluate model performance <!--* Calculate AUC--></li>
</ul>
<p>Let’s load the data:</p>
<pre class="r"><code>data(iris) 
head(iris)</code></pre>
<pre><code>##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa</code></pre>
<pre class="r"><code>shuffleIris &lt;- iris[sample(nrow(iris)),] #Shuffle the dataset 
trainIris &lt;- shuffleIris[1:100,] #Subset the training set 
testIris &lt;- shuffleIris[101:150,-5] #Subset the test set without the class column 
testClass &lt;- shuffleIris[101:150,5] #Get test classes into a separate vector</code></pre>
<div id="k-nearest-neighbors-classification" class="section level2">
<h2>k-Nearest Neighbors Classification</h2>
<pre class="r"><code>require(class)
predClass &lt;- knn(trainIris[,-5],testIris, trainIris[,5], k = 5) #knn(trainvariables, testvariables, trainclasses, k)
require(caret)
confusionMatrix(testClass, predClass)</code></pre>
<pre><code>## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         18         2
##   virginica       0          1        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.94            
##                  95% CI : (0.8345, 0.9875)
##     No Information Rate : 0.38            
##     P-Value [Acc &gt; NIR] : &lt; 2.2e-16       
##                                           
##                   Kappa : 0.9094          
##                                           
##  Mcnemar&#39;s Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                    1.0            0.9474           0.8750
## Specificity                    1.0            0.9355           0.9706
## Pos Pred Value                 1.0            0.9000           0.9333
## Neg Pred Value                 1.0            0.9667           0.9429
## Prevalence                     0.3            0.3800           0.3200
## Detection Rate                 0.3            0.3600           0.2800
## Detection Prevalence           0.3            0.4000           0.3000
## Balanced Accuracy              1.0            0.9414           0.9228</code></pre>
<p>This has a pretty high accuracy. This is partly due to how clean our data is.</p>
<p>The confusion matrix gives us a table that tells the overlap between true class and the predicted class. The columns give us the true class while the rows give us the predicted ones. Take a look at the <code>virginica</code> column in the confusion matrix. One instance of data that is actually <code>virginica</code> is classified as <code>versicolor</code>. Confusion matrix gives us information about the confusion of the classes by the model.</p>
</div>
<div id="naive-bayes-classification" class="section level2">
<h2>Naive Bayes Classification</h2>
<pre class="r"><code>require(e1071)
naiveModel &lt;- naiveBayes(Species~., data = trainIris) 
predClass &lt;- predict(naiveModel, testIris) 
confusionMatrix(testClass, predClass)</code></pre>
<pre><code>## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         18         2
##   virginica       0          2        13
## 
## Overall Statistics
##                                           
##                Accuracy : 0.92            
##                  95% CI : (0.8077, 0.9778)
##     No Information Rate : 0.4             
##     P-Value [Acc &gt; NIR] : 1.565e-14       
##                                           
##                   Kappa : 0.8788          
##                                           
##  Mcnemar&#39;s Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                    1.0            0.9000           0.8667
## Specificity                    1.0            0.9333           0.9429
## Pos Pred Value                 1.0            0.9000           0.8667
## Neg Pred Value                 1.0            0.9333           0.9429
## Prevalence                     0.3            0.4000           0.3000
## Detection Rate                 0.3            0.3600           0.2600
## Detection Prevalence           0.3            0.4000           0.3000
## Balanced Accuracy              1.0            0.9167           0.9048</code></pre>
<p>This also has a pretty high accuracy.</p>
</div>
<div id="decision-trees" class="section level2">
<h2>Decision Trees</h2>
<pre class="r"><code>require(rpart)
decisionModel &lt;- rpart(Species~., data = trainIris) 
predClass &lt;- predict(decisionModel, testIris, type = &quot;class&quot;) 
confusionMatrix(testClass, predClass)</code></pre>
<pre><code>## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         18         2
##   virginica       0          3        12
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7819, 0.9667)
##     No Information Rate : 0.42            
##     P-Value [Acc &gt; NIR] : 1.676e-12       
##                                           
##                   Kappa : 0.848           
##                                           
##  Mcnemar&#39;s Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                    1.0            0.8571           0.8571
## Specificity                    1.0            0.9310           0.9167
## Pos Pred Value                 1.0            0.9000           0.8000
## Neg Pred Value                 1.0            0.9000           0.9429
## Prevalence                     0.3            0.4200           0.2800
## Detection Rate                 0.3            0.3600           0.2400
## Detection Prevalence           0.3            0.4000           0.3000
## Balanced Accuracy              1.0            0.8941           0.8869</code></pre>
<pre class="r"><code>require(rpart.plot)
prp(decisionModel) </code></pre>
<p><img src="classification_files/figure-html/unnamed-chunk-6-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>Keep in mind that due to sampling your decision tree can look different than this.</p>
<p>This also has a very high accuracy; however, accuracy in itself is not sufficient in evaluating models. We should also consider the specificity and sensitivity values. Because, if there is an imbalance in the class, say you have 1000 <code>class A</code> in your test data and 10 <code>class B</code>. Then, if you label all the test data as <code>class A</code>, you will have high accuracy (1000/1010), however you weren’t able to detect any instances with <code>class B</code>, so your model is not very good. Sensitivity and Specificity both needs to be high for your model to be good.</p>
<!--## Logistic Regression

Logistic regression is the type of regression where you fit a binary classification model. A binary classification model is the type of model where your output variable has 2 classes. Now this doesn?t mean that if you have a data that has more than one class (such as the iris data) cannot be modeled using a logistic regression. The idea is that you model each class vs. all others individually.

Let's load the data: 


```r
data <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/cmc/cmc.data"), sep=",", header = F) 
names(data) <- c("w.age", "w.ed", "h.ed", "child", "rel","w.occ", "h.occ", "ind", "med", "outcome") 
data$w.ed <- as.factor(data$w.ed) 
data$h.ed <- as.factor(data$h.ed) 
data$rel<-as.factor(data$rel) 
data$w.occ <- as.factor(data$w.occ) 
data$h.occ <- as.factor(data$h.occ) 
data$ind <- as.factor(data$ind) 
data$med <- as.factor(data$med) 
data$outcome <- as.factor(data$outcome) 
summary(data)
```

```
##      w.age       w.ed    h.ed        child        rel      w.occ   
##  Min.   :16.00   1:152   1: 44   Min.   : 0.000   0: 220   0: 369  
##  1st Qu.:26.00   2:334   2:178   1st Qu.: 1.000   1:1253   1:1104  
##  Median :32.00   3:410   3:352   Median : 3.000                    
##  Mean   :32.54   4:577   4:899   Mean   : 3.261                    
##  3rd Qu.:39.00                   3rd Qu.: 4.000                    
##  Max.   :49.00                   Max.   :16.000                    
##  h.occ   ind     med      outcome
##  1:436   1:129   0:1364   1:629  
##  2:425   2:229   1: 109   2:333  
##  3:585   3:431            3:511  
##  4: 27   4:684                   
##                                  
## 
```

As you can see, we have 3 classes in `outcome` variable, which means that we have to generate 3 different models to perform logistic regression on this data for each class.

Let's subset the data into training and testing sets:


```r
data <- data[sample(nrow(data)),] #Shuffles the data by sampling nrow(data) observations from the data without replacement 
trainInd <- round(nrow(data)*0.7) #Take 70% of data as training 
train <- data[1:trainInd,] #Subset training data 
test.outcome <- data[-(1:trainInd),10] #Separate the outcome values of test 
test <- data[-(1:trainInd),-10] #Subset test data and remove outcome variable
```

If you like, you can separate the training test further into training and validation tests to see if your model is working properly.

In R, we can train logistic regression with a single line of code. `glm` function computes logistic regression using `family = binomial("logit")` parameter. This means that our output variable has a binomial distribution of `1s` and `0s`. If you want to classify more that two outcomes, you will need to use two combinatorials of those outcomes (one vs. all). This is what we will try to do.


```r
iris2 <-iris 
iris2$Species<-as.numeric(iris2$Species)
#Create dataset for setosa 
iris2.setosa <-iris2 
iris2.setosa$Species <- as.factor(iris2.setosa$Species==1)
#Create dataset for versicolor 
iris2.versicolor <-iris2 
iris2.versicolor$Species <- as.factor(iris2.versicolor$Species==2) 
#Create dataset for virginica 
iris2.virginica <-iris2 
iris2.virginica$Species <- as.factor(iris2.virginica$Species==3)
logit.setosa <- glm(Species~., data = iris2.setosa, family = binomial)
summary(logit.setosa)
```

```
## 
## Call:
## glm(formula = Species ~ ., family = binomial, data = iris2.setosa)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -3.185e-05  -2.100e-08  -2.100e-08   2.100e-08   3.173e-05  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -16.946 457457.097       0        1
## Sepal.Length     11.759 130504.042       0        1
## Sepal.Width       7.842  59415.385       0        1
## Petal.Length    -20.088 107724.594       0        1
## Petal.Width     -21.608 154350.616       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.9095e+02  on 149  degrees of freedom
## Residual deviance: 3.2940e-09  on 145  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25
```


```r
class1.train <- train 
class1.train$outcome <- class1.train$outcome==1 #Get true for class = 1, false for otherwise 
class1.model <- glm(outcome~., data = class1.train, family = binomial("logit")) 
summary(class1.model)
```

```
## 
## Call:
## glm(formula = outcome ~ ., family = binomial("logit"), data = class1.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0911  -0.9717  -0.6790   1.0836   2.1438  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.26247    0.71339  -0.368  0.71293    
## w.age        0.07609    0.01135   6.702 2.05e-11 ***
## w.ed2       -0.20797    0.28959  -0.718  0.47267    
## w.ed3       -0.74086    0.29989  -2.470  0.01349 *  
## w.ed4       -1.61910    0.32398  -4.997 5.81e-07 ***
## h.ed2       -0.20531    0.48059  -0.427  0.66924    
## h.ed3       -0.35962    0.47882  -0.751  0.45263    
## h.ed4       -0.15781    0.48451  -0.326  0.74464    
## child       -0.33612    0.04057  -8.285  < 2e-16 ***
## rel1         0.33336    0.20247   1.647  0.09966 .  
## w.occ1       0.02297    0.16094   0.143  0.88653    
## h.occ2       0.13086    0.19662   0.666  0.50570    
## h.occ3      -0.19388    0.19546  -0.992  0.32124    
## h.occ4      -0.29294    0.56256  -0.521  0.60256    
## ind2        -0.46456    0.28827  -1.612  0.10706    
## ind3        -0.54456    0.27573  -1.975  0.04827 *  
## ind4        -0.74194    0.28060  -2.644  0.00819 ** 
## med1         0.36701    0.30398   1.207  0.22730    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1412.1  on 1030  degrees of freedom
## Residual deviance: 1250.3  on 1013  degrees of freedom
## AIC: 1286.3
## 
## Number of Fisher Scoring iterations: 4
```

There are some irrelevant features in this model, so we can use stepwise removal to retain only relevant ones. There are other methods for variable selection which we will not cover in this tutorial.


```r
class1.model2 <- step(class1.model, direction="backward", trace=0) 
summary(class1.model2)
```

```
## 
## Call:
## glm(formula = outcome ~ w.age + w.ed + child + rel + ind, family = binomial("logit"), 
##     data = class1.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0759  -0.9703  -0.6868   1.1044   2.0550  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.42542    0.48513  -0.877  0.38053    
## w.age        0.07728    0.01115   6.929 4.23e-12 ***
## w.ed2       -0.32916    0.27165  -1.212  0.22563    
## w.ed3       -0.82844    0.26792  -3.092  0.00199 ** 
## w.ed4       -1.62007    0.27456  -5.901 3.62e-09 ***
## child       -0.32664    0.03950  -8.270  < 2e-16 ***
## rel1         0.30412    0.19930   1.526  0.12704    
## ind2        -0.50638    0.28349  -1.786  0.07406 .  
## ind3        -0.59421    0.26565  -2.237  0.02530 *  
## ind4        -0.77171    0.26796  -2.880  0.00398 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1412.1  on 1030  degrees of freedom
## Residual deviance: 1257.5  on 1021  degrees of freedom
## AIC: 1277.5
## 
## Number of Fisher Scoring iterations: 4
```

After generating the model for outcome = 1, not we have to generate models for other outcome values.

For outcome = 2:


```r
class2.train <- train 
class2.train$outcome <- class2.train$outcome==2 #Get true for class = 1, false for otherwise 
class2.model <- glm(outcome~., data = class2.train, family = binomial("logit")) 
summary(class2.model)
```

```
## 
## Call:
## glm(formula = outcome ~ ., family = binomial("logit"), data = class2.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8400  -0.7549  -0.4556  -0.1559   2.6567  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.775818   0.894198  -1.986 0.047041 *  
## w.age       -0.004322   0.013439  -0.322 0.747743    
## w.ed2        1.247278   0.590575   2.112 0.034689 *  
## w.ed3        1.912480   0.588397   3.250 0.001153 ** 
## w.ed4        2.554835   0.604439   4.227 2.37e-05 ***
## h.ed2       -2.189328   0.626913  -3.492 0.000479 ***
## h.ed3       -2.006864   0.582312  -3.446 0.000568 ***
## h.ed4       -1.778930   0.581853  -3.057 0.002233 ** 
## child        0.247304   0.046422   5.327 9.97e-08 ***
## rel1        -0.330308   0.215107  -1.536 0.124648    
## w.occ1      -0.183878   0.191018  -0.963 0.335737    
## h.occ2      -0.497848   0.220888  -2.254 0.024206 *  
## h.occ3      -0.303870   0.217421  -1.398 0.162230    
## h.occ4      -0.351975   0.835968  -0.421 0.673727    
## ind2        -0.111082   0.483385  -0.230 0.818247    
## ind3         0.265234   0.442322   0.600 0.548747    
## ind4         0.473479   0.445545   1.063 0.287920    
## med1        -1.081801   0.594288  -1.820 0.068709 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1089.5  on 1030  degrees of freedom
## Residual deviance:  929.6  on 1013  degrees of freedom
## AIC: 965.6
## 
## Number of Fisher Scoring iterations: 6
```


```r
class2.model2 <- step(class2.model, direction="backward", trace=0) 
summary(class2.model2)
```

```
## 
## Call:
## glm(formula = outcome ~ w.ed + h.ed + child + rel + h.occ + med, 
##     family = binomial("logit"), data = class2.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8125  -0.7596  -0.4597  -0.1555   2.6564  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.79231    0.70944  -2.526 0.011525 *  
## w.ed2        1.31279    0.58569   2.241 0.024997 *  
## w.ed3        1.95373    0.58303   3.351 0.000805 ***
## w.ed4        2.66821    0.59712   4.468 7.88e-06 ***
## h.ed2       -2.13925    0.62038  -3.448 0.000564 ***
## h.ed3       -1.96525    0.57308  -3.429 0.000605 ***
## h.ed4       -1.67933    0.56827  -2.955 0.003125 ** 
## child        0.23775    0.03745   6.349 2.17e-10 ***
## rel1        -0.38430    0.20722  -1.855 0.063662 .  
## h.occ2      -0.52965    0.21708  -2.440 0.014693 *  
## h.occ3      -0.37129    0.21056  -1.763 0.077840 .  
## h.occ4      -0.34077    0.83296  -0.409 0.682460    
## med1        -1.23386    0.58042  -2.126 0.033520 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1089.46  on 1030  degrees of freedom
## Residual deviance:  935.11  on 1018  degrees of freedom
## AIC: 961.11
## 
## Number of Fisher Scoring iterations: 6
```

For outcome = 3:


```r
class3.train <- train 
class3.train$outcome <- class3.train$outcome==3 #Get true for class = 1, false for otherwise 
class3.model <- glm(outcome~., data = class3.train, family = binomial("logit")) 
summary(class3.model)
```

```
## 
## Call:
## glm(formula = outcome ~ ., family = binomial("logit"), data = class3.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5992  -0.9440  -0.6897   1.2214   2.1727  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.284635   0.938359  -1.369  0.17099    
## w.age       -0.083695   0.012221  -6.849 7.46e-12 ***
## w.ed2       -0.185547   0.294106  -0.631  0.52812    
## w.ed3        0.008821   0.302722   0.029  0.97675    
## w.ed4        0.392387   0.323545   1.213  0.22522    
## h.ed2        1.954096   0.772067   2.531  0.01137 *  
## h.ed3        2.064856   0.771327   2.677  0.00743 ** 
## h.ed4        1.788520   0.774563   2.309  0.02094 *  
## child        0.195245   0.039190   4.982 6.29e-07 ***
## rel1        -0.135300   0.204771  -0.661  0.50878    
## w.occ1       0.068860   0.163761   0.420  0.67413    
## h.occ2       0.305853   0.198607   1.540  0.12356    
## h.occ3       0.487544   0.195268   2.497  0.01253 *  
## h.occ4       0.641800   0.562154   1.142  0.25359    
## ind2         0.566382   0.297006   1.907  0.05652 .  
## ind3         0.495880   0.284065   1.746  0.08087 .  
## ind4         0.488650   0.288724   1.692  0.09056 .  
## med1         0.028589   0.309017   0.093  0.92629    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1326.4  on 1030  degrees of freedom
## Residual deviance: 1236.0  on 1013  degrees of freedom
## AIC: 1272
## 
## Number of Fisher Scoring iterations: 5
```


```r
class3.model2 <- step(class3.model, direction="backward", trace=0) 
summary(class3.model2)
```

```
## 
## Call:
## glm(formula = outcome ~ w.age + w.ed + h.ed + child + h.occ, 
##     family = binomial("logit"), data = class3.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5697  -0.9569  -0.6975   1.2272   2.2372  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.13822    0.85275  -1.335  0.18195    
## w.age       -0.08058    0.01169  -6.893 5.46e-12 ***
## w.ed2       -0.16585    0.28293  -0.586  0.55776    
## w.ed3        0.04851    0.29008   0.167  0.86719    
## w.ed4        0.45192    0.30646   1.475  0.14031    
## h.ed2        2.02944    0.77020   2.635  0.00841 ** 
## h.ed3        2.18770    0.76706   2.852  0.00434 ** 
## h.ed4        1.92459    0.76946   2.501  0.01238 *  
## child        0.19402    0.03816   5.084 3.70e-07 ***
## h.occ2       0.31335    0.19521   1.605  0.10846    
## h.occ3       0.47837    0.19221   2.489  0.01282 *  
## h.occ4       0.63700    0.55636   1.145  0.25223    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1326.4  on 1030  degrees of freedom
## Residual deviance: 1240.6  on 1019  degrees of freedom
## AIC: 1264.6
## 
## Number of Fisher Scoring iterations: 5
```

In these models, p-values show the significance level of the variables. The residual deviance and null deviance show the variability of the residuals and the model predictions respectively. We want them to be as small as possible. Coefficients explain the effect of that variable on the outcome.

Now that we have generated our models, we can perform classification with the test set we have set aside:


```r
class1.test <- predict(class1.model2, test, type = "response") #Predicts probability of belonging to that class 
class2.test <- predict(class2.model2, test, type = "response") 
class3.test <- predict(class3.model2, test, type = "response") 
classProbs <- cbind(class1.test, class2.test, class3.test) 
classProbs <- classProbs/rowSums(classProbs) 
tclassProbs <- data.frame(t(classProbs)) 
classes <- as.factor(sapply(tclassProbs, which.max)) 
confusionMatrix(classes, test.outcome)
```

```
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 126  39  72
##          2  13  32  19
##          3  41  34  66
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5068          
##                  95% CI : (0.4591, 0.5543)
##     No Information Rate : 0.4072          
##     P-Value [Acc > NIR] : 1.489e-05       
##                                           
##                   Kappa : 0.222           
##                                           
##  Mcnemar's Test P-Value : 1.076e-05       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.7000   0.3048   0.4204
## Specificity            0.5763   0.9050   0.7368
## Pos Pred Value         0.5316   0.5000   0.4681
## Neg Pred Value         0.7366   0.8069   0.6977
## Prevalence             0.4072   0.2376   0.3552
## Detection Rate         0.2851   0.0724   0.1493
## Detection Prevalence   0.5362   0.1448   0.3190
## Balanced Accuracy      0.6382   0.6049   0.5786
```

This model obviously does not perform well. It can only predict the true class 50% of the time, which is better than chance level in this case because prediction true class out of 3 possible values has a chance value of 33% but still, 50% is not good. This process is also very difficult to do when working with more than 3 classes. With 4 classes, you need to generate 6 models. With 5 classes, you need to generate 10 models and with 10 classes you need to generate 45 models for classification. There are packages that do this for you but they are won't be covered in this tutorial.

To evaluate a single logistic regression model, we can use the following code to get the p-value associated with it. Assume we want to test if class3.model2 is a viable model:


```r
with(class3.model2, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
```

```
## [1] 1.139793e-13
```

The p-value is <<0.05, so the model is appropriate for use.

Suppose we want to see the probability and the confidence interval of beloging to that class by a random variable in the dataset for `class3.model2`, first we need to get the probabilities of that class along with the standard error of the prediction, then we plot it with the desired variable:


```r
newdata <- cbind(test, predict(class3.model2, newdata = test, type = "link",se = TRUE)) 
newdata <- within(newdata, { 
  PredictedProb <- plogis(fit) 
  LL <- plogis(fit - (1.96 * se.fit)) 
  UL <- plogis(fit + (1.96 * se.fit)) 
})
```

Suppose we want to see how the probabilities change by `w.age`, the following code visualizes that relationship:


```r
ggplot(newdata, aes(x = w.age, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
                                                                     ymax = UL), alpha = 0.2) + geom_line(size = 1)
```

<img src="classification_files/figure-html/unnamed-chunk-19-1.png" width="672" style="display: block; margin: auto;" />
-->
</div>
<div id="evaluation-measures" class="section level2">
<h2>Evaluation Measures</h2>
<p>There are several evaluation measures reported in the outputs of the models generated above. Three most important values are ‘<em>Sensitivity</em>’, ‘<em>Specificity</em>’ and ‘<em>Accuracy</em>’. Accuracy gives you which percent of the data you correctly classified. However this is not a good measure if there is a class unbalance. For instance, let’s say that you have 100 data points of which 95 are <code>class a</code> and 5 are <code>class b</code>. You can classify all 100 points as <code>class a</code> and you will still have 95% accuracy, even though you failed to find any data that belongs to <code>class b</code>. This is why we need sensitivity and specificity. Sensitivity tells us how many of the data that are of <code>class a</code>, we were able to classify as <code>class a</code>. In this example, this would be 1. Because every data that belonged to <code>class a</code> was classified as <code>class a</code>. Specificity tells us how many data points that were <code>class b</code> were classified as <code>class b</code>. In this example, specificity will be zero because none of the data that were <code>class b</code> was classified as such. Ideally, we want both of these values to be close to one. If all your values are close to one, then you have a good model.</p>
<!--### Receiver Operating Characteristic (ROC) Curve and Area Under Curve (AUC)

In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test.

Let's plot ROC curve of the `class3.model2` that we fit with logistic regression:


```r
#install.packages(pROC)
class3probs<-predict(class3.model2,type="response")
require(pROC)
roccurve <- roc(class3.model2$y, class3probs)
plot(roccurve)
```

<img src="classification_files/figure-html/unnamed-chunk-20-1.png" width="672" style="display: block; margin: auto;" />


```r
auc(roccurve)
```

```
## Area under the curve: 0.6647
```

We want that curve to be far away from straight line. Ideally we want the area under the curve as high as possible. We want to make almost 0% mistakes while identifying all the positives, which means we want to see AUC value near to 1.

As can be seen, the AUC for logistic regression model of class 3 is 0.66. It is a fairly good model but it can be enhanced. -->
</div>
<div id="useful-links" class="section level2">
<h2>Useful Links</h2>
<ul>
<li>Caret package documentation: <a href="http://www.jstatsoft.org/v28/i05/paper" class="uri">http://www.jstatsoft.org/v28/i05/paper</a>
<ul>
<li>This webpage holds examples and advanced methods for generating both classification and regression models using <code>caret</code> package.</li>
</ul></li>
<li>Accurately determining prediction error: <a href="http://scott.fortmann-roe.com/docs/MeasuringError.html" class="uri">http://scott.fortmann-roe.com/docs/MeasuringError.html</a>
<ul>
<li>This document explains the details of error measurements</li>
</ul></li>
<li>PCA revisited: using principal components for classification of faces: <a href="https://www.r-bloggers.com/pca-revisited-using-principal-components-for-classification-of-faces/" class="uri">https://www.r-bloggers.com/pca-revisited-using-principal-components-for-classification-of-faces/</a>
<ul>
<li>A tutorial on how you can use PCA to a classification problem. For more details of the PCA, you can refer to our exercise on <a href="https://mehmetaliakyol.com/datascience/discretization.html">Dimentionality Reduction</a></li>
</ul></li>
</ul>
</div>


</div>

<script>

// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
  $('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
  bootstrapStylePandocTables();
});


</script>

<!-- tabsets -->

<script>
$(document).ready(function () {
  window.buildTabsets("TOC");
});

$(document).ready(function () {
  $('.tabset-dropdown > .nav-tabs > li').click(function () {
    $(this).parent().toggleClass('nav-tabs-open')
  });
});
</script>

<!-- code folding -->


<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

</body>
</html>