-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgapres.Rpres
688 lines (513 loc) · 21.1 KB
/
gapres.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
Statistical Learning with R
========================================================
author: Guy Freeman <[email protected]>
date: Monday 20th October 2014
Source: rpubs.com/slygent/statlearn
Statistical learning with R
========================================================
width: 3200
height: 1800
```{r echo=FALSE}
opts_chunk$set(fig.align='center')
```
- Why learning?
- Why statistical learning?
- Why R?
- Why me?
- I use R to carry out statistical learning almost every day.
![](knn1.png)
R, the statistical programming language
========================================================
I know that Mart has been using Python with you, but I want to show you an alternative.
I've been using R for 15 years now, and I want to show you the advantages (and maybe a few disadvantages) of using a programming language designed purely for statistical analysis. Then you can use different tools for different situations.
The New York Times approves of R! http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html
And most recently, Bayesian statistics, the foundation of statistical learning: http://www.nytimes.com/2014/09/30/science/the-odds-continually-updated.html
Statistical learning -- it's like machine learning, but better
========================================================
These are my subjective definitions:
*Machine learning* is running algorithms to understand patterns in data and make predictions from inputs.
_Statistical learning_ is machine learning where you understand why the algorithms work and when they work.
I highly recommend the book "An Introduction to Statistical Learning: With Applications in R" by Gareth James et al for an overview. I heavily leaned on this book for today's lesson :)
An Introduction to Statistical Learning: With Applications in R
====
![](cover.jpg)
Introduction to R
===
"R is a language and environment for statistical computing and graphics" -- the official description of R at its homepage www.r-project.org
It also happens to be Free Software (just like Python). You can use it without paying money, and you can copy it and modify it and distribute it.
![](Rlogo.jpg)
You can download it from http://cran.r-project.org/
I highly recommend you also install RStudio, an Integrated Development Environment (IDE) for R that works on Windows, Mac and Linux, from http://www.rstudio.com/products/rstudio/download/
Functions in R
===
R uses *functions* to carry out procedures on (optional) inputs (aka *arguments*).
For example, the `sum` function calculates the sum of its inputs (so original!).
```{r}
sum(3,5,500,pi)
```
Vectors in R
===
To create a vector of numbers we use the `c` function [short for "concatenate"]. We can save this vector to a variable using `<-`.
```{r}
x <- c(1,8,7,15)
x
```
Many functions in R are "vectorized", which means they know what to do with vectors. Adding two vectors, for example, works how you'd expect:
```{r}
y <- c(20,8,4,1)
y + x
```
Objects in R
===
What objects have we created so far? Use `ls()` to find out:
```{r}
ls()
```
Delete anything you don't want any more with `rm()`:
```{r}
rm(y)
ls()
```
Matrices in R (with some help)
===
The `matrix` function in R creates matrices. How? Let's use R's help system.
```{r}
?matrix
matrix(data=c(1,2,3,4), nrow=2, ncol=2)
matrix(c(1,2,3,4), 2, 2)
```
Random variables in R
===
We can simulate a sample from common probability distributions and calculate summary statistics easily.
```{r}
set.seed(3) # to make sure our random numbers are the same!
y <- rnorm(100) # 100 random normal variables with mean 0 and s.d. 1
```
```{r}
mean(y)
var(y)
```
Random variables in R
===
Standard deviation is the square root of the variance.
```{r}
sqrt(var(y))
sd(y)
```
Graphics in R
===
```{r out.height="550px"}
x=rnorm(100)
y=rnorm(100)
plot(x,y)
```
Annotations in R
===
```{r out.height="550px"}
plot(x,y,xlab="this is the x-axis",ylab="this is the y-axis",
main="Plot of X vs Y")
```
Contour plots
===
```{r out.height="450px"}
x <- seq(-pi,pi,length=50) # from -pi to pi in 50 steps
y <- x
f <- outer(x,y,function (x,y)cos(y)/(1+x^2)) #
contour(x,y,f)
```
Indexing data
===
```{r}
A <- matrix(1:16,4,4)
A
A[2,3] # 2nd row, 3rd column
A[c(1,3),c(2,4)]
```
Indexing data
===
```{r}
A[,2:1] # 2nd then 1st columns, all rows
A[-c(1,3),] # ignore 1st and 3rd rows
```
Loading data into data frame
===
```{r}
auto <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")
colnames(auto)
head(auto$mpg) # access column
```
Exercises (to make sure you're awake)
===
1. Use `read.csv` to store the data from http://www-bcf.usc.edu/~gareth/ISL/College.csv
2. Look at the first six rows using the `head` function.
3. What's going on with the first column? Use it to name the rows of the data using `row.names`.
4. Now delete the redundant first column.
5. Use the `summary` function on the data frame.
6. Use the `pairs` function on the first ten rows of the data frame.
7. Use the `hist` function to produce some histograms of the data. Try different bin widths.
Now we can start statistical learning!
===
Statistical learning means using $input$ variables to predict $output$ variables. That's it. But that's easier said than done...
```{r}
Advertising <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv")
head(Advertising, n=3)
```
How can we best predict Sales from the TV, Radio and Newspaper advertising budgets? (Why would we want to?)
Statistical learning in one equation
===
Let $Y$ be the output variable (e.g. sales), and $X$ the vector of input variables $X_1, X_2, X_3, ...$
Then
$$Y = f(X) + \epsilon $$
*We want to work out what $f$ is*. $\epsilon$ is unavoidable noise that is independent of $X$.
How do we *estimate* $f$ from the data, and how do we evaluate our estimate? That is statistical learning.
Prediction and its limits
===
Once we have an estimate $\hat{f}$ for $f$, we can predict unavailable values of $Y$ for known values of $X$: $$ \hat{Y} = \hat{f}(X) $$
How good an estimate of $Y$ is $\hat{Y}$? The difference between the two values can be partitioned into *reducible* and *irreducible* errors:
$$ E(Y - \hat{Y})^2 = [f(X) - \hat{f}(X)]^2 + \sigma_\epsilon^2 $$
where $[f(X) - \hat{f}(X)]^2$ is the reducible error.
How to estimate f
===
Two main approaches:
1) Parametric
An assumption is made about the form of $f$. For example, the $linear$ model states that $$\hat{f}(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p $$ Then we use the *training data* to choose the values of $\beta_0, \beta_1, ..., \beta_p$, the *parameters*.
Advantage: Much easier to estimate parameters than whole function.
Disadvantage: Our choice of the form of $f$ might be wrong, or even very wrong.
How to estimate f
===
We can try to make our parametric form more *flexible* in order to reduce the risk of choosing the wrong $f$, but this also makes $\hat{f}$ more complex and potentially following the noise too closely, thereby *overfitting*.
2) Non-parametric
Just get $f$ as close as possible to the data points, subject to not being too wiggly or too unsmooth.
Advantage: More likely to get $f$ right, especially if $f$ is weird.
Disadvantage: Far more data is needed to obtain a good estimate for $f$.
Supervised vs unsupervised learning
===
What if we are only given input variables and no outputs? Then our learning will be *unsupervised*; we are blind.
What can we do? We can try to understand the relationship between the input variables or between the observations. One example is to *cluster* observations or variables together into groups.
We will focus on *supervised* learning with outputs today.
Regression vs classification
===
Variables can be either *quantitative* or *qualitative*. (Anyone who went to my Introduction to Data Science lecture will remember I am obsessed with this distinction).
We care about this because different variable types affect the class of values we can make predictions from and therefore the possible nature of $f$, and also how to measure the size of an error from a wrong prediction.
When the output variable is quantitative, prediction is called *regression*.
When the output variable is qualitative, prediction is *classification*.
Assessing model accuracy
===
How good is our $\hat{f}$? This is equivalent to asking how close each $\hat{f}(x)$ was to $y$. The most commonly-used score for regression is the *Mean Squared Error* (MSE): $$ MSE = \frac{1}{n} \sum{(y_i - \hat{f}(x_i))^2} $$
i.e. just sum up the squares of the differences between the predictions $\hat{f}(x_i)$ and the actual outputs $y_i$.
But actually we are not very interested in how well $\hat{f}$ predicts $y$ we already know; we want it to predict future $y$ we haven't seen yet.
But we haven't seen the future $y$ yet...
Training MSE vs test MSE
===
If we only try to minimize the MSE for old (*training*) data, we might *overfit* and have terrible MSE for new (*test*) data.
![](mse.png)
Increasing flexibilty leads to more accuracy, but after a while we are predicting training data TOO accurately!
Bias vs variance
===
Why does the test MSE go down and then up? Because
$$ E(y - \hat{f}(x))^2 = Var(\hat{f}(x)) + [Bias(\hat{f}(x))]^2 + Var(\epsilon) $$
What is the *variance* of $\hat{f}$? It is the amount $\hat{f}(x)$ would change if we had different data to train with from the same source population. Ideally, $\hat{f}$ shouldn't change for different data that is fundamentally the same. **More flexible models have higher variance.**
The *bias* of $\hat{f}$ is the error due to using $\hat{f}$ instead of the true (but unknown) $f$. **Simpler models have higher bias.**
**We are trying to decrease the bias without increasing the variance more**.
Exercises
===
For which of these situations would a flexible model/method be better than an inflexible one?
1. Large sample size with a small number of predictors.
2. Small sample size with a large number of predictors.
3. The relationship between the input and output variables is highly non-linear.
4. The variance of the irreducible error is very high.
Linear regression
===
Back to `Advertising` at last. Are the various media budgets *linearly* associated with sales?
```{r}
library(ggplot2)
library(reshape2)
advmelt <- melt(Advertising, id.vars = c("X", "Sales"))
```
Hmm.
***
```{r}
ggplot(advmelt, aes(x = value, y = Sales)) + geom_point() + facet_wrap(~ variable, scales="free_x")
```
Linear regression in R
===
```{r}
summary(lm(Sales ~ TV + Radio + Newspaper, data = Advertising))
```
Linear regression
===
Why does newspaper advertising seem to have no effect on sales, despite the graph earlier seeming to show otherwise?
Consider the relationship between newspaper and radio budgets:
```{r echo=FALSE,out.height="350px"}
plot(Advertising$Radio, Advertising$Newspaper)
```
Higher newspaper and radio budgets go together. So even if only radio affects sales, there will still be a (spurious) association between newspaper advertising budgets and sales.
Model fit
===
How good is our linear model? One way to ask this: How well does our model fit the data?
One measure is $R^2$, the square of the correlation between the predicted and actual sales values. For this model that value is 0.8972. Excluding newspapers, $R^2$ is... 0.8972. This suggests newspaper budgets are indeed unnecessary for predicting sales once TV and radio are taken into account.
Linear in reality?
===
But how linear is the data? Below is a graph of the prediction errors for different values of TV advertising.
```{r echo = FALSE,out.height="350px"}
Advertising$pred <- predict(lm(Sales ~ Radio + TV, data = Advertising))
ggplot(Advertising, aes(x = TV, y = pred - Sales)) + geom_point() + stat_smooth()
```
It can be seen that for low and high values of TV advertising, the model over-predicts sales, while underestimating in the middle. This systematic error indicates the true relationship is not linear.
Extending the linear model
===
One way to extend the linear model is to include *interactions*.
$$ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_{12} X_1 X_2... $$
Now the effect of $X_1$ on $\hat{Y}$ depends on the value of $X_2$, and vice versa.
Is there an interaction between TV and Radio advertising budgets? It turns out there is.
Interactions in R
===
<small>
```{r echo=FALSE}
attach(Advertising)
summary(lm(Sales ~ TV + Radio + TV:Radio))
detach(Advertising)
```
</small>
Non-linear linear models
===
I mean *polynomial regression*, e.g.
$$ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_{2} X_1^2 + ... $$
<small>
```{r echo=FALSE}
attach(Advertising)
summary(lm(Sales ~ TV + I(TV^2)))
detach(Advertising)
```
</small>
Example
===
```{r}
library(MASS)
library(ISLR)
# fix(Boston)
names(Boston)
```
Example
===
```{r, out.height="450px"}
bostonlm <- lm(medv~lstat,data=Boston) # median home value vs low economic status
plot(Boston$lstat, Boston$medv)
abline(bostonlm)
```
Is this linear?
Example
===
Let's look at the behaviour of the _residuals_ (prediction errors):
```{r}
plot(predict(bostonlm), residuals(bostonlm))
```
Example -- multiple linear regression
===
<small>
```{r}
summary(lm(medv~lstat+age,data=Boston))
```
</small>
Example -- one for all....
===
```{r echo=FALSE}
summary(lm(medv~.,data=Boston))
```
Example -- non-linear transformation
===
```{r echo=FALSE}
bostonlm2 <- lm(medv~lstat + I(lstat^2),data=Boston)
summary(bostonlm2)
```
Exercises
===
Using the `Auto` dataset:
1. Carry out linear regression of `mpg` against `horsepower`. Use `summary` to print the results.
- Is there a relationship?
- How strong is the relationship?
- Is the relationship positive?
- What is the predicted `mpg` associated with a `horsepower` of 98?
2. Plot the response against the predictor, and the line of best fit with `abline`.
Exercises
===
3. Use the `lm` function to regress `mpg` against all other variables (except `name`).
- Is there a relationship?
- Which predictors are statistically significant?
- What does the coefficient for `year` suggest?
4. Use * and : to fit regression models with interactions. Are the interactions statistically significant?
Classification
===
Linear regression was a fine first step for quantitative responses. What about qualitative (categorical) responses?
We don't want to make predictions on a continuous scale, so linear regression can't be used. But it turns out we can predict probabilities of belonging to a class (i.e. numbers continuous between 0 and 1) and then classifying with that...
Motivating example
===
Can we predict if someone will default on their credit card payment on the basis of their annual income and credit card balance? Let's find out using the `Default` dataset in the `ISLR` package.
```{r eval=FALSE}
library(ISLR)
ggplot(Default, aes(x = balance, y = income)) + geom_point(aes(colour = default), shape = 3)
```
Motivating example
===
Can we predict if someone will default on their credit card payment on the basis of their annual income and credit card balance? Let's find out using the `Default` dataset in the `ISLR` package.
```{r echo=FALSE}
#,out.width="550px"}
library(ISLR)
ggplot(Default, aes(x = balance, y = income)) + geom_point(aes(colour = default), shape = 3)
```
Why not linear regression?
===
If output is "default" and "not default", how can we run regression?
You could argue we can code response as 0 and 1. But then order of coding and scale of coding are both arbitrary. For two classes situation is not too bad, but...
... still predicting a continuous value. We can use a threshold to classify the continuous prediction, but with linear regression this prediction could fall outside [0,1] range. How do we restrict to [0,1]?
Logistic regression
===
![](log.png)
Rather than modelling default directly, we can model the *probability* of default $Pr(Y = 1|X) = p(X)$. How?
Logistic regression
===
Instead of
$$p(X) = \beta_0 + \beta_1 X $$
try
$$ p(X) = \frac{\exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)} $$
This will ensure the prediction is between 0 and 1. Re-arranging:
$$ \frac{p(X)}{1 - p(X)} = \beta_0 + \beta_1 X $$
Linear again! But in a transformed variable.
Logistic regression
===
```{r}
summary(glm(default~balance,family = binomial,data=Default))
```
Logistic regression
===
```{r}
summary(glm(default~balance+income+student,family = binomial,data=Default))
```
Model goodness
===
A binary classifier can make two errors: false positives and false negatives. Predicting using the model in the last slide and a threshold p(X) of 0.5, the *confusion matrix* for the `Default` is as follows:
```{r echo=FALSE}
defaultglm <- glm(default~balance+income+student,family = binomial,data=Default)
Default$preds <- predict(defaultglm, type = "response")
table(Default$default,Default$preds > 0.5,dnn=c("Actual default", "Predicted default"))
```
So although overall error rate is low, the false negative error rate (those we mistakenly said won't default) is 228/(228+105) = `r round(228/(228+105), 2)`. That's not the best error rate ever... But the false positive rate is only 40/(9627+40) = `r round(40/(9627+40), 3)`...
ROC curves
===
Why use 0.5 as the threshold? Would another value be better? What if we choose 0.2?
```{r echo=FALSE}
defaultglm <- glm(default~balance+income+student,family = binomial,data=Default)
Default$preds <- predict(defaultglm, type = "response")
table(Default$default,Default$preds > 0.2,dnn=c("Actual default", "Predicted default"))
```
Now the false negative error rate is `r round(130/(130+203), 2)`, although the false positive rate has increased to `r round(277/(277+9390), 2)`. Changing the threshold always results in this trade-off.
ROC curves
===
What are the error rates across thresholds? The *ROC curve* can show this.
```{r}
#out.height="300px"}
library(ROCR)
pred <- prediction(Default$preds, Default$default)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))
```
ROC curves
===
```{r echo=FALSE}
#,out.height="300px"}
library(ROCR)
pred <- prediction(Default$preds, Default$default)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))
```
As we want the curve to touch the top-left hand corner, we can use the *area under the curve* (AUC) to determine the quality of the classifier. This time the AUC is `r round(performance(pred, measure = "auc")@"y.values"[[1]], 2)`.
Exercises
===
Use the `Weekly` data from the `ISLR` package.
1. Produce some summaries of the data. Are there any patterns?
2. Fit a logistic regression model to explain `Direction` using the `Lag`s and `Volume`. Which factors are significant?
3. Make a confusion matrix for this model using `table`.
4. Fit a logistic model and a KNN model with K=1 with only `Lag2` as a predictor. (Check out `knn` function). Which model is better?
Advanced R features -- data.table
===
(Inspired by http://adv-r.had.co.nz/)
It turns out that `data.frame`s, the base form of data structure in R, have an annoying syntax. Enter `data.table`s.
```{r}
library(data.table)
defaultdt <- as.data.table(Default)
defaultdt
```
Advanced R features -- data.table
===
It turns out that `data.frame`s, the base form of data structure in R, have an annoying syntax. Enter `data.table`s.
```{r}
library(data.table)
defaultdt <- as.data.table(Default)
defaultdt[student=="No",list(meanbalance = mean(balance), sdbalance = sd(balance), meanincome = mean(income)),by=default]
```
Advanced R features -- Grammar of Graphics (ggplot2)
===
This builds graphs using a grammar to describe how aesthetic (geometrical) elements relate to the data/statistics, rather than detailing where every point goes. This allows for a concise and coherent description of complex graphs. For example, http://stats.stackexchange.com/questions/87132/what-is-the-proper-name-for-a-river-plot-visualisation shows how to re-create the famous Minard graph of the fate of Napoleon's Grand Army in the 1812 Russian campaign.
Advanced R features -- Grammar of Graphics (ggplot2)
===
![](http://www.datavis.ca/gallery/minard/orig.gif)
![](http://www.datavis.ca/gallery/minard/ggplot2/temps.jpg)
Advanced R features -- closures
===
Closures are functions that return functions, e.g.
```{r}
power <- function(exponent) {
function(x) {
x ^ exponent
}
}
square <- power(2)
square(2)
square(4)
```
Advanced R features -- functionals
===
Functionals take functions as arguments, e.g.
```{r}
randomise <- function(f) f(runif(1e3))
randomise(mean)
randomise(mean)
randomise(sum)
```
Advanced R features -- domain-specific languages
===
Formulas (`y ~ x`) and ggplot2 provide new syntaxes within R. Here's another example, converting R code to SQL:
```{r}
library(dplyr)
translate_sql(sin(x) + tan(y))
translate_sql(x < 5 & !(y >= 5))
```
Advanced R features -- Use C++ within R
====
Sometimes you want to use C++ code to do something, probably because it's much faster than R or it has a ready-made implementation of a complicated data structure or algorithm that R lacks.
```{r}
library(Rcpp)
cppFunction('int add(int x, int y, int z) {
int sum = x + y + z;
return sum;
}')
add(1,2,3)
```
Advanced R features -- Use C++ within R
====
```{r}
cppFunction('double meanC(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total / n;
}')
library(microbenchmark)
x <- runif(1e5)
microbenchmark(
mean(x), # internal mean
meanC(x) # new mean function
)
```