-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathbivariate.qmd
396 lines (285 loc) · 19.5 KB
/
bivariate.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
---
output: html_document
editor_options:
chunk_output_type: console
---
# Fitting a bivariate model
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
pkg_ver(c("ggplot2"))
```
---------------------------------
Bivariate data consist of datasets where *two* variables are measured for each observation. For instance, both wind speed *and* temperature might be recorded at specific locations. This contrasts with univariate data, where only a *single* variable—such as temperature—is measured for each observation at *two* different locations.
Understanding how to model and analyze bivariate data is critical for uncovering relationships and trends between the variables. In this chapter, we will delve into common fitting strategies that are specifically tailored for bivariate datasets.
## Scatter plot
A scatter plot is a widely used visualization tool for comparing values between two variables. In some cases, one variable is considered *dependent* on the other, which is then referred to as the *independent* variable. The dependent variable, also known as the *response* variable, is typically plotted on the y-axis, while the independent variable, also called the *factor* (not to be confused with the factor data type in R, used as a grouping variable), is plotted on the x-axis.
In other situations, the focus may not be on a dependent-independent relationship. Instead, the goal might simply be to explore and understand the overall relationship between the two variables, without assigning a hierarchical dependency.
The following figure shows a scatter plot of a vehicle's miles-per-gallon (mpg) consumption as a function of horsepower (hp). In this exercise, we'll seek to understand how a car's miles-per-gallon (the dependent variable) can be related to its horsepower (the independent variable).
```{r class.source="eda", fig.height = 3, fig.width = 3, echo = FALSE}
library(tukeyedar)
eda_lm(mtcars, hp, mpg, lm = FALSE, sd= FALSE, show.par = FALSE, mean.l = FALSE)
```
## Fitting a model to the data
In univariate data analysis, we aim to reduce the data to manageable parameters, providing a clearer "handle" on the dataset. Similarly, when working with bivariate data, we can start by fitting the simplest possible model. For the variable `mpg`, a straightforward approach is to use a measure of location, such as the mean.
```{r class.source="eda", fig.height = 3, fig.width = 3, echo = FALSE}
eda_lm(mtcars, hp, mpg, poly=0, sd= FALSE, show.par = FALSE, mean.l = FALSE)
```
The red line represents the fitted model. Mathematically, this model can be expressed as:
$$
mpg = a + b(hp)^0 = a + b(1) + \epsilon = 20.09 + \epsilon
$$
The equation is a 0^th^ order polynomial model, The polynomial order is determined by the highest power to which the independent variable is raised (0 in this case). For now, note that regardless of `hp`'s observed value, its contribution to `mpg` remains constant. The terms $a$ and $b$ are defined as the model's **intercept** and **slope**. We will delve into the meanings of $a$ and $b$ later in this section. $\epsilon$ is the residual or the difference between each observation and the fitted model.
If this were a univariate dataset, modeling the data using the mean would be a reasonable choice. But, given that this is a bivariate dataset, this approach fails to utilize the variation in `hp` to explain `mpg`.
The pattern observed in the scatter plot suggests that as `hp` increases, `mpg` decreases. We can model this by fitting the line in such a way so as to capture the pattern generated by the points. The fitting strategy can be as simple as "eyeballing" the fit, or adopting more involved statistical methods like the least-squares method.
The following plot is an example of a line fitted to the data using the least-square method.
```{r class.source="eda", fig.height = 3, fig.width = 3, echo = FALSE}
eda_lm(mtcars, hp, mpg, poly=1, sd= FALSE, show.par = FALSE, mean.l = FALSE)
```
The modeled line can be expressed mathematically as:
$$
mpg = a + b(hp)^1 + \epsilon = 30.1 - 0.068(hp) + \epsilon
$$
This is a 1^st^ order polynomial where `hp` is raised to the power of 1. Here, the intercept $a$ is the value `mpg` would take if `hp` were zero. The term $b$ is the slope of the line and is interpreted as the change in $y$ for every unit change in $x$. In our example, for every one horsepower increase, the miles-per-gallon decreases by 0.068 miles-per-gallon.
Not all relationships will necessarily follow a perfectly straight line. Relationships can be curvilinear. Polynomial functions can take on many different non-linear shapes. The power to which the $x$ variable is raised to will define the nature of that shape. For example, to fit a quadratic line to the data, one can define the following model:
$$
mpg = a + b(hp) + c(hp)^2 + \epsilon
$$
The model retains the *slope* term, $b$, and adds the *curvature coefficient*, $c$.
Fitting the model to the data using least squares method results in the following coefficients:
$$
mpg = 40.4 -0.2(hp) + 0.0004(hp)^2 + \epsilon
$$
The modeled line looks like:
```{r class.source="eda", fig.height = 3, fig.width = 3, echo = FALSE}
eda_lm(mtcars, hp, mpg, poly=2, sd= FALSE, show.par = FALSE, mean.l = FALSE)
```
## Generating bivariate plots with base R
In R, a scatter plot can be easily created using the base plotting environment. Below, is an example of this process with the built-in `mtcars` dataset.
```{r fig.width=2.5, fig.height=2.5, small.mar=TRUE}
plot(mpg ~ hp, mtcars)
```
The formula `mpg ~ hp` can be interpreted as *...mpg as a function of hp...*, where `mpg` (miles per gallon) is plotted on the y-axis and `hp` (horsepower) on the x-axis.
To overlay a regression line on the scatter plot, you first need to define a linear model using the `lm()` function.
```{r}
M1 <- lm(mpg ~ hp, mtcars)
```
Here, a first-order polynomial (linear) model is fitted to the data. The `mpg ~ hp` carries the same interpretation as in the `plot()` function. The model coefficients (intercept and slope) can be extracted as a numeric vector from the model `M1` using the `coef()` function.
```{r}
coef(M1)
```
Once the model is defined, you can add the regression line to the plot by passing the `M1` model to `abline()` as an argument.
```{r fig.width=2.5, fig.height=2.5, small.mar=TRUE}
plot(mpg ~ hp, mtcars)
abline(M1, col = "red")
```
In this example, the regression line is displayed in red.
To fit a second-order polynomial (quadratic) model, extend the formula by adding a second-order term using the `I()` function:
```{r}
M2 <- lm(mpg ~ hp + I(hp^2), mtcars)
```
When raising an independent variable $x$ to a power using the caret (`^`) operator, you must wrap the term with `I()`. Without it, `lm()` may misinterpret the formula, leading to unintended behavior.
Adding the quadratic regression line (or any higher-order polynomial model) to the plot requires generating predicted values using the `predict()` function. These predictions are then plotted using the `lines()` function:
```{r fig.width=2.5, fig.height=2.5, small.mar=TRUE}
plot(mpg ~ hp, mtcars)
# Create sequence of hp values to
x.pred <- data.frame( hp = seq(min(mtcars$hp), max(mtcars$hp), length.out = 50))
# Predict mpg values based on the quadratic model
y.pred <- predict(M2, x.pred)
# Add the quadratic regression line to the plot
lines(x.pred$hp, y.pred, col = "red")
```
In this code, we:
+ Create a sequence of `hp` values covering the range of the dataset.
+ Predict the corresponding `mpg` values using the quadratic model `M2`.
+ Add the predicted curve to the plot with `lines()`.
## Generating bivariate plots with `ggplot`
Using the `ggplot2` package in R, a scatter plot can be created with minimal effort. Here's an example using the built-in `mtcars` dataset:
```{r fig.width=2.5, fig.height=2.5}
library(ggplot2)
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point()
```
In this code, `ggplot()` initializes the plot and specifies the dataset (`mtcars`) and aesthetics (`aes()`), where x = hp and y = mpg. `geom_point()` adds points to create the scatter plot.
o add a linear regression line to the scatter plot, use the `stat_smooth()` function:
```{r fig.width=2.5, fig.height=2.5}
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() +
stat_smooth(method ="lm", se = FALSE)
```
Here, the `method = "lm"` argument specifies that a linear model (`lm`) is used for the regression line. The `se = FALSE` argument prevents the addition of a confidence interval around the regression line. (Confidence intervals are not covered in this course). Note that there’s no need to create a separate model outside the ggplot pipeline, as the `stat_smooth()` function handles it automatically.
The `stat_smooth()` function can also fit higher-order polynomial models by specifying the desired formula. For instance, to fit a quadratic model:
```{r fig.width=2.5, fig.height=2.5}
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() +
stat_smooth(method ="lm", se = FALSE, formula = y ~ x + I(x^2) )
```
In this case, the formula argument defines the model as $y=x+x^2$, where $x$ and $y$ refer to the x- and y-axes, respectively. The `I()` function ensures that the quadratic term (`x^2`) is treated correctly in the formula.
It’s worth noting that the formula directly references `x` and `y` instead of specific column names. This generality allows the formula to adapt to the axis mappings defined in the `aes()` function.
## None-parametric fit
Polynomial models used to fit lines to data are classified as **parametric models**. These models require defining a specific functional form (e.g., linear, quadratic) *a priori*, which is then fitted to the data. By imposing this predefined structure, parametric models make strong assumptions about the underlying relationship between variables.
In contrast, **non-parametric models** belong to a class of fitting strategies that do not assume a specific structure for the data. Instead, these models are designed to adapt flexibly, allowing the data to reveal its inherent patterns and relationships. One such method used in this course is the **loess** fit, a locally weighted regression approach.
#### Loess
A flexible curve-fitting method is the **loess** curve (short for **lo**cal regr**ess**ion, also known as *local weighted regression*). This technique fits small segments of a regression line across the range of x-values and links the midpoints of these segments to generate a *smooth* curve. The range of x-values contributing to each localized regression lines is controlled by the **span** parameter, $\alpha$, which typically ranges from 0.2 to 1 (though it can exceed 1 for smaller datasets). A larger $\alpha$ value results in a smoother curve. Another key parameter, $\lambda$, specifies the **polynomial order** of the localized regression lines. This is usually set to 1, although in `ggplot2`, the loess function defaults to a 2^nd^ order polynomial.
```{r echo=FALSE}
library(dplyr)
library(purrr)
#df <- read.csv("http://mgimond.github.io/ES218/Data/ganglion.csv")
df <- mtcars
alpha <- 0.5
strip.x <- nrow(df) * alpha # Number of points within band
f.plot <- function(start , line = FALSE, point = FALSE, bnds =FALSE,
w = FALSE, plot = TRUE, pts = FALSE, title = NULL){
# Find points closest to starting point
subset <- df %>%
# mutate(dst = abs(area - start)) %>%
mutate(dst = abs(hp - start)) %>%
arrange(dst) %>%
mutate(j = row_number()) %>%
filter(j <= strip.x ) %>%
mutate(wt = dst / max(dst) * 3)
# Assign weights
wts <- dnorm(subset$wt)/ 0.3989423
# Regress with weights
# M <- lm(cp.ratio ~ area, subset, weights = wts)
M <- lm(mpg ~ hp, subset, weights = wts)
x.l <- coef(M)[1] + coef(M)[2] * start
# Plot by option
if(plot == TRUE){
# plot(cp.ratio ~ area, df, yaxt='n', main = title,
plot(mpg ~ hp, df, yaxt='n', main = title,
axes = FALSE, pch=16, col = "grey90", cex = 1.6)
# axis(side=1, at=c(seq(10,150,30)))
axis(side=1, at=c(seq(10,350,30)))
abline(v = start, lty = 2)
if(bnds == TRUE){
# abline(v = c(min(subset$area),max(subset$area)), lty = 3, col = "grey")
# rect(min(subset$area), 0, max(subset$area), 20, col = rgb(0,0,1,0.1),
abline(v = c(min(subset$hp),max(subset$hp)), lty = 3, col = "grey")
# rect(min(subset$hp), 0, max(subset$hp), 20, col = rgb(0,0,1,0.1),
rect(min(subset$hp), 0, max(subset$hp), 35, col = rgb(0,0,1,0.1),
border = rgb(0,0,1,0.1))
}
if(w == TRUE){
# points(x = subset$area, y = subset$cp.ratio, col = rgb(0,0,1, wts),
points(x = subset$hp, y = subset$mpg, col = rgb(0,0,1, wts),
pch = 16, cex=1.6)
}
if(line == TRUE){
# clip(min(subset$area),max(subset$area),
# min(df$cp.ratio),max(df$cp.ratio))
clip(min(subset$hp),max(subset$hp),
min(df$mpg),max(df$mpg))
abline(M, col = "orange", lwd = 1.8)
}
if(point == TRUE){
points(x = start, y = x.l, pch = 16, col = "red", cex=1.8)
}
}
if(pts == TRUE){
return(c(start, x.l))
}
}
```
#### How a loess is constructed
Behind the scenes, each point ($x_i$, $y_i$) that defines the loess curve is constructed as follows:
a) A subset of data points closest to point $x_i$ are identified ($x_{130}$ is used as an example in the figure below). The number of points in the subset is identified by multiplying the bandwidth $\alpha$ by the total number of observations. In our current example, $\alpha$ is set to `r alpha`. The number of points defining the subset is thus 0.5 * 32 = 16. The region encompassing these points is displayed in a light blue color.
b) The points in the subset are assigned weights. Greater weight is assigned to points closest to $x_i$. The weights define the points' influence on the fitted line. Different weighting techniques can be implemented in a loess with the `gaussian` weight being the most common. Another weighting strategy we will also explore later in this course is the `symmetric` weight.
c) A regression line is fit to the subset of points. Points with smaller weights will have less leverage on the fitted line than points with larger weights. The fitted line can be either a first order polynomial fit or a second order polynomial fit.
d) Next, the value $y_i$ from the regression line is computed (red point in panel (d)). This is one of the points that will define the shape of the loess.
```{r fig.height = 2.7, fig.width = 10, echo = FALSE}
# OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
# f.plot(start = 50, bnds = TRUE, title = "(a) Subset points")
# f.plot(start = 50, bnds = TRUE, w = TRUE, title = "(b) Assign weights")
# f.plot(start = 50, bnds = TRUE, w = TRUE, line = TRUE, title = "(c) Fit line")
# f.plot(start = 50, bnds = TRUE, w = TRUE, line = TRUE, point = TRUE,
# title = "(d) Draw point")
# par(OP)
OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
f.plot(start = 130, bnds = TRUE, title = "(a) Subset points")
f.plot(start = 130, bnds = TRUE, w = TRUE, title = "(b) Assign weights")
f.plot(start = 130, bnds = TRUE, w = TRUE, line = TRUE, title = "(c) Fit line")
f.plot(start = 130, bnds = TRUE, w = TRUE, line = TRUE, point = TRUE,
title = "(d) Draw point")
par(OP)
```
The above steps are repeated for as many $x_i$ values practically possible. Note that when $x_i$ approaches an upper or lower limit, the subset of points is no longer centered on $x_i$. For example, when estimating $x_{52}$, the sixteen closest points to the right of $x_{52}$ are selected. Likewise, for the upper bound $x_{335}$, the sixteen closest points to the left of $x_{335}$ are selected.
```{r fig.height = 2.5, fig.width = 10, echo = FALSE}
# OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
# f.plot(start = 10, bnds = TRUE, w = TRUE, line = TRUE, point = TRUE,
# title = expression(Left-most ~~ x[i] ))
# f.plot(start = 140, bnds = TRUE, w = TRUE, line = TRUE, point = TRUE,
# title = expression(Right-most ~~ x[i]) )
# par(OP)
OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
f.plot(start = 52, bnds = TRUE, w = TRUE, line = TRUE, point = TRUE,
title = expression(Left-most ~~ x[i] ))
f.plot(start = 335, bnds = TRUE, w = TRUE, line = TRUE, point = TRUE,
title = expression(Right-most ~~ x[i]) )
par(OP)
```
In the following example, just under 30 loess points are computed at equal intervals. This defines the shape of the loess.
```{r fig.height = 2.3, fig.width = 10, echo = FALSE}
# OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
# l.pts <- seq(10,140,(110-20)/20) %>% map(function(x) f.plot(start = x, plot = FALSE, pts = TRUE)) %>%
# do.call(rbind, .) %>% as.data.frame()
# plot(cp.ratio ~ area, df, yaxt='n', main = NULL,
# axes = FALSE, pch=16, col = "grey90", cex = 1.6)
# axis(side=1, at=c(seq(10,150,30)))
# points(l.pts$V1, l.pts$`(Intercept)`, pch = 16, col = "red",
# cex=1)
# par(OP)
OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
l.pts <- seq(50,330,(330-50)/20) %>% map(function(x) f.plot(start = x, plot = FALSE, pts = TRUE)) %>%
do.call(rbind, .) %>% as.data.frame()
plot(mpg ~ hp, df, yaxt='n', main = NULL,
axes = FALSE, pch=16, col = "grey90", cex = 1.6)
axis(side=1, at=c(seq(50,335,30)))
points(l.pts$V1, l.pts$`(Intercept)`, pch = 16, col = "red",
cex=1)
par(OP)
```
It's more conventional to plot the line segments than it is to plot the points.
```{r fig.height = 2.3, fig.width = 10, echo = FALSE}
# OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
# l.pts <- seq(10,140,(110-20)/20) %>% map(function(x) f.plot(start = x, plot = FALSE, pts = TRUE)) %>%
# do.call(rbind, .) %>% as.data.frame()
# plot(cp.ratio ~ area, df, yaxt='n', main = NULL,
# axes = FALSE, pch=16, col = "grey90", cex = 1.6)
# axis(side=1, at=c(seq(10,150,30)))
# lines(l.pts$V1, l.pts$`(Intercept)`, col = "red")
# par(OP)
OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
l.pts <- seq(50,335,(335-50)/20) %>% map(function(x) f.plot(start = x, plot = FALSE, pts = TRUE)) %>%
do.call(rbind, .) %>% as.data.frame()
plot(mpg ~ hp, df, yaxt='n', main = NULL,
axes = FALSE, pch=16, col = "grey90", cex = 1.6)
axis(side=1, at=c(seq(50,335,30)))
lines(l.pts$V1, l.pts$`(Intercept)`, col = "red")
par(OP)
```
<br>
## Generating a loess model in base R
The loess fit can be computed in R using the `loess()` function. It takes as arguments `span` ($\alpha$), and `degree` ($\lambda$).
```{r}
# Fit loess function
lo <- loess(mpg ~ hp, mtcars, span = 0.5, degree = 1)
# Predict loess values for a range of x-values
lo.x <- seq(min(mtcars$hp), max(mtcars$hp), length.out = 50)
lo.y <- predict(lo, lo.x)
```
The modeled loess curve can be added to the scatter plot using the `lines` function.
```{r fig.width=2.5, fig.height=2.5, small.mar=TRUE}
plot(mpg ~ hp, mtcars)
lines(lo.x, lo.y, col = "red")
```
## Generating a loess model in `ggplot`
In `ggplot2` simply pass the `method="loess"` parameter to the `stat_smooth` function.
```{r eval=FALSE}
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 0.5)
```
`ggplot` defaults to a second degree loess (i.e. the small regression line elements that define the loess are modeled using a 2^nd^ order polynomial and not a 1^st^ order polynomial). If a first order polynomial is desired, you need to include an argument list in the form of `method.args=list(degree=1)` to the `stat_smooth` function.
```{r fig.width=2.5, fig.height=2.5}
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 0.5,
method.args = list(degree = 1) )
```