This repository has been archived by the owner on Jul 24, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME.Rmd
executable file
·276 lines (225 loc) · 9.61 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
library(knitr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)
library(gravity)
library(eflm)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# Efficient Fitting of Linear Models
<!-- badges: start -->
[![Project Status: Active – The project has reached a stable, usable
state and is being actively
developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://www.tidyverse.org/lifecycle/#stable)
[![CRAN
status](https://www.r-pkg.org/badges/version/gravity)](https://cran.r-project.org/package=eflm)
[![codecov](https://codecov.io/gh/pachadotdev/eflm/branch/main/graph/badge.svg?token=XI59cmGd15)](https://codecov.io/gh/pachadotdev/eflm)
[![R-CMD-check](https://github.com/pachadotdev/eflm/workflows/R-CMD-check/badge.svg)](https://github.com/pachadotdev/eflm/actions)
<!-- badges: end -->
## Scope
`eflm` package reduces the design matrix from $N\times P$ into
$P \times P$ for reduced fitting time, and delivers functions that are drop-in
replacements for `glm` and `lm`, like:
```{r, eval = FALSE}
# just append and 'e' to glm
eglm(mpg ~ wt, data = mtcars)
```
The best computational performance is obtained when R is linked against OpenBLAS,
Intel MKL or other optimized BLAS library. This implementation aims at being
compatible with 'broom' and 'sandwich' packages for summary statistics and
clustering by providing S3 methods.
This package takes ideas from glm2, speedglm, fastglm, speedglm and fixest
packages, but the implementations here shall keep the functions and outputs as
closely as possible to the stats package, therefore making the functions
provided here compatible with packages such as sandwich for robust estimation,
even if that means to attenuate the speed gains.
The greatest strength of this package is testing. With more than 1600
(and counting) tests, we try to do exactly the same as lm/glm, even in edge
cases, but faster.
The ultimate aim of the project is to produce a package that:
* Does exactly the same as lm and glm in less time
* Is equally numerically stable as lm and glm
* Depends only on base R, with no Rcpp or other calls
* Uses R's internal C code such as the `Cdqrls` function that the stats package uses for model fitting
* Can be used in Shiny dashboard and contexts where you need fast model fitting
* Is useful for memory consuming models
* Allows model fitting in cases demanding more memory than free RAM (PENDING)
## Installation
You can install the released version of eflm from CRAN with:
```r
install.packages("eflm")
```
And the development version with:
``` r
remotes::install_github("pachadotdev/eflm")
```
## Progress list
### Stats compatibility
- [x] cooks.distance
### Sandwich compatibility
- [x] estfun
- [x] bread
- [x] vcovCL
- [x] meatCL
- [x] vcovCL
- [x] vcovBS
- [ ] vcovHC
- [ ] meatHC
- [ ] vcovPC
- [ ] meatPC
- [ ] vcovPL
- [ ] meatPL
### Broom compatibility
- [x] augment
- [x] tidy
- [x] glance
### Lmtest compatibility
- [x] resettest
## Benchmarking
The dataset for this benchmark was taken from Yotov et al. (2016) and consists
in a 28,152 x 8 data frame with 6 numeric and 2 categorical columns of the form:
|Year ($t$) |Trade ($X$) |DIST |CNTG |LANG |CLNY |Exp Year ($\pi$) |Imp Year ($\chi$) |
|-----------|------------|-----|-----|-----|-----|-----------------|------------------|
|1986 |27.8 |12045|0 |0 |0 |ARG1986 |AUS1986 |
|1986 |3.56 |11751|0 |0 |0 |ARG1986 |AUT1986 |
|1986 |96.1 |11305|0 |0 |0 |ARG1986 |BEL1986 |
This data can be found in the `tradepolicy` package.
The variables are:
* `year`: time of export/import flow
* `trade`: bilateral trade
* `log_dist`: log of distance
* `cntg`: contiguity (0/1)
* `lang`: common language (0/1)
* `clny`: colonial relation (0/1)
* `exp_year`/`imp_year`: exporter/importer time fixed effects
For benchmarking I'll fit a PPML model, as it's a computationally expensive model.
```r
ch1_application1 <- tradepolicy::agtpa_applications %>%
select(exporter, importer, pair_id, year, trade, dist, cntg, lang, clny) %>%
filter(year %in% seq(1986, 2006, 4))
formula <- trade ~ log(dist) + cntg + lang + clny + exp_year + imp_year
eglm(formula, quasipoisson, ch1_application1)
```
To compare `glm`, the proposed `eglm` and Stata's `ppml`, I conducted a test
with 500 repetitions locally, and reported the median of the realizations as
the fitting time. The plots on the right report the fitting times and used
memory by running regressions with cumulative subset of the data for
1986, ..., 2006 (e.g. regress for 1986, then 1986 and 1990, ...,
then 1986 to 2006), we obtain the next fitting times and memory allocation
depending on the design matrix dimensions:
```{r echo=FALSE, fig.height=3, fig.width=6, message=FALSE, warning=FALSE, dpi=150, fig.align="center"}
g <- readRDS("benchmarks/06-stata-vs-r.rds")
g2 <- readRDS("benchmarks/05-benchmark-subsets.rds")
g + (g2[[1]] / g2[[2]])
```
Yotov et al. (2016) features complex both partial and general equilibrium
models. Some partial equilibrium models are particularly slow to fit because of
the allocated memory and the number of fixed effects, such as the Regional Trade
Agreements (RTAs) model.
In the next table, TG means 'Traditional Gravity' (e.g. vanilla PPML),
DP means 'Distance Puzzle' and GB stands for 'Globalization', which are
refinements of the simple PPML model and include dummy variables such as
specific country pair fixed effects and lagged RTAs.
```{r echo=FALSE}
load("benchmarks/04-glm-vs-eglm.RData")
t1 <- bind_rows(
model_matrix_traditional_gravity %>%
filter(fun %in% c("eglm"), model_matrix_ncol == max(model_matrix_ncol)),
model_matrix_distance_puzzle %>%
filter(fun %in% c("eglm"), model_matrix_ncol == max(model_matrix_ncol)),
model_matrix_rtas %>%
filter(fun %in% c("eglm"), model_matrix_ncol == max(model_matrix_ncol))
) %>%
select(-fun)
colnames(t1) <- c("Model", "Rows in design matrix", "Cols in design matrix")
kable(t1)
```
```{r echo=FALSE, fig.height=2, fig.width=6, message=FALSE, warning=FALSE, dpi=150, fig.align="center"}
include <- t1$Model
g2_time <- ggplot(bench_models %>% filter(fun %in% c("eglm", "glm"), model %in% include)) +
geom_col(aes(x = model, y = median_seconds, fill = `Function`), position = "dodge2") +
scale_fill_manual(values = c("#3B9AB2", "#E1AF00")) +
theme_minimal(base_size = 8) +
# theme(legend.position = "none") +
labs(x = "Model", y = "Median Time (Seconds)",
title = "Time Benchmark")
g2_memory <- ggplot(bench_models %>% filter(fun %in% c("eglm", "glm"), model %in% include)) +
geom_col(aes(x = model, y = mem_alloc_mb, fill = `Function`), position = "dodge2") +
scale_fill_manual(values = c("#3B9AB2", "#E1AF00")) +
theme_minimal(base_size = 8) +
# theme(legend.position = "none") +
labs(x = "Model", y = "Required Memory (MB)",
title = "Memory Benchmark")
g2_time + g2_memory
```
The results for the RTA model show that the speedups can be scaled, and we can
show both time reduction and required memory increases.
```{r echo=FALSE}
t2 <- bench_models %>%
filter(fun %in% c("eglm", "glm"), model %in% include) %>%
select(model, fun, median_seconds) %>%
spread(fun, median_seconds) %>%
select(model, glm, eglm) %>%
mutate(time_gain = paste0(round(100 * (glm - eglm) / glm, 2), "%"))
colnames(t2) <- c("Model", "GLM Time (s)", "EGLM Time (s)", "Time Gain (%)")
kable(t2)
```
Is it important to mention that the increase in memory results in reduced object
size for the stored model.
```{r echo=FALSE}
t4 <- bind_rows(
object_size_traditional_gravity %>%
filter(fun %in% c("glm", "eglm"), model %in% include),
object_size_distance_puzzle %>%
filter(fun %in% c("glm", "eglm"), model %in% include),
object_size_rtas %>%
filter(fun %in% c("glm", "eglm"), model %in% include)
) %>%
mutate(fun = ifelse(fun == "glm", "GLM", "EGLM"))
t4 <- t4 %>%
select(model, fun, size) %>%
mutate(size = round(size, 2)) %>%
spread(fun, size) %>%
select(model, GLM, EGLM) %>%
mutate(object_gain = paste0(round(100 * (GLM - EGLM) / GLM, 2), "%"))
colnames(t4) <- c("Model", "GLM Size (MB)", "EGLM Size (MB)", "Memory Savings (%)")
kable(t4)
```
To conclude my benchmarks, I fitted the PPML model again on DigitalOcean
droplets, leading to consistent times across scaled hardware. The results can be
seen in the next plot:
```{r echo=FALSE, fig.height=2, fig.width=6, message=FALSE, warning=FALSE, dpi=150, fig.align="center"}
g_do <- readRDS("benchmarks/07-r-digitalocean.rds")
g_do
```
## Edge cases
An elementary example that breaks `eflm` even with QR decomposition can be
found in Golub et al. (2013), which consists in passing an ill conditioned
matrix:
```{r include=FALSE}
set.seed(200100); n <- 10000
df <- data.frame(x1 = 1:n, x2 = c(0, 2:n))
df$y <- with(df, 1.999 + 2.000 * x1 + 2.001 * x2 + rnorm(n))
reg1 <- lm(y ~ ., df); reg2 <- elm(y ~ ., df)
```
| Model | (Intercept) | $x_1$ | $x_2$ |
|-------|-------------|-------|-------|
|REG 1 | 1.98 | 2.98 | 1.02 |
|REG 2 | 1.98 | 4.00 | NA |
# References
Golub, Gene H, and Charles F Van Loan. 2013. *Matrix Computations*. Vol. 3. JHU press.
Yotov, Yoto V, Roberta Piermartini, José-Antonio Monteiro, and Mario Larch. 2016.
*An Advanced Guide to Trade Policy Analysis: The Structural Gravity Model*. World Trade
Organization Geneva.