R for data science by Garrett Grolemund and Hadley Wickham


Model assessment

  • Some discussion of p-values.
  • Bootstrapping to understand uncertainty in parameters.
  • Cross-validation to understand predictive quality.

Purrr + dplyr

i.e. how do dplyr and purrr intersect.

  • Why use a data frame?
  • List columns in a data frame
  • Mutate & filter.
  • Creating list columns with group_by() and do().

Multiple models

A natural application of map2() is handling test-training pairs when doing model evaluation. This is an important modelling technique: you should never evaluate a model on the same data it was fit to because it’s going to make you overconfident. Instead, it’s better to divide the data up and use one piece to fit the model and the other piece to evaluate it. A popular technique for this is called k-fold cross validation. You randomly hold out x% of the data and fit the model to the rest. You need to repeat this a few times because of random variation.

Why you should store related vectors (even if they’re lists!) in a data frame. Need example that has some covariates so you can (e.g.) select all models for females, or under 30s, …

Let’s start by writing a function that partitions a dataset into test and training:

partition <- function(df, p) {
  n <- nrow(df)
  groups <- rep(c(TRUE, FALSE), n * c(p, 1 - p))
  sample(groups)
}
partition(mtcars, 0.1)
##  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

We’ll generate 20 random test-training splits, and then create lists of test-training datasets:

partitions <- rerun(200, partition(mtcars, 0.25))

tst <- partitions %>% map(~mtcars[.x, , drop = FALSE])
trn <- partitions %>% map(~mtcars[!.x, , drop = FALSE])

Then fit the models to each training dataset:

mod <- trn %>% map(~lm(mpg ~ wt, data = .))

If we wanted, we could extract the coefficients using broom, and make a single data frame with map_df() and then visualise the distributions with ggplot2:

coef <- mod %>% 
  map_df(broom::tidy, .id = "i")
coef
## Source: local data frame [400 x 6]
## 
##        i        term estimate std.error statistic  p.value
##    (chr)       (chr)    (dbl)     (dbl)     (dbl)    (dbl)
## 1      1 (Intercept)    37.62     2.229     16.88 4.49e-14
## 2      1          wt    -5.45     0.643     -8.49 2.19e-08
## 3      2 (Intercept)    37.72     2.025     18.63 5.83e-15
## 4      2          wt    -5.56     0.606     -9.18 5.62e-09
## 5      3 (Intercept)    37.97     2.325     16.33 8.75e-14
## 6      3          wt    -5.46     0.660     -8.28 3.34e-08
## 7      4 (Intercept)    37.63     2.038     18.46 7.05e-15
## 8      4          wt    -5.53     0.620     -8.93 9.12e-09
## 9      5 (Intercept)    37.40     2.434     15.37 3.02e-13
## 10     5          wt    -5.38     0.741     -7.26 2.86e-07
## ..   ...         ...      ...       ...       ...      ...
library(ggplot2)
ggplot(coef, aes(estimate)) + 
  geom_histogram(bins = 10) + 
  facet_wrap(~term, scales = "free_x")

But we’re most interested in the quality of the models, so we make predictions for each test data set and compute the mean squared distance between predicted and actual:

pred <- map2(mod, tst, predict)
actl <- map(tst, "mpg")

msd <- function(x, y) sqrt(mean((x - y) ^ 2))
mse <- map2_dbl(pred, actl, msd)
mean(mse)
## [1] 3.07
mod <- lm(mpg ~ wt, data = mtcars)
base_mse <- msd(mtcars$mpg, predict(mod))
base_mse
## [1] 2.95
ggplot(, aes(mse)) + 
  geom_histogram(binwidth = 0.25) + 
  geom_vline(xintercept = base_mse, colour = "red")