19 Model

The goal of a fitted model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true “signals” (i.e. patterns generated by the phenomenon of interest), and ignore “noise” (i.e. random variation that you’re not interested in).

A model is a tool for making predictions. Goal of a model is to be simple and useful.

This is a hard problem because any fitted dataset is just the “best” (closest) model from a family of models. Just because it’s the best model doesn’t make it good. And it certainly doesn’t imply that the model is true. But a model doesn’t need to be true to be useful. You’ve probably heard George Box’s famous aphorism:

All models are worng, but some are useful.

But you might not have read the fuller content.

Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an “ideal” gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules.

For such a model there is no need to ask the question “Is the model true?”. If “truth” is to be the “whole truth” the answer must be “No”. The only question of interest is “Is the model illuminating and useful?”.

In this chapter, we’ll explore the basics of model fitting.

We are going to focus on predictive models, how you can use simple fitted models to help better understand your data. Many of the models will be motivated by plots: you’ll use a model captures to strong signals in the data so you can focus on what remains. This is a different motivation from most introductions to modelling, but if you go on to more traditional coverage, you can apply these same ideas to help you understand what’s going on.

19.0.1 Prerequisites

To access the functions and datasets that we will use in the chapter, load the following packages:

# Modelling functions
library(modelr)
library(broom)

# Modelling requires plently of visualisation and data manipulation
library(ggplot2)
library(dplyr)
library(tidyr)

I also recommend setting the following options. These make base models a little less surprising.

# Options that make your life easier
options(
  contrasts = c("contr.treatment", "contr.treatment"),
  na.option = na.exclude
)

19.1 Basics

I can be easier to learn about modelling in a simulated environment where we know the truth. For example, imagine we have this data:

df <- tibble(
  x = rep(1:10, each = 3),
  y = 5 + x * 2 + rnorm(length(x), sd = 2)
)

ggplot(df, aes(x, y)) + 
  geom_point()

You already know how to fit a straight line to that dataset:

ggplot(df, aes(x, y)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

But what is actually happening here?

Model class: linear. Model family: y = a + b * x.

There are lots of possible models in that family. Here are a few:

models <- tibble(
  a = runif(250, -20, 80),
  b = runif(250, -5, 5)
)

ggplot(df, aes(x, y)) + 
  geom_abline(aes(intercept = a, slope = b), data = models, alpha = 1/4) +
  geom_point() 

There are an infinite number of possible models. But some of them are obviously bad. How can we judge a good model? It seems like it must have something to do with distance: a good model should be close to the data points.

One common measurement of distance is Euclidean distance. For each, we could compute the Euclidean distance between each data point and the closest point on the model. We have multiple points so we could take the average.

distance <- function(a, b) {
  dist_one <- function(a, b) sqrt(mean((df$y - (a + b * df$x)) ^ 2))
  purrr::map2_dbl(a, b, dist_one)
}
models <- models %>% 
  mutate(dist = distance(a, b)) %>% 
  arrange(dist)

ggplot(df, aes(x, y)) + 
  geom_abline(aes(intercept = a, slope = b, alpha = dist), data = filter(models, dist < 10)) +
  geom_point() + 
  scale_alpha_continuous(trans = "reverse")


ggplot(models, aes(a, b)) +
  geom_point(aes(colour = dist)) + 
  scale_colour_continuous(trans = "reverse")

Obviously trying lots of random models is unlikely to find us a good model quickly. Instead, lets try to be systematic:

mu <- c(5, 2)
sd <- c(0.9, .14)

grid <- expand.grid(
  a = mu[1] + seq(-5, 5, length = 25) * sd[1],
  b = mu[2] + seq(-5, 5, length = 25) * sd[2]
  ) %>% 
  mutate(dist = distance(a, b)) 

best_10 <- grid %>% filter(rank(dist) < 10)
best_10
#>      a    b dist
#> 1 4.63 2.06 1.77
#> 2 5.00 2.06 1.75
#> 3 4.25 2.12 1.76
#> 4 4.63 2.12 1.73
#> 5 3.88 2.17 1.76
#> 6 4.25 2.17 1.73
#> 7 4.63 2.17 1.77
#> 8 3.88 2.23 1.74
#> 9 4.25 2.23 1.77

grid %>% 
  ggplot(aes(a, b, fill = dist)) +
  geom_raster() + 
  geom_point(data = best_10) +
  scale_fill_continuous(trans = "reverse") 


ggplot(df, aes(x, y)) + 
  geom_abline(aes(intercept = a, slope = b), data = best_10, alpha = 1/3) +
  geom_point()

That’s a called a grid search. Or we could do something even more complex: a Newton-Rhapson search. Here you effectively ski down the steepest slope and until you hit the bottom.

nlm(function(x) distance(x[1], x[2]), c(0, 0))
#> $minimum
#> [1] 1.72
#> 
#> $estimate
#> [1] 4.37 2.15
#> 
#> $gradient
#> [1] 9.39e-07 6.07e-06
#> 
#> $code
#> [1] 2
#> 
#> $iterations
#> [1] 7

However, generally, rather than doing it yourself, you’ll rely on a model function to do so. Here we’re using a linear model, so we can do:

summary(lm(y ~ x, data = df))
#> 
#> Call:
#> lm(formula = y ~ x, data = df)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -2.875 -1.538 -0.212  0.857  4.118 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    4.367      0.704     6.2  1.1e-06 ***
#> x              2.153      0.113    19.0  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.78 on 28 degrees of freedom
#> Multiple R-squared:  0.928,  Adjusted R-squared:  0.925 
#> F-statistic:  360 on 1 and 28 DF,  p-value: <2e-16

Generally, model functions will do something more sophisticated than a gradient search: for linear models based on linear algebra. Using connection between geometry, calculus, and linear algebra: the closest linear model to the data can always be found by constructing and then inverting a matrix.

19.2 Heights data

Have you heard that a relationship exists between your height and your income? It sounds far-fetched—and maybe it is—but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?

Luckily, it is easy to measure someone’s height, as well as their income, which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS National Longitudinal Surveys (NLS) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering just how your tax dollars are being spent, the point of the NLS is not to study the relationship between height and income, that’s just a lucky accident.

A small sample of the full dataset is included in modelr:

heights
#> # A tibble: 7,006 x 8
#>   income height weight   age  marital    sex education  afqt
#>    <int>  <dbl>  <int> <int>   <fctr> <fctr>     <int> <dbl>
#> 1  19000     60    155    53  married female        13  6.84
#> 2  35000     70    156    51  married female        10 49.44
#> 3 105000     65    195    52  married   male        16 99.39
#> 4  40000     63    197    54  married female        14 44.02
#> 5  75000     66    190    49  married   male        14 59.68
#> 6 102000     68    200    49 divorced female        18 98.80
#> # ... with 7,000 more rows

As well as height and income there are some other variables that might affect someone’s income: age, sex, race, years of education, and their score on the afqt (Armed Forces Qualification Test).

Now that you have the data, you can visualize the relationship between height and income. But what does the data say? How would you describe the relationship?

ggplot(heights, aes(height, income)) +
  geom_point()

First, let’s address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That’s not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Bureau of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.

n <- nrow(heights)
heights <- heights %>% filter(income < 150000)
nrow(heights) / n
#> [1] 0.967

I’m going to record the original number of observations in n. We’ll come back to this every now and then to make sure that we haven’t throw out too much of our data.

Also, you can see that heights have been rounded to the nearest inch so using boxplots will make it easier to see the pattern. We’ll also remove the very tall and very short people so we can focus on the most typically heights:

heights <- heights %>% filter(between(height, 59, 78))
nrow(heights) / n
#> [1] 0.962

ggplot(heights, aes(height, income, group = height)) +
  geom_boxplot()

(Throwing away data in the first pass at a model is perfectly acceptable: starting with a simple subset of a problem that you can easily solve is a good general strategy. But in a real analysis, once you’ve got the first simple model working, you really should come back and all look at the full dataset. Is removing the data still a good idea?)

You can see there seems to be a fairly weak relationship: as height increase the median wage also seems to increase. But how could we summarise that more quantitiatively?

19.3 Linear models

One way is to use a linear model. A linear model is a very broad family of models: it encompasses all models that are a weighted sum of variables.

The formula specifies a family of models: for example, income ~ height describes the family of models specified by x1 * income + x0, where x0 and x1 are real numbers.

income ~ height
#> income ~ height

We fit the model by supplying the family of models (the formula), and the data, to a model fitting function, lm(). lm() finds the single model in the family of models that is closest to the data:

h <- lm(income ~ height, data = heights)
h 
#> 
#> Call:
#> lm(formula = income ~ height, data = heights)
#> 
#> Coefficients:
#> (Intercept)       height  
#>      -72289         1575

We can extract the coefficients of this fitted model and write down the model it specifies:

coef(h)
#> (Intercept)      height 
#>      -72289        1575

This tells says the model is \(-7.229\times 10^{4} + 1575.125 * height\). In other words, one inch increase of height associated with an increase of $937 in income.

The definition that lm() uses for closeness is that it looks for a model that minimises the “root mean squared error”.

lm() fits a straight line that describes the relationship between the variables in your formula. You can picture the result visually like this.

ggplot(heights, aes(height, income)) +
  geom_boxplot(aes(group = height)) +
  geom_smooth(method = lm, se = FALSE)

lm() treats the variable(s) on the right-hand side of the formula as explanatory variables that partially determine the value of the variable on the left-hand side of the formula, which is known as the response variable. In other words, it acts as if the response variable is determined by a function of the explanatory variables. Linear regression is linear because it finds the linear combination of the explanatory variables that best predict the response.

19.3.1 Exercises

  1. What variables in heights do you expect to be most highly correlated with income? Use cor() plus purrr::map_dbl() to check your guesses.

  2. Correlation only summarises the linear relationship between two continuous variables. There are some famous drawbacks to the correlation. What are they? Hint: google for Anscombe’s quartet, read https://xkcd.com/552/.

19.4 Understanding the model

For simple models, like this one, you can figure out what the model says about the data by carefully studying the coefficients. If you ever take a formal statistics course on modelling, you’ll spend a lot of time doing that. Here, however, we’re going to take a different tack. In this book, we’re going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand simple linear models or complex random forrests. We’ll see that advantage later on when we explore some other families of models.

  1. Make predictions across a uniform grid to understand the “shape” of the model.

  2. Subtract predictions from the actual values to look at the residuals and to learn what the model misses.

  3. Create succinct numerical summaries when you need to compare many models.

19.4.1 Predictions

To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies. The easiest way to do that is to use tidyr::expand(). It’s first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:

grid <- heights %>% expand(height) 
grid
#> # A tibble: 20 x 1
#>   height
#>    <dbl>
#> 1     59
#> 2     60
#> 3     61
#> 4     62
#> 5     63
#> 6     64
#> # ... with 14 more rows

(This will get more interesting when we start to add more variables to our model.)

Next we add predicitons. We’ll use modelr::add_predictions() which works in exactly the same way as add_residuals(), but just compute predictions (so doesn’t need a data frame that contains the response variable:)

grid <- grid %>% add_predictions(h, "income") 
grid
#> # A tibble: 20 x 2
#>   height income
#>    <dbl>  <dbl>
#> 1     59  20643
#> 2     60  22218
#> 3     61  23793
#> 4     62  25368
#> 5     63  26943
#> 6     64  28519
#> # ... with 14 more rows

And then we plot the predictions. Here the choice of plots is pretty obvious because we only havce two variables. In general, however, figuring out how to the visualise the predictions can be quite challenging, and you’ll often need to try a few alternatives before you get the most useful plot. For more complex models, you’re likely to need multiple plots, not just one.

ggplot(heights, aes(height, income)) +
  geom_boxplot(aes(group = height)) +
  geom_line(data = grid, colour = "red", size = 1)

19.4.2 Residuals

The flip-side of predictions are residuals. The predictions tell you what the model is doing; the residuals tell you what the model is missing. We can compute residuals with add_residuals(). Note that we computing residuals, you’ll use the original dataset, not a manufactured grid. Otherwise where would you get the value of the response?

heights <- heights %>% add_residuals(h)

There are a few different ways to understand what the residuals tell us about the model. One way is to simply draw a frequency polygon to help us understand the spread of the residuals:

ggplot(heights, aes(resid)) + 
  geom_freqpoly(binwidth = 2000)

(I prefer the frequency polygon over the histogram here because it makes it easier to display other categorical variables on the same plot.)

Here you can see that the range of the residuals is quite large. (Note that by the formal definiton of the linear model, the mean of the residuals will always be zero).

For many problems, the sign of the residual (i.e. whether the prediction is too high or too low) isn’t important, and you might just want to focus on the magnitude of the residuals. You can do that by plotting the absolute value:

ggplot(heights, aes(abs(resid))) + 
  geom_freqpoly(binwidth = 2000, boundary = 0)

You can also explore how the residuals vary with other variables in the data:

ggplot(heights, aes(height, resid)) + geom_point()

Iterative plotting the residuals instead of the original response leads to a natual way of building up a complex model in simple steps, which we’ll explore in detail in the next chapter.

19.4.3 Exercises

  1. In the plot of the absolute residuals, there’s a bin smaller than zero. What does this bin represent, and why is it necessary?

  2. Explore how the distribution of residuals varies with the sex and race. What makes understanding the residuals split up by race particularly challenging?

  3. P-values are an important part of model interpretation, particularly in science, that we’re not going to talk much about. https://xkcd.com/882/

  4. It’s often useful to recreate your initial EDA plots using residuals instead of the original missing values. How does visualising resid instead of height change your understanding of the heights data?

19.5 Multiple predictors

In most cases, like this one, a single variable is not enough to generate a useful model. Instead we need multiple variables in the model, which you can include with either + or * in the modelling formula.

The distinction between + and * is important:

  • x + y adds the variables independently: the model will estimate the effect of x and the effect of y completely separately.

  • x * y will estimate the effect of the model together.

We’ll explore what this means in the sections below, as we add more categorical and continuous variables to the dataset.

19.5.1 Categorical

Our model so far is extremely simple: it only uses one variable to try and predict income. We also know something else important: women tend to be shorter than men and tend to get paid less.

ggplot(heights, aes(height, colour = sex)) + 
  geom_freqpoly(binwidth = 1)

ggplot(heights, aes(income, colour = sex)) + 
  geom_freqpoly(binwidth = 5000)

What happens if we also include sex in the model?

h2 <- lm(income ~ height * sex, data = heights)
grid <- heights %>% 
  expand(height, sex) %>% 
  add_predictions(h2, "income")

ggplot(heights, aes(height, income)) + 
  geom_point() + 
  geom_line(data = grid) +
  facet_wrap(~sex)

Need to commment about predictions for tall women and short men - there is not a lot of data there. Need to be particularly sceptical.

* vs +.

h3 <- lm(income ~ height + sex, data = heights)
grid <- heights %>% 
  expand(height, sex) %>% 
  gather_predictions(h2, h3)

ggplot(grid, aes(height, pred, colour = sex)) + 
  geom_line() +
  facet_wrap(~model)

19.5.1.1 Factors

R stores categorical data as factors. If you add a string to a model, R will convert it to a factor for the purposes of the model.

A factor is an integer vector with a levels attribute. You can make a factor with factor().

fac <- factor(c("c", "a", "b"), 
  levels = c("a", "b", "c"), 
  labels = c("blond", "brunette", "red"))
fac
#> [1] red      blond    brunette
#> Levels: blond brunette red
unclass(fac)
#> [1] 3 1 2
#> attr(,"levels")
#> [1] "blond"    "brunette" "red"

Each level of the factor (i.e. unique value) is encoded as an integer and displayed with the label that is associated with that integer.

If you use factors outside of a model, you will notice some limiting behavior:

  • You cannot add values to a factor that do not appear in its levels.
  • Factors retain all of their levels when you subset them. To avoid this use drop = TRUE.

    fac[1]
    #> [1] red
    #> Levels: blond brunette red
    fac[1, drop = TRUE]
    #> [1] red
    #> Levels: red
  • If you coerce a factor to a number with as.numeric(), R will convert the integer vector that underlies the factor to a number, not the level labels that you see when you print the factor.

    num_fac <- factor(1:3, levels = 1:3, labels = c("100", "200", "300"))
    num_fac
    #> [1] 100 200 300
    #> Levels: 100 200 300
    as.numeric(num_fac)
    #> [1] 1 2 3

    To coerce the labels that you see to a new data type, first coerce the factor to a character string with as.character()

as.numeric(as.character(num_fac))
#> [1] 100 200 300

19.5.1.2 Interpretation

Add categorical variables to a model in the same way that you would add continuous variables.

s <- lm(income ~ sex, data = heights)
tidy(s)
#>          term estimate std.error statistic  p.value
#> 1 (Intercept)    39518       563      70.2 0.00e+00
#> 2   sexfemale   -11845       776     -15.3 1.02e-51

Every level of the factor except one receives its own coefficient. The missing level acts as a baseline.

To change the baseline, create a new factor with a new levels attribute. R will use the first level in the levels attribute as the baseline.

heights$sex <- factor(heights$sex, levels = c("male", "female"))
hes <- lm(income ~ height + education + sex, data = heights)
tidy(hes)
#>          term estimate std.error statistic   p.value
#> 1 (Intercept)   -53147      8915     -5.96  2.63e-09
#> 2      height      394       128      3.08  2.10e-03
#> 3   education     5067       143     35.54 1.12e-253
#> 4   sexfemale   -12150      1030    -11.79  8.55e-32
heights %>% 
  group_by(sex)  %>% 
  do(glance(lm(income ~ height, data = .)))
#> Source: local data frame [2 x 12]
#> Groups: sex [2]
#> 
#>      sex r.squared adj.r.squared sigma statistic  p.value    df logLik
#>   <fctr>     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl>
#> 1   male   0.01163       0.01132 34992      37.6 9.74e-10     2 -37985
#> 2 female   0.00551       0.00523 28346      19.6 9.76e-06     2 -41315
#> # ... with 4 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
#> #   df.residual <int>
hes2 <- lm(income ~ height + education * sex, data = heights)
tidy(hes2)
#>                  term estimate std.error statistic   p.value
#> 1         (Intercept)   -62429      9084     -6.87  6.89e-12
#> 2              height      378       128      2.96  3.13e-03
#> 3           education     5878       214     27.50 6.66e-158
#> 4           sexfemale     6611      3830      1.73  8.44e-02
#> 5 education:sexfemale    -1443       284     -5.09  3.77e-07

19.5.2 Nested variables

Another case that occassionally crops up is nested variables: you have an identifier that is locally unique, not globally unique. For example you might have this data about students in schools:

students <- tibble::frame_data(
  ~student_id, ~school_id,
  1, 1,
  2, 1,
  1, 2,
  1, 3,
  2, 3,
  3, 3
)

The student id only makes sense in the context of the school: it doesn’t make sense to generate every combination of student and school. You can use nesting() for this case:

students %>% expand(nesting(school_id, student_id))
#> # A tibble: 6 x 2
#>   school_id student_id
#>       <dbl>      <dbl>
#> 1         1          1
#> 2         1          2
#> 3         2          1
#> 4         3          1
#> 5         3          2
#> 6         3          3

19.5.3 Continuous

There appears to be a relationship between a person’s education and how poorly the model predicts their income. If we graph the model residuals against education above, we see that the more a person is educated, the worse the model underestimates their income:

But before we add a variable to our model, we need to do a little EDA + cleaning:

ggplot(heights, aes(education)) + geom_bar()
#> Warning: Removed 9 rows containing non-finite values (stat_count).

heights_ed <- heights %>% filter(education >= 12)
nrow(heights) / n
#> [1] 0.962

We could improve the model by adding education:

he1 <- lm(income ~ height + education, data = heights_ed)
he2 <- lm(income ~ height * education, data = heights_ed)

How can we visualise the results of this model? One way to think about it as a surface: we have a 2d grid of height and education, and point on that grid gets a predicted income.

grid <- heights_ed %>% 
  expand(height, education) %>% 
  gather_predictions(he1, he2)

ggplot(grid, aes(height, education, fill = pred)) + 
  geom_raster() +
  facet_wrap(~model)

It’s easier to see what’s going on in a line plot:

ggplot(grid, aes(height, pred, group = education)) + 
  geom_line() +
  facet_wrap(~model)

ggplot(grid, aes(education, pred, group = height)) + 
  geom_line() +
  facet_wrap(~model)

One of the big advantages to + instead of * is that because the terms are independent we display them using two simple plots instead of one complex plot:

heights_ed %>% 
  expand(
    height = seq_range(height, 10), 
    education = mean(education, na.rm = TRUE)
  ) %>% 
  add_predictions(he1, "income") %>% 
  ggplot(aes(height, income)) + 
    geom_line()


heights_ed %>% 
  expand(
    height = mean(height, na.rm = TRUE), 
    education = seq_range(education, 10)
  ) %>% 
  add_predictions(he1, "income") %>% 
  ggplot(aes(education, income)) + 
    geom_line()

The full interaction suggests that height matters less as education increases. But which model is “better”? We’ll come back to that question later.

What happens if we add the data back in to the plot? Do you get more or less sceptical about the results from this model?

You can imagine that if you had a model with four continuous predictions all interacting, that it would be pretty complicated to understand what’s going in the model! And certainly you don’t have to - it’s totally fine to use a model simply as a tool for predicting new values, and in the next chapters you’ll learn some techniques to help evaluate such models without looking at them. However, I think the more you can connect your understand of the domain to the model, the more likely you are to detect potential problems before they occur. The goal is not to undertand every last nuance of the model, but instead to understand more than what you did previously.

condvis.

19.5.4 Splines

But what if the relationship between variables is not linear? For example, the relationship between income and education does not seem to be linear:

ggplot(heights_ed, aes(education, income)) + 
  geom_boxplot(aes(group = education)) +
  geom_smooth(se = FALSE)
#> Warning: Computation failed in `stat_smooth()`:
#> x has insufficient unique values to support 10 knots: reduce k.

One way to introduce non-linearity into our model is to use transformed variants of the predictors.

mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ education + I(education ^ 2) + I(education ^ 3), data = heights_ed)

heights_ed %>% 
  expand(education) %>% 
  gather_predictions(mod_e1, mod_e2) %>% 
  ggplot(aes(education, pred, colour = model)) +
    geom_point() + 
    geom_line()

This is a bit clunky because we have to surround each transformation with I(). This is because the rules of model algebra are a little different to usual algebra. x ^ 2 is equivalent to x * x which in the modelling algebra is equivalent to x + x + x:x which is the same as x. This is useful because (x + y + z)^2 fit all all major terms and second order interactions of x, y, and z.

mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ poly(education, 2), data = heights_ed)
mod_e3 <- lm(income ~ poly(education, 3), data = heights_ed)

heights_ed %>% 
  expand(education) %>% 
  gather_predictions(mod_e1, mod_e2, mod_e3) %>% 
  ggplot(aes(education, pred, colour = model)) +
    geom_point() + 
    geom_line()

However: there’s one major problem with using poly(): outside the range of the data, polynomials are going to rapidly shoot off to positive or negative infinity.

tibble(education = seq(5, 25)) %>% 
  gather_predictions(mod_e1, mod_e2, mod_e3) %>% 
  ggplot(aes(education, pred, colour = model)) +
    geom_line()

Splines avoid this problem by linearly interpolating outside the range of the data. This isn’t great either, but it’s a safer default when you don’t know for sure what’s going to happen.

library(splines)
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ ns(education, 2), data = heights_ed)
mod_e3 <- lm(income ~ ns(education, 3), data = heights_ed)

tibble(education = seq(5, 25)) %>% 
  gather_predictions(mod_e1, mod_e2, mod_e3) %>% 
  ggplot(aes(education, pred, colour = model)) +
    geom_line()
#> Warning in predict.lm(model, data): prediction from a rank-deficient fit
#> may be misleading

Other useful arguments to seq_range():

  • pretty = TRUE will generate a “pretty” sequence, i.e. something that looks nice to the human eye. This is useful if you want to produce tables of output:

    seq_range(c(0.0123, 0.923423), n = 5)
    #> [1] 0.0123 0.2401 0.4679 0.6956 0.9234
    seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
    #> [1] 0.0 0.2 0.4 0.6 0.8 1.0
  • trim = 0.1 will trim off 10% of the tail values. This is useful if the variables has an long tailed distribution and you want to focus on generating values near the center:

    x <- rcauchy(100)
    seq_range(x, n = 5)
    #> [1] -22.09 -13.61  -5.13   3.34  11.82
    seq_range(x, n = 5, trim = 0.10)
    #> [1] -5.343 -2.880 -0.417  2.046  4.510
    seq_range(x, n = 5, trim = 0.25)
    #> [1] -2.285 -1.296 -0.306  0.684  1.674
    seq_range(x, n = 5, trim = 0.50)
    #> [1] -1.5363 -1.0361 -0.5360 -0.0358  0.4643

19.5.5 Interpolation vs extrapolation

One danger with prediction plots is that it’s easy to make predictions that are far away from the original data. This is dangerous because it’s quite possible that the model (which is a simplification of reality) will no longer apply far away from observed values.

As the number of variables in your model grows … “the curse of dimensionality”: as the number of variables increases the average distance between points increases. That means most of the space is very sparse, and you have to rely on strong assumptions.

To help avoid this problem, it’s good practice to include “nearby” observed data points in any prediction plot. These help you see if you’re interpolating, making prediction “in between” existing data points, or extrapolating, making predictions about preivously unobserved slices of the data.

One way to do this is to use condvis::visualweight().

https://cran.rstudio.com/web/packages/condvis/

19.6 Response variations

19.6.1 Transformations

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()

ggplot(diamonds, aes(x = log(carat), y = log(price))) +
  geom_point()

lm(log(price) ~ log(carat), data = diamonds)
#> 
#> Call:
#> lm(formula = log(price) ~ log(carat), data = diamonds)
#> 
#> Coefficients:
#> (Intercept)   log(carat)  
#>        8.45         1.68
# visualize model line

19.6.2 Genearlised linear models

So far the y variable of our models has been a continuous variable, income. You can use linear regression to model a categorical y variable by transforming y into a continuous variable with a link function. Then model fit a model to the results of the link function and use the link function to back transform and interpret the results.

The most common link function is the logit function, which transforms a bivariate y variable into a continuous range.

Use glm() to perform logistic regression in R.

she <- glm(sex ~ height + education, family = binomial(link = "logit"), data = heights)
tidy(she)
#>          term estimate std.error statistic  p.value
#> 1 (Intercept)   44.966    1.0865      41.4 0.00e+00
#> 2      height   -0.707    0.0166     -42.6 0.00e+00
#> 3   education    0.196    0.0154      12.7 8.76e-37

19.7 Other model families

19.7.1 Robust models

Iteratively re-fit the model down-weighting outlying points (points with high residuals).

19.7.2 Additive models

library(mgcv)
#> Loading required package: nlme
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse
#> This is mgcv 1.8-12. For overview type 'help("mgcv-package")'.
gam(income ~ s(education), data = heights)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> income ~ s(education)
#> 
#> Estimated degrees of freedom:
#> 6.83  total = 7.83 
#> 
#> GCV score: 8.91e+08

ggplot(data = heights, mapping = aes(x = education, y = income)) +
  geom_point() +
  geom_smooth(method = gam, formula = y ~ s(x))
#> Warning: Removed 9 rows containing non-finite values (stat_smooth).
#> Warning: Removed 9 rows containing missing values (geom_point).

# Linear z
gam(y ~ s(x) + z, data = df)

# Smooth x and smooth z
gam(y ~ s(x) + s(z), data = df)

# Smooth surface of x and z 
# (a smooth function that takes both x and z)
gam(y ~ s(x, z), data = df)

19.7.3 Random forrests

19.8 Summary

We’ve avoided two things in this chapter that are usually conflated with models: hypothesis testing and predictive analysis.

There are other types of modeling algorithms; each provides a valid description of the data.

Which description will be best? Does the relationship have a known form? Does the data have a known structure? Are you going to attempt hypothesis testing that imposes its own constraints?