20 Robust code

(This is an advanced topic. You shouldn’t worry too much about it when you first start writing functions. Instead you should focus on getting a function that works right for the easiest 80% of the problem. Then in time, you’ll learn how to get to 99% with minimal extra effort. The defaults in this book should steer you in the right direction: we avoid teaching you functions with major suprises.)

In this section you’ll learn an important principle that lends itself to reliable and readable code: favour code that can be understood with a minimum of context. On one extreme, take this code:

baz <- foo(bar, qux)

What does it do? You can glean only a little from the context: foo() is a function that takes (at least) two arguments, and it returns a result we store in baz. But apart from that, you have no idea. To understand what this function does, you need to read the definitions of foo(), bar, and qux. Using better variable names helps a lot:

df2 <- arrange(df, qux)

It’s now much easier to see what’s going on! Function and variable names are important because they tell you about (or at least jog your memory of) what the code does. That helps you understand code in isolation, even if you don’t completely understand all the details. Unfortunately naming things is hard, and its hard to give concrete advice apart from giving objects short but evocative names. As autocomplete in RStudio has gotten better, I’ve tended to use longer names that are more descriptive. Short names are faster to type, but you write code relatively infrequently compared to the number of times that you read it.

The idea of minimising the context needed to understand your code goes beyond just good naming. You also want to favour functions with predictable behaviour and few surprises. If a function does radically different things when its inputs differ slightly, you’ll need to carefully read the surrounding context in order to predict what it will do. The goal of this section is to educate you about the most common ways R functions can be surprising and to provide you with unsurprising alternatives.

There are three common classes of surprises in R:

  1. Unstable types: What will df[, x] return? You can assume that df is a data frame and x is a vector because of their names. But you don’t know whether this code will return a data frame or a vector because the behaviour of [ depends on the length of x.

  2. Non-standard evaluation: What will filter(df, x == y) do? It depends on whether x or y or both are variable in df or variables in the current environment.

  3. Hidden arguments: What sort of variable will data.frame(x = "a") create? It will be either a character vector or a factor depending on the value of the global stringsAsFactors option.

Avoiding these three types of functions helps you to write code that you is easily understand and fails obviously with unexpected input. If this behaviour is so important, why do any functions behave differently? It’s because R is not just a programming language, but it’s also an environment for interactive data analysis. Some things make sense for interactive use (where you quickly check the output and guessing what you want is ok) but don’t make sense for programming (where you want errors to arise as quickly as possible). You might notice that these issues revolve around data frames. That’s unfortunate because data frames are the data structure you’ll use most commonly. It’s ironic, the most frustrating things about programming in R are features that were originally designed to make your data analysis easier! Data frames try very hard to be helpful:

df <- data.frame(xy = c("x", "y"))
# Character vectors were hard to work with for a long time, so R
# helpfully converts to a factor for you:
class(df$xy)
#> [1] "factor"

# If you're only selecting a single column, R tries to be helpful
# and give you that column, rather than giving you a single column
# data frame
class(df[, "xy"])
#> [1] "factor"

# If you have long variable names, R is "helpful" and lets you select
# them with a unique prefix
df$x
#> [1] x y
#> Levels: x y

These features all made sense at the time they were added to R, but computing environments have changed a lot, and these features now tend to cause a lot of problems. dplyr disables them for you:

df <- dplyr::data_frame(xy = c("x", "y"))
class(df$xy)
#> [1] "character"
class(df[, "xy"])
#> [1] "tbl_df"     "data.frame"
df$x
#> [1] "x" "y"

20.0.1 Unpredictable types

One of the most frustrating for programming is they way [ returns a vector if the result has a single column, and returns a data frame otherwise. In other words, if you see code like df[x, ] you can’t predict what it will return without knowing the value of x. This can trip you up in surprising ways. For example, imagine you’ve written this function to return the last row of a data frame:

last_row <- function(df) {
  df[nrow(df), ]
}

It’s not always going to return a row! If you give it a single column data frame, it will return a single number:

df <- data.frame(x = 1:3)
last_row(df)
#> [1] 3

There are two ways to avoid this problem:

  • Use drop = FALSE: df[x, , drop = FALSE].
  • Subset the data frame like a list: df[x].

Using one of those techniques for last_row() makes it more predictable: you know it will always return a data frame.

last_row <- function(df) {
  df[nrow(df), , drop = FALSE]
}
last_row(df)
#>   x
#> 3 3

Another common cause of problems is the sapply() function. If you’ve never heard of it before, feel free to skip this bit: just remember to avoid it! The problem with sapply() is that it tries to guess what the simplest form of output is, and it always succeeds.

The following code shows how sapply() can produce three different types of data depending on the input.

df <- data.frame(
  a = 1L,
  b = 1.5,
  y = Sys.time(),
  z = ordered(1)
)


df[1:4] %>% sapply(class) %>% str()
#> List of 4
#>  $ a: chr "integer"
#>  $ b: chr "numeric"
#>  $ y: chr [1:2] "POSIXct" "POSIXt"
#>  $ z: chr [1:2] "ordered" "factor"
df[1:2] %>% sapply(class) %>% str()
#>  Named chr [1:2] "integer" "numeric"
#>  - attr(*, "names")= chr [1:2] "a" "b"
df[3:4] %>% sapply(class) %>% str()
#>  chr [1:2, 1:2] "POSIXct" "POSIXt" "ordered" "factor"
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:2] "y" "z"

In the next chapter, you’ll learn about the purrr package which provides a variety of alternatives. In this case, you could use map_chr() which always returns a character vector: if it can’t, it will throw an error. Another option is the base vapply() function which takes a third argument indicating what the output should look like.

This doesn’t make sapply() bad and vapply() and map_chr() good. sapply() is nice because you can use it interactively without having to think about what f will return. 95% of the time it will do the right thing, and if it doesn’t you can quickly fix it. map_chr() is more important when your programming because a clear error message is more valuable when an operation is buried deep inside a tree of function calls. At this point its worth thinking more about

20.0.2 Non-standard evaluation

You’ve learned a number of functions that implement special lookup rules:

ggplot(mpg, aes(displ, cty)) + geom_point()
filter(mpg, displ > 10)

These are called “non-standard evaluation”, or NSE for short, because the usual lookup rules don’t apply. In both cases above neither displ nor cty are present in the global environment. Instead both ggplot2 and dplyr look for them first in a data frame. This is great for interactive use, but can cause problems inside a function because they’ll fall back to the global environment if the variable isn’t found.

[Talk a little bit about the standard scoping rules]

For example, take this function:

big_x <- function(df, threshold) {
  dplyr::filter(df, x > threshold)
}

There are two ways in which this function can fail:

  1. df$x might not exist. There are two potential failure modes:

    big_x(mtcars, 10)
    #> Error in eval(expr, envir, enclos): object 'x' not found
    
    x <- 1
    big_x(mtcars, 10)
    #>  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
    #> <0 rows> (or 0-length row.names)

    The second failure mode is particularly pernicious because it doesn’t throw an error, but instead silently returns an incorrect result. It works because by design filter() looks in both the data frame and the parent environment.

    It is unlikely that the variable you care about will both be missing where you expect it, and present where you don’t expect it. But I think it’s worth weighing heavily in your analysis of potential failure modes because it’s a failure that’s easy to miss (since it just silently gives a bad result), and hard to track down (since you need to read a lot of context).

  2. df$threshold might exist:

    df <- dplyr::data_frame(x = 1:10, threshold = 100)
    big_x(df, 5)
    #> Source: local data frame [0 x 2]
    #> 
    #> Variables not shown: x (int), threshold (dbl)

    Again, this is bad because it silently gives an unexpected result.

How can you avoid this problem? Currently, you need to do this:

big_x <- function(df, threshold) {
  if (!"x" %in% names(df)) 
    stop("`df` must contain variable called `x`.", call. = FALSE)
  
  if ("threshold" %in% names(df))
    stop("`df` must not contain variable called `threshold`.", call. = FALSE)
  
  dplyr::filter(df, x > threshold)
}

Because dplyr currently has no way to force a name to be interpreted as either a local or parent variable, as I’ve only just realised that’s really you should avoid NSE. In a future version you should be able to do:

big_x <- function(df, threshold) {
  dplyr::filter(df, local(x) > parent(threshold))
}

Another option is to implement it yourself using base subsetting:

big_x <- function(df, threshold) {
  rows <- df$x > threshold
  df[!is.na(rows) & rows, , drop = FALSE]
}

The challenge is remembering that filter() also drops missing values, and you also need to remember to use drop = FALSE!

20.0.3 Relying on global options

Functions are easiest to reason about if they have two properties:

  1. Their output only depends on their inputs.
  2. They don’t affect the outside world except through their return value.

The first property is particularly important. If a function has hidden additional inputs, it’s very difficult to even know where the important context is!

The biggest breaker of this rule in base R are functions that create data frames. Most of these functions have a stringsAsFactors argument that defaults to getOption("stringsAsFactors"). This means that a global option affects the operation of a very large number of functions, and you need to be aware that depending on an external state a function might produce either a character vector or a factor. In this book, we steer you away from that problem by recommnding functions like readr::read_csv() and dplyr::data_frame() that don’t rely on this option. But be aware of it! Generally if a function is affected by a global option, you should avoid setting it.

Only use options() to control side-effects of a function. The value of an option should never affect the return value of a function. There are only three violations of this rule in base R: stringsAsFactors, encoding, na.action. For example, base R lets you control the number of digits printed in default displays with (e.g.) options(digits = 3). This is a good use of an option because it’s something that people frequently want control over, but doesn’t affect the computation of a result, just its display. Follow this principle with your own use of options.

20.0.4 Trying too hard

Another class of problems is functions that try really really hard to always return a useful result. Unfortunately they try so hard that they never throw error messages so you never find out if the input is really really weird.

20.0.5 Exercises

  1. Look at the encoding argument to file(), url(), gzfile() etc. What is the default value? Why should you avoid setting the default value on a global level?