17 Data structures

Often, when you write a function it will work with a single vector (or a handful of vectors), rather than a data frame. So far we’ve focussed on tools, like dplyr, that work with data frames, and have talked little about vector. Now it’s time to dive deep and learn how you can work with vectors to build your own functions to automate common problems.

There are two types of vectors:

  1. Atomic vectors, which are further broken down into six types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors.

  2. Lists, which sometimes called recursive vectors, because lists can contain other lists. This is the chief difference between atomic vectors and lists.

There’s a somewhat related object: NULL. It’s often used to represent the absence of a vector (as opposed to NA which is used to represent the absence of a value in a vector). NULL typically behaves like a vector of length 0.

The structure of the vector types is summarised in the following diagram:

Every vector has two key properties:

  1. Its type, which you can determine with typeof().

    typeof(letters)
    #> [1] "character"
    typeof(1:10)
    #> [1] "integer"
  2. Its length, which you can determine with length().

    x <- list("a", "b", 1:10)
    length(x)
    #> [1] 3

Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create augmented vectors which build on additional behaviour. There are four important types of augmented vector:

  • Factors and dates are built on top of integers.
  • Date times (POSIXct) are built on of doubles.
  • Data frames and tibbles are built on top of lists.

This chapter will introduce you to these important vectors from simplest to most complicated. You’ll start with atomic vectors, then build up to lists, and finally learn about augmented vectors.

17.1 Types of atomic vector

The four most important types of atomic vector are logical, integer, double, and character. Raw and complex are rarely used during a data analysis, so I don’t discuss them here.

Each type of atomic vector has its own missing value:

NA            # logical
#> [1] NA
NA_integer_   # integer
#> [1] NA
NA_real_      # double
#> [1] NA
NA_character_ # character
#> [1] NA

Normally, you don’t need to know about these different types because you can always use NA it will be converted to the correct type. However, there are some functions that are strict about their inputs, so it’s useful to have this knowledge sitting in your back pocket so you can use a specific type of missing value when needed.

Note that R does not have “scalars”. In R, a single number is a vector of length 1. The impacts of this are mostly on how functions work. Because there are no scalars, most built-in functions are vectorised, meaning that they will operate on a vector of numbers. That’s why, for example, this code works:

1:10 + 2:11
#>  [1]  3  5  7  9 11 13 15 17 19 21

In R, basic mathematical operations work with vectors, not scalars like in most programming languages. This means that you should never need to write an explicit for loop when performing simple computations on vectors.

17.1.1 Logical

Logical vectors are the simplest type of atomic vector because they can take only three possible values: FALSE, TRUE, and NA. Logical vectors are usually constructed with comparison operators, as described in comparisons. You can also create them by hand with c():

c(TRUE, TRUE, FALSE, NA)
#> [1]  TRUE  TRUE FALSE    NA

17.1.2 Numeric

Integer and double vectors are known collectively as numeric vectors and most of the time the distinction is not important, so we’ll discuss them together.

In R, numbers are doubles by default. To make an integer, place a L after the number:

typeof(1)
#> [1] "double"
typeof(1L)
#> [1] "integer"

There are two important differences between integers and doubles: doubles are approximations, and they have three extra special values.

Doubles represent floating point numbers that can not always be precisely represented with a fixed amount of memory. This means that you should consider all doubles to be approximations, and you should never test for equality. For example, what is square of the square root of two?

x <- sqrt(2) ^ 2
x
#> [1] 2

It certainly looks like we get what we expect: 2. But things are not exactly as they seem:

x == 2
#> [1] FALSE
x - 2
#> [1] 4.44e-16

This behaviour is common when working with floating point numbers: most calculations include some approximation error. Instead of comparing floating point numbers using ==, you should use dplyr::near() which allows for some numerical tolerance.

dplyr::near(x, 2)

Doubles also have three special values in addition to NA:

c(-1, 0, 1) / 0
#> [1] -Inf  NaN  Inf

Avoid using == to check for these other special values. Instead use the helper functions is.finite(), is.infinite(), and is.nan():

0 Inf NA NaN
is.finite() x
is.infinite() x
is.na() x x
is.nan() x

Note that is.finite(x) is not the same as !is.infinite(x).

17.1.3 Character

Character vectors are the most complex of atomic vectors, because each element of a character vector is a string, and a string can contain an arbitrary amount of data. Strings are such an important data type, they have their own chapter: strings.

Here I wanted to mention one important feature of the underlying string implementation: it uses a global string pool. This means that each unique string is only stored in memory once, and every use of the string points to that representation. This reduces the amount of memory needed by duplicated strings.

You can see this behaviour in practice by using pryr::object_size():

x <- "This is a reasonably long string."
pryr::object_size(x)
#> 136 B

y <- rep(x, 1000)
pryr::object_size(y)
#> 8.13 kB

y doesn’t take up 1,000x as much memory as x, because each element of y is just a pointer to that same string. A pointer is 8 bytes, so 1000 pointers to a 136 B string is 8 * 1000 + 136 = 8.13 kB.

17.1.4 Exercises

  1. Read the source code for dplyr::near(). How does it work?

  2. A logical vector can take 3 possible values. How many possible values can an integer vector take?

  3. List four functions that allow you to convert a double to an integer. How do they differ?

  4. What functions from the readr package allow you to turn a string into a logical, integer, or double vector?

17.2 Using atomic vectors

Now that you understand the different types of atomic vector, it’s useful to review some of the important tools for working with them:

  1. The coercion rules
  2. Testing if an input is of a given type
  3. How to create named vectors.
  4. Subsetting a vector to pull out elements of interest.

17.2.1 Coercion

There are two ways to convert, or coerce, one type of vector to another:

  1. Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector. For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.

  2. Explicit coercion happesn when you call a function like as.logical(), as.integer(), as.double(), and as.character(). Whenever you find yourself using explicit coercion, you should always check whether you can make the fix upstream, so that the vector never had the wrong type in the first place. For example, you may need to tweak you readr col_types specification.

Because explicit coercion is used relatively rarely, it’s more important to understand implicit coercion. The most important implicit coercion is logical to numeric. When used in a numeric context: TRUE is converted to 1, FALSE converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues.

x <- sample(20, 100, replace = TRUE)
y <- x > 10
sum(y)  # how many are greater than 10?
#> [1] 45
mean(y) # what proportion are greater than 10?
#> [1] 0.45

It’s also important to understand what happens when you try and create a vector containing multiple types with c(): the most complex type always wins. The type is a property of the complete vector, not the individual elements, so there’s no way to have an atomic vector which is a mix of different types. If you need to mix multiple types in the same vector, you should use a list, which you’ll learn about shortly.

str(c(TRUE, 1L))
#>  int [1:2] 1 1
str(c(1L, 1.5))
#>  num [1:2] 1 1.5
str(c(1.5, "a"))
#>  chr [1:2] "1.5" "a"

17.2.2 Test functions

It’s also useful to be able to test what type of thing you have in an unknown object. Base R provides many functions like is.vector() and is.atomic(), but they often don’t do what you expect. Instead, it’s safer to use the is_* functions provided by purrr, which are summarised in the table below.

lgl int dbl chr list
is_logical() x
is_integer() x
is_double() x
is_numeric() x x
is_character() x
is_atomic() x x x x
is_list() x
is_vector() x x x x x

Each predicate also comes with a “scalar” version, which checks that the length is 1. This is useful if you want to check (for example) that the inputs to your function are as you expect.

17.2.3 Naming vectors

All types of vectors can be named. You can either name them during creation with c():

c(x = 1, y = 2, z = 4)
#> x y z 
#> 1 2 4

Or after the fact with purrr::set_names():

1:3 %>% set_names(c("a", "b", "c"))
#> a b c 
#> 1 2 3

Named vectors are most useful for subsetting, described next.

17.2.4 Subsetting

Before we continue on to a richer data structure, the list, we need to take a brief detour to talk about subsetting vectors. So far, we’ve focussed on data frames, which are most easily subset with dplyr::filter(). filter(), however, does not work with vectors, so we need to learn a new tool: [.

[ is the subsetting function, and is called like x[a]. We’re not going to cover data structures that are 2d or higher in detail, but the idea generalised to x[a, b], x[a, b, c] and so on. When working with individual vectors, it’s important to understand how [ works and how you can use it to extract elements of interest.

There are three four types of thing you can use to subset a vector:

  1. The simplest type of subsetting is nothing, x[], which returns the complete x. This is not useful for subsetting vectors, but it is useful when subsetting matrices (and other high dimensional structures) because it lets you select all the rows or all the columns, by leaving that index blank.

  2. A numeric vector. If you subset with a numeric vector, it must either be all positive, all negative, or zero.

    Subsetting with a positive vector keeps the elements at those positions:

    x <- c("one", "two", "three", "four", "five")
    x[c(3, 2, 5)]
    #> [1] "three" "two"   "five"

    By repeating a position, you can actually make an longer output than input:

    x[c(1, 1, 5, 5, 5, 2)]
    #> [1] "one"  "one"  "five" "five" "five" "two"

    Negative values drop the elements at the specified positions:

    x[c(-1, -3, -5)]
    #> [1] "two"  "four"

    It’s an error to mix position and negative values:

    x[c(1, -1)]
    #> Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts

    The error message mentions subsetting with zero, which returns no values:

    x[0]
    #> character(0)

    This is not generally useful, but can be helpful if you want to create unusual data structures with which to test your functions.

  3. Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with a function that creates a logical vector.

    # All non-missing values of x
    x[!is.na(x)]
    
    # All even values of x
    x[x %% 2 == 0]
  4. If you have a named vector, you can subset it with a character vector.

    x <- c(abc = 1, def = 2, xyz = 5)
    x[c("xyz", "def")]
    #> xyz def 
    #>   5   2

    Like with positive integers, you can also use a character vector to duplicate individual entries.

I’d recommend reading http://adv-r.had.co.nz/Subsetting.html#applications to learn more about how you can use subsetting to achieve various goals. If you are working with data frames, you can typically use a dplyr function to achieve these goals, but the techniques are useful to know about when you are writing your own functions.

There is an important variation of [ called [[. [[ only ever extracts a single element, and always drops names. It’s a good idea to use it whenever you want to make it clear that you’re extracting one thing, as in a for loop. The distinction between [ and [[ is most important for lists, as we’ll see shortly.

17.2.5 Exercises

  1. Carefully read the documentation of is.vector(). What does it actually test for?

  2. Create functions that take a vector as input and returns:

    1. The last value. Should you use [ or [[?

    2. The elements at even numbered positions.

    3. Every element except the last value.

  3. Why is x[-which(x > 0)] not the same as x[x <= 0]?

  4. What happens when you subset with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?

17.3 Recursive vectors (lists)

Lists are a fundamentally richer than atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with list():

x <- list(1, 2, 3)
str(x)
#> List of 3
#>  $ : num 1
#>  $ : num 2
#>  $ : num 3

x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
#> List of 3
#>  $ a: num 1
#>  $ b: num 2
#>  $ c: num 3

Unlike atomic vectors, lists() can contain a mix of objects:

y <- list("a", 1L, 1.5, TRUE)
str(y)
#> List of 4
#>  $ : chr "a"
#>  $ : int 1
#>  $ : num 1.5
#>  $ : logi TRUE

Lists can even contain other lists!

z <- list(list(1, 2), list(3, 4))
str(z)
#> List of 2
#>  $ :List of 2
#>   ..$ : num 1
#>   ..$ : num 2
#>  $ :List of 2
#>   ..$ : num 3
#>   ..$ : num 4

str() is very helpful when looking at lists because it focusses on the structure, not the contents.

17.3.1 Visualising lists

To explain more complicated list manipulation functions, it’s helpful to have a visual representation of lists. For example, take these three lists:

x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))

I’ll draw them as follows:

  • Lists are rounded rectangles that contain their children.

  • I draw each child a little darker than its parent to make it easier to see the hierarchy.

  • The orientation of the children (i.e. rows or columns) isn’t important, so I’ll pick a row or column orientation to either save space or illustrate an important property in the example.

17.3.2 Subsetting

There are three ways to subset a list, which I’ll illustrate with a:

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
  • [ extracts a sub-list. The result will always be a list.

    str(a[1:2])
    #> List of 2
    #>  $ a: int [1:3] 1 2 3
    #>  $ b: chr "a string"
    str(a[4])
    #> List of 1
    #>  $ d:List of 2
    #>   ..$ : num -1
    #>   ..$ : num -5

    Like with vectors, you can subset with a logical, integer, or character vector.

  • [[ extracts a single component from a list. It removes a level of hierarchy from the list.

    str(y[[1]])
    #>  chr "a"
    str(y[[4]])
    #>  logi TRUE
  • $ is a shorthand for extracting named elements of a list. It works similarly to [[ except that you don’t need to use quotes.

    a$a
    #> [1] 1 2 3
    a[["b"]]
    #> [1] "a string"

The distinction between [ and [[ is really important for lists, because [[ drills down into the list while [ returns a new, smaller list. Compare the code and output above with the visual representation below.

17.3.3 Lists of condiments

It’s easy to get confused between [ and [[, but it’s important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help you remember these differences:

If this pepper shaker is your list x, then, x[1] is a pepper shaker containing a single pepper packet:

x[2] would look the same, but would contain the second packet. x[1:2] would be a pepper shaker containing two pepper packets.

x[[1]] is:

If you wanted to get the content of the pepper package, you’d need x[[1]][[1]]:

17.3.4 Exercises

  1. Draw the following lists as nested sets.

  2. Generate the lists corresponding to these nested set diagrams.

  3. What happens if you subset a data frame as if you’re subsetting a list? What are the key differences between a list and a data frame?

17.4 Augmented vectors

There are four important types of vector that are built on top of atomic vectors: factors, dates, date times, and data frames. I call these augmented vectors, because they are atomic vectors with additional attributes. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with attr() or see them all at once with attributes().

x <- 1:10
attr(x, "greeting")
#> NULL
attr(x, "greeting") <- "Hi!"
attr(x, "farewell") <- "Bye!"
attributes(x)
#> $greeting
#> [1] "Hi!"
#> 
#> $farewell
#> [1] "Bye!"

There are three very important attributes that are used to implement fundamental parts of R:

  • “names” are used to name the elements of a vector.
  • “dims” make a vector behave like a matrix or array.
  • “class” is used to implemenet the S3 object oriented system.

Class is particularly important because it changes what generic functions do with the object. Generic functions are key to OO in R. Here’s what a typical generic function looks like:

as.Date
#> function (x, ...) 
#> UseMethod("as.Date")
#> <bytecode: 0x3c58c50>
#> <environment: namespace:base>

The call to “UseMethod” means that this is a generic function, and it will call a specific method, based on the class of the first argument. You can list all the methods for a generic with methods():

methods("as.Date")
#> [1] as.Date.character as.Date.date      as.Date.dates     as.Date.default  
#> [5] as.Date.factor    as.Date.numeric   as.Date.POSIXct   as.Date.POSIXlt  
#> see '?methods' for accessing help and source code

And you can see the specific implementation of a method with getS3method():

getS3method("as.Date", "default")
#> function (x, ...) 
#> {
#>     if (inherits(x, "Date")) 
#>         return(x)
#>     if (is.logical(x) && all(is.na(x))) 
#>         return(structure(as.numeric(x), class = "Date"))
#>     stop(gettextf("do not know how to convert '%s' to class %s", 
#>         deparse(substitute(x)), dQuote("Date")), domain = NA)
#> }
#> <bytecode: 0x3390fb0>
#> <environment: namespace:base>
getS3method("as.Date", "numeric")
#> function (x, origin, ...) 
#> {
#>     if (missing(origin)) 
#>         stop("'origin' must be supplied")
#>     as.Date(origin, ...) + x
#> }
#> <bytecode: 0x33613f0>
#> <environment: namespace:base>

The most important S3 generic is print(): it controls how the object is printed when you type its name on the console. Other important generics are the subsetting functions [, [[, and $.

A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at http://adv-r.had.co.nz/OO-essentials.html#s3.

17.4.1 Factors

Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:

x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
typeof(x)
#> [1] "integer"
attributes(x)
#> $levels
#> [1] "ab" "cd" "ef"
#> 
#> $class
#> [1] "factor"

Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread stringsAsFactors argument). To get more historical context, you might want to read stringsAsFactors: An unauthorized biography by Roger Peng or stringsAsFactors = <sigh> by Thomas Lumley. The motivation for factors is the modelling context. If you’re going to fit a model to categorical data, you need to know in advance all the possible values. There’s no way to make a prediction for “green” if all you’ve ever seen is “red”, “blue”, and “yellow”

The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be stringsAsFactors argument that you can set to FALSE. Otherwise, you can apply as.character() to the column to explicitly turn back into a factor.

x <- factor(letters[1:5])
is.factor(x)
#> [1] TRUE
as.factor(letters[1:5])
#> [1] a b c d e
#> Levels: a b c d e

17.4.2 Dates and date times

Dates in R are numeric vectors (sometimes integers, sometimes doubles) that represent the number of days since 1 January 1970.

x <- as.Date("1971-01-01")
unclass(x)
#> [1] 365

typeof(x)
#> [1] "double"
attributes(x)
#> $class
#> [1] "Date"

Date times are numeric vectors (sometimes integers, sometimes doubles) that represent the number of seconds since 1 January 1970:

x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
#> [1] 3600
#> attr(,"tzone")
#> [1] "UTC"

typeof(x)
#> [1] "double"
attributes(x)
#> $tzone
#> [1] "UTC"
#> 
#> $class
#> [1] "POSIXct" "POSIXt"

The tzone is optional, and only controls the way the date is printed not what it means.

There is another type of datetimes called POSIXlt. These are built on top of named lists.

y <- as.POSIXlt(x)
typeof(y)
#> [1] "list"
attributes(y)
#> $names
#> [1] "sec"   "min"   "hour"  "mday"  "mon"   "year"  "wday"  "yday"  "isdst"
#> 
#> $class
#> [1] "POSIXlt" "POSIXt" 
#> 
#> $tzone
#> [1] "UTC"

If you use the packages outlined in this book, you should never encounter a POSIXlt. They do crop up in base R, because they are used extract specific components of a date (like the year or month). However, lubridate provides helpers for you to do this instead. Otherwise POSIXct’s are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a POSIXct with as.POSIXct().

17.4.3 Data frames and tibbles

Data frames are augmented lists: they have class “data.frame”, and names (column) and row.names attributes:

df1 <- data.frame(x = 1:5, y = 5:1)
typeof(df1)
#> [1] "list"
attributes(df1)
#> $names
#> [1] "x" "y"
#> 
#> $row.names
#> [1] 1 2 3 4 5
#> 
#> $class
#> [1] "data.frame"

The difference between a data frame and a list is that all the elements of a data frame must be the same length. All functions that work with data frames enforce this constraint.

In this book, we use tibbles, rather than data frames. Tibbles are identical to data frames, except that they have two additional components in the class:

df2 <- dplyr::data_frame(x = 1:5, y = 5:1)
typeof(df2)
#> [1] "list"
attributes(df2)
#> $names
#> [1] "x" "y"
#> 
#> $row.names
#> [1] 1 2 3 4 5
#> 
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"

These extra components give tibbles the helpful behaviours defined in [tibbles].