R for data science by Garrett Grolemund and Hadley Wickham


Data import

Overview

You can’t apply any of the tools you’ve applied so far to your own work, unless you can get your own data into R. In this chapter, you’ll learn how to import:

  • Flat files (like csv) with readr.
  • Database queries with DBI.
  • Data from web APIs with httr.
  • Binary file formats (like excel or sas), with haven and readxl.

The common link between all these packages is they all aim to take your data and turn it into a data frame in R, so you can tidy it and then analyse it.

Flat files

There are many ways to read flat files into R. If you’ve be using R for a while, you might be familiar with read.csv(), read.fwf() and friends. We’re not going to use these base functions. Instead we’re going to use read_csv(), read_fwf(), and friends from the readr package. Because:

  • These functions are typically much faster (~10x) than the base equivalents. Long run running jobs also have a progress bar, so you can see what’s happening. (If you’re looking for raw speed, try data.table::fread(), it’s slightly less flexible than readr, but can be twice as fast.)

  • They have more flexible parsers: they can read in dates, times, currencies, percentages, and more.

  • They fail to do some annoying things like converting character vectors to factors, munging the column headers to make sure they’re valid R variable names, and using row names.

  • They return objects with class tbl_df. As you saw in the dplyr chapter, this provides a nicer printing method, so it’s easier to work with large datasets.

  • They’re designed to be as reproducible as possible - this means that you sometimes need to supply a few more arguments when using them the first time, but they’ll definitely work on other peoples computers. The base R functions take a number of settings from system defaults, which means that code that works on your computer might not work on someone else’s.

Make sure you have the readr package (install.packages("readr")).

Most of readr’s functions are concerned with turning flat files into data frames:

  • read_csv() reads comma delimited files, read_csv2() reads semi-colon separated files (common in countries where , is used as the decimal place), read_tsv() reads tab delimited files, and read_delim() reads in files with a user supplied delimiter.

  • read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions(). read_table() reads a common variation of fixed width files where columns are separated by white space.

  • read_log() reads Apache style logs. (But also check out webreadr which is built on top of read_log(), but provides many more helpful tools.)

readr also provides a number of functions for reading files off disk into simpler data structures:

  • read_file() reads an entire file into a single string.

  • read_lines() reads a file into a character vector with one element per line.

These might be useful for other programming tasks.

As well as reading data from disk, readr also provides tools for working with data frames and character vectors in R:

  • type_convert() applies the same parsing heuristics to the character columns in a data frame. You can override its choices using col_types.

For the rest of this chapter we’ll focus on read_csv(). If you understand how to use this function, it will be straightforward to your knowledge to all the other functions in readr.

Basics

The first two arguments of read_csv() are:

  • file: path (or URL) to the file you want to load. Readr can automatically decompress files ending in .zip, .gz, .bz2, and .xz. This can also be a literal csv file, which is useful for experimenting and creating reproducible examples.

  • col_names: column names. There are three options:

    • TRUE (the default), which reads column names from the first row of the file

    • FALSE numbers columns sequentially from X1 to Xn.

    • A character vector, used as column names. If these don’t match up with the columns in the data, you’ll get a warning message.

EXAMPLE

Column types

Readr uses a heuristic to figure out the types of your columns: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. This is fast, and fairly robust. If readr detects the wrong type of data, you’ll get warning messages. Readr prints out the first five, and you can access them all with problems():

EXAMPLE

Typically, you’ll see a lot of warnings if readr has guessed the column type incorrectly. This most often occurs when the first 1000 rows are different to the rest of the data. Perhaps there are a lot of missing data there, or maybe your data is mostly numeric but a few rows have characters. Fortunately, it’s easy to fix these problems using the col_type argument.

(Note that if you have a very large file, you might want to set n_max to 10,000 or 100,000. That will speed up iterations while you’re finding common problems)

Specifying the col_type looks like this:

read_csv("mypath.csv", col_types = col(
  x = col_integer(),
  treatment = col_character()
))

You can use the following types of columns

  • col_integer() (i) and col_double() (d) specify integer and doubles. col_logical() (l) parses TRUE, T, FALSE and F into a logical vector. col_character() (c) leaves strings as is.

  • col_number() (n) is a more flexible parsed for numbers embedded in other strings. It will look for the first number in a string, ignoring non-numeric prefixes and suffixes. It will also ignore the grouping mark specified by the locale (see below for more details).

  • col_factor() (f) allows you to load data directly into a factor if you know what the levels are.

  • col_skip() (_, -) completely ignores a column.

  • col_date() (D), col_datetime() (T) and col_time() (t) parse into dates, date times, and times as described below.

You might have noticed that each column parser has a one letter abbreviation, which you can use instead of the full function call (assuming you’re happy with the default arguments):

read_csv("mypath.csv", col_types = cols(
  x = "i",
  treatment = "c"
))

(If you just have a few columns you supply a single string giving the type for each column: i__dc. See the documentation for more details. It’s not as easy to understand as the cols() specification, so I’m not going to describe it further here.)

By default, any column not mentioned in cols will be guessed. If you’d rather those columns are simply not read in, use cols_only(). In that case, you can use col_guess() (?) if you want to guess the type of a column.

Each col_XYZ() function also has a corresponding parse_XYZ() that you can use on a character vector. This makes it easier to explore what each of the parsers does interactively.

parse_integer(c("1", "2", "3"))
## [1] 1 2 3
parse_logical(c("TRUE", "FALSE", "NA"))
## [1]  TRUE FALSE    NA
parse_number(c("$1000", "20%", "3,000"))
## [1] 1000   20 3000
parse_number(c("$1000", "20%", "3,000"))
## [1] 1000   20 3000

Parsing occurs after leading and trailing whitespace has been removed (if not overridden with trim_ws = FALSE) and missing values listed in na have been removed:

parse_logical(c("TRUE ", " ."), na = ".")
## [1] TRUE   NA

Datetimes

Readr provides three options depending on where you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:

  • Date times: an ISO8601 date time.
  • Date: a year, optional separator, month, optional separator, day.
  • Time: an hour, optional colon, hour, optional colon, minute, optional colon, optional seconds, optional am/pm.
parse_datetime("2010-10-01T2010")
## [1] "2010-10-01 20:10:00 UTC"
parse_date("2010-10-01")
## [1] "2010-10-01"
parse_time("20:10:01")
## [1] 20:10:01

If these defaults don’t work for your data you can supply your own date time formats, built up of the following pieces:

  • Year: %Y (4 digits). %y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.

  • Month: %m (2 digits), %b (abbreviated name), %B (full name).

  • Day: %d (2 digits), %e (optional leading space).

  • Hour: %H.

  • Minutes: %M.

  • Seconds: %S (integer seconds), %OS (partial seconds).

  • Time zone: %Z (as name, e.g. America/Chicago), %z (as offset from UTC, e.g. +0800). If you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is Eastern Standard Time!

  • AM/PM indicator: %p.

  • Non-digits: %. skips one non-digit character, %* skips any number of non-digits.

The best way to figure out the correct string is to create a few examples in a character vector, and test with one of the parsing functions. For example:

parse_date("01/02/15", "%m/%d/%y")
## [1] "2015-01-02"
parse_date("01/02/15", "%d/%m/%y")
## [1] "2015-02-01"
parse_date("01/02/15", "%y/%m/%d")
## [1] "2001-02-15"

Then when you read in the data with read_csv() you can easily translate to the col_date() format.

International data

The goal of readr’s locales is to encapsulate the common options that vary between languages and different regions of the world. This includes:

  • Names of months and days, used when parsing dates.
  • The default time zones, used when parsing date times.
  • The character encoding, used when reading non-ASCII strings.
  • Default date and time formats, used when guessing column types.
  • The decimal and grouping marks, used when reading numbers.

Readr is designed to be independent of your current locale settings. This makes a bit more hassle in the short term, but makes it much much easier to share your code with others: if your readr code works locally, it will also work for everyone else in the world. The same is not true for base R code, since it often inherits defaults from your system settings. Just because data ingest code works for you doesn’t mean that it will work for someone else in another country.

The settings you are most like to need to change are:

  • The names of days and months:

    locale("fr")
    ## <locale>
    ## Numbers:  123,456.78
    ## Formats:  %Y%.%m%.%d / %H:%M
    ## Timezone: UTC
    ## Encoding: UTF-8
    ## <date_names>
    ## Days:   dimanche (dim.), lundi (lun.), mardi (mar.), mercredi (mer.),
    ##         jeudi (jeu.), vendredi (ven.), samedi (sam.)
    ## Months: janvier (janv.), février (févr.), mars (mars), avril (avr.), mai
    ##         (mai), juin (juin), juillet (juil.), août (août),
    ##         septembre (sept.), octobre (oct.), novembre (nov.),
    ##         décembre (déc.)
    ## AM/PM:  AM/PM
    locale("fr", asciify = TRUE)
    ## <locale>
    ## Numbers:  123,456.78
    ## Formats:  %Y%.%m%.%d / %H:%M
    ## Timezone: UTC
    ## Encoding: UTF-8
    ## <date_names>
    ## Days:   dimanche (dim.), lundi (lun.), mardi (mar.), mercredi (mer.),
    ##         jeudi (jeu.), vendredi (ven.), samedi (sam.)
    ## Months: janvier (janv.), fevrier (fevr.), mars (mars), avril (avr.), mai
    ##         (mai), juin (juin), juillet (juil.), aout (aout),
    ##         septembre (sept.), octobre (oct.), novembre (nov.),
    ##         decembre (dec.)
    ## AM/PM:  AM/PM
  • The character encoding used in the file. If you don’t know the encoding you can use guess_encoding(). It’s not perfect, but if you have a decent sample of text, it’s likely to be able to figure it out.

    Readr converts all strings into UTF-8 as this is safest to work with across platforms. (It’s also what every stringr operation does.)

Exercises

  • Parse these dates (incl. non-English examples).
  • Parse these example files.
  • Parse this fixed width file.

Databases

Web APIs

Binary files

Needs to discuss how data types in different languages are converted to R. Similarly for missing values.

Tibble diffs

data_frame() is a nice way to create data frames. It encapsulates best practices for data frames:

  • It never changes an input’s type (i.e., no more stringsAsFactors = FALSE!).

    data.frame(x = letters) %>% sapply(class)
    ##        x 
    ## "factor"
    data_frame(x = letters) %>% sapply(class)
    ##           x 
    ## "character"

    This makes it easier to use with list-columns:

    data_frame(x = 1:3, y = list(1:5, 1:10, 1:20))
    ## Source: local data frame [3 x 2]
    ## 
    ##       x         y
    ##   (int)     (chr)
    ## 1     1  <int[5]>
    ## 2     2 <int[10]>
    ## 3     3 <int[20]>

    List-columns are most commonly created by do(), but they can be useful to create by hand.

  • It never adjusts the names of variables:

    data.frame(`crazy name` = 1) %>% names()
    ## [1] "crazy.name"
    data_frame(`crazy name` = 1) %>% names()
    ## [1] "crazy name"
  • It evaluates its arguments lazily and sequentially:

    data_frame(x = 1:5, y = x ^ 2)
    ## Source: local data frame [5 x 2]
    ## 
    ##       x     y
    ##   (int) (dbl)
    ## 1     1     1
    ## 2     2     4
    ## 3     3     9
    ## 4     4    16
    ## 5     5    25
  • It adds the tbl_df() class to the output so that if you accidentally print a large data frame you only get the first few rows.

    data_frame(x = 1:5) %>% class()
    ## [1] "tbl_df"     "tbl"        "data.frame"
  • It changes the behaviour of [ to always return the same type of object: subsetting using [ always returns a tbl_df() object; subsetting using [[ always returns a column.

    You should be aware of one case where subsetting a tbl_df() object will produce a different result than a data.frame() object:

    df <- data.frame(a = 1:2, b = 1:2)
    str(df[, "a"])
    ##  int [1:2] 1 2
    tbldf <- tbl_df(df)
    str(tbldf[, "a"])
    ## Classes 'tbl_df' and 'data.frame':   2 obs. of  1 variable:
    ##  $ a: int  1 2
  • It never uses row.names(). The whole point of tidy data is to store variables in a consistent way. So it never stores a variable as special attribute.

  • It only recycles vectors of length 1. This is because recycling vectors of greater lengths is a frequent source of bugs.

Coercion

To complement data_frame(), dplyr provides as_data_frame() to coerce lists into data frames. It does two things:

  • It checks that the input list is valid for a data frame, i.e. that each element is named, is a 1d atomic vector or list, and all elements have the same length.

  • It sets the class and attributes of the list to make it behave like a data frame. This modification does not require a deep copy of the input list, so it’s very fast.

This is much simpler than as.data.frame(). It’s hard to explain precisely what as.data.frame() does, but it’s similar to do.call(cbind, lapply(x, data.frame)) - i.e. it coerces each component to a data frame and then cbinds() them all together. Consequently as_data_frame() is much faster than as.data.frame():

l2 <- replicate(26, sample(100), simplify = FALSE)
names(l2) <- letters
microbenchmark::microbenchmark(
  as_data_frame(l2),
  as.data.frame(l2)
)
## Unit: microseconds
##               expr      min       lq     mean    median       uq      max
##  as_data_frame(l2)  111.201  121.337  131.083  127.4815  133.550  177.518
##  as.data.frame(l2) 1525.587 1567.716 1694.285 1602.3225 1668.065 2896.099
##  neval
##    100
##    100

The speed of as.data.frame() is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame.

tbl_dfs vs data.frames

There are three key differences between tbl_dfs and data.frames:

  • When you print a tbl_df, it only shows the first ten rows and all the columns that fit on one screen. It also prints an abbreviated description of the column type:

    data_frame(x = 1:1000)
    ## Source: local data frame [1,000 x 1]
    ## 
    ##        x
    ##    (int)
    ## 1      1
    ## 2      2
    ## 3      3
    ## 4      4
    ## 5      5
    ## 6      6
    ## 7      7
    ## 8      8
    ## 9      9
    ## 10    10
    ## ..   ...

    You can control the default appearance with options:

    • options(dplyr.print_max = n, dplyr.print_min = m): if more than m rows print m rows. Use options(dplyr.print_max = Inf) to always show all rows.

    • options(dplyr.width = Inf) will always print all columns, regardless of the width of the screen.

  • When you subset a tbl_df with [, it always returns another tbl_df. Contrast this with a data frame: sometimes [ returns a data frame and sometimes it just returns a single column:

    df1 <- data.frame(x = 1:3, y = 3:1)
    class(df1[, 1:2])
    ## [1] "data.frame"
    class(df1[, 1])
    ## [1] "integer"
    df2 <- data_frame(x = 1:3, y = 3:1)
    class(df2[, 1:2])
    ## [1] "tbl_df"     "data.frame"
    class(df2[, 1])
    ## [1] "tbl_df"     "data.frame"

    To extract a single column it’s use [[ or $:

    class(df2[[1]])
    ## [1] "integer"
    class(df2$x)
    ## [1] "integer"
  • When you extract a variable with $, tbl_dfs never do partial matching. They’ll throw an error if the column doesn’t exist:

    df <- data.frame(abc = 1)
    df$a
    ## [1] 1
    df2 <- data_frame(abc = 1)
    df2$a
    ## [1] 1