Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of “R for Data Science” is to introduce you to the most important in R tools that you need to do data science. After reading this book, you’ll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
What you will learn
Data science is a huge field, and there’s no way you can master it by reading a single book. The goal of this book is to give you a solid foundation with the most important tools. Our model of the tools needed in a typical data science project looks something like this:
First you must import your data into R. This typically means that you take data stored in a file, database, or web API, and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it!
Once you’ve imported your data, it is a good idea to tidy it. Tidying your data means storing it in a standard form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Working with tidy data is important because the consistency lets you spend your time struggling with your questions, not fighting to get data into the right form for different functions.
Once you have tidy data, a common first step is to transform it. You may zero in on a subset of data, add new variables that are functions of existing variables, calculate a set of summary statistics, or sort your data according to values.
There are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times. For example, you might see a scatterplot that inspires you to fit a linear model. Then you transform the data to add a column of residuals from the model, and look at another scatterplot, this time of the residuals.
Visualisation is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions of the data. A good visualisation might also hint that you’re asking the wrong question and you need to refine your thinking. In short, visualisations can surprise you. However, visualisations don’t scale particularly well.
Models are the complementary tools to visualisation. Models are a fundamentally mathematical or computation tool, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model can not fundamentally surprise you.
The last step of data science is communication, an absolutely critical part of any data analysis project. It doesn’t matter how well models and visualisation have led you to understand the data, unless you can commmunicate your results to other people.
There’s one important toolset that’s not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don’t need to be an expert programmer to be a data scientist, but learning more about programming pays off. Becoming a better programmer will allow you automate common tasks, and solve new problems with greater ease.
You’ll use these tools in every data science project, but for most projects they’re not enough. There’s a rough 80-20 rule at play: you can probably tackle 80% of every project using the tools that we’ll teach you, but you’ll need more to tackle the remaining 20%. Throughout this book we’ll point you to resources where you can learn more.
How you will learn
The above description of the tools of data science was organised roughly around the order in which you use them in analysis (although of course you’ll iterate through them multiple times). In our experience, however, this is not the best way to learn them:
Starting with data ingest and tidying is sub-optimal because 80% of the time it’s routine and boring, and the other 20% of the time it’s horrendously frustrating. Instead, we’ll start with visualisation and transformation on data that’s already been imported and tidied. That way, when you ingest and tidy your own data, you’ll be able to keep your motivation high because you know the pain is worth it because of what you can accomplish once its done.
Some topics are best explained with other tools. For example, we believe that it’s easier to understand how models work as a tool for data science if you already know about visualisation, data transformation, and tidy data.
Programming tools are not necessarily interesting in their own right, but do allow you to tackle considerably more challenging problems. We’ll give you a selection of programming tools in the middle of the book, and then finish off by showing how they combine with the key data science tools to tackle interesting problems.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you’ve learned. It’s tempting to skip these, but there’s no better way to learn than practicing.
What you won’t learn
There are some important topics that this book doesn’t cover. We believe it’s important to stay ruthlessly focussed on the essentials so you can get up and running as quickly as possible. That means this book can’t cover every important topic.
This book proudly focusses on small, in-memory datasets. This is the right place to start because you can’t tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you’re routinely working larger data (10-100 Gb, say), you should learn more about data.table. We don’t teach data.table here because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you’re working with large data, the performance payoff is worth a little extra effort to learn it.
Many big data problems are often small data problems in disguise. Often your complete dataset is big, but the data needed to answer a specific question is small. It’s often possible to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you’re interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We’ll touch on this idea in transform.
Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarassingly parallel), so you just need a system (like hadoop) that allows you to send different datasets to different computers for processing. Once you’ve figured out to how answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset.
In this book, you won’t learn anything about Python, Julia, or any other programming language useful for data science. This isn’t because we think these tools are bad. They’re not! And in practice, most data science teams use a mix of languages, often at least R and Python.
However, we strongly believe that it’s best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn’t mean you should only know one thing, just that you’ll generally learn faster if you stick to one thing at a time.
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data sets that do not naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they’re a great place to start your data analysis journey.
Formal Statistics and Machine Learning
This book focusses on practical tools for understanding your data: visualization, modelling, and transformation. You can develop your understanding further by learning probability theory, statistical hypothesis testing, and machine learning methods; but we won’t teach you those things here. There are many books that cover these topics, but few that integrate the other parts of the data science process. When you are ready, you can and should read books devoted to each of these topics. We recommend Statistical Modeling: A Fresh Approach by Danny Kaplan; An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani; and Applied Predictive Modeling by Kuhn and Johnson.
We’ve made few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it’s helpful if you have some programming experience already. If you’ve never programmed before, you might find Hands on Programming with R by Garrett to be a useful adjunct to this book.
To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are open source, free and easy to install:
- Download R and install R, https://www.r-project.org/alt-home/.
- Download and install RStudio, http://www.rstudio.com/download.
- Install needed packages (see below).
RStudio is an integated development environment, or IDE, for R programming. There are three key regions:
You run R code in the console pane. Textual output appears inline, and graphical output appears in the output pane. You write more complex R scripts in the editor pane.
There are three keyboard shortcuts for the RStudio IDE that we strongly encourage that you learn because they’ll save you so much time:
Cmd + Enter: sends current line (or current selection) from the editor to the console and runs it. (Ctrl + Enter on a PC)
Tab: suggest possible completions for the text you’ve typed.
Cmd + ↑: in the console, searches all commands you’ve typed that start with those characters. (Ctrl + ↑ on a PC)
If you want to see a list of all keyboard shortcuts, use the meta keyboard shortcut Alt + Shift + K: that’s the keyboard shortcut to show all the other keyboard shortcuts.
We strongly recommend making two changes to the default RStudio options:
This ensures that every time you restart RStudio you get a completely clean slate. This is good pratice because it encourages you to capture all important interactions in your code. There’s nothing worse than discovering three months after the fact that you’ve only stored the results of important calculation in your workspace, not the calculation itself in your code. During a project, it’s good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
You’ll also need to install some R packages. An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. To install all the packages used in this book open RStudio and run:
pkgs <- c( "bookdown", "broom", "dplyr", "ggplot2", "jpeg", "jsonlite", "knitr", "microbenchmark", "png", "pryr", "purrr", "readr", "stringr", "tidyr" ) install.packages(pkgs)
R will download the packages from CRAN and install them in your system library. If you have problems installing, make sure that you are connected to the internet, and that you haven’t blocked http://cran.r-project.org in your firewall or proxy.
You will not be able to use the functions, objects, and help files in a package until you load it with
library(). After you have downloaded the packages, you can load any of the packages into your current R session with the
library() command, e.g.
You will need to reload the package every time you start a new R session.
Google. Always a great place to start! Adding “R” to a query is usually enough to filter it down. If you ever hit an error message that you don’t know how to handle, it is a great idea to google it.
If your operating system defaults to another language, you can use
Sys.setenv(LANGUAGE = "en")to tell R to use english. That’s likely to get you to common solutions more quickly.
Twitter. #rstats hashtag is very welcoming. Great way to keep up with what’s happening in the community.
- Jenny Bryan and Lionel Henry for many helpful discussions around working with lists and list-columns.
This book was built with:
devtools::session_info(pkgs) #> Session info -------------------------------------------------------------- #> setting value #> version R version 3.2.3 (2015-12-10) #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> tz <NA> #> date 2016-02-16 #> Packages ------------------------------------------------------------------ #> package * version date source #> assertthat 0.1 2013-12-06 CRAN (R 3.2.3) #> BH 1.60.0-1 2015-12-28 CRAN (R 3.2.3) #> bitops 1.0-6 2013-08-17 CRAN (R 3.2.3) #> bookdown * 0.1 2016-02-12 Github (hadley/bookdown@12ed348) #> broom 0.4.0 2015-11-30 CRAN (R 3.2.3) #> caTools 1.17.1 2014-09-10 CRAN (R 3.2.3) #> codetools 0.2-14 2015-07-15 CRAN (R 3.2.3) #> colorspace 1.2-6 2015-03-11 CRAN (R 3.2.3) #> curl 0.9.5 2016-01-23 CRAN (R 3.2.3) #> DBI 0.3.1 2014-09-24 CRAN (R 3.2.3) #> dichromat 2.0-0 2013-01-24 CRAN (R 3.2.3) #> digest 0.6.9 2016-01-08 CRAN (R 3.2.3) #> dplyr 0.4.3 2015-09-01 CRAN (R 3.2.3) #> evaluate 0.8 2015-09-18 CRAN (R 3.2.3) #> formatR 1.2.1 2015-09-18 CRAN (R 3.2.3) #> ggplot2 184.108.40.20601 2016-02-12 Github (hadley/ggplot2@025d198) #> gtable 0.1.2.9000 2016-02-12 Github (hadley/gtable@6c7b22c) #> highr 0.5.1 2015-09-18 CRAN (R 3.2.3) #> htmltools 0.3 2015-12-29 CRAN (R 3.2.3) #> jpeg 0.1-8 2014-01-23 CRAN (R 3.2.3) #> jsonlite 0.9.19 2015-11-28 CRAN (R 3.2.3) #> knitr 1.12.8 2016-02-12 Github (yihui/knitr@f95518b) #> labeling 0.3 2014-08-23 CRAN (R 3.2.3) #> lattice 0.20-33 2015-07-14 CRAN (R 3.2.3) #> lazyeval 0.1.10 2015-01-02 CRAN (R 3.2.3) #> magrittr 1.5 2014-11-22 CRAN (R 3.2.3) #> markdown 0.7.7 2015-04-22 CRAN (R 3.2.3) #> MASS 7.3-45 2015-11-10 CRAN (R 3.2.3) #> microbenchmark 1.4-2.1 2015-11-25 CRAN (R 3.2.3) #> mime 0.4 2015-09-03 CRAN (R 3.2.3) #> mnormt 1.5-3 2015-05-25 CRAN (R 3.2.3) #> munsell 0.4.3 2016-02-13 CRAN (R 3.2.3) #> nlme 3.1-122 2015-08-19 CRAN (R 3.2.3) #> plyr 1.8.3 2015-06-12 CRAN (R 3.2.3) #> png 0.1-7 2013-12-03 CRAN (R 3.2.3) #> pryr 0.1.2 2015-06-20 CRAN (R 3.2.3) #> psych 1.5.8 2015-08-30 CRAN (R 3.2.3) #> purrr 0.2.1 2016-02-12 Github (hadley/purrr@da96161) #> R6 2.1.2 2016-01-26 CRAN (R 3.2.3) #> RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.2.3) #> Rcpp 0.12.3 2016-01-10 CRAN (R 3.2.3) #> readr 0.2.2 2015-10-22 CRAN (R 3.2.3) #> reshape2 1.4.1 2014-12-06 CRAN (R 3.2.3) #> RJSONIO 1.3-0 2014-07-28 CRAN (R 3.2.3) #> rmarkdown * 0.9.2 2016-01-01 CRAN (R 3.2.3) #> scales 0.3.0.9000 2016-02-12 Github (hadley/scales@ad60fbe) #> stringi 1.0-1 2015-10-22 CRAN (R 3.2.3) #> stringr 220.127.116.1100 2016-02-12 Github (hadley/stringr@a67f8f0) #> tidyr 0.4.1 2016-02-05 CRAN (R 3.2.3) #> yaml 2.1.13 2014-06-12 CRAN (R 3.2.3)