Welcome back! This week we’ll get into some more advanced fundamentals to help you import and tidy/manage data, define and run your own functions, generate distributions and samples, and manage dates. Some of these topics are more useful for future problem sets (e.g., the stuff about distributions and dates), but I’ve included them here to start introducing the ideas.

1 Importing (yet more) data

So far, we have imported datasets that come installed with R packages with the data() function as well as the load() function to read R data files (.RData, .rda, etc.). For better and worse, you should be able to import data in other formats too.

Before I get any further, I want to note that my approach here is to use R commands directly that you can type in RMarkdown scripts or run at the console yourself. RStudio also provides a number of handy data import tools through the graphical interface and drop-down menus. This how-to article introduces some of these resources.

Tabular (rows and columns) data files formatted as plain text with “comma-separated values” (“.csv’s”) are quite common, so we’ll look at those. R comes with a handy read.csv() command that does exactly what you’d expect. Here’s an example using a csv file I created from one of R’s built-in datasets called mtcars, which has old data about fuel consumption cars. Run help(mtcars) to learn more about where it comes from and to read the variable descriptions. Since it’s built-in, you can import it using data(mtcars), but I also posted it to the course data repository so we can use the url() command to point read.csv() to download it:

data.url <- url("https://communitydata.science/~ads/teaching/2020/stats/data/week_04/mtcars.csv")

my.mtcars <- read.csv(data.url)

head(my.mtcars)
##                   X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

As always, take a look at the help documentation for read.csv() to learn more about some of the arguments that you can use. Because data comes in so many (weird) formats, there are many possible arguments!

You might notice that the documentation for read.csv() is actually part of the documentation for another command called read.delim(). Turns out read.delim() is just a more general-purpose way to read in tabular data and that read.csv() is short-hand for read.delim() with some default values that make sense for csv files. Here is a command that produces identical output to the previous one

more.cars <- read.delim(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_03/mtcars.csv"), sep=",")

head(more.cars)
##                   X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
table(more.cars == my.mtcars)
## 
## TRUE 
##  384

When find yourself trying to load a tabular data file that consists of plain text, but has some idiosyncratic difference from a csv (e.g., it is tab-separated instead of comma-separated), you may want to use read.delim().

Note that the Tidyverse has some related functions, namely read_csv() that can also import csv’s very efficiently and with helpful defaults to try and guess what kinds of variables you’re working with. The guesses are usually pretty good! Here’s an example using read_csv()

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
yet.more.cars <- read_csv(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_03/mtcars.csv"))
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   mpg = col_double(),
##   cyl = col_double(),
##   disp = col_double(),
##   hp = col_double(),
##   drat = col_double(),
##   wt = col_double(),
##   qsec = col_double(),
##   vs = col_double(),
##   am = col_double(),
##   gear = col_double(),
##   carb = col_double()
## )
yet.more.cars
## # A tibble: 32 x 12
##    X1            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows
table(yet.more.cars == my.mtcars)
## 
## TRUE 
##  384

How will you know what to use? The best practice is always to get to know your data first! Seriously, try opening it the file (or at least opening up part of it) using a text editor and/or spreadsheet software. Looking at the “raw” plain text can help you figure out what arguments you need to use to make the data load up exactly the way you want it.

For example, you might notice that my import of the mtcars.csv file introduces an important difference from the original mtcars dataset. In the original mtcars, the car model names are row.names attributes of the dataframe instead of a variable.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
row.names(mtcars)
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

Since it’s the first column of the raw data, you can fix this with an additional argument to read.csv (I’m sure there’s also a way to do this with read_csv() but I didn’t look it up in time to include in this tutorial):

my.mtcars <- read.csv(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_04/mtcars.csv"), 
                      row.names=1)
head(my.mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
table(my.mtcars == mtcars)
## 
## TRUE 
##  352

This illustrates a common issue that relates back to variable types (classes). Most of the commands in R that import data try to “guess” what class is appropriate for each column of your dataset. Surprise, surprise, these guesses are sometimes not so great and often quite different from what you might guess. As a result, it’s a great idea to inspect the classes of every column of a dataset after you import it (review last week’s R lecture materials for more on this).

1.1 Importing proprietary data format