From 3f31579d7ac0cbade8c9fc684ea0130682160cd5 Mon Sep 17 00:00:00 2001 From: aaronshaw Date: Mon, 28 Sep 2020 12:06:22 -0500 Subject: [PATCH] week 04 tutorial materials --- r_tutorials/w04-R_tutorial.html | 2016 +++++++++++++++++++++++++++++++ r_tutorials/w04-R_tutorial.pdf | Bin 0 -> 327534 bytes r_tutorials/w04-R_tutorial.rmd | 422 +++++++ 3 files changed, 2438 insertions(+) create mode 100644 r_tutorials/w04-R_tutorial.html create mode 100644 r_tutorials/w04-R_tutorial.pdf create mode 100644 r_tutorials/w04-R_tutorial.rmd diff --git a/r_tutorials/w04-R_tutorial.html b/r_tutorials/w04-R_tutorial.html new file mode 100644 index 0000000..9a5cb2e --- /dev/null +++ b/r_tutorials/w04-R_tutorial.html @@ -0,0 +1,2016 @@ + + + + + + + + + + + + + + + +Week 4 R tutorial + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + +

Welcome back! This week we’ll get into some more advanced fundamentals to help you import and tidy/manage data, define and run your own functions, generate distributions and samples, and manage dates. Some of these topics are more useful for future problem sets (e.g., the stuff about distributions and dates), but I’ve included them here to start introducing the ideas.

+
+

1 Importing (yet more) data

+

So far, we have imported datasets that come installed with R packages with the data() function as well as the load() function to read R data files (.RData, .rda, etc.). For better and worse, you should be able to import data in other formats too.

+

Before I get any further, I want to note that my approach here is to use R commands directly that you can type in RMarkdown scripts or run at the console yourself. RStudio also provides a number of handy data import tools through the graphical interface and drop-down menus. This how-to article introduces some of these resources.

+

Tabular (rows and columns) data files formatted as plain text with “comma-separated values” (“.csv’s”) are quite common, so we’ll look at those. R comes with a handy read.csv() command that does exactly what you’d expect. Here’s an example using a csv file I created from one of R’s built-in datasets called mtcars, which has old data about fuel consumption cars. Run help(mtcars) to learn more about where it comes from and to read the variable descriptions. Since it’s built-in, you can import it using data(mtcars), but I also posted it to the course data repository so we can use the url() command to point read.csv() to download it:

+
data.url <- url("https://communitydata.science/~ads/teaching/2020/stats/data/week_04/mtcars.csv")
+
+my.mtcars <- read.csv(data.url)
+
+head(my.mtcars)
+
##                   X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
+## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+

As always, take a look at the help documentation for read.csv() to learn more about some of the arguments that you can use. Because data comes in so many (weird) formats, there are many possible arguments!

+

You might notice that the documentation for read.csv() is actually part of the documentation for another command called read.delim(). Turns out read.delim() is just a more general-purpose way to read in tabular data and that read.csv() is short-hand for read.delim() with some default values that make sense for csv files. Here is a command that produces identical output to the previous one

+
more.cars <- read.delim(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_03/mtcars.csv"), sep=",")
+
+head(more.cars)
+
##                   X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
+## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+
table(more.cars == my.mtcars)
+
## 
+## TRUE 
+##  384
+

When find yourself trying to load a tabular data file that consists of plain text, but has some idiosyncratic difference from a csv (e.g., it is tab-separated instead of comma-separated), you should use read.delim().

+

How will you know what to use? Get to know your data first! Seriously, try opening it the file (or at least opening up part of it) using a text editor and/or spreadsheet software. Looking at the “raw” plain text can help you figure out what arguments you need to use to make the data load up exactly the way you want it.

+

For example, you might notice that my import of the mtcars.csv file introduces an important difference from the original mtcars dataset. In the original mtcars, the car model names are row.names attributes of the dataframe instead of a variable.

+
head(mtcars)
+
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
+## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+
row.names(mtcars)
+
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
+##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
+##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
+## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
+## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
+## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
+## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
+## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
+## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
+## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
+## [31] "Maserati Bora"       "Volvo 142E"
+

Since it’s the first column of the raw data, you can fix this with an additional argument to read.csv:

+
my.mtcars <- read.csv(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_04/mtcars.csv"), 
+                      row.names=1)
+head(my.mtcars)
+
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
+## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+
table(my.mtcars == mtcars)
+
## 
+## TRUE 
+##  352
+

This illustrates a common issue that relates back to variable types (classes). Most of the commands in R that import data try to “guess” what class is appropriate for each column of your dataset. Surprise, surprise, these guesses are sometimes not so great and often quite different from what you might guess. As a result, it’s a great idea to inspect the classes of every column of a dataset after you import it (review last week’s R lecture materials for more on this).

+
+

1.1 Importing proprietary data formats

+

R has libraries that can read (and write) many proprietary data file formats, including files from Stata, SAS, MS Excel, and SPSS (among others).

+

That same helpful Rstudio how-to data import article includes several examples of these. You can find other suggestions and examples online.

+
+
+
+

2 Tidy and manage data

+

Last tutorial introduced the "*apply" family of functions (e.g., sapply, lapply, and tapply). This time around, I want to introduce an example of how to use them repeatedly over a set of objects.

+
+

2.1 Conditional means with nested *apply functions

+

I used mtcars for this last time around, so let’s start off