r_tutorials/w03-R_tutorial.Rmd

   1 ---
   2 title: "Week 3 R Tutorial"
   3 author: "Aaron Shaw"
   4 date: "September 28, 2020"
   5 output:
   6   html_document:
   7     toc: yes
   8     toc_depth: 3
   9     theme: readable
  10   pdf_document:
  11     toc: yes
  12     toc_depth: '3'
  13 subtitle: "Statistics and Statistical Programming  \nNorthwestern University  \nMTS
  14   525"
  15 ---
  16
  17 ```{r setup, include=FALSE}
  18 knitr::opts_chunk$set(echo = TRUE)
  19 ```
  20
  21 ## More Rstudio, RMarkdown, and R background + tips
  22
  23 Last week, I encouraged you to learn some basic things about RStudio, RMarkdown, and R in order to get started. All three have a vast (overwhelming?) number of available options, opportunities for customization, and documentation. While I think you'll be able to do most of what you need for the course with what I'm providing in these tutorials and/or links to other resources, you should absolutely explore further (I'll keep adding materials to the list of resources on the course wiki page. In the meantime, here are a few more things that I think are likely to be useful in the coming weeks of our course. Partly, I want to keep introducing other "features" of these tools that may not have been obvious. I also want to continue to explain some of the choices I've made in how I introduce you to these resources and encourage you to explore more broadly, cultivate your own preferences, and find ways to work in R that suit your goals, needs, and skill level.
  24
  25 ### Working in "Base R"
  26
  27 You may encounter a distinction between what people often call *Base R* and other tools or software packages. *Base R* usually refers to the syntax and functions that R employs without the addition of extra libraries or packages (or with libraries and packages that merely extend that syntax rather than develop an entirely different syntax). The most common syntax people currently use that is not Base R comes from a large family of packages called [the Tidyverse](https://www.tidyverse.org/). The Tidyverse is maintained and distributed by the same organization that maintains and distributes RStudio. The Tidyverse packages provide a suite of tools developed to facilitate data science and statistical analysis with a programming syntax (or "grammar" as the creators might prefer) designed to overcome some of the initial obstacles and idiosyncracies of Base R.
  28
  29 For the purposes of our course, you might want to learn more about Tidyverse code and packages. You will almost certainly encounter example code snippets and suggestions around the web that assume you are familiar with the Tidyverse syntax and software. Indeed, those of you who are accustomed to working in Python or other "modern" programming languages may find it easier to work with the Tidyverse than with Base R. Books like those by Healy and Wickham and Grolemund linked from the course wiki page forego almost any introduction to Base R in favor of working almost exclusively within the Tidyverse.
  30
  31 All of this is a bit far out in the weeds, so let me skip to the key takeaway: I start off teaching you Base R and will not require you to learn the Tidyverse. That said, I strongly encourage you to embrace the Tidyverse, especially as you become familiar with some of the basic tasks and operations you can perform with Base R. I especially encourage you to become familiar with the `ggplot2` package as it produces visualizations superior to those you can make in Base R in almost every way. My rationale is that even the Tidyverse packages and syntax operate *within* the R software environment and rely on the underlying functionalities of Base R. In that respect, starting with Base R provides you with a foundation that should allow you greater flexibility as you deepen your R skills. You will not be constrained to any particular software package within R.
  32
  33 ### Notebooks vs. scripts vs. other ways of developing R code
  34
  35 My first tutorial made at least one other leap/assumption worth noting. If you've taken other introductory courses or tutorials in data science or programming with R or Python, you may have interacted with other languages/tools via software "Notebooks." Notebooks are a type of software development environment that make it possible to work with text and execute code alongside each other. Sound familiar? It should! The R Markdown scripts I've introduced and suggested here are similar in many ways. The big difference is that in a notebook you can iteratively execute code chunks and view the results of that execution as you go along.
  36
  37 Notebooks are fantastic software development and learning tools and R Studio/R Markdown support them as well. You can learn more via the [R studio documentation](https://rmarkdown.rstudio.com/lesson-10.html), the (much more exhaustive) [R Markdown "cookbook"](https://bookdown.org/yihui/rmarkdown-cookbook/notebook.html), and try one out by selecting the 'File → New File → R Notebook' from the R Studio dropdown menus.
  38
  39 My attitude here is analogous to my thoughts on the Tidyverse. Even if you wind up working primarily/exclusively in Notebooks, you should have some grounding in working with R via the console, scripts, and "regular" R Markdown files. This is because R Markdown Notebooks rely on all of these tools as well and by knowing even a little bit about what's going on in the background you'll be better able to deepen your knowledge and extend your work beyond the limitations of Notebooks.
  40
  41 tl;dr: If you (want to) love Notebooks, you can/should use them. I wanted to make sure you knew how to work with some more foundational stuff too.
  42
  43 ### Working directories
  44
  45 The concept of "working directories" refers to the location on your computer where R is running, looking for files, and/or storing output files. Sounds simple enough, but managing working directories seems to often induce confusion as you start working with external datasets from the web or stored locally elsewhere on your computer.
  46
  47 For your purposes here, the simplest way to manage/avoid working directory issues may be to select an option from the "Session → Set Working Directory" dropdown menu in RStudio. My best guess is that while you're working on a given .Rmd file you'll be happy most of the time if you choose the "To source file location" option. That said, you might have other preferences and I don't mean to suggest this as a rule/requirement. Whatever the case, when you choose that menu option, you'll see RStudio generate something at the console that might look a bit like this:
  48
  49 > `> setwd("~/Documents/Courses/2020/stats/r_tutorials")`
  50
  51 This `setwd()` ("set working directory") command is what R calls under the hood to..set the working directory. You can also ask R to tell you where your current working directory is with the related command, `getwd()`. Try running it just like that with nothing in the parentheses to see what it returns.
  52
  53 ```{r}
  54 getwd()
  55 ```
  56
  57 Whatever this says is where R thinks it's "doing stuff" on your machine, so, in the scenario where you are asking R to load a dataset from a file stored somewhere else on your computer, R might struggle to find any files located elsewhere unless you point to exactly where it lives. More on this in one of the examples below.
  58
  59 ### Adding comments to your R code chunks
  60
  61 The concept of "comments" in code is potentially intuitive. A comment is just some text that the programming language interpreter ignores and repeats. Comments are generally inserted in code to make it easier for people to read in various ways. R interprets the `#` character and anything that comes after it as a comment. R will not try to interpret whatever comes next as a command:
  62
  63 ```{r}
  64 2+2
  65
  66 # This is a comment. The next line is too:
  67 # 2+2
  68 ```
  69
  70 Comments are often less common in the context of R Markdown scripts and notebooks. That said, you may encounter them in examples, R documentation, and elsewhere. You may also want to use them to leave notes for yourself or the teaching team in your R scripts. Whatever the case, it's good to know how to comment.
  71
  72 ### Importing datasets from libraries, the web, and locally (on your computer)
  73
  74 Data import is crucial and can be a time-consuming step in quantitative/computational research (maybe especially in R). In the previous tutorial and problem set you needed to load a library/package in R. Many packages come with datasets pre-installed that you will use for assignments in the course and/or to try out example code. You will also need to learn how to import datasets from the web and locally from files stored on your computer. Here are examples of each.
  75
  76 #### Loading a dataset from an R package
  77
  78 Let's find the `email50` dataset that's included in the `openintro` package provided by the textbook authors. First, I'll load the library, then I can use the `data()` command to call the dataset.
  79
  80 ```{r}
  81 library(openintro)
  82 data(email50)
  83
  84 ## Take a look at the first few rows of the email50 dataset
  85 head(email50)
  86
  87 ```
  88
  89 #### Loading a dataset from the web
  90
  91 This gets a bit more complicated because you have to use the `url()` command to tell R the address you want to use, then you will need to use a second command to actually import the dataset file. In this case, I'm going to point to another dataset provided by the OpenIntro authors containing NOAA temperature information ([more information about the dataset is available on the OpenIntro website](https://www.openintro.org/data/index.php?data=climate70)). The format for the file is `.rda` which is one of several common R dataset file format suffixes (another one is .rdata) and R you'll usually use the `load()` command to import an .rda or .rdata file.
  92
  93 ```{r}
  94 load(url("https://www.openintro.org/data/rda/climate70.rda"))
  95
  96 ## Again, check out the first few rows to see what you've got.
  97 head(climate70)
  98 ```
  99
 100 #### Loading a dataset stored locally
 101
 102 Loading from local storage is last because, ironically, it may be the least intuitive. The best practice here is to use an [absolute path](https://en.wikipedia.org/wiki/Path_%28computing%29) to point R to the unique location on your computer where the file in question is stored. In the example below, my code reflects the operating system and directory structure of my laptop. Your computer will likely (I assume/hope!) use something quite different. Nevertheless, I am providing an example because I think you may be able to work with it and it can at least provide a demonstration that we can talk about later on.
 103
 104 ```{r}
 105
 106 load("/home/ads/Documents/Teaching/2020/stats/data/week_03/group_07.RData")
 107
 108 ls() ## list objects in my global environment
 109
 110 head(d) ## and inspect the first few rows of the new object
 111
 112 ```
 113
 114 ## More (complicated) variable types
 115
 116 In the previous tutorial we introduced some basic variable types (numeric, character, and logical). Here are some other common Base R variable types that you will encounter and need to learn to recognize/manage in your work:
 117
 118 ### Factors
 119
 120 Factors usually work like character variables that take a finite and pre-specified set of values. They are useful for encoding categorical variables. You can create them with the `factor()` command or by running `as.factor()` on a character vector.
 121
 122 ```{r}
 123
 124 ## Create a vector of cities as a factor:
 125 cities <- factor(c("Chicago", "Detroit", "Milwaukee"))
 126 summary(cities)
 127 class(cities)
 128
 129 ## Create another vector as a character first...
 130 more.cities <- c("Oakland", "Seattle", "San Diego")
 131 summary(more.cities)
 132 class(more.cities)
 133
 134 ## ...and coerce it to a factor:
 135 more.cities <- as.factor(more.cities)
 136 summary(more.cities)
 137 class(more.cities)
 138
 139 ```
 140
 141 You can usually run `as.factor()` to coerce other kinds of variables into a factor; however, be warned that doing so has some risks as R may not share your intuitions about what ought to happen. Whenever you coerce or convert or reshape data you should immediately run something like `summary()` and `class()` to inspect the results and confirm whether the results seem to align with your aspirations.
 142
 143 ### Lists
 144
 145 Lists are a bit like vectors, but can contain many other kinds of object (e.g., other variables). In this sense they are more of a data structure than a type, but for the purposes of R, the distinctions between data structures and types are a bit mushy...
 146
 147 ```{r}
 148 cities.list <- list(cities, more.cities)
 149 class(cities.list)
 150 cities.list
 151 ```
 152 We can name the items in the list just like we did for a vector:
 153 ```{r}
 154 names(cities.list) <- c("midwest", "west")
 155
 156 # This works too:
 157 cities.list <- list("midwest" = cities, "west" = more.cities)
 158 ```
 159 You can index into the list just like you can with a vector except that instead of one set of square brackets you have to use two:
 160 ```{r}
 161 cities.list[["midwest"]]
 162 cities.list[[2]]
 163 ```
 164 With a list you can also index recursively (down into the individual objects contained in the list). For example:
 165 ```{r}
 166 cities.list[["west"]][2]
 167 ```
 168
 169 Some functions that operate on vectors or other data structures "just work" on lists. Others don't or produce weird output that you probably weren't expecting (if you're coming to R from Python this may induce tears and gnashing of teeth). As a practical matter, you should not assume R will be perfectly consistent with how functions treat different variable types and inspect the output of commands to see what happens:
 170
 171 ```{r}
 172
 173 # summary works as you might hope:
 174 summary(cities.list)
 175
 176 # table produces something very weird:
 177 table(cities.list)
 178 ```
 179
 180 ### Matrices
 181
 182 Matrices are a little less common for everyday use, but it's good to know they're there and that R can help you do matrix arithmetic to your heart's content. In my day-to-day work, I most frequently encounter matrices in R as an intermediate format used or generated by functions to complete specific tasks (e.g., running regression models). This means that I sometimes want/need to interact with matrix objects and, by extension, you might too.
 183
 184 An example is below. Check out the help documentation for the `matrix()` function for more.
 185
 186 ```{r}
 187
 188 m1 <- matrix(c(1:12), nrow=4, byrow=FALSE)
 189 m1
 190
 191 m2 <- matrix(seq(2,24,2), nrow=4, byrow=FALSE)
 192 m2
 193
 194 m1*m2
 195 t(m2) # transposition
 196
 197 ```
 198
 199 ### Data frames
 200
 201 A data frame is a structured format for storing tabular data. More formally in R, a data frame consists of a list of vectors of equal length.
 202
 203 For our purposes, data frames are the most important data structure (or type) in R. We will use them constantly. They have rows (usually units or observations) and columns (usually variables). There are also many functions designed to work especially (even exclusively) with data frames. Let's take a look at another built-in example dataset, `faithful` (note: you can read the help documentation on `faithful` to learn about the dataset):
 204
 205 ```{r}
 206
 207 ## This first command calls a dataset into my working "Global" environment and makes it available for subsequent use.
 208 data("faithful")
 209
 210 dim(faithful) # Returns the number of rows and columns. Often the first thing I do with any data frame
 211
 212 nrow(faithful)
 213 ncol(faithful)
 214
 215 names(faithful)  ## try colnames(faithful) too
 216
 217 head(faithful) ## look at the first few rows of data
 218
 219 summary(faithful)
 220
 221 ```
 222
 223 Some datasets are built-in to packages. Once you've installed the package (see the R Tutorial from Week 1 of the help documentation for `install.packages()`), you can call a dataset from that package by first invoking the package
 224
 225 You can index into a data frame using numeric values or variable names. The notation uses square brackets again and requires you to remember the convention of `[<rows>, <columns>]`:
 226
 227 ```{r}
 228 faithful[1,1] # The item in the first row of the first column
 229
 230 faithful[,2] # all of the items in the second column
 231
 232 faithful[10:20, 2] # ranges work too. This returns the 10-20th values of the second column.
 233
 234 faithful[37, "eruptions"] # The 37th value of the column called "eruptions."
 235 ```
 236
 237 It is very useful to work with column (variable) names in a data frame using the `$` symbol:
 238
 239 ```{r}
 240 faithful$eruptions
 241
 242 mean(faithful$waiting)
 243
 244 boxplot(faithful$waiting)
 245 ```
 246
 247 Data frames are very useful for bivariate analyses (e.g., plots and tables). The base R notation for a bivariate presentation usually uses the `~` character. If both of the variables in your bivariate comparison are within the same data frame you can use the `data=` argument. For example, here is a scatterplot of eruption time (Y axis) over waiting time (X axis):
 248
 249 ```{r}
 250 plot(eruptions ~ waiting, data=faithful)
 251 ```
 252
 253
 254 Data frames can have an arbitrary number of columns (variables). Another built in dataset used frequently in R documentation and examples is `mtcars` (read the help documentation! it contains a codebook that tells you about each variable). Let's look at that one next:
 255
 256 ```{r}
 257 data("mtcars")
 258
 259 dim(mtcars)
 260
 261 head(mtcars)
 262 ```
 263
 264 There are many ways to create and modify data frames. Here are some examples using the `mtcars` data. I create new vectors from the variables (columns) in the dataset, then I use the `data.frame` command to build a new data frame from the three vectors and do some data cleanup/recoding:
 265
 266 ```{r}
 267
 268 my.mpg <- mtcars$mpg
 269 my.cyl <- mtcars$cyl
 270 my.disp <- mtcars$disp
 271
 272 df.small <- data.frame(my.mpg, my.cyl, my.disp)
 273 class(df.small)
 274 head(df.small)
 275
 276 # recode a value as missing
 277 df.small[5,1] <- NA
 278
 279 # removing a column
 280 df.small[,3] <- NULL
 281 dim(df.small)
 282 head(df.small)
 283
 284 ```
 285
 286 Creating new variables, recoding, and transformations look very similar to working with vectors. Notice the `na.rm=TRUE` argument I am passing to the `mean` function in the first line here:
 287
 288 ```{r}
 289 df.small$mpg.big <- df.small$my.mpg > mean(df.small$my.mpg, na.rm=TRUE)
 290
 291 table(df.small$mpg.big)
 292
 293 df.small$mpg.l <- log1p(df.small$my.mpg) # notice: log1p()
 294 head(df.small$mpg.l)
 295 ## convert a number into a factor:
 296
 297 df.small$my.cyl.factor <- factor(df.small$my.cyl)
 298 summary(df.small$my.cyl.factor)
 299 ```
 300
 301 Some special functions are particularly useful for working with data frames:
 302
 303 ```{r}
 304 is.na(df.small$my.mpg)
 305
 306 sum(is.na(df.small$my.mpg)) # sum() works in mysterious ways sometimes...
 307
 308 complete.cases(df.small)
 309 sum(complete.cases(df.small))
 310 ```
 311
 312 ## Vectorized operations: "Apply" functions and beyond
 313
 314 R has a lot of built-in functionality to support working on objects very quickly/efficiently in a "vectorized" way. The specifics of vectorization are beyond the scope of our course, but the significance is that while you may have learned to do things with "loops" in other programming languages (and we'll learn how to use them in R) vectorized functions in R almost always provide a faster/better way to achieve the same goals. As a result, I like to introduce vectorized functions first.
 315
 316 The reasons you might use a vectorized function are kind of simple: imagine you have an object (vector, dataframe, list, or whatever) and you want to perform the same operation over all of the rows or columns. Loops are good if you want to do this in a strict order (e.g., first row #1, then row #2, etc.), but most of the time the order doesn't matter. Vectorized functions allow R to apply operations over data objects in an arbitrary order, resulting in massively faster and less-verbose code.
 317
 318 Most of the base R versions of these functions have "apply" in the name. There are also Tidyverse alternatives. I will stick to the base R versions here. Please feel free to read about and use the alternatives! I'll try to introduce and document them a little later on in our course.
 319
 320 Let's start with an example using the `mtcars` dataset again. The `sapply()` and `lapply()` functions both "apply" the second argument (a function) iteratively to the items (variables) in the first argument. The differences emerge in the structure of the output: `sapply()` returns a (usually more user-friendly) vector or matrix by default whereas `lapply()` returns a list:
 321
 322 ```{r}
 323 sapply(mtcars, quantile)
 324
 325 lapply(mtcars, quantile) # Same output, different format/class
 326 ```
 327
 328 Experiment with that idea on your own a little bit before moving on. For example, can you find the mean of each variable in the `mtcars` data using either `sapply` or `lapply`?
 329
 330 The `tapply` function allows you to apply functions conditionally. For example, the chunk below finds the mean gas mileage by number of cylinders. The second argument (`mtcars$cyl`) provides an index into the first (`mtcars$mpg`) before the third argument (`mean`) is applied to each of the conditional subsets:
 331
 332 ```{r}
 333 tapply(mtcars$mpg, mtcars$cyl, mean)
 334 ```
 335
 336 Try some other calculations using `tapply()`. Can you calculate the average engine discplacement conditional on number of cylinders? What about the average miles per gallon conditional on whether the car has an automatic transmission?
 337
 338 Note that `apply()` works pretty smoothly with matrices, but it can be a bit complicated/surprising otherwise.
 339
 340 ## Some basic graphs with ggplot2
 341
 342 ggplot2 is what I like to use for plotting so I'll develop examples with it from here on out.
 343
 344 Make sure you've installed the package with `install.packages("ggplot2")` and load it with `library(ggplot2)`.
 345
 346 There is another built-in (automotive) dataset that comes along with the ggplot2 package called `mpg`. This dataset includes variables of several types and is used in much of the package documentation, so it's helpful to become familiar with it.
 347
 348 I'll develop a few simple examples below. For more, please take a look at the (extensive) [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/index.html). There are **many** options and arguments for most of these functions and many more functions to help you produce publication-ready graphics. Chapter 3 of the Healy book is also an extraordinary resource for getting started creating visualizations with ggplot2.
 349
 350 ```{r}
 351 library(ggplot2)
 352 data("mpg")
 353
 354 # First thing, call ggplot() to start building up a plot
 355 # aes() indicates which variables to use as "aesthetic" mappings
 356
 357 p <- ggplot(data=mpg, aes(manufacturer, hwy))
 358
 359 p + geom_boxplot()
 360
 361 # another relationship:
 362 p <- ggplot(data=mpg, aes(fl, hwy))
 363 p + geom_boxplot()
 364 p + geom_violin()
 365
 366 ```
 367
 368 Here's another that visualizes the relationship between miles per gallon (mpg) in the city vs. mpg on the highway:
 369
 370 ```{r}
 371 p <- ggplot(data=mpg, aes(cty, hwy))
 372
 373 p+geom_point()
 374
 375 # Multivariate graphical displays can get pretty wild
 376 p + geom_point(aes(color=factor(class), shape=factor(cyl)))
 377 ```
 378
 379