--- title: "Week 3 R Tutorial" author: "Aaron Shaw" date: "September 28, 2020" output: html_document: toc: yes toc_depth: 3 theme: readable pdf_document: toc: yes toc_depth: '3' subtitle: "Statistics and Statistical Programming \nNorthwestern University \nMTS 525" --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## More Rstudio, RMarkdown, and R background + tips Last week, I encouraged you to learn some basic things about RStudio, RMarkdown, and R in order to get started. All three have a vast (overwhelming?) number of available options, opportunities for customization, and documentation. While I think you'll be able to do most of what you need for the course with what I'm providing in these tutorials and/or links to other resources, you should absolutely explore further (I'll keep adding materials to the list of resources on the course wiki page. In the meantime, here are a few more things that I think are likely to be useful in the coming weeks of our course. Partly, I want to keep introducing other "features" of these tools that may not have been obvious. I also want to continue to explain some of the choices I've made in how I introduce you to these resources and encourage you to explore more broadly, cultivate your own preferences, and find ways to work in R that suit your goals, needs, and skill level. ### Working in "Base R" You may encounter a distinction between what people often call *Base R* and other tools or software packages. *Base R* usually refers to the syntax and functions that R employs without the addition of extra libraries or packages (or with libraries and packages that merely extend that syntax rather than develop an entirely different syntax). The most common syntax people currently use that is not Base R comes from a large family of packages called [the Tidyverse](https://www.tidyverse.org/). The Tidyverse is maintained and distributed by the same organization that maintains and distributes RStudio. The Tidyverse packages provide a suite of tools developed to facilitate data science and statistical analysis with a programming syntax (or "grammar" as the creators might prefer) designed to overcome some of the initial obstacles and idiosyncracies of Base R. For the purposes of our course, you might want to learn more about Tidyverse code and packages. You will almost certainly encounter example code snippets and suggestions around the web that assume you are familiar with the Tidyverse syntax and software. Indeed, those of you who are accustomed to working in Python or other "modern" programming languages may find it easier to work with the Tidyverse than with Base R. Books like those by Healy and Wickham and Grolemund linked from the course wiki page forego almost any introduction to Base R in favor of working almost exclusively within the Tidyverse. All of this is a bit far out in the weeds, so let me skip to the key takeaway: I start off teaching you Base R and will not require you to learn the Tidyverse. That said, I strongly encourage you to embrace the Tidyverse, especially as you become familiar with some of the basic tasks and operations you can perform with Base R. I especially encourage you to become familiar with the `ggplot2` package as it produces visualizations superior to those you can make in Base R in almost every way. My rationale is that even the Tidyverse packages and syntax operate *within* the R software environment and rely on the underlying functionalities of Base R. In that respect, starting with Base R provides you with a foundation that should allow you greater flexibility as you deepen your R skills. You will not be constrained to any particular software package within R. ### Notebooks vs. scripts vs. other ways of developing R code My first tutorial made at least one other leap/assumption worth noting. If you've taken other introductory courses or tutorials in data science or programming with R or Python, you may have interacted with other languages/tools via software "Notebooks." Notebooks are a type of software development environment that make it possible to work with text and execute code alongside each other. Sound familiar? It should! The R Markdown scripts I've introduced and suggested here are similar in many ways. The big difference is that in a notebook you can iteratively execute code chunks and view the results of that execution as you go along. Notebooks are fantastic software development and learning tools and R Studio/R Markdown support them as well. You can learn more via the [R studio documentation](https://rmarkdown.rstudio.com/lesson-10.html), the (much more exhaustive) [R Markdown "cookbook"](https://bookdown.org/yihui/rmarkdown-cookbook/notebook.html), and try one out by selecting the 'File → New File → R Notebook' from the R Studio dropdown menus. My attitude here is analogous to my thoughts on the Tidyverse. Even if you wind up working primarily/exclusively in Notebooks, you should have some grounding in working with R via the console, scripts, and "regular" R Markdown files. This is because R Markdown Notebooks rely on all of these tools as well and by knowing even a little bit about what's going on in the background you'll be better able to deepen your knowledge and extend your work beyond the limitations of Notebooks. tl;dr: If you (want to) love Notebooks, you can/should use them. I wanted to make sure you knew how to work with some more foundational stuff too. ### Working directories The concept of "working directories" refers to the location on your computer where R is running, looking for files, and/or storing output files. Sounds simple enough, but managing working directories seems to often induce confusion as you start working with external datasets from the web or stored locally elsewhere on your computer. For your purposes here, the simplest way to manage/avoid working directory issues may be to select an option from the "Session → Set Working Directory" dropdown menu in RStudio. My best guess is that while you're working on a given .Rmd file you'll be happy most of the time if you choose the "To source file location" option. That said, you might have other preferences and I don't mean to suggest this as a rule/requirement. Whatever the case, when you choose that menu option, you'll see RStudio generate something at the console that might look a bit like this: > `> setwd("~/Documents/Courses/2020/stats/r_tutorials")` This `setwd()` ("set working directory") command is what R calls under the hood to..set the working directory. You can also ask R to tell you where your current working directory is with the related command, `getwd()`. Try running it just like that with nothing in the parentheses to see what it returns. ```{r} getwd() ``` Whatever this says is where R thinks it's "doing stuff" on your machine, so, in the scenario where you are asking R to load a dataset from a file stored somewhere else on your computer, R might struggle to find any files located elsewhere unless you point to exactly where it lives. More on this in one of the examples below. ### Adding comments to your R code chunks The concept of "comments" in code is potentially intuitive. A comment is just some text that the programming language interpreter ignores and repeats. Comments are generally inserted in code to make it easier for people to read in various ways. R interprets the `#` character and anything that comes after it as a comment. R will not try to interpret whatever comes next as a command: ```{r} 2+2 # This is a comment. The next line is too: # 2+2 ``` Comments are often less common in the context of R Markdown scripts and notebooks. That said, you may encounter them in examples, R documentation, and elsewhere. You may also want to use them to leave notes for yourself or the teaching team in your R scripts. Whatever the case, it's good to know how to comment. ### Importing datasets from libraries, the web, and locally (on your computer) Data import is crucial and can be a time-consuming step in quantitative/computational research (maybe especially in R). In the previous tutorial and problem set you needed to load a library/package in R. Many packages come with datasets pre-installed that you will use for assignments in the course and/or to try out example code. You will also need to learn how to import datasets from the web and locally from files stored on your computer. Here are examples of each. #### Loading a dataset from an R package Let's find the `email50` dataset that's included in the `openintro` package provided by the textbook authors. First, I'll load the library, then I can use the `data()` command to call the dataset. ```{r} library(openintro) data(email50) ## Take a look at the first few rows of the email50 dataset head(email50) ``` #### Loading a dataset from the web This gets a bit more complicated because you have to use the `url()` command to tell R the address you want to use, then you will need to use a second command to actually import the dataset file. In this case, I'm going to point to another dataset provided by the OpenIntro authors containing NOAA temperature information ([more information about the dataset is available on the OpenIntro website](https://www.openintro.org/data/index.php?data=climate70)). The format for the file is `.rda` which is one of several common R dataset file format suffixes (another one is .rdata) and R you'll usually use the `load()` command to import an .rda or .rdata file. ```{r} load(url("https://www.openintro.org/data/rda/climate70.rda")) ## Again, check out the first few rows to see what you've got. head(climate70) ``` #### Loading a dataset stored locally Loading from local storage is last because, ironically, it may be the least intuitive. The best practice here is to use an [absolute path](https://en.wikipedia.org/wiki/Path_%28computing%29) to point R to the unique location on your computer where the file in question is stored. In the example below, my code reflects the operating system and directory structure of my laptop. Your computer will likely (I assume/hope!) use something quite different. Nevertheless, I am providing an example because I think you may be able to work with it and it can at least provide a demonstration that we can talk about later on. ```{r} load("/home/ads/Documents/Teaching/2020/stats/data/week_03/group_07.RData") ls() ## list objects in my global environment head(d) ## and inspect the first few rows of the new object ``` ## More (complicated) variable types In the previous tutorial we introduced some basic variable types (numeric, character, and logical). Here are some other common Base R variable types that you will encounter and need to learn to recognize/manage in your work: ### Factors Factors usually work like character variables that take a finite and pre-specified set of values. They are useful for encoding categorical variables. You can create them with the `factor()` command or by running `as.factor()` on a character vector. ```{r} ## Create a vector of cities as a factor: cities <- factor(c("Chicago", "Detroit", "Milwaukee")) summary(cities) class(cities) ## Create another vector as a character first... more.cities <- c("Oakland", "Seattle", "San Diego") summary(more.cities) class(more.cities) ## ...and coerce it to a factor: more.cities <- as.factor(more.cities) summary(more.cities) class(more.cities) ``` You can usually run `as.factor()` to coerce other kinds of variables into a factor; however, be warned that doing so has some risks as R may not share your intuitions about what ought to happen. Whenever you coerce or convert or reshape data you should immediately run something like `summary()` and `class()` to inspect the results and confirm whether the results seem to align with your aspirations. ### Lists Lists are a bit like vectors, but can contain many other kinds of object (e.g., other variables). In this sense they are more of a data structure than a type, but for the purposes of R, the distinctions between data structures and types are a bit mushy... ```{r} cities.list <- list(cities, more.cities) class(cities.list) cities.list ``` We can name the items in the list just like we did for a vector: ```{r} names(cities.list) <- c("midwest", "west") # This works too: cities.list <- list("midwest" = cities, "west" = more.cities) ``` You can index into the list just like you can with a vector except that instead of one set of square brackets you have to use two: ```{r} cities.list[["midwest"]] cities.list[[2]] ``` With a list you can also index recursively (down into the individual objects contained in the list). For example: ```{r} cities.list[["west"]][2] ``` Some functions that operate on vectors or other data structures "just work" on lists. Others don't or produce weird output that you probably weren't expecting (if you're coming to R from Python this may induce tears and gnashing of teeth). As a practical matter, you should not assume R will be perfectly consistent with how functions treat different variable types and inspect the output of commands to see what happens: ```{r} # summary works as you might hope: summary(cities.list) # table produces something very weird: table(cities.list) ``` ### Matrices Matrices are a little less common for everyday use, but it's good to know they're there and that R can help you do matrix arithmetic to your heart's content. In my day-to-day work, I most frequently encounter matrices in R as an intermediate format used or generated by functions to complete specific tasks (e.g., running regression models). This means that I sometimes want/need to interact with matrix objects and, by extension, you might too. An example is below. Check out the help documentation for the `matrix()` function for more. ```{r} m1 <- matrix(c(1:12), nrow=4, byrow=FALSE) m1 m2 <- matrix(seq(2,24,2), nrow=4, byrow=FALSE) m2 m1*m2 t(m2) # transposition ``` ### Data frames A data frame is a structured format for storing tabular data. More formally in R, a data frame consists of a list of vectors of equal length. For our purposes, data frames are the most important data structure (or type) in R. We will use them constantly. They have rows (usually units or observations) and columns (usually variables). There are also many functions designed to work especially (even exclusively) with data frames. Let's take a look at another built-in example dataset, `faithful` (note: you can read the help documentation on `faithful` to learn about the dataset): ```{r} ## This first command calls a dataset into my working "Global" environment and makes it available for subsequent use. data("faithful") dim(faithful) # Returns the number of rows and columns. Often the first thing I do with any data frame nrow(faithful) ncol(faithful) names(faithful) ## try colnames(faithful) too head(faithful) ## look at the first few rows of data summary(faithful) ``` Some datasets are built-in to packages. Once you've installed the package (see the R Tutorial from Week 1 of the help documentation for `install.packages()`), you can call a dataset from that package by first invoking the package You can index into a data frame using numeric values or variable names. The notation uses square brackets again and requires you to remember the convention of `[, ]`: ```{r} faithful[1,1] # The item in the first row of the first column faithful[,2] # all of the items in the second column faithful[10:20, 2] # ranges work too. This returns the 10-20th values of the second column. faithful[37, "eruptions"] # The 37th value of the column called "eruptions." ``` It is very useful to work with column (variable) names in a data frame using the `$` symbol: ```{r} faithful$eruptions mean(faithful$waiting) boxplot(faithful$waiting) ``` Data frames are very useful for bivariate analyses (e.g., plots and tables). The base R notation for a bivariate presentation usually uses the `~` character. If both of the variables in your bivariate comparison are within the same data frame you can use the `data=` argument. For example, here is a scatterplot of eruption time (Y axis) over waiting time (X axis): ```{r} plot(eruptions ~ waiting, data=faithful) ``` Data frames can have an arbitrary number of columns (variables). Another built in dataset used frequently in R documentation and examples is `mtcars` (read the help documentation! it contains a codebook that tells you about each variable). Let's look at that one next: ```{r} data("mtcars") dim(mtcars) head(mtcars) ``` There are many ways to create and modify data frames. Here are some examples using the `mtcars` data. I create new vectors from the variables (columns) in the dataset, then I use the `data.frame` command to build a new data frame from the three vectors and do some data cleanup/recoding: ```{r} my.mpg <- mtcars$mpg my.cyl <- mtcars$cyl my.disp <- mtcars$disp df.small <- data.frame(my.mpg, my.cyl, my.disp) class(df.small) head(df.small) # recode a value as missing df.small[5,1] <- NA # removing a column df.small[,3] <- NULL dim(df.small) head(df.small) ``` Creating new variables, recoding, and transformations look very similar to working with vectors. Notice the `na.rm=TRUE` argument I am passing to the `mean` function in the first line here: ```{r} df.small$mpg.big <- df.small$my.mpg > mean(df.small$my.mpg, na.rm=TRUE) table(df.small$mpg.big) df.small$mpg.l <- log1p(df.small$my.mpg) # notice: log1p() head(df.small$mpg.l) ## convert a number into a factor: df.small$my.cyl.factor <- factor(df.small$my.cyl) summary(df.small$my.cyl.factor) ``` Some special functions are particularly useful for working with data frames: ```{r} is.na(df.small$my.mpg) sum(is.na(df.small$my.mpg)) # sum() works in mysterious ways sometimes... complete.cases(df.small) sum(complete.cases(df.small)) ``` ## Vectorized operations: "Apply" functions and beyond R has a lot of built-in functionality to support working on objects very quickly/efficiently in a "vectorized" way. The specifics of vectorization are beyond the scope of our course, but the significance is that while you may have learned to do things with "loops" in other programming languages (and we'll learn how to use them in R) vectorized functions in R almost always provide a faster/better way to achieve the same goals. As a result, I like to introduce vectorized functions first. The reasons you might use a vectorized function are kind of simple: imagine you have an object (vector, dataframe, list, or whatever) and you want to perform the same operation over all of the rows or columns. Loops are good if you want to do this in a strict order (e.g., first row #1, then row #2, etc.), but most of the time the order doesn't matter. Vectorized functions allow R to apply operations over data objects in an arbitrary order, resulting in massively faster and less-verbose code. Most of the base R versions of these functions have "apply" in the name. There are also Tidyverse alternatives. I will stick to the base R versions here. Please feel free to read about and use the alternatives! I'll try to introduce and document them a little later on in our course. Let's start with an example using the `mtcars` dataset again. The `sapply()` and `lapply()` functions both "apply" the second argument (a function) iteratively to the items (variables) in the first argument. The differences emerge in the structure of the output: `sapply()` returns a (usually more user-friendly) vector or matrix by default whereas `lapply()` returns a list: ```{r} sapply(mtcars, quantile) lapply(mtcars, quantile) # Same output, different format/class ``` Experiment with that idea on your own a little bit before moving on. For example, can you find the mean of each variable in the `mtcars` data using either `sapply` or `lapply`? The `tapply` function allows you to apply functions conditionally. For example, the chunk below finds the mean gas mileage by number of cylinders. The second argument (`mtcars$cyl`) provides an index into the first (`mtcars$mpg`) before the third argument (`mean`) is applied to each of the conditional subsets: ```{r} tapply(mtcars$mpg, mtcars$cyl, mean) ``` Try some other calculations using `tapply()`. Can you calculate the average engine discplacement conditional on number of cylinders? What about the average miles per gallon conditional on whether the car has an automatic transmission? Note that `apply()` works pretty smoothly with matrices, but it can be a bit complicated/surprising otherwise. ## Some basic graphs with ggplot2 ggplot2 is what I like to use for plotting so I'll develop examples with it from here on out. Make sure you've installed the package with `install.packages("ggplot2")` and load it with `library(ggplot2)`. There is another built-in (automotive) dataset that comes along with the ggplot2 package called `mpg`. This dataset includes variables of several types and is used in much of the package documentation, so it's helpful to become familiar with it. I'll develop a few simple examples below. For more, please take a look at the (extensive) [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/index.html). There are **many** options and arguments for most of these functions and many more functions to help you produce publication-ready graphics. Chapter 3 of the Healy book is also an extraordinary resource for getting started creating visualizations with ggplot2. ```{r} library(ggplot2) data("mpg") # First thing, call ggplot() to start building up a plot # aes() indicates which variables to use as "aesthetic" mappings p <- ggplot(data=mpg, aes(manufacturer, hwy)) p + geom_boxplot() # another relationship: p <- ggplot(data=mpg, aes(fl, hwy)) p + geom_boxplot() p + geom_violin() ``` Here's another that visualizes the relationship between miles per gallon (mpg) in the city vs. mpg on the highway: ```{r} p <- ggplot(data=mpg, aes(cty, hwy)) p+geom_point() # Multivariate graphical displays can get pretty wild p + geom_point(aes(color=factor(class), shape=factor(cyl))) ```