r_lectures/w02-R_lecture.Rmd

   1 ---
   2 title: "Week 2 R lecture"
   3 subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
   4 author: "Aaron Shaw"
   5 date: "April 4, 2019"
   6 output: html_document
   7 ---
   8
   9 ```{r setup, include=FALSE}
  10 knitr::opts_chunk$set(echo = TRUE)
  11 ```
  12
  13 ## Adding comments to your code
  14
  15 Sorry, I realized that I forgot to explain this in last week's R lecture!
  16
  17 R interprets the `#` character and anything that comes after it as a comment. R will not try to interpret whatever comes next as a command:
  18
  19 ```{r}
  20 2+2
  21
  22 # This is a comment. The next line is too:
  23 # 2+2
  24 ```
  25
  26
  27 ## More advanced variable types:
  28
  29 ### Factors
  30
  31 These are for categorical data. You can create them with the `factor()` command or by running `as.factor()` on a character vector.
  32
  33 ```{r}
  34 cities <- factor(c("Chicago", "Detroit", "Milwaukee"))
  35 summary(cities)
  36 class(cities)
  37
  38 more.cities <- c("Oakland", "Seattle", "San Diego")
  39 summary(more.cities)
  40 class(more.cities)
  41
  42 more.cities <- as.factor(more.cities)
  43 summary(more.cities)
  44 class(more.cities)
  45
  46 ```
  47
  48 ### Lists
  49
  50 Lists are a lot like vectors, but can contain any kind of object (e.g., other variables).
  51
  52 ```{r}
  53 cities.list <- list(cities, more.cities)
  54 cities.list
  55 ```
  56 We can name the items in the list just like we did for a vector:
  57 ```{r}
  58 names(cities.list) <- c("midwest", "west")
  59
  60 # This works too:
  61 cities.list <- list("midwest" = cities, "west" = more.cities)
  62 ```
  63 You can index into the list just like you can with a vector except that instead of one set of square brackets you have to use two:
  64 ```{r}
  65 cities.list[["midwest"]]
  66 cities.list[[2]]
  67 ```
  68 With a list you can also index recursively (down into the individual objects contained in the list). For example:
  69 ```{r}
  70 cities.list[["west"]][2]
  71 ```
  72
  73 Some functions "just work" on lists. Others don't or produce weird output that you probably weren't expecting. You should be careful and check the output to see what happens:
  74
  75 ```{r}
  76
  77 # summary works as you might hope:
  78 summary(cities.list)
  79
  80 # table produces something very weird:
  81 table(cities.list)
  82 ```
  83
  84 ### Matrices
  85
  86 Matrices are a little less common for everyday use, but it's good to know they're there and that you can do matrix arithmetic to your heart's content. An example is below. Check out the help documentation for the `matrix()` function for more.
  87
  88 ```{r}
  89
  90 m1 <- matrix(c(1:12), nrow=4, byrow=FALSE)
  91 m1
  92
  93 m2 <- matrix(seq(2,24,2), nrow=4, byrow=FALSE)
  94 m2
  95
  96 m1*m2
  97 t(m2) # transposition
  98
  99 ```
 100
 101 ### Data frames
 102
 103 A data frame is a format for storing tabular data. Formally, it consists of a list of vectors of equal length.
 104
 105 For our purposes, data frames are the most important data structure (or type) in R. We will use them constantly. They have rows (usually units or observations) and columns (usually variables). There are also many functions designed to work especially (even exclusively) with data frames. Let's take a look at another built-in example dataset, `faithful` (note: read the help documentation on `faithful` to learn about the dataset!):
 106
 107 ```{r}
 108 faithful <- faithful # This makes the dataset visible in the "Environment" tab in RStudio.
 109
 110 dim(faithful) # often the first thing I do with any data frame
 111 nrow(faithful)
 112
 113 names(faithful)  ## try colnames(faithful) too
 114
 115 head(faithful) ## look at the first few rows of data
 116
 117 summary(faithful)
 118
 119 ```
 120
 121 You can index into a data frame using numeric values or variable names. The notation uses square brackets again and requires you to remember the convention of `[<rows>, <columns>]`:
 122
 123 ```{r}
 124 faithful[1,1] # The item in the first row of the first column
 125
 126 faithful[,2] # all of the items in the second column
 127
 128 faithful[10:20, 2] # ranges work too
 129
 130 faithful[37, "eruptions"]
 131 ```
 132
 133 It is very useful to work with column (variable) names in a data frame using the `$` symbol:
 134
 135 ```{r}
 136 faithful$eruptions
 137
 138 mean(faithful$waiting)
 139
 140 boxplot(faithful$waiting)
 141 ```
 142
 143 Data frames are very useful for bivariate analyses (e.g., plots and tables). The base R notation for a bivariate presentation usually uses the `~` character. If both of the variables in your bivariate comparison are within the same data frame you can use the `data=` argument. For example, here is a scatterplot of eruption time (Y axis) over waiting time (X axis):
 144
 145 ```{r}
 146 plot(eruptions ~ waiting, data=faithful)
 147 ```
 148
 149
 150 Data frames can have an arbitrary number of columns (variables). Another built in dataset used frequently in R documentation and examples is `mtcars` (read the help documentation! it contains a codebook that tells you about each variable). Let's look at that one next:
 151
 152 ```{r}
 153 mtcars <- mtcars
 154
 155 dim(mtcars)
 156
 157 head(mtcars)
 158 ```
 159
 160 There are many ways to create and modify data frames. Here is an example playing with the `mtcars` data. I use the `data.frame` command to build a new data frame from three vectors:
 161
 162 ```{r}
 163
 164 my.mpg <- mtcars$mpg
 165 my.cyl <- mtcars$cyl
 166 my.disp <- mtcars$disp
 167
 168 df.small <- data.frame(my.mpg, my.cyl, my.disp)
 169 class(df.small)
 170 head(df.small)
 171
 172 # recode a value as missing
 173 df.small[5,1] <- NA
 174
 175 # removing a column
 176 df.small[,3] <- NULL
 177 dim(df.small)
 178 head(df.small)
 179
 180 ```
 181
 182 Creating new variables, recoding, and transformations look very similar to working with vectors. Notice the `na.rm=TRUE` argument I am passing to the `mean` function in the first line here:
 183
 184 ```{r}
 185 df.small$mpg.big <- df.small$my.mpg > mean(df.small$my.mpg, na.rm=TRUE)
 186
 187 table(df.small$mpg.big)
 188
 189 df.small$mpg.l <- log1p(df.small$my.mpg) # notice: log1p()
 190 head(df.small$mpg.l)
 191 ## convert a number into a factor:
 192
 193 df.small$my.cyl.factor <- factor(df.small$my.cyl)
 194 summary(df.small$my.cyl.factor)
 195 ```
 196
 197 Some special functions are particularly useful for working with data frames:
 198
 199 ```{r}
 200 is.na(df.small$my.mpg)
 201
 202 sum(is.na(df.small$my.mpg)) # sum() works in mysterious ways sometimes...
 203
 204 complete.cases(df.small)
 205 sum(complete.cases(df.small))
 206 ```
 207 ## "Apply" functions and beyond
 208
 209 R has some special functions to help apply operations over vectors, lists, etc. These can seem a little complicated at first, but they are super, super useful.
 210
 211 Most of the base R versions of these have "apply" in the name. There are also alternatives (some created by the same people who created ggplot2 that you can read more about in, for example, the Healy *Data Visualization* book). I will stick to the base R versions here. Please feel free to read about and use the alternatives!
 212
 213 Let's start with an example using the `mtcars` dataset again. The `sapply()` and `lapply()` functions both "apply" the second argument (a function) iteratively to the items (variables) in the first argument:
 214
 215 ```{r}
 216 sapply(mtcars, quantile)
 217
 218 lapply(mtcars, quantile) # Same output, different format/class
 219 ```
 220
 221 Experiment with that idea on your own a little bit before moving on. For example, can you find the mean of each variable using either `sapply` or `lapply`?
 222
 223
 224 The `tapply` function allows you to apply functions conditionally. For example, below I find mean gas mileage by number of cylinders. The second argument (`mtcars$cyl`) provides an index into the first (`mtcars$mpg`) before the third argument (`mean`) is applied to each of the conditional subsets:
 225
 226 ```{r}
 227 tapply(mtcars$mpg, mtcars$cyl, mean)
 228 ```
 229
 230 Try some other calculations using `tapply()`. Can you calculate the average engine discplacement conditional on number of cylinders? What about the average miles per gallon conditional on whether the car has an automatic transmission?
 231
 232 Note that `apply()` works pretty smoothly with matrices, but it can be a bit complicated/surprising otherwise.
 233
 234 ### Some basic graphs with ggplot2
 235
 236 ggplot2 is what I like to use for plotting so I'll develop examples with it from here on out.
 237
 238 Make sure you've installed the package with `install.packages()` and load it with `library()`.
 239
 240 There is another built-in (automotive) dataset that comes along with the ggplot2 package called `mpg`. This dataset includes variables of several types and is used in much of the package documentation, so it's helpful to become familiar with it.
 241
 242 I'll develop a few simple examples below. For more, please take a look at the (extensive) [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/index.html). There are **many** options and arguments for most of these functions and many more functions to help you produce publication-ready graphics. Chapter 3 of the Healy book is also an extraordinary resource for getting started creating visualizations with ggplot2.
 243
 244 ```{r}
 245 library(ggplot2)
 246 mpg <- mpg
 247
 248 # First thing, call ggplot() to start building up a plot
 249 # aes() indicates which variables to use as "aesthetic" mappings
 250
 251 p <- ggplot(data=mpg, aes(manufacturer, hwy))
 252
 253 p + geom_boxplot()
 254
 255 # another relationship:
 256 p <- ggplot(data=mpg, aes(fl, hwy))
 257 p + geom_boxplot()
 258 p + geom_violin()
 259
 260 ```
 261
 262 Here's another that visualizes the relationship between miles per gallon (mpg) in the city vs. mpg on the highway:
 263
 264 ```{r}
 265 p <- ggplot(data=mpg, aes(cty, hwy))
 266
 267 p+geom_point()
 268
 269 # Multivariate graphical displays can get pretty wild
 270 p + geom_point(aes(color=factor(class), shape=factor(cyl)))
 271 ```
 272
 273