2 title: "Week 2 R lecture"
3 subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525"
9 ```{r setup, include=FALSE}
10 knitr::opts_chunk$set(echo = TRUE)
13 ## Adding comments to your code
15 Sorry, I realized that I forgot to explain this in last week's R lecture!
17 R interprets the `#` character and anything that comes after it as a comment. R will not try to interpret whatever comes next as a command:
22 # This is a comment. The next line is too:
27 ## More advanced variable types:
31 These are for categorical data. You can create them with the `factor()` command or by running `as.factor()` on a character vector.
34 cities <- factor(c("Chicago", "Detroit", "Milwaukee"))
38 more.cities <- c("Oakland", "Seattle", "San Diego")
42 more.cities <- as.factor(more.cities)
50 Lists are a lot like vectors, but can contain any kind of object (e.g., other variables).
53 cities.list <- list(cities, more.cities)
56 We can name the items in the list just like we did for a vector:
58 names(cities.list) <- c("midwest", "west")
61 cities.list <- list("midwest" = cities, "west" = more.cities)
63 You can index into the list just like you can with a vector except that instead of one set of square brackets you have to use two:
65 cities.list[["midwest"]]
68 With a list you can also index recursively (down into the individual objects contained in the list). For example:
70 cities.list[["west"]][2]
73 Some functions "just work" on lists. Others don't or produce weird output that you probably weren't expecting. You should be careful and check the output to see what happens:
77 # summary works as you might hope:
80 # table produces something very weird:
86 Matrices are a little less common for everyday use, but it's good to know they're there and that you can do matrix arithmetic to your heart's content. An example is below. Check out the help documentation for the `matrix()` function for more.
90 m1 <- matrix(c(1:12), nrow=4, byrow=FALSE)
93 m2 <- matrix(seq(2,24,2), nrow=4, byrow=FALSE)
103 A data frame is a format for storing tabular data. Formally, it consists of a list of vectors of equal length.
105 For our purposes, data frames are the most important data structure (or type) in R. We will use them constantly. They have rows (usually units or observations) and columns (usually variables). There are also many functions designed to work especially (even exclusively) with data frames. Let's take a look at another built-in example dataset, `faithful` (note: read the help documentation on `faithful` to learn about the dataset!):
108 faithful <- faithful # This makes the dataset visible in the "Environment" tab in RStudio.
110 dim(faithful) # often the first thing I do with any data frame
113 names(faithful) ## try colnames(faithful) too
115 head(faithful) ## look at the first few rows of data
121 You can index into a data frame using numeric values or variable names. The notation uses square brackets again and requires you to remember the convention of `[<rows>, <columns>]`:
124 faithful[1,1] # The item in the first row of the first column
126 faithful[,2] # all of the items in the second column
128 faithful[10:20, 2] # ranges work too
130 faithful[37, "eruptions"]
133 It is very useful to work with column (variable) names in a data frame using the `$` symbol:
138 mean(faithful$waiting)
140 boxplot(faithful$waiting)
143 Data frames are very useful for bivariate analyses (e.g., plots and tables). The base R notation for a bivariate presentation usually uses the `~` character. If both of the variables in your bivariate comparison are within the same data frame you can use the `data=` argument. For example, here is a scatterplot of eruption time (Y axis) over waiting time (X axis):
146 plot(eruptions ~ waiting, data=faithful)
150 Data frames can have an arbitrary number of columns (variables). Another built in dataset used frequently in R documentation and examples is `mtcars` (read the help documentation! it contains a codebook that tells you about each variable). Let's look at that one next:
160 There are many ways to create and modify data frames. Here is an example playing with the `mtcars` data. I use the `data.frame` command to build a new data frame from three vectors:
166 my.disp <- mtcars$disp
168 df.small <- data.frame(my.mpg, my.cyl, my.disp)
172 # recode a value as missing
182 Creating new variables, recoding, and transformations look very similar to working with vectors. Notice the `na.rm=TRUE` argument I am passing to the `mean` function in the first line here:
185 df.small$mpg.big <- df.small$my.mpg > mean(df.small$my.mpg, na.rm=TRUE)
187 table(df.small$mpg.big)
189 df.small$mpg.l <- log1p(df.small$my.mpg) # notice: log1p()
191 ## convert a number into a factor:
193 df.small$my.cyl.factor <- factor(df.small$my.cyl)
194 summary(df.small$my.cyl.factor)
197 Some special functions are particularly useful for working with data frames:
200 is.na(df.small$my.mpg)
202 sum(is.na(df.small$my.mpg)) # sum() works in mysterious ways sometimes...
204 complete.cases(df.small)
205 sum(complete.cases(df.small))
207 ## "Apply" functions and beyond
209 R has some special functions to help apply operations over vectors, lists, etc. These can seem a little complicated at first, but they are super, super useful.
211 Most of the base R versions of these have "apply" in the name. There are also alternatives (some created by the same people who created ggplot2 that you can read more about in, for example, the Healy *Data Visualization* book). I will stick to the base R versions here. Please feel free to read about and use the alternatives!
213 Let's start with an example using the `mtcars` dataset again. The `sapply()` and `lapply()` functions both "apply" the second argument (a function) iteratively to the items (variables) in the first argument:
216 sapply(mtcars, quantile)
218 lapply(mtcars, quantile) # Same output, different format/class
221 Experiment with that idea on your own a little bit before moving on. For example, can you find the mean of each variable using either `sapply` or `lapply`?
224 The `tapply` function allows you to apply functions conditionally. For example, below I find mean gas mileage by number of cylinders. The second argument (`mtcars$cyl`) provides an index into the first (`mtcars$mpg`) before the third argument (`mean`) is applied to each of the conditional subsets:
227 tapply(mtcars$mpg, mtcars$cyl, mean)
230 Try some other calculations using `tapply()`. Can you calculate the average engine discplacement conditional on number of cylinders? What about the average miles per gallon conditional on whether the car has an automatic transmission?
232 Note that `apply()` works pretty smoothly with matrices, but it can be a bit complicated/surprising otherwise.
234 ### Some basic graphs with ggplot2
236 ggplot2 is what I like to use for plotting so I'll develop examples with it from here on out.
238 Make sure you've installed the package with `install.packages()` and load it with `library()`.
240 There is another built-in (automotive) dataset that comes along with the ggplot2 package called `mpg`. This dataset includes variables of several types and is used in much of the package documentation, so it's helpful to become familiar with it.
242 I'll develop a few simple examples below. For more, please take a look at the (extensive) [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/index.html). There are **many** options and arguments for most of these functions and many more functions to help you produce publication-ready graphics. Chapter 3 of the Healy book is also an extraordinary resource for getting started creating visualizations with ggplot2.
248 # First thing, call ggplot() to start building up a plot
249 # aes() indicates which variables to use as "aesthetic" mappings
251 p <- ggplot(data=mpg, aes(manufacturer, hwy))
255 # another relationship:
256 p <- ggplot(data=mpg, aes(fl, hwy))
262 Here's another that visualizes the relationship between miles per gallon (mpg) in the city vs. mpg on the highway:
265 p <- ggplot(data=mpg, aes(cty, hwy))
269 # Multivariate graphical displays can get pretty wild
270 p + geom_point(aes(color=factor(class), shape=factor(cyl)))