r_tutorials/w10-R_tutorial.rmd

   1 ---
   2 title: "Week 10 R tutorial"
   3 author: "Aaron Shaw"
   4 date: "November 10, 2020"
   5 output:
   6   html_document:
   7     toc: yes
   8     toc_depth: 3
   9     toc_float:
  10       collapsed: true
  11       smooth_scroll: true
  12     theme: readable
  13   pdf_document:
  14     toc: yes
  15     toc_depth: '3'
  16   header-includes:
  17   - \newcommand{\lt}{<}
  18   - \newcommand{\gt}{>}
  19 ---
  20
  21 ```{r setup, include=FALSE}
  22 knitr::opts_chunk$set(echo = TRUE)
  23 ```
  24
  25 This week's R tutorial focuses on the basics of correlations and linear regression. I'll work with the `mtcars` dataset that comes built-in with R.
  26 ```{r}
  27 data(mtcars)
  28
  29 ```
  30
  31 ## Correlations and covariance
  32
  33 Calculating correlation coefficients is straightforward: use the `cor()` function:
  34 ```{r}
  35 with(mtcars, cor(mpg, hp))
  36 ```
  37 All you prius drivers out there will be shocked to learn that miles-per-gallon is negatively correlated with horsepower.
  38
  39 The `cor()` function works with two variables or with more—the following generates a correlation matrix for the whole dataset!
  40 ```{r}
  41 cor(mtcars)
  42 ```
  43
  44 Note that if you are calculating correlations with variables that are not distributed normally you should use `cor(method="spearman")` because it calculates rank-based correlations (look it up online for more details).
  45
  46 To calculate covariance, you use the `cov()` function:
  47 ```{r}
  48 with(mtcars, cov(mpg, hp))
  49 ```
  50
  51 While *OpenIntro* does not spend much time on either covariance or correlation, I highly recommend taking the time to review at least the corresponding Wikipedia articles and ideally another statistics textbook) to understand what goes into each one. [^1] A key point to notice/understand is that covariance is calculated in such a way that the magnitude may vary depending on the scale of the variables involved. Correlation is not (every correlation coefficient falls between -1 and 1).
  52
  53 [^1]: Here are those corresponding Wikipedia articles: [correlation](https://en.wikipedia.org/wiki/Correlation_and_dependence) and [covariance](https://en.wikipedia.org/wiki/Covariance).
  54
  55 ## Fitting a linear model
  56
  57 Linear models are fit using the `lm()` command. As with `aov()`, the `lm()` function requires a formula as an input and is usually presented with a call to `summary()`. You can enter the formula directly in the call to `lm()` or define it separately. For this example, I'll regress `mpg` on a single predictor, `hp`:
  58 ```{r}
  59 model1 <- lm(mpg ~ hp, data=mtcars)
  60
  61 summary(model1)
  62 ```
  63 Notice how much information the output of `summary()` gives you for a linear model! You have details about the residuals, the usual information about the coefficients, standard errors, t-values, etc., little stars corresponding to conventional significance levels, $R^2$ values, degrees of freedom, F-statistics (remember those?) and p-values for the overall model fit.
  64
  65 There's even more under the hood. Try looking at all the different things in the model object R has created:
  66 ```{r}
  67 names(model1)
  68
  69 ```
  70 You can directly inspect the residuals using `model1$residuals`. This makes plotting and other diagnostic activities pretty straightforward:
  71 ```{r}
  72 summary(model1$residuals)
  73 ```
  74
  75 More on that in a moment. In the meantime, you can also use the items generated by the call to `summary()` as well:
  76 ```{r}
  77 names(summary(model1))
  78 summary(model1)$coefficients
  79 ```
  80
  81 ### CIs around coefficients and predicted values
  82
  83 There are also functions to help you do things with the model such as predict the fitted values for new (unobserved)  data. For example, if I found some new cars with horsepowers ranging from 90-125, what would this model predict for the corresponding mpg for each car?
  84 ```{r}
  85 new.data <- data.frame(hp=seq(90,125,5))
  86 predict(model1, new.data, type="response")
  87 ```
  88 A call to predict can also provide standard errors around these predictions (which you could use, for example, to construct a 95% confidence interval around the model-predicted values):
  89 ```{r}
  90 predict(model1, new.data, type="response", se.fit = TRUE)
  91 ```
  92 Linear model objects also have a built-in method for generating confidence intervals around the values of $\beta$:
  93 ```{r}
  94 confint(model1, "hp", level=0.95) # Note that I provide the variable name in quotes
  95 ```
  96 Feeling old-fashioned? You can always calculate residuals or confidence intervals (or anything else) "by hand":
  97 ```{r}
  98 # Residuals
  99 mtcars$mpg - model1$fitted.values
 100
 101 # 95% CI for the coefficient on horsepower
 102 est <- model1$coefficients["hp"]
 103 se <- summary(model1)$coefficients[2,2]
 104
 105 est + 1.96 * c(-1,1) * se
 106 ```
 107
 108 ### Plotting residuals
 109
 110 You can generate diagnostic plots of residuals in various ways:
 111
 112 ```{r}
 113 hist(residuals(model1))
 114 hist(model1$residuals)
 115 ```
 116
 117 Plot the residuals against the original predictor variable:
 118
 119 ```{r}
 120 library(ggplot2)
 121
 122 qplot(x=mtcars$hp, y=residuals(model1), geom="point")
 123 ```
 124
 125
 126 Quantile-quantile plots can be done using `qqnorm()` on the residuals:
 127 ```{r}
 128 qqnorm(residuals(model1))
 129 ```
 130 The easiest way to generate a few generic diagnostic plots in ggplot is documented pretty well on StackExchange and elsewhere:
 131 ```{r}
 132 library(ggfortify)
 133
 134 autoplot(model1)
 135 ```
 136
 137 ### Adding additional variables (multiple regression—really useful next week)
 138
 139 You can, of course, have models with many variables. This might happen by creating a brand new formula or using a command `update.formula()` to...well, you probably guessed it:
 140 ```{r}
 141 f1 <- formula(mpg ~ hp)
 142
 143 f2 <- formula(mpg ~ hp + disp + cyl + vs)
 144
 145 f2a <- update.formula(f1, . ~ . + disp + cyl + vs) ## Same as f2 above
 146
 147 model2 <- lm(f2, data=mtcars)
 148
 149 summary(model2)
 150 ```
 151 Estimating linear models with predictor variables that are not continuous (numeric or integers) is no problem. Just go for it:
 152 ```{r}
 153 mtcars$cyl <- factor(mtcars$cyl)
 154 mtcars$vs <- as.logical(mtcars$vs)
 155
 156 ## Refit the same model:
 157 model2 <- lm(f2, data=mtcars)
 158 summary(model2)
 159 ```
 160 We'll talk more about how to interpret these results with categorical predictors next week, but for now you can see that R has no trouble handling multiple types or classes of variables in a regression model.
 161
 162 ## Producing nice regression tables
 163 Generating regression tables directly from your statistical software is very important for preventing mistakes and typos. There are many ways to do this and a variety of packages that may be helpful (LaTex users: see [this StackExchange post](https://stackoverflow.com/questions/5465314/tools-for-making-latex-tables-in-r) for a big list).
 164
 165 One especially easy-to-use package that can output text and html (both eminently paste-able into a variety of typesetting/word-processing systems) is called `stargazer`. Here I use it to generate an ASCII table summarizing the two models we've fit in this tutorial.
 166 ```{r}
 167 library(stargazer)
 168
 169 stargazer(model1, model2, type="text")
 170 ```
 171
 172 ## Back to ANOVAs for a moment
 173
 174 You may recall that I mentioned that R actually calls `lm()` when it estimates an ANOVA. As I said before, I'm not going to walk through the details, but an important thing to note is that the F-statistics and the p-values for those F-statistics are identical when you use `aov()` and when you use `lm()`. That means that you already know what hypothesis is being tested there and how to interpret that part of the regression model output.
 175
 176 ```{r}
 177 summary(aov(data=mtcars, mpg ~ factor(cyl)))
 178
 179 summary(lm(data=mtcars, mpg ~ factor(cyl)))
 180 ```
 181
 182