1 ---
2 title: "Week 8 R Lecture"
3 author: "Aaron Shaw"
4 date: "May 16, 2019"
5 output: html_document
6 ---
8 {r setup, include=FALSE}
9 knitr::opts_chunk$set(echo = TRUE) 10  11 This week's R tutorial materials focus on the basics of correlations and linear regressions. I'll work with the mtcars dataset that comes built-in with R. 13 ## Correlations 15 Calculating correlation coefficients is straightforward: use the cor() function: 16 {r} 17 with(mtcars, cor(mpg, hp)) 18  19 All you prius drivers out there will be shocked to learn that miles-per-gallon is negatively correlated with horsepower. 21 The cor() function works with two variables or with moreâ€”the following generates a correlation matrix for the whole dataset! 22 {r} 23 cor(mtcars) 24  26 Note that if you are calculating correlations with variables that are not distributed normally you should use cor(method="spearman") because it calculates rank-based correlations (look it up online for more details). 28 ## Fitting a linear model (with one variable) 30 Linear models are fit using the lm() command. As with aov(), the lm() function requires a formula as an input and is usually presented with a call to summary(). You can enter the formula directly in the call to lm() or define it separately. For this example, I'll regress mpg on a single predictor, hp: 31 {r} 32 model1 <- lm(mpg ~ hp, data=mtcars) 34 summary(model1) 35  36 Notice how much information the output of summary() gives you for a linear model! You have details about the residuals, the usual information about the coefficients, standard errors, t-values, etc., little stars corresponding to conventional significance levels,$R^2$values, degrees of freedom, F-statistics (remember those?) and p-values for the overall model fit. 38 There's even more under the hood. Try looking at all the different things in the model object R has created: 39 {r} 40 names(model1) 42  43 You can directly inspect the residuals using model1$residuals. This makes plotting and other diagnostic activities pretty straightforward:
44 {r}
45 summary(model1$residuals) 46  48 More on that in a moment. In the meantime, you can also use the items generated by the call to summary() as well: 49 {r} 50 names(summary(model1)) 51 summary(model1)$coefficients
52 
55 There are also functions to help you do things with the model such as predict the fitted values for new data. For example, if I found some new cars with horsepowers ranging from 90-125, what would this model predict for the corresponding mpg for each car?
56 {r}
57 new.data <- data.frame(hp=seq(90,125,5))
58 predict(model1, new.data, type="response")
59 
60 A call to predict can also provide standard errors around these predictions (which you could use, for example, to construct a 95% confidence interval around the model-predicted values):
61 {r}
62 predict(model1, new.data, type="response", se.fit = TRUE)
63 
64 Linear model objects also have a built-in method for generating confidence intervals around the values of $\beta$:
65 {r}
66 confint(model1, "hp", level=0.95) # Note that I provide the variable name in quotes
67 
68 Feeling old-fashioned? You can always calculate residuals or confidence intervals (or anything else) "by hand":
69 {r}
70 # Residuals
71 mtcars$mpg - model1$fitted.values
73 # 95% CI for the coefficient on horsepower
74 est <- model1$coefficients["hp"] 75 se <- summary(model1)$coefficients[2,2]
77 est + 1.96 * c(-1,1) * se
78 
80 ## Plotting residuals
82 You can generate diagnostic plots of residuals in various ways:
84 {r}
85 hist(residuals(model1))
86 hist(model1$residuals) 87  89 Plot the residuals against the original predictor variable: 91 {r} 92 library(ggplot2) 94 qplot(x=mtcars$hp, y=residuals(model1), geom="point")
95 
98 Quantile-quantile plots can be done using qqnorm() on the residuals:
99 {r}
100 qqnorm(residuals(model1))
101 
102 The easiest way to generate a few generic diagnostic plots in ggplot is documented pretty well on StackExchange and elsewhere:
103 {r}
104 library(ggfortify)
106 autoplot(model1)
107 
109 ## Adding additional variables (multiple regressionâ€”really useful next week)
111 You can, of course, have models with many variables. This might happen by creating a brand new formula or using a command update.formula() to...well, you probably guessed it:
112 {r}
113 f1 <- formula(mpg ~ hp)
115 f2 <- formula(mpg ~ hp + disp + cyl + vs)
117 f2a <- update.formula(f1, . ~ . + disp + cyl + vs) ## Same as f2 above
119 model2 <- lm(f2, data=mtcars)
121 summary(model2)
122 
123 Estimating linear models with predictor variables that are not continuous (numeric or integers) is no problem. Just go for it:
124 {r}
125 mtcars$cyl <- factor(mtcars$cyl)
126 mtcars$vs <- as.logical(mtcars$vs)
128 ## Refit the same model:
129 model2 <- lm(f2, data=mtcars)
130 summary(model2)
131 
132 We'll talk more about how to interpret these results with categorical predictors next week, but for now you can see that R has no trouble handling multiple types or classes of variables in a regression model.
134 ## Producing nice regression tables
135 Generating regression tables directly from your statistical software is very important for preventing mistakes and typos. There are many ways to do this and a variety of packages that may be helpful (LaTex users: see [this StackExchange post](https://stackoverflow.com/questions/5465314/tools-for-making-latex-tables-in-r) for a big list).
137 One especially easy-to-use package that can output text and html (both eminently paste-able into a variety of typesetting/word-processing systems) is called stargazer. Here I use it to generate an ASCII table summarizing the two models we've fit in this tutorial.
138 {r}
139 library(stargazer)
141 stargazer(model1, model2, type="text")
142 
144 ## Back to ANOVAs for a moment
146 You may recall that I mentioned that R actually calls lm() when it estimates an ANOVA. As I said before, I'm not going to walk through the details, but an important thing to note is that the F-statistics and the p-values for those F-statistics are identical when you use aov() and when you use lm(). That means that you already know what hypothesis is being tested there and how to interpret that part of the regression model output.
148 {r}
149 summary(aov(data=mtcars, mpg ~ factor(cyl)))
151 summary(lm(data=mtcars, mpg ~ factor(cyl)))
152