X-Git-Url: https://code.communitydata.science/stats_class_2020.git/blobdiff_plain/9d912e87ec9ee97b3ffb0e3b7dc8b2361ccca9f2..43106301b566a661e090a0b173aa52cea1bd1ae7:/r_tutorials/w10-R_tutorial.rmd diff --git a/r_tutorials/w10-R_tutorial.rmd b/r_tutorials/w10-R_tutorial.rmd new file mode 100644 index 0000000..bc92a06 --- /dev/null +++ b/r_tutorials/w10-R_tutorial.rmd @@ -0,0 +1,182 @@ +--- +title: "Week 10 R tutorial" +author: "Aaron Shaw" +date: "November 10, 2020" +output: + html_document: + toc: yes + toc_depth: 3 + toc_float: + collapsed: true + smooth_scroll: true + theme: readable + pdf_document: + toc: yes + toc_depth: '3' + header-includes: + - \newcommand{\lt}{<} + - \newcommand{\gt}{>} +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +This week's R tutorial focuses on the basics of correlations and linear regression. I'll work with the `mtcars` dataset that comes built-in with R. +```{r} +data(mtcars) + +``` + +## Correlations and covariance + +Calculating correlation coefficients is straightforward: use the `cor()` function: +```{r} +with(mtcars, cor(mpg, hp)) +``` +All you prius drivers out there will be shocked to learn that miles-per-gallon is negatively correlated with horsepower. + +The `cor()` function works with two variables or with more—the following generates a correlation matrix for the whole dataset! +```{r} +cor(mtcars) +``` + +Note that if you are calculating correlations with variables that are not distributed normally you should use `cor(method="spearman")` because it calculates rank-based correlations (look it up online for more details). + +To calculate covariance, you use the `cov()` function: +```{r} +with(mtcars, cov(mpg, hp)) +``` + +While *OpenIntro* does not spend much time on either covariance or correlation, I highly recommend taking the time to review at least the corresponding Wikipedia articles and ideally another statistics textbook) to understand what goes into each one. [^1] A key point to notice/understand is that covariance is calculated in such a way that the magnitude may vary depending on the scale of the variables involved. Correlation is not (every correlation coefficient falls between -1 and 1). + +[^1]: Here are those corresponding Wikipedia articles: [correlation](https://en.wikipedia.org/wiki/Correlation_and_dependence) and [covariance](https://en.wikipedia.org/wiki/Covariance). + +## Fitting a linear model + +Linear models are fit using the `lm()` command. As with `aov()`, the `lm()` function requires a formula as an input and is usually presented with a call to `summary()`. You can enter the formula directly in the call to `lm()` or define it separately. For this example, I'll regress `mpg` on a single predictor, `hp`: +```{r} +model1 <- lm(mpg ~ hp, data=mtcars) + +summary(model1) +``` +Notice how much information the output of `summary()` gives you for a linear model! You have details about the residuals, the usual information about the coefficients, standard errors, t-values, etc., little stars corresponding to conventional significance levels, $R^2$ values, degrees of freedom, F-statistics (remember those?) and p-values for the overall model fit. + +There's even more under the hood. Try looking at all the different things in the model object R has created: +```{r} +names(model1) + +``` +You can directly inspect the residuals using `model1$residuals`. This makes plotting and other diagnostic activities pretty straightforward: +```{r} +summary(model1$residuals) +``` + +More on that in a moment. In the meantime, you can also use the items generated by the call to `summary()` as well: +```{r} +names(summary(model1)) +summary(model1)$coefficients +``` + +### CIs around coefficients and predicted values + +There are also functions to help you do things with the model such as predict the fitted values for new (unobserved) data. For example, if I found some new cars with horsepowers ranging from 90-125, what would this model predict for the corresponding mpg for each car? +```{r} +new.data <- data.frame(hp=seq(90,125,5)) +predict(model1, new.data, type="response") +``` +A call to predict can also provide standard errors around these predictions (which you could use, for example, to construct a 95% confidence interval around the model-predicted values): +```{r} +predict(model1, new.data, type="response", se.fit = TRUE) +``` +Linear model objects also have a built-in method for generating confidence intervals around the values of $\beta$: +```{r} +confint(model1, "hp", level=0.95) # Note that I provide the variable name in quotes +``` +Feeling old-fashioned? You can always calculate residuals or confidence intervals (or anything else) "by hand": +```{r} +# Residuals +mtcars$mpg - model1$fitted.values + +# 95% CI for the coefficient on horsepower +est <- model1$coefficients["hp"] +se <- summary(model1)$coefficients[2,2] + +est + 1.96 * c(-1,1) * se +``` + +### Plotting residuals + +You can generate diagnostic plots of residuals in various ways: + +```{r} +hist(residuals(model1)) +hist(model1$residuals) +``` + +Plot the residuals against the original predictor variable: + +```{r} +library(ggplot2) + +qplot(x=mtcars$hp, y=residuals(model1), geom="point") +``` + + +Quantile-quantile plots can be done using `qqnorm()` on the residuals: +```{r} +qqnorm(residuals(model1)) +``` +The easiest way to generate a few generic diagnostic plots in ggplot is documented pretty well on StackExchange and elsewhere: +```{r} +library(ggfortify) + +autoplot(model1) +``` + +### Adding additional variables (multiple regression—really useful next week) + +You can, of course, have models with many variables. This might happen by creating a brand new formula or using a command `update.formula()` to...well, you probably guessed it: +```{r} +f1 <- formula(mpg ~ hp) + +f2 <- formula(mpg ~ hp + disp + cyl + vs) + +f2a <- update.formula(f1, . ~ . + disp + cyl + vs) ## Same as f2 above + +model2 <- lm(f2, data=mtcars) + +summary(model2) +``` +Estimating linear models with predictor variables that are not continuous (numeric or integers) is no problem. Just go for it: +```{r} +mtcars$cyl <- factor(mtcars$cyl) +mtcars$vs <- as.logical(mtcars$vs) + +## Refit the same model: +model2 <- lm(f2, data=mtcars) +summary(model2) +``` +We'll talk more about how to interpret these results with categorical predictors next week, but for now you can see that R has no trouble handling multiple types or classes of variables in a regression model. + +## Producing nice regression tables +Generating regression tables directly from your statistical software is very important for preventing mistakes and typos. There are many ways to do this and a variety of packages that may be helpful (LaTex users: see [this StackExchange post](https://stackoverflow.com/questions/5465314/tools-for-making-latex-tables-in-r) for a big list). + +One especially easy-to-use package that can output text and html (both eminently paste-able into a variety of typesetting/word-processing systems) is called `stargazer`. Here I use it to generate an ASCII table summarizing the two models we've fit in this tutorial. +```{r} +library(stargazer) + +stargazer(model1, model2, type="text") +``` + +## Back to ANOVAs for a moment + +You may recall that I mentioned that R actually calls `lm()` when it estimates an ANOVA. As I said before, I'm not going to walk through the details, but an important thing to note is that the F-statistics and the p-values for those F-statistics are identical when you use `aov()` and when you use `lm()`. That means that you already know what hypothesis is being tested there and how to interpret that part of the regression model output. + +```{r} +summary(aov(data=mtcars, mpg ~ factor(cyl))) + +summary(lm(data=mtcars, mpg ~ factor(cyl))) +``` + +