1 ---
2 title: "Week 11 R tutorial"
3 author: "Aaron Shaw"
4 date: "November 24, 2020"
5 output:
6   html_document:
7     toc: yes
8     toc_depth: 3
9     toc_float:
10       collapsed: true
11       smooth_scroll: true
21 {r setup, include=FALSE}
22 knitr::opts_chunk$set(echo = TRUE) 23  25 # Leftovers! 27 There is a whole lot we will not have a chance to cover in this course related to regression analysis and interpretation, but I wanted to write out a brief R tutorial on some topics that are just beyond the reach of the material introduced by our textbook and problem sets. If you have any interest/need to apply these kinds of techniques in your final projects, I hope these examples will help and give you a sense of where/how to direct your questions. As always, please get in touch when you run into trouble. 29 ## Import data for the examples 30 For all of the examples this week, I'll work with the population.tsv dataset from back in [Week 6](https://communitydata.science/~ads/teaching/2020/stats/data/week_06/). The following lines of code load it and clean it up a bit. Since we've worked with this dataset before I will not revisit the process of "getting to know" it. 33 {r Import and cleanup data} 34 d <- read.delim(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_06/population.tsv")) 36 d$j <- as.logical(d$j) 37 d$l <- as.logical(d$l) 38 d$k <- factor(d$k, 39 levels=c(1,2,3), 40 labels=c("some", "lots", "all")) 42 d <- d[complete.cases(d),] 43  45 ## Transformations 47 It is often necessary to transform variables for modeling (we'll discuss some reasons why in class). Here are some common transformations with example code to perform them in R. Keep in mind that you also know some others (e.g., you know how to standardize a variable or take the square root). 49 ### Interaction terms 51 Interaction terms are often best handled using the I() function (note the capitalization). You can also incorporate interactions by multiplying the two variables involved directly within the model formula. In the example below, I start with a "base" model and then update it to add the interaction term. 53 {r} 54 m.base <- formula(y ~ x + j) 55 summary(lm(m.base, data=d)) 58 m.i <- update.formula(m.base, . ~ . + I(x*j)) 59 summary(lm(m.i, data=d)) 60  62 Evaluating whether interaction terms are important or improve the fit of your model is a topic best left out of this discussion for now. That said, if you need to include an interaction term, you now have a basic idea of how to do it. Interpreting interaction terms is generally best done using model-predicted values. For example, in this case where x is continuous and j is dichotomous, you might generate a "hypothetical" dataset incorporating the range of observed values for x at each of the two values of j (that would yield a "fake" data frame with$n \times 2$rows). You can then plot the predicted values of y for each of the j categories over the x distribution (imagine: a plot with x on the x-axis, y on y-axis, and lines of different colors corresponding to the different levels of j). [^1] 64 [^1]: An okay example of this appears in [this paper](https://doi.org/10.1093/joc/jqx003) that I worked on a few years ago. 66 Needless to say, there's a lot more to be said about interactions. You can read more [in this econometrics textbook](https://www.econometrics-with-r.org/8-3-interactions-between-independent-variables.html), which seems to have pretty thorough coverage of the fundamentals as well as example R code. 68 ### Polynomial (square, cube, etc.) terms 70 Polynomial terms can easily be created using I() as well. You can also create a new variable in your data called x.2 or something like that by writing x.2 <- x^2 if that seems simpler. Either way, you'll wind up adding that to your model formula (either by creating a new formula or using the update.formula() function as I do here). 72 {r} 73 m.poly <- update.formula(m.base, . ~ . + I(x^2)) 74 summary(lm(m.poly, data=d)) 75  76 Need higher order polynomials? Try including I(x^3) and so on... 78 Creating polynomials this way is intuitive, but can also create a little bit of a messy situation for reasons that go beyond the scope of our course (look up "orthogonalized polynomials" online to learn more). In these circumstances, using the poly() function is useful, but potentially confusing. Generally speaking, creating polynomials in this way impacts the interpretation as well as the model estimates, so you should only use it if you need to and once you've taken the time to actually learn what is happening. That said, here's what the code could look like: 80 {r} 81 m.poly2 <- formula(y ~ j + poly(x,2)) 82 summary(lm(m.poly2, data=d)) 83  84 Higher order (to the nth degree) terms can be created by using higher values of n as an argument to (e.g. poly(x, n)). 86 ### Log-transformations 88 We covered log transformations (usually natural logarithms) before, but just in case, here they are again. I usually default to using log1p() because it is less prone to fail in the event your data (like most of mine) contains many zeroes. That said, if you have a lot of -1 values you may need something else: 90 {r} 91 m.log <- update.formula(m.base, . ~ log1p(x) + j) 92 summary(lm(m.log, data=d)) 93  94 Keep in mind that you can use other bases for your logarithmic transformations. Check out the documentation for log() for more information. 96 ## Why model-predicted values? 98 When you report the results of a regression model, you should provide a table summarizing the model as well as some interpretation that renders the model results back into the original, human-intelligible measures and units specific to the study. 100 This was covered in one of the [resources](https://communitydata.science/~ads/teaching/2020/stats/r_tutorials/logistic_regression_interpretation.html) I distributed last week (the handout on logistic regression), but I wanted to bring it back because it is **important**. Please revisit that handout to see a worked example that walks through the process. The rest of this text is a bit of a rant about why you should bother to do so. 102 When is a regression table not enough? In textbook/homework examples, this is not an issue, but in real data it matters all the time. Recall that the coefficient estimated for any single predictor is the expected change in the outcome for a 1-unit change in the predictor *holding all the other predictors constant.* What value are those other predictors held constant at? Algebraically, the best answer is zero! [^2] This is unlikely to be the most helpful or intuitive way to understand your estimates (for example, what if you have a dichotomous predictor, what does it mean then?). Once your models get even a little bit complicated (quick, exponentiate a log-transformed value and tell me what it means!), the regression-table-alone approach becomes arguably worse than useless. 104 [^2]: Think about it this way: an ordinary least squares regression equation can be written$\hat{y} = \alpha + \beta_1x_1 + \beta_2x_2$, when I want to talk about the change in$\hat{y}$associated with a one-unit change in$x_1$, I am essentially taking$x_2$out of the equation. In other words, I am acting as if$x_2=0$(and the$\alpha = 0$, although that's a separate topic) for the purposes of interpreting the$\beta_1\$ coefficient directly.