psets/pset8-worked_solution.rmd

   1 ---
   2 title: "Problem set 8: Worked solutions"
   3 subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
   4 author: "Aaron Shaw"
   5 date: "November 23, 2020"
   6 output:
   7   pdf_document:
   8     toc: yes
   9     toc_depth: '3'
  10   html_document:
  11     toc: yes
  12     toc_depth: 3
  13     toc_float:
  14       collapsed: true
  15       smooth_scroll: true
  16     theme: readable
  17   header-includes:
  18   - \newcommand{\lt}{<}
  19   - \newcommand{\gt}{>}
  20 ---
  21
  22 ```{r setup, include=FALSE}
  23 knitr::opts_chunk$set(echo = TRUE, message=FALSE, tidy='styler')
  24 ```
  25
  26 # Import libraries
  27
  28 I'll start by loading some useful libraries...
  29 ```{r dependencies}
  30 library(openintro)
  31 library(tidyverse)
  32 library(ggfortify)
  33 library(haven)
  34 ```
  35
  36 # Part I: Mario kart replication/extension
  37
  38 ## Import data and setup
  39 ```{r}
  40 data(mariokart)
  41
  42 mariokart
  43 ```
  44
  45 To make things a bit easier to manage, I'll select the variables I want to use in the analysis and do some cleanup. Note that I conver the `cond_new` and `stock_photo` variables to logical first (using boolean comparisons) and also coerce them to be numeric values (using `as.numeric()`). This results in 1/0 values corresponding to the observations shown/described in Figure 9.13 and 9.14 on p. 365 of the textbook.
  46
  47 ```{r}
  48 mariokart <- mariokart %>%
  49   select(
  50     price = total_pr, cond_new = cond, stock_photo, duration, wheels
  51   ) %>% mutate(
  52     cond_new = as.numeric(cond_new == "new"),
  53     stock_photo = as.numeric(stock_photo == "yes")
  54   )
  55
  56 mariokart
  57 ```
  58
  59 Now let's look at the variables in our model. Summary statistics for each, univariate density plots for the continuous measures, and some bivariate plots w each predictor and the outcome variable.
  60
  61 ```{r}
  62 summary(mariokart)
  63
  64 sd(mariokart$price)
  65 sd(mariokart$duration)
  66 sd(mariokart$wheels)
  67
  68 qplot(data=mariokart, x=price, geom="density")
  69 qplot(data=mariokart, x=duration, geom="density")
  70 qplot(data=mariokart, x=wheels, geom="density")
  71
  72 ggplot(data=mariokart, aes(cond_new, price)) + geom_boxplot()
  73 ggplot(data=mariokart, aes(stock_photo, price)) + geom_boxplot()
  74 ggplot(data=mariokart, aes(duration, price)) + geom_point()
  75 ggplot(data=mariokart, aes(wheels, price)) + geom_point()
  76
  77 ```
  78
  79 I'm also going to calculate correlation coefficients for all of the variables. We can discuss what to make of this in class:
  80
  81 ```{r}
  82 cor(mariokart)
  83 ```
  84
  85 ## Replicate model results
  86
  87 The description of the model
  88 ```{r}
  89 model <- lm(price ~ cond_new + stock_photo + duration + wheels, data=mariokart)
  90 summary(model)
  91 ```
  92
  93 While I'm looking at that, I'll go ahead and calculate a confidence interval around the only parameter for which the model rejects the null hypothesis (`wheels`):
  94
  95 ```{r}
  96 confint(model, "wheels")
  97 ```
  98
  99
 100 Overall, this resembles the model results in Figure 9.15, but notice that it's different in a number of ways! Without more information from the authors of the textbook it's very hard to determine exactly where or why the differences emerge.
 101
 102 ## Assess model fit/assumptions
 103
 104 I've already generated a bunch of univariate and bivariate summaries and plots. Let's inspect the residuals more closely to see what else we can learn. I'll use the `autoplot()` function from the `ggfortify` package to help me do this.
 105
 106 ```{r}
 107 autoplot(model)
 108 ```
 109
 110 Overall, there are a number of issues with this model fit that I'd want to mention/consider:
 111 * The distribution of the dependent variable (`price`) is very skewed, with two extreme outliers. I'd recommend trying some transformations to see if it would look more appropriate for a linear regression and/or inspecting the cases that produced the outlying values more closely to understand what's happening there.
 112 * The plots of the residuals reveal those same two outlying points are also outliers with respect to the line of best fit. That said, they are not exerting huge amounts of leverage on the estimates, so it's possible that the estimates from the fitted model wouldn't change much without those two points. Indeed, based on the degrees of freedom reported in Figure 9.15 (136) vs. the number reported in our version of the model (138) my best guess is that the textbook authors silently dropped those two outlying observations from their model!
 113
 114 More out of curiosity than anything, I'll create a version of the model that drops the two largest values of price. From the plots, I can see that those two are the only ones above \$100, so I'll use that information here:
 115
 116 ```{r}
 117 summary(lm(price ~ cond_new + stock_photo + duration + wheels, data = mariokart[mariokart$price < 100,]))
 118 ```
 119
 120 What do you know. That was it.
 121
 122 ## Interpret some results
 123
 124 The issues above notwithstanding, we can march ahead and interpret the model results. Here are some general comments and some specifically focused on the `cond_new` and `stock_photo` variables:
 125 * Overall, the linear model regressing total auction price on condition, stock photo, duration, and number of Wii wheels shows evidence of a positive, significant relationship between number of wheels and price. According to this model fit, an increase of 1 wheel is associated with a total auction price increase of $10 with the 95% confindence interval of (\$4.57-\$15.32).
 126 * The point estimate for selling a new condition game is positive, but with a large standard error. As a result, the model fails to reject the null of no association and provides no evidence of any relationship between the game condition and auction price.
 127 * The point estimate for including a stock photo is negative, but again, the standard error is very large and the model fails to reject the null hypothesis. There is no evidence of any relationship between including a stock photo and the final auction price.
 128
 129 ## Recommendations
 130
 131 Based on this model result, I'd recommend the prospective vendor of a **used** copy of the game not worry about it too much unless they can get their hands on some extra Wii wheels, since it seems like the number of Wii wheels explains variations in the auction price outcome more than anything else they can control (such as whether or not a stock photo is included).
 132
 133 # Part II: Hypothetical study
 134
 135 ## Import and explore
 136
 137 I'll start off by just importing things and summarizing the different variables we care about here:
 138 ```{r}
 139 grads <- readRDS(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_11/grads.rds"))
 140
 141 summary(grads)
 142
 143 table(grads$cohort)
 144 sd(grads$gpa)
 145 sd(grads$income)
 146
 147 qplot(data=grads, x=gpa, geom="density")
 148 qplot(data=grads, x=income, geom="density")
 149
 150 ggplot(data=grads, aes(x=gpa, y=income)) + geom_point()
 151 ```
 152
 153 I'll also calculate some summary statistics and visual comparisons within cohorts:
 154
 155 ```{r}
 156
 157 tapply(grads$income, grads$cohort, summary)
 158 tapply(grads$income, grads$cohort, sd)
 159
 160 tapply(grads$gpa, grads$cohort, summary)
 161 tapply(grads$gpa, grads$cohort, sd)
 162 ```
 163
 164 Huh. Those are remarkably similar values for the group means and the group standard deviations...
 165
 166 Onwards to plotting:
 167 ```{r}
 168 ggplot(data=grads, aes(x=cohort, y=income)) + geom_boxplot()
 169
 170 ggplot(data=grads, aes(x=gpa, y=income, color=cohort)) + geom_point()
 171 ```
 172
 173 Those plots are also a little strange. I know this is just a simulated analysis, but it still seems weird that overall it just looks like a big mass of random points, but when I add the colors by cohort, I can see there are some lines and other regularities within groups. I wonder what happens when I plot each scatter within cohorts?
 174
 175 ```{r}
 176 ggplot(data=grads, aes(x=gpa, y=income, color=cohort)) + geom_point() + facet_wrap(vars(cohort))
 177 ```
 178
 179 Hmmm. That's...absurd (in particular, cohort 8 looks like a sideways dinosaur). At this point, if I were really working as a consultant on this project, I would write to the client and start asking some uncomfortable questions about data quality (who collected this data? how did it get recorded/stored/etc.? what quality controls were in place?). I would also feel obligated to tell them that there's just no way the data correspond to the variables they think are here. If you did that and the client was honest, they might tell you [where the data actually came from](https://www.autodesk.com/research/publications/same-stats-different-graphs).
 180
 181 In the event that you marched ahead with the analysis and are curious about what that could have looked like, I've provided some example code below. That said, *this is a situation where the assumptions and conditions necessary to identify ANOVA, t-tests, or regression are all pretty broken* because the data was generated programmatically in ways that undermine the kinds of interpretation you've been asked to make. The best response here (IMHO) is to abandon these kinds of analysis once you discover that there's something systematically weird going on. The statistical procedures will "work" in the sense that they will return a result, but because those results aren't even close to meaningful, any relationships you do observe in the data reflect something different than the sorts of relationships the statistical procedures were designed to identify.
 182
 183 ```{r}
 184 summary(aov(income ~ cohort, data=grads))  # no global differences of means across groups
 185
 186 summary(grads.model <-  lm(income ~ gpa + cohort, data=grads)) # gpa percentile has a small, negative association with income.
 187
 188 confint(grads.model, "gpa") # 95% confidence interval
 189
 190 ```
 191
 192 Note that the failure to reject the null of any association between district and income in the ANOVA does not provide conclusive evidence that the relationship between GPA and income does not vary by cohort. There were several things you might have done here. One is to calculate correlation coefficients within groups. Here's some tidyverse code that does that:
 193
 194 ```{r}
 195 grads %>%
 196   group_by(cohort) %>%
 197   summarize(
 198     correlation = cor(income, gpa)
 199   )
 200 ```
 201 Because these correlation coefficients are nearly identical, I would likely end my analysis here and conclude that the correlation between gpa and income appears to be consistently small and negative. If you wanted to go further, you could theoretically calculate an interaction term in the model (by including `I(gpa*cohort)` in the model formula), but the analysis up to this point gives no indication that you'd be likely to find much of anything (and we haven't really talked about interactions yet).
 202
 203 # Part III: Trick or treating again
 204 ## Import and update data
 205
 206 ```{r import}
 207 ## reminder that the "read_dta()" function requires the "haven" library
 208
 209 df <- read_dta(url('https://communitydata.science/~ads/teaching/2020/stats/data/week_07/Halloween2012-2014-2015_PLOS.dta'))
 210
 211 df <- df %>%
 212   mutate(
 213     obama = as.logical(obama),
 214     fruit = as.logical(fruit),
 215     year = as.factor(year),
 216     male = as.logical(male),
 217     age = age-mean(age, na.rm=T)
 218   )
 219
 220 df
 221 ```
 222
 223 Let's fit and summarize the model:
 224 ```{r}
 225 f <- formula(fruit ~ obama + age + male + year)
 226
 227 fit <- glm(f, data=df, family=binomial("logit"))
 228
 229 summary(fit)
 230 ```
 231
 232 Interesting. Looks like adjusting for these other variables in a regression setting can impact the results.
 233
 234 Onwards to generating more interpretable results:
 235
 236 ```{r}
 237 ## Odds ratios (exponentiated log-odds!)
 238 exp(coef(fit))
 239 exp(confint(fit))
 240
 241 ## model-predicted probabilities for prototypical observations:
 242 fake.kids = data.frame(
 243   obama = rep(c(FALSE, TRUE), 2),
 244   year = factor(rep(c("2015", "2012"), 2)),
 245   age = rep(c(9, 7), 2),
 246   male= rep(c(FALSE, TRUE), 2)
 247 )
 248
 249 fake.kids.pred <- cbind(fake.kids, pred.prob = predict(fit, fake.kids, type="response"))
 250
 251 fake.kids.pred
 252
 253
 254 ```
 255
 256 Note that [this UCLA logit regression tutorial](https://stats.idre.ucla.edu/r/dae/logit-regression/) also contains example code to help extract standard errors and confidence intervals around these predicted probabilities. You were not asked to produce them here, but if you'd like an example here you go (I can try to clarify in class):
 257
 258 ```{r}
 259 fake.kids.more.pred <- cbind(fake.kids,
 260                       predict(fit, fake.kids, type= "link", se=TRUE))
 261
 262 within(fake.kids.more.pred, {
 263   pred.prob = plogis(fit);
 264   lower = plogis(fit - (1.96*se.fit));
 265   upper = plogis(fit + (1.96*se.fit))
 266 } )
 267 ```
 268
 269 ## Sub-group analysis
 270
 271 ```{r}
 272 f2 <- formula(fruit ~ obama + age + male)
 273
 274 summary( glm(f2, data=df[df$year == "2012",], family=binomial("logit")))
 275 summary( glm(f2, data=df[df$year == "2014",], family=binomial("logit")))
 276 summary( glm(f2, data=df[df$year == "2015",], family=binomial("logit")))
 277 ```
 278
 279 Interesting. The treatment effect seems to emerge overwhelmingly within a single year of the data.
 280