problem_sets/week_08/ps8-worked-solutions.Rmd

   1 ---
   2 title: "Week 8 problem set: Worked solutions"
   3 author: "Jeremy Foote & Aaron Shaw"
   4 date: "May 23, 2019"
   5 output: html_document
   6 ---
   7
   8 ```{r setup, include=FALSE}
   9 knitr::opts_chunk$set(echo = TRUE)
  10 ```
  11 ## Programming Challenges
  12
  13 ```{r}
  14 w3 <- read.csv(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_03/group_08.csv"))
  15
  16 ## Cleanup and recoding
  17 w3$j <- as.logical(w3$j)
  18 w3$l <- as.logical(w3$l)
  19
  20 w3$k <- factor(w3$k,
  21                levels=c(0,1,2,3),
  22                labels=c("none", "some", "lots", "all"))
  23
  24 summary(w3)
  25
  26 ```
  27
  28
  29 ### PC2
  30
  31 ```{r}
  32 t.test(w3$x, w3$y)
  33 ```
  34 ### PC3
  35
  36 ```{r}
  37 cor(w3$x, w3$y, use="complete.obs")
  38 ```
  39
  40 ### PC4
  41
  42 ```{r}
  43 f <- formula(y ~ x)
  44 m <- lm(f, data=w3) # 'lm' stands for 'linear model', and is used for linear regression
  45
  46 summary(m)
  47 ```
  48 ### PC5
  49
  50 ```{r}
  51 ## Using base R graphing tools
  52 plot(m$residuals)
  53 hist(m$residuals)
  54 plot(m$fitted.values, m$residuals)
  55 qqnorm(m$residuals)
  56
  57 library(ggplot2)
  58 library(ggfortify)
  59
  60 autoplot(m)
  61 ```
  62
  63 ### PC6
  64
  65 ```{r}
  66 library(stargazer)
  67
  68 stargazer(m, type="text")
  69 ```
  70
  71 ### PC7
  72
  73 ```{r}
  74 library(readstata13)
  75 df <- read.dta13('~/Documents/Teaching/2019/stats/data/week_07/Halloween2012-2014-2015_PLOS.dta')
  76 df <- df[complete.cases(df),] # Remove rows with NA
  77
  78 f1 <- formula(fruit ~ obama) # Create a formula for the regression
  79
  80 m1 <- lm(f1, data=df) # Create a model and store it as m1
  81
  82 summary(m1) # Display a summary of the regression model
  83 ```
  84 ### PC8
  85
  86 ```{r}
  87 autoplot(m1)
  88
  89 hist(residuals(m1))
  90
  91 plot(df$obama, m1$residuals)
  92 ```
  93
  94 ### PC9
  95 I'll fit the models separately and then present them using Stargazer to facilitate readability:
  96 ```{r}
  97 m2012 <- lm(f1, data=df[df$year == 2012,]) # Create a regression model for just 2012
  98 m2014 <- lm(f1, data=df[df$year == 2014,])
  99 m2015 <- lm(f1, data=df[df$year == 2015,])
 100
 101 stargazer(m2012, m2014, m2015, column.labels = c("2012", "2014", "2015"), type="text")
 102 ```
 103
 104 ## Statistical Questions
 105
 106 ### SQ1—8.14
 107
 108 Assumptions and how they look from the plots:
 109 *Linear relationship between explanatory and outcome variables*: Some outliers in the plots of residuals against each of the explanatory variables (IQ and gender) suggests that the fit is less than ideal. It does not look like there's a strong linear relationship from these plots.
 110 *Normally distributed residuals*: The quantile-quantile plot (top left) shows some systematic deviation from the normal distribution.
 111 *Constant variability of residuals*: There are some outliers (individual points with very large residual values) in the top right plot. There also seems to be some relationship between the residuals and gender (bottom plot). These features suggest that the assumption is not met.
 112 *Independent residuals*: None of the plots show obvious evidence of dependence between data points, but there are (again) some outliers that could be concerning.
 113
 114 Overall, the data do not seem to meet several of the assumptions necessary to identify the model due to some outliers and some unobserved relationships structuring the data. It would be risky to rely on this model to draw valid inferences about the relationships between the variables as a result.
 115
 116 ### SQ2—8.16
 117
 118 (a) The damaged O-rings almost all occurred at the lowest launch-time temperature.
 119
 120 (b) The model suggests that lower launch-time temperatures result in a higher probability of O-ring damage. The coefficient of the "Temperature" term is negative with a very small (proportionally speaking) standard error. It is statistically significant with a p-value near 0 ($H_0:~\beta_{temp} = 0$, $H_A:~\beta_{temp}\neq 0$), indicating that the data provide evidence that the coefficient is likely different from 0. Exponentiating the coefficient (see the R-code below), we see that a one degree farenheit increase in launch-time temperature is associated with 81% as large odds of O-ring damage occurring.
 121 ```{r}
 122 exp(-.2162)
 123 ```
 124
 125 (c) The corresponding logistic model where $\hat{p}_{damage}$ represents the probability of a damaged O-ring:
 126 $$log(\frac{\hat{p}_{damage}}{1-\hat{p}_{damage}}) = 11.663 - 0.2162\times~Temperature$$
 127 (d)  Given the high stakes in terms of human lives and vast costs involved, concerns about the relationship between O-rings and launch-time temperature seem more than justified from this data. The large negative association between temperature and O-ring damage suggest increased potential for failures at low launch temperatures. That said, several limitations of the data, modeling strategy, and estimates should be kept in mind. See my answer to part (c) of SQ3 below for more.
 128
 129 ### SQ3—8.18
 130
 131 (a)  Let's do this in R. Remember that we'll need to plug in the values for temperature *and* do some algebra with the natural logarithm parts of the formula (on the left hand side above) to find the predicted probability of O-ring damage:
 132
 133 ```{r}
 134 my.fits <- function(x){
 135   p.hat <- exp(11.663-(0.2162*x) ) # Function to get the odds
 136   print(p.hat)
 137   return(p.hat / (1 + p.hat)) # Convert the odds into a probability
 138 }
 139
 140 vals <- c(51,52, 53, 55) # Vector of values we want probabilities for
 141
 142 my.fits(vals)
 143 ```
 144
 145 (b) I'll use my little function above to build a data frame with the predicted values and plot everything in ggplot. Note that the question asks for a "smooth curve" fit to the dots. There are many ways to do this in ggplot. I demonstrate one here using `geom_smooth()` that fits a quadratic function ($y = x^2$) to the points. You might experiment with the different options for `geom_smooth()` or, for a simpler solution, just try `geom_line()` (with no arguments) instead.
 146
 147 ```{r}
 148 vals <- seq(51, 71, by=2) # This creates a vector from 51 to 71, counting by twos
 149
 150 probs <- my.fits(vals) # store the probabilities as another vector
 151
 152 pred <- data.frame(temp=vals, pred.probability=probs) # Change to a data frame for ggplot
 153
 154 ggplot(data=pred, aes(x=temp, y=pred.probability)) +
 155   geom_point(color="purple") + # Plot the points
 156   geom_smooth(color="orange", # Add a smooth line
 157               method="glm", # Create a line fit to the data
 158               formula = y ~ poly(x, 2), # Using this formula
 159               se=FALSE) # Don't show standard errors
 160
 161 ```
 162
 163
 164 (c)  I have several concerns about this application of logistic regression to the problem and data at hand. First, this is a fairly small observational dataset modeling a relatively rare outcome. It seems like the model treats each O-ring as independent (the problem wording is not clear about the model in this respect) and, if we accept that assumption, \~11 damaged O-ring outcomes occurred out of almost 140 trials (less than 10%). This assumption of independence between the observations seems likely to be problematic since each O-ring in a given mission is probably more similar to the others on that mission that to O-rings on another mission. It is also possible that the O-ring production or installation procedures may have changed across the missions over time. Any such clustered or time-dependent structures lurking in the data could lead to unobserved bias in the estimates when we model each O-ring outcome as independent events. Furthermore, about 50% of the observed failures occurred in a single mission—the mission with the lowest observed launch-time temperature. The result is that one mission drives the model results disproportionately (it generates observations that exert "high leverage" on the model fit to the data). Without knowing ahead of time that temperature was a likely explanation (as compared against any of the other infinite details of that one mission), it's hard to see how NASA analysts necessarily should have drawn this conclusion on the basis of evidence like this.
 165
 166 ## Empirical Questions
 167
 168 We will discuss these in class.
 169