2 title: "Week 8 problem set: Worked solutions"
3 author: "Jeremy Foote & Aaron Shaw"
8 ```{r setup, include=FALSE}
9 knitr::opts_chunk$set(echo = TRUE)
11 ## Programming Challenges
14 w3 <- read.csv(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_03/group_08.csv"))
16 ## Cleanup and recoding
17 w3$j <- as.logical(w3$j)
18 w3$l <- as.logical(w3$l)
22 labels=c("none", "some", "lots", "all"))
37 cor(w3$x, w3$y, use="complete.obs")
44 m <- lm(f, data=w3) # 'lm' stands for 'linear model', and is used for linear regression
51 ## Using base R graphing tools
54 plot(m$fitted.values, m$residuals)
68 stargazer(m, type="text")
75 df <- read.dta13('~/Documents/Teaching/2019/stats/data/week_07/Halloween2012-2014-2015_PLOS.dta')
76 df <- df[complete.cases(df),] # Remove rows with NA
78 f1 <- formula(fruit ~ obama) # Create a formula for the regression
80 m1 <- lm(f1, data=df) # Create a model and store it as m1
82 summary(m1) # Display a summary of the regression model
91 plot(df$obama, m1$residuals)
95 I'll fit the models separately and then present them using Stargazer to facilitate readability:
97 m2012 <- lm(f1, data=df[df$year == 2012,]) # Create a regression model for just 2012
98 m2014 <- lm(f1, data=df[df$year == 2014,])
99 m2015 <- lm(f1, data=df[df$year == 2015,])
101 stargazer(m2012, m2014, m2015, column.labels = c("2012", "2014", "2015"), type="text")
104 ## Statistical Questions
108 Assumptions and how they look from the plots:
109 *Linear relationship between explanatory and outcome variables*: Some outliers in the plots of residuals against each of the explanatory variables (IQ and gender) suggests that the fit is less than ideal. It does not look like there's a strong linear relationship from these plots.
110 *Normally distributed residuals*: The quantile-quantile plot (top left) shows some systematic deviation from the normal distribution.
111 *Constant variability of residuals*: There are some outliers (individual points with very large residual values) in the top right plot. There also seems to be some relationship between the residuals and gender (bottom plot). These features suggest that the assumption is not met.
112 *Independent residuals*: None of the plots show obvious evidence of dependence between data points, but there are (again) some outliers that could be concerning.
114 Overall, the data do not seem to meet several of the assumptions necessary to identify the model due to some outliers and some unobserved relationships structuring the data. It would be risky to rely on this model to draw valid inferences about the relationships between the variables as a result.
118 (a) The damaged O-rings almost all occurred at the lowest launch-time temperature.
120 (b) The model suggests that lower launch-time temperatures result in a higher probability of O-ring damage. The coefficient of the "Temperature" term is negative with a very small (proportionally speaking) standard error. It is statistically significant with a p-value near 0 ($H_0:~\beta_{temp} = 0$, $H_A:~\beta_{temp}\neq 0$), indicating that the data provide evidence that the coefficient is likely different from 0. Exponentiating the coefficient (see the R-code below), we see that a one degree farenheit increase in launch-time temperature is associated with 81% as large odds of O-ring damage occurring.
125 (c) The corresponding logistic model where $\hat{p}_{damage}$ represents the probability of a damaged O-ring:
126 $$log(\frac{\hat{p}_{damage}}{1-\hat{p}_{damage}}) = 11.663 - 0.2162\times~Temperature$$
127 (d) Given the high stakes in terms of human lives and vast costs involved, concerns about the relationship between O-rings and launch-time temperature seem more than justified from this data. The large negative association between temperature and O-ring damage suggest increased potential for failures at low launch temperatures. That said, several limitations of the data, modeling strategy, and estimates should be kept in mind. See my answer to part (c) of SQ3 below for more.
131 (a) Let's do this in R. Remember that we'll need to plug in the values for temperature *and* do some algebra with the natural logarithm parts of the formula (on the left hand side above) to find the predicted probability of O-ring damage:
134 my.fits <- function(x){
135 p.hat <- exp(11.663-(0.2162*x) ) # Function to get the odds
137 return(p.hat / (1 + p.hat)) # Convert the odds into a probability
140 vals <- c(51,52, 53, 55) # Vector of values we want probabilities for
145 (b) I'll use my little function above to build a data frame with the predicted values and plot everything in ggplot. Note that the question asks for a "smooth curve" fit to the dots. There are many ways to do this in ggplot. I demonstrate one here using `geom_smooth()` that fits a quadratic function ($y = x^2$) to the points. You might experiment with the different options for `geom_smooth()` or, for a simpler solution, just try `geom_line()` (with no arguments) instead.
148 vals <- seq(51, 71, by=2) # This creates a vector from 51 to 71, counting by twos
150 probs <- my.fits(vals) # store the probabilities as another vector
152 pred <- data.frame(temp=vals, pred.probability=probs) # Change to a data frame for ggplot
154 ggplot(data=pred, aes(x=temp, y=pred.probability)) +
155 geom_point(color="purple") + # Plot the points
156 geom_smooth(color="orange", # Add a smooth line
157 method="glm", # Create a line fit to the data
158 formula = y ~ poly(x, 2), # Using this formula
159 se=FALSE) # Don't show standard errors
164 (c) I have several concerns about this application of logistic regression to the problem and data at hand. First, this is a fairly small observational dataset modeling a relatively rare outcome. It seems like the model treats each O-ring as independent (the problem wording is not clear about the model in this respect) and, if we accept that assumption, \~11 damaged O-ring outcomes occurred out of almost 140 trials (less than 10%). This assumption of independence between the observations seems likely to be problematic since each O-ring in a given mission is probably more similar to the others on that mission that to O-rings on another mission. It is also possible that the O-ring production or installation procedures may have changed across the missions over time. Any such clustered or time-dependent structures lurking in the data could lead to unobserved bias in the estimates when we model each O-ring outcome as independent events. Furthermore, about 50% of the observed failures occurred in a single mission—the mission with the lowest observed launch-time temperature. The result is that one mission drives the model results disproportionately (it generates observations that exert "high leverage" on the model fit to the data). Without knowing ahead of time that temperature was a likely explanation (as compared against any of the other infinite details of that one mission), it's hard to see how NASA analysts necessarily should have drawn this conclusion on the basis of evidence like this.
166 ## Empirical Questions
168 We will discuss these in class.