From: aaronshaw Date: Tue, 24 Nov 2020 02:37:19 +0000 (-0600) Subject: more text, typo and bug fixes X-Git-Url: https://code.communitydata.science/stats_class_2020.git/commitdiff_plain/fb64f892792941055a14f285f859056f3afb6ede?ds=sidebyside more text, typo and bug fixes --- diff --git a/psets/pset8-worked_solution.html b/psets/pset8-worked_solution.html index 2bd1884..756fe38 100644 --- a/psets/pset8-worked_solution.html +++ b/psets/pset8-worked_solution.html @@ -1571,7 +1571,7 @@ mariokart ## 10 3.30e11 7 8 used 20.0 4 50 firstC… 201 ## # … with 133 more rows, and 3 more variables: stock_photo <fct>, wheels <int>, ## # title <fct> -

To make things a bit easier to manage, I’ll select the variables I want to use in the analysis and do some cleanup. Note that I conver the cond_new and stock_photo variables to logical first (using boolean comparisons) and also coerce them to be numeric values (using as.numeric()). This results in 1/0 values corresponding to the observations shown/described in Figure 9.13 and 9.14 on p. 365 of the textbook.

+

To make things a bit easier to manage, I’ll select the variables I want to use in the analysis and do some cleanup. Note that I convert the cond_new and stock_photo variables to logical TRUE/FALSE values first (using boolean comparisons) and then coerce them to be numeric values (using as.numeric()). This results in 1/0 values corresponding to the observations shown/described in Figure 9.13 and 9.14 on p. 365 of the textbook.

mariokart <- mariokart %>%
   select(
     price = total_pr, cond_new = cond, stock_photo, duration, wheels
@@ -1596,7 +1596,7 @@ mariokart
## 9 47 0 1 3 1 ## 10 50 0 0 7 1 ## # … with 133 more rows -

Now let’s look at the variables in our model. Summary statistics for each, univariate density plots for the continuous measures, and some bivariate plots w each predictor and the outcome variable.

+

Now let’s summarize and explore the variables in our model. I’ll calculate summary statistics for each, univariate density plots for the continuous measures, boxplots for the dichotomous indicators, and some bivariate plots with each predictor and the outcome variable.

summary(mariokart)
##      price           cond_new       stock_photo        duration     
 ##  Min.   : 28.98   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
@@ -1620,6 +1620,8 @@ mariokart
## [1] 0.8471829
qplot(data = mariokart, x = price, geom = "density")

+
qplot(data = mariokart, y = price, geom = "boxplot") # check out the outliers!
+

qplot(data = mariokart, x = duration, geom = "density")

qplot(data = mariokart, x = wheels, geom = "density")
@@ -1649,7 +1651,7 @@ mariokart

Replicate model results

-

The description of the model

+

Based on the information in the textbook, I’m assuming that the model is an ordinary least squares regression.

model <- lm(price ~ cond_new + stock_photo + duration + wheels, data = mariokart)
 summary(model)
## 
@@ -1674,15 +1676,16 @@ summary(model)
## Residual standard error: 24.4 on 138 degrees of freedom ## Multiple R-squared: 0.1235, Adjusted R-squared: 0.09808 ## F-statistic: 4.86 on 4 and 138 DF, p-value: 0.001069 -

While I’m looking at that, I’ll go ahead and calculate a confidence interval around the only parameter for which the model rejects the null hypothesis (wheels):

+

Huh, that doesn’t quite look like what’s in the textbook Figure 9.15…

+

I’ll go ahead and calculate a confidence interval around the only parameter for which the model rejects the null hypothesis (wheels):

confint(model, "wheels")
##           2.5 %   97.5 %
 ## wheels 4.572473 15.32278
-

Overall, this resembles the model results in Figure 9.15, but notice that it’s different in a number of ways! Without more information from the authors of the textbook it’s very hard to determine exactly where or why the differences emerge.

+

Without more information from the authors of the textbook it’s very hard to determine exactly where or why the differences emerge. My guess at this point is that it might have something to do with those outlying price values (take a look at that boxplot and density plot again). Maybe they did something to transform the variable? Remove the outliers? I have no idea.

Assess model fit/assumptions

-

I’ve already generated a bunch of univariate and bivariate summaries and plots. Let’s inspect the residuals more closely to see what else we can learn. I’ll use the autoplot() function from the ggfortify package to help me do this.

+

I’ve already generated a bunch of univariate and bivariate summaries and plots. Let’s inspect the residuals more closely to see what else we can learn. I’ll use the autoplot() function from the ggfortify package.

autoplot(model)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
 ## Please use `arrange()` instead.
@@ -1690,9 +1693,13 @@ summary(model)
## This warning is displayed once every 8 hours. ## Call `lifecycle::last_warnings()` to see where this warning was generated.

-

Overall, there are a number of issues with this model fit that I’d want to mention/consider: * The distribution of the dependent variable (price) is very skewed, with two extreme outliers. I’d recommend trying some transformations to see if it would look more appropriate for a linear regression and/or inspecting the cases that produced the outlying values more closely to understand what’s happening there. * The plots of the residuals reveal those same two outlying points are also outliers with respect to the line of best fit. That said, they are not exerting huge amounts of leverage on the estimates, so it’s possible that the estimates from the fitted model wouldn’t change much without those two points. Indeed, based on the degrees of freedom reported in Figure 9.15 (136) vs. the number reported in our version of the model (138) my best guess is that the textbook authors silently dropped those two outlying observations from their model!

+

There are those outliers again. At this point, there are a number of issues with this model fit that I’d want to mention/consider: * The distribution of the dependent variable (price) is skewed, with two extreme outliers. I’d recommend trying some transformations to see if it would look more appropriate for a linear regression and/or inspecting the cases that produced the outlying values more closely to understand what’s happening there and identify reasons why they’re so different. * The plots of the residuals reveal those same two outlying points are also outliers with respect to the line of best fit. That said, they are not exerting huge amounts of leverage on the estimates, so it’s possible that the estimates from the fitted model wouldn’t change too much without those two points. Indeed, based on the degrees of freedom reported in Figure 9.15 (136) vs. the number reported in our version of the model (138) my best guess at this point is that the textbook authors silently dropped those two outlying observations from their model.

More out of curiosity than anything, I’ll create a version of the model that drops the two largest values of price. From the plots, I can see that those two are the only ones above $100, so I’ll use that information here:

-
summary(lm(price ~ cond_new + stock_photo + duration + wheels, data = mariokart[mariokart$price < 100, ]))
+
summary(
+  lm(price ~ cond_new + stock_photo + duration + wheels,
+    data = mariokart[mariokart$price < 100, ]
+  )
+)
## 
 ## Call:
 ## lm(formula = price ~ cond_new + stock_photo + duration + wheels, 
@@ -1715,11 +1722,11 @@ summary(model)
## Residual standard error: 4.901 on 136 degrees of freedom ## Multiple R-squared: 0.719, Adjusted R-squared: 0.7108 ## F-statistic: 87.01 on 4 and 136 DF, p-value: < 2.2e-16 -

What do you know. That was it.

+

What do you know. That was it. The difference in \(R^2\) is huge!

Interpret some results

-

The issues above notwithstanding, we can march ahead and interpret the model results. Here are some general comments and some specifically focused on the cond_new and stock_photo variables: * Overall, the linear model regressing total auction price on condition, stock photo, duration, and number of Wii wheels shows evidence of a positive, significant relationship between number of wheels and price. According to this model fit, an increase of 1 wheel is associated with a total auction price increase of $10 with the 95% confindence interval of ($4.57-$15.32). * The point estimate for selling a new condition game is positive, but with a large standard error. As a result, the model fails to reject the null of no association and provides no evidence of any relationship between the game condition and auction price. * The point estimate for including a stock photo is negative, but again, the standard error is very large and the model fails to reject the null hypothesis. There is no evidence of any relationship between including a stock photo and the final auction price.

+

The issues above notwithstanding, we can march ahead and interpret the results of the original model that I fit. Here are some general comments and some specifically focused on the cond_new and stock_photo variables: * Overall, the linear model regressing total auction price on condition, stock photo, duration, and number of Wii wheels shows evidence of a positive, significant relationship between number of wheels and price. According to this model fit, an increase of 1 wheel in a listing is associated with a total auction price increase of $10 on average (95% confindence interval: $4.57-$15.32). * The point estimate for selling a new condition game is positive, but with a large standard error. The model fails to reject the null of no association and provides no evidence of any relationship between the game condition and auction price. * The point estimate for including a stock photo is negative, but again, the standard error is very large and the model fails to reject the null hypothesis. There is no evidence of any relationship between including a stock photo and the final auction price.

Recommendations

@@ -1728,8 +1735,8 @@ summary(model)

Part II: Hypothetical study

-
-

Import and explore

+
+

Import, explore, summarize

I’ll start off by just importing things and summarizing the different variables we care about here:

grads <- readRDS(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_11/grads.rds"))
 
@@ -1759,7 +1766,7 @@ summary(grads)
ggplot(data = grads, aes(x = gpa, y = income)) +
   geom_point()

-

I’ll also calculate some summary statistics and visual comparisons within cohorts:

+

I’ll also calculate some summary statistics and visual comparisons within districts (cohorts):

tapply(grads$income, grads$cohort, summary)
## $cohort_01
 ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
@@ -1874,7 +1881,58 @@ summary(grads)
## 26.93027 26.93019 26.93861 26.93988 26.93768 26.94000 26.93000 26.93540 ## cohort_09 cohort_10 cohort_11 cohort_12 cohort_13 ## 26.93974 26.93790 26.93573 26.93004 26.93610 -

Huh. Those are remarkably similar values for the group means and the group standard deviations…

+

Note that you could also do this pretty easily with a call to group_by and summarize in the tidyverse:

+
grads %>%
+  group_by(cohort) %>%
+  summarize(
+    n = n(),
+    min = min(income),
+    mean = mean(income),
+    max = max(income),
+    sd = sd(income)
+  )
+
## # A tibble: 13 x 6
+##    cohort        n   min  mean   max    sd
+##    <fct>     <int> <dbl> <dbl> <dbl> <dbl>
+##  1 cohort_01   142  27.0  54.3  86.4  16.8
+##  2 cohort_02   142  25.4  54.3  78.0  16.8
+##  3 cohort_03   142  20.2  54.3  95.3  16.8
+##  4 cohort_04   142  22.0  54.3  98.3  16.8
+##  5 cohort_05   142  30.4  54.3  89.5  16.8
+##  6 cohort_06   142  17.9  54.3  96.1  16.8
+##  7 cohort_07   142  31.1  54.3  85.4  16.8
+##  8 cohort_08   142  22.3  54.3  98.2  16.8
+##  9 cohort_09   142  15.6  54.3  91.6  16.8
+## 10 cohort_10   142  27.4  54.3  77.9  16.8
+## 11 cohort_11   142  19.3  54.3  91.7  16.8
+## 12 cohort_12   142  21.9  54.3  85.7  16.8
+## 13 cohort_13   142  18.1  54.3  95.6  16.8
+
grads %>%
+  group_by(cohort) %>%
+  summarize(
+    n = n(),
+    min = min(gpa),
+    mean = mean(gpa),
+    max = max(gpa),
+    sd = sd(gpa)
+  )
+
## # A tibble: 13 x 6
+##    cohort        n     min  mean   max    sd
+##    <fct>     <int>   <dbl> <dbl> <dbl> <dbl>
+##  1 cohort_01   142 14.4     47.8  92.2  26.9
+##  2 cohort_02   142 15.8     47.8  94.2  26.9
+##  3 cohort_03   142  5.65    47.8  99.6  26.9
+##  4 cohort_04   142 10.5     47.8  90.5  26.9
+##  5 cohort_05   142  2.73    47.8  99.7  26.9
+##  6 cohort_06   142 14.9     47.8  87.2  26.9
+##  7 cohort_07   142  4.58    47.8  97.8  26.9
+##  8 cohort_08   142  2.95    47.8  99.5  26.9
+##  9 cohort_09   142  0.0151  47.8  97.5  26.9
+## 10 cohort_10   142  0.217   47.8  99.3  26.9
+## 11 cohort_11   142  9.69    47.8  85.9  26.9
+## 12 cohort_12   142 16.3     47.8  85.6  26.9
+## 13 cohort_13   142  0.304   47.8  99.6  26.9
+

Huh. Those are remarkably similar values for the group means and the group standard deviations…weird.

Onwards to plotting:

ggplot(data = grads, aes(x = cohort, y = income)) +
   geom_boxplot()
@@ -1882,13 +1940,17 @@ summary(grads)
ggplot(data = grads, aes(x = gpa, y = income, color = cohort)) +
   geom_point()

-

Those plots are also a little strange. I know this is just a simulated analysis, but it still seems weird that overall it just looks like a big mass of random points, but when I add the colors by cohort, I can see there are some lines and other regularities within groups. I wonder what happens when I plot each scatter within cohorts?

+

Those plots are also a little strange. Even though this is just a simulated analysis, it still seems weird that overall the scatterplot just looks like a big mass of points, but when I color the points by district, I can see some regularities within groups. At this point, I might want to facet the scatterplots by district to see any patterns more clearly.

ggplot(data = grads, aes(x = gpa, y = income, color = cohort)) +
   geom_point() +
   facet_wrap(vars(cohort))

-

Hmmm. That’s…absurd (in particular, cohort 8 looks like a sideways dinosaur). At this point, if I were really working as a consultant on this project, I would write to the client and start asking some uncomfortable questions about data quality (who collected this data? how did it get recorded/stored/etc.? what quality controls were in place?). I would also feel obligated to tell them that there’s just no way the data correspond to the variables they think are here. If you did that and the client was honest, they might tell you where the data actually came from.

-

In the event that you marched ahead with the analysis and are curious about what that could have looked like, I’ve provided some example code below. That said, this is a situation where the assumptions and conditions necessary to identify ANOVA, t-tests, or regression are all pretty broken because the data was generated programmatically in ways that undermine the kinds of interpretation you’ve been asked to make. The best response here (IMHO) is to abandon these kinds of analysis once you discover that there’s something systematically weird going on. The statistical procedures will “work” in the sense that they will return a result, but because those results aren’t even close to meaningful, any relationships you do observe in the data reflect something different than the sorts of relationships the statistical procedures were designed to identify.

+

Okay, that’s…a joke (in particular, cohort 8 looks like a sideways dinosaur). At this point, if I were really working as a consultant on this project, I would write to the client and start asking some probing questions about data quality (who collected this data? how did it get recorded/stored/etc.? what quality controls were in place?). I would also feel obligated to tell them that I suspect there’s just no way the data correspond to the variables they think are here. If you did that and the client was honest, they might tell you where the data actually came from.

+

In the event that you marched ahead with the analysis and are curious about what that could have looked like, I’ve provided some example code below. That said, this is a situation where the assumptions and conditions necessary to identify ANOVA, t-tests, or regression are all pretty broken because the data was generated programmatically in ways that undermine the kinds of interpretation you’ve been asked to make. The best response here (IMHO) is to abandon these kinds of analysis once you discover that there’s something systematically weird going on like this. While the experience of discovering a scatterplot dinosaur in your data is…unlikely outside of the context of a problem set, there are many situations in which careful data exploration will bring a realization that you just don’t understand some important things about the sources or qualities of your data. You have to learn to identify these moments and develop strategies for dealing with them! Often, the statistical procedures will “work” in the sense that they will return a result without any apparent errors, but because those results aren’t even close to meaningful, any relationships you do observe in the data reflect something different than the sorts of relationships the statistical procedures were designed to identify.

+
+
+

Fake analysis for fake data

+

Okay, if you wanted example code to look at for this, here it is. Please just keep in mind that the results are not informative!

summary(aov(income ~ cohort, data = grads)) # no global differences of means across groups
##               Df Sum Sq Mean Sq F value Pr(>F)
 ## cohort        12      0     0.0       0      1
@@ -1927,7 +1989,7 @@ summary(grads)
confint(grads.model, "gpa") # 95% confidence interval
##           2.5 %      97.5 %
 ## gpa -0.06955974 -0.01263512
-

Note that the failure to reject the null of any association between district and income in the ANOVA does not provide conclusive evidence that the relationship between GPA and income does not vary by cohort. There were several things you might have done here. One is to calculate correlation coefficients within groups. Here’s some tidyverse code that does that:

+

Note that the failure to reject the null of any association between district and income in the ANOVA would not (even in the event of more realistic data) provide very compelling evidence that the relationship between GPA and income does not vary by cohort. There were several things you might have done here. One is to calculate correlation coefficients within groups. Here’s some tidyverse code that does that:

grads %>%
   group_by(cohort) %>%
   summarize(
@@ -1949,13 +2011,14 @@ summary(grads)
## 11 cohort_11 -0.0686 ## 12 cohort_12 -0.0683 ## 13 cohort_13 -0.0690 -

Because these correlation coefficients are nearly identical, I would likely end my analysis here and conclude that the correlation between gpa and income appears to be consistently small and negative. If you wanted to go further, you could theoretically calculate an interaction term in the model (by including I(gpa*cohort) in the model formula), but the analysis up to this point gives no indication that you’d be likely to find much of anything (and we haven’t really talked about interactions yet).

+

Because these correlation coefficients are nearly identical, I would likely end an analysis here and conclude that the correlation between gpa and income appears to be consistently small and negative. If you wanted to go further, you could theoretically calculate an interaction term in the model (by including I(gpa*cohort) in the model formula), but the analysis up to this point gives no indication that you’d be likely to find much of anything (and we haven’t really talked about interactions yet). On top of that, there’s a literal dinosaur lurking in your data…just give up!

Part III: Trick or treating again

Import and update data

+

Revisit the text and worked solutions for problem set 7 for more details about the study design, data collection and more.

## reminder that the "read_dta()" function requires the "haven" library
 
 df <- read_dta(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_07/Halloween2012-2014-2015_PLOS.dta"))
@@ -1984,7 +2047,7 @@ df
## 9 FALSE FALSE 2014 -0.520 TRUE 1 4 ## 10 FALSE TRUE 2014 -1.52 FALSE 1 4 ## # … with 1,213 more rows -

Let’s fit and summarize the model:

+

That looks consistent with what we want here. Let’s fit and summarize the model:

f <- formula(fruit ~ obama + age + male + year)
 
 fit <- glm(f, data = df, family = binomial("logit"))
@@ -2017,8 +2080,9 @@ summary(fit)
## AIC: 1379.2 ## ## Number of Fisher Scoring iterations: 4 -

Interesting. Looks like adjusting for these other variables in a regression setting can impact the results.

-

Onwards to generating more interpretable results:

+

Interesting. Looks like adjusting for these other variables in a regression setting allows us to uncover some different the results.

+

Onwards to generating more interpretable results.

+

First, odds-ratios instead of “raw” log-odds coefficients:

## Odds ratios (exponentiated log-odds!)
 exp(coef(fit))
## (Intercept)   obamaTRUE         age    maleTRUE    year2014    year2015 
@@ -2031,9 +2095,9 @@ exp(coef(fit))
## maleTRUE 0.6697587 1.1280707 ## year2014 0.6518923 1.5564945 ## year2015 0.8649706 1.9799543 -
## model-predicted probabilities for prototypical observations:
-fake.kids <- data.frame(
-  obama = rep(c(FALSE, TRUE), 2),
+

Now, model-predicted probabilities for prototypical observations. Recall that it’s necessary to create synthetic (“fake”), hypothetical individuals to generate predicted probabilities like these. In this case, I’ll create two versions of each fake kid: one assigned to the treatment condition and one assigned to the control. Then I’ll use the predict() function to generate fitted values for each of the fake kids.

+
fake.kids <- data.frame(
+  obama = c(FALSE, FALSE, TRUE, TRUE),
   year = factor(rep(c("2015", "2012"), 2)),
   age = rep(c(9, 7), 2),
   male = rep(c(FALSE, TRUE), 2)
@@ -2044,10 +2108,10 @@ fake.kids.pred <- cbind(fake.kids, pred.prob = predict(fit, fake.kids, type =
 fake.kids.pred
##   obama year age  male pred.prob
 ## 1 FALSE 2015   9 FALSE 0.2906409
-## 2  TRUE 2012   7  TRUE 0.2537326
-## 3 FALSE 2015   9 FALSE 0.2906409
+## 2 FALSE 2012   7  TRUE 0.2117531
+## 3  TRUE 2015   9 FALSE 0.3414844
 ## 4  TRUE 2012   7  TRUE 0.2537326
-

Note that this UCLA logit regression tutorial also contains example code to help extract standard errors and confidence intervals around these predicted probabilities. You were not asked to produce them here, but if you’d like an example here you go (I can try to clarify in class):

+

Note that this UCLA logit regression tutorial also contains example code to help extract standard errors and confidence intervals around these predicted probabilities. You were not asked to create them here, but if you’d like an example here you go. The workhorse is the plogis() function, which essentially does the inverse logit transformation detailed in the textbook chapter 8:

fake.kids.more.pred <- cbind(
   fake.kids,
   predict(fit, fake.kids, type = "link", se = TRUE)
@@ -2060,100 +2124,61 @@ within(fake.kids.more.pred, {
 })
##   obama year age  male        fit    se.fit residual.scale     upper     lower
 ## 1 FALSE 2015   9 FALSE -0.8922736 0.2243091              1 0.3887361 0.2088421
-## 2  TRUE 2012   7  TRUE -1.0788031 0.2594135              1 0.3611555 0.1697706
-## 3 FALSE 2015   9 FALSE -0.8922736 0.2243091              1 0.3887361 0.2088421
+## 2 FALSE 2012   7  TRUE -1.3143903 0.2682218              1 0.3124531 0.1370389
+## 3  TRUE 2015   9 FALSE -0.6566864 0.2363758              1 0.4518027 0.2460144
 ## 4  TRUE 2012   7  TRUE -1.0788031 0.2594135              1 0.3611555 0.1697706
 ##   pred.prob
 ## 1 0.2906409
-## 2 0.2537326
-## 3 0.2906409
+## 2 0.2117531
+## 3 0.3414844
 ## 4 0.2537326

Sub-group analysis

+

To do this, we’ll need to create a slightly new model formula that drops the year term since we’re going to restrict each subset of the data along that dimension.

+

Once I fit the models, I’ll use the stargazer package to create a reasonably pretty regression table that incorporates all three summaries.

f2 <- formula(fruit ~ obama + age + male)
 
-summary(glm(f2, data = df[df$year == "2012", ], family = binomial("logit")))
-
## 
-## Call:
-## glm(formula = f2, family = binomial("logit"), data = df[df$year == 
-##     "2012", ])
-## 
-## Deviance Residuals: 
-##     Min       1Q   Median       3Q      Max  
-## -1.0041  -0.7564  -0.6804  -0.5434   1.9929  
-## 
-## Coefficients:
-##             Estimate Std. Error z value Pr(>|z|)    
-## (Intercept) -1.53780    0.37711  -4.078 4.55e-05 ***
-## obamaTRUE    0.32568    0.38680   0.842    0.400    
-## age          0.08532    0.06699   1.273    0.203    
-## maleTRUE     0.32220    0.38802   0.830    0.406    
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## (Dispersion parameter for binomial family taken to be 1)
-## 
-##     Null deviance: 177.56  on 163  degrees of freedom
-## Residual deviance: 174.90  on 160  degrees of freedom
-##   (1 observation deleted due to missingness)
-## AIC: 182.9
-## 
-## Number of Fisher Scoring iterations: 4
-
summary(glm(f2, data = df[df$year == "2014", ], family = binomial("logit")))
-
## 
-## Call:
-## glm(formula = f2, family = binomial("logit"), data = df[df$year == 
-##     "2014", ])
-## 
-## Deviance Residuals: 
-##     Min       1Q   Median       3Q      Max  
-## -0.8016  -0.7373  -0.6846  -0.6594   1.8071  
-## 
-## Coefficients:
-##             Estimate Std. Error z value Pr(>|z|)    
-## (Intercept) -1.13351    0.19798  -5.725 1.03e-08 ***
-## obamaTRUE    0.07348    0.24158   0.304    0.761    
-## age          0.01198    0.03673   0.326    0.744    
-## maleTRUE    -0.23980    0.23499  -1.020    0.308    
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## (Dispersion parameter for binomial family taken to be 1)
-## 
-##     Null deviance: 450.11  on 421  degrees of freedom
-## Residual deviance: 448.85  on 418  degrees of freedom
-## AIC: 456.85
-## 
-## Number of Fisher Scoring iterations: 4
-
summary(glm(f2, data = df[df$year == "2015", ], family = binomial("logit")))
+m2012 <- glm(f2, data = df[df$year == "2012", ], family = binomial("logit")) +m2014 <- glm(f2, data = df[df$year == "2014", ], family = binomial("logit")) +m2015 <- glm(f2, data = df[df$year == "2015", ], family = binomial("logit")) + + +## I can make a pretty table out of that: +library(stargazer) +stargazer(m2012, m2014, m2015, column.labels = c("2012", "2014", "2015"), type = "text")
## 
-## Call:
-## glm(formula = f2, family = binomial("logit"), data = df[df$year == 
-##     "2015", ])
-## 
-## Deviance Residuals: 
-##     Min       1Q   Median       3Q      Max  
-## -0.9067  -0.7931  -0.7338   1.4761   1.7012  
-## 
-## Coefficients:
-##               Estimate Std. Error z value Pr(>|z|)    
-## (Intercept) -0.9971079  0.1383417  -7.208  5.7e-13 ***
-## obamaTRUE    0.3170892  0.1898404   1.670   0.0949 .  
-## age          0.0004615  0.0290552   0.016   0.9873    
-## maleTRUE    -0.1791721  0.1791828  -1.000   0.3173    
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## (Dispersion parameter for binomial family taken to be 1)
-## 
-##     Null deviance: 743.8  on 634  degrees of freedom
-## Residual deviance: 740.0  on 631  degrees of freedom
-##   (1 observation deleted due to missingness)
-## AIC: 748
-## 
-## Number of Fisher Scoring iterations: 4
-

Interesting. The treatment effect seems to emerge overwhelmingly within a single year of the data.

+## =============================================== +## Dependent variable: +## ----------------------------- +## fruit +## 2012 2014 2015 +## (1) (2) (3) +## ----------------------------------------------- +## obama 0.326 0.073 0.317* +## (0.387) (0.242) (0.190) +## +## age 0.085 0.012 0.0005 +## (0.067) (0.037) (0.029) +## +## male 0.322 -0.240 -0.179 +## (0.388) (0.235) (0.179) +## +## Constant -1.538*** -1.134*** -0.997*** +## (0.377) (0.198) (0.138) +## +## ----------------------------------------------- +## Observations 164 422 635 +## Log Likelihood -87.451 -224.427 -370.002 +## Akaike Inf. Crit. 182.902 456.853 748.005 +## =============================================== +## Note: *p<0.1; **p<0.05; ***p<0.01 +
+
+

Interpret and discuss

+

Well, for starters, the model providing a “pooled” estimate of treatment effects while adjusting for age, gender, and study year suggests that the point estimate is “marginally” statistically significant (\(p <0.1\)) indicating some evidence that the data support the alternative hypothesis (being shown a picture of Michelle Obama causes trick-or-treaters to be more likely to pick up fruit than the control condition). In more concrete terms, the trick-or-treaters shown the Obama picture were, on average, about 25% more likely to pick up fruit than those exposed to the control (95% CI: \(-4\%~-~+66\%\)). In even more concrete terms, the estimated probability that a 9 year-old girl in 2015 and a 7 year-old boy in 2012 would take fruit increase about 17% and 19% respectively on average (from 29% to 34% in the case of the 9 year-old and from 21% to 25% in the case of the 7 year-old). These findings are sort of remarkable given the simplicity of the intervention and the fairly strong norm that Halloween is all about candy.

+

All of that said, the t-test results from Problem set 7 and the “unpooled” results reported in the sub-group analysis point to some potential concerns and limitations. For starters, the fact that the experiment was run iteratively over multiple years and that the sample size grew each year raises some concerns that the study design may not have anticipated the small effect sizes eventually observed and/or was adapted on the fly. This would undermine confidence in some of the test statistics and procedures. Furthermore, because the experiment occurred in sequential years, there’s a very real possibility that the significance of a picture of Michelle Obama shifted during that time period and/or the house in question developed a reputation for being “that weird place where they show you pictures of Michelle Obama and offer you fruit.” Whatever the case, my confidence in the findings here is not so great and I have some lingering suspicions that the results might not replicate.

+

On a more nuanced/advanced statistical note, I also have some concerns about the standard errors. This goes beyond the content of our course, but basically, a randomized controlled trial introduces clustering into the data by-design (you can think of it as analogous to the observations coming from the treatment “cluster” and the control “cluster”). In this regard, the normal standard error formulas can be biased. Luckily, there’s a fix for this: compute “robust” standard errors as a result and re-calculate the corresponding confidence intervals. Indeed, robust standard errors are often considered to be the best choice even when you don’t know about potential latent clustering or heteroskedastic error structures in your data. This a short pdf provides a little more explanation, citations, as well as example R code for how you might calculate robust standard errors.

diff --git a/psets/pset8-worked_solution.pdf b/psets/pset8-worked_solution.pdf index b67bf06..5d96e20 100644 Binary files a/psets/pset8-worked_solution.pdf and b/psets/pset8-worked_solution.pdf differ diff --git a/psets/pset8-worked_solution.rmd b/psets/pset8-worked_solution.rmd index 7f25571..8c4680f 100644 --- a/psets/pset8-worked_solution.rmd +++ b/psets/pset8-worked_solution.rmd @@ -4,9 +4,6 @@ subtitle: "Statistics and statistical programming \nNorthwestern University \n author: "Aaron Shaw" date: "November 23, 2020" output: - pdf_document: - toc: yes - toc_depth: '3' html_document: toc: yes toc_depth: 3 @@ -14,6 +11,9 @@ output: collapsed: true smooth_scroll: true theme: readable + pdf_document: + toc: yes + toc_depth: '3' header-includes: - \newcommand{\lt}{<} - \newcommand{\gt}{>} @@ -42,7 +42,7 @@ data(mariokart) mariokart ``` -To make things a bit easier to manage, I'll select the variables I want to use in the analysis and do some cleanup. Note that I conver the `cond_new` and `stock_photo` variables to logical first (using boolean comparisons) and also coerce them to be numeric values (using `as.numeric()`). This results in 1/0 values corresponding to the observations shown/described in Figure 9.13 and 9.14 on p. 365 of the textbook. +To make things a bit easier to manage, I'll select the variables I want to use in the analysis and do some cleanup. Note that I convert the `cond_new` and `stock_photo` variables to logical `TRUE/FALSE` values first (using boolean comparisons) and then coerce them to be numeric values (using `as.numeric()`). This results in 1/0 values corresponding to the observations shown/described in Figure 9.13 and 9.14 on p. 365 of the textbook. ```{r} mariokart <- mariokart %>% @@ -56,7 +56,7 @@ mariokart <- mariokart %>% mariokart ``` -Now let's look at the variables in our model. Summary statistics for each, univariate density plots for the continuous measures, and some bivariate plots w each predictor and the outcome variable. +Now let's summarize and explore the variables in our model. I'll calculate summary statistics for each, univariate density plots for the continuous measures, boxplots for the dichotomous indicators, and some bivariate plots with each predictor and the outcome variable. ```{r} summary(mariokart) @@ -66,6 +66,7 @@ sd(mariokart$duration) sd(mariokart$wheels) qplot(data=mariokart, x=price, geom="density") +qplot(data=mariokart, y=price, geom="boxplot") # check out the outliers! qplot(data=mariokart, x=duration, geom="density") qplot(data=mariokart, x=wheels, geom="density") @@ -84,46 +85,50 @@ cor(mariokart) ## Replicate model results -The description of the model +Based on the information in the textbook, I'm assuming that the model is an ordinary least squares regression. ```{r} model <- lm(price ~ cond_new + stock_photo + duration + wheels, data=mariokart) summary(model) ``` -While I'm looking at that, I'll go ahead and calculate a confidence interval around the only parameter for which the model rejects the null hypothesis (`wheels`): +Huh, that doesn't quite look like what's in the textbook Figure 9.15... +I'll go ahead and calculate a confidence interval around the only parameter for which the model rejects the null hypothesis (`wheels`): ```{r} confint(model, "wheels") ``` - -Overall, this resembles the model results in Figure 9.15, but notice that it's different in a number of ways! Without more information from the authors of the textbook it's very hard to determine exactly where or why the differences emerge. +Without more information from the authors of the textbook it's very hard to determine exactly where or why the differences emerge. My guess at this point is that it might have something to do with those outlying `price` values (take a look at that boxplot and density plot again). Maybe they did something to transform the variable? Remove the outliers? I have no idea. ## Assess model fit/assumptions -I've already generated a bunch of univariate and bivariate summaries and plots. Let's inspect the residuals more closely to see what else we can learn. I'll use the `autoplot()` function from the `ggfortify` package to help me do this. +I've already generated a bunch of univariate and bivariate summaries and plots. Let's inspect the residuals more closely to see what else we can learn. I'll use the `autoplot()` function from the `ggfortify` package. ```{r} autoplot(model) ``` -Overall, there are a number of issues with this model fit that I'd want to mention/consider: -* The distribution of the dependent variable (`price`) is very skewed, with two extreme outliers. I'd recommend trying some transformations to see if it would look more appropriate for a linear regression and/or inspecting the cases that produced the outlying values more closely to understand what's happening there. -* The plots of the residuals reveal those same two outlying points are also outliers with respect to the line of best fit. That said, they are not exerting huge amounts of leverage on the estimates, so it's possible that the estimates from the fitted model wouldn't change much without those two points. Indeed, based on the degrees of freedom reported in Figure 9.15 (136) vs. the number reported in our version of the model (138) my best guess is that the textbook authors silently dropped those two outlying observations from their model! +There are those outliers again. At this point, there are a number of issues with this model fit that I'd want to mention/consider: +* The distribution of the dependent variable (`price`) is skewed, with two extreme outliers. I'd recommend trying some transformations to see if it would look more appropriate for a linear regression and/or inspecting the cases that produced the outlying values more closely to understand what's happening there and identify reasons why they're so different. +* The plots of the residuals reveal those same two outlying points are also outliers with respect to the line of best fit. That said, they are not exerting huge amounts of leverage on the estimates, so it's possible that the estimates from the fitted model wouldn't change *too much* without those two points. Indeed, based on the degrees of freedom reported in Figure 9.15 (136) vs. the number reported in our version of the model (138) my best guess at this point is that the textbook authors silently dropped those two outlying observations from their model. More out of curiosity than anything, I'll create a version of the model that drops the two largest values of price. From the plots, I can see that those two are the only ones above \$100, so I'll use that information here: ```{r} -summary(lm(price ~ cond_new + stock_photo + duration + wheels, data = mariokart[mariokart$price < 100,])) +summary( + lm(price ~ cond_new + stock_photo + duration + wheels, + data = mariokart[mariokart$price < 100,] + ) + ) ``` -What do you know. That was it. +What do you know. That was it. The difference in $R^2$ is huge! ## Interpret some results -The issues above notwithstanding, we can march ahead and interpret the model results. Here are some general comments and some specifically focused on the `cond_new` and `stock_photo` variables: -* Overall, the linear model regressing total auction price on condition, stock photo, duration, and number of Wii wheels shows evidence of a positive, significant relationship between number of wheels and price. According to this model fit, an increase of 1 wheel is associated with a total auction price increase of $10 with the 95% confindence interval of (\$4.57-\$15.32). -* The point estimate for selling a new condition game is positive, but with a large standard error. As a result, the model fails to reject the null of no association and provides no evidence of any relationship between the game condition and auction price. +The issues above notwithstanding, we can march ahead and interpret the results of the original model that I fit. Here are some general comments and some specifically focused on the `cond_new` and `stock_photo` variables: +* Overall, the linear model regressing total auction price on condition, stock photo, duration, and number of Wii wheels shows evidence of a positive, significant relationship between number of wheels and price. According to this model fit, an increase of 1 wheel in a listing is associated with a total auction price increase of \$10 on average (95% confindence interval: \$4.57-\$15.32). +* The point estimate for selling a new condition game is positive, but with a large standard error. The model fails to reject the null of no association and provides no evidence of any relationship between the game condition and auction price. * The point estimate for including a stock photo is negative, but again, the standard error is very large and the model fails to reject the null hypothesis. There is no evidence of any relationship between including a stock photo and the final auction price. ## Recommendations @@ -132,7 +137,7 @@ Based on this model result, I'd recommend the prospective vendor of a **used** c # Part II: Hypothetical study -## Import and explore +## Import, explore, summarize I'll start off by just importing things and summarizing the different variables we care about here: ```{r} @@ -150,7 +155,7 @@ qplot(data=grads, x=income, geom="density") ggplot(data=grads, aes(x=gpa, y=income)) + geom_point() ``` -I'll also calculate some summary statistics and visual comparisons within cohorts: +I'll also calculate some summary statistics and visual comparisons within districts (cohorts): ```{r} @@ -161,7 +166,32 @@ tapply(grads$gpa, grads$cohort, summary) tapply(grads$gpa, grads$cohort, sd) ``` -Huh. Those are remarkably similar values for the group means and the group standard deviations... +Note that you could also do this pretty easily with a call to `group_by` and `summarize` in the tidyverse: + +```{r} +grads %>% + group_by(cohort) %>% + summarize( + n = n(), + min = min(income), + mean = mean(income), + max = max(income), + sd = sd(income) + ) + +grads %>% + group_by(cohort) %>% + summarize( + n = n(), + min = min(gpa), + mean = mean(gpa), + max = max(gpa), + sd = sd(gpa) + ) +``` + + +Huh. Those are remarkably similar values for the group means and the group standard deviations...weird. Onwards to plotting: ```{r} @@ -170,16 +200,19 @@ ggplot(data=grads, aes(x=cohort, y=income)) + geom_boxplot() ggplot(data=grads, aes(x=gpa, y=income, color=cohort)) + geom_point() ``` -Those plots are also a little strange. I know this is just a simulated analysis, but it still seems weird that overall it just looks like a big mass of random points, but when I add the colors by cohort, I can see there are some lines and other regularities within groups. I wonder what happens when I plot each scatter within cohorts? +Those plots are also a little strange. Even though this is just a simulated analysis, it still seems weird that overall the scatterplot just looks like a big mass of points, but when I color the points by district, I can see some regularities within groups. At this point, I might want to facet the scatterplots by district to see any patterns more clearly. ```{r} ggplot(data=grads, aes(x=gpa, y=income, color=cohort)) + geom_point() + facet_wrap(vars(cohort)) ``` -Hmmm. That's...absurd (in particular, cohort 8 looks like a sideways dinosaur). At this point, if I were really working as a consultant on this project, I would write to the client and start asking some uncomfortable questions about data quality (who collected this data? how did it get recorded/stored/etc.? what quality controls were in place?). I would also feel obligated to tell them that there's just no way the data correspond to the variables they think are here. If you did that and the client was honest, they might tell you [where the data actually came from](https://www.autodesk.com/research/publications/same-stats-different-graphs). +Okay, that's...a joke (in particular, cohort 8 looks like a sideways dinosaur). At this point, if I were really working as a consultant on this project, I would write to the client and start asking some probing questions about data quality (who collected this data? how did it get recorded/stored/etc.? what quality controls were in place?). I would also feel obligated to tell them that I suspect there's just no way the data correspond to the variables they think are here. If you did that and the client was honest, they might tell you [where the data actually came from](https://www.autodesk.com/research/publications/same-stats-different-graphs). + +In the event that you marched ahead with the analysis and are curious about what that could have looked like, I've provided some example code below. That said, **this is a situation where the assumptions and conditions necessary to identify ANOVA, t-tests, or regression are all pretty broken** because the data was generated programmatically in ways that undermine the kinds of interpretation you've been asked to make. The best response here (IMHO) is to abandon these kinds of analysis once you discover that there's something systematically weird going on like this. While the experience of discovering a scatterplot dinosaur in your data is...unlikely outside of the context of a problem set, there are many situations in which careful data exploration will bring a realization that you just don't understand some important things about the sources or qualities of your data. You have to learn to identify these moments and develop strategies for dealing with them! Often, the statistical procedures will "work" in the sense that they will return a result without any apparent errors, but because those results aren't even close to meaningful, any relationships you do observe in the data reflect something different than the sorts of relationships the statistical procedures were designed to identify. -In the event that you marched ahead with the analysis and are curious about what that could have looked like, I've provided some example code below. That said, *this is a situation where the assumptions and conditions necessary to identify ANOVA, t-tests, or regression are all pretty broken* because the data was generated programmatically in ways that undermine the kinds of interpretation you've been asked to make. The best response here (IMHO) is to abandon these kinds of analysis once you discover that there's something systematically weird going on. The statistical procedures will "work" in the sense that they will return a result, but because those results aren't even close to meaningful, any relationships you do observe in the data reflect something different than the sorts of relationships the statistical procedures were designed to identify. +## Fake analysis for fake data +Okay, if you wanted example code to look at for this, here it is. Please just keep in mind that the results are not informative! ```{r} summary(aov(income ~ cohort, data=grads)) # no global differences of means across groups @@ -189,7 +222,7 @@ confint(grads.model, "gpa") # 95% confidence interval ``` -Note that the failure to reject the null of any association between district and income in the ANOVA does not provide conclusive evidence that the relationship between GPA and income does not vary by cohort. There were several things you might have done here. One is to calculate correlation coefficients within groups. Here's some tidyverse code that does that: +Note that the failure to reject the null of any association between district and income in the ANOVA would not (even in the event of more realistic data) provide very compelling evidence that the relationship between GPA and income does not vary by cohort. There were several things you might have done here. One is to calculate correlation coefficients within groups. Here's some tidyverse code that does that: ```{r} grads %>% @@ -198,11 +231,13 @@ grads %>% correlation = cor(income, gpa) ) ``` -Because these correlation coefficients are nearly identical, I would likely end my analysis here and conclude that the correlation between gpa and income appears to be consistently small and negative. If you wanted to go further, you could theoretically calculate an interaction term in the model (by including `I(gpa*cohort)` in the model formula), but the analysis up to this point gives no indication that you'd be likely to find much of anything (and we haven't really talked about interactions yet). +Because these correlation coefficients are nearly identical, I would likely end an analysis here and conclude that the correlation between gpa and income appears to be consistently small and negative. If you wanted to go further, you could theoretically calculate an interaction term in the model (by including `I(gpa*cohort)` in the model formula), but the analysis up to this point gives no indication that you'd be likely to find much of anything (and we haven't really talked about interactions yet). On top of that, there's a literal dinosaur lurking in your data...just give up! # Part III: Trick or treating again ## Import and update data +Revisit the text and worked solutions for problem set 7 for more details about the study design, data collection and more. + ```{r import} ## reminder that the "read_dta()" function requires the "haven" library @@ -220,7 +255,7 @@ df <- df %>% df ``` -Let's fit and summarize the model: +That looks consistent with what we want here. Let's fit and summarize the model: ```{r} f <- formula(fruit ~ obama + age + male + year) @@ -229,18 +264,23 @@ fit <- glm(f, data=df, family=binomial("logit")) summary(fit) ``` -Interesting. Looks like adjusting for these other variables in a regression setting can impact the results. +Interesting. Looks like adjusting for these other variables in a regression setting allows us to uncover some different the results. + +Onwards to generating more interpretable results. -Onwards to generating more interpretable results: +First, odds-ratios instead of "raw" log-odds coefficients: ```{r} ## Odds ratios (exponentiated log-odds!) exp(coef(fit)) exp(confint(fit)) +``` + +Now, model-predicted probabilities for prototypical observations. Recall that it's necessary to create synthetic ("fake"), hypothetical individuals to generate predicted probabilities like these. In this case, I'll create two versions of each fake kid: one assigned to the treatment condition and one assigned to the control. Then I'll use the `predict()` function to generate fitted values for each of the fake kids. -## model-predicted probabilities for prototypical observations: +```{r} fake.kids = data.frame( - obama = rep(c(FALSE, TRUE), 2), + obama = c(FALSE, FALSE, TRUE, TRUE), year = factor(rep(c("2015", "2012"), 2)), age = rep(c(9, 7), 2), male= rep(c(FALSE, TRUE), 2) @@ -249,11 +289,9 @@ fake.kids = data.frame( fake.kids.pred <- cbind(fake.kids, pred.prob = predict(fit, fake.kids, type="response")) fake.kids.pred - - ``` -Note that [this UCLA logit regression tutorial](https://stats.idre.ucla.edu/r/dae/logit-regression/) also contains example code to help extract standard errors and confidence intervals around these predicted probabilities. You were not asked to produce them here, but if you'd like an example here you go (I can try to clarify in class): +Note that [this UCLA logit regression tutorial](https://stats.idre.ucla.edu/r/dae/logit-regression/) also contains example code to help extract standard errors and confidence intervals around these predicted probabilities. You were not asked to create them here, but if you'd like an example here you go. The workhorse is the `plogis()` function, which essentially does the inverse logit transformation detailed in the textbook chapter 8: ```{r} fake.kids.more.pred <- cbind(fake.kids, @@ -268,13 +306,27 @@ within(fake.kids.more.pred, { ## Sub-group analysis +To do this, we'll need to create a slightly new model formula that drops the `year` term since we're going to restrict each subset of the data along that dimension. + +Once I fit the models, I'll use the `stargazer` package to create a reasonably pretty regression table that incorporates all three summaries. ```{r} f2 <- formula(fruit ~ obama + age + male) -summary( glm(f2, data=df[df$year == "2012",], family=binomial("logit"))) -summary( glm(f2, data=df[df$year == "2014",], family=binomial("logit"))) -summary( glm(f2, data=df[df$year == "2015",], family=binomial("logit"))) +m2012 <- glm(f2, data=df[df$year == "2012",], family=binomial("logit")) +m2014 <- glm(f2, data=df[df$year == "2014",], family=binomial("logit")) +m2015 <- glm(f2, data=df[df$year == "2015",], family=binomial("logit")) + + +## I can make a pretty table out of that: +library(stargazer) +stargazer(m2012, m2014, m2015, column.labels = c("2012", "2014", "2015"), type="text") + ``` -Interesting. The treatment effect seems to emerge overwhelmingly within a single year of the data. +## Interpret and discuss + +Well, for starters, the model providing a "pooled" estimate of treatment effects while adjusting for age, gender, and study year suggests that the point estimate is "marginally" statistically significant ($p <0.1$) indicating some evidence that the data support the alternative hypothesis (being shown a picture of Michelle Obama causes trick-or-treaters to be more likely to pick up fruit than the control condition). In more concrete terms, the trick-or-treaters shown the Obama picture were, on average, about 25\% more likely to pick up fruit than those exposed to the control (95\% CI: $-4\%~-~+66\%$). In even more concrete terms, the estimated probability that a 9 year-old girl in 2015 and a 7 year-old boy in 2012 would take fruit increase about 17\% and 19\% respectively on average (from 29\% to 34\% in the case of the 9 year-old and from 21\% to 25\% in the case of the 7 year-old). These findings are sort of remarkable given the simplicity of the intervention and the fairly strong norm that Halloween is all about candy. + +All of that said, the t-test results from Problem set 7 and the "unpooled" results reported in the sub-group analysis point to some potential concerns and limitations. For starters, the fact that the experiment was run iteratively over multiple years and that the sample size grew each year raises some concerns that the study design may not have anticipated the small effect sizes eventually observed and/or was adapted on the fly. This would undermine confidence in some of the test statistics and procedures. Furthermore, because the experiment occurred in sequential years, there's a very real possibility that the significance of a picture of Michelle Obama shifted during that time period and/or the house in question developed a reputation for being "that weird place where they show you pictures of Michelle Obama and offer you fruit." Whatever the case, my confidence in the findings here is not so great and I have some lingering suspicions that the results might not replicate. +On a more nuanced/advanced statistical note, I also have some concerns about the standard errors. This goes beyond the content of our course, but basically, a randomized controlled trial introduces clustering into the data by-design (you can think of it as analogous to the observations coming from the treatment "cluster" and the control "cluster"). In this regard, the normal standard error formulas can be biased. Luckily, there's a fix for this: compute "robust" standard errors as a result and re-calculate the corresponding confidence intervals. Indeed, robust standard errors are often considered to be the best choice even when you don't know about potential latent clustering or heteroskedastic error structures in your data. [This a short pdf](https://oes.gsa.gov/assets/files/calculating-standard-errors-guidance.pdf) provides a little more explanation, citations, as well as example R code for how you might calculate robust standard errors. \ No newline at end of file