problem_sets/week_06/ps6-worked-solution.Rmd

   1 ---
   2 title: 'Week 6 problem set: Worked solutions'
   3 subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
   4 author: "Jeremy Foote"
   5 date: "April 11, 2019"
   6 output: html_document
   7 ---
   8
   9 ```{r setup, include=FALSE}
  10 knitr::opts_chunk$set(echo = TRUE, messages = F)
  11 ```
  12
  13 ## Programming Questions
  14
  15 Note that Jeremy created this solution set with some code that uses functions drawn from the `tidyverse`. Wherever possible, we have also provided solution code using base R functions or alternatives since the course has not really provided an introduction to the tidyverse.
  16
  17 ### PC0
  18
  19 First we import the data.
  20
  21 ```{r}
  22 raw_df = read.csv("/tmp/owan03.csv") # Note that I saved the file as a CSV for importing to R
  23 head(raw_df)
  24 ```
  25
  26 ### PC1
  27
  28 Let's organize the data. We want this in "long" format where the survival times in weeks for each of the three dosage levels are all arranged in a single column and another variable labels the dosage level. There are (surprise!) many ways to do this. Here are two examples using a function called `melt()` from a library called `reshape` and another function called `gather()` from the `tidyverse` library. Note that you'll need to install the respective library and load it before you can call either function!
  29
  30 ```{r}
  31 library(reshape)
  32 library(tidyr)
  33
  34 # Rename the columns
  35 colnames(raw_df) <- c("None", "Low", "Medium","High")
  36
  37 # Here's a version using "melt"
  38 d <- melt(raw_df, na.rm=TRUE)
  39 names(d) <- c("dose", "weeks_alive")
  40
  41 # "gather" similarly takes all of the columns from the dataframe and uses them as keys, with the value as the value from that # cell. This gives us a "long" dataframe
  42 df <- gather(raw_df, dose, weeks_alive)
  43 # We want to treat dose as a factor, so we can group by it easier. Specifying the levels helps with labels later on.
  44 df$dose <- factor(df$dose, levels=c("None", "Low", "Medium", "High"))
  45 # Finally, remove NAs
  46 df <- df[complete.cases(df),]
  47
  48 # Uncomment this to see that the results are the same:
  49 ## df==d
  50 ```
  51
  52 ### PC2
  53
  54 Now we're going to calculate summary statistics and create some visualizations
  55
  56 ```{r}
  57 # First, using the tapply function
  58 tapply(df$weeks_alive, df$dose, summary)
  59
  60 # Alternative way to do this using tidyverse "group_by" function:
  61 library(dplyr)
  62 df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR'))
  63
  64 ```
  65
  66 When it comes to visualizations, we definitely want to use ggplot. We have lots of options for what to do.
  67
  68 ```{r}
  69 library(ggplot2)
  70 # Histograms
  71
  72 h_plot <- ggplot(data=df, aes(x=weeks_alive, fill = dose)) + geom_histogram(position = 'dodge', bins = 5)
  73 h_plot
  74
  75 # In this case, faceted histograms is probably better
  76 h_facet <- ggplot(data=df, aes(x=weeks_alive)) + geom_histogram(bins = 5) + facet_grid(~dose)
  77 h_facet
  78
  79 # Even easier to compare if we flip the coordinates (rotate the axes)
  80 h_facet+coord_flip()
  81
  82 # Density plots. Note that setting a value <1 for "alpha" makes the fill more transparent
  83 d_plot <- ggplot(data=df, aes(x=weeks_alive, fill = dose)) + geom_density(alpha = .2)
  84 d_plot
  85
  86 # Boxplots
  87 box_plot <- ggplot(data=df, aes(y=weeks_alive,  x = dose)) + geom_boxplot()
  88 box_plot
  89
  90 # Jeremy's favorite - ridgeline plots
  91
  92 # install.packages('ggridges')
  93 library(ggridges)
  94
  95 ridge_plot <- ggplot(data=df, aes(x=weeks_alive, y = dose)) + geom_density_ridges(jittered_points = T, fill = 'orange')
  96 ridge_plot
  97
  98 # add a fancy minimalist theme to make it prettier:
  99 ridge_plot + theme_minimal()
 100
 101 ```
 102
 103 A two sample t-test assumes independence and normality. An ANOVA assumes independence, normality, and equal variance. It's a bit tough to tell, but the overall assumption of equal variance seems reasonable. Normality is a bit of a hard sell within groups or overall. Nevertheless, most analysts would march ahead with the analysis despite these violations of assumptions. We can discuss how you might think and talk about this in class.
 104
 105 The global mean is
 106
 107 ```{r}
 108 mean(df$weeks_alive)
 109 ```
 110
 111 ### PC3
 112
 113 ANOVA! Remember that we have to call `summary()` to see the ANOVA table:
 114
 115 ```{r}
 116 summary(aov(weeks_alive ~ dose, data = df))
 117 ```
 118
 119 This provides evidence in support of the alternative hypothesis that the group means are not equal. It seems that the dosage of Red Dye #40 has a relationship with how many weeks the mice lived.
 120
 121 ### PC4
 122
 123 T-test between None and Any, and between None and High.
 124
 125 ```{r}
 126 t.test(df$weeks_alive[df$dose == 'None'], # Samples with no dose
 127        df$weeks_alive[df$dose != 'None'] # Samples with any dose
 128        )
 129
 130 # Or, using formula notation
 131 t.test(weeks_alive ~ dose == 'None', data = df)
 132
 133 # T-test between None and High
 134 t.test(df$weeks_alive[df$dose == 'None'], # Samples with no dose
 135        df$weeks_alive[df$dose == 'High'] # Samples with high dose
 136        )
 137
 138 # Formula notation is a bit tricker. I would create a temprorary dataframe
 139 df.tmp <- subset(df, subset= dose=="None" | dose=="High")
 140 t.test(weeks_alive ~ dose, data = df.tmp)
 141
 142 ```
 143
 144 These t-tests both support the idea that receiving a dose of RD40 reduces lifespan. However, we should not completely trust these p-values, since we are doing multiple comparisons. One option is to do a Bonferroni correction, where we only cnsider things significant if $\alpha < .05/m$ where $m$ is the number of tests. In this case, we would fail to reject the null for the second test because we would set $\alpha = .025$.
 145
 146 The Bonferroni correction is more conservative than it needs to be, ane there are other approaches; for example,  the `TukeyHSD` function takes in an anova result and does post-hoc comparisons with corrections for all of the groups. The Reinhart book also provides a description of the Benjamini-Hochberg correction.
 147
 148 ## Statistical Questions
 149
 150 ### SQ1 — 6.12
 151 a) It is a sample statistic (the sample mean), because it comes from a sample. The population parameter is unknown in this case, but can be estimated from the sample statistic.
 152
 153 b) As given in the textbook, confidence intervals for proportions are equal to:
 154
 155 $$\hat{p} \pm z^* \sqrt{ \frac{\hat{p}\times(1-\hat{p})}{n}}$$
 156
 157 Where we calculate $z^*$ from the Z distribution table. For a 95% confidence interval, $z = 1.96$, so we can calculate it like this:
 158
 159
 160 ```{r}
 161
 162 lower = .48 - 1.96 * sqrt(.48 * .52 / 1259)
 163
 164 upper = .48 + 1.96 * sqrt(.48 * .52 / 1259)
 165
 166 ci = c(lower, upper)
 167 print(ci)
 168
 169 ```
 170
 171 This means that we are 95% confident that the true proportion of Americans who support legalizing marijuana is between ~45% and ~51%.
 172
 173 c) We have a large enough sample (collected randomly) in which enough responses are drawn from each potential outcome to assume that the observations are independent and the distribution of the sample proportion is approximately normal.
 174
 175 d) The statement is **not** justified, since our confidence interval includes 50%.
 176
 177 ### SQ2 — 6.20
 178
 179 We can use the point estimate of the poll to estimate how large a sample we would need to have a confidence interval of a given width.
 180
 181 In this case, we want the margin or error to be 2%. Using the same formula from above, this translates to: $1.96 * \sqrt{\frac{.48 * .52}{n}} = .02$
 182
 183 We can solve for $n$:
 184
 185 $$\sqrt{\frac{.48 * .52}{n}} = .02/1.96$$
 186 $$\frac{.48 * .52}{n} = (.02/1.96)^2$$
 187 $$n = (.48 * .52)/(.02/1.96)^2$$
 188 Let R solve this:
 189 ```{r}
 190 (.48 * .52)/(.02/1.96)^2
 191 ```
 192 So, we need a sample of at least 2,398 (since we can't survey less than a whole person).
 193
 194 ### SQ3 — 6.38
 195
 196 The question is whether there has been a change in the proportion across two groups (the students pre- and post-class) over time. While the tools we have learned could allow you to answer that question for two groups, they assume that responses are independent. In this case, the responses are not independent because they come from the same students. You could go ahead and calculate a statistical test for difference in pooled proportions (it's just math!), but the dependence between observations violates the assumptions and means that the baseline expectations necessary to identify the hypothesis test are not met. Your test statistic will not mean what you want it to mean.
 197
 198 ### SQ4 — 6.50
 199
 200 (a) Our hypothesis focuses on the relationship of the sample and census distributions by region:
 201   $H_0$ : The sample distribution of regions follows the census distribution.
 202   $H_A$ : The sample distribution of regions does not follow the census distribution.
 203
 204 Before we can calculate the test statistic we should check the conditions:
 205 1. Independence: $500 < 10%$ of US population, and the sample is random, hence we can assume that one individual in the sample is independent of another.
 206 2. Sample size: The expected counts can be calculated as follows:
 207 $$E N E = 500 × 0.18 = 90$$
 208 $$E N C = 500 × 0.22 = 110$$
 209 $$E S = 500 × 0.37 = 185$$
 210 $$E W = 500 × 0.23 = 115$$
 211
 212 All expected counts are greater than 5 (which meets the rule of thumb re: empty "cells" in the table).
 213
 214 We can test this with a $\chi^2$ test.
 215
 216 ```{r}
 217 # Manually using the expected counts and formula for chi-squared
 218 chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115
 219 chisq
 220
 221 # We could then look up the chi-square distribution for 3 degrees of freedom in the book or find it in R.
 222 # lower.tail=FALSE ensures that R returns the proportion of the distribution at least as large as our critical value:
 223 pchisq(chisq, df=3, lower.tail=FALSE)
 224
 225 # We could also use R's chisq.test function:
 226 chisq.test(x = c(83,121,193,103), p = c(.18,.22,.37,.23))
 227
 228 ```
 229
 230 The p-value for this is large, meaning that we cannot reject the null hypothesis that the sample proportions do not follow the census distribution.
 231
 232 (b) Part b answers below:
 233   i) We would probably want to treat opinion as the response and region as the explanatory variable, since it's unlikely that people move to a region based on their opinion (although that could be an empirical question!).
 234   ii) One hypothesis is that opinions differ by region. The null hypothesis is that opinion is independent of region, while the alternative hypothesis is that there is a relationship. More formally:
 235     $H_0$ : Region and opinion on direction are independent.
 236     $H_A$ : Region and opinion on direction are dependent.
 237   iii) We can again use a $\chi^2$ test.
 238
 239 You can do this by hand calculating the expected counts and then plugging the values into the formula for $\chi^2$ as we did above. If you do that, here are the expected counts:
 240
 241 Right direction:
 242 ```{r}
 243 # NE:
 244 round(171*83/500)
 245 # NC:
 246 round(171*121/500)
 247 # S:
 248 round(171*193/500)
 249 # W:
 250 round(171*103/500)
 251 ```
 252 Wrong direction:
 253 ```{r}
 254 round(329*83/500)
 255 round(329*121/500)
 256 round(329*193/500)
 257 round(329*103/500)
 258 ```
 259 Working through the formula for the $\chi^2$ test statistic and then calculating the p-value for 3 degrees of freedom:
 260
 261 ```{r}
 262 chisq <- (29-28)^2 /28 +
 263   (44-41)^2 /41 +
 264   (62-66)^2 /66 +
 265   (36-35)^2 / 35 +
 266   (54-55)^2 / 55 +
 267   (77-80)^2 / 80 +
 268   (131-127)^2 / 127 +
 269   (67-68)^2 / 68
 270
 271 chisq
 272 pchisq(chisq, df=3, lower.tail=FALSE)
 273 ```
 274
 275 And using R's `chisq.test()` function here:
 276 ```{r}
 277 x <- matrix(c(29,54,44,77,62,131,36,67), nrow = 4, # this makes a matrix with 4 rows
 278             byrow=T) # And this says that we've entered it row by row
 279 chisq.test(x)
 280 ```
 281 The answers don't match exactly (because of the rounding I did in computing the expected counts), however the payoff/interpretation is the same: we cannot reject the null hypothesis of no relationship between region and direction opinoin in the sample.
 282
 283 ## Empirical Questions
 284
 285 ### EQ1
 286
 287 a) The unit of analysis is the customer. The dependent variable is the type of board purchased and the independent variable is gender. Males, females, and unkown gender customers are being compared. This is a two-way test.
 288
 289 b) For this type of comparison statistical tests help to give (or take away) confidence in any observed differences across categories. Choosing a statistical test is based on the question that you want to answer and the type of data that you have available to answer it. For example, if this were numeric data (e.g., the amount of money spent on electronics for men and women) then we could choose a t-test to compare those distributions.
 290
 291 c) The null hypothesis ($H_0$) is that the board purchased is independent of the gender of the customer. The alternative hypothesis ($H_A$) is that board purchase choice is dependent on gender.
 292
 293 d) A $\chi^2$ test found statistically significant evidence that board purchase behavior differs by gender. This difference is convincing, but it does directly not tell us what the authors set out to understand, which is the difference between men and women (the test could have identified a significant difference in the number of unknown gender customers across board types). Many of these concerns are addressed in the text and with additional tests, giving increased confidence in the observed differences.
 294
 295 ### — EQ2
 296
 297 a) These are counts for two categorical variables, so the procedure used was a $\chi^2$ test. The null hypothesis is that whether or not a blog is governed by one person is independent of whether it is on the left or the right ideologically.
 298
 299 b) Assuming that the null hypothesis of no difference across the two groups is compelling, it would be surprising to see these results in a world where ideological orientation and blog governance have no relationship. In this respect, it makes sense to believe that this difference is likely real. Perhaps the main reason to be skeptical is the way that the data are grouped. The authors could have grouped them differently (e.g., 1-2 people, 3-4 people, and 5+ people); if the decision on how to group was made after seeing the data then we have good reason to be skeptical.
 300
 301 c) We can do this in R.
 302
 303 ```{r}
 304
 305 # First we create the dataframe
 306 df = data.frame(Governance=c('Individual','Multiple', 'Individual','Multiple'),
 307                 Ideology = c('Left','Left','Right','Right'),
 308                 Count = c(13,51,27,38))
 309
 310 # We can make sure it's the same by testing the Chi-squared
 311 chisq.test(matrix(df$Count, nrow=2))
 312
 313 # Using Jeremy's tidyverse code:
 314 percentage_data <- df %>%
 315   group_by(Ideology) %>%
 316   summarize(individual_ratio = sum(Count[Governance=='Individual']) / sum(Count),
 317             group_count = sum(Count))
 318
 319 shaw_benkler_plot = percentage_data %>%
 320   ggplot(aes(x=Ideology, y=individual_ratio * 100)) +
 321     geom_bar(stat='identity', aes(fill=c('red','blue')), show.legend=F) +
 322     ylab('Percentage of Blogs') + theme_minimal()
 323
 324 shaw_benkler_plot
 325 ```
 326
 327
 328 If we want to add error bars, we need to calculate them (Note that ggplot could do this for us if we had raw data - always share your data!).
 329
 330 For our purposes here, Jeremy decided to use confidence intervals. The standard error is another reasonable choice. Either way, ggplot has a `geom_errorbar` layer that is very useful.
 331
 332 Remember that for a binomial distribution (we can consider individual/non-individual as binomial), confidence intervals are
 333 $\hat{p} \pm z^* \sqrt{\frac{\hat{p}~(1-\hat{p})}{n}}$
 334
 335 ```{r}
 336 ci_95 = 1.96 * sqrt(percentage_data$individual_ratio * (1 - percentage_data$individual_ratio)/percentage_data$group_count)
 337
 338 shaw_benkler_plot + geom_errorbar(aes(ymin=(individual_ratio-ci_95)*100, ymax=(individual_ratio + ci_95)*100),
 339                                   alpha = .3,
 340                                   size=1.1,
 341                                   width=.4)
 342
 343 ```
 344
 345 The error bars do overlap in this case, indicating that the true population proportions may not be as far apart as our point estimates suggest. Note that this is not the same as the hypothesis test.
 346
 347 d) On the one hand, we don't need to worry about the base rate fallacy because the sizes of both groups are about the same and the paper does not abuse the evidence too egregiously. The base rate fallacy would likely come into play, however, in the ways that the results are (mis)represented. For example, you might imagine some news coverage looking at these results and claiming something (totally wrong!) like "Harvard study finds right wing blogs more than twice as likely to be solo affairs." This is taking a relationship between the sample proportions ($\hat{p}$ in the language of our textbook) and converting that into a statement about the relationship between population proportions ($p$). That would be a mess.
 348
 349 Another way in which the base rate fallacy could play a role in this paper, however, concerns the presence of multiple comparisons. The authors conducted numerous statistical tests (indeed, one of the authors seems to recall that some of the tests were not even reported in the paper <Gasp!>) and they make no effort to address the baseline probability of false positives.
 350
 351 In any case, the point here is that the statistical tests reported in the paper may not mean exactly what the authors said they did in the context of the publication. That may or may not change the validity of the results, but it should inspire us all to do better statistical analysis!