problem_sets/week_06/ps6-worked-solution.Rmd

   1 ---
   2 title: 'Week 6 problem set: Worked solutions'
   3 subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
   4 author: "Jeremy Foote"
   5 date: "April 11, 2019"
   6 output: html_document
   7 ---
   8
   9 ```{r setup, include=FALSE}
  10 knitr::opts_chunk$set(echo = TRUE, messages = F)
  11 ```
  12
  13 ## Programming Questions
  14
  15 ### PC0
  16
  17 First we import the data.
  18
  19 ```{r}
  20 raw_df = read.csv("~/Desktop/DeleteMe/Teaching/owan03.csv") # Note that I saved the file as a CSV for importing to R
  21 head(raw_df)
  22 ```
  23
  24 ### PC1
  25
  26 Let's reshape the data
  27
  28 ```{r}
  29 library(tidyverse)
  30
  31 # Rename the columns
  32 colnames(raw_df) <- c("None", "Low", "Medium","High")
  33
  34 # Gather gets all of the columns from the dataframe and
  35 # use them as keys, with the value as the value from that cell.
  36 # This gives us a "long" dataframe
  37 df <- gather(raw_df, dose, weeks_alive)
  38
  39 # We want to treat dose as a factor, so we can group by it easier
  40 df$dose <- as.factor(df$dose)
  41
  42 # Finally, when we look at it, we see some missing data (NA); this is because not all group sizes were the same.
  43 # We can safely remove these.
  44 df <- df[complete.cases(df),]
  45 ```
  46
  47 ### PC2
  48
  49 Now we're going to calculate summary statistics and create some visualizations
  50
  51 ```{r}
  52
  53 tapply(df$weeks_alive, df$dose, summary)
  54
  55 # Alternative way to do this using tidyverse
  56
  57 df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR'))
  58
  59 ```
  60
  61 When it comes to visualizations, we definitely want to use ggplot. We have lots of options for what to do.
  62
  63 ```{r}
  64
  65 # Histograms
  66
  67 h_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize
  68                       fill = dose # How to group by color
  69                       )) + geom_histogram(position = 'dodge', bins = 5)
  70 h_plot
  71
  72 # In this case, faceted histograms is probably better
  73
  74 h_facet = df %>% ggplot(aes(x=weeks_alive # What to summarize
  75                       )) + geom_histogram(bins = 5) + facet_grid(~dose)
  76
  77 h_facet
  78
  79 # Density plots
  80 d_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize
  81                       fill = dose # How to group by color
  82                       )) + geom_density(alpha = .2)
  83 d_plot
  84
  85
  86 # Boxplots
  87 box_plot = df %>% ggplot(aes(y=weeks_alive,
  88                       x = dose
  89                       )) + geom_boxplot()
  90 box_plot
  91
  92 # My favorite - ridgeline plots
  93
  94 # install.packages('ggridges')
  95
  96 library(ggridges)
  97
  98 ridge_plot = df %>% ggplot(aes(x=weeks_alive, y = dose)) +
  99   geom_density_ridges(jittered_points = T, fill = 'orange') + theme_minimal()
 100 ridge_plot
 101
 102 ```
 103
 104 It's a bit tough to tell, but the overall assumptions of normality and equal variance seem reasonable.
 105
 106 The global mean is
 107
 108 ```{r}
 109 mean(df$weeks_alive)
 110 ```
 111
 112 ### PC3
 113
 114 Anova!
 115 ```{r}
 116 summary(aov(weeks_alive ~ dose, data = df))
 117
 118 ```
 119
 120 This provides evidence that the group means are different.
 121
 122 ### PC4
 123
 124 T-test between None and Any, and between None and High.
 125
 126 ```{r}
 127
 128
 129 t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
 130        df[df$dose != 'None','weeks_alive'] # Samples with any dose
 131        )
 132
 133 # Or, using formula notation
 134 t.test(weeks_alive ~ dose == 'None', data = df)
 135
 136 # T-test between None and High
 137
 138 t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
 139        df[df$dose == 'High','weeks_alive'] # Samples with high dose
 140        )
 141
 142 # Formula notation is a bit tricker. I would probably create a temprorary dataframe
 143
 144 tmp = df %>% filter(dose %in% c('None', 'High'))
 145
 146 t.test(weeks_alive ~ dose, data = tmp)
 147
 148 ```
 149
 150 These t-tests both support the idea that receiving a dose of RD40 reduces lifespan. However, we should not completely trust these p-values, since we are doing multiple comparisons. One option is to do a Bonferroni correction, where we only cnsider things significant if $\alpha < .05/m$ where $m$ is the number of tests. In this case, we would fail to reject the null for the second test because we would set $\alpha = .025$.
 151
 152 The Bonferroni correction is more conservative than it needs to be, ane there are other approaches; for example,  the `TukeyHSD` function takes in an anova result and does post-hoc comparisons with corrections for all of the groups.
 153
 154 ## Statistical Questions
 155
 156 Q1.
 157 a) It is a sample statistic, because it comes from a sample.
 158 b) Confidence intervals for proportions are equal to
 159
 160 $$p \pm z * \sqrt{ \frac{p*(1-p)}{n}}$$
 161
 162 For a 95% confidence interval, $z = 1.96$, so we can calculate it like this:
 163
 164
 165 ```{r}
 166
 167 lower = .48 - 1.96 * sqrt(.48 * .52 / 1259)
 168
 169 upper = .48 + 1.96 * sqrt(.48 * .52 / 1259)
 170
 171 ci = c(lower, upper)
 172
 173 print(ci)
 174
 175 ```
 176
 177 This means that we are 95% confident that the true proportion of Americans who support legalizing marijuana is between ~45% and ~51%.
 178
 179 c) We have a large enough sample, which is collected randomly, to assume that the distribution is normal.
 180 d) The statement isn't justified, since our confidence interval include 50%
 181
 182 6.20
 183 We can use the point estimate of the poll to estimate how large a sample we would need to have a confidence interval of a given width.
 184
 185 Basically, we want each half of the confidence interval to be 1%, i.e.,  $1.96 * \sqrt{\frac{.48 * .52}{n}} = .01$
 186
 187 We can solve for $n$:
 188
 189 $$\sqrt{\frac{.48 * .52}{n}} = .01/1.96$$
 190 $$\frac{.48 * .52}{n} = (.01/1.96)^2$$
 191 $$n = (.48 * .52)/(.01/1.96)^2$$
 192 ```{r}
 193 (.48 * .52)/(.01/1.96)^2
 194 ```
 195 So, we need a sample of approximately 9,589
 196
 197 6.38
 198
 199 The question is whether there has been a change in the proportion over time. While the tools we have learned could allow you to answer that question, they assume that responses are independent. In this case, they are obviously not independent as they come from the same students.
 200
 201 6.50
 202
 203 a) We can test this with a $\chi^2$ test.
 204
 205 ```{r}
 206 chisq.test(x = c(83,121,193,103), p = c(.18,.22,.37,.23))
 207
 208 # Or manually
 209
 210 # Calculate expected values
 211 500 * c(.18,.22,.37,.23)
 212
 213 # Use the formula for chi-squared
 214 chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115
 215
 216 chisq
 217
 218 # We could then look up the chi-square distribution for 3 degrees of freedom
 219
 220 ```
 221
 222 The p-value for this is large, meaning that we don't have evidence that the sample differs from the census distribution.
 223
 224 b)
 225 i) Opinion is the response and location is the explanatory variable, since it's unlikely that people move to a region based on their opinion.
 226 ii) One hypothesis is that opinions differ by region. The null hypothesis is that opinion is independent of region, while the alternative hypothesis is that there is a relationship.
 227 iii) We can again use a $\chi^2$ test.
 228
 229 ```{r}
 230
 231 x <- matrix(c(29,54,44,77,62,131,36,67), nrow = 4, # this makes a matrix with 4 rows
 232             byrow=T) # And this says that we've entered it row by row
 233 chisq.test(x)
 234 ```
 235
 236 ## Empirical Questions
 237
 238
 239 EQ0.
 240
 241 a) The unit of analysis is the customer. The dependent variable is the type of board purchased and the independent variable is gender. Males, females, and unkown gender customers are being compared. This is a two-way test.
 242
 243 b) The null hypothesis is that the board purchased is independent of the gender of the customer. The alternative hypothesis is that if we know the gender of the customer that will tell us something about the type of board they purchased.
 244
 245 c) A $\chi^2$ test found statistically significant evidence that board purchase behavior differs by gender. This difference is convincing, but it does directly not tell us what we really want to know, which is the difference between men and women. It could be possible that it is simply identifying a significant difference in the number of unknown gender customers across board types. Many of these concerns are addressed in the text and with additional tests, giving increased confidence in the reality of this difference.
 246
 247 d) Statistical tests help to give (or take away) confidence in a conclusion. People are not natively good at estimating how likely something is due to chance and tests help us to make these judgments. Choosing a statistical test is based on the question that you want to answer and the type of data that you have available to answer it. For example, if this were numeric data (e.g., the amount of money spent on electronics for men and women) then we could choose a t-test to compare those distributions.
 248
 249
 250 EQ1
 251
 252 a) These are counts for two categorical variables, so the procedure used was a $\chi^2$ test. The null hypothesis is that whether or not a blog is governed by one person is independent of whether it is on the left or the right.
 253
 254 b) It would be surprising to see these results by chance and it make sense to believe that this difference is real. However, the main reason to be skeptical is the way that the data are grouped. The authors could have grouped them differently (e.g., 1-2 people, 3-4 people, and 5+ people); if the decision on how to group was made after seeing the data then we have good reason to be skeptical.
 255
 256 c)
 257
 258 ```{r}
 259
 260 # First we create the dataframe
 261 df = data.frame(Governance=c('Individual','Multiple', 'Individual','Multiple'),
 262                 Ideology = c('Left','Left','Right','Right'),
 263                 Count = c(13,51,27,38))
 264
 265 # We can make sure it's the same by testing the Chi-squared
 266 chisq.test(matrix(df$Count, nrow=2))
 267
 268 percentage_data = df %>%
 269   group_by(Ideology) %>%
 270   summarize(individual_ratio = sum(Count[Governance=='Individual']) / sum(Count),
 271             group_count = sum(Count))
 272
 273 shaw_benkler_plot = percentage_data %>%
 274   ggplot(aes(x=Ideology, y=individual_ratio * 100)) +
 275     geom_bar(stat='identity', aes(fill=c('red','blue')), show.legend=F) +
 276     ylab('Percentage of Blogs') + theme_minimal()
 277
 278 shaw_benkler_plot
 279
 280 # If we want to add error bars, we need to calculate them (Note that ggplot could do this for us if we had raw data - always share your data!)
 281
 282 # I decided to use confidence intervals. The standard error is another reasonable choice
 283
 284 # Remember that for a binomial distribution (we can consider individual/non-individual as binomial), confidence intervals are mu +- x * sqrt((p*(1-p))/n)
 285
 286 ci_95 = 1.96 * sqrt(percentage_data$individual_ratio * (1 - percentage_data$individual_ratio)/percentage_data$group_count)
 287
 288 shaw_benkler_plot + geom_errorbar(aes(ymin=(individual_ratio-ci_95)*100, ymax=(individual_ratio + ci_95)*100),
 289                                   alpha = .3,
 290                                   size=1.1,
 291                                   width=.4)
 292
 293 ```
 294
 295 The error bars do overlap in this case, but that actually doesn't tell us whether they are significantly different.
 296
 297 d) We don't need to be very worried about the base rate fallacy because the sizes of both groups are about the same.