problem_sets/week_06/ps6-worked-solution.Rmd

   1 ---
   2 title: "Week 6 Worked Examples"
   3 author: "Jeremy Foote"
   4 date: "April 11, 2019"
   5 output: html_document
   6 ---
   7
   8 ```{r setup, include=FALSE}
   9 knitr::opts_chunk$set(echo = TRUE, messages = F)
  10 ```
  11
  12 ## Programming Questions
  13
  14 PC0. First we import the data.
  15
  16 ```{r}
  17 raw_df = read.csv("~/Desktop/DeleteMe/Teaching/owan03.csv") # Note that I saved the file as a CSV for importing to R
  18 head(raw_df)
  19 ```
  20
  21 PC1. Let's reshape the data
  22
  23
  24 ```{r}
  25 library(tidyverse)
  26
  27 # Rename the columns
  28 colnames(raw_df) <- c("None", "Low", "Medium","High")
  29
  30 # Gather gets all of the columns from the dataframe and
  31 # use them as keys, with the value as the value from that cell.
  32 # This gives us a "long" dataframe
  33 df <- gather(raw_df, dose, weeks_alive)
  34
  35 # We want to treat dose as a factor, so we can group by it easier
  36 df$dose <- as.factor(df$dose)
  37
  38 # Finally, when we look at it, we see some missing data (NA); this is because not all group sizes were the same.
  39 # We can safely remove these.
  40 df <- df[complete.cases(df),]
  41 ```
  42
  43 PC2: Now we're goint to get statistics and create some visualizations
  44
  45 ```{r}
  46
  47 tapply(df$weeks_alive, df$dose, summary)
  48
  49 # Alternative way to do this using tidyverse
  50
  51 df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR'))
  52
  53 ```
  54
  55 When it comes to visualizations, we definitely want to use ggplot. We have lots of options for what to do.
  56
  57 ```{r}
  58
  59 # Histograms
  60
  61 h_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize
  62                       fill = dose # How to group by color
  63                       )) + geom_histogram(position = 'dodge', bins = 5)
  64 h_plot
  65
  66 # In this case, faceted histograms is probably better
  67
  68 h_facet = df %>% ggplot(aes(x=weeks_alive # What to summarize
  69                       )) + geom_histogram(bins = 5) + facet_grid(~dose)
  70
  71 h_facet
  72
  73 # Density plots
  74 d_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize
  75                       fill = dose # How to group by color
  76                       )) + geom_density(alpha = .2)
  77 d_plot
  78
  79
  80 # Boxplots
  81 box_plot = df %>% ggplot(aes(y=weeks_alive,
  82                       x = dose
  83                       )) + geom_boxplot()
  84 box_plot
  85
  86 # My favorite - ridgeline plots
  87
  88 # install.packages('ggridges')
  89
  90 library(ggridges)
  91
  92 ridge_plot = df %>% ggplot(aes(x=weeks_alive, y = dose)) +
  93   geom_density_ridges(jittered_points = T, fill = 'orange') + theme_minimal()
  94 ridge_plot
  95
  96 ```
  97
  98 It's a bit tough to tell, but the overall assumptions of normality and equal variance seem reasonable.
  99
 100 The global mean is
 101
 102 ```{r}
 103 mean(df$weeks_alive)
 104 ```
 105
 106
 107 PC3. Anova
 108
 109 ```{r}
 110 summary(aov(weeks_alive ~ dose, data = df))
 111
 112 ```
 113
 114 This provides evidence that the group means are different.
 115
 116 PC4. T-test between None and Any, and between None and High.
 117
 118 ```{r}
 119
 120
 121 t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
 122        df[df$dose != 'None','weeks_alive'] # Samples with any dose
 123        )
 124
 125 # Or, using formula notation
 126 t.test(weeks_alive ~ dose == 'None', data = df)
 127
 128 # T-test between None and High
 129
 130 t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
 131        df[df$dose == 'High','weeks_alive'] # Samples with high dose
 132        )
 133
 134 # Formula notation is a bit tricker. I would probably create a temprorary dataframe
 135
 136 tmp = df %>% filter(dose %in% c('None', 'High'))
 137
 138 t.test(weeks_alive ~ dose, data = tmp)
 139
 140 ```
 141
 142 These t-tests both support the idea that receiving a dose of RD40 reduces lifespan. However, we should not completely trust these p-values, since we are doing multiple comparisons. One option is to do a Bonferroni correction, where we only cnsider things significant if $\alpha < .05/m$ where $m$ is the number of tests. In this case, we would fail to reject the null for the second test because we would set $\alpha = .025$.
 143
 144 The Bonferroni correction is more conservative than it needs to be, ane there are other approaches; for example,  the `TukeyHSD` function takes in an anova result and does post-hoc comparisons with corrections for all of the groups.
 145
 146
 147 ## Statistical Questions
 148
 149 Q1.
 150 a) It is a sample statistic, because it comes from a sample.
 151 b) Confidence intervals for proportions are equal to
 152
 153 $$p \pm z * \sqrt{ \frac{p*(1-p)}{n}}$$
 154
 155 For a 95% confidence interval, $z = 1.96$, so we can calculate it like this:
 156
 157
 158 ```{r}
 159
 160 lower = .48 - 1.96 * sqrt(.48 * .52 / 1259)
 161
 162 upper = .48 + 1.96 * sqrt(.48 * .52 / 1259)
 163
 164 ci = c(lower, upper)
 165
 166 print(ci)
 167
 168 ```
 169
 170 This means that we are 95% confident that the true proportion of Americans who support legalizing marijuana is between ~45% and ~51%.
 171
 172 c) We have a large enough sample, which is collected randomly, to assume that the distribution is normal.
 173 d) The statement isn't justified, since our confidence interval include 50%
 174
 175 6.20
 176 We can use the point estimate of the poll to estimate how large a sample we would need to have a confidence interval of a given width.
 177
 178 Basically, we want each half of the confidence interval to be 1%, i.e.,  $1.96 * \sqrt{\frac{.48 * .52}{n}} = .01$
 179
 180 We can solve for $n$:
 181
 182 $$\sqrt{\frac{.48 * .52}{n}} = .01/1.96$$
 183 $$\frac{.48 * .52}{n} = (.01/1.96)^2$$
 184 $$n = (.48 * .52)/(.01/1.96)^2$$
 185 ```{r}
 186 (.48 * .52)/(.01/1.96)^2
 187 ```
 188 So, we need a sample of approximately 9,589
 189
 190 6.38
 191
 192 The question is whether there has been a change in the proportion over time. While the tools we have learned could allow you to answer that question, they assume that responses are independent. In this case, they are obviously not independent as they come from the same students.
 193
 194 6.50
 195
 196 a) We can test this with a $\chi^2$ test.
 197
 198 ```{r}
 199 chisq.test(x = c(83,121,193,103), p = c(.18,.22,.37,.23))
 200
 201 # Or manually
 202
 203 # Calculate expected values
 204 500 * c(.18,.22,.37,.23)
 205
 206 # Use the formula for chi-squared
 207 chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115
 208
 209 chisq
 210
 211 # We could then look up the chi-square distribution for 3 degrees of freedom
 212
 213 ```
 214
 215 The p-value for this is large, meaning that we don't have evidence that the sample differs from the census distribution.
 216
 217 b)
 218 i) Opinion is the response and location is the explanatory variable, since it's unlikely that people move to a region based on their opinion.
 219 ii) One hypothesis is that opinions differ by region. The null hypothesis is that opinion is independent of region, while the alternative hypothesis is that there is a relationship.
 220 iii) We can again use a $\chi^2$ test.
 221
 222 ```{r}
 223
 224 x <- matrix(c(29,54,44,77,62,131,36,67), nrow = 4, # this makes a matrix with 4 rows
 225             byrow=T) # And this says that we've entered it row by row
 226 chisq.test(x)
 227 ```
 228
 229 ## Empirical Questions
 230
 231
 232 EQ0.
 233
 234 a) The unit of analysis is the customer. The dependent variable is the type of board purchased and the independent variable is gender. Males, females, and unkown gender customers are being compared. This is a two-way test.
 235
 236 b) The null hypothesis is that the board purchased is independent of the gender of the customer. The alternative hypothesis is that if we know the gender of the customer that will tell us something about the type of board they purchased.
 237
 238 c) A $\chi^2$ test found statistically significant evidence that board purchase behavior differs by gender. This difference is convincing, but it does directly not tell us what we really want to know, which is the difference between men and women. It could be possible that it is simply identifying a significant difference in the number of unknown gender customers across board types. Many of these concerns are addressed in the text and with additional tests, giving increased confidence in the reality of this difference.
 239
 240 d) Statistical tests help to give (or take away) confidence in a conclusion. People are not natively good at estimating how likely something is due to chance and tests help us to make these judgments. Choosing a statistical test is based on the question that you want to answer and the type of data that you have available to answer it. For example, if this were numeric data (e.g., the amount of money spent on electronics for men and women) then we could choose a t-test to compare those distributions.
 241
 242
 243 EQ1
 244
 245 a) These are counts for two categorical variables, so the procedure used was a $\chi^2$ test. The null hypothesis is that whether or not a blog is governed by one person is independent of whether it is on the left or the right.
 246
 247 b) It would be surprising to see these results by chance and it make sense to believe that this difference is real. However, the main reason to be skeptical is the way that the data are grouped. The authors could have grouped them differently (e.g., 1-2 people, 3-4 people, and 5+ people); if the decision on how to group was made after seeing the data then we have good reason to be skeptical.
 248
 249 c)
 250
 251 ```{r}
 252
 253 # First we create the dataframe
 254 df = data.frame(Governance=c('Individual','Multiple', 'Individual','Multiple'),
 255                 Ideology = c('Left','Left','Right','Right'),
 256                 Count = c(13,51,27,38))
 257
 258 # We can make sure it's the same by testing the Chi-squared
 259 chisq.test(matrix(df$Count, nrow=2))
 260
 261 percentage_data = df %>%
 262   group_by(Ideology) %>%
 263   summarize(individual_ratio = sum(Count[Governance=='Individual']) / sum(Count),
 264             group_count = sum(Count))
 265
 266 shaw_benkler_plot = percentage_data %>%
 267   ggplot(aes(x=Ideology, y=individual_ratio * 100)) +
 268     geom_bar(stat='identity', aes(fill=c('red','blue')), show.legend=F) +
 269     ylab('Percentage of Blogs') + theme_minimal()
 270
 271 shaw_benkler_plot
 272
 273 # If we want to add error bars, we need to calculate them (Note that ggplot could do this for us if we had raw data - always share your data!)
 274
 275 # I decided to use confidence intervals. The standard error is another reasonable choice
 276
 277 # Remember that for a binomial distribution (we can consider individual/non-individual as binomial), confidence intervals are mu +- x * sqrt((p*(1-p))/n)
 278
 279 ci_95 = 1.96 * sqrt(percentage_data$individual_ratio * (1 - percentage_data$individual_ratio)/percentage_data$group_count)
 280
 281 shaw_benkler_plot + geom_errorbar(aes(ymin=(individual_ratio-ci_95)*100, ymax=(individual_ratio + ci_95)*100),
 282                                   alpha = .3,
 283                                   size=1.1,
 284                                   width=.4)
 285
 286 ```
 287
 288 The error bars do overlap in this case, but that actually doesn't tell us whether they are significantly different.
 289
 290 d) We don't need to be very worried about the base rate fallacy because the sizes of both groups are about the same.