problem_sets/week_06/ps6-worked-solution.Rmd

   1 ---
   2 title: "Week 6 Worked Examples"
   3 author: "Jeremy Foote"
   4 date: "April 11, 2019"
   5 output: html_document
   6 ---
   7
   8 ```{r setup, include=FALSE}
   9 knitr::opts_chunk$set(echo = TRUE, messages = F)
  10 ```
  11
  12 ## Programming Questions
  13
  14 PC0. First we import the data.
  15
  16 ```{r}
  17 raw_df = read.csv("~/Desktop/DeleteMe/Teaching/owan03.csv") # Note that I saved the file as a CSV for importing to R
  18 head(raw_df)
  19 ```
  20
  21 PC1. Let's reshape the data
  22
  23
  24 ```{r}
  25 library(tidyverse)
  26
  27 # Rename the columns
  28 colnames(raw_df) <- c("None", "Low", "Medium","High")
  29
  30 # Gather gets all of the columns from the dataframe and
  31 # use them as keys, with the value as the value from that cell.
  32 # This gives us a "long" dataframe
  33 df <- gather(raw_df, dose, weeks_alive)
  34
  35 # We want to treat dose as a factor, so we can group by it easier
  36 df$dose <- as.factor(df$dose)
  37
  38 # Finally, when we look at it, we see some missing data (NA); this is because not all group sizes were the same.
  39 # We can safely remove these.
  40 df <- df[complete.cases(df),]
  41 ```
  42
  43 PC2: Now we're goint to get statistics and create some visualizations
  44
  45 ```{r}
  46
  47 tapply(df$weeks_alive, df$dose, summary)
  48
  49 # Alternative way to do this using tidyverse
  50
  51 df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR'))
  52
  53 ```
  54
  55 When it comes to visualizations, we definitely want to use ggplot. We have lots of options for what to do.
  56
  57 ```{r}
  58
  59 # Histograms
  60
  61 h_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize
  62                       fill = dose # How to group by color
  63                       )) + geom_histogram(position = 'dodge', bins = 5)
  64 h_plot
  65
  66 # In this case, faceted histograms is probably better
  67
  68 h_facet = df %>% ggplot(aes(x=weeks_alive, # What to summarize
  69                       )) + geom_histogram(bins = 5) + facet_grid(~dose)
  70
  71 h_facet
  72
  73 # Density plots
  74 d_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize
  75                       fill = dose # How to group by color
  76                       )) + geom_density(alpha = .2)
  77 d_plot
  78
  79
  80 # Boxplots
  81 box_plot = df %>% ggplot(aes(y=weeks_alive,
  82                       x = dose
  83                       )) + geom_boxplot()
  84 box_plot
  85
  86 # My favorite - ridgeline plots
  87
  88 # install.packages('ggridges')
  89
  90 library(ggridges)
  91
  92 ridge_plot = df %>% ggplot(aes(x=weeks_alive, y = dose)) +
  93   geom_density_ridges(jittered_points = T)
  94 ridge_plot
  95
  96 ```
  97
  98 It's a bit tough to tell, but the overall assumptions of normality and equal variance seem reasonable.
  99
 100 The global mean is
 101
 102 ```{r}
 103 mean(df$weeks_alive)
 104 ```
 105
 106 PC3. T-test between None and Any, and between None and High.
 107
 108 ```{r}
 109
 110
 111 t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
 112        df[df$dose != 'None','weeks_alive'] # Samples with any dose
 113        )
 114
 115 # Or, using formula notation
 116 t.test(weeks_alive ~ dose == 'None', data = df)
 117
 118 # T-test between None and High
 119
 120 t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
 121        df[df$dose == 'High','weeks_alive'] # Samples with high dose
 122        )
 123
 124 # Formula notation is a bit tricker. I would probably create a temprorary dataframe
 125
 126 tmp = df %>% filter(dose %in% c('None', 'High'))
 127
 128 t.test(weeks_alive ~ dose, data = tmp)
 129
 130 ```
 131
 132 The t-test supports the idea that receiving a dose of RD40 reduces lifespan
 133
 134 PC4. Anova
 135
 136 ```{r}
 137 summary(aov(weeks_alive ~ dose, data = df))
 138
 139 ```
 140
 141 This provides evidence that the group means are different.
 142
 143 ## Statistical Questions
 144
 145 Q1.
 146 a) It is a sample statistic, because it comes from a sample.
 147 b) Confidence intervals for proportions are equal to
 148
 149 $$p \pm z * \sqrt{ \frac{p*(1-p)}{n}}$$
 150
 151 For a 95% confidence interval, $z = 1.96$, so we can calculate it like this:
 152
 153
 154 ```{r}
 155
 156 lower = .48 - 1.96 * sqrt(.48 * .52 / 1259)
 157
 158 upper = .48 + 1.96 * sqrt(.48 * .52 / 1259)
 159
 160 ci = c(lower, upper)
 161
 162 print(ci)
 163
 164 ```
 165
 166 This means that we are 95% confident that the true proportion of Americans who support legalizing marijuana is between ~45% and ~51%.
 167
 168 c) We have a large enough sample, which is collected randomly, to assume that the distribution is normal.
 169 d) The statement isn't justified, since our confidence interval include 50%
 170
 171 6.20
 172 We can use the point estimate of the poll to estimate how large a sample we would need to have a confidence interval of a given width.
 173
 174 Basically, we want each half of the confidence interval to be 1%, i.e.,  $1.96 * \sqrt{\frac{.48 * .52}{n}} = .01$
 175
 176 We can solve for $n$:
 177
 178 $$\sqrt{\frac{.48 * .52}{n}} = .01/1.96$$
 179 $$\frac{.48 * .52}{n} = (.01/1.96)^2$$
 180 $$n = (.48 * .52)/(.01/1.96)^2$$
 181 ```{r}
 182 (.48 * .52)/(.01/1.96)^2
 183 ```
 184 So, we need a sample of approximately 9,589
 185
 186 6.38
 187
 188 The question is whether there has been a change in the proportion over time. While the tools we have learned could allow you to answer that question, they assume that responses are independent. In this case, they are obviously not independent as they come from the same students.
 189
 190 6.50
 191
 192 a) We can test this with a $\chi^2$ test.
 193
 194 ```{r}
 195 chisq.test(x = c(83,121,193,103), p = c(.18,.22,.37,.23))
 196
 197 # Or manually
 198
 199 # Calculate expected values
 200 500 * c(.18,.22,.37,.23)
 201
 202 # Use the formula for chi-squared
 203 chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115
 204
 205
 206 ```
 207
 208 The p-value for this is large, meaning that we don't have evidence that the sample differs from the census distribution.
 209
 210 b)
 211 i) Opinion is the response and location is the explanatory variable, since it's unlikely that people move to a region based on their opinion.
 212 ii) One hypothesis is that opinions differ by region. The null hypothesis is that opinion is independent of region, while the alternative hypothesis is that there is a relationship.
 213 iii) We can again use a $\chi^2$ test.
 214
 215 ```{r}
 216
 217 x <- matrix(c(29,54,44,77,62,131,36,67), nrow = 4, # this makes a matrix with 4 rows
 218             byrow=T) # And this says that we've entered it row by row
 219 chisq.test(x)
 220 ```