2 title: "Week 6 Worked Examples"
8 ```{r setup, include=FALSE}
9 knitr::opts_chunk$set(echo = TRUE, messages = F)
12 ## Programming Questions
14 PC0. First we import the data.
17 raw_df = read.csv("~/Desktop/DeleteMe/Teaching/owan03.csv") # Note that I saved the file as a CSV for importing to R
21 PC1. Let's reshape the data
28 colnames(raw_df) <- c("None", "Low", "Medium","High")
30 # Gather gets all of the columns from the dataframe and
31 # use them as keys, with the value as the value from that cell.
32 # This gives us a "long" dataframe
33 df <- gather(raw_df, dose, weeks_alive)
35 # We want to treat dose as a factor, so we can group by it easier
36 df$dose <- as.factor(df$dose)
38 # Finally, when we look at it, we see some missing data (NA); this is because not all group sizes were the same.
39 # We can safely remove these.
40 df <- df[complete.cases(df),]
43 PC2: Now we're goint to get statistics and create some visualizations
47 tapply(df$weeks_alive, df$dose, summary)
49 # Alternative way to do this using tidyverse
51 df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR'))
55 When it comes to visualizations, we definitely want to use ggplot. We have lots of options for what to do.
61 h_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize
62 fill = dose # How to group by color
63 )) + geom_histogram(position = 'dodge', bins = 5)
66 # In this case, faceted histograms is probably better
68 h_facet = df %>% ggplot(aes(x=weeks_alive, # What to summarize
69 )) + geom_histogram(bins = 5) + facet_grid(~dose)
74 d_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize
75 fill = dose # How to group by color
76 )) + geom_density(alpha = .2)
81 box_plot = df %>% ggplot(aes(y=weeks_alive,
86 # My favorite - ridgeline plots
88 # install.packages('ggridges')
92 ridge_plot = df %>% ggplot(aes(x=weeks_alive, y = dose)) +
93 geom_density_ridges(jittered_points = T)
98 It's a bit tough to tell, but the overall assumptions of normality and equal variance seem reasonable.
106 PC3. T-test between None and Any, and between None and High.
111 t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
112 df[df$dose != 'None','weeks_alive'] # Samples with any dose
115 # Or, using formula notation
116 t.test(weeks_alive ~ dose == 'None', data = df)
118 # T-test between None and High
120 t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
121 df[df$dose == 'High','weeks_alive'] # Samples with high dose
124 # Formula notation is a bit tricker. I would probably create a temprorary dataframe
126 tmp = df %>% filter(dose %in% c('None', 'High'))
128 t.test(weeks_alive ~ dose, data = tmp)
132 The t-test supports the idea that receiving a dose of RD40 reduces lifespan
137 summary(aov(weeks_alive ~ dose, data = df))
141 This provides evidence that the group means are different.
143 ## Statistical Questions
146 a) It is a sample statistic, because it comes from a sample.
147 b) Confidence intervals for proportions are equal to
149 $$p \pm z * \sqrt{ \frac{p*(1-p)}{n}}$$
151 For a 95% confidence interval, $z = 1.96$, so we can calculate it like this:
156 lower = .48 - 1.96 * sqrt(.48 * .52 / 1259)
158 upper = .48 + 1.96 * sqrt(.48 * .52 / 1259)
166 This means that we are 95% confident that the true proportion of Americans who support legalizing marijuana is between ~45% and ~51%.
168 c) We have a large enough sample, which is collected randomly, to assume that the distribution is normal.
169 d) The statement isn't justified, since our confidence interval include 50%
172 We can use the point estimate of the poll to estimate how large a sample we would need to have a confidence interval of a given width.
174 Basically, we want each half of the confidence interval to be 1%, i.e., $1.96 * \sqrt{\frac{.48 * .52}{n}} = .01$
176 We can solve for $n$:
178 $$\sqrt{\frac{.48 * .52}{n}} = .01/1.96$$
179 $$\frac{.48 * .52}{n} = (.01/1.96)^2$$
180 $$n = (.48 * .52)/(.01/1.96)^2$$
182 (.48 * .52)/(.01/1.96)^2
184 So, we need a sample of approximately 9,589
188 The question is whether there has been a change in the proportion over time. While the tools we have learned could allow you to answer that question, they assume that responses are independent. In this case, they are obviously not independent as they come from the same students.
192 a) We can test this with a $\chi^2$ test.
195 chisq.test(x = c(83,121,193,103), p = c(.18,.22,.37,.23))
199 # Calculate expected values
200 500 * c(.18,.22,.37,.23)
202 # Use the formula for chi-squared
203 chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115
208 The p-value for this is large, meaning that we don't have evidence that the sample differs from the census distribution.
211 i) Opinion is the response and location is the explanatory variable, since it's unlikely that people move to a region based on their opinion.
212 ii) One hypothesis is that opinions differ by region. The null hypothesis is that opinion is independent of region, while the alternative hypothesis is that there is a relationship.
213 iii) We can again use a $\chi^2$ test.
217 x <- matrix(c(29,54,44,77,62,131,36,67), nrow = 4, # this makes a matrix with 4 rows
218 byrow=T) # And this says that we've entered it row by row