os_exercises/ch6_exercises_solutions.rmd

   1 ---
   2 title: "Chapter 6 Textbook exercises"
   3 subtitle: "Solutions to even-numbered questions  \nStatistics and statistical programming  \nNorthwestern University  \nMTS
   4   525"
   5 author: "Aaron Shaw"
   6 date: "October 22, 2020"
   7 output:
   8   html_document:
   9     toc: yes
  10     toc_depth: 3
  11     toc_float:
  12       collapsed: false
  13       smooth_scroll: true
  14     theme: readable
  15   pdf_document:
  16     toc: no
  17     toc_depth: '3'
  18     latex_engine: xelatex
  19 header-includes:
  20   - \newcommand{\lt}{<}
  21   - \newcommand{\gt}{>}
  22   - \renewcommand{\leq}{≤}
  23   - \usepackage{lmodern}
  24 ---
  25
  26 ```{r setup, include=FALSE}
  27 knitr::opts_chunk$set(echo = TRUE)
  28
  29 ```
  30
  31
  32
  33 All exercises taken from the *OpenIntro Statistics* textbook, $4^{th}$ edition, Chapter 6.
  34
  35
  36 # 6.10 Marijuana legalization, Part I
  37
  38 (a) It is a sample statistic (the sample mean), because it comes from a sample. The population parameter (the population mean) is unknown in this case, but can be estimated from the sample statistic.
  39
  40 (b) As given in the textbook, confidence intervals for proportions are equal to:
  41
  42 $$\hat{p} \pm z^* \sqrt{ \frac{\hat{p}\times(1-\hat{p})}{n}}$$
  43
  44 Where we calculate $z^*$ from the Z distribution table. For a 95% confidence interval, $z^* = 1.96$, so we can plug in the values $\hat{p}$ and $n$ given in the problem and calculate the interval like this:
  45 ```{r}
  46 lower = .61 - 1.96 * sqrt(.61 * .39 / 1578)
  47 upper = .61 + 1.96 * sqrt(.61 * .39 / 1578)
  48
  49 ci = c(lower, upper)
  50 print(ci)
  51 ```
  52
  53 This means that we are 95% confident that the true proportion of Americans who support legalizing marijuana is between ~58.6% and ~63.4%.
  54
  55 (c) We should believe that the distribution of the sample proportion would be approximately normal because we have a large enough sample (collected randomly) in which enough responses are drawn from each potential outcome to assume (1) that the observations are independent and (2) that the distribution of the sample proportion is approximately normal.
  56
  57 (d) Yes, the statement is justified, since our confidence interval is entirely above 50%.
  58
  59 # 6.16 Marijuana legalization, Part II
  60
  61 We can use the point estimate of the poll to estimate how large a sample we would need to have a confidence interval of a given width.
  62
  63 In this case, we want the margin or error to be $2\%$. Using the same formula from above, this translates to:
  64 $$1.96 \times \sqrt{\frac{.61 \times .39}{n}} \leq .02$$
  65
  66 Rearrange to solve for $n$:
  67
  68 $$\begin{array}{c c c}
  69 \sqrt{\frac{.61 \times .39}{n}} & \leq & \frac{.02}{1.96}\\ \\
  70 \frac{.61 \times .39}{n} & \leq & \left(\frac{.02}{1.96}\right)^2\\ \\
  71 \frac{(.61 \times .39)}{\left(\frac{.02}{1.96}\right)^2} & \leq & n
  72 \end{array}$$
  73
  74 Let R solve this:
  75 ```{r}
  76 (.61 * .39)/(.02/1.96)^2
  77 ```
  78 So, we need a sample of at least 2,285 people (since we can't survey fractions of a person).
  79
  80 # 6.22 Sleepless on the west coast
  81
  82 Before we march ahead and plug numbers into formulas, it's probably good to review that the conditions for calculating a valid confidence interval for a difference of proportions are met. Those conditions (and the corresponding situation here) are:
  83
  84 1. *Indepdendence*: Both samples are random and less than $10\%$ of the total population in the case of each state. The observations within each state are therefore likely independent. In addition, the two samples are independent of each other (the individuals sampled in Oregon do not (likely) have any dependence on the individuals sampled in California).
  85
  86 2. *Success-failure*: The number of "successes" in each state is greater than 10 (you can multiply the respective sample sizes by the proportions in the "success" and "failure" outcomes to calculate this directly).
  87
  88 With that settled, we can move on to the calculation. The first equation here is the formula for a confidence interval for a difference of proportions. Note that the subscripts $CA$ and $OR$ indicate parameters for the observed proportions ($\hat{p}$) reporting sleep deprivation from each of the two states.
  89
  90 $$\begin{array}{l}
  91 \hat{p}_{CA}-\hat{p}_{OR} ~\pm~ z^*\sqrt{\frac{\hat{p}_{CA}(1-\hat{p}_{CA})}{n_{CA}} + \frac{\hat{p}_{OR}(1-\hat{p}_{OR})}{n_{OR}}}
  92 \end{array}$$
  93 Plug values in and recall that $z^*=1.96$ for a $95\%$ confidence interval:
  94 $$\begin{array}{l}
  95 0.8-0.088 ~\pm~ 1.96\sqrt{\frac{0.08 \times 0.92}{11545} + \frac{0.088 \times 0.912}{4691}}
  96 \end{array}$$
  97
  98 Let's let R take it from there:
  99 ```{r 6.22 CI}
 100 var.ca <- (0.08*0.92 ) / 11545
 101 var.or <- (0.088*0.912) / 4691
 102 se.diff <- 1.96 * sqrt(var.or + var.ca)
 103 upper <- 0.08-0.088 + se.diff
 104 lower <- 0.08-0.088 - se.diff
 105 print(c(lower, upper))
 106 ```
 107
 108 The data suggests that the 95% confidence interval for the difference between the proportion of California residents and Oregon residents reporting sleep deprivation is between $-1.75\%$ and $0.1\%$. In other words, we can be 95% confident that the true difference between the two proportions falls within that range.
 109
 110 # 6.30 Apples, doctors, and informal experiments on children
 111
 112 **tl;dr answer**: No. Constructing the test implied by the question is not possible without violating the assumptions that define the estimation procedure, and thereby invalidating the estimate.
 113
 114 **longer answer**: The question the teacher wants to answer is whether there has been a meaningful change in a proportion across two data collection points (the students pre- and post-class). While the tools we have learned could allow you to answer that question for *two independent groups*, the responses are not independent in this case because they come from the same students. You could go ahead and calculate a statistical test for difference in pooled proportions (after all, it's just plugging values into an equation!) and explain how the data violates a core assumption of the test. However, since the dependence between observations violates that core assumption, the baseline expectations necessary to construct the null distribution against which the observed test statistic can be evaluated are not met. The results of the hypothesis test under these conditions may or may not mean what you might expect (the test has nothing to say about that).
 115
 116 # 6.40 Website experiment
 117
 118 (a) The question gives us the total sample size and the proportions cross-tabulated for treatment condition (position) and outcome (download or not). I'll use R to work out the answers here.
 119
 120 ```{r}
 121 props <- data.frame(
 122   "position" = c("pos1", "pos2", "pos3"),
 123   "download" = c(.138, .146, .121),
 124   "no_download" = c(.183, .185, .227)
 125 )
 126 props
 127 ```
 128
 129 Now multiply those values by the sample size to get the counts:
 130
 131 ```{r}
 132 counts <- data.frame(
 133   "position" = props$position,
 134   "download" = round(props$download*701, 0),
 135   "no_download" =  round(props$no_download*701, 0)
 136 )
 137
 138 counts
 139 ```
 140
 141 (b) This set up is leading towards a $\chi^2$ test for goodness of fit to evaluate balance in a one-way table (revisit the section of the chapter dealing with this test for more details). We can construct and conduct the test using the textbook's (slightly cumbersome, but delightfully thorough and transparent) "prepare-check-calculate-conclude" algorithm for hypothesis testing. Let's walk through that:
 142
 143 **Prepare**: The first thing to consider is the actual values the question is actually asking us to compare: the total number of study participants in each condition. We can do that using the table from part (a) above:
 144 ```{r}
 145 counts$total <- counts$download + counts$no_download
 146 counts$total
 147 ```
 148 So the idea here is to figure out whether these counts are less balanced than might be expected. (And this is maybe a good time to point out that you might eyeball these values and notice that they're all pretty close together.)
 149
 150 Here are the hypotheses stated more formally:
 151     $H_0$: The chance of a site visitor being in any of the three groups is equal.
 152     $H_A$: The chance of a site visitor being in one group or another is not equal.
 153
 154 **Check**: Now we can check the assumptions for the test. If $H_0$ were true, we might expect $1/3$ of the 701 visitors (233.67 visitors) to be in each group. This expected (and observed) count is greater than 5 for all three groups, satisfying the *sample size /distribution condition*. Because the visitors were assigned into the groups randomly and only appear in their respective group once, the *indepdendence condition* is also satisfied. That's both of the conditions for this test, so we can go ahead and conduct it.
 155
 156 **Calculate**: For a $\chi^2$ test, we need to calculate a test statistic as well as the number of degrees of freedom. Here we go, in that order.
 157
 158 First up, let's set up the test statistic given some number of cells ($k$) in the one-way table:
 159 $$\begin{array}{l}
 160 ~\chi^2 = \sum\limits_{n=1}^{k} \frac{(Observed_k-Expected_k)^2}{Expected_k}\\\\
 161 \phantom{~\chi^2} = \frac{(225-233.67)^2}{233.67} + \frac{(232-233.67)^2}{233.67} + \frac{(244-233.67)^2}{233.67}\\\\
 162 \phantom{~\chi^2} = 0.79\\
 163 \end{array}$$
 164 Now the degrees of freedom:
 165 $$df = k-1 = 2$$
 166 You can look up the results in the tables at the end of the book or calculate it in R using the `pchisq()` function. Note that the `pchisq()` function returns "lower tail" area values from the $\chi^2$ distribution. However, for these tests, we usually want the corresponding "upper tail" area, which can be found by subtracting the results of a call to `pchisq()` from 1.
 167
 168 ```{r}
 169 1-pchisq(.79, df=2)
 170 ```
 171
 172 **Conclude**: Because this p-value is *larger* than 0.05, we cannot reject $H_0$. That is, we do not find evidence that randomization of site visitors to the groups is imbalanced.
 173
 174 (c)  I said you did *not* need to do this one, but I'll walk through the setup and solution anyway because it's useful to have an example. We're doing a $\chi^2$ test again, but this time for independence in a two-way table. Because the underlying setup is pretty similar, my solution here is a bit more concise.
 175
 176 **Prepare**: Create the null and alternative hypotheses (in words here, but we could do this in notation too).
 177 $H_0$: No difference in download rate across the experiment groups.
 178 $H_A$: Some difference in download rate across the groups.
 179
 180 **Check**: Each visitor was randomly assigned to a group and only counted once in the table, so the observations are independent. The expected counts can also be computed by following the procedure described on p.241 of the textbook to get the expected counts under $H_0$. Those expected counts in this case are (reading down the first column then down the second): 91.2, 94.0, 98.9, 133.8, 138.0, 145.2. All of these expected counts are at least 5 (which, let's be honest, you might have been able to infer/guess just by looking at the observed counts). Therefore we can use the $\chi^2$ test.
 181
 182 **Calculate**: the test statistic and corresponding degrees of freedom. For the test statistic
 183
 184 $$\begin{array}{l}
 185 ~\chi^2 = \sum\limits_{n=1}^{k} \frac{(Observed_k-Expected_k)^2}{Expected_k}\\\\
 186 \phantom{~\chi^2} = \frac{(97-91.2)^2}{91.2} +~ ... ~+\frac{(159-145.2)^2}{145.2}\\\\
 187 \phantom{~\chi^2} = 5.04\\\\
 188 df = 3-1 = 2
 189 \end{array}$$
 190
 191 Once again, I'll let R calculate the p-value:
 192 ```{r}
 193 1-pchisq(5.04, 2)
 194 ```
 195
 196 **Conclude**: The p-value is (just a little bit!) greater than 0.05, so assuming a typical hypothesis testing framework, we would be unable to reject $H_0$ that there is no difference in the download rates. In other words, we do not find compelling evidence that the position of the link led to any difference in download rates. That said, given that the p-value is quite close to the conventional threshold, you might also note that it's possible that there's a small effect that our study design was insufficiently sensitive to detect.