os_exercises/ch5_exercises_solutions.rmd

   1 ---
   2 title: "Chapter 5 Textbook exercises"
   3 subtitle: "Solutions to even-numbered questions  \nStatistics and statistical programming  \nNorthwestern University  \nMTS
   4   525"
   5 author: "Aaron Shaw"
   6 date: "October 15, 2020"
   7 output:
   8   html_document:
   9     toc: yes
  10     toc_depth: 3
  11     toc_float:
  12       collapsed: false
  13       smooth_scroll: true
  14     theme: readable
  15   pdf_document:
  16     toc: no
  17     toc_depth: '3'
  18     latex_engine: xelatex
  19 header-includes:
  20   - \newcommand{\lt}{<}
  21   - \newcommand{\gt}{>}
  22 ---
  23
  24 ```{r setup, include=FALSE}
  25 knitr::opts_chunk$set(echo = TRUE)
  26
  27 ```
  28
  29
  30
  31 All exercises taken from the *OpenIntro Statistics* textbook, $4^{th}$ edition, Chapter 5.
  32
  33
  34 # 5.4  Unexpected expenses
  35
  36 (a) Adults in the United States.
  37
  38 (b) The proportion of adults in the US who could not cover a $\$400$ expense without borrowing money or going into debt.
  39
  40 (c) $$\hat{p} = \frac{322}{765} = 0.421$$
  41
  42 (d)  The standard error ($SE$).
  43
  44 (e) The formula for the standard error of a proportion can be used to do this:
  45 $$\begin{array}{l}
  46 SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\\
  47 \phantom{SE} = \sqrt{\frac{0.421(1-0.421)}{765}}\\
  48 \phantom{SE} = 0.0179
  49 \end{array}$$
  50
  51 (f) The standard error of a point estimate is analogous to a standard deviation of a distribution of a random variable, so the answer to this question is best understood in relation to the number of standard error units between the point estimate ($42\%$) and the news pundit's baseline expectation ($50\%$). Since the difference is $0.5 - 0.42 = 0.8$ and that is more than four times the standard error ($0.0179$ from part e above), the news pundit should be quite surprised.
  52
  53 (g) Note that this concerns the distinction between $\hat{p}$ and $p$. In this case, the two values are very close ($0.42$ vs. $0.40$). The standard error does not change much:
  54 $$\begin{array}{l}
  55 SE = \sqrt{\frac{p(1-p)}{n}}\\
  56 \phantom{SE} = \sqrt{\frac{0.40(1-0.40)}{765}}\\
  57 \phantom{SE} = 0.0177
  58 \end{array}$$
  59
  60 # 5.8 Twitter users and news, Part I
  61
  62 The general formula for a confidence interval is $point~estimate~±~z^*\times~SE$. Where $z^*$ corresponds to the z-score for the desired value of $\alpha$.
  63
  64 To estimate the interval from the data described in the question, identify the three different values. The point estimate is 45%, $z^* = 2.58$ for a 99% confidence level (that's the number of standard deviations around the mean that ensure that 99% of a Z-score distribution is included), and $SE = 2.4\%$.
  65 With this we can plug and chug:
  66
  67 $$52\% ± 2.58 \times 2.4\%$$
  68 And that yields:
  69 $$95\% CI = (45.8\%, 58.2\%)$$
  70
  71 Which means that from this data we are 99% confident that between 45.8% and 58.2% U.S. adult Twitter users get some news through the site.
  72
  73 # 5.10  Twitter users and news, Part II
  74
  75 (a) False. See the answer to exercise 5.8 above. With $\alpha = 0.01$, we can consult the 99% confidence interval. It includes 50% but also goes lower. A null hypothesis of $p=0.50$ would not be rejected at this level.
  76
  77 (b) False. The standard error of the sample proportion does not contain any information about the proportion of the population included in the sample. It estimates the variability of the sample proportion.
  78
  79 (c) False. All else being equal, increasing the sample size will decrease the standard error. Consider the general formula for a standard error: $\frac{\sigma}{\sqrt{n}}$ or the formula for the standard error of a proportion: $\sqrt{\frac{p(1-p)}{n}}$. A smaller value of $n$ will result in a larger standard error.
  80
  81 (d) False. All else being equal, a lower/smaller confidence interval will cover a narrower range. A higher/larger interval will cover a wider range. To confirm this, revisit the formula from the previous exercise and plug in the corresponding alpha value of .9, resulting in a $z^*$ value of 1.28 (see the Z-score table in the back of *OpenIntro* and/or calculate this directly with the R command `qnorm(0.9)`).
  82
  83 # 5.17 Online communication
  84
  85 Key points here: (1) The hypotheses should be about the population proportion (p), not the sample proportion. (2) The null hypothesis should have an equal sign. (3) The alternative hypothesis should have a not-equals sign and reference the null value rather than the observed sample proportion.
  86
  87 The correct way to set up these hypotheses is:
  88 $$H_0~:~p = 0.6$$
  89 $$H_A~:~p \neq 0.6$$
  90
  91 # 5.30  True or false
  92 (a) True. See 5.10 part d above.
  93
  94 (b) False. The alpha value (significance level) *is* the probability of Type 1 Error, so reducing the one reduces the other.
  95
  96 (c) False. Failure to reject the null ($H_0$) is evidence that we cannot conclude that the true value is different from the null. This is **very** different from evidence that the null hypothesis is true.
  97
  98 (d) True. We'll revisit this in a moment below, but consider the relationship between a statistical test, the standard error, and the sample size as a sample size grows infinitely large. Given the formula for a standard error, the standard error of arbitrarily large samples approaches zero, resulting in arbitrarily precise point estimates that will result in rejecting the null hypothesis for *any* value of a test statistic for any critical value of $\alpha$.
  99
 100 # 5.35  Practical vs. statistical significance
 101 True. If the sample size gets ever larger, then the standard error will become ever smaller. Eventually, when the sample size is large enough and the standard error is tiny, we can find statistically significant yet very small differences between the null value and point estimate (assuming they are not exactly equal).
 102
 103 # 5.36  Same observation, different sample size
 104 As the sample size increases the standard error will decrease, the sample statistic (a Z-score comparing the point estimate against the null hypothesis in all of the examples developed in this chapter) will increase, and the resulting p-value will decrease.