From: aaronshaw Date: Wed, 11 Nov 2020 19:48:26 +0000 (-0600) Subject: initial commit of these materials X-Git-Url: https://code.communitydata.science/stats_class_2020.git/commitdiff_plain/031b7bd587ecb01c5628cc1a282ca121b06c381c?ds=inline initial commit of these materials --- diff --git a/os_exercises/ch8_exercises_solutions.html b/os_exercises/ch8_exercises_solutions.html new file mode 100644 index 0000000..ce2da90 --- /dev/null +++ b/os_exercises/ch8_exercises_solutions.html @@ -0,0 +1,1695 @@ + + + + + + + + + + + + + + + +Chapter 8 Textbook exercises + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + +

All exercises taken from the OpenIntro Statistics textbook, \(4^{th}\) edition, Chapter 8.

+
+

8.6 British married straight couples

+
    +
  1. Husband and wife ages are positively, linearly correlated, with a few possible outliers.
  2. +
  3. Husband and wife heights also appear to be positively and linearly correlated, but very weakly.
  4. +
  5. The age plot shows a much more evident linear trend. The data points also align more tightly.
  6. +
  7. No. Correlation is not affected by rescaling. Revisit the formulas for calculating correlation for more depth on this..
  8. +
+
+
+

8.36 The heads of babies

+
    +
  1. Substituting into the regression equation:
    +\(\bar{y} = 3.91 + .78 * 28\)
  2. +
+
y_bar = 3.91 + .78 * 28
+
+y_bar
+
## [1] 25.75
+
    +
  1. Let’s write out the hypotheses first: \[H_0:~\beta_{gestational\_age} = 0\] \[H_A:~\beta_{gestational\_age} \neq 0\]
  2. +
+

The test statistic for this can be calculated as a t-score as follows:

+

\[T = \frac{0.78 - 0}{.35} = 2.23\]

+

We can look up a p-value for this in a t-table for \(df = 23\) or we can figure it out in R. The pt(q,df) function gives the proportion of the T-distribution which is less than q with df degrees of freedom.

+
(1 - pt(q = 2.23, df = 23)) * 2 # To get 2-tailed p-value, we multiply this by 2
+
## [1] 0.03579437
+
pt(q = 2.23, df = 23, lower.tail = FALSE) * 2 # Alternatively, we can get the area of the upper tail and multiply that by 2 (multiplying by 2 works because t-distributions are symmetrical)
+
## [1] 0.03579437
+

The p-value is less than 0.05, so with a traditional \(\alpha = 0.05\) we would reject \(H_0\) and conclude that there is substantial evidence of an association between gestational age and head circumference for low birth-weight babies (the true slope parameter is unlikely to be zero).

+
+
+

8.40 Cats

+
    +
  1. Here are the hypotheses:
    +\[H_0:~ \beta_{body\_weight} = 0\] \[H_A:~ \beta_{body\_weight} \neq 0\]

  2. +
  3. The estimate is large (and positive) with a small standard error, resulting in a large test-statistic and a correspondingly small p-value. This is likely lower than whatever critical value was selected for the significance level, so the data provide compelling evidence in support of the alternative hypothesis that the true slope parameter for body weight is not equal to zero, and is positively associated with heart weight in cats.

  4. +
  5. The confidence interval for the parameter can be calculated like this:

  6. +
+

\[b_{body\_weight} \pm t^* \times SE\] Plugging in values from the model estimates and 1.98 for the value of \(t^*\): \[4.034 \pm 1.98 \times 0.250\] \[(3.539, 4.529)\]
+In words, this model indicates that for each additional kilogram in a cat’s weight we can be 95% confident that the wight of it’s heart will increase by 3.539 to 4.529 grams on average.

+
    +
  1. Yes. The alternative hypothesis from parts (a) and (b) corresponds to the 95% CI above zero calculated in part (d).
  2. +
+
+
+

8.44 Rating professors

+
    +
  1. Here’s how to calculate the slope:
    +\[\bar{y} = b_0 + b_1 \bar{x}\] \[3.9983 = 4.010 + b_1 \times -0.0883\] I’ll let R handle the arithmetic:
  2. +
+
b_1 = (3.9983-4.010) / -0.0883
+b_1
+
## [1] 0.1325028
+
    +
  1. The null hypothesis is that \(H_0:~\beta_1 = 0\). With a t-score of 4.13 and a p-value of ~0.00 there is strong evidence of a positive relationship between beauty and teaching scores (we would reject the null hypothesis).

  2. +
  3. +
+
    +
  1. Nearly normal residuals: The distribution of residuals looks a little bit (left) skewed but fairly normal.
    +
  2. +
  3. Homoscedasticity (constant variance) of residuals: The residuals have approximately equal variance across the range of beauty around zero.
    +
  4. +
  5. Independent observations/residuals: The question does not really shed much light on this. Looking at the original paper, it is actually a bit unclear how the instructors were chosen and whether or not they are a random sample. The residuals do not seem to have any relationship with the order of data collection, so that’s good at least.
    +
  6. +
  7. Linear relationship between independent/dependent variables: From the scatterplot, the relationship appears to be linear (or at least not obviously non-linear).
  8. +
+
+ + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/os_exercises/ch8_exercises_solutions.pdf b/os_exercises/ch8_exercises_solutions.pdf new file mode 100644 index 0000000..d23dc76 Binary files /dev/null and b/os_exercises/ch8_exercises_solutions.pdf differ diff --git a/os_exercises/ch8_exercises_solutions.rmd b/os_exercises/ch8_exercises_solutions.rmd new file mode 100644 index 0000000..8c9dad6 --- /dev/null +++ b/os_exercises/ch8_exercises_solutions.rmd @@ -0,0 +1,106 @@ +--- +title: "Chapter 8 Textbook exercises" +subtitle: "Solutions to even-numbered questions \nStatistics and statistical programming \nNorthwestern University \nMTS + 525" +author: "Aaron Shaw" +date: "November 11, 2020" +output: + pdf_document: + toc: no + toc_depth: '3' + latex_engine: xelatex + html_document: + toc: yes + toc_depth: 3 + toc_float: + collapsed: false + smooth_scroll: true + theme: readable +header-includes: + - \newcommand{\lt}{<} + - \newcommand{\gt}{>} + - \renewcommand{\leq}{≤} + - \usepackage{lmodern} +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) + +``` + + +All exercises taken from the *OpenIntro Statistics* textbook, $4^{th}$ edition, Chapter 8. + +# 8.6 British married straight couples + +a) Husband and wife ages are positively, linearly correlated, with a few possible outliers. +b) Husband and wife heights also appear to be positively and linearly correlated, but very weakly. +c) The age plot shows a much more evident linear trend. The data points also align more tightly. +d) No. Correlation is not affected by rescaling. Revisit the formulas for calculating correlation for more depth on this.. + + +# 8.36 The heads of babies + +a) Substituting into the regression equation: +$\bar{y} = 3.91 + .78 * 28$ +```{r} +y_bar = 3.91 + .78 * 28 + +y_bar +``` + +b) Let's write out the hypotheses first: +$$H_0:~\beta_{gestational\_age} = 0$$ +$$H_A:~\beta_{gestational\_age} \neq 0$$ + +The test statistic for this can be calculated as a t-score as follows: + +$$T = \frac{0.78 - 0}{.35} = 2.23$$ + +We can look up a p-value for this in a t-table for $df = 23$ or we can figure it out in R. The `pt(q,df)` function gives the proportion of the T-distribution which is less than `q` with `df` degrees of freedom. + +```{r} +(1 - pt(q = 2.23, df = 23)) * 2 # To get 2-tailed p-value, we multiply this by 2 + +pt(q = 2.23, df = 23, lower.tail = FALSE) * 2 # Alternatively, we can get the area of the upper tail and multiply that by 2 (multiplying by 2 works because t-distributions are symmetrical) +``` + +The p-value is less than 0.05, so with a traditional $\alpha = 0.05$ we would reject $H_0$ and conclude that there is substantial evidence of an association between gestational age and head circumference for low birth-weight babies (the true slope parameter is unlikely to be zero). + +# 8.40 Cats + +(a) Here are the hypotheses: +$$H_0:~ \beta_{body\_weight} = 0$$ +$$H_A:~ \beta_{body\_weight} \neq 0$$ +(b) The estimate is large (and positive) with a small standard error, resulting in a large test-statistic and a correspondingly small p-value. This is likely lower than whatever critical value was selected for the significance level, so the data provide compelling evidence in support of the alternative hypothesis that the true slope parameter for body weight is not equal to zero, and is positively associated with heart weight in cats. + +(c) The confidence interval for the parameter can be calculated like this: + +$$b_{body\_weight} \pm t^* \times SE$$ +Plugging in values from the model estimates and 1.98 for the value of $t^*$: +$$4.034 \pm 1.98 \times 0.250$$ +$$(3.539, 4.529)$$ +In words, this model indicates that for each additional kilogram in a cat's weight we can be 95% confident that the wight of it's heart will increase by 3.539 to 4.529 grams on average. + +(d) Yes. The alternative hypothesis from parts (a) and (b) corresponds to the 95% CI above zero calculated in part (d). + +# 8.44 Rating professors + +a) Here's how to calculate the slope: +$$\bar{y} = b_0 + b_1 \bar{x}$$ +$$3.9983 = 4.010 + b_1 \times -0.0883$$ +I'll let R handle the arithmetic: +```{r} +b_1 = (3.9983-4.010) / -0.0883 +b_1 +``` + +b) The null hypothesis is that $H_0:~\beta_1 = 0$. With a t-score of 4.13 and a p-value of \~0.00 there is strong evidence of a positive relationship between beauty and teaching scores (we would reject the null hypothesis). + +c) + 1. Nearly normal residuals: The distribution of residuals looks a little bit (left) skewed but fairly normal. + 2. Homoscedasticity (constant variance) of residuals: The residuals have approximately equal variance across the range of beauty around zero. + 3. Independent observations/residuals: The question does not really shed much light on this. Looking at the original paper, it is actually a bit unclear how the instructors were chosen and whether or not they are a random sample. The residuals do not seem to have any relationship with the order of data collection, so that's good at least. + 4. Linear relationship between independent/dependent variables: From the scatterplot, the relationship appears to be linear (or at least not obviously non-linear). + +