From: aaronshaw Date: Sat, 19 Sep 2020 18:24:29 +0000 (-0500) Subject: textbook exercises for OS ch1 and ch2 X-Git-Url: https://code.communitydata.science/stats_class_2020.git/commitdiff_plain/d7a0a0bfc4cebcab15d7d93c8d2b32e34b949e4b?hp=d6d41996cb1f73b32130cfd82648b29cc2438872 textbook exercises for OS ch1 and ch2 --- diff --git a/os_exercises/ch1_exercises_solutions.Rmd b/os_exercises/ch1_exercises_solutions.Rmd new file mode 100644 index 0000000..af765fb --- /dev/null +++ b/os_exercises/ch1_exercises_solutions.Rmd @@ -0,0 +1,61 @@ +--- +title: 'Chapter 1 Textbook exercises' +author: "Aaron Shaw" +date: "September 21, 2020" +output: + pdf_document: default + html_document: default +subtitle: "Solutions to even-numbered questions \nStatistics and statistical programming \nNorthwestern University \nMTS + 525" +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +All questions taken from the *OpenIntro Statistics* textbook, $4^{th}$ edition, Chapter 1. + + +### 1.6 + +(a) Many possible answers here. A basic one rearranges the first sentence of the question in the textbook: "Is there a difference in unethical behaviors by people from different (perceived) social classes? +(b) 129 UC Berkeley undergraduate students. +(b) From the description in the question text it seems like there are two primary measures: +* Unethical behavior (candies taken): a discrete numerical measure. +* Perceived social class (experimental treatment): categorical measure. + +### 1.10 + +(a) Each row represents the data collected about a single participant in the survey. +(b) There were 1,691 participants. +(c) See the table below: + +variable | type | sub-type (if applicable) +--- | --- | --- +sex|categorical +age|numerical|discrete (rounded to year) +maritalStatus|categorical| +grossIncome|categorical|ordinal +smoke|categorical| +amtWeekends|numerical|discrete +amtWeekdays|numerical|discrete + +### 1.16 + +(a) The population of interest is all people. The sample is the 129 UC Berkeley undergraduates who participated in the study. +(b) Given that this is an observational study conducted on a convenient sample of UC Berkeley undergraduate students any claims to either causal identification or generalizability seem...implausible. + + +### 1.40 + +(a) The explanatory (independent, predictor, $x$) variable is percent of a county's population with a bachelor's degree and the response (dependent, outcome, $y$) variable is each county's per capita income (measured in thousands of US$). +(b) There is a positive, linear relationship between the two variables. There are few counties where more than 50% of residents hold bachelor's degrees and few counties with a per capita income greater than $40k. +(c) No. Based on the description a causal interpretation is not justified. The data suggest a *positive association* between education and income. + +### 1.42 + +(a) This is an observational study. +(b) The explanatory (independent, predictor, $x$) variables are child's screen time, sex, and age and mother's education, ethnicity, psychological distress, and employment. If you just said child's screen time that would probably be okay since that seems to be the key explanatory variable that is the focus of the study and the other variables are included to support a more accurate estimate of the relationship between screen time and psychological well-being. +(c) The response (dependent, outcome, $y$) variable is child's psychological well-being. +(d) The best answer to this depends on the target population of the study. The surveys come from three nationally representative samples from the UK, Ireland, and the United States. If the target population of the study is the populations of those three countries, then sure, the study results should generalize. If the target population is "all children on earth" or something like that, well then there's no reason to believe it generalizes since these three countries are in no way representative of the world. +(e) The study is observational and lacks any clear strategy for identifying a causal relationship between screen time and psychological well-being. As a result, it does not support any claims to have identified causal effects. \ No newline at end of file diff --git a/os_exercises/ch1_exercises_solutions.html b/os_exercises/ch1_exercises_solutions.html new file mode 100644 index 0000000..90a6b50 --- /dev/null +++ b/os_exercises/ch1_exercises_solutions.html @@ -0,0 +1,517 @@ + + + + + + + + + + + + + + + +Chapter 1 Textbook exercises + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

All questions taken from the OpenIntro Statistics textbook, \(4^{th}\) edition, Chapter 1.

+
+

1.6

+
    +
  1. Many possible answers here. A basic one rearranges the first sentence of the question in the textbook: "Is there a difference in unethical behaviors by people from different (perceived) social classes?
  2. +
  3. 129 UC Berkeley undergraduate students.
  4. +
  5. From the description in the question text it seems like there are two primary measures:
  6. +
+
    +
  • Unethical behavior (candies taken): a discrete numerical measure.
  • +
  • Perceived social class (experimental treatment): categorical measure.
  • +
+
+
+

1.10

+
    +
  1. Each row represents the data collected about a single participant in the survey.
    +
  2. +
  3. There were 1,691 participants.
  4. +
  5. See the table below:
  6. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
variabletypesub-type (if applicable)
sexcategorical
agenumericaldiscrete (rounded to year)
maritalStatuscategorical
grossIncomecategoricalordinal
smokecategorical
amtWeekendsnumericaldiscrete
amtWeekdaysnumericaldiscrete
+
+
+

1.16

+
    +
  1. The population of interest is all people. The sample is the 129 UC Berkeley undergraduates who participated in the study.
  2. +
  3. Given that this is an observational study conducted on a convenient sample of UC Berkeley undergraduate students any claims to either causal identification or generalizability seem…implausible.
  4. +
+
+
+

1.40

+
    +
  1. The explanatory (independent, predictor, \(x\)) variable is percent of a county’s population with a bachelor’s degree and the response (dependent, outcome, \(y\)) variable is each county’s per capita income (measured in thousands of US$).
  2. +
  3. There is a positive, linear relationship between the two variables. There are few counties where more than 50% of residents hold bachelor’s degrees and few counties with a per capita income greater than $40k.
  4. +
  5. No. Based on the description a causal interpretation is not justified. The data suggest a positive association between education and income.
  6. +
+
+
+

1.42

+
    +
  1. This is an observational study.
  2. +
  3. The explanatory (independent, predictor, \(x\)) variables are child’s screen time, sex, and age and mother’s education, ethnicity, psychological distress, and employment. If you just said child’s screen time that would probably be okay since that seems to be the key explanatory variable that is the focus of the study and the other variables are included to support a more accurate estimate of the relationship between screen time and psychological well-being.
  4. +
  5. The response (dependent, outcome, \(y\)) variable is child’s psychological well-being.
  6. +
  7. The best answer to this depends on the target population of the study. The surveys come from three nationally representative samples from the UK, Ireland, and the United States. If the target population of the study is the populations of those three countries, then sure, the study results should generalize. If the target population is “all children on earth” or something like that, well then there’s no reason to believe it generalizes since these three countries are in no way representative of the world.
  8. +
  9. The study is observational and lacks any clear strategy for identifying a causal relationship between screen time and psychological well-being. As a result, it does not support any claims to have identified causal effects.
  10. +
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/os_exercises/ch1_exercises_solutions.pdf b/os_exercises/ch1_exercises_solutions.pdf new file mode 100644 index 0000000..3c862ee Binary files /dev/null and b/os_exercises/ch1_exercises_solutions.pdf differ diff --git a/os_exercises/ch2_exercises_solutions.Rmd b/os_exercises/ch2_exercises_solutions.Rmd new file mode 100644 index 0000000..75bb891 --- /dev/null +++ b/os_exercises/ch2_exercises_solutions.Rmd @@ -0,0 +1,46 @@ +--- +title: 'Chapter 2 Textbook exercises' +author: "Aaron Shaw" +date: "September 24, 2020" +output: + pdf_document: default + html_document: default +subtitle: "Solutions to even-numbered questions \nStatistics and statistical programming \nNorthwestern University \nMTS + 525" +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +All questions taken from the *OpenIntro Statistics* textbook, $4^{th}$ edition, Chapter 2. + + +### 2.12 + +The median seems to be around 80. The mean would be slightly lower than the median because the distribution has a long left tail (is left skewed). + +### 2.16 + +(a) The distribution is right skewed with potential outliers on the positive (high) end, therefore the median and the IQR are preferable measures of center and spread because they are robust to outliers. + +(b) The distribution is somewhat symmetric and has few, if any, extreme observations, therefore the mean and the standard deviation are preferable measures of center and spread. + +(c) The distribution would be right skewed. There would be some students who did not consume any alcohol, but this is the minimum since students cannot consume fewer than 0 drinks. There would be a few students who consume *many* more drinks than their peers, giving the distribution a long right tail. Due to the skew, the median and IQR would be preferable measures of center and spread. + +(d) The distribution would be right skewed. Most employees would make something on the order of the median salary, but we would anticipate upper management makes much more. The distribution would have a long right tail, and the median and the IQR would be preferable measures of center or spread. + +### 2.20 + +(a) The distribution of percentage of population that is Hispanic is extremely right skewed with majority of counties with less than 10% Hispanic residents. However there are a few counties that have more than 90% Hispanic population. It might be preferable, in certain analyses, to use the log-transformed values since this distribution would be much less skewed. + +(b) The map reveals that counties with higher proportions of Hispanic residents are clustered along the Southwest border, all of New Mexico, a large swath of Southwest Texas, the bottom two-thirds of California, and in Southern Florida. In the map all counties with more than 40% of Hispanic residents are indicated by the darker shading, so it is impossible to discern the how high Hispanic percentages go. The histogram reveals that there are counties with over 90% Hispanic residents. The histogram is also useful for estimating measures of center and spread. + +(c) Both visualizations are useful and a preference for one over the other most likely depends on the context in which you plan to use it. The textbook authors seem to prefer the map, so if you chose that one you can rejoice in having anticipated the authors' preferences? + +### 2.30 + +(a) This distribution would most likely be symmetric, resulting in equal values of the mean ($\bar{x}$) and the median. +(b) This distribution would most likely be left-skewed (have a long tail of values towards zero) since the mean would be pulled down towards zero more than the median, resulting in a fraction where $\frac{\bar{x}}{median}<1$. +(c) This distribution would most likely be right-skewed (have a long tail of values far from zero) since the mean would be pulled up away from zero more than the median, resulting in a fraction where $\frac{\bar{x}}{median}>1$. diff --git a/os_exercises/ch2_exercises_solutions.html b/os_exercises/ch2_exercises_solutions.html new file mode 100644 index 0000000..970ecf4 --- /dev/null +++ b/os_exercises/ch2_exercises_solutions.html @@ -0,0 +1,454 @@ + + + + + + + + + + + + + + + +Chapter 2 Textbook exercises + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

All questions taken from the OpenIntro Statistics textbook, \(4^{th}\) edition, Chapter 2.

+
+

2.12

+

The median seems to be around 80. The mean would be slightly lower than the median because the distribution has a long left tail (is left skewed).

+
+
+

2.16

+
    +
  1. The distribution is right skewed with potential outliers on the positive (high) end, therefore the median and the IQR are preferable measures of center and spread because they are robust to outliers.

  2. +
  3. The distribution is somewhat symmetric and has few, if any, extreme observations, therefore the mean and the standard deviation are preferable measures of center and spread.

  4. +
  5. The distribution would be right skewed. There would be some students who did not consume any alcohol, but this is the minimum since students cannot consume fewer than 0 drinks. There would be a few students who consume many more drinks than their peers, giving the distribution a long right tail. Due to the skew, the median and IQR would be preferable measures of center and spread.

  6. +
  7. The distribution would be right skewed. Most employees would make something on the order of the median salary, but we would anticipate upper management makes much more. The distribution would have a long right tail, and the median and the IQR would be preferable measures of center or spread.

  8. +
+
+
+

2.20

+
    +
  1. The distribution of percentage of population that is Hispanic is extremely right skewed with majority of counties with less than 10% Hispanic residents. However there are a few counties that have more than 90% Hispanic population. It might be preferable, in certain analyses, to use the log-transformed values since this distribution would be much less skewed.

  2. +
  3. The map reveals that counties with higher proportions of Hispanic residents are clustered along the Southwest border, all of New Mexico, a large swath of Southwest Texas, the bottom two-thirds of California, and in Southern Florida. In the map all counties with more than 40% of Hispanic residents are indicated by the darker shading, so it is impossible to discern the how high Hispanic percentages go. The histogram reveals that there are counties with over 90% Hispanic residents. The histogram is also useful for estimating measures of center and spread.

  4. +
  5. Both visualizations are useful and a preference for one over the other most likely depends on the context in which you plan to use it. The textbook authors seem to prefer the map, so if you chose that one you can rejoice in having anticipated the authors’ preferences?

  6. +
+
+
+

2.30

+
    +
  1. This distribution would most likely be symmetric, resulting in equal values of the mean (\(\bar{x}\)) and the median.
  2. +
  3. This distribution would most likely be left-skewed (have a long tail of values towards zero) since the mean would be pulled down towards zero more than the median, resulting in a fraction where \(\frac{\bar{x}}{median}<1\).
  4. +
  5. This distribution would most likely be right-skewed (have a long tail of values far from zero) since the mean would be pulled up away from zero more than the median, resulting in a fraction where \(\frac{\bar{x}}{median}>1\).
  6. +
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/os_exercises/ch2_exercises_solutions.pdf b/os_exercises/ch2_exercises_solutions.pdf new file mode 100644 index 0000000..bcf17f5 Binary files /dev/null and b/os_exercises/ch2_exercises_solutions.pdf differ