From: aaronshaw Date: Fri, 12 Apr 2019 19:12:45 +0000 (-0500) Subject: adding worked solutions for weeks 1 and 2 X-Git-Url: https://code.communitydata.science/stats_class_2019.git/commitdiff_plain/6a19048e00f594ee1a194f1457462ad33e24a8a4?ds=inline adding worked solutions for weeks 1 and 2 --- diff --git a/problem_sets/week_01/ps1-worked_solution.Rmd b/problem_sets/week_01/ps1-worked_solution.Rmd new file mode 100644 index 0000000..d6305ca --- /dev/null +++ b/problem_sets/week_01/ps1-worked_solution.Rmd @@ -0,0 +1,95 @@ +--- +title: 'Week 1 Problem Set: Worked solutions' +subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525" +author: "Aaron Shaw" +date: "April 5, 2019" +output: html_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +## Programming Challenges + +No worked solutions to programming challenges this week. Make sure you completed the assignment as described. Contact us if you have trouble or are uncertain. + +## Statistical Questions + +All questions taken from the *OpenIntro* textbook, Chapter 1. + +### SQ1 — 1.6 + +(a) 129 UC Berkeley students. +(b) From the description in the question text and the paper it seems like there are two primary measures: +* Unethical behavior (candies taken): discrete continuous measure. +* Perceived social class (experimental treatment): categorical measure. +(c) Many possible answers here. A basic one turns the first sentence around: What is the difference in unethical behavior by (perceived) social class? + +Note that the summary in *OpenIntro* sort of bungles/ignores the fact that the paper focuses on *perceived* social class rather than social class. That's a shame since the paper is very clear. I plan to contact the textbook authors about it if there are no corrections in the 4th edition. + +### SQ2 — 1.12 + +(a) The population of interest is humans. The sample is the 129 UC Berkeley undergraduates who participated in the study. +(b) Generalizability seems...limited in this study, although the paper describes an experiment in which *perceived* social class is manipulated and thus supports causal identification of effects of priming on unethical behavior. The study as described in the textbook does not support causal interpretations. + + +### SQ3 — 1.52 + +The median here looks to be about 80, while the mean looks to be about 75. You would expect this relationship (median > mean) because of the left skew of the data. + +### SQ4 — 1.56 + +(a) The distribution is right skewed with potential outliers on the positive end, therefore the median and the IQR are preferable measures of center and spread. + +(b) The distribution is somewhat symmetric and has few, if any, extreme observations, therefore the mean and the standard deviation are preferable measures of center and spread. + +(c) The distribution would be right skewed. There would be some students who did not consume any alcohol, but this is the minimum since students cannot consume fewer than 0 drinks. There would be a few students who consume *many* more drinks than their peers, giving the distribution a long right tail. Due to the skew, the median and IQR would be preferable measures of center and spread. + +(d) The distribution would be right skewed. Most employees would make something on the order of the median salary, but we would anticipate upper management makes much more. The distribution would have a long right tail, and the median and the IQR would be preferable measures of center or spread. + +### SQ5 — 1.64 + +(a) The distribution of percentage of population that is Hispanic is extremely right skewed with majority of counties with less than 10% Hispanic residents. However there are a few counties that have more than 90% Hispanic population. It might be preferable, in certain analyses, to use the log-transformed values since this distribution is much less skewed. + +(b) The map reveals that counties with higher proportions of Hispanic residents are clustered along the Southwest border, all of New Mexico, a large swath of Southwest Texas, the bottom two-thirds of California, and in Southern Florida. In the map all counties with more than 40% of Hispanic residents are indicated by the darker shading, so it is impossible to discern the how high Hispanic percentages go. The histogram reveals that there are counties with over 90% Hispanic residents. The histogram is also useful for estimating measures of center and spread. + +(c) Both visualizations are useful and a preference for one over the other most likely depends on the context in which you planned to use it. + +## Empirical Paper Questions + +All questions refer to the Kramer et al. study identified on the course website. + +### EQ1 + +(a) The cases are 689,003 individuals included in the study. + +(b) The key variables include: +* Emotional exposure (whether positive or negative words were potentially reduced): dichotomous categorical. +* Percentage of positive/negative emotion words in posts: continuous numeric. + +(c) Many possible responses, but here's one: Does emotional contagion occur via social networks online? + +### EQ2 + +(a) The treatment groups received a reduced portion of posts in their news feed with either positive or negative words in them. The control groups had random posts removed from their news feed. + +(b) The study uses random sampling (by an internal account ID number). + +(c) See part (a) of this question. The manipulation involved probabilistically reducing the proportion of news feed posts with either positive or negative language in them. + +### EQ3 + +(a) Humans. + +(b) 689,003 English language Facebook users during a week or so in January 2012. + +(c) Many possible answers. Personally, I find such generalization a bit iffy (despite the sample size) given the likely biases of the sample in comparison to the target population of the study. English language Facebook users have a number of attributes that differentiate them from even a broader English language speaking population in the U.S. and Facebook users and it's likely that some of these attributes systematically covary with the outcomes of interest in this study. Covariance (non-independence) between these factors and the outcomes of interest would render the estimated effects subject to bias. + +### EQ4 + +See the figure and the discussion in the paper. The four panels capture the comparisons of the treatment and control conditions by the two outcome measures. They show that the treatments had the anticipated effects with reduced positive posts reducing/increasing the proportion of positive/negative words in subsequent posts and reduced negative posts reducing/increasing the proportion of negative/positive words in subsequent posts. + +### EQ5 + +The study finds evidence of emotional contagion in soicial networks among English language Facebook users in 2012. The estimated effects sizes are tiny (Cohen's *d* ~ 0.02. Read more about [Cohen's *d*](https://trendingsideways.com/the-cohens-d-formula)). However, the effect is potentially still quite meaningful and substantively important because the manipulation was itself quite small in scope (the scale of the study notwithstanding). There estimates may contain unobserved biased due to the construction of the sample, but overall this is a landmark study demonstrating the existence of an infrequently observed phenomenon (emotional contagion) in a context dominated by large numbers of short computer-mediated communications. \ No newline at end of file diff --git a/problem_sets/week_01/ps1-worked_solution.html b/problem_sets/week_01/ps1-worked_solution.html new file mode 100644 index 0000000..c9634e7 --- /dev/null +++ b/problem_sets/week_01/ps1-worked_solution.html @@ -0,0 +1,325 @@ + + + + + + + + + + + + + + +Week 1 Problem Set: Worked solutions + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

Programming Challenges

+

No worked solutions to programming challenges this week. Make sure you completed the assignment as described. Contact us if you have trouble or are uncertain.

+
+
+

Statistical Questions

+

All questions taken from the OpenIntro textbook, Chapter 1.

+
+

SQ1 — 1.6

+
    +
  1. 129 UC Berkeley students.
  2. +
  3. From the description in the question text and the paper it seems like there are two primary measures:
  4. +
+
    +
  • Unethical behavior (candies taken): discrete continuous measure.
  • +
  • Perceived social class (experimental treatment): categorical measure.
  • +
+
    +
  1. Many possible answers here. A basic one turns the first sentence around: What is the difference in unethical behavior by (perceived) social class?
  2. +
+

Note that the summary in OpenIntro sort of bungles/ignores the fact that the paper focuses on perceived social class rather than social class. That’s a shame since the paper is very clear. I plan to contact the textbook authors about it if there are no corrections in the 4th edition.

+
+
+

SQ2 — 1.12

+
    +
  1. The population of interest is humans. The sample is the 129 UC Berkeley undergraduates who participated in the study.
  2. +
  3. Generalizability seems…limited in this study, although the paper describes an experiment in which perceived social class is manipulated and thus supports causal identification of effects of priming on unethical behavior. The study as described in the textbook does not support causal interpretations.
  4. +
+
+
+

SQ3 — 1.52

+

The median here looks to be about 80, while the mean looks to be about 75. You would expect this relationship (median > mean) because of the left skew of the data.

+
+
+

SQ4 — 1.56

+
    +
  1. The distribution is right skewed with potential outliers on the positive end, therefore the median and the IQR are preferable measures of center and spread.

  2. +
  3. The distribution is somewhat symmetric and has few, if any, extreme observations, therefore the mean and the standard deviation are preferable measures of center and spread.

  4. +
  5. The distribution would be right skewed. There would be some students who did not consume any alcohol, but this is the minimum since students cannot consume fewer than 0 drinks. There would be a few students who consume many more drinks than their peers, giving the distribution a long right tail. Due to the skew, the median and IQR would be preferable measures of center and spread.

  6. +
  7. The distribution would be right skewed. Most employees would make something on the order of the median salary, but we would anticipate upper management makes much more. The distribution would have a long right tail, and the median and the IQR would be preferable measures of center or spread.

  8. +
+
+
+

SQ5 — 1.64

+
    +
  1. The distribution of percentage of population that is Hispanic is extremely right skewed with majority of counties with less than 10% Hispanic residents. However there are a few counties that have more than 90% Hispanic population. It might be preferable, in certain analyses, to use the log-transformed values since this distribution is much less skewed.

  2. +
  3. The map reveals that counties with higher proportions of Hispanic residents are clustered along the Southwest border, all of New Mexico, a large swath of Southwest Texas, the bottom two-thirds of California, and in Southern Florida. In the map all counties with more than 40% of Hispanic residents are indicated by the darker shading, so it is impossible to discern the how high Hispanic percentages go. The histogram reveals that there are counties with over 90% Hispanic residents. The histogram is also useful for estimating measures of center and spread.

  4. +
  5. Both visualizations are useful and a preference for one over the other most likely depends on the context in which you planned to use it.

  6. +
+
+
+
+

Empirical Paper Questions

+

All questions refer to the Kramer et al. study identified on the course website.

+
+

EQ1

+
    +
  1. The cases are 689,003 individuals included in the study.

  2. +
  3. The key variables include:
  4. +
+
    +
  • Emotional exposure (whether positive or negative words were potentially reduced): dichotomous categorical.
  • +
  • Percentage of positive/negative emotion words in posts: continuous numeric.
  • +
+
    +
  1. Many possible responses, but here’s one: Does emotional contagion occur via social networks online?
  2. +
+
+
+

EQ2

+
    +
  1. The treatment groups received a reduced portion of posts in their news feed with either positive or negative words in them. The control groups had random posts removed from their news feed.

  2. +
  3. The study uses random sampling (by an internal account ID number).

  4. +
  5. See part (a) of this question. The manipulation involved probabilistically reducing the proportion of news feed posts with either positive or negative language in them.

  6. +
+
+
+

EQ3

+
    +
  1. Humans.

  2. +
  3. 689,003 English language Facebook users during a week or so in January 2012.

  4. +
  5. Many possible answers. Personally, I find such generalization a bit iffy (despite the sample size) given the likely biases of the sample in comparison to the target population of the study. English language Facebook users have a number of attributes that differentiate them from even a broader English language speaking population in the U.S. and Facebook users and it’s likely that some of these attributes systematically covary with the outcomes of interest in this study. Covariance (non-independence) between these factors and the outcomes of interest would render the estimated effects subject to bias.

  6. +
+
+
+

EQ4

+

See the figure and the discussion in the paper. The four panels capture the comparisons of the treatment and control conditions by the two outcome measures. They show that the treatments had the anticipated effects with reduced positive posts reducing/increasing the proportion of positive/negative words in subsequent posts and reduced negative posts reducing/increasing the proportion of negative/positive words in subsequent posts.

+
+
+

EQ5

+

The study finds evidence of emotional contagion in soicial networks among English language Facebook users in 2012. The estimated effects sizes are tiny (Cohen’s d ~ 0.02. Read more about Cohen’s d). However, the effect is potentially still quite meaningful and substantively important because the manipulation was itself quite small in scope (the scale of the study notwithstanding). There estimates may contain unobserved biased due to the construction of the sample, but overall this is a landmark study demonstrating the existence of an infrequently observed phenomenon (emotional contagion) in a context dominated by large numbers of short computer-mediated communications.

+
+
+ + + + +
+ + + + + + + + diff --git a/problem_sets/week_02/example.Rmd b/problem_sets/week_02/example.Rmd new file mode 100644 index 0000000..00e26ce --- /dev/null +++ b/problem_sets/week_02/example.Rmd @@ -0,0 +1,16 @@ +--- +title: "example" +author: "aaron shaw" +date: "April 11, 2019" +output: html_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +```{r} +ls() +``` + diff --git a/problem_sets/week_02/example.html b/problem_sets/week_02/example.html new file mode 100644 index 0000000..8ca2a4a --- /dev/null +++ b/problem_sets/week_02/example.html @@ -0,0 +1,232 @@ + + + + + + + + + + + + + + + +example + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
ls()
+
## character(0)
+ + + + +
+ + + + + + + + diff --git a/problem_sets/week_02/ps2-worked_solution.Rmd b/problem_sets/week_02/ps2-worked_solution.Rmd new file mode 100644 index 0000000..f348edd --- /dev/null +++ b/problem_sets/week_02/ps2-worked_solution.Rmd @@ -0,0 +1,220 @@ +--- +title: "Week 2 Problem set: Worked solutions" +subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525" +author: "Aaron Shaw" +date: "April 11, 2019" +output: html_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +## Programming Challenges + +### PC2 & PC3 + +If you downloaded the file, you'll need to point R's `load()` command at the correct file location on your machine. You can also use `url()` within `load()` to point at a web address. Here I've chosen to demonstrate with group 8's data. + +```{r} +load(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_02/group_02.RData")) + +ls() # shows me what's available +``` +A clarifying point that came up in class: if you're writing your own RMarkdown script, you will need to load the dataset explicitly (not just open the file with RStudio). When RMarkdown tries to "knit" the .Rmd file into HTML or whatever, it is as if you are running the entire contents of the .Rmd in an entirely new RStudio environment. This means if you don't load something explicitly RStudio will not know it's there! + +### PC4 + +```{r} +min(d) +max(d) +mean(d) +median(d) +var(d) +sd(d) +IQR(d) + +``` + +### PC5 + +Graphing using R's built-in functions: +```{r} + +hist(d) +boxplot(d) + +``` + +Next week, we will start learning to use more of the `ggplot2` package (make sure it's installed first). I'll reproduce the historgram and boxplot in ggplot2 and leave these examples here for future reference. Graphing a single vector in ggplot2 is a bit weird, so this looks a little different than the examples from the R lecture: + +```{r} +library(ggplot2) +p <- ggplot() + aes(d) +p + geom_histogram() +p + geom_boxplot(aes(x="", y=d)) +``` + +During today's class a question came up about changing the axis limits, so here is an examples of how that would look for the histogram in ggplot2. Note that the `coord_cartesian` layer only adjusts the "zoom" of the image on the axes. The axes themselves are generated in the call to `geom_histogram` which looks at the range of the underlying data and (by default) distributes that into 30 "bins." See the ggplot2 documentation for to learn how to adjust the number of bins directly. Also, feel free to fiddle around with the values passed to the `xlim` and `ylim` arguments below to see what happens: + +```{r} +p + geom_histogram() + coord_cartesian(xlim=c(-50,17000), ylim=c(0,100)) +``` + + + +### PC6 + +```{r} + +dd <- d # You might create a new object to work with just for clarity. +dd[d < 0] <- NA + +mean(dd, na.rm=T) # R can understand "T" in place of TRUE +sd(dd, na.rm=T) +``` + + +### PC7 + +```{r} +d.log <- log1p(dd) # Note: I use log1p() because some values are very close to zero + +mean(d.log, na.rm=T) +median(d.log, na.rm=T) +sd(d.log, na.rm=T) + +hist(d.log) +boxplot(d.log) + +``` + +## Statistical Questions + +Please get in touch with any questions or clarifications. I realize these solutions are a bit concise. I'my happy to clarify the underlying logic. + +### SQ 1 (2.12) + +(a) By the addition rule: $P(no~missed~days) = 1 - (0.25 + 0.15 + 0.28) = 0.32$ +(b) $P(1~miss~or~less) = P(no~misses) + P(1~miss)$ + $= 0.32 + 0.25 = 0.57$ +(c) $P(at~least~1~miss) = P(1~miss) + P(2~misses) + P(\geq 3~misses)$ + $= 1 - P(no~misses) = 1 - 0.32 = 0.68$ +(d) Assume (foolishly!) that the absences are independent. This allows us to use the multiplication rule: + $P(neither~miss~any) = P(no~misses) × P(no~misses) = 0.32*2 = 0.1024$ +(e) Again, assume that the absences are independent and use the multiplication rule: + $P(both~miss~some) = P(at~least~1~miss) × P(at~least~1 miss) = 0.68*2 = 0.4624$ +(f) Siblings are likely to get each other sick, so the independence assumption is not sound. + +### SQ 2 (2.20) + +This one is all about calculating the conditional probabilities from a contingency table: + +(a) $P(man~or~partner~has~blue~eyes) = (108 + 114 - 78) / 204 = 0.7059$ +(b) $P(partner~with~blue~eyes | man~with~blue~eyes) = 78 / 114 = 0.6842$ +(c) $P(partner~with~blue~eyes | man~with~brown~eyes) = 19 / 54 = 0.3519$ + $P(partner~with~blue~eyes | man~with~green~eyes) = 11 / 36 = 0.3056$ +(d) Partner eye color does not appear independent within this sample. Blue-eyed partners are more common for blue-eyed individuals than other eye colors. + +### SQ 3 (2.26) + +More conditional probabilities, this time calculating a compound probability: + +$P(identical|females) = \frac{P(identical~and~females)}{P(females)}$ + $= \frac{0.15}{0.15 + 0.175}$ + $= 0.46$ + +(A decision tree may also useful here) + +### SQ 4 (2.32) + +(a) Once you have one person's birthday, the odds of the second person have the same birthday are: $P(first~two~share~birthday) = \frac{1}{365} = 0.0027$ + +(b) This one is more challenging! I find it easier to think about by focusing on the probability that the first two don't share a birthday, followed by the probability that the next person doesn't share a birthday either: + + $P(at~least~two~share~birthday) = 1-P(none~of~three~share~birthday)$ + $=1-P(first~two~don't~share) \times P(third~doesn't~share~either)$ + $=1-(\frac{364}{365}) \times (\frac{363}{365}) =0.0082$ + +#### Bonus pop-quiz question from class: + +If I offered you a choice between a bet on the flip of a fair coin or whether any two people in our (25 person) class shares a birthday, what should you choose? + +I like the following approach to this question: + +Consider that 25 people can be combined into pairs ${25 \choose 2}$ ways, which is equal to $\frac{25 \times 24}{2} = 300$ (see the Wikipedia article on [binomial coefficients](https://en.wikipedia.org/wiki/Binomial_coefficient) for more on that). + +Generalizing the logic from part b of the previous problem, I assume that each of these possible pairings are independent and thus each one has a probability $P = (\frac{364}{365})$ of producing a non-matched set of birthdays. + +Put everything together, employ the multiplication rule and you get the following: +$$P(any~match) = 1 - P(no~matches)$$ +$$P(no~matches) = (\frac{364}{365})^{300}$$ +Let's let R take it from here: +```{r} +1-((364/365)^300) + +``` + +I'd take the coin flip if I were you! + + +### SQ 5 (2.38) + +(a) First, the average fee ($F$) per passenger is the expected value of the fees per passenger: + +$E(Fee~per~passenger) = \$0(0.54) + \$25(0.34) + \$60(0.12)$ +$= \$0 + \$8.5 + \$7.2 = \$15.70$ + +To calculate the standard deviation, we need to calculate the square root of the variance. To find the variance, we take the deviance at each fee level, multiply that by the probability (frequency) of the fee level, and calculate the sum: + +Deviance: $(F - E(F))^2$ +No baggage: $(0-15.70)^2 = 246.49$ +1 bag: $(25-15.70)^2 = 86.49$ +2 bags: $(60-15.70)^2 = 1962.49$ + +Deviance multiplied by probability: $(F - E(F))^2 \times P(F)$ +No baggage: $246.49 \times 0.54 = 133.10$ +1 bag: $86.49 \times 0.34 = 29.41$ +2 bags: $1962.49 \times 0.12 = 235.50$ + +Variance ($V$): $\$398.01$ +Standard deviation ($SD$): $19.95$ + + +(b) Assume independence of individual passengers (probably wrong, but maybe not catastrophic?): + +For 120 passengers: +$E(Revenue) = 120 \times 15.70 = \$1,884$ +$V = 120 \times 398.01 = \$47,761.20$ +$SD = \sqrt{47761.20} = \$218.54$ + +### SQ 6 (2.44) + +(a) Right skewed, with a median somewhere around \$35-\$50,000. There's a long tail out to the right. +(b) Addition rule: $P(Income \lt \$50k) = 2.2 + 4.7 + 15.5 + 18.3 + 21.2 = 62.2\%$ +(c) Assume that income and gender are independent and the general multiplication rule holds: + $P(Income \lt \$50k~and~female) = P(Income \lt \$50k) \times P(female) = 0.622 \times 0.41 = 0.255$ +(d) It seems that the assumption was not so great since the actual proportion of women with incomes less than $\$50k$ was a lot higher than we would expect based on the multiplied conditional probabilities in part c. + +## Empirical Paper Questions + +### EQ 1 + +(a) "Top" U.S. political blogs. +(b) 155 blogs selected for the study in summer 2008. + +### EQ 2 + +$P(sole~authored | left~perspective) = \frac{17}{17+37+10} = 0.266$ + +### EQ 3 + +$P(right~perspective|large~scale~collab.) = \frac{4}{10+4} = 0.286$ + +### EQ 4 + +The conditional probabilites provide some evidence that left wing blogs were more collaborative than right wing blogs. That said, a substantial proportion of left blogs are sole-authored, so it's not clear (to me at least) that left blogs are all that likely to be collaborative in their own right. In addition, the number of large scale collaborative blogs overall is not very big (consider Jeremy's point that if you had 4 heads out of 14 coin flips you might not be that surprised!). Nevertheless, the conditional probabilities *do* suggest that right blogs are highly unlikely to be large collaborations. A formal hypothesis test such as the $\chi^2$ test in the paper is arguably more convincing, but the assumptions underlying the null hypothesis may be a little funny (we can discuss this more in a few weeks). + +### EQ 5 + +I'll be curious to hear what you think! \ No newline at end of file diff --git a/problem_sets/week_02/ps2-worked_solution.html b/problem_sets/week_02/ps2-worked_solution.html new file mode 100644 index 0000000..cbeb45a --- /dev/null +++ b/problem_sets/week_02/ps2-worked_solution.html @@ -0,0 +1,425 @@ + + + + + + + + + + + + + + + +Week 2 Problem set: Worked solutions + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

Programming Challenges

+
+

PC2 & PC3

+

If you downloaded the file, you’ll need to point R’s load() command at the correct file location on your machine. You can also use url() within load() to point at a web address. Here I’ve chosen to demonstrate with group 8’s data.

+
load(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_02/group_02.RData"))
+
+ls() # shows me what's available
+
## [1] "d"
+

A clarifying point that came up in class: if you’re writing your own RMarkdown script, you will need to load the dataset explicitly (not just open the file with RStudio). When RMarkdown tries to “knit” the .Rmd file into HTML or whatever, it is as if you are running the entire contents of the .Rmd in an entirely new RStudio environment. This means if you don’t load something explicitly RStudio will not know it’s there!

+
+
+

PC4

+
min(d)
+
## [1] -5.550215
+
max(d)
+
## [1] 15415.83
+
mean(d)
+
## [1] 738.7643
+
median(d)
+
## [1] 7.278341
+
var(d)
+
## [1] 6380870
+
sd(d)
+
## [1] 2526.038
+
IQR(d)
+
## [1] 87.33717
+
+
+

PC5

+

Graphing using R’s built-in functions:

+
hist(d)
+

+
boxplot(d)
+

+

Next week, we will start learning to use more of the ggplot2 package (make sure it’s installed first). I’ll reproduce the historgram and boxplot in ggplot2 and leave these examples here for future reference. Graphing a single vector in ggplot2 is a bit weird, so this looks a little different than the examples from the R lecture:

+
library(ggplot2)
+p <- ggplot() + aes(d)
+p + geom_histogram()
+
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+

+
p + geom_boxplot(aes(x="", y=d))
+

+

During today’s class a question came up about changing the axis limits, so here is an examples of how that would look for the histogram in ggplot2. Note that the coord_cartesian layer only adjusts the “zoom” of the image on the axes. The axes themselves are generated in the call to geom_histogram which looks at the range of the underlying data and (by default) distributes that into 30 “bins.” See the ggplot2 documentation for to learn how to adjust the number of bins directly. Also, feel free to fiddle around with the values passed to the xlim and ylim arguments below to see what happens:

+
p + geom_histogram() + coord_cartesian(xlim=c(-50,17000), ylim=c(0,100))
+
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+

+
+
+

PC6

+
dd <- d # You might create a new object to work with just for clarity.
+dd[d < 0] <- NA
+
+mean(dd, na.rm=T) # R can understand "T" in place of TRUE
+
## [1] 777.7835
+
sd(dd, na.rm=T)
+
## [1] 2586.407
+
+
+

PC7

+
d.log <- log1p(dd) # Note: I use log1p() because some values are very close to zero
+
+mean(d.log, na.rm=T)
+
## [1] 3.163984
+
median(d.log, na.rm=T)
+
## [1] 2.258885
+
sd(d.log, na.rm=T)
+
## [1] 2.655523
+
hist(d.log)
+

+
boxplot(d.log)
+

+
+
+
+

Statistical Questions

+

Please get in touch with any questions or clarifications. I realize these solutions are a bit concise. I’my happy to clarify the underlying logic.

+
+

SQ 1 (2.12)

+
    +
  1. By the addition rule: \(P(no~missed~days) = 1 - (0.25 + 0.15 + 0.28) = 0.32\)
  2. +
  3. \(P(1~miss~or~less) = P(no~misses) + P(1~miss)\) \(= 0.32 + 0.25 = 0.57\)
  4. +
  5. \(P(at~least~1~miss) = P(1~miss) + P(2~misses) + P(\geq 3~misses)\) \(= 1 - P(no~misses) = 1 - 0.32 = 0.68\)
  6. +
  7. Assume (foolishly!) that the absences are independent. This allows us to use the multiplication rule:
    +\(P(neither~miss~any) = P(no~misses) × P(no~misses) = 0.32*2 = 0.1024\)
  8. +
  9. Again, assume that the absences are independent and use the multiplication rule:
    +\(P(both~miss~some) = P(at~least~1~miss) × P(at~least~1 miss) = 0.68*2 = 0.4624\)
  10. +
  11. Siblings are likely to get each other sick, so the independence assumption is not sound.
  12. +
+
+
+

SQ 2 (2.20)

+

This one is all about calculating the conditional probabilities from a contingency table:

+
    +
  1. \(P(man~or~partner~has~blue~eyes) = (108 + 114 - 78) / 204 = 0.7059\)
  2. +
  3. \(P(partner~with~blue~eyes | man~with~blue~eyes) = 78 / 114 = 0.6842\)
  4. +
  5. \(P(partner~with~blue~eyes | man~with~brown~eyes) = 19 / 54 = 0.3519\) \(P(partner~with~blue~eyes | man~with~green~eyes) = 11 / 36 = 0.3056\)
  6. +
  7. Partner eye color does not appear independent within this sample. Blue-eyed partners are more common for blue-eyed individuals than other eye colors.
  8. +
+
+
+

SQ 3 (2.26)

+

More conditional probabilities, this time calculating a compound probability:

+

\(P(identical|females) = \frac{P(identical~and~females)}{P(females)}\) \(= \frac{0.15}{0.15 + 0.175}\) \(= 0.46\)

+

(A decision tree may also useful here)

+
+
+

SQ 4 (2.32)

+
    +
  1. Once you have one person’s birthday, the odds of the second person have the same birthday are: \(P(first~two~share~birthday) = \frac{1}{365} = 0.0027\)

  2. +
  3. This one is more challenging! I find it easier to think about by focusing on the probability that the first two don’t share a birthday, followed by the probability that the next person doesn’t share a birthday either:

  4. +
+

\(P(at~least~two~share~birthday) = 1-P(none~of~three~share~birthday)\) \(=1-P(first~two~don't~share) \times P(third~doesn't~share~either)\)
+\(=1-(\frac{364}{365}) \times (\frac{363}{365}) =0.0082\)

+
+

Bonus pop-quiz question from class:

+

If I offered you a choice between a bet on the flip of a fair coin or whether any two people in our (25 person) class shares a birthday, what should you choose?

+

I like the following approach to this question:

+

Consider that 25 people can be combined into pairs \({25 \choose 2}\) ways, which is equal to \(\frac{25 \times 24}{2} = 300\) (see the Wikipedia article on binomial coefficients for more on that).

+

Generalizing the logic from part b of the previous problem, I assume that each of these possible pairings are independent and thus each one has a probability \(P = (\frac{364}{365})\) of producing a non-matched set of birthdays.

+

Put everything together, employ the multiplication rule and you get the following: \[P(any~match) = 1 - P(no~matches)\]
+\[P(no~matches) = (\frac{364}{365})^{300}\]
+Let’s let R take it from here:

+
1-((364/365)^300) 
+
## [1] 0.5609078
+

I’d take the coin flip if I were you!

+
+
+
+

SQ 5 (2.38)

+
    +
  1. First, the average fee (\(F\)) per passenger is the expected value of the fees per passenger:
  2. +
+

\(E(Fee~per~passenger) = \$0(0.54) + \$25(0.34) + \$60(0.12)\) \(= \$0 + \$8.5 + \$7.2 = \$15.70\)

+

To calculate the standard deviation, we need to calculate the square root of the variance. To find the variance, we take the deviance at each fee level, multiply that by the probability (frequency) of the fee level, and calculate the sum:

+

Deviance: \((F - E(F))^2\)
+No baggage: \((0-15.70)^2 = 246.49\)
+1 bag: \((25-15.70)^2 = 86.49\)
+2 bags: \((60-15.70)^2 = 1962.49\)

+

Deviance multiplied by probability: \((F - E(F))^2 \times P(F)\)
+No baggage: \(246.49 \times 0.54 = 133.10\)
+1 bag: \(86.49 \times 0.34 = 29.41\)
+2 bags: \(1962.49 \times 0.12 = 235.50\)

+

Variance (\(V\)): \(\$398.01\)
+Standard deviation (\(SD\)): \(19.95\)

+
    +
  1. Assume independence of individual passengers (probably wrong, but maybe not catastrophic?):
  2. +
+

For 120 passengers:
+\(E(Revenue) = 120 \times 15.70 = \$1,884\)
+\(V = 120 \times 398.01 = \$47,761.20\)
+\(SD = \sqrt{47761.20} = \$218.54\)

+
+
+

SQ 6 (2.44)

+
    +
  1. Right skewed, with a median somewhere around $35-$50,000. There’s a long tail out to the right.
  2. +
  3. Addition rule: \(P(Income \lt \$50k) = 2.2 + 4.7 + 15.5 + 18.3 + 21.2 = 62.2\%\)
    +
  4. +
  5. Assume that income and gender are independent and the general multiplication rule holds:
    +\(P(Income \lt \$50k~and~female) = P(Income \lt \$50k) \times P(female) = 0.622 \times 0.41 = 0.255\)
    +
  6. +
  7. It seems that the assumption was not so great since the actual proportion of women with incomes less than \(\$50k\) was a lot higher than we would expect based on the multiplied conditional probabilities in part c.
  8. +
+
+
+
+

Empirical Paper Questions

+
+

EQ 1

+
    +
  1. “Top” U.S. political blogs.
  2. +
  3. 155 blogs selected for the study in summer 2008.
  4. +
+
+
+

EQ 2

+

\(P(sole~authored | left~perspective) = \frac{17}{17+37+10} = 0.266\)

+
+
+

EQ 3

+

\(P(right~perspective|large~scale~collab.) = \frac{4}{10+4} = 0.286\)

+
+
+

EQ 4

+

The conditional probabilites provide some evidence that left wing blogs were more collaborative than right wing blogs. That said, a substantial proportion of left blogs are sole-authored, so it’s not clear (to me at least) that left blogs are all that likely to be collaborative in their own right. In addition, the number of large scale collaborative blogs overall is not very big (consider Jeremy’s point that if you had 4 heads out of 14 coin flips you might not be that surprised!). Nevertheless, the conditional probabilities do suggest that right blogs are highly unlikely to be large collaborations. A formal hypothesis test such as the \(\chi^2\) test in the paper is arguably more convincing, but the assumptions underlying the null hypothesis may be a little funny (we can discuss this more in a few weeks).

+
+
+

EQ 5

+

I’ll be curious to hear what you think!

+
+
+ + + + +
+ + + + + + + +