problem_sets/week_02/ps2-worked_solution.Rmd

   1 ---
   2 title: "Week 2 Problem set: Worked solutions"
   3 subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
   4 author: "Aaron Shaw"
   5 date: "April 11, 2019"
   6 output: html_document
   7 ---
   8
   9 ```{r setup, include=FALSE}
  10 knitr::opts_chunk$set(echo = TRUE)
  11 ```
  12
  13 ## Programming Challenges
  14
  15 ### PC2 & PC3
  16
  17 If you downloaded the file, you'll need to point R's `load()` command at the correct file location on your machine. You can also use `url()` within `load()` to point at a web address. Here I've chosen to demonstrate with group 8's data.
  18
  19 ```{r}
  20 load(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_02/group_02.RData"))
  21
  22 ls() # shows me what's available
  23 ```
  24 A clarifying point that came up in class: if you're writing your own RMarkdown script, you will need to load the dataset explicitly (not just open the file with RStudio). When RMarkdown tries to "knit" the .Rmd file into HTML or whatever, it is as if you are running the entire contents of the .Rmd in an entirely new RStudio environment. This means if you don't load something explicitly RStudio will not know it's there!
  25
  26 ### PC4
  27
  28 ```{r}
  29 min(d)
  30 max(d)
  31 mean(d)
  32 median(d)
  33 var(d)
  34 sd(d)
  35 IQR(d)
  36
  37 ```
  38
  39 ### PC5
  40
  41 Graphing using R's built-in functions:
  42 ```{r}
  43
  44 hist(d)
  45 boxplot(d)
  46
  47 ```
  48
  49 Next week, we will start learning to use more of the `ggplot2` package (make sure it's installed first). I'll reproduce the historgram and boxplot in ggplot2 and leave these examples here for future reference. Graphing a single vector in ggplot2 is a bit weird, so this looks a little different than the examples from the R lecture:
  50
  51 ```{r}
  52 library(ggplot2)
  53 p <- ggplot() + aes(d)
  54 p + geom_histogram()
  55 p + geom_boxplot(aes(x="", y=d))
  56 ```
  57
  58 During today's class a question came up about changing the axis limits, so here is an examples of how that would look for the histogram in ggplot2. Note that the `coord_cartesian` layer only adjusts the "zoom" of the image on the axes. The axes themselves are generated in the call to `geom_histogram` which looks at the range of the underlying data and (by default) distributes that into 30 "bins." See the ggplot2 documentation for to learn how to adjust the number of bins directly. Also, feel free to fiddle around with the values passed to the `xlim` and `ylim` arguments below to see what happens:
  59
  60 ```{r}
  61 p + geom_histogram() + coord_cartesian(xlim=c(-50,17000), ylim=c(0,100))
  62 ```
  63
  64
  65
  66 ### PC6
  67
  68 ```{r}
  69
  70 dd <- d # You might create a new object to work with just for clarity.
  71 dd[d < 0] <- NA
  72
  73 mean(dd, na.rm=T) # R can understand "T" in place of TRUE
  74 sd(dd, na.rm=T)
  75 ```
  76
  77
  78 ### PC7
  79
  80 ```{r}
  81 d.log <- log1p(dd) # Note: I use log1p() because some values are very close to zero
  82
  83 mean(d.log, na.rm=T)
  84 median(d.log, na.rm=T)
  85 sd(d.log, na.rm=T)
  86
  87 hist(d.log)
  88 boxplot(d.log)
  89
  90 ```
  91
  92 ## Statistical Questions
  93
  94 Please get in touch with any questions or clarifications. I realize these solutions are a bit concise. I'my happy to clarify the underlying logic.
  95
  96 ### SQ 1 (2.12)
  97
  98 (a) By the addition rule: $P(no~missed~days) = 1 - (0.25 + 0.15 + 0.28) = 0.32$
  99 (b) $P(1~miss~or~less) = P(no~misses) + P(1~miss)$
 100   $= 0.32 + 0.25 = 0.57$
 101 (c) $P(at~least~1~miss) = P(1~miss) + P(2~misses) + P(\geq 3~misses)$
 102   $= 1 - P(no~misses) = 1 - 0.32 = 0.68$
 103 (d) Assume (foolishly!) that the absences are independent. This allows us to use the multiplication rule:
 104   $P(neither~miss~any) = P(no~misses) × P(no~misses) = 0.32*2 = 0.1024$
 105 (e) Again, assume that the absences are independent and use the multiplication rule:
 106   $P(both~miss~some) = P(at~least~1~miss) × P(at~least~1 miss) = 0.68*2 = 0.4624$
 107 (f) Siblings are likely to get each other sick, so the independence assumption is not sound.
 108
 109 ### SQ 2 (2.20)
 110
 111 This one is all about calculating the conditional probabilities from a contingency table:
 112
 113 (a) $P(man~or~partner~has~blue~eyes) = (108 + 114 - 78) / 204 = 0.7059$
 114 (b) $P(partner~with~blue~eyes | man~with~blue~eyes) = 78 / 114 = 0.6842$
 115 (c) $P(partner~with~blue~eyes | man~with~brown~eyes) = 19 / 54 = 0.3519$
 116   $P(partner~with~blue~eyes | man~with~green~eyes) = 11 / 36 = 0.3056$
 117 (d) Partner eye color does not appear independent within this sample. Blue-eyed partners are more common for blue-eyed individuals than other eye colors.
 118
 119 ### SQ 3 (2.26)
 120
 121 More conditional probabilities, this time calculating a compound probability:
 122
 123 $P(identical|females) = \frac{P(identical~and~females)}{P(females)}$
 124   $= \frac{0.15}{0.15 + 0.175}$
 125   $= 0.46$
 126
 127 (A decision tree may also useful here)
 128
 129 ### SQ 4 (2.32)
 130
 131 (a) Once you have one person's birthday, the odds of the second person have the same birthday are: $P(first~two~share~birthday) = \frac{1}{365} = 0.0027$
 132
 133 (b) This one is more challenging! I find it easier to think about by focusing on the probability that the first two don't share a birthday, followed by the probability that the next person doesn't share a birthday either:
 134
 135   $P(at~least~two~share~birthday) = 1-P(none~of~three~share~birthday)$
 136   $=1-P(first~two~don't~share) \times P(third~doesn't~share~either)$
 137   $=1-(\frac{364}{365}) \times (\frac{363}{365}) =0.0082$
 138
 139 #### Bonus pop-quiz question from class:
 140
 141 If I offered you a choice between a bet on the flip of a fair coin or whether any two people in our (25 person) class shares a birthday, what should you choose?
 142
 143 I like the following approach to this question:
 144
 145 Consider that 25 people can be combined into pairs ${25 \choose 2}$ ways, which is equal to $\frac{25 \times 24}{2} = 300$ (see the Wikipedia article on [binomial coefficients](https://en.wikipedia.org/wiki/Binomial_coefficient) for more on that).
 146
 147 Generalizing the logic from part b of the previous problem, I assume that each of these possible pairings are independent and thus each one has a probability $P = (\frac{364}{365})$ of producing a non-matched set of birthdays.
 148
 149 Put everything together, employ the multiplication rule and you get the following:
 150 $$P(any~match) = 1 - P(no~matches)$$
 151 $$P(no~matches) = (\frac{364}{365})^{300}$$
 152 Let's let R take it from here:
 153 ```{r}
 154 1-((364/365)^300)
 155
 156 ```
 157
 158 I'd take the coin flip if I were you!
 159
 160
 161 ### SQ 5 (2.38)
 162
 163 (a) First, the average fee ($F$) per passenger is the expected value of the fees per passenger:
 164
 165 $E(Fee~per~passenger) = \$0(0.54) + \$25(0.34) + \$60(0.12)$
 166 $= \$0 + \$8.5 + \$7.2 = \$15.70$
 167
 168 To calculate the standard deviation, we need to calculate the square root of the variance. To find the variance, we take the deviance at each fee level, multiply that by the probability (frequency) of the fee level, and calculate the sum:
 169
 170 Deviance:   $(F - E(F))^2$
 171 No baggage: $(0-15.70)^2 = 246.49$
 172 1 bag:      $(25-15.70)^2 = 86.49$
 173 2 bags:     $(60-15.70)^2 = 1962.49$
 174
 175 Deviance multiplied by probability: $(F - E(F))^2 \times P(F)$
 176 No baggage: $246.49 \times 0.54 = 133.10$
 177 1 bag:      $86.49 \times 0.34 = 29.41$
 178 2 bags:     $1962.49 \times 0.12 = 235.50$
 179
 180 Variance ($V$): $\$398.01$
 181 Standard deviation ($SD$): $19.95$
 182
 183
 184 (b)  Assume independence of individual passengers (probably wrong, but maybe not catastrophic?):
 185
 186 For 120 passengers:
 187 $E(Revenue) = 120 \times 15.70 = \$1,884$
 188 $V = 120 \times 398.01 = \$47,761.20$
 189 $SD = \sqrt{47761.20} = \$218.54$
 190
 191 ### SQ 6 (2.44)
 192
 193 (a) Right skewed, with a median somewhere around \$35-\$50,000. There's a long tail out to the right.
 194 (b) Addition rule: $P(Income \lt \$50k) = 2.2 + 4.7 + 15.5 + 18.3 + 21.2 = 62.2\%$
 195 (c) Assume that income and gender are independent and the general multiplication rule holds:
 196   $P(Income \lt \$50k~and~female) = P(Income \lt \$50k) \times P(female) = 0.622 \times 0.41 = 0.255$
 197 (d) It seems that the assumption was not so great since the actual proportion of women with incomes less than $\$50k$ was a lot higher than we would expect based on the multiplied conditional probabilities in part c.
 198
 199 ## Empirical Paper Questions
 200
 201 ### EQ 1
 202
 203 (a) "Top" U.S. political blogs.
 204 (b) 155 blogs selected for the study in summer 2008.
 205
 206 ### EQ 2
 207
 208 $P(sole~authored | left~perspective) = \frac{17}{17+37+10} = 0.266$
 209
 210 ### EQ 3
 211
 212 $P(right~perspective|large~scale~collab.) = \frac{4}{10+4} = 0.286$
 213
 214 ### EQ 4
 215
 216 The conditional probabilites provide some evidence that left wing blogs were more collaborative than right wing blogs. That said, a substantial proportion of left blogs are sole-authored, so it's not clear (to me at least) that left blogs are all that likely to be collaborative in their own right. In addition, the number of large scale collaborative blogs overall is not very big (consider Jeremy's point that if you had 4 heads out of 14 coin flips you might not be that surprised!). Nevertheless, the conditional probabilities *do* suggest that right blogs are highly unlikely to be large collaborations. A formal hypothesis test such as the $\chi^2$ test in the paper is arguably more convincing, but the assumptions underlying the null hypothesis may be a little funny (we can discuss this more in a few weeks).
 217
 218 ### EQ 5
 219
 220 I'll be curious to hear what you think!