problem_sets/week_03/ps3-worked_solution.Rmd

   1 ---
   2 title: 'Week 3 Problem set: Worked solutions'
   3 subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
   4 author: "Aaron Shaw"
   5 date: "April 18, 2019"
   6 output: html_document
   7 ---
   8
   9 ## Programming challenges
  10
  11 ```{r setup, include=FALSE}
  12 knitr::opts_chunk$set(echo = TRUE)
  13 ```
  14 ### PC1 & PC2: Download and read the data into R
  15
  16 I like to load libraries at the beginning of my R scripts. Doing so helps me find them later. Since I know we're graphing stuff later in this problem set I'll call the ggplot2 package here.
  17
  18 ```{r}
  19 library(ggplot2)
  20 ```
  21
  22 Now, I'll go ahead and load the CSV file into R. As with last week, I'll do this directly with a `url()` command. However, I included a couple of lines (commented out) that you might adapt to set the working directory on your machine, ask R to show you what's there, and read the data. I'll walk through this in the screencast and/or in class, but **please note that you'll need to edit this code to get it to work on your own file system!**
  23
  24 ```{r}
  25
  26 ### Uncomment, edit, and try running this on your machine:
  27 ###
  28 ### setwd("~/Documents/Teaching/2019/stats/")
  29 ### list.files("data/week_03") # just take a look around
  30 ### w3.data <- read.csv("data/week_03/group_01.csv")
  31
  32 w3.data <- read.csv(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_03/group_02.csv"))
  33 ```
  34
  35 ### PC3. Get to know your data!
  36
  37 First, I like to see the first few rows of the dataset and the dimensions of it:
  38
  39 ```{r}
  40 head(w3.data)
  41 nrow(w3.data)
  42 ncol(w3.data)
  43 ```
  44
  45 Now I'll use the `lapply()` command to ask for some summary information about each variable (column) in the data frame.
  46
  47 ```{r}
  48 lapply(w3.data, class)
  49 lapply(w3.data, summary)
  50 lapply(w3.data, sd, na.rm=TRUE)
  51
  52 ```
  53
  54 This code below is not ideal since it makes it hard to figure out which variable is being graphed and outputs a bunch of extra crud at the end (the figures are in order, but are all labeled as x[[l]]).  That said, the code sure is expedient and short! We'll learn more intelligible ways to do this later.
  55
  56
  57 ```{r}
  58 names(w3.data)
  59 lapply(w3.data, hist)
  60
  61 ### You might also have done this variable by variable with code like the following:
  62 ### hist(w3.data$x)
  63
  64 ```
  65
  66
  67 ### PC4. Roll your own functions to find the mean and median of "x"
  68
  69 First, I have to define the `my.mean()` function. Then I can run it.
  70
  71 ```{r}
  72
  73 ### Run this first...
  74 my.mean <- function(z){
  75   z <- z[!is.na(z)]
  76   sigma <- sum(z)
  77   n <- length(z)
  78   out.value <- sigma/n
  79   return(out.value)
  80 }
  81
  82 ### Now you can call the function we just defined
  83 my.mean(w3.data$x)
  84
  85 ```
  86
  87 Now, for the median bit. Medians can be complicated because the calculation depends on whether there's a midpoint of the data or not. Luckily in this case, there should be 95 non-missing values in the x vector, so a midpoint exists (the 48th value) and the following should work:
  88
  89 ```{r}
  90
  91 my.median <- function(z){
  92   z <- z[!is.na(z)]
  93   midpoint <- (length(z) / 2) + .5
  94   sorted <- sort(z)
  95   output <- sorted[midpoint]
  96   return(output)
  97 }
  98
  99 my.median(w3.data$x)
 100
 101 ```
 102 Since R has built-in functions that do both of these things, I can even check my answers:
 103 ```{r}
 104 mean(w3.data$x, na.rm=T)
 105 median(w3.data$x, na.rm=T)
 106 ```
 107
 108
 109
 110 ### PC5. Load the Week 2 data and clean it up again.
 111
 112 ```{r}
 113 load(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_02/group_02.RData"))
 114
 115 ### I'll rename it for clarity:
 116 w2.data <- d
 117 rm(d)
 118
 119 ### and recode:
 120 w2.data[w2.data < 0] <- NA
 121 w2.data <- log1p(w2.data)
 122 ```
 123
 124 ### PC6. Compare Week 2 data against the x variable from Week 3
 125
 126 First, some summary statistics. These look quite similar...
 127 ```{r}
 128 summary(w2.data)
 129 summary(w3.data$x)
 130 ```
 131
 132 But they don't quite match up...
 133
 134 ```{r}
 135 table(w2.data == w3.data$x)
 136
 137 head(w2.data)
 138 head(w3.data$x)
 139 ```
 140
 141 Inspecting the first few values returned by `head()` gave you a clue. Rounded to six decimal places, the vectors match!
 142
 143 I can create a table comparing the sorted rounded values to check this.
 144 ```{r}
 145
 146 table(sort(round(w2.data, 6)) == sort(round(w3.data$x, 6)))
 147 ```
 148
 149 Can you explain what each piece of that last line of code is doing?
 150
 151
 152 ### PC7. Visualize the data using ggplot2
 153
 154 First create a basic plotting object. Then add the call to `geom_point()` to show just the x and y:
 155 ```{r}
 156 p <- ggplot(w3.data, mapping=aes(x=x, y=y))
 157
 158 p + geom_point()
 159 ```
 160
 161 Now uncomment this line to try to add the color, size, and shape to the point layer:
 162
 163 ```{r}
 164 # p + geom_point(aes(color=j, size=l, shape=k))
 165 ```
 166
 167 Hmm. That doesn't work, but the error message is actually pretty helpful. If you search the text of that message online you might discover that we should try turning the aesthetic mappings of the call to `geom_point` into factors.
 168
 169 ```{r}
 170 p + geom_point(aes(color=as.factor(j), size=as.factor(l), shape=as.factor(k)))
 171 ```
 172
 173 Now that's more like it!
 174
 175
 176 ### PC8 Recoding (again)
 177
 178 ```{r}
 179 w3.data$j <- as.logical(w3.data$j)
 180 w3.data$l <- as.logical(w3.data$l)
 181
 182 ### Create a new column just so I can double check this bit:
 183 w3.data$k.factor <- factor(w3.data$k,
 184                                  levels=c(0,1,2,3),
 185                                  labels=c("none", "some", "lots", "all"))
 186
 187 ### Spotcheck to make sure it looks good
 188 head(w3.data, 10)
 189
 190 ### Now cleanup my extra column:
 191 w3.data$k <- w3.data$k.factor
 192 w3.data$k.factor <- NULL
 193
 194 head(w3.data)
 195 ```
 196
 197 ### PC9. Summarize again
 198
 199 ```{r}
 200 lapply(w3.data, summary)
 201
 202 ### Run this line again to assign the new dataframe to p
 203 p <- ggplot(w3.data, aes(x=x, y=y))
 204
 205 p + geom_point(aes(color=j, size=l, shape=k))
 206 ```
 207
 208
 209 Check out how much more readable the legends look now!
 210
 211
 212 ## Statistical questions
 213
 214 ### SQ1 — 3.4
 215
 216 (a) Let $X$ denote the finishing times of *Men, Ages 30 - 34* and $Y$ denote the finishing times of *Women, Ages 25 - 29*. Then,
 217
 218 $$X ∼ N (μ = 4313, σ = 583)$$
 219 $$Y ∼ N (μ = 5261, σ = 807)$$
 220
 221
 222 (b) Z-scores are a standardization measure: to calculate for a given value you subtract the mean of the corresponding distribution from the value and divide by the standard deviation. The formula notation is given in the *OpenIntro* textbook.
 223
 224     Let's let R calculate it for us:
 225 ```{r}
 226 ## Mary
 227 (5513 - 5261) / 807
 228
 229 ## Leo:
 230 (4948 - 4313) / 583
 231 ```
 232 Since the Z score tells you how many standard deviation units above/below the mean each value is, we can see that Mary finished 0.31 standard deviations above the mean in her category while Leo finished 1.09 standard deviations above the mean in his.
 233
 234 (c) Mary finished in a much faster time with respect to her category.
 235
 236 (d) Using the Z-score table (Table 3.8) on p. 132 of the book, Leo finished *faster* than approximately $1-0.86 = .14$ or $14\%$ of his category. This corresponds the probability $P(Z \gt 1.09)$ for a normal distribution.
 237
 238 (e) Mary finished *faster* than approximately $1-0.62 = .38$ or $38\%$ of her category. This corresponds to the probability $P(Z \gt 0.31)$ for a normal distribution.
 239
 240 (f) The answer for part b would not change as standardized values (Z-scores) can be computed for any distribution. However, the interpretation and percentile calculations (parts c-e) *would* be different because they all presume a normal distribution.
 241
 242
 243 ### SQ2 — 3.6
 244
 245 (a) The fastest $5\%$ are in the $5^{th}$ percentile of the distribution. Using Table 3.8 again, the Z score corresponding to the $5^{th}$ percentile of the normal distribution is approximately -1.64. Then,
 246 $$Z = −1.65 = \frac{x − 4313}{583} → x = −1.65 × 583 + 4313 = 3351~seconds$$
 247     Divide that by 60 and it looks like the fastest $5\%$ of males in this age group finished in less than 56 minutes.
 248
 249 (b) The slowest $10\%$ are in the $90^{th}$ percentile of the distribution. The Z score corresponding to the $90^{th}$ percentile of the normal distribution is approximately 1.28. Then,
 250 $$Z = 1.27 = \frac{y-5261}{807} → y = 1.28 \times 807 + 5261 = 6294 ~seconds$$
 251     Divide that by 60 and it looks like the slowest $10\%$ of females in this age group finished in about 1 hour 45 minutes (or longer).
 252
 253 ### SQ3 — 3.18
 254
 255 (a) I'll do the math for this one in R. First, I can enter the data, then can use the information provided to calculate some ranges for the middle 68, 95, and 99 percent of the data (one, two, and three standard deviations from the mean in a normal distribution). Finally, I'll see what percentage of the empirical distribution falls within the ranges of those quantiles.
 256
 257 ```{r}
 258
 259 d <- c(54, 55, 56, 56, 57, 58, 58, 59, 60, 60, 60, 61, 61, 62, 62, 63, 63, 63, 64, 65, 65, 67, 67, 69, 73)
 260 m <- 61.52
 261 sdev <- 4.58
 262
 263 ## Here are the ranges for +/- 1, 2, and 3 SDs from the mean
 264 r68 <- c(m-sdev, m+sdev)
 265 r95 <- c(m-(2*sdev), m+(2*sdev))
 266 r99 <- c(m-(3*sdev), m+(3*sdev))
 267
 268 ##
 269 l68 <- length(d[d>r68[1] & d<r68[2]])
 270 l95 <- length(d[d>r95[1] & d<r95[2]])
 271 l99 <- length(d[d>r99[1] & d<r99[2]])
 272
 273 l68/length(d)
 274 l95/length(d)
 275 l99/length(d)
 276 ```
 277 Those proportions look consistent with the 68-95-99 rule to me.
 278
 279 (b) The data look pretty consistent with the normal distribution from the visualizations as well. The historgram is pretty symmetric around the theoretical mean of the density plot (red line). The quantile-quantile plot also looks pretty consistent with a normal distribution, altough most of the points that deviate seem to do so in the same direction and there's a slight trend away from the diagonal towards the upper end of the distribution  (right side of the plot).
 280
 281 ### SQ4 — 3.32
 282
 283 This question focuses on applying the knowledge from section 3.4 of the textbook on binomial distributions. Review this section and especially equation 3.40 if this gets confusing
 284
 285
 286 (a) $P(at~least~1~arachnophobe)=1-P(none)$
 287     $1-P(none)=1-{10 \choose 0}0.07^{0}(1-0.07)^{10-0}$
 288 ```{r}
 289 1-(choose(10,0)*1*(.93^10))
 290 ```
 291
 292 (b) $P(2~arachnophobes)={10 \choose 2}0.07^2(1-0.07)^{(10-2)}$
 293 ```{r}
 294 choose(10,2)*0.07^2*0.93^8
 295 ```
 296 (c) $P(\leq1~arachnophobes)=P(none)+P(one)$
 297     ${10 \choose 0}0.07^00.93^{10}+{10 \choose 1}0.07^1 0.93^9$
 298
 299 ```{r}
 300 (choose(10,0)*1*(.93^10))+(choose(10,1)*0.07*(0.93^9))
 301 ```
 302
 303 (d) Is random assignment to tents likely to ensure $\leq1~arachnophobe$ per tent?
 304
 305  Random assignment and the independence assumption means that the answer to part c is the inverse of the outcome we're looking to avoid: $P(\gt1~arachnophobes) = 1-P(\leq1~arachnophobe)$. So, $P(\gt1~arachnophobes) = 1-0.84 = 0.16 = 16\%$. Those are the probabilities, but the interpretation really depends on how confident the camp counselor feels about a $16\%$ chance of having multiple arachnophobic campers in one of the tents.
 306
 307