X-Git-Url: https://code.communitydata.science/stats_class_2019.git/blobdiff_plain/b28ed253c872d0f13c7dbfab658c1b7b52d2e0a0..94e323497df92322a9e05f1c820a38fb56e38b39:/problem_sets/week_03/ps3-worked_solution.html diff --git a/problem_sets/week_03/ps3-worked_solution.html b/problem_sets/week_03/ps3-worked_solution.html new file mode 100644 index 0000000..9927124 --- /dev/null +++ b/problem_sets/week_03/ps3-worked_solution.html @@ -0,0 +1,677 @@ + + + + +
+ + + + + + + + + + +I like to load libraries at the beginning of my R scripts. Doing so helps me find them later. Since I know weâre graphing stuff later in this problem set Iâll call the ggplot2 package here.
+library(ggplot2)
+Now, Iâll go ahead and load the CSV file into R. As with last week, Iâll do this directly with a url()
command. However, I included a couple of lines (commented out) that you might adapt to set the working directory on your machine, ask R to show you whatâs there, and read the data. Iâll walk through this in the screencast and/or in class, but please note that youâll need to edit this code to get it to work on your own file system!
### Uncomment, edit, and try running this on your machine:
+###
+### setwd("~/Documents/Teaching/2019/stats/")
+### list.files("data/week_03") # just take a look around
+### w3.data <- read.csv("data/week_03/group_01.csv")
+
+w3.data <- read.csv(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_03/group_02.csv"))
+First, I like to see the first few rows of the dataset and the dimensions of it:
+head(w3.data)
+## x j l k y
+## 1 9.643215 0 0 2 27.43
+## 2 2.158358 0 0 2 1.48
+## 3 1.396595 0 0 1 9.19
+## 4 0.192623 1 0 3 -1.42
+## 5 1.752234 0 1 2 0.26
+## 6 0.170634 1 1 1 0.51
+nrow(w3.data)
+## [1] 100
+ncol(w3.data)
+## [1] 5
+Now Iâll use the lapply()
command to ask for some summary information about each variable (column) in the data frame.
lapply(w3.data, class)
+## $x
+## [1] "numeric"
+##
+## $j
+## [1] "integer"
+##
+## $l
+## [1] "integer"
+##
+## $k
+## [1] "integer"
+##
+## $y
+## [1] "numeric"
+lapply(w3.data, summary)
+## $x
+## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+## 0.04996 0.98868 2.25888 3.16398 4.58581 9.64321 5
+##
+## $j
+## Min. 1st Qu. Median Mean 3rd Qu. Max.
+## 0.00 0.00 0.00 0.49 1.00 1.00
+##
+## $l
+## Min. 1st Qu. Median Mean 3rd Qu. Max.
+## 0.00 0.00 1.00 0.52 1.00 1.00
+##
+## $k
+## Min. 1st Qu. Median Mean 3rd Qu. Max.
+## 1.00 1.00 2.00 1.81 2.00 3.00
+##
+## $y
+## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+## -4.42 3.19 7.81 9.96 14.61 33.14 5
+lapply(w3.data, sd, na.rm=TRUE)
+## $x
+## [1] 2.655523
+##
+## $j
+## [1] 0.5024184
+##
+## $l
+## [1] 0.5021167
+##
+## $k
+## [1] 0.7204796
+##
+## $y
+## [1] 8.667574
+This code below is not ideal since it makes it hard to figure out which variable is being graphed and outputs a bunch of extra crud at the end (the figures are in order, but are all labeled as x[[l]]). That said, the code sure is expedient and short! Weâll learn more intelligible ways to do this later.
+names(w3.data)
+## [1] "x" "j" "l" "k" "y"
+lapply(w3.data, hist)
+
+## $x
+## $breaks
+## [1] 0 1 2 3 4 5 6 7 8 9 10
+##
+## $counts
+## [1] 24 19 9 10 11 6 5 5 3 3
+##
+## $density
+## [1] 0.25263158 0.20000000 0.09473684 0.10526316 0.11578947 0.06315789
+## [7] 0.05263158 0.05263158 0.03157895 0.03157895
+##
+## $mids
+## [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
+##
+## $xname
+## [1] "X[[i]]"
+##
+## $equidist
+## [1] TRUE
+##
+## attr(,"class")
+## [1] "histogram"
+##
+## $j
+## $breaks
+## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
+##
+## $counts
+## [1] 51 0 0 0 0 0 0 0 0 49
+##
+## $density
+## [1] 5.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.9
+##
+## $mids
+## [1] 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
+##
+## $xname
+## [1] "X[[i]]"
+##
+## $equidist
+## [1] TRUE
+##
+## attr(,"class")
+## [1] "histogram"
+##
+## $l
+## $breaks
+## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
+##
+## $counts
+## [1] 48 0 0 0 0 0 0 0 0 52
+##
+## $density
+## [1] 4.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.2
+##
+## $mids
+## [1] 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
+##
+## $xname
+## [1] "X[[i]]"
+##
+## $equidist
+## [1] TRUE
+##
+## attr(,"class")
+## [1] "histogram"
+##
+## $k
+## $breaks
+## [1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
+##
+## $counts
+## [1] 37 0 0 0 45 0 0 0 0 18
+##
+## $density
+## [1] 1.85 0.00 0.00 0.00 2.25 0.00 0.00 0.00 0.00 0.90
+##
+## $mids
+## [1] 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9
+##
+## $xname
+## [1] "X[[i]]"
+##
+## $equidist
+## [1] TRUE
+##
+## attr(,"class")
+## [1] "histogram"
+##
+## $y
+## $breaks
+## [1] -5 0 5 10 15 20 25 30 35
+##
+## $counts
+## [1] 7 25 25 15 9 7 5 2
+##
+## $density
+## [1] 0.014736842 0.052631579 0.052631579 0.031578947 0.018947368 0.014736842
+## [7] 0.010526316 0.004210526
+##
+## $mids
+## [1] -2.5 2.5 7.5 12.5 17.5 22.5 27.5 32.5
+##
+## $xname
+## [1] "X[[i]]"
+##
+## $equidist
+## [1] TRUE
+##
+## attr(,"class")
+## [1] "histogram"
+### You might also have done this variable by variable with code like the following:
+### hist(w3.data$x)
+First, I have to define the my.mean()
function. Then I can run it.
### Run this first...
+my.mean <- function(z){
+ z <- z[!is.na(z)]
+ sigma <- sum(z)
+ n <- length(z)
+ out.value <- sigma/n
+ return(out.value)
+}
+
+### Now you can call the function we just defined
+my.mean(w3.data$x)
+## [1] 3.163984
+Now, for the median bit. Medians can be complicated because the calculation depends on whether thereâs a midpoint of the data or not. Luckily in this case, there should be 95 non-missing values in the x vector, so a midpoint exists (the 48th value) and the following should work:
+my.median <- function(z){
+ z <- z[!is.na(z)]
+ midpoint <- (length(z) / 2) + .5
+ sorted <- sort(z)
+ output <- sorted[midpoint]
+ return(output)
+}
+
+my.median(w3.data$x)
+## [1] 2.258885
+Since R has built-in functions that do both of these things, I can even check my answers:
+mean(w3.data$x, na.rm=T)
+## [1] 3.163984
+median(w3.data$x, na.rm=T)
+## [1] 2.258885
+load(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_02/group_02.RData"))
+
+### I'll rename it for clarity:
+w2.data <- d
+rm(d)
+
+### and recode:
+w2.data[w2.data < 0] <- NA
+w2.data <- log1p(w2.data)
+First, some summary statistics. These look quite similarâ¦
+summary(w2.data)
+## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+## 0.04996 0.98868 2.25888 3.16398 4.58581 9.64321 5
+summary(w3.data$x)
+## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+## 0.04996 0.98868 2.25888 3.16398 4.58581 9.64321 5
+But they donât quite match upâ¦
+table(w2.data == w3.data$x)
+##
+## FALSE
+## 95
+head(w2.data)
+## [1] 9.6432153 2.1583583 1.3965948 0.1926225 1.7522343 0.1706340
+head(w3.data$x)
+## [1] 9.643215 2.158358 1.396595 0.192623 1.752234 0.170634
+Inspecting the first few values returned by head()
gave you a clue. Rounded to six decimal places, the vectors match!
I can create a table comparing the sorted rounded values to check this.
+table(sort(round(w2.data, 6)) == sort(round(w3.data$x, 6)))
+##
+## TRUE
+## 95
+Can you explain what each piece of that last line of code is doing?
+First create a basic plotting object. Then add the call to geom_point()
to show just the x and y:
p <- ggplot(w3.data, mapping=aes(x=x, y=y))
+
+p + geom_point()
+## Warning: Removed 5 rows containing missing values (geom_point).
+
+Now uncomment this line to try to add the color, size, and shape to the point layer:
+# p + geom_point(aes(color=j, size=l, shape=k))
+Hmm. That doesnât work, but the error message is actually pretty helpful. If you search the text of that message online you might discover that we should try turning the aesthetic mappings of the call to geom_point
into factors.
p + geom_point(aes(color=as.factor(j), size=as.factor(l), shape=as.factor(k)))
+## Warning: Using size for a discrete variable is not advised.
+## Warning: Removed 5 rows containing missing values (geom_point).
+
+Now thatâs more like it!
+w3.data$j <- as.logical(w3.data$j)
+w3.data$l <- as.logical(w3.data$l)
+
+### Create a new column just so I can double check this bit:
+w3.data$k.factor <- factor(w3.data$k,
+ levels=c(0,1,2,3),
+ labels=c("none", "some", "lots", "all"))
+
+### Spotcheck to make sure it looks good
+head(w3.data, 10)
+## x j l k y k.factor
+## 1 9.643215 FALSE FALSE 2 27.43 lots
+## 2 2.158358 FALSE FALSE 2 1.48 lots
+## 3 1.396595 FALSE FALSE 1 9.19 some
+## 4 0.192623 TRUE FALSE 3 -1.42 all
+## 5 1.752234 FALSE TRUE 2 0.26 lots
+## 6 0.170634 TRUE TRUE 1 0.51 some
+## 7 NA TRUE TRUE 1 NA some
+## 8 4.327381 TRUE TRUE 2 17.98 lots
+## 9 NA FALSE FALSE 1 NA some
+## 10 2.937375 TRUE FALSE 2 7.81 lots
+### Now cleanup my extra column:
+w3.data$k <- w3.data$k.factor
+w3.data$k.factor <- NULL
+
+head(w3.data)
+## x j l k y
+## 1 9.643215 FALSE FALSE lots 27.43
+## 2 2.158358 FALSE FALSE lots 1.48
+## 3 1.396595 FALSE FALSE some 9.19
+## 4 0.192623 TRUE FALSE all -1.42
+## 5 1.752234 FALSE TRUE lots 0.26
+## 6 0.170634 TRUE TRUE some 0.51
+lapply(w3.data, summary)
+## $x
+## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+## 0.04996 0.98868 2.25888 3.16398 4.58581 9.64321 5
+##
+## $j
+## Mode FALSE TRUE
+## logical 51 49
+##
+## $l
+## Mode FALSE TRUE
+## logical 48 52
+##
+## $k
+## none some lots all
+## 0 37 45 18
+##
+## $y
+## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+## -4.42 3.19 7.81 9.96 14.61 33.14 5
+### Run this line again to assign the new dataframe to p
+p <- ggplot(w3.data, aes(x=x, y=y))
+
+p + geom_point(aes(color=j, size=l, shape=k))
+## Warning: Using size for a discrete variable is not advised.
+## Warning: Removed 5 rows containing missing values (geom_point).
+
+Check out how much more readable the legends look now!
+\[X â¼ N (μ = 4313, Ï = 583)\] \[Y â¼ N (μ = 5261, Ï = 807)\]
+Z-scores are a standardization measure: to calculate for a given value you subtract the mean of the corresponding distribution from the value and divide by the standard deviation. The formula notation is given in the OpenIntro textbook.
+Letâs let R calculate it for us:
## Mary
+(5513 - 5261) / 807
+## [1] 0.3122677
+## Leo:
+(4948 - 4313) / 583
+## [1] 1.089194
+Since the Z score tells you how many standard deviation units above/below the mean each value is, we can see that Mary finished 0.31 standard deviations above the mean in her category while Leo finished 1.09 standard deviations above the mean in his.
+Mary finished in a much faster time with respect to her category.
Using the Z-score table (Table 3.8) on p. 132 of the book, Leo finished faster than approximately \(1-0.86 = .14\) or \(14\%\) of his category. This corresponds the probability \(P(Z \gt 1.09)\) for a normal distribution.
Mary finished faster than approximately \(1-0.62 = .38\) or \(38\%\) of her category. This corresponds to the probability \(P(Z \gt 0.31)\) for a normal distribution.
The answer for part b would not change as standardized values (Z-scores) can be computed for any distribution. However, the interpretation and percentile calculations (parts c-e) would be different because they all presume a normal distribution.
The fastest \(5\%\) are in the \(5^{th}\) percentile of the distribution. Using Table 3.8 again, the Z score corresponding to the \(5^{th}\) percentile of the normal distribution is approximately -1.64. Then, \[Z = â1.65 = \frac{x â 4313}{583} â x = â1.65 Ã 583 + 4313 = 3351~seconds\] Divide that by 60 and it looks like the fastest \(5\%\) of males in this age group finished in less than 56 minutes.
The slowest \(10\%\) are in the \(90^{th}\) percentile of the distribution. The Z score corresponding to the \(90^{th}\) percentile of the normal distribution is approximately 1.28. Then, \[Z = 1.27 = \frac{y-5261}{807} â y = 1.28 \times 807 + 5261 = 6294 ~seconds\] Divide that by 60 and it looks like the slowest \(10\%\) of females in this age group finished in about 1 hour 45 minutes (or longer).
d <- c(54, 55, 56, 56, 57, 58, 58, 59, 60, 60, 60, 61, 61, 62, 62, 63, 63, 63, 64, 65, 65, 67, 67, 69, 73)
+m <- 61.52
+sdev <- 4.58
+
+## Here are the ranges for +/- 1, 2, and 3 SDs from the mean
+r68 <- c(m-sdev, m+sdev)
+r95 <- c(m-(2*sdev), m+(2*sdev))
+r99 <- c(m-(3*sdev), m+(3*sdev))
+
+##
+l68 <- length(d[d>r68[1] & d<r68[2]])
+l95 <- length(d[d>r95[1] & d<r95[2]])
+l99 <- length(d[d>r99[1] & d<r99[2]])
+
+l68/length(d)
+## [1] 0.68
+l95/length(d)
+## [1] 0.96
+l99/length(d)
+## [1] 1
+Those proportions look consistent with the 68-95-99 rule to me.
+This question focuses on applying the knowledge from section 3.4 of the textbook on binomial distributions. Review this section and especially equation 3.40 if this gets confusing
+1-(choose(10,0)*1*(.93^10))
+## [1] 0.5160177
+choose(10,2)*0.07^2*0.93^8
+## [1] 0.1233878
+(choose(10,0)*1*(.93^10))+(choose(10,1)*0.07*(0.93^9))
+## [1] 0.8482701
+Random assignment and the independence assumption means that the answer to part c is the inverse of the outcome weâre looking to avoid: \(P(\gt1~arachnophobes) = 1-P(\leq1~arachnophobe)\). So, \(P(\gt1~arachnophobes) = 1-0.84 = 0.16 = 16\%\). Those are the probabilities, but the interpretation really depends on how confident the camp counselor feels about a \(16\%\) chance of having multiple arachnophobic campers in one of the tents.
+