assessment/interactive_assessment.rmd

   1 ---
   2 title: "Interactive Self-Assessment"
   3 subtitle: "Fall 2020 MTS 525 / COMMST 395 Statistics and Statistical Programming"
   4 output: learnr::tutorial
   5 runtime: shiny_prerendered
   6 ---
   7
   8
   9 ```{r setup, include=FALSE}
  10 library(learnr)
  11 library(tidyverse)
  12
  13 knitr::opts_chunk$set(echo = FALSE, tidy=TRUE)
  14
  15 t <- Sys.time()
  16 question_filename <- paste("question_submission_", t, ".csv", sep="")
  17 code_filename <- paste("code_", t, ".csv", sep="")
  18
  19 #df <- data.frame(label=c('test'), question=c('asd'), answer=c('asd'), correct=c(TRUE), stringsAsFactors=FALSE)
  20 df <- data.frame()
  21 print('*')
  22 print(df)
  23
  24 tutorial_event_recorder <- function(tutorial_id, tutorial_version, user_id,
  25                                     event, data) {
  26   # quiz question answered
  27   if (event == "question_submission"){
  28     # nick exasperatedly believes this is the correct way to index the result of strsplit... [[1]][[1]]
  29     data$category <- strsplit(data$label, '_')[[1]][[1]]
  30     #print(data)
  31
  32     df <<- rbind(df, data, stringsAsFactors=FALSE)
  33     #write.table(data, question_filename, append=TRUE, sep=",", row.names=TRUE, col.names=FALSE)
  34     write.table(df, question_filename, append=FALSE, sep=",", row.names=TRUE, col.names=TRUE)
  35
  36   }
  37   # code
  38   if (event == "exercise_submitted"){
  39     write.table(data, code_filename, append=TRUE, sep=",", row.names=TRUE, col.names=FALSE)
  40   }
  41
  42 }
  43 options(tutorial.event_recorder = tutorial_event_recorder)
  44 ```
  45
  46
  47
  48 ## Section 1: Warmup exercises
  49
  50 TODO add a short description. State the number of questions that will be asked. Include expectations about time commitment.
  51
  52 ### Code Chunk Warm-up
  53
  54 To get familiar with how code chunks work in `learnr`, let's write R code required to add two numbers: 1234 and 5678 (and the answer is 6912).
  55
  56 The code chunk below is editable and is "pre-populated" with an unfinished function definition. The goal is to add arguments and fill in the body of the function. When finished, you can run the code chunk and it should produce the answer.
  57
  58 If you click "Run Code", you should see the answer below the chunk. That answer will persist as you navigate around this doc.
  59
  60 You can clear your answers by clicking "Start Over" in the top-left of the chunk. You can also clear **all** your answers by clicking "Start Over" in the left-hand sidebar, but doing that basically erases all progress in the document *Use with caution!*
  61
  62 ```{r warmup_1, exercise=TRUE, exercise.lines=10}
  63 add <- function() {
  64
  65 }
  66 x = 1234
  67 y = 5678
  68 add(x,y)
  69 ```
  70
  71 ```{r warmup_1-solution}
  72 add <- function(value1, value2) {
  73   return(value1 + value2)
  74 }
  75
  76 x = 1234
  77 y = 5678
  78
  79 add(x,y)
  80 ```
  81
  82 ### Multiple Choice Question Warmup
  83 The question below shows how the multiple choice answering and "feedback" works.
  84 ```{r warmup_2}
  85 quiz(
  86   question("Select the answer choice that will return `TRUE` in R.",
  87     answer("1 == 1", message="Good work! Feedback appears here.", correct=TRUE),
  88     answer("1 == 0", message="Not quite! Feedback appears here.")
  89   )
  90 )
  91 ```
  92
  93 ### Debugging a Function
  94 Below, you'll see code to define a function that is *supposed* to perform a transformation on a vector. The problem is that it doesn't work right now.
  95
  96 In theory, the function will take a numeric vector as input (let's call it $x$) and scale the values so they lie between zero and one. [^1] The way it *should* do this is by first subtracting the minimum value of $x$ from each element of $x$. Then, the function will divide each element by the difference between the maximum value of $x$ and the minimum value of $x$.
  97
  98 [^1]: This is sometimes called min-max [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling), and is sometimes used for machine learning.
  99
 100 As written now, however, the function does not work! There are at least three issues you will need to fix to get it working. Once you fix them, you should be able to confirm that your function works with the pre-populated example (with the correct output provided). You might also be able to make this code more "elegant" (or alternatively, improve the comments and variable names as you see fit).
 101
 102 Bonus: how might we update this function to scale between any "floor" and "ceiling" value?
 103
 104 ```{r R_debug1, exercise=TRUE}
 105 zeroToOneRescaler <- function() {
 106   # the minimum value
 107   minval <- min(x)
 108   # let's "shift" our vector by subtracting the minimum value of x from each element
 109   shifted <- x - minval
 110
 111   # let's find the difference between max val and min val
 112   difference <- min(x) - max(x)
 113
 114   scaled <- shifted / difference
 115   scaled
 116 }
 117
 118 test_vector = c(1,2,3,4,5)
 119 zeroToOneRescaler(test_vector)
 120 # Should print c(0, 0.25, 0.5, 0.75, 1.00)
 121 ```
 122
 123 ```{r R_debug1-solution}
 124 zeroToOneRescaler <- function(x) {
 125   shifted <- x - min(x)
 126   difference = max(x) - min(x)
 127   return(shifted / difference)
 128 }
 129
 130 test_vector = c(1,2,3,4,5)
 131 zeroToOneRescaler(test_vector)
 132 # Should print c(0, 0.25, 0.5, 0.75, 1.00)
 133 ```
 134
 135 ```{r R_debug1-response}
 136 quiz(
 137   question("Were you able to solve the debugging question? (this question is for feedback purposes)",
 138     answer("Yes", message="Nice work!", correct = TRUE),
 139     answer("No", message="")
 140   )
 141 )
 142 ```
 143
 144
 145 The following commented chunk has at least five (annoying) bugs. Can you uncomment the code, fix all the bugs, and get this chunk to run? These are drawn from real experiences from your TA!
 146 ```{r R_debug2, exercise=TRUE}
 147 # ps2 <- readcsv(file = url(
 148 #   " https://communitydata.science/~ads/teaching/2020/stats/data/week_04/group_03.csv"), row.names = NULL
 149 # )
 150 #
 151 # ps2$y[is.na(ps2$y)] <- 0
 152 # "ps2$'My First New Column' <- ps2$y * -1"
 153 # ps2$'My Second New Column" <- ps2$y + ps2$'My First New Column'
 154 #
 155 # summary(ps2$'My Second New Column']
 156 ```
 157
 158 ```{r R_debug2-solution}
 159 ps2 <- read.csv(file = url("https://communitydata.science/~ads/teaching/2020/stats/data/week_04/group_03.csv"), row.names = NULL)
 160 ps2$y[is.na(ps2$y)] <- 0
 161 ps2$'My First New Column' <- ps2$y * -1
 162 ps2$'My Second New Column' <- ps2$y + ps2$'My First New Column'
 163 summary(ps2$'My Second New Column')
 164 ```
 165
 166 ```{r R_debug2-response}
 167 quiz(
 168   question("Were you able to solve the above debugging question? (this question is for feedback purposes)",
 169            answer("Yes", message="Nice work!", correct = TRUE),
 170            answer("No", message="")
 171   )
 172 )
 173 ```
 174
 175 ### Updating a visualization
 176 Imagine you've created a histogram to visualize some data from your research (below, we'll use R's built-in "PlantGrowth" dataset). You show your collaborator a histogram of this plot using default R, and they express some concerns about your plot's aesthetics. Replace the base-R histogram with a `ggplot2` histogram that also includes a density plot overlaid on it (maybe in a bright, contrasting color like red).
 177
 178 ```{r R_ggplot, exercise=TRUE}
 179 data("PlantGrowth")
 180 hist(PlantGrowth$weight)
 181 ```
 182
 183 ```{r R_ggplot-solution}
 184 library(ggplot2)
 185
 186 ggplot(PlantGrowth, aes(weight, after_stat(density))) + geom_histogram() + geom_density(color = "red")
 187 ```
 188
 189 Bonus: How would you find more information about the source of this dataset?
 190
 191
 192 ### Interpret a dataframe
 193 ```{r R_columns-setup, exercise=TRUE}
 194 data <- mtcars
 195 data$mpgGreaterThan20 <- data$mpg > 20
 196 data$gear <- as.factor(data$gear)
 197 data$mpgRounded <- round(data$mpg)
 198 ```
 199
 200 The below questions relate to the `data` data.frame defined above, which is a modified version of the classic `mtcars`.
 201
 202 For all answers, assume the above code chunks *has completely run*, i.e. assume all modifications described above were made.
 203 ```{r R_columns}
 204 quiz(
 205   question("Which of the following best describes the `mpg` variable?",
 206     answer("Numeric, continuous", correct=TRUE),
 207     answer("Numeric, discrete"),
 208     answer("Categorical, dichotomous"),
 209     answer("Categorical, ordinal"),
 210     answer("Categorical")
 211   ),
 212   question("Which of the following best describes the `mpgGreaterThan20` variable?",
 213     answer("Numeric, continuous"),
 214     answer("Numeric, discrete"),
 215     answer("Categorical, dichotomous", correct=TRUE),
 216     answer("Categorical, ordinal"),
 217     answer("Categorical")
 218   ),
 219   question("Which of the following best describes the `mpgRounded` variable?",
 220     answer("Numeric, continuous"),
 221     answer("Numeric, discrete", correct=TRUE),
 222     answer("Categorical, dichotomous"),
 223     answer("Categorical, ordinal"),
 224     answer("Categorical")
 225   ),
 226   question("Which of the following best describes the `gear` variable?",
 227     answer("Numeric, continuous"),
 228     answer("Numeric, discrete"),
 229     answer("Categorical, dichotomous"),
 230     answer("Categorical, ordinal", correct=TRUE),
 231     answer("Categorical")
 232   )
 233 )
 234 ```
 235
 236 ## Section 2
 237 The following is a series of short multiple choice questions. These questions focus on definitions, and should not require performing any computations or writing any code.
 238 ```{r Stats_lightninground}
 239 m1 <- ""
 240 wolf <- "Think of the 'Boy who cried wolf', with a null hypothesis that no wolf exists. First the boy claims the alternative hypothesis: there is a wolf. The villagers believe this, and reject the correct null hypothesis. Second, the villagers make an error by not believing the boy when he presents a correct alternative hypothesis."
 241
 242 quiz(
 243   question("A hypothesis is typically concerned with a:",
 244     answer("population statistic.", correct = TRUE),
 245     answer("sample statistic.")
 246   ),
 247   question("A sampling distribution is:",
 248     answer("critical to report in your papers."),
 249     answer("theoretically helpful, but rarely available to researchers in practice.", correct = TRUE),
 250     answer("practically useful, but not relies on assumptions that are rarely met.")
 251   ),
 252   question("Z-scores tell us about a value in terms of:",
 253     answer("mean and standard deviation.", correct = TRUE),
 254     answer("sample size and sampling strategy."),
 255     answer("if an effect is causal or not.")
 256   ),
 257   question("A distribution that is right-skewed has a long tail to the:",
 258     answer("right", correct = TRUE),
 259     answer("left")
 260   ),
 261   question("A normal distribution can be characterized with only this many parameters:",
 262     answer("1"),
 263     answer("2", correct = TRUE),
 264     answer("3")
 265   ),
 266   question("When we calculate standard error, we calculate",
 267     answer("using a different formula for every type of variable."),
 268     answer("the sample standard error, which is an estimate of the population standard error.", correct = TRUE),
 269     answer("whether or not our result is causal.")
 270   ),
 271   question("When we calculate standard error, we calculate",
 272     answer("using a different formula for every type of variable."),
 273     answer("the sample standard error, which is an estimate of the population standard error.", correct = TRUE),
 274     answer("whether or not our result is causal.")
 275   ),
 276   question("P values tell us about",
 277     answer("the world in which our null hypothesis is true.", correct = TRUE),
 278     answer("the world in which our null hypothesis is false."),
 279     answer("the world in which our data describe a causal effect")
 280   ),
 281   question("P values are",
 282     answer("a conditional probability.", correct = TRUE),
 283     answer("completely misleading."),
 284     answer("only useful when our data has a normal distribution.")
 285   ),
 286   question("A type 1 error occurs when",
 287     answer("when we reject a correct null hypothesis (i.e. false positive).", correct = TRUE, message=wolf),
 288     answer("when we accept a correct null hypothesis", message=wolf),
 289     answer("when we accept an incorrect null hypothesis (i.e. false negative)", message=wolf)
 290   ),
 291   question("Before we assume independence of two random samples, it is useful to check that",
 292     answer("both samples include over 90% of the population."),
 293     answer("both samples include less than 10% of the population.", correct = TRUE)
 294   )
 295 )
 296 ```
 297
 298 ## Section 3
 299
 300 ### About this Section
 301
 302 The following questions are in the style of pen-and-paper statistics class exam questions. There a few sections that you may want or need to run some R code; there are a variety of empty "scratch paper" code chunks for this purpose. Note that this document contains a section with helpful formulas, which you can navigate to via the leftmost column.
 303
 304 ### Sampling
 305
 306 ```{r Stats_sampling}
 307 quiz(
 308   question("A political scientist is interested in the effect of government type on economic development.
 309 She wants to use a sample of 30 countries evenly represented among the Americas, Europe,
 310 Asia, and Africa to conduct her analysis. What type of study should she use to ensure that
 311 countries are selected from each region of the world? Assume a limitied research budget.",
 312     answer("Observational - simple random sample"),
 313     answer("Observational - cluster"),
 314     answer("Observational - stratifed", correct=TRUE),
 315     answer("Experimental")
 316   )
 317 )
 318 ```
 319
 320 For the following question, you may want to use this "scratch paper" code chunk.
 321 ```{r Stats_quartile-scratch, exercise=TRUE}
 322
 323 ```
 324
 325 ```{r Stats_quartile}
 326 quiz(
 327   question("Heights of boys in a high school are approximately normally distributed with mean of 175 cm
 328 standard deviation of 5 cm. What is the first quartile of heights?",
 329     answer("25 cm"),
 330     answer("167.3 cm"),
 331     answer("171.7 cm", correct=TRUE),
 332     answer("173.5 cm"),
 333     answer("178.3 cm")
 334   )
 335 )
 336 ```
 337
 338
 339 ### Outliers and Skew
 340 Suppose we are reading a paper which reports the following about a column of a dataset:
 341
 342 Minimum value is 0.00125 and Maximum Value is 2.1100.
 343
 344 Mean is 0.41100 and median is 0.27800.
 345
 346 1st quartile is 0.13000 and 3rd quartile is 0.56200.
 347
 348 ```{r Stats_summary}
 349 m1 <- "Under R's default setting, outliers are values that are either greater than the upper bound $Q_3 + 1.5\\times IQR$ OR less than the lower bound $Q_1 - 1.5\\times IQR$. Here, $IQR = 0.562-0.130=0.432$. The upper bound $= 0.562 + 1.5\\times (0.432) = 1.21$. The lower bound is $0.13 - 1.5\\times (0.432) = -0.518$. We see that the maximum value is 2.11, greater than the upper bound. Thus, there is at least one outlier in this sample."
 350
 351 m2 <- "There is at least one outlier on the right, whereas there is none on the left. $|Q_3-Q_2| > |Q_2-Q_1|$, so the whisker for this box plot would be longer on the right-hand side. The mean is larger than the median."
 352 quiz(
 353   question("Are there outliers (in terms of IQR) in this sample?",
 354     answer("Yes", correct = TRUE, message=m1),
 355     answer("No", message="asd")
 356   ),
 357   question("Based on these summary statistics, we might expect the skew of the distribution to be:",
 358     answer("left-skewed", message=m2),
 359     answer("right-skewed", message=m2, correct=TRUE),
 360     answer("symmetric", message=m2)
 361   )
 362 )
 363 ```
 364
 365
 366 ### Computing Probabilities
 367 For each of the below questions, you will need to calculate some probabilities by hand.
 368 You may want to use this "scratch paper" code chunk (possibly in conjunction with actual paper).
 369
 370 ```{r Stats_probs-scratch, exercise=TRUE}
 371
 372 ```
 373
 374 ```{r Stats_probs}
 375 m1 <- "$P(\\text{Coffee} \\cap \\text{No Milk}) = P(\\text{Coffee})\\cdot P(\\text{No Milk}) = 0.5 \\cdot (1-0.1)  = 0.45$"
 376
 377 m2 <- "Let H be the event of hypertension, M be event of being a male. We see here that $P(H) = 0.15$ whereas $P(H|M) = 0.18$. Since $P(H) \\neq P(H|M)$, then hypertension is not independent of sex."
 378
 379 m3 <- "$P(HIV \\cap HCV) = P(HIV|HCV)\\cdot P(HCV) = 0.1\\cdot 0.02 = 0.002$"
 380
 381 quiz(
 382   question("Suppose in a population, half prefer coffee to tea, and assume that 10 percent of the population does not put milk in their coffee or tea. If coffee vs. tea preference and cow milk are independent, what fraction of the population both prefers coffee and does put milk in their coffee?",
 383     answer("40%", message=m1),
 384     answer("45%", correct = TRUE, message=m1),
 385     answer("50%", message=m1),
 386     answer("55%", message=m1)
 387   ),
 388   question("In the general population, about 15 percent of adults between 25 and 40 years of age are hypertensive.  Suppose that among males of this age, hypertension occurs about 18 percent of the time. Is hypertension independent of sex? ",
 389     answer("Yes", message=m2),
 390     answer("No.", correct=TRUE, message=m2)
 391   ),
 392   question("What might you search for (in Google, your notes, the OpenIntro PDF, etc.) to help with this question?",
 393     answer("t test"),
 394     answer("laws of probability", correct=TRUE),
 395     answer("linear regression"),
 396     answer("R debugging")
 397   ),
 398   question("Co-infection with HIV and hepatitis C (HCV) occurs when a patient has both diseases, and is on the rise in some countries. Assume that in a given country, only about 2% of the population has HCV,  but 25% of the population with HIV have HCV.  Assume as well that 10% of the population with HCV have HIV.  What is the probability that a randomly chosen member of the population has both HIV and HCV?",
 399     answer("0.001", message=m3),
 400     answer("0.01", message=m3),
 401     answer("0.002", correct=TRUE, message=m3),
 402     answer("0.02", message=m3)
 403   ),
 404   question("What might you search for (in Google, your notes, the OpenIntro PDF, etc.) to help with this question?",
 405     answer("t test"),
 406     answer("laws of probability", correct=TRUE),
 407     answer("linear regression"),
 408     answer("R debugging")
 409   )
 410 )
 411 ```
 412
 413 ### Calculating Probabilities: A Biostats Example
 414 This question is adapted from a biostats midterm exam.
 415 In the past (2015, to be specific), the US Preventive Services
 416 Task Force recommended that women under the age of 50 should
 417 not get routine mammogram screening for breast cancer.  The Task Force
 418 argued that for a woman with a positive mammogram (one suggesting the
 419 presence of breast cancer), the chance that she has breast cancer was
 420 too low to justify a surgical biopsy.
 421
 422 Suppose the data below describe a cohort of 100,000 women age 40 -
 423 49 in whom mammogram screening and breast cancer behaves just like the
 424 larger population.  For instance, in this table, the 3,333 women with
 425 breast cancer represent a rate of 1 in 30 women with undiagnosed
 426 cancer. The numbers in the table are realistic for US women in this
 427 age category.
 428
 429 Has Breast Cancer: 3,296 Positive Test Results and 37 negative test results (3,333 total)
 430
 431 Does not Have Breast Cancer: 8,313 Positive Test Results and 88,354 negative test results (96,667 total)
 432
 433 First, compute the "margins" of the above contingency table.
 434 Row margins: How many total women have breast cancer? How many total women do not have breast cancer?
 435 Column margins: How many total positive test? How many total negative tests?
 436 ```{r Stats_mammogram-chunk, exercise=TRUE}
 437
 438 ```
 439
 440 ```{r Stats_mammogram}
 441 m1 <- "
 442 $\\Pr(\\textrm{Test}^+ \\cap \\textrm{Cancer}) = 3,296$
 443
 444 $\\Pr(Cancer) = 3,333$
 445
 446 $\\Pr(\\textrm{Test}^+|\\textrm{Cancer}) =$ \
 447 $\\dfrac{\\Pr(\\textrm{Test}^+ \\cap \\textrm{Cancer})}{\\Pr(\\textrm{Cancer})} =$\
 448 $\\dfrac{3,296}{3,333} = 0.989$"
 449
 450 m2 <- "
 451 $Pr(\\textrm{Cancer}|\\textrm{Test}^+) =$
 452
 453 $\\dfrac{\\Pr(\\textrm{Cancer} \\cap \\textrm{Test}^+)}
 454      {\\Pr(\\textrm{Test}^+)}=$
 455
 456
 457  $\\dfrac{3,296}{11,609} = 0.284$"
 458
 459 quiz(
 460   question("Based on this data, what is the probability that a woman has a positive test given that women has cancer?",
 461     answer("98.9%", correct = TRUE, message=m1),
 462     answer("99.9%",message=m1),
 463     answer("89.9%",message=m1),
 464     answer("88.9%",message=m1)
 465   ),
 466   question("Based on this data, what is the probability that a woman has cancer receives a positive test?",
 467     answer("28.4%", correct = TRUE,message=m2),
 468     answer("10.3%",message=m2),
 469     answer("50.7%",message=m2),
 470     answer("97.9%",message=m2)
 471   ),
 472   question("Is the Task Force correct to claim that there is a low probability that a women between 40-49 who tests positive has breast cancer?",
 473     answer("Yes", correct=TRUE),
 474     answer("No")
 475   )
 476 )
 477 ```
 478
 479
 480
 481 ## Useful Formulas
 482 Sample Mean (sample statistic):
 483 $\bar{x}=\frac{\sum_{i=1}^n x_i}{n}$ |
 484 Standard deviation:
 485 $s=\sqrt{\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1}}$ |
 486 Variance:
 487 $var = s^2$
 488
 489 Useful probability axioms:
 490 $\mbox{Pr}(A^c)=1-\mbox{Pr}(A)$ | Pr(A and B) = Pr(A) $\times$ Pr(B) | Pr(A or B) = Pr(A) + Pr(B) - Pr(A and B)
 491
 492 $\mbox{Pr}(A|B)=\frac{\mbox{Pr(A and B)}}{\mbox{Pr(B)}}$\\
 493
 494 Population mean (population statistic):
 495 $\mu = \sum_{i=1}^{n}x\mbox{Pr}(x)$
 496
 497 Z-score:
 498 $z=\frac{x-\mu}{\sigma}$
 499
 500 $x=\mu + z\sigma$\\
 501
 502 $\mbox{P}(x)=\frac{n!}{x!(n-x)!}p^x(1-p)^{n-x}$
 503     ~for~ $x=0,1,2,...,n$
 504
 505 $\mu=np$, $\sigma=\sqrt{np(1-p)}$\\
 506
 507 $\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}$
 508
 509 $\sigma_{\hat{p}}=\sqrt{\frac{p(1-p)}{n}}$
 510
 511 $Q_1 - 1.5 \times IQR, \quad Q_3 + 1.5 \times IQR$
 512
 513
 514
 515 ## Answer Report
 516 Finally, let's generate a report that summarizes your answers to this evaluation.
 517
 518 Answers are written to a file that looks like this: `question_submission-{CURRENT TIME}.csv`. We can actually quickly analyze them.
 519
 520 ```{r report1, exercise=TRUE}
 521 df
 522 ```
 523
 524 ```{r report2, exercise=TRUE}
 525 mean(df$correct)
 526 ```
 527
 528 ```{r report3, exercise=TRUE}
 529 df %>% group_by(category) %>% summarize(avg = mean(correct))
 530 ```