assessment/interactive_assessment.rmd

   1 ---
   2 title: "Interactive Self-Assessment"
   3 subtitle: "Fall 2020 MTS 525 / COMMST 395 Statistics and Statistical Programming"
   4 output: learnr::tutorial
   5 runtime: shiny_prerendered
   6 ---
   7
   8
   9 ```{r setup, include=FALSE}
  10 library(learnr)
  11 library(tidyverse)
  12
  13 knitr::opts_chunk$set(echo = FALSE, tidy=TRUE)
  14
  15 t <- Sys.time()
  16 question_filename <- paste("question_submission_", t, ".csv", sep="")
  17 code_filename <- paste("code_", t, ".csv", sep="")
  18
  19 #df <- data.frame(label=c('test'), question=c('asd'), answer=c('asd'), correct=c(TRUE), stringsAsFactors=FALSE)
  20 df <- data.frame()
  21 print('*')
  22 print(df)
  23
  24 tutorial_event_recorder <- function(tutorial_id, tutorial_version, user_id,
  25                                     event, data) {
  26   # quiz question answered
  27   if (event == "question_submission"){
  28     # nick exasperatedly believes this is the correct way to index the result of strsplit... [[1]][[1]]
  29     data$category <- strsplit(data$label, '_')[[1]][[1]]
  30     #print(data)
  31
  32     df <<- rbind(df, data, stringsAsFactors=FALSE)
  33     #write.table(data, question_filename, append=TRUE, sep=",", row.names=TRUE, col.names=FALSE)
  34     write.table(df, question_filename, append=FALSE, sep=",", row.names=TRUE, col.names=TRUE)
  35
  36   }
  37   # code
  38   if (event == "exercise_submitted"){
  39     write.table(data, code_filename, append=TRUE, sep=",", row.names=TRUE, col.names=FALSE)
  40   }
  41
  42 }
  43 options(tutorial.event_recorder = tutorial_event_recorder)
  44 ```
  45
  46
  47
  48 ## Overview
  49
  50 TODO add a short description. State the number of questions that will be asked. Include expectations about time commitment. Explain the idea of the solution report
  51
  52 Note that you can clear **all** your answers to *all* questions by clicking "Start Over" in the left-hand sidebar, but doing that basically erases all progress in the document and your answers to any questions will be deleted. *Use with caution* (if at all)!
  53
  54 ## Section 1: Warmup exercises
  55
  56 TODO add a short description of this section.
  57
  58 ### Code Chunk Warm-up
  59
  60 To get familiar with how code chunks work in `learnr`, let's write R code required to add two numbers: 1234 and 5678 (and the answer is 6912).
  61
  62 The code chunk below is editable and is "pre-populated" with an unfinished function definition. The goal is to add arguments and fill in the body of the function. When finished, you can run the code chunk and it should produce the answer.
  63
  64 If you click "Run Code", you should see the answer below the chunk. That answer will persist as you navigate around this doc.
  65
  66 You can clear your answers by clicking "Start Over" in the top-left of the chunk. You can also clear **all** your answers by clicking "Start Over" in the left-hand sidebar, but doing that basically erases all progress in the document *Use with caution!*
  67
  68 ```{r warmup_1, exercise=TRUE, exercise.lines=10}
  69 add <- function() {
  70
  71 }
  72 x = 1234
  73 y = 5678
  74 add(x,y)
  75 ```
  76
  77 ```{r warmup_1-solution}
  78 add <- function(value1, value2) {
  79   return(value1 + value2)
  80 }
  81
  82 x = 1234
  83 y = 5678
  84
  85 add(x,y)
  86 ```
  87
  88 ### Multiple Choice Question Warmup
  89 The question below shows how the multiple choice answering and "feedback" works.
  90 ```{r warmup_2}
  91 quiz(
  92   question("Select the answer choice that will return `TRUE` in R.",
  93     answer("1 == 1", message="Good work! Feedback appears here.", correct=TRUE),
  94     answer("1 == 0", message="Not quite! Feedback appears here.")
  95   )
  96 )
  97 ```
  98
  99 ### Debugging a Function
 100 Below, you'll see code to define a function that is *supposed* to perform a transformation on a vector. The problem is that it doesn't work right now.
 101
 102 In theory, the function will take a numeric vector as input (let's call it $x$) and scale the values so they lie between zero and one. [^1] The way it *should* do this is by first subtracting the minimum value of $x$ from each element of $x$. Then, the function will divide each element by the difference between the maximum value of $x$ and the minimum value of $x$.
 103
 104 [^1]: This is sometimes called min-max [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling), and is sometimes used for machine learning.
 105
 106 As written now, however, the function does not work! There are at least three issues you will need to fix to get it working. Once you fix them, you should be able to confirm that your function works with the pre-populated example (with the correct output provided). You might also be able to make this code more "elegant" (or alternatively, improve the comments and variable names as you see fit).
 107
 108 Bonus: how might we update this function to scale between any "floor" and "ceiling" value?
 109
 110 ```{r R_debug1, exercise=TRUE}
 111 zeroToOneRescaler <- function() {
 112   # the minimum value
 113   minval <- min(x)
 114   # let's "shift" our vector by subtracting the minimum value of x from each element
 115   shifted <- x - minval
 116
 117   # let's find the difference between max val and min val
 118   difference <- min(x) - max(x)
 119
 120   scaled <- shifted / difference
 121   scaled
 122 }
 123
 124 test_vector = c(1,2,3,4,5)
 125 zeroToOneRescaler(test_vector)
 126 # Should print c(0, 0.25, 0.5, 0.75, 1.00)
 127 ```
 128
 129 ```{r R_debug1-solution}
 130 zeroToOneRescaler <- function(x) {
 131   shifted <- x - min(x)
 132   difference = max(x) - min(x)
 133   return(shifted / difference)
 134 }
 135
 136 test_vector = c(1,2,3,4,5)
 137 zeroToOneRescaler(test_vector)
 138 # Should print c(0, 0.25, 0.5, 0.75, 1.00)
 139 ```
 140
 141 ```{r R_debug1-response}
 142 quiz(
 143   question("Were you able to solve the debugging question? (this question is for feedback purposes)",
 144     answer("Yes", message="Nice work!", correct = TRUE),
 145     answer("No", message="")
 146   )
 147 )
 148 ```
 149
 150
 151 The following commented chunk has at least five (annoying) bugs. Can you uncomment the code, fix all the bugs, and get this chunk to run? These are drawn from real experiences from your TA!
 152 ```{r R_debug2, exercise=TRUE}
 153 # ps2 <- readcsv(file = url(
 154 #   " https://communitydata.science/~ads/teaching/2020/stats/data/week_04/group_03.csv"), row.names = NULL
 155 # )
 156 #
 157 # ps2$y[is.na(ps2$y)] <- 0
 158 # "ps2$'My First New Column' <- ps2$y * -1"
 159 # ps2$'My Second New Column" <- ps2$y + ps2$'My First New Column'
 160 #
 161 # summary(ps2$'My Second New Column']
 162 ```
 163
 164 ```{r R_debug2-solution}
 165 ps2 <- read.csv(file = url("https://communitydata.science/~ads/teaching/2020/stats/data/week_04/group_03.csv"), row.names = NULL)
 166 ps2$y[is.na(ps2$y)] <- 0
 167 ps2$'My First New Column' <- ps2$y * -1
 168 ps2$'My Second New Column' <- ps2$y + ps2$'My First New Column'
 169 summary(ps2$'My Second New Column')
 170 ```
 171
 172 ```{r R_debug2-response}
 173 quiz(
 174   question("Were you able to solve the above debugging question? (this question is for feedback purposes)",
 175            answer("Yes", message="Nice work!", correct = TRUE),
 176            answer("No", message="")
 177   )
 178 )
 179 ```
 180
 181 ### Updating a visualization
 182 Imagine you've created a histogram to visualize some data from your research (below, we'll use R's built-in "PlantGrowth" dataset). You show your collaborator a histogram of this plot using default R, and they express some concerns about your plot's aesthetics. Replace the base-R histogram with a `ggplot2` histogram that also includes a density plot overlaid on it (maybe in a bright, contrasting color like red).
 183
 184 ```{r R_ggplot, exercise=TRUE}
 185 data("PlantGrowth")
 186 hist(PlantGrowth$weight)
 187 ```
 188
 189 ```{r R_ggplot-solution}
 190 library(ggplot2)
 191
 192 ggplot(PlantGrowth, aes(weight, after_stat(density))) + geom_histogram() + geom_density(color = "red")
 193 ```
 194
 195 Bonus: How would you find more information about the source of this dataset?
 196
 197
 198 ### Interpret a dataframe
 199 ```{r R_columns-setup, exercise=TRUE}
 200 data <- mtcars
 201 data$mpgGreaterThan20 <- data$mpg > 20
 202 data$gear <- as.factor(data$gear)
 203 data$mpgRounded <- round(data$mpg)
 204 ```
 205
 206 The below questions relate to the `data` data.frame defined above, which is a modified version of the classic `mtcars`.
 207
 208 For all answers, assume the above code chunks *has completely run*, i.e. assume all modifications described above were made.
 209 ```{r R_columns}
 210 quiz(
 211   question("Which of the following best describes the `mpg` variable?",
 212     answer("Numeric, continuous", correct=TRUE),
 213     answer("Numeric, discrete"),
 214     answer("Categorical, dichotomous"),
 215     answer("Categorical, ordinal"),
 216     answer("Categorical")
 217   ),
 218   question("Which of the following best describes the `mpgGreaterThan20` variable?",
 219     answer("Numeric, continuous"),
 220     answer("Numeric, discrete"),
 221     answer("Categorical, dichotomous", correct=TRUE),
 222     answer("Categorical, ordinal"),
 223     answer("Categorical")
 224   ),
 225   question("Which of the following best describes the `mpgRounded` variable?",
 226     answer("Numeric, continuous"),
 227     answer("Numeric, discrete", correct=TRUE),
 228     answer("Categorical, dichotomous"),
 229     answer("Categorical, ordinal"),
 230     answer("Categorical")
 231   ),
 232   question("Which of the following best describes the `gear` variable?",
 233     answer("Numeric, continuous"),
 234     answer("Numeric, discrete"),
 235     answer("Categorical, dichotomous"),
 236     answer("Categorical, ordinal", correct=TRUE),
 237     answer("Categorical")
 238   )
 239 )
 240 ```
 241
 242 ## Section 2
 243 The following is a series of short multiple choice questions. These questions focus on definitions, and should not require performing any computations or writing any code.
 244 ```{r Stats_lightninground}
 245 m1 <- ""
 246 wolf <- "Think of the 'Boy who cried wolf', with a null hypothesis that no wolf exists. First the boy claims the alternative hypothesis: there is a wolf. The villagers believe this, and reject the correct null hypothesis. Second, the villagers make an error by not believing the boy when he presents a correct alternative hypothesis."
 247
 248 quiz(
 249   question("A hypothesis is typically concerned with a:",
 250     answer("population statistic.", correct = TRUE),
 251     answer("sample statistic.")
 252   ),
 253   question("A sampling distribution is:",
 254     answer("critical to report in your papers."),
 255     answer("theoretically helpful, but rarely available to researchers in practice.", correct = TRUE),
 256     answer("practically useful, but not relies on assumptions that are rarely met.")
 257   ),
 258   question("Z-scores tell us about a value in terms of:",
 259     answer("mean and standard deviation.", correct = TRUE),
 260     answer("sample size and sampling strategy."),
 261     answer("if an effect is causal or not.")
 262   ),
 263   question("A distribution that is right-skewed has a long tail to the:",
 264     answer("right", correct = TRUE),
 265     answer("left")
 266   ),
 267   question("A normal distribution can be characterized with only this many parameters:",
 268     answer("1"),
 269     answer("2", correct = TRUE),
 270     answer("3")
 271   ),
 272   question("When we calculate standard error, we calculate",
 273     answer("using a different formula for every type of variable."),
 274     answer("the sample standard error, which is an estimate of the population standard error.", correct = TRUE),
 275     answer("whether or not our result is causal.")
 276   ),
 277   question("When we calculate standard error, we calculate",
 278     answer("using a different formula for every type of variable."),
 279     answer("the sample standard error, which is an estimate of the population standard error.", correct = TRUE),
 280     answer("whether or not our result is causal.")
 281   ),
 282   question("P values tell us about",
 283     answer("the world in which our null hypothesis is true.", correct = TRUE),
 284     answer("the world in which our null hypothesis is false."),
 285     answer("the world in which our data describe a causal effect")
 286   ),
 287   question("P values are",
 288     answer("a conditional probability.", correct = TRUE),
 289     answer("completely misleading."),
 290     answer("only useful when our data has a normal distribution.")
 291   ),
 292   question("A type 1 error occurs when",
 293     answer("when we reject a correct null hypothesis (i.e. false positive).", correct = TRUE, message=wolf),
 294     answer("when we accept a correct null hypothesis", message=wolf),
 295     answer("when we accept an incorrect null hypothesis (i.e. false negative)", message=wolf)
 296   ),
 297   question("Before we assume independence of two random samples, it is useful to check that",
 298     answer("both samples include over 90% of the population."),
 299     answer("both samples include less than 10% of the population.", correct = TRUE)
 300   )
 301 )
 302 ```
 303
 304 ## Section 3
 305
 306 ### About this Section
 307
 308 The following questions are in the style of pen-and-paper statistics class exam questions. There a few sections that you may want or need to run some R code; there are a variety of empty "scratch paper" code chunks for this purpose. Note that this document contains a section with helpful formulas, which you can navigate to via the leftmost column.
 309
 310 ### Sampling
 311
 312 ```{r Stats_sampling}
 313 quiz(
 314   question("A political scientist is interested in the effect of government type on economic development.
 315 She wants to use a sample of 30 countries evenly represented among the Americas, Europe,
 316 Asia, and Africa to conduct her analysis. What type of study should she use to ensure that
 317 countries are selected from each region of the world? Assume a limitied research budget.",
 318     answer("Observational - simple random sample"),
 319     answer("Observational - cluster"),
 320     answer("Observational - stratifed", correct=TRUE),
 321     answer("Experimental")
 322   )
 323 )
 324 ```
 325
 326 For the following question, you may want to use this "scratch paper" code chunk.
 327 ```{r Stats_quartile-scratch, exercise=TRUE}
 328
 329 ```
 330
 331 ```{r Stats_quartile}
 332 quiz(
 333   question("Heights of boys in a high school are approximately normally distributed with mean of 175 cm
 334 standard deviation of 5 cm. What is the first quartile of heights?",
 335     answer("25 cm"),
 336     answer("167.3 cm"),
 337     answer("171.7 cm", correct=TRUE),
 338     answer("173.5 cm"),
 339     answer("178.3 cm")
 340   )
 341 )
 342 ```
 343
 344
 345 ### Outliers and Skew
 346 Suppose we are reading a paper which reports the following about a column of a dataset:
 347
 348 Minimum value is 0.00125 and Maximum Value is 2.1100.
 349
 350 Mean is 0.41100 and median is 0.27800.
 351
 352 1st quartile is 0.13000 and 3rd quartile is 0.56200.
 353
 354 ```{r Stats_summary}
 355 m1 <- "Under R's default setting, outliers are values that are either greater than the upper bound $Q_3 + 1.5\\times IQR$ OR less than the lower bound $Q_1 - 1.5\\times IQR$. Here, $IQR = 0.562-0.130=0.432$. The upper bound $= 0.562 + 1.5\\times (0.432) = 1.21$. The lower bound is $0.13 - 1.5\\times (0.432) = -0.518$. We see that the maximum value is 2.11, greater than the upper bound. Thus, there is at least one outlier in this sample."
 356
 357 m2 <- "There is at least one outlier on the right, whereas there is none on the left. $|Q_3-Q_2| > |Q_2-Q_1|$, so the whisker for this box plot would be longer on the right-hand side. The mean is larger than the median."
 358 quiz(
 359   question("Are there outliers (in terms of IQR) in this sample?",
 360     answer("Yes", correct = TRUE, message=m1),
 361     answer("No", message="asd")
 362   ),
 363   question("Based on these summary statistics, we might expect the skew of the distribution to be:",
 364     answer("left-skewed", message=m2),
 365     answer("right-skewed", message=m2, correct=TRUE),
 366     answer("symmetric", message=m2)
 367   )
 368 )
 369 ```
 370
 371
 372 ### Computing Probabilities
 373 For each of the below questions, you will need to calculate some probabilities by hand.
 374 You may want to use this "scratch paper" code chunk (possibly in conjunction with actual paper).
 375
 376 ```{r Stats_probs-scratch, exercise=TRUE}
 377
 378 ```
 379
 380 ```{r Stats_probs}
 381 m1 <- "$P(\\text{Coffee} \\cap \\text{No Milk}) = P(\\text{Coffee})\\cdot P(\\text{No Milk}) = 0.5 \\cdot (1-0.1)  = 0.45$"
 382
 383 m2 <- "Let H be the event of hypertension, M be event of being a male. We see here that $P(H) = 0.15$ whereas $P(H|M) = 0.18$. Since $P(H) \\neq P(H|M)$, then hypertension is not independent of sex."
 384
 385 m3 <- "$P(HIV \\cap HCV) = P(HIV|HCV)\\cdot P(HCV) = 0.1\\cdot 0.02 = 0.002$"
 386
 387 quiz(
 388   question("Suppose in a population, half prefer coffee to tea, and assume that 10 percent of the population does not put milk in their coffee or tea. If coffee vs. tea preference and cow milk are independent, what fraction of the population both prefers coffee and does put milk in their coffee?",
 389     answer("40%", message=m1),
 390     answer("45%", correct = TRUE, message=m1),
 391     answer("50%", message=m1),
 392     answer("55%", message=m1)
 393   ),
 394   question("In the general population, about 15 percent of adults between 25 and 40 years of age are hypertensive.  Suppose that among males of this age, hypertension occurs about 18 percent of the time. Is hypertension independent of sex? ",
 395     answer("Yes", message=m2),
 396     answer("No.", correct=TRUE, message=m2)
 397   ),
 398   question("What might you search for (in Google, your notes, the OpenIntro PDF, etc.) to help with this question?",
 399     answer("t test"),
 400     answer("laws of probability", correct=TRUE),
 401     answer("linear regression"),
 402     answer("R debugging")
 403   ),
 404   question("Co-infection with HIV and hepatitis C (HCV) occurs when a patient has both diseases, and is on the rise in some countries. Assume that in a given country, only about 2% of the population has HCV,  but 25% of the population with HIV have HCV.  Assume as well that 10% of the population with HCV have HIV.  What is the probability that a randomly chosen member of the population has both HIV and HCV?",
 405     answer("0.001", message=m3),
 406     answer("0.01", message=m3),
 407     answer("0.002", correct=TRUE, message=m3),
 408     answer("0.02", message=m3)
 409   ),
 410   question("What might you search for (in Google, your notes, the OpenIntro PDF, etc.) to help with this question?",
 411     answer("t test"),
 412     answer("laws of probability", correct=TRUE),
 413     answer("linear regression"),
 414     answer("R debugging")
 415   )
 416 )
 417 ```
 418
 419 ### Calculating Probabilities: A Biostats Example
 420 This question is adapted from a biostats midterm exam.
 421 In the past (2015, to be specific), the US Preventive Services
 422 Task Force recommended that women under the age of 50 should
 423 not get routine mammogram screening for breast cancer.  The Task Force
 424 argued that for a woman with a positive mammogram (one suggesting the
 425 presence of breast cancer), the chance that she has breast cancer was
 426 too low to justify a surgical biopsy.
 427
 428 Suppose the data below describe a cohort of 100,000 women age 40 -
 429 49 in whom mammogram screening and breast cancer behaves just like the
 430 larger population.  For instance, in this table, the 3,333 women with
 431 breast cancer represent a rate of 1 in 30 women with undiagnosed
 432 cancer. The numbers in the table are realistic for US women in this
 433 age category.
 434
 435 Has Breast Cancer: 3,296 Positive Test Results and 37 negative test results (3,333 total)
 436
 437 Does not Have Breast Cancer: 8,313 Positive Test Results and 88,354 negative test results (96,667 total)
 438
 439 First, compute the "margins" of the above contingency table.
 440 Row margins: How many total women have breast cancer? How many total women do not have breast cancer?
 441 Column margins: How many total positive test? How many total negative tests?
 442 ```{r Stats_mammogram-chunk, exercise=TRUE}
 443
 444 ```
 445
 446 ```{r Stats_mammogram}
 447 m1 <- "
 448 $\\Pr(\\textrm{Test}^+ \\cap \\textrm{Cancer}) = 3,296$
 449
 450 $\\Pr(Cancer) = 3,333$
 451
 452 $\\Pr(\\textrm{Test}^+|\\textrm{Cancer}) =$ \
 453 $\\dfrac{\\Pr(\\textrm{Test}^+ \\cap \\textrm{Cancer})}{\\Pr(\\textrm{Cancer})} =$\
 454 $\\dfrac{3,296}{3,333} = 0.989$"
 455
 456 m2 <- "
 457 $Pr(\\textrm{Cancer}|\\textrm{Test}^+) =$
 458
 459 $\\dfrac{\\Pr(\\textrm{Cancer} \\cap \\textrm{Test}^+)}
 460      {\\Pr(\\textrm{Test}^+)}=$
 461
 462
 463  $\\dfrac{3,296}{11,609} = 0.284$"
 464
 465 quiz(
 466   question("Based on this data, what is the probability that a woman has a positive test given that women has cancer?",
 467     answer("98.9%", correct = TRUE, message=m1),
 468     answer("99.9%",message=m1),
 469     answer("89.9%",message=m1),
 470     answer("88.9%",message=m1)
 471   ),
 472   question("Based on this data, what is the probability that a woman has cancer receives a positive test?",
 473     answer("28.4%", correct = TRUE,message=m2),
 474     answer("10.3%",message=m2),
 475     answer("50.7%",message=m2),
 476     answer("97.9%",message=m2)
 477   ),
 478   question("Is the Task Force correct to claim that there is a low probability that a women between 40-49 who tests positive has breast cancer?",
 479     answer("Yes", correct=TRUE),
 480     answer("No")
 481   )
 482 )
 483 ```
 484
 485
 486
 487 ## Useful Formulas
 488 Sample Mean (sample statistic):
 489 $\bar{x}=\frac{\sum_{i=1}^n x_i}{n}$ |
 490 Standard deviation:
 491 $s=\sqrt{\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1}}$ |
 492 Variance:
 493 $var = s^2$
 494
 495 Useful probability axioms:
 496 $\mbox{Pr}(A^c)=1-\mbox{Pr}(A)$ | Pr(A and B) = Pr(A) $\times$ Pr(B) | Pr(A or B) = Pr(A) + Pr(B) - Pr(A and B)
 497
 498 $\mbox{Pr}(A|B)=\frac{\mbox{Pr(A and B)}}{\mbox{Pr(B)}}$\\
 499
 500 Population mean (population statistic):
 501 $\mu = \sum_{i=1}^{n}x\mbox{Pr}(x)$
 502
 503 Z-score:
 504 $z=\frac{x-\mu}{\sigma}$
 505
 506 $x=\mu + z\sigma$\\
 507
 508 $\mbox{P}(x)=\frac{n!}{x!(n-x)!}p^x(1-p)^{n-x}$
 509     ~for~ $x=0,1,2,...,n$
 510
 511 $\mu=np$, $\sigma=\sqrt{np(1-p)}$\\
 512
 513 $\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}$
 514
 515 $\sigma_{\hat{p}}=\sqrt{\frac{p(1-p)}{n}}$
 516
 517 $Q_1 - 1.5 \times IQR, \quad Q_3 + 1.5 \times IQR$
 518
 519
 520
 521 ## Answer Report
 522 Finally, let's generate a report that summarizes your answers to this evaluation.
 523
 524 Answers are written to a file that looks like this: `question_submission-{CURRENT TIME}.csv`. We can actually quickly analyze them.
 525
 526 ```{r report1, exercise=TRUE}
 527 df
 528 ```
 529
 530 ```{r report2, exercise=TRUE}
 531 mean(df$correct)
 532 ```
 533
 534 ```{r report3, exercise=TRUE}
 535 df %>% group_by(category) %>% summarize(avg = mean(correct))
 536 ```