r_lectures/w07-R_lecture.Rmd

   1 ---
   2 title: "Week 7 R Lecture"
   3 author: "Jeremy Foote"
   4 date: "April 4, 2019"
   5 output: html_document
   6 ---
   7
   8 ```{r setup, include=FALSE}
   9 knitr::opts_chunk$set(echo = TRUE)
  10 ```
  11
  12 ## Categorical Data
  13
  14 The goal of this script is to help you think about analyzing categorical data, including proportions, tables, chi-squared tests, and simulation.
  15
  16 ### Estimating proportions
  17
  18 If a survey of 50 randomly sampled Chicagoans found that 45% of them thought that Giordano's made the best deep dish pizza, what would be the 95% confidence interval for the true proportion of Chicagoans who prefer Giordano's?
  19
  20 Can we reject the hypothesis that 50% of Chicagoans prefer Giordano's?
  21
  22
  23 ```{r}
  24 est = .45
  25 sample_size = 50
  26 SE = sqrt(est*(1-est)/sample_size)
  27
  28 conf_int = c(est - 1.96 * SE, est + 1.96 * SE)
  29 conf_int
  30 ```
  31
  32 What if we had the same result but had sampled 500 people?
  33
  34
  35 ```{r}
  36 est = .45
  37 sample_size = 500
  38 SE = sqrt(est*(1-est)/sample_size)
  39
  40 conf_int = c(est - 1.96 * SE, est + 1.96 * SE)
  41 conf_int
  42 ```
  43
  44 ### Tabular Data
  45
  46 The Iris dataset is composed of measurements of flower dimensions. It comes packaged with R and is often used in examples. Here we make a table of how often each species in the dataset has a sepal width greater than 3.
  47
  48 ```{r}
  49
  50 table(iris$Species, iris$Sepal.Width > 3)
  51
  52 ```
  53
  54
  55 The chi-squared test is a test of how much the frequencies we see in a table differ from what we would expect if there was no difference between the groups.
  56
  57 ```{r}
  58
  59 chisq.test(table(iris$Species, iris$Sepal.Width > 3))
  60 ```
  61
  62 The incredibly low p-value means that it is very unlikely that these came from the same distribution and that sepal width differs by species.
  63
  64 ## BONUS: Using simulation to test hypotheses and calculate "exact" p-values
  65
  66 When the assumptions of $\chi^2$ tests aren't met, we can use simulation to approximate how likely a given result is. The material here comes from the final two sections of Chapter 6 of the *OpenIntro* textbook. The book uses the example of a medical practitioner who has 3 complications out of 62 procedures, while the typical rate is 10%.
  67
  68 The null hypothesis is that this practitioner's true rate is also 10%, so we're trying to figure out how rare it would be to have 3 or fewer complications, if the true rate is 10%.
  69
  70 ```{r}
  71 # We write a function that we are going to replicate
  72 simulation <- function(rate = .1, n = 62){
  73   # Draw n random numbers from a uniform distribution from 0 to 1
  74   draws = runif(n)
  75   # If rate = .4, on average, .4 of the draws will be less than .4
  76   # So, we consider those draws where the value is less than `rate` as complications
  77   complication_count = sum(draws < rate)
  78   # Then, we return the total count
  79   return(complication_count)
  80 }
  81
  82 # The replicate function runs a function many times
  83 simulated_complications <- replicate(5000, simulation())
  84
  85 ```
  86
  87 We can look at our simulated complications
  88
  89 ```{r}
  90 hist(simulated_complications)
  91 ```
  92
  93 And determine how many of them are as extreme or more extreme than the value we saw. This is the "exact" p-value.
  94 ```{r}
  95
  96 sum(simulated_complications <= 3)/length(simulated_complications)
  97 ```
  98