--- title: "Week 6 R Lecture" author: "Jeremy Foote" date: "April 4, 2019" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Categorical Data The goal of this script is to help you think about analyzing categorical data, including proportions, tables, chi-squared tests, and simulation. ### Estimating proportions If a survey of 50 randomly sampled Chicagoans found that 45% of them thought that Giordano's made the best deep dish pizza, what would be the 95% confidence interval for the true proportion of Chicagoans who prefer Giordano's? Can we reject the hypothesis that 50% of Chicagoans prefer Giordano's? ```{r} est = .45 sample_size = 50 SE = sqrt(est*(1-est)/sample_size) conf_int = c(est - 1.96 * SE, est + 1.96 * SE) conf_int ``` What if we had the same result but had sampled 500 people? ```{r} est = .45 sample_size = 500 SE = sqrt(est*(1-est)/sample_size) conf_int = c(est - 1.96 * SE, est + 1.96 * SE) conf_int ``` ### Tabular Data The Iris dataset is composed of measurements of flower dimensions. It comes packaged with R and is often used in examples. Here we make a table of how often each species in the dataset has a sepal width greater than 3. ```{r} table(iris$Species, iris$Sepal.Width > 3) ``` The chi-squared test is a test of how much the frequencies we see in a table differ from what we would expect if there was no difference between the groups. ```{r} chisq.test(table(iris$Species, iris$Sepal.Width > 3)) ``` The incredibly low p-value means that it is very unlikely that these came from the same distribution and that sepal width differs by species. ## Using Simulation When the assumptions of Chi-squared tests aren't met, we can use simulation to approximate how likely a given result is. The book uses the example of a medical practitioner who has 3 complications out of 62 procedures, while the typical rate is 10%. The null hypothesis is that this practitioner's true rate is also 10%, so we're trying to figure out how rare it would be to have 3 or fewer complications, if the true rate is 10%. ```{r} # We write a function that we are going to replicate simulation <- function(rate = .1, n = 62){ # Draw n random numbers from a uniform distribution from 0 to 1 draws = runif(n) # If rate = .4, on average, .4 of the draws will be less than .4 # So, we consider those draws where the value is less than `rate` as complications complication_count = sum(draws < rate) # Then, we return the total count return(complication_count) } # The replicate function runs a function many times simulated_complications <- replicate(5000, simulation()) ``` We can look at our simulated complications ```{r} hist(simulated_complications) ``` And determine how many of them are as extreme or more extreme than the value we saw. This is the p-value. ```{r} sum(simulated_complications <= 3)/length(simulated_complications) ```