From: aaronshaw Date: Thu, 2 May 2019 02:23:29 +0000 (-0500) Subject: just updating the name here from wk 6 to wk 7 X-Git-Url: https://code.communitydata.science/stats_class_2019.git/commitdiff_plain/f4347aef36223c656d1e8fc2bc209fe4d09b6cb9?ds=sidebyside just updating the name here from wk 6 to wk 7 --- diff --git a/r_lectures/w07-R_lecture.Rmd b/r_lectures/w07-R_lecture.Rmd new file mode 100644 index 0000000..1861cb6 --- /dev/null +++ b/r_lectures/w07-R_lecture.Rmd @@ -0,0 +1,105 @@ +--- +title: "Week 7 R Lecture" +author: "Jeremy Foote" +date: "April 4, 2019" +output: html_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +## Categorical Data + +The goal of this script is to help you think about analyzing categorical data, including proportions, tables, chi-squared tests, and simulation. + +### Estimating proportions + +If a survey of 50 randomly sampled Chicagoans found that 45% of them thought that Giordano's made the best deep dish pizza, what would be the 95% confidence interval for the true proportion of Chicagoans who prefer Giordano's? + +Can we reject the hypothesis that 50% of Chicagoans prefer Giordano's? + + +```{r} +est = .45 +sample_size = 50 +SE = sqrt(est*(1-est)/sample_size) + +conf_int = c(est - 1.96 * SE, est + 1.96 * SE) +conf_int +``` + +What if we had the same result but had sampled 500 people? + + +```{r} +est = .45 +sample_size = 500 +SE = sqrt(est*(1-est)/sample_size) + +conf_int = c(est - 1.96 * SE, est + 1.96 * SE) +conf_int +``` + +### Tabular Data + +The Iris dataset is composed of measurements of flower dimensions. It comes packaged with R and is often used in examples. Here we make a table of how often each species in the dataset has a sepal width greater than 3. + +```{r} + +table(iris$Species, iris$Sepal.Width > 3) + +``` + + +The chi-squared test is a test of how much the frequencies we see in a table differ from what we would expect if there was no difference between the groups. + +```{r} + +chisq.test(table(iris$Species, iris$Sepal.Width > 3)) +``` + +The incredibly low p-value means that it is very unlikely that these came from the same distribution and that sepal width differs by species. + + + +## Using Simulation + +When the assumptions of Chi-squared tests aren't met, we can use simulation to approximate how likely a given result is. + +The book uses the example of a medical practitioner who has 3 complications out of 62 procedures, while the typical rate is 10%. + +The null hypothesis is that this practitioner's true rate is also 10%, so we're trying to figure out how rare it would be to have 3 or fewer complications, if the true rate is 10%. + +```{r} +# We write a function that we are going to replicate +simulation <- function(rate = .1, n = 62){ + # Draw n random numbers from a uniform distribution from 0 to 1 + draws = runif(n) + # If rate = .4, on average, .4 of the draws will be less than .4 + # So, we consider those draws where the value is less than `rate` as complications + complication_count = sum(draws < rate) + # Then, we return the total count + return(complication_count) +} + +# The replicate function runs a function many times + +simulated_complications <- replicate(5000, simulation()) + +``` + +We can look at our simulated complications + +```{r} + +hist(simulated_complications) +``` + +And determine how many of them are as extreme or more extreme than the value we saw. This is the p-value. + +```{r} + +sum(simulated_complications <= 3)/length(simulated_complications) +``` +