From: aaronshaw Date: Thu, 2 May 2019 02:35:15 +0000 (-0500) Subject: updating to w06 content outline. still no substance X-Git-Url: https://code.communitydata.science/stats_class_2019.git/commitdiff_plain/6b6a21a00c0fc0eff5645b65b080c5cae393fe25?ds=inline;hp=d1fbf48eefd7acc23c5d311c256662f830a51a63 updating to w06 content outline. still no substance --- diff --git a/r_lectures/w06-R_lecture.Rmd b/r_lectures/w06-R_lecture.Rmd index 1861cb6..56c75cf 100644 --- a/r_lectures/w06-R_lecture.Rmd +++ b/r_lectures/w06-R_lecture.Rmd @@ -1,7 +1,8 @@ --- -title: "Week 7 R Lecture" -author: "Jeremy Foote" -date: "April 4, 2019" +title: "Week 6 R lecture" +subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525" +author: "Aaron Shaw" +date: "May 3, 2019" output: html_document --- @@ -9,97 +10,17 @@ output: html_document knitr::opts_chunk$set(echo = TRUE) ``` -## Categorical Data +## T-tests +You learned the theory/concepts behind t-tests last week, so here's a brief run-down on how to use built-in functions in R to conduct them and interpret the results. -The goal of this script is to help you think about analyzing categorical data, including proportions, tables, chi-squared tests, and simulation. +## ANOVAs -### Estimating proportions +Analogous situation with t-tests. Here's a brief introduction to how they work in R. -If a survey of 50 randomly sampled Chicagoans found that 45% of them thought that Giordano's made the best deep dish pizza, what would be the 95% confidence interval for the true proportion of Chicagoans who prefer Giordano's? +## Visualizing confidence intervals -Can we reject the hypothesis that 50% of Chicagoans prefer Giordano's? +We spent a lot of time on confidence intervals in the past few weeks. Since they can be so useful, surely we should learn some approaches to incorporating them into data visualizations. +## Date/time arithmetic -```{r} -est = .45 -sample_size = 50 -SE = sqrt(est*(1-est)/sample_size) - -conf_int = c(est - 1.96 * SE, est + 1.96 * SE) -conf_int -``` - -What if we had the same result but had sampled 500 people? - - -```{r} -est = .45 -sample_size = 500 -SE = sqrt(est*(1-est)/sample_size) - -conf_int = c(est - 1.96 * SE, est + 1.96 * SE) -conf_int -``` - -### Tabular Data - -The Iris dataset is composed of measurements of flower dimensions. It comes packaged with R and is often used in examples. Here we make a table of how often each species in the dataset has a sepal width greater than 3. - -```{r} - -table(iris$Species, iris$Sepal.Width > 3) - -``` - - -The chi-squared test is a test of how much the frequencies we see in a table differ from what we would expect if there was no difference between the groups. - -```{r} - -chisq.test(table(iris$Species, iris$Sepal.Width > 3)) -``` - -The incredibly low p-value means that it is very unlikely that these came from the same distribution and that sepal width differs by species. - - - -## Using Simulation - -When the assumptions of Chi-squared tests aren't met, we can use simulation to approximate how likely a given result is. - -The book uses the example of a medical practitioner who has 3 complications out of 62 procedures, while the typical rate is 10%. - -The null hypothesis is that this practitioner's true rate is also 10%, so we're trying to figure out how rare it would be to have 3 or fewer complications, if the true rate is 10%. - -```{r} -# We write a function that we are going to replicate -simulation <- function(rate = .1, n = 62){ - # Draw n random numbers from a uniform distribution from 0 to 1 - draws = runif(n) - # If rate = .4, on average, .4 of the draws will be less than .4 - # So, we consider those draws where the value is less than `rate` as complications - complication_count = sum(draws < rate) - # Then, we return the total count - return(complication_count) -} - -# The replicate function runs a function many times - -simulated_complications <- replicate(5000, simulation()) - -``` - -We can look at our simulated complications - -```{r} - -hist(simulated_complications) -``` - -And determine how many of them are as extreme or more extreme than the value we saw. This is the p-value. - -```{r} - -sum(simulated_complications <= 3)/length(simulated_complications) -``` - +Last, but not least, another wrinkle in time...or at least how to manage date-time objects in R. \ No newline at end of file