From: aaronshaw Date: Tue, 20 Oct 2020 17:54:20 +0000 (-0500) Subject: initial commit of categorical data analysis examples X-Git-Url: https://code.communitydata.science/stats_class_2020.git/commitdiff_plain/f588b148aef8c3e05f61d1c0c245a0c4bede1693?ds=sidebyside initial commit of categorical data analysis examples --- diff --git a/r_tutorials/w06-R_tutorial.html b/r_tutorials/w06-R_tutorial.html new file mode 100644 index 0000000..2d62246 --- /dev/null +++ b/r_tutorials/w06-R_tutorial.html @@ -0,0 +1,1690 @@ + + + + + + + + + + + + + + + +Week 6 R Lecture + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + +
+

Analyzing categorical data in R

+

The goal of this script is to help you think about analyzing categorical data, including proportions, tables, chi-squared tests, and simulation.

+
+

Estimating proportions

+

If a survey of 50 randomly sampled Chicagoans found that 45% of them thought that Giordano’s made the best deep dish pizza, what would be the 95% confidence interval for the true proportion of Chicagoans who prefer Giordano’s?

+

Can we reject the hypothesis that 50% of Chicagoans prefer Giordano’s?

+
est = .45
+sample_size = 50
+SE = sqrt(est*(1-est)/sample_size)
+
+conf_int = c(est - 1.96 * SE, est + 1.96 * SE)
+conf_int
+
## [1] 0.3121018 0.5878982
+

What if we had the same result but had sampled 500 people?

+
est = .45
+sample_size = 500
+SE = sqrt(est*(1-est)/sample_size)
+
+conf_int = c(est - 1.96 * SE, est + 1.96 * SE)
+conf_int
+
## [1] 0.4063928 0.4936072
+
+
+

Tabular Data

+

The Iris dataset is composed of measurements of flower dimensions. It comes packaged with R and is often used in examples. Here we make a table of how often each species in the dataset has a sepal width greater than 3.

+
data(iris)
+table(iris$Species, iris$Sepal.Width > 3)
+
##             
+##              FALSE TRUE
+##   setosa         8   42
+##   versicolor    42    8
+##   virginica     33   17
+

As described in the most recent OpenIntro reading, the chi-squared (\(\chi^2\)) test is a test of whether the frequencies we see in a table differ from what we would expect if there was no difference between the groups. Base-R provides a chisq.test() function to conduct these tests. Here’s an example of the test evaluating whether the frequency of Iris sepal widths greater than 3 differs by species (the null hypothesis is that there is no difference in the frequencies and the groups are drawn from the same population). It’s always good to examine the contingency table before conducting the test!

+
table(iris$Species, iris$Sepal.Width > 3)
+
##             
+##              FALSE TRUE
+##   setosa         8   42
+##   versicolor    42    8
+##   virginica     33   17
+
chisq.test(table(iris$Species, iris$Sepal.Width > 3))
+
## 
+##  Pearson's Chi-squared test
+## 
+## data:  table(iris$Species, iris$Sepal.Width > 3)
+## X-squared = 50.225, df = 2, p-value = 1.241e-11
+

While we know little about the dataset, sample, or measurement, the incredibly low p-value means that the observed frequencies likely did not come from the same distribution and that the frequency of sepal widths more/less than 3 differs by species.

+

Note that you can create tables to conduct \(\chi^2\) tests by hand. The chisq.test() function will need the object to be of “class” table or matrix in order to run the test.

+
+
+
+

BONUS: Using simulation to test hypotheses and calculate “exact” p-values

+

Section 6.1.4 of OpenIntro mentions a simulation-based approach to inference when assumptions of \(\chi^2\) tests aren’t met. You possess all the tools to implement something like this in R, so let’s do it. Basically, the idea is that we use simulated data to construct a null distribution of outcomes and then compare our observed test statistic against the null to estimate an “exact” p-value corresponding to the proportion of samples drawn from the null would produce statistics at least as large as our observed data.

+

The earlier (3rd) edition of Chapter 6 of the OpenIntro textbook uses an example of a medical practitioner who has 3 complications out of 62 procedures, while the typical rate is 10%. We can use that information to create an example here based on the proportions. The null hypothesis is that this practitioner’s true rate is also 10%, so we’re trying to figure out how rare it would be to have 3 or fewer complications, if the true rate is 10%.

+
# We write a function that we are going to replicate
+simulation <- function(rate = .1, n = 62){
+  # Draw n random numbers from a uniform distribution from 0 to 1
+  draws = runif(n)
+  # If rate = .4, on average, .4 of the draws will be less than .4
+  # So, we consider those draws where the value is less than `rate` as complications
+  complication_count = sum(draws < rate)
+  # Then, we return the total count
+  return(complication_count)
+}
+
+# The replicate function runs a function many times
+simulated_complications <- replicate(5000, simulation())
+

We can look at our simulated complications

+
hist(simulated_complications)
+

+

And determine how many of them are as extreme or more extreme than the value we saw. This is the “exact” p-value.

+
sum(simulated_complications <= 3)/length(simulated_complications)
+
## [1] 0.1324
+
+ + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/r_tutorials/w06-R_tutorial.pdf b/r_tutorials/w06-R_tutorial.pdf new file mode 100644 index 0000000..c3825ce Binary files /dev/null and b/r_tutorials/w06-R_tutorial.pdf differ diff --git a/r_tutorials/w06-R_tutorial.rmd b/r_tutorials/w06-R_tutorial.rmd new file mode 100644 index 0000000..1f031d1 --- /dev/null +++ b/r_tutorials/w06-R_tutorial.rmd @@ -0,0 +1,114 @@ +--- +title: "Week 6 R Lecture" +author: "Jeremy Foote and Aaron Shaw" +date: "October 20, 2020" +output: + html_document: + toc: yes + toc_depth: 3 + toc_float: + collapsed: true + smooth_scroll: true + theme: readable + pdf_document: + toc: yes + toc_depth: '3' + header-includes: + - \newcommand{\lt}{<} + - \newcommand{\gt}{>} +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +## Analyzing categorical data in R + +The goal of this script is to help you think about analyzing categorical data, including proportions, tables, chi-squared tests, and simulation. + +### Estimating proportions + +If a survey of 50 randomly sampled Chicagoans found that 45% of them thought that Giordano's made the best deep dish pizza, what would be the 95% confidence interval for the true proportion of Chicagoans who prefer Giordano's? + +Can we reject the hypothesis that 50% of Chicagoans prefer Giordano's? + + +```{r} +est = .45 +sample_size = 50 +SE = sqrt(est*(1-est)/sample_size) + +conf_int = c(est - 1.96 * SE, est + 1.96 * SE) +conf_int +``` + +What if we had the same result but had sampled 500 people? + + +```{r} +est = .45 +sample_size = 500 +SE = sqrt(est*(1-est)/sample_size) + +conf_int = c(est - 1.96 * SE, est + 1.96 * SE) +conf_int +``` + +### Tabular Data + +The Iris dataset is composed of measurements of flower dimensions. It comes packaged with R and is often used in examples. Here we make a table of how often each species in the dataset has a sepal width greater than 3. + +```{r} +data(iris) +table(iris$Species, iris$Sepal.Width > 3) + +``` + + +As described in the most recent *OpenIntro* reading, the chi-squared ($\chi^2$) test is a test of whether the frequencies we see in a table differ from what we would expect if there was no difference between the groups. Base-R provides a `chisq.test()` function to conduct these tests. Here's an example of the test evaluating whether the frequency of Iris sepal widths greater than 3 differs by species (the null hypothesis is that there is no difference in the frequencies and the groups are drawn from the same population). It's always good to examine the contingency table before conducting the test! + +```{r} +table(iris$Species, iris$Sepal.Width > 3) + +chisq.test(table(iris$Species, iris$Sepal.Width > 3)) +``` + +While we know little about the dataset, sample, or measurement, the incredibly low p-value means that the observed frequencies likely did not come from the same distribution and that the frequency of sepal widths more/less than 3 differs by species. + +Note that you can create tables to conduct $\chi^2$ tests by hand. The `chisq.test()` function will need the object to be of "class" `table` or `matrix` in order to run the test. + +## BONUS: Using simulation to test hypotheses and calculate "exact" p-values + +Section 6.1.4 of *OpenIntro* mentions a simulation-based approach to inference when assumptions of $\chi^2$ tests aren't met. You possess all the tools to implement something like this in R, so let's do it. Basically, the idea is that we use simulated data to construct a null distribution of outcomes and then compare our observed test statistic against the null to estimate an "exact" p-value corresponding to the proportion of samples drawn from the null would produce statistics at least as large as our observed data. + +The earlier (3rd) edition of Chapter 6 of the *OpenIntro* textbook uses an example of a medical practitioner who has 3 complications out of 62 procedures, while the typical rate is 10%. We can use that information to create an example here based on the proportions. The null hypothesis is that this practitioner's true rate is also 10%, so we're trying to figure out how rare it would be to have 3 or fewer complications, if the true rate is 10%. + +```{r} +# We write a function that we are going to replicate +simulation <- function(rate = .1, n = 62){ + # Draw n random numbers from a uniform distribution from 0 to 1 + draws = runif(n) + # If rate = .4, on average, .4 of the draws will be less than .4 + # So, we consider those draws where the value is less than `rate` as complications + complication_count = sum(draws < rate) + # Then, we return the total count + return(complication_count) +} + +# The replicate function runs a function many times +simulated_complications <- replicate(5000, simulation()) + +``` + +We can look at our simulated complications + +```{r} +hist(simulated_complications) +``` + +And determine how many of them are as extreme or more extreme than the value we saw. This is the "exact" p-value. +```{r} + +sum(simulated_complications <= 3)/length(simulated_complications) +``` +