From: aaronshaw Date: Wed, 8 May 2019 21:21:00 +0000 (-0500) Subject: updates and corrections X-Git-Url: https://code.communitydata.science/stats_class_2019.git/commitdiff_plain/792aab610bb1281787880b132a74d329b380ffa6 updates and corrections --- diff --git a/problem_sets/week_06/ps6-worked-solution.Rmd b/problem_sets/week_06/ps6-worked-solution.Rmd index 23d9a77..d99f639 100644 --- a/problem_sets/week_06/ps6-worked-solution.Rmd +++ b/problem_sets/week_06/ps6-worked-solution.Rmd @@ -12,36 +12,41 @@ knitr::opts_chunk$set(echo = TRUE, messages = F) ## Programming Questions +Note that Jeremy created this solution set with some code that uses functions drawn from the `tidyverse`. Wherever possible, we have also provided solution code using base R functions or alternatives since the course has not really provided an introduction to the tidyverse. + ### PC0 First we import the data. ```{r} -raw_df = read.csv("~/Desktop/DeleteMe/Teaching/owan03.csv") # Note that I saved the file as a CSV for importing to R +raw_df = read.csv("/tmp/owan03.csv") # Note that I saved the file as a CSV for importing to R head(raw_df) ``` ### PC1 -Let's reshape the data +Let's organize the data. We want this in "long" format where the survival times in weeks for each of the three dosage levels are all arranged in a single column and another variable labels the dosage level. There are (surprise!) many ways to do this. Here are two examples using a function called `melt()` from a library called `reshape` and another function called `gather()` from the `tidyverse` library. Note that you'll need to install the respective library and load it before you can call either function! ```{r} -library(tidyverse) +library(reshape) +library(tidyr) # Rename the columns colnames(raw_df) <- c("None", "Low", "Medium","High") -# Gather gets all of the columns from the dataframe and -# use them as keys, with the value as the value from that cell. -# This gives us a "long" dataframe -df <- gather(raw_df, dose, weeks_alive) - -# We want to treat dose as a factor, so we can group by it easier -df$dose <- as.factor(df$dose) +# Here's a version using "melt" +d <- melt(raw_df, na.rm=TRUE) +names(d) <- c("dose", "weeks_alive") -# Finally, when we look at it, we see some missing data (NA); this is because not all group sizes were the same. -# We can safely remove these. +# "gather" similarly takes all of the columns from the dataframe and uses them as keys, with the value as the value from that # cell. This gives us a "long" dataframe +df <- gather(raw_df, dose, weeks_alive) +# We want to treat dose as a factor, so we can group by it easier. Specifying the levels helps with labels later on. +df$dose <- factor(df$dose, levels=c("None", "Low", "Medium", "High")) +# Finally, remove NAs df <- df[complete.cases(df),] + +# Uncomment this to see that the results are the same: +## df==d ``` ### PC2 @@ -49,11 +54,11 @@ df <- df[complete.cases(df),] Now we're going to calculate summary statistics and create some visualizations ```{r} - +# First, using the tapply function tapply(df$weeks_alive, df$dose, summary) -# Alternative way to do this using tidyverse - +# Alternative way to do this using tidyverse "group_by" function: +library(dplyr) df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR')) ``` @@ -61,47 +66,41 @@ df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR')) When it comes to visualizations, we definitely want to use ggplot. We have lots of options for what to do. ```{r} - +library(ggplot2) # Histograms -h_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize - fill = dose # How to group by color - )) + geom_histogram(position = 'dodge', bins = 5) +h_plot <- ggplot(data=df, aes(x=weeks_alive, fill = dose)) + geom_histogram(position = 'dodge', bins = 5) h_plot # In this case, faceted histograms is probably better - -h_facet = df %>% ggplot(aes(x=weeks_alive # What to summarize - )) + geom_histogram(bins = 5) + facet_grid(~dose) - +h_facet <- ggplot(data=df, aes(x=weeks_alive)) + geom_histogram(bins = 5) + facet_grid(~dose) h_facet -# Density plots -d_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize - fill = dose # How to group by color - )) + geom_density(alpha = .2) +# Even easier to compare if we flip the coordinates (rotate the axes) +h_facet+coord_flip() + +# Density plots. Note that setting a value <1 for "alpha" makes the fill more transparent +d_plot <- ggplot(data=df, aes(x=weeks_alive, fill = dose)) + geom_density(alpha = .2) d_plot - # Boxplots -box_plot = df %>% ggplot(aes(y=weeks_alive, - x = dose - )) + geom_boxplot() +box_plot <- ggplot(data=df, aes(y=weeks_alive, x = dose)) + geom_boxplot() box_plot -# My favorite - ridgeline plots +# Jeremy's favorite - ridgeline plots # install.packages('ggridges') - library(ggridges) -ridge_plot = df %>% ggplot(aes(x=weeks_alive, y = dose)) + - geom_density_ridges(jittered_points = T, fill = 'orange') + theme_minimal() +ridge_plot <- ggplot(data=df, aes(x=weeks_alive, y = dose)) + geom_density_ridges(jittered_points = T, fill = 'orange') ridge_plot +# add a fancy minimalist theme to make it prettier: +ridge_plot + theme_minimal() + ``` -It's a bit tough to tell, but the overall assumptions of normality and equal variance seem reasonable. +A two sample t-test assumes independence and normality. An ANOVA assumes independence, normality, and equal variance. It's a bit tough to tell, but the overall assumption of equal variance seems reasonable. Normality is a bit of a hard sell within groups or overall. Nevertheless, most analysts would march ahead with the analysis despite these violations of assumptions. We can discuss how you might think and talk about this in class. The global mean is @@ -111,55 +110,51 @@ mean(df$weeks_alive) ### PC3 -Anova! +ANOVA! Remember that we have to call `summary()` to see the ANOVA table: + ```{r} summary(aov(weeks_alive ~ dose, data = df)) - ``` -This provides evidence that the group means are different. +This provides evidence in support of the alternative hypothesis that the group means are not equal. It seems that the dosage of Red Dye #40 has a relationship with how many weeks the mice lived. ### PC4 T-test between None and Any, and between None and High. ```{r} - - -t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose - df[df$dose != 'None','weeks_alive'] # Samples with any dose +t.test(df$weeks_alive[df$dose == 'None'], # Samples with no dose + df$weeks_alive[df$dose != 'None'] # Samples with any dose ) # Or, using formula notation t.test(weeks_alive ~ dose == 'None', data = df) # T-test between None and High - -t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose - df[df$dose == 'High','weeks_alive'] # Samples with high dose +t.test(df$weeks_alive[df$dose == 'None'], # Samples with no dose + df$weeks_alive[df$dose == 'High'] # Samples with high dose ) -# Formula notation is a bit tricker. I would probably create a temprorary dataframe - -tmp = df %>% filter(dose %in% c('None', 'High')) - -t.test(weeks_alive ~ dose, data = tmp) +# Formula notation is a bit tricker. I would create a temprorary dataframe +df.tmp <- subset(df, subset= dose=="None" | dose=="High") +t.test(weeks_alive ~ dose, data = df.tmp) ``` -These t-tests both support the idea that receiving a dose of RD40 reduces lifespan. However, we should not completely trust these p-values, since we are doing multiple comparisons. One option is to do a Bonferroni correction, where we only cnsider things significant if $\alpha < .05/m$ where $m$ is the number of tests. In this case, we would fail to reject the null for the second test because we would set $\alpha = .025$. +These t-tests both support the idea that receiving a dose of RD40 reduces lifespan. However, we should not completely trust these p-values, since we are doing multiple comparisons. One option is to do a Bonferroni correction, where we only cnsider things significant if $\alpha < .05/m$ where $m$ is the number of tests. In this case, we would fail to reject the null for the second test because we would set $\alpha = .025$. -The Bonferroni correction is more conservative than it needs to be, ane there are other approaches; for example, the `TukeyHSD` function takes in an anova result and does post-hoc comparisons with corrections for all of the groups. +The Bonferroni correction is more conservative than it needs to be, ane there are other approaches; for example, the `TukeyHSD` function takes in an anova result and does post-hoc comparisons with corrections for all of the groups. The Reinhart book also provides a description of the Benjamini-Hochberg correction. ## Statistical Questions -Q1. -a) It is a sample statistic, because it comes from a sample. -b) Confidence intervals for proportions are equal to +### SQ1 — 6.12 +a) It is a sample statistic (the sample mean), because it comes from a sample. The population parameter is unknown in this case, but can be estimated from the sample statistic. -$$p \pm z * \sqrt{ \frac{p*(1-p)}{n}}$$ +b) As given in the textbook, confidence intervals for proportions are equal to: -For a 95% confidence interval, $z = 1.96$, so we can calculate it like this: +$$\hat{p} \pm z^* \sqrt{ \frac{\hat{p}\times(1-\hat{p})}{n}}$$ + +Where we calculate $z^*$ from the Z distribution table. For a 95% confidence interval, $z = 1.96$, so we can calculate it like this: ```{r} @@ -169,91 +164,141 @@ lower = .48 - 1.96 * sqrt(.48 * .52 / 1259) upper = .48 + 1.96 * sqrt(.48 * .52 / 1259) ci = c(lower, upper) - print(ci) ``` This means that we are 95% confident that the true proportion of Americans who support legalizing marijuana is between ~45% and ~51%. -c) We have a large enough sample, which is collected randomly, to assume that the distribution is normal. -d) The statement isn't justified, since our confidence interval include 50% +c) We have a large enough sample (collected randomly) in which enough responses are drawn from each potential outcome to assume that the observations are independent and the distribution of the sample proportion is approximately normal. + +d) The statement is **not** justified, since our confidence interval includes 50%. + +### SQ2 — 6.20 -6.20 We can use the point estimate of the poll to estimate how large a sample we would need to have a confidence interval of a given width. -Basically, we want each half of the confidence interval to be 1%, i.e., $1.96 * \sqrt{\frac{.48 * .52}{n}} = .01$ +In this case, we want the margin or error to be 2%. Using the same formula from above, this translates to: $1.96 * \sqrt{\frac{.48 * .52}{n}} = .02$ We can solve for $n$: -$$\sqrt{\frac{.48 * .52}{n}} = .01/1.96$$ -$$\frac{.48 * .52}{n} = (.01/1.96)^2$$ -$$n = (.48 * .52)/(.01/1.96)^2$$ +$$\sqrt{\frac{.48 * .52}{n}} = .02/1.96$$ +$$\frac{.48 * .52}{n} = (.02/1.96)^2$$ +$$n = (.48 * .52)/(.02/1.96)^2$$ +Let R solve this: ```{r} -(.48 * .52)/(.01/1.96)^2 +(.48 * .52)/(.02/1.96)^2 ``` -So, we need a sample of approximately 9,589 +So, we need a sample of at least 2,398 (since we can't survey less than a whole person). -6.38 +### SQ3 — 6.38 -The question is whether there has been a change in the proportion over time. While the tools we have learned could allow you to answer that question, they assume that responses are independent. In this case, they are obviously not independent as they come from the same students. +The question is whether there has been a change in the proportion across two groups (the students pre- and post-class) over time. While the tools we have learned could allow you to answer that question for two groups, they assume that responses are independent. In this case, the responses are not independent because they come from the same students. You could go ahead and calculate a statistical test for difference in pooled proportions (it's just math!), but the dependence between observations violates the assumptions and means that the baseline expectations necessary to identify the hypothesis test are not met. Your test statistic will not mean what you want it to mean. -6.50 +### SQ4 — 6.50 -a) We can test this with a $\chi^2$ test. +(a) Our hypothesis focuses on the relationship of the sample and census distributions by region: + $H_0$ : The sample distribution of regions follows the census distribution. + $H_A$ : The sample distribution of regions does not follow the census distribution. -```{r} -chisq.test(x = c(83,121,193,103), p = c(.18,.22,.37,.23)) +Before we can calculate the test statistic we should check the conditions: +1. Independence: $500 < 10%$ of US population, and the sample is random, hence we can assume that one individual in the sample is independent of another. +2. Sample size: The expected counts can be calculated as follows: +$$E N E = 500 × 0.18 = 90$$ +$$E N C = 500 × 0.22 = 110$$ +$$E S = 500 × 0.37 = 185$$ +$$E W = 500 × 0.23 = 115$$ -# Or manually +All expected counts are greater than 5 (which meets the rule of thumb re: empty "cells" in the table). -# Calculate expected values -500 * c(.18,.22,.37,.23) +We can test this with a $\chi^2$ test. -# Use the formula for chi-squared +```{r} +# Manually using the expected counts and formula for chi-squared chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115 - chisq -# We could then look up the chi-square distribution for 3 degrees of freedom +# We could then look up the chi-square distribution for 3 degrees of freedom in the book or find it in R. +# lower.tail=FALSE ensures that R returns the proportion of the distribution at least as large as our critical value: +pchisq(chisq, df=3, lower.tail=FALSE) + +# We could also use R's chisq.test function: +chisq.test(x = c(83,121,193,103), p = c(.18,.22,.37,.23)) ``` -The p-value for this is large, meaning that we don't have evidence that the sample differs from the census distribution. +The p-value for this is large, meaning that we cannot reject the null hypothesis that the sample proportions do not follow the census distribution. + +(b) Part b answers below: + i) We would probably want to treat opinion as the response and region as the explanatory variable, since it's unlikely that people move to a region based on their opinion (although that could be an empirical question!). + ii) One hypothesis is that opinions differ by region. The null hypothesis is that opinion is independent of region, while the alternative hypothesis is that there is a relationship. More formally: + $H_0$ : Region and opinion on direction are independent. + $H_A$ : Region and opinion on direction are dependent. + iii) We can again use a $\chi^2$ test. + +You can do this by hand calculating the expected counts and then plugging the values into the formula for $\chi^2$ as we did above. If you do that, here are the expected counts: -b) -i) Opinion is the response and location is the explanatory variable, since it's unlikely that people move to a region based on their opinion. -ii) One hypothesis is that opinions differ by region. The null hypothesis is that opinion is independent of region, while the alternative hypothesis is that there is a relationship. -iii) We can again use a $\chi^2$ test. +Right direction: +```{r} +# NE: +round(171*83/500) +# NC: +round(171*121/500) +# S: +round(171*193/500) +# W: +round(171*103/500) +``` +Wrong direction: +```{r} +round(329*83/500) +round(329*121/500) +round(329*193/500) +round(329*103/500) +``` +Working through the formula for the $\chi^2$ test statistic and then calculating the p-value for 3 degrees of freedom: ```{r} +chisq <- (29-28)^2 /28 + + (44-41)^2 /41 + + (62-66)^2 /66 + + (36-35)^2 / 35 + + (54-55)^2 / 55 + + (77-80)^2 / 80 + + (131-127)^2 / 127 + + (67-68)^2 / 68 + +chisq +pchisq(chisq, df=3, lower.tail=FALSE) +``` +And using R's `chisq.test()` function here: +```{r} x <- matrix(c(29,54,44,77,62,131,36,67), nrow = 4, # this makes a matrix with 4 rows byrow=T) # And this says that we've entered it row by row chisq.test(x) ``` +The answers don't match exactly (because of the rounding I did in computing the expected counts), however the payoff/interpretation is the same: we cannot reject the null hypothesis of no relationship between region and direction opinoin in the sample. ## Empirical Questions +### EQ1 -EQ0. - -a) The unit of analysis is the customer. The dependent variable is the type of board purchased and the independent variable is gender. Males, females, and unkown gender customers are being compared. This is a two-way test. +a) The unit of analysis is the customer. The dependent variable is the type of board purchased and the independent variable is gender. Males, females, and unkown gender customers are being compared. This is a two-way test. -b) The null hypothesis is that the board purchased is independent of the gender of the customer. The alternative hypothesis is that if we know the gender of the customer that will tell us something about the type of board they purchased. +b) For this type of comparison statistical tests help to give (or take away) confidence in any observed differences across categories. Choosing a statistical test is based on the question that you want to answer and the type of data that you have available to answer it. For example, if this were numeric data (e.g., the amount of money spent on electronics for men and women) then we could choose a t-test to compare those distributions. -c) A $\chi^2$ test found statistically significant evidence that board purchase behavior differs by gender. This difference is convincing, but it does directly not tell us what we really want to know, which is the difference between men and women. It could be possible that it is simply identifying a significant difference in the number of unknown gender customers across board types. Many of these concerns are addressed in the text and with additional tests, giving increased confidence in the reality of this difference. +c) The null hypothesis ($H_0$) is that the board purchased is independent of the gender of the customer. The alternative hypothesis ($H_A$) is that board purchase choice is dependent on gender. -d) Statistical tests help to give (or take away) confidence in a conclusion. People are not natively good at estimating how likely something is due to chance and tests help us to make these judgments. Choosing a statistical test is based on the question that you want to answer and the type of data that you have available to answer it. For example, if this were numeric data (e.g., the amount of money spent on electronics for men and women) then we could choose a t-test to compare those distributions. +d) A $\chi^2$ test found statistically significant evidence that board purchase behavior differs by gender. This difference is convincing, but it does directly not tell us what the authors set out to understand, which is the difference between men and women (the test could have identified a significant difference in the number of unknown gender customers across board types). Many of these concerns are addressed in the text and with additional tests, giving increased confidence in the observed differences. +### — EQ2 -EQ1 +a) These are counts for two categorical variables, so the procedure used was a $\chi^2$ test. The null hypothesis is that whether or not a blog is governed by one person is independent of whether it is on the left or the right ideologically. -a) These are counts for two categorical variables, so the procedure used was a $\chi^2$ test. The null hypothesis is that whether or not a blog is governed by one person is independent of whether it is on the left or the right. +b) Assuming that the null hypothesis of no difference across the two groups is compelling, it would be surprising to see these results in a world where ideological orientation and blog governance have no relationship. In this respect, it makes sense to believe that this difference is likely real. Perhaps the main reason to be skeptical is the way that the data are grouped. The authors could have grouped them differently (e.g., 1-2 people, 3-4 people, and 5+ people); if the decision on how to group was made after seeing the data then we have good reason to be skeptical. -b) It would be surprising to see these results by chance and it make sense to believe that this difference is real. However, the main reason to be skeptical is the way that the data are grouped. The authors could have grouped them differently (e.g., 1-2 people, 3-4 people, and 5+ people); if the decision on how to group was made after seeing the data then we have good reason to be skeptical. - -c) +c) We can do this in R. ```{r} @@ -265,7 +310,8 @@ df = data.frame(Governance=c('Individual','Multiple', 'Individual','Multiple'), # We can make sure it's the same by testing the Chi-squared chisq.test(matrix(df$Count, nrow=2)) -percentage_data = df %>% +# Using Jeremy's tidyverse code: +percentage_data <- df %>% group_by(Ideology) %>% summarize(individual_ratio = sum(Count[Governance=='Individual']) / sum(Count), group_count = sum(Count)) @@ -276,13 +322,17 @@ shaw_benkler_plot = percentage_data %>% ylab('Percentage of Blogs') + theme_minimal() shaw_benkler_plot +``` + -# If we want to add error bars, we need to calculate them (Note that ggplot could do this for us if we had raw data - always share your data!) +If we want to add error bars, we need to calculate them (Note that ggplot could do this for us if we had raw data - always share your data!). -# I decided to use confidence intervals. The standard error is another reasonable choice +For our purposes here, Jeremy decided to use confidence intervals. The standard error is another reasonable choice. Either way, ggplot has a `geom_errorbar` layer that is very useful. -# Remember that for a binomial distribution (we can consider individual/non-individual as binomial), confidence intervals are mu +- x * sqrt((p*(1-p))/n) +Remember that for a binomial distribution (we can consider individual/non-individual as binomial), confidence intervals are +$\hat{p} \pm z^* \sqrt{\frac{\hat{p}~(1-\hat{p})}{n}}$ +```{r} ci_95 = 1.96 * sqrt(percentage_data$individual_ratio * (1 - percentage_data$individual_ratio)/percentage_data$group_count) shaw_benkler_plot + geom_errorbar(aes(ymin=(individual_ratio-ci_95)*100, ymax=(individual_ratio + ci_95)*100), @@ -292,6 +342,10 @@ shaw_benkler_plot + geom_errorbar(aes(ymin=(individual_ratio-ci_95)*100, ymax=(i ``` -The error bars do overlap in this case, but that actually doesn't tell us whether they are significantly different. +The error bars do overlap in this case, indicating that the true population proportions may not be as far apart as our point estimates suggest. Note that this is not the same as the hypothesis test. + +d) On the one hand, we don't need to worry about the base rate fallacy because the sizes of both groups are about the same and the paper does not abuse the evidence too egregiously. The base rate fallacy would likely come into play, however, in the ways that the results are (mis)represented. For example, you might imagine some news coverage looking at these results and claiming something (totally wrong!) like "Harvard study finds right wing blogs more than twice as likely to be solo affairs." This is taking a relationship between the sample proportions ($\hat{p}$ in the language of our textbook) and converting that into a statement about the relationship between population proportions ($p$). That would be a mess. + +Another way in which the base rate fallacy could play a role in this paper, however, concerns the presence of multiple comparisons. The authors conducted numerous statistical tests (indeed, one of the authors seems to recall that some of the tests were not even reported in the paper ) and they make no effort to address the baseline probability of false positives. -d) We don't need to be very worried about the base rate fallacy because the sizes of both groups are about the same. +In any case, the point here is that the statistical tests reported in the paper may not mean exactly what the authors said they did in the context of the publication. That may or may not change the validity of the results, but it should inspire us all to do better statistical analysis! diff --git a/problem_sets/week_06/ps6-worked-solution.html b/problem_sets/week_06/ps6-worked-solution.html index 50075ed..a0c9e48 100644 --- a/problem_sets/week_06/ps6-worked-solution.html +++ b/problem_sets/week_06/ps6-worked-solution.html @@ -13,202 +13,17 @@ -Week 6 Worked Examples +Week 6 problem set: Worked solutions - + - - - - - - - + + + + + + + - - -
- + + + @@ -301,23 +176,37 @@ $(document).ready(function () { + + + + + +
+ +

Programming Questions

-

PC0. First we import the data.

-
raw_df = read.csv("~/Desktop/DeleteMe/Teaching/owan03.csv") # Note that I saved the file as a CSV for importing to R
+

Note that Jeremy created this solution set with some code that uses functions drawn from the tidyverse. Wherever possible, we have also provided solution code using base R functions or alternatives since the course has not really provided an introduction to the tidyverse.

+
+

PC0

+

First we import the data.

+
raw_df = read.csv("/tmp/owan03.csv") # Note that I saved the file as a CSV for importing to R
 head(raw_df)
##   X1 X2 X3 X4
 ## 1 70 49 30 34
@@ -326,35 +215,43 @@ head(raw_df)
## 4 87 67 65 48 ## 5 92 70 76 65 ## 6 93 74 83 91
-

PC1. Let’s reshape the data

-
library(tidyverse)
-
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
-
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
-## ✔ tibble  2.1.1     ✔ dplyr   0.7.7
-## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
-## ✔ readr   1.1.1     ✔ forcats 0.3.0
-
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
-## ✖ dplyr::filter() masks stats::filter()
-## ✖ dplyr::lag()    masks stats::lag()
+
+
+

PC1

+

Let’s organize the data. We want this in “long” format where the survival times in weeks for each of the three dosage levels are all arranged in a single column and another variable labels the dosage level. There are (surprise!) many ways to do this. Here are two examples using a function called melt() from a library called reshape and another function called gather() from the tidyverse library. Note that you’ll need to install the respective library and load it before you can call either function!

+
library(reshape)
+library(tidyr)
+
## 
+## Attaching package: 'tidyr'
+
## The following objects are masked from 'package:reshape':
+## 
+##     expand, smiths
# Rename the columns
 colnames(raw_df) <- c("None", "Low", "Medium","High")
 
-# Gather gets all of the columns from the dataframe and 
-# use them as keys, with the value as the value from that cell.
-# This gives us a "long" dataframe
-df <- gather(raw_df, dose, weeks_alive)
+# Here's a version using "melt"
+d <- melt(raw_df, na.rm=TRUE)
+
## Using  as id variables
+
names(d) <- c("dose", "weeks_alive")
 
-# We want to treat dose as a factor, so we can group by it easier
-df$dose <- as.factor(df$dose)
+# "gather" similarly takes all of the columns from the dataframe and uses them as keys, with the value as the value from that # cell. This gives us a "long" dataframe
+df <- gather(raw_df, dose, weeks_alive)
+# We want to treat dose as a factor, so we can group by it easier. Specifying the levels helps with labels later on.
+df$dose <- factor(df$dose, levels=c("None", "Low", "Medium", "High"))
+# Finally, remove NAs
+df <- df[complete.cases(df),]
 
-# Finally, when we look at it, we see some missing data (NA); this is because not all group sizes were the same.
-# We can safely remove these.
-df <- df[complete.cases(df),]
-

PC2: Now we’re goint to get statistics and create some visualizations

-
tapply(df$weeks_alive, df$dose, summary)
-
## $High
+# Uncomment this to see that the results are the same:
+## df==d
+
+
+

PC2

+

Now we’re going to calculate summary statistics and create some visualizations

+
# First, using the tapply function
+tapply(df$weeks_alive, df$dose, summary)
+
## $None
 ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-##   34.00   45.00   56.50   65.25   92.75  102.00 
+##   70.00   85.00   93.00   91.36  101.00  103.00 
 ## 
 ## $Low
 ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
@@ -364,81 +261,100 @@ df <- df[complete.cases(df),]
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 30.00 58.25 79.50 71.50 89.25 97.00 ## -## $None +## $High ## Min. 1st Qu. Median Mean 3rd Qu. Max. -## 70.00 85.00 93.00 91.36 101.00 103.00 -
# Alternative way to do this using tidyverse
-
-df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR'))
+## 34.00 45.00 56.50 65.25 92.75 102.00 +
# Alternative way to do this using tidyverse "group_by" function:
+library(dplyr)
+
## 
+## Attaching package: 'dplyr'
+
## The following object is masked from 'package:reshape':
+## 
+##     rename
+
## The following objects are masked from 'package:stats':
+## 
+##     filter, lag
+
## The following objects are masked from 'package:base':
+## 
+##     intersect, setdiff, setequal, union
+
df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR'))
## # A tibble: 4 x 5
 ##   dose     min   max  mean   IQR
 ##   <fct>  <dbl> <dbl> <dbl> <dbl>
-## 1 High      34   102  65.2  47.8
+## 1 None      70   103  91.4  16  
 ## 2 Low       49    89  69.9  14  
 ## 3 Medium    30    97  71.5  31  
-## 4 None      70   103  91.4  16
+## 4 High 34 102 65.2 47.8

When it comes to visualizations, we definitely want to use ggplot. We have lots of options for what to do.

+
library(ggplot2)
+
## Registered S3 methods overwritten by 'ggplot2':
+##   method         from 
+##   [.quosures     rlang
+##   c.quosures     rlang
+##   print.quosures rlang
# Histograms
 
-h_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize
-                      fill = dose # How to group by color
-                      )) + geom_histogram(position = 'dodge', bins = 5)
+h_plot <- ggplot(data=df, aes(x=weeks_alive, fill = dose)) + geom_histogram(position = 'dodge', bins = 5)
 h_plot
-

+

# In this case, faceted histograms is probably better
-
-h_facet = df %>% ggplot(aes(x=weeks_alive # What to summarize
-                      )) + geom_histogram(bins = 5) + facet_grid(~dose)
-
+h_facet <- ggplot(data=df, aes(x=weeks_alive)) + geom_histogram(bins = 5) + facet_grid(~dose)
 h_facet
-

-
# Density plots
-d_plot = df %>% ggplot(aes(x=weeks_alive, # What to summarize
-                      fill = dose # How to group by color
-                      )) + geom_density(alpha = .2)
+

+
# Even easier to compare if we flip the coordinates (rotate the axes)
+h_facet+coord_flip()
+

+
# Density plots. Note that setting a value <1 for "alpha" makes the fill more transparent
+d_plot <- ggplot(data=df, aes(x=weeks_alive, fill = dose)) + geom_density(alpha = .2)
 d_plot
-

+

# Boxplots
-box_plot = df %>% ggplot(aes(y=weeks_alive,
-                      x = dose
-                      )) + geom_boxplot()
+box_plot <- ggplot(data=df, aes(y=weeks_alive,  x = dose)) + geom_boxplot()
 box_plot
-

-
# My favorite - ridgeline plots
+

+
# Jeremy's favorite - ridgeline plots
 
 # install.packages('ggridges')
-
 library(ggridges) 
## 
 ## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
 ## 
 ##     scale_discrete_manual
-
ridge_plot = df %>% ggplot(aes(x=weeks_alive, y = dose)) + 
-  geom_density_ridges(jittered_points = T, fill = 'orange') + theme_minimal()
+
ridge_plot <- ggplot(data=df, aes(x=weeks_alive, y = dose)) + geom_density_ridges(jittered_points = T, fill = 'orange')
 ridge_plot
## Picking joint bandwidth of 10.5
-

-

It’s a bit tough to tell, but the overall assumptions of normality and equal variance seem reasonable.

+

+
# add a fancy minimalist theme to make it prettier:
+ridge_plot + theme_minimal()
+
## Picking joint bandwidth of 10.5
+

+

A two sample t-test assumes independence and normality. An ANOVA assumes independence, normality, and equal variance. It’s a bit tough to tell, but the overall assumption of equal variance seems reasonable. Normality is a bit of a hard sell within groups or overall. Nevertheless, most analysts would march ahead with the analysis despite these violations of assumptions. We can discuss how you might think and talk about this in class.

The global mean is

mean(df$weeks_alive)
## [1] 75.55263
-

PC3. Anova

+
+
+

PC3

+

ANOVA! Remember that we have to call summary() to see the ANOVA table:

summary(aov(weeks_alive ~ dose, data = df))
##             Df Sum Sq Mean Sq F value Pr(>F)  
 ## dose         3   4052  1350.7    3.55 0.0245 *
 ## Residuals   34  12937   380.5                 
 ## ---
 ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-

This provides evidence that the group means are different.

-

PC4. T-test between None and Any, and between None and High.

-
t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
-       df[df$dose != 'None','weeks_alive'] # Samples with any dose
+

This provides evidence in support of the alternative hypothesis that the group means are not equal. It seems that the dosage of Red Dye #40 has a relationship with how many weeks the mice lived.

+
+
+

PC4

+

T-test between None and Any, and between None and High.

+
t.test(df$weeks_alive[df$dose == 'None'], # Samples with no dose
+       df$weeks_alive[df$dose != 'None'] # Samples with any dose
        )
## 
 ##  Welch Two Sample t-test
 ## 
-## data:  df[df$dose == "None", "weeks_alive"] and df[df$dose != "None", "weeks_alive"]
+## data:  df$weeks_alive[df$dose == "None"] and df$weeks_alive[df$dose != "None"]
 ## t = 4.2065, df = 33.732, p-value = 0.0001806
 ## alternative hypothesis: true difference in means is not equal to 0
 ## 95 percent confidence interval:
@@ -460,14 +376,13 @@ t.test(weeks_alive ~ dose == 'None', data = df)
## mean in group FALSE mean in group TRUE ## 69.11111 91.36364
# T-test between None and High
-
-t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
-       df[df$dose == 'High','weeks_alive'] # Samples with high dose
+t.test(df$weeks_alive[df$dose == 'None'], # Samples with no dose
+       df$weeks_alive[df$dose == 'High'] # Samples with high dose
        )
## 
 ##  Welch Two Sample t-test
 ## 
-## data:  df[df$dose == "None", "weeks_alive"] and df[df$dose == "High", "weeks_alive"]
+## data:  df$weeks_alive[df$dose == "None"] and df$weeks_alive[df$dose == "High"]
 ## t = 2.4958, df = 8.5799, p-value = 0.0353
 ## alternative hypothesis: true difference in means is not equal to 0
 ## 95 percent confidence interval:
@@ -475,82 +390,140 @@ t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
 ## sample estimates:
 ## mean of x mean of y 
 ##  91.36364  65.25000
-
# Formula notation is a bit tricker. I would probably create a temprorary dataframe
-
-tmp = df %>% filter(dose %in% c('None', 'High'))
-
-t.test(weeks_alive ~ dose, data = tmp)
+
# Formula notation is a bit tricker. I would create a temprorary dataframe
+df.tmp <- subset(df, subset= dose=="None" | dose=="High")
+t.test(weeks_alive ~ dose, data = df.tmp)
## 
 ##  Welch Two Sample t-test
 ## 
 ## data:  weeks_alive by dose
-## t = -2.4958, df = 8.5799, p-value = 0.0353
+## t = 2.4958, df = 8.5799, p-value = 0.0353
 ## alternative hypothesis: true difference in means is not equal to 0
 ## 95 percent confidence interval:
-##  -49.960874  -2.266399
+##   2.266399 49.960874
 ## sample estimates:
-## mean in group High mean in group None 
-##           65.25000           91.36364
+## mean in group None mean in group High +## 91.36364 65.25000

These t-tests both support the idea that receiving a dose of RD40 reduces lifespan. However, we should not completely trust these p-values, since we are doing multiple comparisons. One option is to do a Bonferroni correction, where we only cnsider things significant if \(\alpha < .05/m\) where \(m\) is the number of tests. In this case, we would fail to reject the null for the second test because we would set \(\alpha = .025\).

-

The Bonferroni correction is more conservative than it needs to be, ane there are other approaches; for example, the TukeyHSD function takes in an anova result and does post-hoc comparisons with corrections for all of the groups.

+

The Bonferroni correction is more conservative than it needs to be, ane there are other approaches; for example, the TukeyHSD function takes in an anova result and does post-hoc comparisons with corrections for all of the groups. The Reinhart book also provides a description of the Benjamini-Hochberg correction.

+

Statistical Questions

-

Q1. a) It is a sample statistic, because it comes from a sample. b) Confidence intervals for proportions are equal to

-

\[p \pm z * \sqrt{ \frac{p*(1-p)}{n}}\]

-

For a 95% confidence interval, \(z = 1.96\), so we can calculate it like this:

+
+

SQ1 — 6.12

+
    +
  1. It is a sample statistic (the sample mean), because it comes from a sample. The population parameter is unknown in this case, but can be estimated from the sample statistic.

  2. +
  3. As given in the textbook, confidence intervals for proportions are equal to:

  4. +
+

\[\hat{p} \pm z^* \sqrt{ \frac{\hat{p}\times(1-\hat{p})}{n}}\]

+

Where we calculate \(z^*\) from the Z distribution table. For a 95% confidence interval, \(z = 1.96\), so we can calculate it like this:

lower = .48 - 1.96 * sqrt(.48 * .52 / 1259)
 
 upper = .48 + 1.96 * sqrt(.48 * .52 / 1259)
 
 ci = c(lower, upper)
-
 print(ci)
## [1] 0.4524028 0.5075972

This means that we are 95% confident that the true proportion of Americans who support legalizing marijuana is between ~45% and ~51%.

    -
  1. We have a large enough sample, which is collected randomly, to assume that the distribution is normal.
  2. -
  3. The statement isn’t justified, since our confidence interval include 50%
  4. +
  5. We have a large enough sample (collected randomly) in which enough responses are drawn from each potential outcome to assume that the observations are independent and the distribution of the sample proportion is approximately normal.

  6. +
  7. The statement is not justified, since our confidence interval includes 50%.

-

6.20 We can use the point estimate of the poll to estimate how large a sample we would need to have a confidence interval of a given width.

-

Basically, we want each half of the confidence interval to be 1%, i.e., \(1.96 * \sqrt{\frac{.48 * .52}{n}} = .01\)

+
+
+

SQ2 — 6.20

+

We can use the point estimate of the poll to estimate how large a sample we would need to have a confidence interval of a given width.

+

In this case, we want the margin or error to be 2%. Using the same formula from above, this translates to: \(1.96 * \sqrt{\frac{.48 * .52}{n}} = .02\)

We can solve for \(n\):

-

\[\sqrt{\frac{.48 * .52}{n}} = .01/1.96\] \[\frac{.48 * .52}{n} = (.01/1.96)^2\] \[n = (.48 * .52)/(.01/1.96)^2\]

-
(.48 * .52)/(.01/1.96)^2
-
## [1] 9588.634
-

So, we need a sample of approximately 9,589

-

6.38

-

The question is whether there has been a change in the proportion over time. While the tools we have learned could allow you to answer that question, they assume that responses are independent. In this case, they are obviously not independent as they come from the same students.

-

6.50

+

\[\sqrt{\frac{.48 * .52}{n}} = .02/1.96\] \[\frac{.48 * .52}{n} = (.02/1.96)^2\] \[n = (.48 * .52)/(.02/1.96)^2\] Let R solve this:

+
(.48 * .52)/(.02/1.96)^2
+
## [1] 2397.158
+

So, we need a sample of at least 2,398 (since we can’t survey less than a whole person).

+
+
+

SQ3 — 6.38

+

The question is whether there has been a change in the proportion across two groups (the students pre- and post-class) over time. While the tools we have learned could allow you to answer that question for two groups, they assume that responses are independent. In this case, the responses are not independent because they come from the same students. You could go ahead and calculate a statistical test for difference in pooled proportions (it’s just math!), but the dependence between observations violates the assumptions and means that the baseline expectations necessary to identify the hypothesis test are not met. Your test statistic will not mean what you want it to mean.

+
+
+

SQ4 — 6.50

    -
  1. We can test this with a \(\chi^2\) test.
  2. +
  3. Our hypothesis focuses on the relationship of the sample and census distributions by region:
    +\(H_0\) : The sample distribution of regions follows the census distribution.
    +\(H_A\) : The sample distribution of regions does not follow the census distribution.
-
chisq.test(x = c(83,121,193,103), p = c(.18,.22,.37,.23))
+

Before we can calculate the test statistic we should check the conditions:
+1. Independence: \(500 < 10%\) of US population, and the sample is random, hence we can assume that one individual in the sample is independent of another.
+2. Sample size: The expected counts can be calculated as follows: \[E N E = 500 × 0.18 = 90\] \[E N C = 500 × 0.22 = 110\] \[E S = 500 × 0.37 = 185\] \[E W = 500 × 0.23 = 115\]

+

All expected counts are greater than 5 (which meets the rule of thumb re: empty “cells” in the table).

+

We can test this with a \(\chi^2\) test.

+
# Manually using the expected counts and formula for chi-squared
+chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115
+chisq
+
## [1] 3.242564
+
# We could then look up the chi-square distribution for 3 degrees of freedom in the book or find it in R. 
+# lower.tail=FALSE ensures that R returns the proportion of the distribution at least as large as our critical value:
+pchisq(chisq, df=3, lower.tail=FALSE)
+
## [1] 0.355717
+
# We could also use R's chisq.test function:
+chisq.test(x = c(83,121,193,103), p = c(.18,.22,.37,.23))
## 
 ##  Chi-squared test for given probabilities
 ## 
 ## data:  c(83, 121, 193, 103)
 ## X-squared = 3.2426, df = 3, p-value = 0.3557
-
# Or manually
-
-# Calculate expected values
-500 * c(.18,.22,.37,.23)
-
## [1]  90 110 185 115
-
# Use the formula for chi-squared
-chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115
-
-chisq
-
## [1] 3.242564
-
# We could then look up the chi-square distribution for 3 degrees of freedom
-

The p-value for this is large, meaning that we don’t have evidence that the sample differs from the census distribution.

+

The p-value for this is large, meaning that we cannot reject the null hypothesis that the sample proportions do not follow the census distribution.

    -
  1. -
  2. Opinion is the response and location is the explanatory variable, since it’s unlikely that people move to a region based on their opinion.
  3. +
  4. Part b answers below:
    +
-
    -
  1. One hypothesis is that opinions differ by region. The null hypothesis is that opinion is independent of region, while the alternative hypothesis is that there is a relationship.
  2. +
      +
    1. We would probably want to treat opinion as the response and region as the explanatory variable, since it’s unlikely that people move to a region based on their opinion (although that could be an empirical question!).
      +
    2. +
    3. One hypothesis is that opinions differ by region. The null hypothesis is that opinion is independent of region, while the alternative hypothesis is that there is a relationship. More formally:
      +\(H_0\) : Region and opinion on direction are independent.
      +\(H_A\) : Region and opinion on direction are dependent.
      +
    4. We can again use a \(\chi^2\) test.
    +

    You can do this by hand calculating the expected counts and then plugging the values into the formula for \(\chi^2\) as we did above. If you do that, here are the expected counts:

    +

    Right direction:

    +
    # NE:
    +round(171*83/500)
    +
    ## [1] 28
    +
    # NC:  
    +round(171*121/500)
    +
    ## [1] 41
    +
    # S:  
    +round(171*193/500)
    +
    ## [1] 66
    +
    # W:
    +round(171*103/500)
    +
    ## [1] 35
    +

    Wrong direction:

    +
    round(329*83/500)
    +
    ## [1] 55
    +
    round(329*121/500)
    +
    ## [1] 80
    +
    round(329*193/500)
    +
    ## [1] 127
    +
    round(329*103/500)
    +
    ## [1] 68
    +

    Working through the formula for the \(\chi^2\) test statistic and then calculating the p-value for 3 degrees of freedom:

    +
    chisq <- (29-28)^2 /28 +
    +  (44-41)^2 /41 +
    +  (62-66)^2 /66 +
    +  (36-35)^2 / 35 +
    +  (54-55)^2 / 55 +
    +  (77-80)^2 / 80 +
    +  (131-127)^2 / 127 +
    +  (67-68)^2 / 68
    +
    +chisq
    +
    ## [1] 0.7975941
    +
    pchisq(chisq, df=3, lower.tail=FALSE)
    +
    ## [1] 0.8500424
    +

    And using R’s chisq.test() function here:

    x <- matrix(c(29,54,44,77,62,131,36,67), nrow = 4, # this makes a matrix with 4 rows
                 byrow=T) # And this says that we've entered it row by row
     chisq.test(x)
    @@ -559,21 +532,26 @@ chisq.test(x) ## ## data: x ## X-squared = 0.66724, df = 3, p-value = 0.8809 +

    The answers don’t match exactly (because of the rounding I did in computing the expected counts), however the payoff/interpretation is the same: we cannot reject the null hypothesis of no relationship between region and direction opinoin in the sample.

    +

Empirical Questions

-

EQ0.

+
+

EQ1

  1. The unit of analysis is the customer. The dependent variable is the type of board purchased and the independent variable is gender. Males, females, and unkown gender customers are being compared. This is a two-way test.

  2. -
  3. The null hypothesis is that the board purchased is independent of the gender of the customer. The alternative hypothesis is that if we know the gender of the customer that will tell us something about the type of board they purchased.

  4. -
  5. A \(\chi^2\) test found statistically significant evidence that board purchase behavior differs by gender. This difference is convincing, but it does directly not tell us what we really want to know, which is the difference between men and women. It could be possible that it is simply identifying a significant difference in the number of unknown gender customers across board types. Many of these concerns are addressed in the text and with additional tests, giving increased confidence in the reality of this difference.

  6. -
  7. Statistical tests help to give (or take away) confidence in a conclusion. People are not natively good at estimating how likely something is due to chance and tests help us to make these judgments. Choosing a statistical test is based on the question that you want to answer and the type of data that you have available to answer it. For example, if this were numeric data (e.g., the amount of money spent on electronics for men and women) then we could choose a t-test to compare those distributions.

  8. +
  9. For this type of comparison statistical tests help to give (or take away) confidence in any observed differences across categories. Choosing a statistical test is based on the question that you want to answer and the type of data that you have available to answer it. For example, if this were numeric data (e.g., the amount of money spent on electronics for men and women) then we could choose a t-test to compare those distributions.

  10. +
  11. The null hypothesis (\(H_0\)) is that the board purchased is independent of the gender of the customer. The alternative hypothesis (\(H_A\)) is that board purchase choice is dependent on gender.

  12. +
  13. A \(\chi^2\) test found statistically significant evidence that board purchase behavior differs by gender. This difference is convincing, but it does directly not tell us what the authors set out to understand, which is the difference between men and women (the test could have identified a significant difference in the number of unknown gender customers across board types). Many of these concerns are addressed in the text and with additional tests, giving increased confidence in the observed differences.

-

EQ1

+
+
+

— EQ2

    -
  1. These are counts for two categorical variables, so the procedure used was a \(\chi^2\) test. The null hypothesis is that whether or not a blog is governed by one person is independent of whether it is on the left or the right.

  2. -
  3. It would be surprising to see these results by chance and it make sense to believe that this difference is real. However, the main reason to be skeptical is the way that the data are grouped. The authors could have grouped them differently (e.g., 1-2 people, 3-4 people, and 5+ people); if the decision on how to group was made after seeing the data then we have good reason to be skeptical.

  4. -
  5. +
  6. These are counts for two categorical variables, so the procedure used was a \(\chi^2\) test. The null hypothesis is that whether or not a blog is governed by one person is independent of whether it is on the left or the right ideologically.

  7. +
  8. Assuming that the null hypothesis of no difference across the two groups is compelling, it would be surprising to see these results in a world where ideological orientation and blog governance have no relationship. In this respect, it makes sense to believe that this difference is likely real. Perhaps the main reason to be skeptical is the way that the data are grouped. The authors could have grouped them differently (e.g., 1-2 people, 3-4 people, and 5+ people); if the decision on how to group was made after seeing the data then we have good reason to be skeptical.

  9. +
  10. We can do this in R.

# First we create the dataframe
 df = data.frame(Governance=c('Individual','Multiple', 'Individual','Multiple'),
@@ -587,7 +565,8 @@ chisq.test(matrix(df$Count, nrow=2))
## ## data: matrix(df$Count, nrow = 2) ## X-squared = 5.8356, df = 1, p-value = 0.01571 -
percentage_data = df %>% 
+
# Using Jeremy's tidyverse code:
+percentage_data <- df %>% 
   group_by(Ideology) %>% 
   summarize(individual_ratio = sum(Count[Governance=='Individual']) / sum(Count),
             group_count = sum(Count))
@@ -599,23 +578,23 @@ shaw_benkler_plot = percentage_data %>%
 
 shaw_benkler_plot

-
# If we want to add error bars, we need to calculate them (Note that ggplot could do this for us if we had raw data - always share your data!)
-
-# I decided to use confidence intervals. The standard error is another reasonable choice
-
-# Remember that for a binomial distribution (we can consider individual/non-individual as binomial), confidence intervals are mu +- x * sqrt((p*(1-p))/n)
-
-ci_95 = 1.96 * sqrt(percentage_data$individual_ratio * (1 - percentage_data$individual_ratio)/percentage_data$group_count)
+

If we want to add error bars, we need to calculate them (Note that ggplot could do this for us if we had raw data - always share your data!).

+

For our purposes here, Jeremy decided to use confidence intervals. The standard error is another reasonable choice. Either way, ggplot has a geom_errorbar layer that is very useful.

+

Remember that for a binomial distribution (we can consider individual/non-individual as binomial), confidence intervals are \(\hat{p} \pm z^* \sqrt{\frac{\hat{p}~(1-\hat{p})}{n}}\)

+
ci_95 = 1.96 * sqrt(percentage_data$individual_ratio * (1 - percentage_data$individual_ratio)/percentage_data$group_count)
 
 shaw_benkler_plot + geom_errorbar(aes(ymin=(individual_ratio-ci_95)*100, ymax=(individual_ratio + ci_95)*100),
                                   alpha = .3, 
                                   size=1.1, 
                                   width=.4)

-

The error bars do overlap in this case, but that actually doesn’t tell us whether they are significantly different.

+

The error bars do overlap in this case, indicating that the true population proportions may not be as far apart as our point estimates suggest. Note that this is not the same as the hypothesis test.

    -
  1. We don’t need to be very worried about the base rate fallacy because the sizes of both groups are about the same.
  2. +
  3. On the one hand, we don’t need to worry about the base rate fallacy because the sizes of both groups are about the same and the paper does not abuse the evidence too egregiously. The base rate fallacy would likely come into play, however, in the ways that the results are (mis)represented. For example, you might imagine some news coverage looking at these results and claiming something (totally wrong!) like “Harvard study finds right wing blogs more than twice as likely to be solo affairs.” This is taking a relationship between the sample proportions (\(\hat{p}\) in the language of our textbook) and converting that into a statement about the relationship between population proportions (\(p\)). That would be a mess.
+

Another way in which the base rate fallacy could play a role in this paper, however, concerns the presence of multiple comparisons. The authors conducted numerous statistical tests (indeed, one of the authors seems to recall that some of the tests were not even reported in the paper ) and they make no effort to address the baseline probability of false positives.

+

In any case, the point here is that the statistical tests reported in the paper may not mean exactly what the authors said they did in the context of the publication. That may or may not change the validity of the results, but it should inspire us all to do better statistical analysis!

+