From: Jeremy Foote Date: Fri, 26 Apr 2019 16:20:29 +0000 (-0500) Subject: Adding solutions for new questions X-Git-Url: https://code.communitydata.science/stats_class_2019.git/commitdiff_plain/d365bb775080c45e0f701f206a516e291a035f2f Adding solutions for new questions --- diff --git a/problem_sets/week_06/ps6-worked-solution.Rmd b/problem_sets/week_06/ps6-worked-solution.Rmd index f5791b0..986a9a0 100644 --- a/problem_sets/week_06/ps6-worked-solution.Rmd +++ b/problem_sets/week_06/ps6-worked-solution.Rmd @@ -65,7 +65,7 @@ h_plot # In this case, faceted histograms is probably better -h_facet = df %>% ggplot(aes(x=weeks_alive, # What to summarize +h_facet = df %>% ggplot(aes(x=weeks_alive # What to summarize )) + geom_histogram(bins = 5) + facet_grid(~dose) h_facet @@ -90,7 +90,7 @@ box_plot library(ggridges) ridge_plot = df %>% ggplot(aes(x=weeks_alive, y = dose)) + - geom_density_ridges(jittered_points = T) + geom_density_ridges(jittered_points = T, fill = 'orange') + theme_minimal() ridge_plot ``` @@ -103,7 +103,17 @@ The global mean is mean(df$weeks_alive) ``` -PC3. T-test between None and Any, and between None and High. + +PC3. Anova + +```{r} +summary(aov(weeks_alive ~ dose, data = df)) + +``` + +This provides evidence that the group means are different. + +PC4. T-test between None and Any, and between None and High. ```{r} @@ -129,16 +139,10 @@ t.test(weeks_alive ~ dose, data = tmp) ``` -The t-test supports the idea that receiving a dose of RD40 reduces lifespan - -PC4. Anova - -```{r} -summary(aov(weeks_alive ~ dose, data = df)) +These t-tests both support the idea that receiving a dose of RD40 reduces lifespan. However, we should not completely trust these p-values, since we are doing multiple comparisons. One option is to do a Bonferroni correction, where we only cnsider things significant if $\alpha < .05/m$ where $m$ is the number of tests. In this case, we would fail to reject the null for the second test because we would set $\alpha = .025$. -``` +The Bonferroni correction is more conservative than it needs to be, ane there are other approaches; for example, the `TukeyHSD` function takes in an anova result and does post-hoc comparisons with corrections for all of the groups. -This provides evidence that the group means are different. ## Statistical Questions @@ -202,6 +206,9 @@ chisq.test(x = c(83,121,193,103), p = c(.18,.22,.37,.23)) # Use the formula for chi-squared chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115 +chisq + +# We could then look up the chi-square distribution for 3 degrees of freedom ``` @@ -218,3 +225,66 @@ x <- matrix(c(29,54,44,77,62,131,36,67), nrow = 4, # this makes a matrix with 4 byrow=T) # And this says that we've entered it row by row chisq.test(x) ``` + +## Empirical Questions + + +EQ0. + +a) The unit of analysis is the customer. The dependent variable is the type of board purchased and the independent variable is gender. Males, females, and unkown gender customers are being compared. This is a two-way test. + +b) The null hypothesis is that the board purchased is independent of the gender of the customer. The alternative hypothesis is that if we know the gender of the customer that will tell us something about the type of board they purchased. + +c) A $\chi^2$ test found statistically significant evidence that board purchase behavior differs by gender. This difference is convincing, but it does directly not tell us what we really want to know, which is the difference between men and women. It could be possible that it is simply identifying a significant difference in the number of unknown gender customers across board types. Many of these concerns are addressed in the text and with additional tests, giving increased confidence in the reality of this difference. + +d) Statistical tests help to give (or take away) confidence in a conclusion. People are not natively good at estimating how likely something is due to chance and tests help us to make these judgments. Choosing a statistical test is based on the question that you want to answer and the type of data that you have available to answer it. For example, if this were numeric data (e.g., the amount of money spent on electronics for men and women) then we could choose a t-test to compare those distributions. + + +EQ1 + +a) These are counts for two categorical variables, so the procedure used was a $\chi^2$ test. The null hypothesis is that whether or not a blog is governed by one person is independent of whether it is on the left or the right. + +b) It would be surprising to see these results by chance and it make sense to believe that this difference is real. However, the main reason to be skeptical is the way that the data are grouped. The authors could have grouped them differently (e.g., 1-2 people, 3-4 people, and 5+ people); if the decision on how to group was made after seeing the data then we have good reason to be skeptical. + +c) + +```{r} + +# First we create the dataframe +df = data.frame(Governance=c('Individual','Multiple', 'Individual','Multiple'), + Ideology = c('Left','Left','Right','Right'), + Count = c(13,51,27,38)) + +# We can make sure it's the same by testing the Chi-squared +chisq.test(matrix(df$Count, nrow=2)) + +percentage_data = df %>% + group_by(Ideology) %>% + summarize(individual_ratio = sum(Count[Governance=='Individual']) / sum(Count), + group_count = sum(Count)) + +shaw_benkler_plot = percentage_data %>% + ggplot(aes(x=Ideology, y=individual_ratio * 100)) + + geom_bar(stat='identity', aes(fill=c('red','blue')), show.legend=F) + + ylab('Percentage of Blogs') + theme_minimal() + +shaw_benkler_plot + +# If we want to add error bars, we need to calculate them (Note that ggplot could do this for us if we had raw data - always share your data!) + +# I decided to use confidence intervals. The standard error is another reasonable choice + +# Remember that for a binomial distribution (we can consider individual/non-individual as binomial), confidence intervals are mu +- x * sqrt((p*(1-p))/n) + +ci_95 = 1.96 * sqrt(percentage_data$individual_ratio * (1 - percentage_data$individual_ratio)/percentage_data$group_count) + +shaw_benkler_plot + geom_errorbar(aes(ymin=(individual_ratio-ci_95)*100, ymax=(individual_ratio + ci_95)*100), + alpha = .3, + size=1.1, + width=.4) + +``` + +The error bars do overlap in this case, but that actually doesn't tell us whether they are significantly different. + +d) We don't need to be very worried about the base rate fallacy because the sizes of both groups are about the same. diff --git a/problem_sets/week_06/ps6-worked-solution.html b/problem_sets/week_06/ps6-worked-solution.html index 7a80c67..50075ed 100644 --- a/problem_sets/week_06/ps6-worked-solution.html +++ b/problem_sets/week_06/ps6-worked-solution.html @@ -328,12 +328,12 @@ head(raw_df) ## 6 93 74 83 91

PC1. Let’s reshape the data

library(tidyverse)
-
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
+
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
 ## ✔ tibble  2.1.1     ✔ dplyr   0.7.7
 ## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
 ## ✔ readr   1.1.1     ✔ forcats 0.3.0
-
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
+
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 ## ✖ dplyr::filter() masks stats::filter()
 ## ✖ dplyr::lag()    masks stats::lag()
# Rename the columns
@@ -387,7 +387,7 @@ h_plot

# In this case, faceted histograms is probably better
 
-h_facet = df %>% ggplot(aes(x=weeks_alive, # What to summarize
+h_facet = df %>% ggplot(aes(x=weeks_alive # What to summarize
                       )) + geom_histogram(bins = 5) + facet_grid(~dose)
 
 h_facet
@@ -415,15 +415,23 @@ library(ggridges)
## ## scale_discrete_manual
ridge_plot = df %>% ggplot(aes(x=weeks_alive, y = dose)) + 
-  geom_density_ridges(jittered_points = T)
+  geom_density_ridges(jittered_points = T, fill = 'orange') + theme_minimal()
 ridge_plot
## Picking joint bandwidth of 10.5
-

+

It’s a bit tough to tell, but the overall assumptions of normality and equal variance seem reasonable.

The global mean is

mean(df$weeks_alive)
## [1] 75.55263
-

PC3. T-test between None and Any, and between None and High.

+

PC3. Anova

+
summary(aov(weeks_alive ~ dose, data = df))
+
##             Df Sum Sq Mean Sq F value Pr(>F)  
+## dose         3   4052  1350.7    3.55 0.0245 *
+## Residuals   34  12937   380.5                 
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+

This provides evidence that the group means are different.

+

PC4. T-test between None and Any, and between None and High.

t.test(df[df$dose == 'None', 'weeks_alive'], # Samples with no dose
        df[df$dose != 'None','weeks_alive'] # Samples with any dose
        )
@@ -483,15 +491,8 @@ t.test(weeks_alive ~ dose, data = tmp) ## sample estimates: ## mean in group High mean in group None ## 65.25000 91.36364 -

The t-test supports the idea that receiving a dose of RD40 reduces lifespan

-

PC4. Anova

-
summary(aov(weeks_alive ~ dose, data = df))
-
##             Df Sum Sq Mean Sq F value Pr(>F)  
-## dose         3   4052  1350.7    3.55 0.0245 *
-## Residuals   34  12937   380.5                 
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-

This provides evidence that the group means are different.

+

These t-tests both support the idea that receiving a dose of RD40 reduces lifespan. However, we should not completely trust these p-values, since we are doing multiple comparisons. One option is to do a Bonferroni correction, where we only cnsider things significant if \(\alpha < .05/m\) where \(m\) is the number of tests. In this case, we would fail to reject the null for the second test because we would set \(\alpha = .025\).

+

The Bonferroni correction is more conservative than it needs to be, ane there are other approaches; for example, the TukeyHSD function takes in an anova result and does post-hoc comparisons with corrections for all of the groups.

Statistical Questions

@@ -536,7 +537,11 @@ print(ci) 500 * c(.18,.22,.37,.23)
## [1]  90 110 185 115
# Use the formula for chi-squared
-chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115
+chisq = (83-90)^2/90 + (121-110)^2/110 + (193-185)^2/185 + (103-115)^2/115 + +chisq +
## [1] 3.242564
+
# We could then look up the chi-square distribution for 3 degrees of freedom

The p-value for this is large, meaning that we don’t have evidence that the sample differs from the census distribution.

  1. @@ -555,6 +560,63 @@ chisq.test(x) ## data: x ## X-squared = 0.66724, df = 3, p-value = 0.8809
+
+

Empirical Questions

+

EQ0.

+
    +
  1. The unit of analysis is the customer. The dependent variable is the type of board purchased and the independent variable is gender. Males, females, and unkown gender customers are being compared. This is a two-way test.

  2. +
  3. The null hypothesis is that the board purchased is independent of the gender of the customer. The alternative hypothesis is that if we know the gender of the customer that will tell us something about the type of board they purchased.

  4. +
  5. A \(\chi^2\) test found statistically significant evidence that board purchase behavior differs by gender. This difference is convincing, but it does directly not tell us what we really want to know, which is the difference between men and women. It could be possible that it is simply identifying a significant difference in the number of unknown gender customers across board types. Many of these concerns are addressed in the text and with additional tests, giving increased confidence in the reality of this difference.

  6. +
  7. Statistical tests help to give (or take away) confidence in a conclusion. People are not natively good at estimating how likely something is due to chance and tests help us to make these judgments. Choosing a statistical test is based on the question that you want to answer and the type of data that you have available to answer it. For example, if this were numeric data (e.g., the amount of money spent on electronics for men and women) then we could choose a t-test to compare those distributions.

  8. +
+

EQ1

+
    +
  1. These are counts for two categorical variables, so the procedure used was a \(\chi^2\) test. The null hypothesis is that whether or not a blog is governed by one person is independent of whether it is on the left or the right.

  2. +
  3. It would be surprising to see these results by chance and it make sense to believe that this difference is real. However, the main reason to be skeptical is the way that the data are grouped. The authors could have grouped them differently (e.g., 1-2 people, 3-4 people, and 5+ people); if the decision on how to group was made after seeing the data then we have good reason to be skeptical.

  4. +
  5. +
+
# First we create the dataframe
+df = data.frame(Governance=c('Individual','Multiple', 'Individual','Multiple'),
+                Ideology = c('Left','Left','Right','Right'),
+                Count = c(13,51,27,38))
+
+# We can make sure it's the same by testing the Chi-squared                
+chisq.test(matrix(df$Count, nrow=2))
+
## 
+##  Pearson's Chi-squared test with Yates' continuity correction
+## 
+## data:  matrix(df$Count, nrow = 2)
+## X-squared = 5.8356, df = 1, p-value = 0.01571
+
percentage_data = df %>% 
+  group_by(Ideology) %>% 
+  summarize(individual_ratio = sum(Count[Governance=='Individual']) / sum(Count),
+            group_count = sum(Count))
+
+shaw_benkler_plot = percentage_data %>%
+  ggplot(aes(x=Ideology, y=individual_ratio * 100)) + 
+    geom_bar(stat='identity', aes(fill=c('red','blue')), show.legend=F) + 
+    ylab('Percentage of Blogs') + theme_minimal()
+
+shaw_benkler_plot
+

+
# If we want to add error bars, we need to calculate them (Note that ggplot could do this for us if we had raw data - always share your data!)
+
+# I decided to use confidence intervals. The standard error is another reasonable choice
+
+# Remember that for a binomial distribution (we can consider individual/non-individual as binomial), confidence intervals are mu +- x * sqrt((p*(1-p))/n)
+
+ci_95 = 1.96 * sqrt(percentage_data$individual_ratio * (1 - percentage_data$individual_ratio)/percentage_data$group_count)
+
+shaw_benkler_plot + geom_errorbar(aes(ymin=(individual_ratio-ci_95)*100, ymax=(individual_ratio + ci_95)*100),
+                                  alpha = .3, 
+                                  size=1.1, 
+                                  width=.4)
+

+

The error bars do overlap in this case, but that actually doesn’t tell us whether they are significantly different.

+
    +
  1. We don’t need to be very worried about the base rate fallacy because the sizes of both groups are about the same.
  2. +
+