Programming Challenges

PC1. Import

You had the option of downloading the dataset in several formats, including a Stata 13 “.dta” file. For demonstration purposes, I’ll use that proprietary format here. I also uploaded a copy of the dataset to the course website for pedagogical purposes (so you can replicate my code here on your own machine).

There are a few ways to import data from Stata 13 files. One involves using the appropriately named readstata13 package. Another uses the haven package (more of a general purpose tool for importing data from binary/proprietary formats). I’ll go with the read_dta() command from haven here because it works better with importing from a URL.

library(haven)

df <- read_dta(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_07/Halloween2012-2014-2015_PLOS.dta"))

PC2. Explore and cleanup

head(df)
## # A tibble: 6 x 7
##   obama fruit  year   age  male  neob treat_year
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
## 1     0     0  2014     6     0     1          4
## 2     0     1  2014     5     0     1          4
## 3     0     0  2014     9     1     1          4
## 4     0     0  2014     5     1     1          4
## 5     0     0  2014     7     0     1          4
## 6     0     0  2014     9     0     1          4
summary(df)
##      obama            fruit             year           age       
##  Min.   :0.0000   Min.   :0.0000   Min.   :2012   Min.   : 2.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2014   1st Qu.: 6.00  
##  Median :0.0000   Median :0.0000   Median :2015   Median : 8.00  
##  Mean   :0.3639   Mean   :0.2512   Mean   :2014   Mean   : 8.52  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2015   3rd Qu.:11.00  
##  Max.   :1.0000   Max.   :1.0000   Max.   :2015   Max.   :19.00  
##                   NA's   :1                                      
##       male             neob          treat_year   
##  Min.   :0.0000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000  
##  Median :1.0000   Median :1.0000   Median :5.000  
##  Mean   :0.5262   Mean   :0.6361   Mean   :4.406  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:6.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :6.000  
##  NA's   :1

There are a few noteable things about the dataset. One is the neob, which the codebook says means “not equal to obama”; in other words, it’s the converse of the obama column. The treat_year column is a unique index of the obama column and the year column. See the codebook for more information. Happily, we only need to use the first two columns for now.

I’ll drop everything else and convert those two columns into logical vectors here using a couple of tidyverse dplyr commands:

library(tidyverse)

df <- df %>%
  select(obama, fruit) %>%
  mutate(
    obama = as.logical(obama),
    fruit = as.logical(fruit)
  )

head(df)
## # A tibble: 6 x 2
##   obama fruit
##   <lgl> <lgl>
## 1 FALSE FALSE
## 2 FALSE TRUE 
## 3 FALSE FALSE
## 4 FALSE FALSE
## 5 FALSE FALSE
## 6 FALSE FALSE

PC3. Summarize key variables

I’ll run summary again and then make my contingency table.

summary(df)
##    obama           fruit        
##  Mode :logical   Mode :logical  
##  FALSE:778       FALSE:915      
##  TRUE :445       TRUE :307      
##                  NA's :1
obama.tbl <- table(took_fruit = df$fruit, saw_flotus = df$obama)
obama.tbl
##           saw_flotus
## took_fruit FALSE TRUE
##      FALSE   593  322
##      TRUE    185  122
obama.prop.tbl <- proportions(table(took_fruit = df$fruit, saw_flotus = df$obama), margin = 2)
obama.prop.tbl
##           saw_flotus
## took_fruit     FALSE      TRUE
##      FALSE 0.7622108 0.7252252
##      TRUE  0.2377892 0.2747748

PC4. Test for differences between groups

So, the test we want to conduct here focuses on the difference between the proportion of the two groups (those shown a picture of Michelle Obama vs those not shown a picture of Michelle Obama) who took fruit vs. candy. Labeling the difference in proportions as \(\Delta_{fruit}\), the comparison can be constructed around the following (two-sided) hypothesis test

\[ H_0:~\Delta_{fruit} = 0\] \[ H_A:~\Delta_{fruit} \neq 0\] As discussed at length in OpenIntro Chapter 6, a great way to determine if two groups are independent (in terms of proportions or counts) is a \(\chi^2\) test.

Are the conditions for a valid test met? It seems so, since there are many observations in each cell of the table being compared and the observations appear to have been collected in a way that ensures independence.

The \(\chi^2\) test is implemented as chisq.test() in R. Since it’s a 2x2 comparison, we can also test for a difference in proportions using the prop.test() function. Let’s do both.

chisq.test(obama.tbl)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  obama.tbl
## X-squared = 1.8637, df = 1, p-value = 0.1722
prop.test(obama.tbl)
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  obama.tbl
## X-squared = 1.8637, df = 1, p-value = 0.1722
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01957437  0.11053751
## sample estimates:
##    prop 1    prop 2 
## 0.6480874 0.6026059

Notice that both functions report identical \(\chi^2\) test results and p-values. Did you expect this based on the OpenIntro reading?

Recall that if you want to double-check that p-value, you can also calculate it “by-hand” using the pchisq() function:

pchisq(1.8637, df = 1, lower.tail = FALSE)
## [1] 0.1721984

We’ll discuss the interpretation of these results in our class session this week.

PC5. Replicate a figure

In order to replicate the top panel of Figure 1, we’ll first want to calculate the proportion and standard error for fruit-takers in the treatment and control groups. These can be calculated individually or using a function (guess which one we’ll document here). Also, note that I’m going to use the complete.cases() function to eliminate the missing items for the sake of simplicity.

df <- df[complete.cases(df), ]

prop.se <- function(values) { # Takes in a vector of T/F values
  N <- length(values)
  prop <- mean(as.numeric(values))
  se <- sqrt(prop * (1 - prop) / N) ## textbook formula for SE of a proportion
  return(c(prop, se))
}

prop.se(df$fruit[df$obama])
## [1] 0.27477477 0.02118524
prop.se(df$fruit[!df$obama])
## [1] 0.23778920 0.01526314

In order to graph that it will help to convert the results into a data frame with clearly-labeled variable names and values:

prop.and.se <- data.frame(
  rbind(
    prop.se(df$fruit[df$obama]),
    prop.se(df$fruit[!df$obama])
  )
)

names(prop.and.se) <- c("proportion", "se")
prop.and.se$obama <- c(TRUE, FALSE)

prop.and.se
##   proportion         se obama
## 1  0.2747748 0.02118524  TRUE
## 2  0.2377892 0.01526314 FALSE

Now we can start to build a visualization

## library(ggplot2)  ## already imported w the tidyverse

p <- ggplot(prop.and.se, aes(x = obama, y = proportion)) +
  geom_point(aes(color = obama), size = 5)

p