Northwestern University

MTS 525

You had the option of downloading the dataset in several formats, including a Stata 13 “.dta” file. For demonstration purposes, I’ll use that proprietary format here. I also uploaded a copy of the dataset to the course website for pedagogical purposes (so you can replicate my code here on your own machine).

There are a few ways to import data from Stata 13 files. One involves using the appropriately named `readstata13`

package. Another uses the `haven`

package (more of a general purpose tool for importing data from binary/proprietary formats). I’ll go with the `read_dta()`

command from `haven`

here because it works better with importing from a URL.

```
library(haven)
df <- read_dta(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_07/Halloween2012-2014-2015_PLOS.dta"))
```

`head(df)`

```
## # A tibble: 6 x 7
## obama fruit year age male neob treat_year
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 2014 6 0 1 4
## 2 0 1 2014 5 0 1 4
## 3 0 0 2014 9 1 1 4
## 4 0 0 2014 5 1 1 4
## 5 0 0 2014 7 0 1 4
## 6 0 0 2014 9 0 1 4
```

`summary(df)`

```
## obama fruit year age
## Min. :0.0000 Min. :0.0000 Min. :2012 Min. : 2.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2014 1st Qu.: 6.00
## Median :0.0000 Median :0.0000 Median :2015 Median : 8.00
## Mean :0.3639 Mean :0.2512 Mean :2014 Mean : 8.52
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2015 3rd Qu.:11.00
## Max. :1.0000 Max. :1.0000 Max. :2015 Max. :19.00
## NA's :1
## male neob treat_year
## Min. :0.0000 Min. :0.0000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
## Median :1.0000 Median :1.0000 Median :5.000
## Mean :0.5262 Mean :0.6361 Mean :4.406
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:6.000
## Max. :1.0000 Max. :1.0000 Max. :6.000
## NA's :1
```

There are a few noteable things about the dataset. One is the `neob`

, which the codebook says means “not equal to obama”; in other words, it’s the converse of the `obama`

column. The `treat_year`

column is a unique index of the `obama`

column and the `year`

column. See the codebook for more information. Happily, we only need to use the first two columns for now.

I’ll drop everything else and convert those two columns into logical vectors here using a couple of tidyverse `dplyr`

commands:

```
library(tidyverse)
df <- df %>%
select(obama, fruit) %>%
mutate(
obama = as.logical(obama),
fruit = as.logical(fruit)
)
head(df)
```

```
## # A tibble: 6 x 2
## obama fruit
## <lgl> <lgl>
## 1 FALSE FALSE
## 2 FALSE TRUE
## 3 FALSE FALSE
## 4 FALSE FALSE
## 5 FALSE FALSE
## 6 FALSE FALSE
```

I’ll run summary again and then make my contingency table.

`summary(df)`

```
## obama fruit
## Mode :logical Mode :logical
## FALSE:778 FALSE:915
## TRUE :445 TRUE :307
## NA's :1
```

```
obama.tbl <- table(took_fruit = df$fruit, saw_flotus = df$obama)
obama.tbl
```

```
## saw_flotus
## took_fruit FALSE TRUE
## FALSE 593 322
## TRUE 185 122
```

```
obama.prop.tbl <- proportions(table(took_fruit = df$fruit, saw_flotus = df$obama), margin = 2)
obama.prop.tbl
```

```
## saw_flotus
## took_fruit FALSE TRUE
## FALSE 0.7622108 0.7252252
## TRUE 0.2377892 0.2747748
```

So, the test we want to conduct here focuses on the difference between the proportion of the two groups (those shown a picture of Michelle Obama vs those not shown a picture of Michelle Obama) who took fruit vs. candy. Labeling the difference in proportions as \(\Delta_{fruit}\), the comparison can be constructed around the following (two-sided) hypothesis test

\[ H_0:~\Delta_{fruit} = 0\] \[ H_A:~\Delta_{fruit} \neq 0\] As discussed at length in *OpenIntro* Chapter 6, a great way to determine if two groups are independent (in terms of proportions or counts) is a \(\chi^2\) test.

Are the conditions for a valid test met? It seems so, since there are many observations in each cell of the table being compared and the observations appear to have been collected in a way that ensures independence.

The \(\chi^2\) test is implemented as `chisq.test()`

in R. Since it’s a 2x2 comparison, we can also test for a difference in proportions using the `prop.test()`

function. Let’s do both.

`chisq.test(obama.tbl)`

```
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: obama.tbl
## X-squared = 1.8637, df = 1, p-value = 0.1722
```

`prop.test(obama.tbl)`

```
##
## 2-sample test for equality of proportions with continuity correction
##
## data: obama.tbl
## X-squared = 1.8637, df = 1, p-value = 0.1722
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01957437 0.11053751
## sample estimates:
## prop 1 prop 2
## 0.6480874 0.6026059
```

Notice that both functions report identical \(\chi^2\) test results and p-values. Did you expect this based on the *OpenIntro* reading?

Recall that if you want to double-check that p-value, you can also calculate it “by-hand” using the `pchisq()`

function:

`pchisq(1.8637, df = 1, lower.tail = FALSE)`

`## [1] 0.1721984`

We’ll discuss the interpretation of these results in our class session this week.

In order to replicate the top panel of Figure 1, we’ll first want to calculate the proportion and standard error for fruit-takers in the treatment and control groups. These can be calculated individually or using a function (guess which one we’ll document here). Also, note that I’m going to use the `complete.cases()`

function to eliminate the missing items for the sake of simplicity.

```
df <- df[complete.cases(df), ]
prop.se <- function(values) { # Takes in a vector of T/F values
N <- length(values)
prop <- mean(as.numeric(values))
se <- sqrt(prop * (1 - prop) / N) ## textbook formula for SE of a proportion
return(c(prop, se))
}
prop.se(df$fruit[df$obama])
```

`## [1] 0.27477477 0.02118524`

`prop.se(df$fruit[!df$obama])`

`## [1] 0.23778920 0.01526314`

In order to graph that it will help to convert the results into a data frame with clearly-labeled variable names and values:

```
prop.and.se <- data.frame(
rbind(
prop.se(df$fruit[df$obama]),
prop.se(df$fruit[!df$obama])
)
)
names(prop.and.se) <- c("proportion", "se")
prop.and.se$obama <- c(TRUE, FALSE)
prop.and.se
```

```
## proportion se obama
## 1 0.2747748 0.02118524 TRUE
## 2 0.2377892 0.01526314 FALSE
```

Now we can start to build a visualization

```
## library(ggplot2) ## already imported w the tidyverse
p <- ggplot(prop.and.se, aes(x = obama, y = proportion)) +
geom_point(aes(color = obama), size = 5)
p
```