We’ll import the .dta file first using an appropriate command from the readstata13
package. Also, after looking through the dataverse files, it turns out there is a version of the data which is a TSV file, and can be imported with read_delim()
library(readstata13)
df <- read.dta13('~/Documents/Teaching/2019/stats/data/week_07/Halloween2012-2014-2015_PLOS.dta')
## Same result
## df.t <- read.delim('~/Documents/Teaching/2019/stats/data/week_07/Halloween2012-2014-2015_PLOS.tab')
head(df)
## obama fruit year age male neob treat_year
## 1 0 0 2014 6 0 1 4
## 2 0 1 2014 5 0 1 4
## 3 0 0 2014 9 1 1 4
## 4 0 0 2014 5 1 1 4
## 5 0 0 2014 7 0 1 4
## 6 0 0 2014 9 0 1 4
summary(df)
## obama fruit year age
## Min. :0.0000 Min. :0.0000 Min. :2012 Min. : 2.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2014 1st Qu.: 6.00
## Median :0.0000 Median :0.0000 Median :2015 Median : 8.00
## Mean :0.3639 Mean :0.2512 Mean :2014 Mean : 8.52
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2015 3rd Qu.:11.00
## Max. :1.0000 Max. :1.0000 Max. :2015 Max. :19.00
## NA's :1
## male neob treat_year
## Min. :0.0000 Min. :0.0000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
## Median :1.0000 Median :1.0000 Median :5.000
## Mean :0.5262 Mean :0.6361 Mean :4.406
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:6.000
## Max. :1.0000 Max. :1.0000 Max. :6.000
## NA's :1
There are a few strange things about the dataset. One is the neob
, which the codebook says means “not equal to obama”; in other words, it’s the converse of the obama
column. The treat_year
column is a unique index of the obama
column and the year
column. See the codebook for more information. Happily, we only need to use the first two columns for now.
The table
function is a great way to create contingency tables. We’ll recode the variables as logical TRUE/FALSE values first.
# Change both measures into T/F
df$obama = as.logical(df$obama)
df$fruit = as.logical(df$fruit)
# create the table with nice labels
obama.tbl <- table(fruit=df$fruit, flotus=df$obama)
obama.tbl
## flotus
## fruit FALSE TRUE
## FALSE 593 322
## TRUE 185 122
The simplest way to determine if the groups are independent is a \(\chi^2\) test. Since it’s a 2x2 comparison, we could also test for a difference in proportions using the prop.test()
function.
chisq.test(obama.tbl)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: obama.tbl
## X-squared = 1.8637, df = 1, p-value = 0.1722
prop.test(obama.tbl)
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: obama.tbl
## X-squared = 1.8637, df = 1, p-value = 0.1722
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01957437 0.11053751
## sample estimates:
## prop 1 prop 2
## 0.6480874 0.6026059
Notice that both functions report identical \(\chi^2\) test results and p-values.
There are many ways you could answer the question about why these results are different from the regression results presented in the paper. We’ll discuss some of them in class this week and some more next week as part of our discussion of multiple regression.
First we want to get the proportion and standard error for fruit in each group. These can be calculated individually or using a function (guess which one we’ll document here). Also, note that I’m just going to use the complete.cases()
function to eliminate the missing items for the sake of simplicity.
df <- df[complete.cases(df),]
prop.se = function(values){# Takes in a vector of T/F values
N = length(values)
prop = mean(as.numeric(values))
se = sqrt(prop * (1-prop)/N)
return(c(prop, se))
}
prop.se(df$fruit[df$obama])
## [1] 0.27477477 0.02118524
prop.se(df$fruit[!df$obama])
## [1] 0.23680824 0.01525123
In order to graph that it will help to convert the results into a data frame:
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
prop.and.se <- data.frame(rbind(
prop.se(df$fruit[df$obama]),
prop.se(df$fruit[!df$obama])
))
names(prop.and.se) <- c("proportion", "se")
prop.and.se$obama <- c(TRUE, FALSE)
ggplot(prop.and.se, aes(x=obama,y=proportion)) +
geom_point(aes(color=obama), size=5) + # Add the points for the proportions
geom_errorbar(aes(ymin=proportion - 1.96 * se, # Add error bars
ymax=proportion + 1.96 * se,
width=0, # Remove the whiskers
color=obama),size=1.1) + # Make them a little bigger and color them
coord_flip() + # Flip the chart
theme_light() + # Change the theme (theme_minimal is also nice)
scale_color_manual(values = c('gray','black'), guide=F) + # Change the colors
ylim(0,.5) + # Change the y axis to go from 0 to .5
ylab('Proportion choosing fruit') + # Add labels
xlab('Picture shown was Michelle Obama')
Another way to do that involves a slightly different version of the function we created above and then using the group_by
and summarize
functions in the dplyr
library.
prop.se = function(values){# Takes in a vector of T/F values
N = length(values)
prop = mean(values)
se = sqrt(prop * (1-prop)/N)
return(se)
}
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
prop.and.se = df %>% filter(!is.na(fruit)) %>%
group_by(obama) %>%
summarize(
proportion=mean(fruit),
se = prop.se(fruit)
)
## Same exact plotting code
ggplot(prop.and.se, aes(x=obama,y=proportion)) +
geom_point(aes(color=obama), size=5) +
geom_errorbar(aes(ymin=proportion - 1.96 * se,
ymax=proportion + 1.96 * se,
width=0,
color=obama),size=1.1) +
coord_flip() +
theme_light() +
scale_color_manual(values = c('gray','black'), guide=F) +
ylim(0,.5) +
ylab('Proportion choosing fruit') +
xlab('Picture shown was Michelle Obama')
Here’s one way to export our table, using write.csv
write.csv(obama.tbl, file = 'crosstabs.csv')
We can make sure it worked
read.csv('crosstabs.csv')
## X FALSE. TRUE.
## 1 FALSE 593 322
## 2 TRUE 185 122
We lost some information, because the table
function doesn’t save column names. Another way to do this would be to change it into a dataframe first, like this:
as.data.frame(obama.tbl)
## fruit flotus Freq
## 1 FALSE FALSE 593
## 2 TRUE FALSE 185
## 3 FALSE TRUE 322
## 4 TRUE TRUE 122
and then save that dataframe.
write.csv(as.data.frame(obama.tbl), file = 'crosstabs.csv')
read.csv('crosstabs.csv')
## X fruit flotus Freq
## 1 1 FALSE FALSE 593
## 2 2 TRUE FALSE 185
## 3 3 FALSE TRUE 322
## 4 4 TRUE TRUE 122
You could also use the xtable
package to do this. The package has many functions to customize table outputs, but a relatively simple way to generate an html table looks like this:
library(xtable)
print(xtable(obama.tbl), type="html")
## <!-- html table generated in R 3.6.0 by xtable 1.8-4 package -->
## <!-- Thu May 16 08:40:44 2019 -->
## <table border=1>
## <tr> <th> </th> <th> FALSE </th> <th> TRUE </th> </tr>
## <tr> <td align="right"> FALSE </td> <td align="right"> 593 </td> <td align="right"> 322 </td> </tr>
## <tr> <td align="right"> TRUE </td> <td align="right"> 185 </td> <td align="right"> 122 </td> </tr>
## </table>
There is a lot of documentation and examples online to help you customize as you see fit.
b_1 = (3.9983-4.010) / -0.0883
b_1
## [1] 0.1325028
y_bar = 3.91 + .78 * 28
y_bar
## [1] 25.75
pt(q,df)
function gives the proportion of the T-distribution which is less than q
with df
degrees of freedom.(1 - pt(q = 2.23, df = 23)) * 2 # To get 2-tailed p-value, we multiply this by 2
## [1] 0.03579437
pt(q = 2.23, df = 23, lower.tail = FALSE) * 2 # Alternatively, we can get the area of the upper tail and multiply that by 2 (multiplying by 2 works because t-distributions are symmetrical)
## [1] 0.03579437
The p-value is less than 0.05, so with a traditional \(\alpha = 0.05\) we would reject the null hypothesis and conclude that there is substantial evidence of an association between gestational age and head circumference for low birth-weight babies.
Final score had a strong postive correlation with “Modded”, “Starting score”, and “Karma”, and strong negative relationship with “Anonymous user”. Other correlations were fairly weak.
The authors have presented a histogram of the dependent variable and while it’s not quite normal, it does not seem so skewed that we would be too worried about fitting a linear model. It might be helpful to see more information about the linearity of relationships between the independent variables and the outcome. In addition, the fact that this is data collected over time suggests that there may be temporal dependencies in the data that undermine the assumptions of a linear model. It might be helpful to see a residual plot and QQ-plot to check for homogeneity of variance and normal distribution of errors.
The coefficients in the table represent the expected amount that a final score would change for a one-unit change in a given measure (in this sense, multiple regression is the same as regression with a single predictor). For example, if we focus on the first row in the table, for every 1-point increase in starting score, we would expect the final score to increase by 1.08 on average.
The t-statistic is the (coefficient - 0) / standard error, and the P value is the proportion of the t-distribution which is as extreme or more extreme than that t-statistic. For the purposes of the first coefficient, the extremely high t-statistic and extremely low p-value indicate that the probability of observing a relationship this strong under the null hypothesis of no association between starting score and final score is very, very small. This indicates that we can reject the null hypothesis.
The \(R^2\) value is the amount of variation in final scores which is explained by these measures. In this case, the \(R^2\) value of 0.52 indicates that the variables in the model explain a substantial amount of the variation in the final scores, but that other explanatory factors likely exist that are not captured by the model.