---
title: "Problem set 4: Worked solutions"
subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
author: "Aaron Shaw"
date: "October 19, 2020"
output:
  html_document:
    toc: yes
    toc_depth: 3
    toc_float:
      collapsed: true
      smooth_scroll: true
    theme: readable
  pdf_document:
    toc: yes
    toc_depth: '3'
  header-includes:
  - \newcommand{\lt}{<}
  - \newcommand{\gt}{>}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=FALSE, tidy='styler')
```

# Programming Challenges  
## PC1. Import  
Load the dataset. As usual, you'll need to edit this block of code to point at the appropriate file location to make it work on your machine.
```{r}
pop <- read.delim(file=url("https://communitydata.science/~ads/teaching/2020/stats/data/week_06/population.tsv"))

# Same thing using `read_tsv()`. Notice that it can handle URLs directly.
library(tidyverse)

pop <- read_tsv("https://communitydata.science/~ads/teaching/2020/stats/data/week_06/population.tsv")

head(pop)

```

While we're at it, let's import the data from Problem Set 2:

```{r}
## Insert an appropriate group number as needed:
ps2 <- read.csv(file=url("https://communitydata.science/~ads/teaching/2020/stats/data/week_04/group_03.csv"), row.names=NULL)

head(ps2)
```


## PC2. Compare means  
Straightforward enough:  

```{r}
mean(pop$x, na.rm=T)

mean(ps2$x, na.rm=T)
``` 

Given the structure of the dataset for this week, it's also no big deal to calculate all of the group means (and you can compare your answers against these if you like). Notice that I can define and call a function within my call to `tapply()`:
```{r}
s.means <- tapply(pop$x, pop$group, function(x){
                    round(mean(x, na.rm=T), 2)})

s.means
```

For anyone who decided to try this using a Tidyverse approach, here's a way you might do that with a call to `group_by()` and `summarize()`:
```{r}
pop %>%
  group_by(group) %>%
  summarize(
    mean(x, na.rm=T)
  )
```

Knowing that each group was a random sample from the entire population, we might think of the individual group (sample) means as constructing a *sampling distribution of the (population) mean*.


## PC3. CI of the mean  

I'll do this two ways. First, I can plug in the values from one group sample into the formula for the standard error and then add/subtract twice the standard error from the mean to find the 95% CI.
```{r}
se <- sd(ps2$x, na.rm=T) / sqrt(length(ps2$x[!is.na(ps2$x)]))

mean(ps2$x, na.rm=T)-(2*se) ## lower
mean(ps2$x, na.rm=T)+(2*se) ## upper
```

Now, I'll write a more general function to calculate confidence intervals. Note that my function here takes an argument for `x` as well as an `alpha` argument with a default value of 0.05. The function then goes on to divide alpha (and it's complementary probability) by 2. Can you explain why this division step is necessary?

```{r}
ci <- function (x, alpha=0.05) {
    x <- x[!is.na(x)]
    probs <- c(alpha/2, 1-alpha/2)
    critical.values <- qnorm(probs, mean=0, sd=1)
    se <- sd(x)/sqrt(length(x))
    return(
      round( mean(x) + critical.values * se, 2) )
}

## Here I run the function on the group 3 data
ci(ps2$x)
```

Again, it's possible to use `tapply()` to estimate this for every group:

```{r}
group.confints <- tapply(pop$x, pop$group, ci)
group.confints
```

Recall that the population mean for `x` was 2.60. Since the group samples are random samples, we should not be surprised that the group means are different from the population mean. We should also not be surprised that the 95% CI for the population mean estimated from at least one of the samples does *not* include the true population mean. Recall that our confidence interval is 95%, so we can expect to be wrong about 1/20 times (on average)! In this case, we got unlucky since 2 of our confidence intervals around the sample means do not include the population mean.  

## PC4. Compare distributions  

I'll start with the single comparison between the population and my Problem Set 2 sample using base-R functions:

```{r}
hist(pop$x)
hist(ps2$x)

summary(pop$x)
sd(pop$x, na.rm=T)

summary(ps2$x)
sd(ps2$x, na.rm=T)
```

Notice the differences between the shapes of the histograms as well as the summary statistics. In particular, you might consider that the sample has a *larger* standard deviation than the population. Given what you know about this statistic and the conditions under which this data was generated, could it have worked out otherwise (i.e., could the standard deviation of the population have been larger than that of the sample?). How?


Moving right along, here's `tapply()` code to construct the same comparisons for each group. 

```{r}
tapply(pop$x, pop$group, summary)

tapply(pop$x, pop$group, sd, na.rm=T)

```

And here's a visual comparison. A ggplot2 "faceted" set of histograms will produce an easy visual comparison (and also makes for concise code). First I'll do it in a slightly simplified way:
```{r}
library(ggplot2)
ggplot(pop, aes(x)) + geom_histogram() + labs(title="Population distribution of X")

ggplot(pop, aes(x)) + geom_histogram() + facet_wrap(. ~ group) + labs(title="Conditional distributions of X")
```

It's possible to see some of the differences there, but it might be helpful to overlay a brightly colored density plot. In the process, I can also convert the histograms themselves into density plots to make it easier to directly contrast each group against the population without having to worry about the divergent y-axis scales. The code below does that by providing a call to `after_stat(density)` inside the initial plotting aesthetics and a `color` argument to `geom_density()`. For more examples like this, you might search the phrase "density histogram" online.

```{r}
library(ggplot2)
ggplot(pop, aes(x, after_stat(density))) + geom_histogram() + geom_density(color="red") + labs(title="Population distribution of X")

ggplot(pop, aes(x, after_stat(density))) + geom_histogram(aes(x,)) + facet_wrap(. ~ group) + geom_density(color="red") + labs(title="Group distributions of X")

```

The conditional distributions all look a little bit different from each other and from the population distribution. Again, none of this should be shocking given the relationship of the samples to the population.  

## PC5. Std. dev. of conditional means  

I can do this by constructing my vector of group means again and calling `sd()` on that.
```{r}
s.means <- tapply(pop$x, pop$group, mean, na.rm=T)
s.means

sd(s.means)

## This was the standard error from one of the groups that I calculated earlier:
se
```
As mentioned earlier, the distribution of sample means drawn from the population is the *sampling distribution*. The standard error of the mean estimated from any of the individual groups/samples should be a good approximation of (but not necessarily equal to!) the standard deviation of the sampling distribution of the means. 

## PC6. A simulation  

Since there's going to be some randomness in the next few commands, I'll set a seed value to ensure the results are reproducible and consistent on other machines.

```{r}
set.seed(20201019)
```

**(a) Simulate draws from a known distribution**  

There are a few ways to do this, but one of the most intuitive is to define my vector of possible values and then sample it repeatedly *with replacement*.
```{r}
my.vals <- seq(0, 9)
pop.unif <- sample(my.vals, 10000, replace=TRUE)
```

**(b) Take the mean**  

```{r}
mean(pop.unif)
hist(pop.unif)
```

**(c) Draw samples and describe them**  

Again, many ways to go about this. A good first step might be to do the sampling part once:

```{r}
sample(pop.unif, 2, replace=T)
```
Now I can call the mean of that:

```{r}
mean(sample(pop.unif, 2, replace=T))
```
Now, I want to run that 100 times. I might do that with a for-loop and store the values in a new vector:
```{r}
sample.means <- 0

for(i in 1:100){
  sample.means[i] <- mean(sample(pop.unif, 2, replace=T))
}
```

You could also do the same thing by nesting the sampling step inside of a call to `sapply()` that runs over an arbitrary index (`rep(1, 100)` here).

```{r}
sample.means.2 <- sapply( rep(1, 100), function (x) { 
  mean(sample(pop.unif, 2))}
  )
```

Then I can plot them: 
```{r}
hist(sample.means)

hist(sample.means.2)
```

Note that you certainly didn't need to do this twice, but having done so provides a nice illustration of sampling variability!

**(d) Draw more samples and describe them too**  

Same `tapply()` command as (c) just changing the sample size

```{r}
hist(sapply(rep(1, 100), function (x) { mean(sample(pop.unif, 10))}))

hist(sapply(rep(1, 100), function (x) { mean(sample(pop.unif, 100))}))
```

Those axis labels are just terrible, but let's try to focus on the substance of the plots. Some notable things you might observe include that the sampling distribution of the means approaches normality as each sample gets larger in size (and this is true whether the population we draw from is uniform, log-normal, or really just about any other smooth distribution). In this simulation, the number of samples is constant (100), so the changes in the distribution are *solely* due to changes in sample size. This is an illustration of some aspects of the *central limit theorem* applied to sampling distributions. It is also an illustration of the *t-distribution* (the basis for "t-tests" that you'll read about soon).  

# Reading Questions  
## RQ1. CIs vs. P-values?  

We'll discuss this one as a group and I'm eager to hear others' thoughts. Personally, I am inclined to agree with Reinhart as I find the focus on p-values somewhat thought-stopping and would prefer to see researchers and publications embrace reporting standards that yield more intuitive, flexible, uncertain, and meaningful interpretations of findings in terms of the original measurements/units. Confidence intervals usually accomplish these goals more effectively than p-values and, in that respect, I like confidence intervals quite a lot. 

## RQ2 Emotional contagion revisited  
**(a) Hypotheses**    

In my words (or rather formulas since I think that's less ambiguous), the key pairs of null/alternative hypotheses look something like the following:

Let $\Delta$ be the parameter estimate for the difference in mean percentage of positive ($\mu_{pos}$) and negative ($\mu_{neg}$) words between the experimental and control conditions for the treatments of reduced negative content ($R_{neg}$ and reduced positive content ($R_{pos}$).

For the reduced negative content conditions (the left-hand side of Figure 1), the paper tests:

$$HR_{neg}1_0: \Delta_{\mu_{pos}} = 0$$
$$HR_{neg}1_a: \Delta{\mu_{pos}} \neq 0$$
And:
$$HR_{neg}2_0: \Delta_{\mu_{neg}} = 0$$
$$HR_{neg}2_a: \Delta_{\mu_{neg}} \neq 0$$
Then, for the reduced positive content conditions (the right-hand side of Figure 1), the paper tests:

$$HR_{pos}1_0:~~ \Delta_{\mu_{pos}} = 0$$
$$HR_{pos}1_a:~~ \Delta{\mu_{pos}} \neq 0$$

And:

$$HR_{pos}2_0:~~ \Delta_{\mu_{neg}} = 0$$
$$HR_{pos}2_a:~~ \Delta_{\mu_{neg}} \neq 0$$
Note that the theories the authors used to motivate the study imply directions for the alternative hypotheses, but nothing in the description of the analysis suggests that they used one-tailed tests. I've written these all in terms of undirected two-tailed tests to correspond to the analysis conducted in the paper. That said, given that the theories correspond to specific directions you might (arguably more accurately) have written the hypotheses in terms of directed inequalities (e.g., "$\gt$" or "$\lt$").


**(b) Describing the effects**  

The authors' estimates suggest that reduced negative News Feed content causes an increase in the percentage of positive words and a decrease in the percentage of negative words in subsequent News Feed posts by study participants (supporting $HR_{neg}1_a$ and $HR_{neg}2_a$ respectively).

They also find that reduced positive News Feed content causes a decrease in the percentage of negative words and an increase in the percentage of positive words in subsequent News Feed posts (supporting $HR_{pos}1_a$ and $HR_{pos}2_a$)

**(c) Statistical vs. practical significance**  

Cohen's $d$ puts estimates of experimental effects in standardized units (much like a Z-score!) in order to help understand their size relative to the underlying distribution of the dependent variable(s). The $d$ values for each of the effects estimated in the paper are 0.02, 0.001, 0.02, and 0.008 respectively (in the order presented in the paper, not in order of the hypotheses above). These are miniscule by the standards of most approaches to Cohen's $d$! However, as the authors' argue, the treatment itself is quite narrow in scope, suggesting that the presence of any treatment effect at all is an indication of the underlying phenomenon (emotional contagion). Personally, I find it difficult to attribute much substantive significance to the results because I'm not even convinced that tiny shifts in the percentage of positive/negative words used in News Feed updates accurately index meaningful emotional shifts (I might call it linguistic contagion instead?). That said, I have a hard time thinking about micro-level psychological processes and I'm probably being overly narrow/skeptical in my response. Despite these concerns and the ethical considerations that attracted so much public attention, I consider this a clever, well-executed study and I think it's quite compelling. I expect many of you will have different opinions of various kinds and I'm eager to hear about them.