problem_sets/week_04/ps4-worked_solution.Rmd

   1 ---
   2 title: 'Week 4 Problem set: Worked solutions'
   3 subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
   4 author: "Aaron Shaw"
   5 date: "April 25, 2019"
   6 output: html_document
   7 ---
   8
   9 ```{r setup, include=FALSE}
  10 knitr::opts_chunk$set(echo = TRUE)
  11 ```
  12
  13 # Programming challenges
  14
  15
  16 ## PC2
  17
  18 You may need to edit these first lines to work on your own machine. Note that for working with .Rmd files interactively in Rstudio you may find it easier to do this using the drop down menus: "Session" → "Set Working Directory" → "To Source File Location"
  19
  20 ```{r}
  21 ## setwd("~/Documents/Teaching/2019/stats/")
  22 ## list.files("data/week_04")
  23
  24 mobile <- read.csv("data/week_04/COS-Statistics-Mobile_Sessions.csv")
  25 total <- read.csv("data/week_04/COS-Statistics-Gov-Domains-Only.csv")
  26 ```
  27
  28 I'll write a little function to help inspect the data. Make sure you understand what the last line of the function is doing.
  29 ```{r}
  30 summary.df <- function (d) {
  31     print(nrow(d))
  32     print(ncol(d))
  33     print(head(d))
  34     print(d[sample(seq(1, nrow(d)), 5),])
  35 }
  36 ```
  37
  38 Then I can run these two lines a few times to look at some samples
  39 ```{r}
  40 summary.df(mobile)
  41
  42 summary.df(total)
  43 ```
  44 I can check for missing values and summarize the different columns using `lapply`:
  45
  46 ```{r}
  47 lapply(total, summary)
  48
  49 lapply(mobile, summary)
  50 ```
  51
  52 ## PC3
  53
  54 First let's create a table/array using `tapply` that sums pageviews per month across all the sites:
  55 ```{r}
  56 total.views.bymonth.tbl <- tapply(total$pageviews, total$month, sum)
  57 total.views.bymonth.tbl
  58 ```
  59 If you run `class` on `total.views.bymonth.tbl` you'll notice it's not a data frame yet. We can change that:
  60 ```{r}
  61 total.views <- data.frame(months=names(total.views.bymonth.tbl),
  62                           total=total.views.bymonth.tbl)
  63
  64 head(total.views)
  65 ```
  66 Let's cleanup the rownames (this would all work the same if i didn't do this part).
  67
  68 ```{r}
  69 rownames(total.views) <- NULL
  70
  71 head(total.views)
  72 ```
  73 ## PC4
  74 Onwards to the mobile dataset!
  75
  76 Here we have a challenge because we have to estimate total pageviews (it's not given in the raw dataset). I'll do this by multiplying sessions by pages-per-session. This assumes that the original pages-per-session calculation is precise, but I'm not sure what else we could do under the circumstances.
  77 ```{r}
  78 mobile$total.pages <- mobile$Sessions * mobile$PagesPerSession
  79 ```
  80 Then, making the views-per-month array is more or less copy/pasted from above:
  81 ```{r}
  82 mobile.views.bymonth.tbl <- tapply(mobile$total.pages, mobile$Month, sum)
  83 mobile.views.bymonth.tbl
  84
  85 mobile.views <- data.frame(months=names(mobile.views.bymonth.tbl),
  86                            mobile=mobile.views.bymonth.tbl)
  87 rownames(mobile.views) <- NULL
  88 ```
  89 ## PC5
  90 Now we merge the two datasets. Notice that I have created the `months` column in both datasets with *exactly* the same name.
  91 ```{r}
  92 views <- merge(mobile.views, total.views, all.x=TRUE, all.y=TRUE, by="months")
  93 ```
  94
  95 These are sorted in strange ways and will be difficult to graph because the dates are stored as characters. Let's convert them into Date objects. Then I can use `sort.list` to sort everything.
  96
  97 ```{r}
  98 views$months <- as.Date(views$months, format="%m/%d/%Y %H:%M:%S")
  99
 100 views <- views[sort.list(views$months),]
 101 ```
 102
 103 Take a look at the data. Some rows are missing observations. We can drop those rows using `complete.cases`:
 104 ```{r}
 105 lapply(views, summary)
 106
 107 views[rowSums(is.na(views)) > 0,]
 108
 109 views.complete <- views[complete.cases(views),]
 110 ```
 111
 112 ## PC6
 113
 114 For my proportion measure, I'll take the mobile views divided by the total views.
 115
 116 ```{r}
 117 views.complete$prop.mobile <- views.complete$mobile / views.complete$total
 118
 119 ```
 120 ## PC7.
 121
 122 ```{r}
 123 library(ggplot2)
 124 ggplot(data=views.complete) + aes(x=months, y=prop.mobile) + geom_point() + geom_line() + scale_y_continuous(limits=c(0, 1))
 125
 126 ```
 127
 128 (a) For my estimate of the proportion I'll just calculate an average from the monthly numbers:
 129
 130 ```{r}
 131 mean(views.complete$prop.mobile)
 132
 133 ```
 134 (b) From the graph, this proportion seems quite stable with the exception of a single outlier month in late 2015.
 135
 136 # Statistical questions
 137
 138 ## SQ1 — 4.8
 139
 140 The general formula for a confidence interval is $point~estimate~±~z^*\times~SE$. First, identify the three different values. The point estimate is 45%, $z^* = 2.58$ for a 99% confidence level (that's the number of standard deviations around the mean that ensure that 99% of a Z-score distribution is included), and $SE = 2.4\%$.
 141
 142 With this we can plug and chug:
 143
 144 $$52\% ± 2.58 \times 2.4\% → (45.8\%, 58.2\%)$$
 145
 146 From this data we are 99% confident that between 45.8% and 58.2% U.S. adult Twitter users get some news through the site.
 147
 148 ## SQ2 — 4.10
 149
 150 (a) False. See the answer to 4.8 above. With $\alpha = 0.01$, we can consult the 99% confidence interval. It includes 50% but also goes lower.
 151
 152 (b) False. The standard error of the sample does not contain any information about the proportion of the population included in the sample. It measures the variability of the sample distribution.
 153
 154 (c) False. Increasing the sample size will decrease the standard error. Consider the formula: $\frac{\sigma}{\sqrt{n}}$. A smaller $n$ will result in a larger standard error.
 155
 156 (d) False. All else being equal, a lower confidence interval will cover a narrower range. A higher interval will cover a wider range. To confirm this, revisit the formula in SQ1 above. and plug in the corresponding alpha value of .9, resulting in a $z^*$ value of 1.28 (see the Z-score table in the back of *OpenIntro*).
 157
 158 ## SQ3 — 4.19
 159
 160 The hypotheses should be about the population mean ($\mu$) and not the sample mean ($\bar{x}$). The null hypothesis should have an equal sign. The alternative hypothesis should be about the critical value, not the sample mean. The following would have been better:
 161
 162 $$H_0: \mu = 10~hours$$
 163 $$H_A: \mu \gt 10~hours$$
 164
 165 ## SQ4 — 4.32
 166
 167 (a) True. See part (d) of SQ2 above.
 168 (b) False. A lower alpha value is the probability of Type 1 Error, so reducing the one reduces the other.
 169 (c) False. Failure to reject the null is evidence that we cannot conclude that the true value is different from the null. This is **very** different from evidence that the null hypothesis is true.
 170 (d) True. Consult the section of *OpenIntro* discussing statistical power and Type 2 Error.
 171 (e) True. We'll revisit this in a moment below, but consider the relationship between statistical test, the standard error, and the sample size. As the sample size increases towards infinity, the standard error approaches zero, resulting in arbitrarily precise point estimates that will result in rejecting the null hypothesis for *any* value of a test statistic for any critical value of $\alpha$.
 172
 173 # Empirical paper questions
 174
 175 ## EQ1
 176
 177 In my words (or rather formulas since I think that's less ambiguous), the key pairs of null/alternative hypotheses look something like the following:
 178
 179 Let $\Delta$ be the parameter estimate for the difference in mean percentage of positive ($\mu_{pos}$) and negative ($\mu_{neg}$) words between the experimental and control conditions for the treatments of reduced negative content ($R_{neg}$ and reduced positive content ($R_{pos}$).
 180
 181 For the reduced negative content conditions (the left-hand side of Figure 1), the paper tests:
 182
 183 $$HR_{neg}1_0: \Delta_{\mu_{pos}} = 0$$
 184 $$HR_{neg}1_a: \Delta{\mu_{pos}} \gt 0$$
 185 And:
 186 $$HR_{neg}2_0: \Delta_{\mu_{neg}} = 0$$
 187 $$HR_{neg}2_a: \Delta_{\mu_{neg}} \lt 0$$
 188 Then, for the reduced positive content conditions (the right-hand side of Figure 1), the paper tests:
 189
 190 $$HR_{pos}1_0:~~ \Delta_{\mu_{pos}} = 0$$
 191 $$HR_{pos}1_a:~~ \Delta{\mu_{pos}} \lt 0$$
 192
 193 And:
 194
 195 $$HR_{pos}2_0:~~ \Delta_{\mu_{neg}} = 0$$
 196 $$HR_{pos}2_a:~~ \Delta_{\mu_{neg}} \gt 0$$
 197 Note that the theories the authors used to motivate the study imply directions for the alternative hypotheses, but nothing in the description of the analysis suggests that they used one-tailed tests. I've written these all in terms of specific directions here to correspond with the theories stated in the paper. They could also (arguably more accurately) have been written in terms of inequalities ("$\neq$").
 198
 199 ## EQ2
 200
 201 The authors' estimates suggest that reduced negative News Feed content causes an increase in the percentage of positive words and a decrease in the percentage of negative words in subsequent News Feed posts by study participants (supporting $HR_{neg}1_a$ and $HR_{neg}2_a$ respectively).
 202
 203 They also find that reduced positive News Feed content causes a decrease in the percentage of negative words and an increase in the percentage of positive words in susbequent News Feed posts (supporting $HR_{pos}1_a$ and $HR_{pos}2_a$)
 204
 205 ## EQ3
 206
 207 Cohen's $d$ puts estimates of experimental effects in standardized units (much like a Z-score!) in order to help understand their size relative to the underlying distribution of the dependent variable(s). The d-values for each of the effects estimated in the paper are 0.02, 0.001, 0.02, and 0.008 respectively (in the order presented in the paper, not in order of the hypotheses above!). These are miniscule effects. However, the treatment itself is also quite narrow in scope, suggesting that the presence of any treatment effect at all is an indication of the underlying phenomenon (emotional contagion). Personally, I find it difficult to attribute much substantive significance to the results because I'm not even convinced that tiny shifts in the percentage of positive/negative words used in News Feed updates accurately index meaningful emotional shifts (maybe we could call it linguistic contagion instead?). Despite these concerns and the ethical considerations that attracted so much public attention, I consider this a clever, well-executed study and I think it's quite compelling. I expect many of you will have different opinions of various kinds!