psets/pset2-worked_solution.rmd

   1 ---
   2 title: "Problem set 2: Worked solutions"
   3 subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
   4 author: "Aaron Shaw"
   5 date: "October 5, 2020"
   6 output:
   7   html_document:
   8     toc: yes
   9     toc_depth: 3
  10     toc_float:
  11       collapsed: true
  12       smooth_scroll: true
  13     theme: readable
  14   pdf_document:
  15     toc: yes
  16     toc_depth: '3'
  17 ---
  18
  19 ```{r setup, include=FALSE}
  20 knitr::opts_chunk$set(echo = TRUE)
  21 ```
  22
  23 ## Programming challenges
  24
  25 ### PC 1: Import data from a .csv file
  26
  27 So, the nice thing about running `set.seed()` before I chose my random group number in the previous problem set is that I can replicate the **exact same** random draw again:
  28
  29 ```{r}
  30 set.seed(220920)
  31 sample(x=c(1:20), size=1)
  32 ```
  33 My dataset number for the rest of this programming challenge is, once again, 3. I'll read the dataset in using the `read.csv()` command (note that I assign it to an object).
  34
  35 ```{r}
  36 w4 <- read.csv(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_04/group_03.csv"))
  37
  38 ```
  39 ### PC 2: Explore and describe the data
  40
  41 **Note: You can complete this programming challenge in numerous ways. What follows is one path that uses a number of the commands introduced in the r tutorials provided for the course**
  42
  43 Let's start by just seeing what's here:
  44 ```{r}
  45 dim(w4)
  46 lapply(w4, class)
  47 ```
  48
  49 Summarizing the variables:
  50 ```{r}
  51 summary(w4)
  52 ```
  53 It looks like `x` and `y` are continuous. That last call to `summary()` gave us the range, median, mean and IQR for both. I'll go ahead and create some boxplots to see which measure of spread might be most appropriate:
  54
  55 ```{r}
  56 boxplot(w4$x)
  57 boxplot(w4$y)
  58 ```
  59
  60 Both are a little skewed with a longer "right" tail. Given that the skew isn't totally egregious, I'll calculate the standard deviation for each as well (and remember the `na.rm=TRUE` bit otherwise this won't work).
  61
  62 ```{r}
  63 sd(w4$x, na.rm=TRUE)
  64 sd(w4$y, na.rm=TRUE)
  65 ```
  66
  67 Despite the class R attributes to the other variables, it sure looks like j, l, and k are categorical/discrete variables (note the nice round numbers in the range), so let's look at the first few values of each to see if that seems right:summarize those with contingency tables
  68 ```{r}
  69 head(w4[, c("j","l", "k")])
  70 ```
  71 This confirms my intuitions, so I'll go ahead and create univariate contingency tables for each:
  72 ```{r}
  73 table(w4$j)
  74 table(w4$l)
  75 table(w4$k)
  76 ```
  77
  78
  79 ### PC 3: Use and write user-defined functions
  80
  81 For starters, I'll go copy-paste that `my.mean()` function from the tutorial. Remember, that unless I execute the code that defines the function, R doesn't know that it exists! Once I've executed it, I can call it and calculate the value for my `x` variable.
  82
  83 ```{r}
  84 my.mean <- function(z){
  85   z <- z[!is.na(z)]
  86   sigma <- sum(z)
  87   n <- length(z)
  88   out.value <- sigma/n
  89   return(out.value)
  90 }
  91
  92 my.mean(w4$x)
  93 ```
  94
  95 Creating a function to calculate the median is quite a bit harder! Medians can be complicated because the calculation depends on whether there's a midpoint of the data or not. Luckily in this case, there should be 95 non-missing values in the x vector, so a midpoint exists (the 48th value) and the following should work:
  96
  97 ```{r}
  98 my.median <- function(z){
  99   z <- z[!is.na(z)]
 100   midpoint <- (length(z) / 2) + .5
 101   sorted <- sort(z)
 102   output <- sorted[midpoint]
 103   return(output)
 104 }
 105
 106 my.median(w4$x)
 107 ```
 108 I can check my work here using R's built-in functions:
 109
 110 ```{r}
 111 mean(w4$x, na.rm=TRUE)
 112 median(w4$x, na.rm=TRUE)
 113 ```
 114 Note that the functions here work for this problem set and the `x` variable provided here, but might break under other circumstances. Can you think of how they might break? Can you imagine ways that you might rewrite either the `my.mean()` or the `my.median()` functions presented here (or in your own solutions) to be more robust to these potential problems?
 115
 116 ### PC 4: Compare two vectors
 117
 118 For starters, I have to load and cleanup my week 3 variable again:
 119 ```{r week 3 data}
 120
 121 load(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_03/group_03.RData"))
 122
 123 d[d < 0] <- NA
 124 d <- log1p(d)
 125 ```
 126 Onwards to some comparisons. At first glance, they look very similar...
 127
 128 ```{r}
 129 summary(w4$x)
 130 summary(d)
 131 ```
 132
 133 But closer inspection suggests something is off...
 134 ```{r}
 135 table(w4$x == d)
 136 ```
 137 It might be good to look at the first few values of each to see if there's anything noteworthy...
 138
 139 ```{r}
 140 head(w4$x)
 141 head(d)
 142 ```
 143 It looks like the vectors might match if I round them to six decimal places. I can create a table comparing the sorted rounded values to check this.
 144
 145 ```{r}
 146 table(round(w4$x, 6) == round(d, 6))
 147
 148 ```
 149 Note that you should be able to explain what each part of that last line of code is doing!
 150
 151 ### PC 5: Cleanup/tidy your data
 152
 153
 154 ```{r}
 155 w4$j <- as.logical(w4$j)
 156 w4$l <- as.logical(w4$l)
 157
 158 ### Note that I am creating a new column in my w4 dataframe so I can double check this:
 159 w4$k.factor <- factor(w4$k,
 160                       levels=c(0,1,2,3),
 161                       labels=c("none", "some", "lots", "all"))
 162
 163 ### Spotcheck to make sure it looks good
 164 head(w4, 10)
 165 summary(w4$k)
 166
 167 ### Now cleanup my extra column:
 168 w4$k <- w4$k.factor
 169 w4$k.factor <- NULL
 170
 171 head(w4)
 172 ```
 173
 174 ### PC 6: Calculate conditional summary statistics
 175
 176 As usual, there are many ways to approach this. My approach here uses the `tapply()` function, passing in calls to the `summary()` and `sd()` functions respectively:
 177
 178 ```{r}
 179
 180 tapply(w4$x, w4$k, summary)
 181 tapply(w4$x, w4$k, sd, na.rm=TRUE)
 182 ```
 183 Here's a tidyverse example that achieves the same goal:
 184
 185 ```{r, message=FALSE}
 186 library(tidyverse)
 187
 188 w4 %>%
 189   group_by(k) %>%
 190   summarize(
 191     n= n(),
 192     min_x = min(x, na.rm=T),
 193     max_x = max(x, na.rm=T),
 194     mean_x = mean(x, na.rm=T),
 195     sd_x = sd(x, na.rm=T)
 196   )
 197
 198 ```
 199
 200 Note that the tidyverse has "silently" dropped our "none" level from the summary because there are no values of it in the dataset. This could be altered by passing an argument of `.drop=FALSE` to the `group_by()` call. It produces very ugly results, so I have ommitted it here.
 201
 202
 203 ### PC 7: Create a bivariate table
 204
 205 ```{r}
 206 table(w4$k, w4$j)
 207 ```
 208
 209 ### PC 8: Create a bivariate visualizations
 210
 211
 212 First create a basic plotting object. Then add the call to `geom_point()` to show just the x and y:
 213 ```{r}
 214 p <- ggplot(w4, mapping=aes(x=x, y=y))
 215
 216 p + geom_point()
 217 ```
 218
 219 Now, let's add some additional dimensions to the plot using our other variables. Note that the following line of code works very well with the recoded version of the dataset, but will produce errors if I try to run it on the original version that was treating `j`, `k`, and `l` as integers:
 220
 221 ```{r}
 222 p + geom_point(aes(color=j, shape=l, size=k))
 223 ```
 224
 225 This is very visually cluttered and therefore probably not a great data visualization. It's hard to tell whether there might be any patterns other than the positive association of `x` and `y` reflected in the very clear upward slope of the points. That said, it's good to be able to incorporate multiple dimensions of variation into your plots and sometimes you can learn/convey a lot by doing so.
 226
 227 ## Statistical questions
 228 ### SQ 1: Interpret bivariate analyses
 229
 230 #### 1.1 Interpret conditional means
 231
 232 The most noteworthy thing is that none of the riders logging miles in this subset of the dataset fell into the "none" category measuring how frequently they rode in weather defined to be "bad" by the NOAA. They must be a hardy bunch of cyclists! Bayond that, the only possible pattern seems like there might be a weak association between the distance ridden and riding on "all" bad weather days (the mean and max are slightly higher in this group than the others). However, this might just reflect the presence of a few outliers as the spread (standard deviation) of distances ridden among those in the "all" bad weather days ridden category is also a bit bigger than in other categories.
 233
 234 #### 1.2 Interpret contingency table
 235
 236 The contingency table doesn't seem to show much of a relationship between riding January-March and riding in bad weather days. Maybe it was a relatively mild winter in the year of data collection?
 237
 238 #### 1.3 Interpret scatterplot
 239
 240 The plot reveals a very strong positive relationship between average daily distance ridden and income. It is difficult to see any other patterns. This suggests that wealthier cyclists may be more likely to make further trips.
 241
 242 ### SQ 2: Birthdays revisited (optional bonus)
 243
 244 #### SQ2.1 Which bet?
 245
 246 If you are willing to assume that birthdays in the class are independent (not a terrible assumption) and that birthdays are distributed randomly throughout the year (a terrible assumption, as it turns out!), you should take Bet #2. Here's a way to explain why:
 247
 248 Consider that 25 people can be combined into pairs ${25 \choose 2}$ ways (which you can read as "25 choose 2"), which is equal to $\frac{25 \times 24}{2} = 300$ (and that little calculation is where those [binomial coefficients](https://en.wikipedia.org/wiki/Binomial_coefficient) I mentioned in my hint come in handy).
 249
 250 Generalizing the logic from part b of the textbook exercise last week, I have assumed that each of these possible pairings are independent and thus each one has a probability $P = (\frac{364}{365})$ of producing an *unmatched* set of birthdays.
 251
 252 Putting everything together, I can employ the multiplication rule from *OpenIntro* Ch. 3 and get the following:
 253 $$P(any~match) = 1 - P(no~matches)$$
 254 $$P(no~matches) = (\frac{364}{365})^{300}$$
 255 And I can let R do the arithmetic:
 256 ```{r}
 257 1-((364/365)^300)
 258
 259 ```
 260 A fair coin flip would give you a lower probability of winning the bet.
 261
 262 #### SQ2.2
 263
 264 Once you have the answer above, this shouldn't be too hard. I'll start with the calculation for how many pairings exist in our seven person class:
 265 ```{r}
 266 choose(7, 2)
 267 ```
 268 Then I can plug that result into the formulas from SQ2.1 to calculate the probability of any shared birthdays in our class:
 269 ```{r}
 270 1-((364/365)^21)
 271 ```
 272
 273 ## Empirical paper questions: Emotional contagion in social networks
 274
 275 The responses below about the Kramer et al. paper may be taken as a guide, but are intended neither to be comprehensive nor the only possible accurate responses.
 276
 277 ### EQ 1: Research questions and objectives
 278
 279 The paper asks: does emotional contagion occur via social networks (online)? It seeks to answer this question in human populations (i.e., the target population seems to be "people").
 280
 281 ### EQ 2: Sample and experiment design
 282
 283 The study uses a random sample of 689,003 Facebook accounts (drawn from the population of internal Facebook account ID numbers as of a week in January 2012). The treatment groups received a reduced portion of posts in their news feed with either positive or negative words in them respectively. The control groups had random posts removed from their news feeds. The experimental manipulation consists in a probabilistic reduction the proportion of news feed posts with either positive or negative language in them.
 284
 285 ### EQ 3: Data and variables
 286
 287 The unit of analysis is a Facebook account. The key variables used in the analysis are the treatment/control indicator (categorical, dichotomous) and the dependent variables: "the percentage of all words produced by a given account that was either positive or negative during the experimental period" (continuous, numeric).
 288
 289 ### EQ 4: Results
 290
 291 The study finds evidence of emotional contagion in soicial networks among English language Facebook users in 2012. Accounts exposed to fewer posts with positive (negative) language tended to produce few posts with positive (negative) language than accounts with randomly removed posts. The four panels of Figure 1 capture the comparisons of the treatment and control conditions by each of the two outcome measures. They show that the treatments had the anticipated effects with reduced positive posts reducing/increasing the proportion of positive/negative words in subsequent posts and reduced negative posts reducing/increasing the proportion of negative/positive words in subsequent posts.
 292
 293 ### EQ 5: Interpretation and contribution (significance)
 294
 295 The authors argue that the paper demonstrates evidence of emotional contagion in social networks in a particularly novel and compelling way. They underscore the scale of the study and resulting precision of the estimates that support the existence of an infrequently observed phenomenon (emotional contagion) in a context dominated by large numbers of short computer-mediated communications (which could be expected to have reduced emotional immediacy/effects).
 296
 297 With respect to generalization, the authors seem to base the core of their claim on the scale and scope of the Facebook user base (as of 2012). However, things get a bit iffy (despite the sample size) given the likely biases of the sample in comparison to the target population of the study. English language Facebook users have a number of attributes that differentiate them from even a broader English language speaking population in the U.S. and Facebook users and it's likely that some of these attributes systematically covary with the outcomes of interest in this study. Covariance (non-independence) between these factors and the outcomes of interest could bias the effects reported in the paper.
 298
 299