You may need to edit these first lines to work on your own machine. Note that for working with .Rmd files interactively in Rstudio you may find it easier to do this using the drop down menus: “Session” → “Set Working Directory” → “To Source File Location”
## setwd("~/Documents/Teaching/2019/stats/")
## list.files("data/week_04")
mobile <- read.csv("data/week_04/COS-Statistics-Mobile_Sessions.csv")
total <- read.csv("data/week_04/COS-Statistics-Gov-Domains-Only.csv")
I’ll write a little function to help inspect the data. Make sure you understand what the last line of the function is doing.
summary.df <- function (d) {
print(nrow(d))
print(ncol(d))
print(head(d))
print(d[sample(seq(1, nrow(d)), 5),])
}
Then I can run these two lines a few times to look at some samples
summary.df(mobile)
## [1] 231
## [1] 8
## Operating_System Sessions New_Sessions New_Users Bounce_Rate
## 1 iOS 332291 47.75 158674 60.79
## 2 Android 170107 45.53 77453 58.14
## 3 Windows 27325 44.76 12231 44.60
## 4 Windows Phone 10109 45.71 4621 59.01
## 5 BlackBerry 1375 39.27 540 62.98
## 6 (not set) 408 83.09 339 72.30
## PagesPerSession AvgSessionDuration Month
## 1 2.34 0:02:11 01/01/2015 12:00:00 AM
## 2 2.98 0:03:53 01/01/2015 12:00:00 AM
## 3 3.26 0:02:40 01/01/2015 12:00:00 AM
## 4 2.14 0:01:45 01/01/2015 12:00:00 AM
## 5 2.10 0:02:24 01/01/2015 12:00:00 AM
## 6 1.82 0:01:01 01/01/2015 12:00:00 AM
## Operating_System Sessions New_Sessions New_Users Bounce_Rate
## 182 Playstation Vita 6 100.00 6 100.0
## 214 Android 214077 47.17 100978 59.4
## 194 Android 178625 47.32 84526 57.4
## 92 Series40 8 100.00 8 100.0
## 14 Playstation Vita 6 100.00 6 100.0
## PagesPerSession AvgSessionDuration Month
## 182 1.00 0:00:00 01/01/2016 12:00:00 AM
## 214 3.65 0:04:45 07/01/2016 12:00:00 AM
## 194 3.65 0:04:48 03/01/2016 12:00:00 AM
## 92 1.00 0:00:00 07/01/2015 12:00:00 AM
## 14 1.00 0:00:00 01/01/2015 12:00:00 AM
summary.df(total)
## [1] 1242
## [1] 7
## domain pageviews unique.pageviews average.time.on.page
## 1 www.seattle.gov/ 3525737 2689843 0:01:19
## 2 www2.seattle.gov/ 2158182 125984 0:01:12
## 3 web6.seattle.gov/ 367871 204803 0:01:18
## 4 spdblotter.seattle.gov/ 117645 91076 0:01:14
## 5 web1.seattle.gov/ 79529 32258 0:01:09
## 6 find.seattle.gov/ 78611 62516 0:00:39
## bounce.rate exit.percent month
## 1 50.86 36.53 04/01/2015 12:00:00 AM
## 2 41.69 4.53 04/01/2015 12:00:00 AM
## 3 40.66 23.23 04/01/2015 12:00:00 AM
## 4 69.29 46.42 04/01/2015 12:00:00 AM
## 5 59.57 18.76 04/01/2015 12:00:00 AM
## 6 25.67 21.74 04/01/2015 12:00:00 AM
## domain pageviews unique.pageviews
## 542 dpdwinw101.ad.seattle.gov/ 52 25
## 678 murray.seattle.gov/ 41246 35629
## 776 consultants.seattle.gov/ 2790 2203
## 808 perspectives.seattle.gov/ 46 44
## 644 NA NA
## average.time.on.page bounce.rate exit.percent month
## 542 125.76 0.0 11.54 07/01/2015 12:00:00 AM
## 678 0:02:38 0.8 0.69 09/01/2015 12:00:00 AM
## 776 0:01:10 5446.0 3391.00 10/01/2015 12:00:00 AM
## 808 0:02:17 8667.0 4130.00 10/01/2015 12:00:00 AM
## 644 NA NA
I can check for missing values and summarize the different columns using lapply
:
lapply(total, summary)
## $domain
##
## 34
## 2035.seattle.gov/
## 15
## artbeat.seattle.gov/
## 15
## atyourservice.seattle.gov/
## 15
## bagshaw.seattle.gov/
## 15
## bottomline.seattle.gov/
## 15
## brainstorm.seattle.gov/
## 15
## buildingconnections.seattle.gov/
## 15
## centerspotlight.seattle.gov/
## 15
## cityclerk.seattle.gov/
## 15
## clark.seattle.gov/
## 15
## clerk.seattle.gov/
## 15
## climatechange.seattle.gov/
## 15
## conlin.seattle.gov/
## 15
## consultants.seattle.gov/
## 15
## council.seattle.gov/
## 15
## find.seattle.gov/
## 15
## fireline.seattle.gov/
## 15
## frontporch.seattle.gov/
## 15
## godden.seattle.gov/
## 15
## grantsandfunding.seattle.gov/
## 15
## greenspace.seattle.gov/
## 15
## hackthecommute.seattle.gov/
## 15
## humaninterests.seattle.gov/
## 15
## licata.seattle.gov/
## 15
## married.seattle.gov/
## 15
## mayormcginn.seattle.gov/
## 15
## m.seattle.gov/
## 15
## news.seattle.gov/
## 15
## obrien.seattle.gov/
## 15
## onthemove.seattle.gov/
## 15
## parkways.seattle.gov/
## 15
## perspectives.seattle.gov/
## 15
## powerlines.seattle.gov/
## 15
## rasmussen.seattle.gov/
## 15
## rectech.seattle.gov/
## 15
## sawant.seattle.gov/
## 15
## sdotblog.seattle.gov/
## 15
## sdotperformance.seattle.gov/
## 15
## seattlerdy.seattle.gov/
## 15
## spdblotter.seattle.gov/
## 15
## techtalk.seattle.gov/
## 15
## thebuyline.seattle.gov/
## 15
## thescoop.seattle.gov/
## 15
## web6.seattle.gov/
## 15
## www2.seattle.gov/
## 15
## www.clerk.seattle.gov/
## 15
## wwwqa.seattle.gov/
## 15
## cmstrn.seattle.gov/
## 14
## cms8.seattle.gov/
## 13
## igxqa8.seattle.gov/
## 13
## seattle.gov/
## 13
## cttab.seattle.gov/
## 12
## okamoto.seattle.gov/
## 12
## web5.seattle.gov/
## 12
## web7.seattle.gov/
## 12
## education.seattle.gov/
## 11
## web1.seattle.gov/
## 11
## webqa7.seattle.gov/
## 11
## www4.seattle.gov/
## 11
## alert.seattle.gov/
## 10
## alerts.seattle.gov/
## 10
## data.seattle.gov/
## 10
## seattle-govstat.demo.socrata.com/
## 10
## connect.seattle.gov/
## 9
## igx8.seattle.gov/
## 9
## murray.seattle.gov/
## 9
## webqa6.seattle.gov/
## 9
## www.seattle.gov/
## 9
## www.seattle.gov.googleweblight.com/
## 9
## alphaqa.seattle.gov/
## 8
## cmsdev8.seattle.gov/
## 8
## dpdwinw101.ad.seattle.gov/
## 8
## web6.seattle.gov.googleweblight.com/
## 8
## cms.seattle.gov/
## 7
## ctab.seattle.gov/
## 7
## www.citylink.seattle.gov/
## 7
## aboveandbeyond.seattle.gov/
## 6
## citylink.seattle.gov/
## 6
## langstoninstitute.org/
## 6
## mayormurray.seattle.gov/
## 6
## take21.seattlechannel.org/
## 6
## web8.seattle.gov/
## 6
## wwwdev.seattle.gov/
## 6
## www.evergreenapps.org/
## 6
## www.safeyouthseattle.org/
## 6
## cityofseattle.gov/
## 5
## councilconnection.seattle.gov/
## 5
## filmandmusic.seattle.gov/
## 5
## gonzalez.seattle.gov/
## 5
## homebase.seattle.gov/
## 5
## igxdev8.seattle.gov/
## 5
## www.mayor.seattle.gov/
## 5
## www.seattle.gov.offcampus.lib.washington.edu/
## 5
## capitalprojects.seattle.gov/
## 4
## dpdwina307.ad.seattle.gov/
## 4
## herbold.seattle.gov/
## 4
## johnson.seattle.gov/
## 4
## juarez.seattle.gov/
## 4
## (Other)
## 97
##
## $pageviews
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 24 402 66417 2752 4172985 34
##
## $unique.pageviews
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 17 285 28515 2204 3213093 34
##
## $average.time.on.page
## 0:00:00 0:01:11 0:01:18 0.00 0:01:12 0:01:13 0:01:14 0:01:20
## 134 34 17 17 16 15 13 13 12
## 0:01:53 0:01:09 0:01:17 0:01:23 0:01:32 0:01:05 0:01:24 0:01:29 0:01:36
## 12 11 11 11 11 10 10 10 10
## 0:01:51 0:01:54 0:01:58 0:00:55 0:01:01 0:01:06 0:01:08 0:01:10 0:01:16
## 10 10 10 9 9 9 9 9 9
## 0:01:22 0:01:25 0:01:30 0:01:35 0:01:37 0:01:56 0:00:39 0:00:53 0:00:56
## 9 9 9 9 9 9 8 8 8
## 0:00:57 0:01:03 0:01:27 0:01:31 0:01:38 0:01:43 0:01:47 0:00:42 0:00:48
## 8 8 8 8 8 8 8 7 7
## 0:01:07 0:01:19 0:01:40 0:01:41 0:01:42 0:01:45 0:01:50 0:01:52 0:02:00
## 7 7 7 7 7 7 7 7 7
## 0:02:04 0:02:31 0:00:31 0:00:54 0:00:59 0:01:21 0:01:26 0:01:44 0:01:48
## 7 7 6 6 6 6 6 6 6
## 0:01:59 0:02:06 0:02:07 0:02:23 0:02:35 0:00:08 0:00:38 0:01:00 0:01:02
## 6 6 6 6 6 5 5 5 5
## 0:01:04 0:01:33 0:01:34 0:01:39 0:01:46 0:02:09 0:02:12 0:02:19 0:02:21
## 5 5 5 5 5 5 5 5 5
## 0:02:27 0:02:29 0:02:42 0:02:47 0:02:51 0:02:54 0:03:03 0:00:11 0:00:12
## 5 5 5 5 5 5 5 4 4
## 0:00:20 0:00:27 0:00:33 0:00:41 0:00:49 0:00:50 0:00:58 0:01:15 0:01:28
## 4 4 4 4 4 4 4 4 4
## (Other)
## 350
##
## $bounce.rate
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 24.89 65.75 430.47 79.32 10000.00 34
##
## $exit.percent
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 17.67 42.09 347.91 62.37 10000.00 34
##
## $month
## 01/01/2015 12:00:00 AM 01/01/2016 12:00:00 AM
## 34 84 84
## 02/01/2015 12:00:00 AM 02/01/2016 12:00:00 AM 03/01/2015 12:00:00 AM
## 78 79 80
## 03/01/2016 12:00:00 AM 04/01/2015 12:00:00 AM 04/01/2016 12:00:00 AM
## 88 83 87
## 05/01/2015 12:00:00 AM 06/01/2015 12:00:00 AM 07/01/2015 12:00:00 AM
## 75 84 85
## 08/01/2015 12:00:00 AM 09/01/2015 12:00:00 AM 10/01/2015 12:00:00 AM
## 70 84 77
## 12/01/2015 12:00:00 AM
## 70
lapply(mobile, summary)
## $Operating_System
## Android Bada BlackBerry
## 34 17 4 17
## Firefox OS iOS LG LGE
## 5 10 12 1
## MOT Nintendo 3DS Nokia (not set)
## 1 7 16 17
## Playstation Vita Samsung Series40 SymbianOS
## 12 17 10 17
## Windows Windows Phone
## 17 17
##
## $Sessions
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6 16 217 38469 10718 519563 34
##
## $New_Sessions
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.44 45.53 84.62 72.65 100.00 100.00 34
##
## $New_Users
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6 13 124 17575 4853 236550 34
##
## $Bounce_Rate
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 53.85 62.98 66.21 84.62 100.00 34
##
## $PagesPerSession
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.210 1.860 2.082 2.500 9.000 34
##
## $AvgSessionDuration
## 0:00:00 0:00:06 0:00:41 0:01:06 0:01:21 0:01:45 0:01:50 0:02:00
## 46 34 3 3 3 3 3 3 3
## 0:00:04 0:00:09 0:00:25 0:00:42 0:01:01 0:01:05 0:01:07 0:01:09 0:01:20
## 2 2 2 2 2 2 2 2 2
## 0:01:46 0:01:56 0:02:02 0:02:06 0:02:40 0:02:49 0:03:01 0:03:05 0:03:53
## 2 2 2 2 2 2 2 2 2
## 0:00:02 0:00:14 0:00:17 0:00:20 0:00:21 0:00:24 0:00:26 0:00:29 0:00:32
## 1 1 1 1 1 1 1 1 1
## 0:00:34 0:00:38 0:00:40 0:00:43 0:00:44 0:00:46 0:00:48 0:00:49 0:00:50
## 1 1 1 1 1 1 1 1 1
## 0:00:52 0:00:55 0:00:56 0:01:03 0:01:08 0:01:12 0:01:14 0:01:16 0:01:19
## 1 1 1 1 1 1 1 1 1
## 0:01:24 0:01:25 0:01:26 0:01:28 0:01:29 0:01:33 0:01:34 0:01:35 0:01:37
## 1 1 1 1 1 1 1 1 1
## 0:01:41 0:01:42 0:01:51 0:01:52 0:01:54 0:02:01 0:02:03 0:02:05 0:02:08
## 1 1 1 1 1 1 1 1 1
## 0:02:09 0:02:10 0:02:11 0:02:13 0:02:14 0:02:15 0:02:17 0:02:18 0:02:19
## 1 1 1 1 1 1 1 1 1
## 0:02:24 0:02:26 0:02:34 0:02:39 0:02:47 0:02:48 0:02:52 0:02:56 0:02:57
## 1 1 1 1 1 1 1 1 1
## 0:03:04 0:03:07 0:03:14 0:03:18 0:03:21 0:03:25 0:03:26 0:03:29 0:03:36
## 1 1 1 1 1 1 1 1 1
## (Other)
## 22
##
## $Month
## 01/01/2015 12:00:00 AM 01/01/2016 12:00:00 AM
## 34 15 9
## 02/01/2015 12:00:00 AM 02/01/2016 12:00:00 AM 03/01/2015 12:00:00 AM
## 13 11 15
## 03/01/2016 12:00:00 AM 04/01/2015 12:00:00 AM 04/01/2016 12:00:00 AM
## 9 12 10
## 05/01/2015 12:00:00 AM 06/01/2015 12:00:00 AM 07/01/2015 12:00:00 AM
## 11 14 12
## 07/01/2016 12:00:00 AM 08/01/2015 12:00:00 AM 08/01/2016 12:00:00 AM
## 9 14 10
## 09/01/2015 12:00:00 AM 10/01/2015 12:00:00 AM 12/01/2015 12:00:00 AM
## 10 12 11
First let’s create a table/array using tapply
that sums pageviews per month across all the sites:
total.views.bymonth.tbl <- tapply(total$pageviews, total$month, sum)
total.views.bymonth.tbl
## 01/01/2015 12:00:00 AM 01/01/2016 12:00:00 AM
## NA 6350440 3471121
## 02/01/2015 12:00:00 AM 02/01/2016 12:00:00 AM 03/01/2015 12:00:00 AM
## 5820453 3366834 6609602
## 03/01/2016 12:00:00 AM 04/01/2015 12:00:00 AM 04/01/2016 12:00:00 AM
## 4087054 6481483 3644750
## 05/01/2015 12:00:00 AM 06/01/2015 12:00:00 AM 07/01/2015 12:00:00 AM
## 6544055 6952488 8084318
## 08/01/2015 12:00:00 AM 09/01/2015 12:00:00 AM 10/01/2015 12:00:00 AM
## 7045189 3067760 2961681
## 12/01/2015 12:00:00 AM
## 5745045
If you run class
on total.views.bymonth.tbl
you’ll notice it’s not a data frame yet. We can change that:
total.views <- data.frame(months=names(total.views.bymonth.tbl),
total=total.views.bymonth.tbl)
head(total.views)
## months total
## NA
## 01/01/2015 12:00:00 AM 01/01/2015 12:00:00 AM 6350440
## 01/01/2016 12:00:00 AM 01/01/2016 12:00:00 AM 3471121
## 02/01/2015 12:00:00 AM 02/01/2015 12:00:00 AM 5820453
## 02/01/2016 12:00:00 AM 02/01/2016 12:00:00 AM 3366834
## 03/01/2015 12:00:00 AM 03/01/2015 12:00:00 AM 6609602
Let’s cleanup the rownames (this would all work the same if i didn’t do this part).
rownames(total.views) <- NULL
head(total.views)
## months total
## 1 NA
## 2 01/01/2015 12:00:00 AM 6350440
## 3 01/01/2016 12:00:00 AM 3471121
## 4 02/01/2015 12:00:00 AM 5820453
## 5 02/01/2016 12:00:00 AM 3366834
## 6 03/01/2015 12:00:00 AM 6609602
Onwards to the mobile dataset!
Here we have a challenge because we have to estimate total pageviews (it’s not given in the raw dataset). I’ll do this by multiplying sessions by pages-per-session. This assumes that the original pages-per-session calculation is precise, but I’m not sure what else we could do under the circumstances.
mobile$total.pages <- mobile$Sessions * mobile$PagesPerSession
Then, making the views-per-month array is more or less copy/pasted from above:
mobile.views.bymonth.tbl <- tapply(mobile$total.pages, mobile$Month, sum)
mobile.views.bymonth.tbl
## 01/01/2015 12:00:00 AM 01/01/2016 12:00:00 AM
## NA 1399185.6 668275.2
## 02/01/2015 12:00:00 AM 02/01/2016 12:00:00 AM 03/01/2015 12:00:00 AM
## 1275315.2 592607.8 1402086.4
## 03/01/2016 12:00:00 AM 04/01/2015 12:00:00 AM 04/01/2016 12:00:00 AM
## 800842.8 1381295.1 788533.7
## 05/01/2015 12:00:00 AM 06/01/2015 12:00:00 AM 07/01/2015 12:00:00 AM
## 1605914.9 1722519.5 1988848.0
## 07/01/2016 12:00:00 AM 08/01/2015 12:00:00 AM 08/01/2016 12:00:00 AM
## 878142.6 1741067.8 912435.4
## 09/01/2015 12:00:00 AM 10/01/2015 12:00:00 AM 12/01/2015 12:00:00 AM
## 564453.5 1285288.0 1223414.0
mobile.views <- data.frame(months=names(mobile.views.bymonth.tbl),
mobile=mobile.views.bymonth.tbl)
rownames(mobile.views) <- NULL
Now we merge the two datasets. Notice that I have created the months
column in both datasets with exactly the same name.
views <- merge(mobile.views, total.views, all.x=TRUE, all.y=TRUE, by="months")
These are sorted in strange ways and will be difficult to graph because the dates are stored as characters. Let’s convert them into Date objects. Then I can use sort.list
to sort everything.
views$months <- as.Date(views$months, format="%m/%d/%Y %H:%M:%S")
views <- views[sort.list(views$months),]
Take a look at the data. Some rows are missing observations. We can drop those rows using complete.cases
:
lapply(views, summary)
## $months
## Min. 1st Qu. Median Mean 3rd Qu.
## "2015-01-01" "2015-05-01" "2015-09-01" "2015-09-20" "2016-02-01"
## Max. NA's
## "2016-08-01" "1"
##
## $mobile
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 564454 800843 1275315 1190013 1402086 1988848 1
##
## $total
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2961681 3557936 5820453 5348818 6576828 8084318 3
views[rowSums(is.na(views)) > 0,]
## months mobile total
## 13 2016-07-01 878142.6 NA
## 15 2016-08-01 912435.4 NA
## 1 <NA> NA NA
views.complete <- views[complete.cases(views),]
For my proportion measure, I’ll take the mobile views divided by the total views.
views.complete$prop.mobile <- views.complete$mobile / views.complete$total
library(ggplot2)
ggplot(data=views.complete) + aes(x=months, y=prop.mobile) + geom_point() + geom_line() + scale_y_continuous(limits=c(0, 1))
mean(views.complete$prop.mobile)
## [1] 0.2308486
The general formula for a confidence interval is \(point~estimate~±~z^*\times~SE\). First, identify the three different values. The point estimate is 45%, \(z^* = 2.58\) for a 99% confidence level (that’s the number of standard deviations around the mean that ensure that 99% of a Z-score distribution is included), and \(SE = 2.4\%\).
With this we can plug and chug:
\[52\% ± 2.58 \times 2.4\% → (45.8\%, 58.2\%)\]
From this data we are 99% confident that between 45.8% and 58.2% U.S. adult Twitter users get some news through the site.
False. See the answer to 4.8 above. With \(\alpha = 0.01\), we can consult the 99% confidence interval. It includes 50% but also goes lower.
False. The standard error of the sample does not contain any information about the proportion of the population included in the sample. It measures the variability of the sample distribution.
False. Increasing the sample size will decrease the standard error. Consider the formula: \(\frac{\sigma}{\sqrt{n}}\). A smaller \(n\) will result in a larger standard error.
False. All else being equal, a lower confidence interval will cover a narrower range. A higher interval will cover a wider range. To confirm this, revisit the formula in SQ1 above. and plug in the corresponding alpha value of .9, resulting in a \(z^*\) value of 1.28 (see the Z-score table in the back of OpenIntro).
The hypotheses should be about the population mean (\(\mu\)) and not the sample mean (\(\bar{x}\)). The null hypothesis should have an equal sign. The alternative hypothesis should be about the critical value, not the sample mean. The following would have been better:
\[H_0: \mu = 10~hours\] \[H_A: \mu \gt 10~hours\]
In my words (or rather formulas since I think that’s less ambiguous), the key pairs of null/alternative hypotheses look something like the following:
Let \(\Delta\) be the parameter estimate for the difference in mean percentage of positive (\(\mu_{pos}\)) and negative (\(\mu_{neg}\)) words between the experimental and control conditions for the treatments of reduced negative content (\(R_{neg}\) and reduced positive content (\(R_{pos}\)).
For the reduced negative content conditions (the left-hand side of Figure 1), the paper tests:
\[HR_{neg}1_0: \Delta_{\mu_{pos}} = 0\] \[HR_{neg}1_a: \Delta{\mu_{pos}} \gt 0\] And: \[HR_{neg}2_0: \Delta_{\mu_{neg}} = 0\] \[HR_{neg}2_a: \Delta_{\mu_{neg}} \lt 0\] Then, for the reduced positive content conditions (the right-hand side of Figure 1), the paper tests:
\[HR_{pos}1_0:~~ \Delta_{\mu_{pos}} = 0\] \[HR_{pos}1_a:~~ \Delta{\mu_{pos}} \lt 0\]
And:
\[HR_{pos}2_0:~~ \Delta_{\mu_{neg}} = 0\] \[HR_{pos}2_a:~~ \Delta_{\mu_{neg}} \gt 0\] Note that the theories the authors used to motivate the study imply directions for the alternative hypotheses, but nothing in the description of the analysis suggests that they used one-tailed tests. I’ve written these all in terms of specific directions here to correspond with the theories stated in the paper. They could also (arguably more accurately) have been written in terms of inequalities (“\(\neq\)”).
The authors’ estimates suggest that reduced negative News Feed content causes an increase in the percentage of positive words and a decrease in the percentage of negative words in subsequent News Feed posts by study participants (supporting \(HR_{neg}1_a\) and \(HR_{neg}2_a\) respectively).
They also find that reduced positive News Feed content causes a decrease in the percentage of negative words and an increase in the percentage of positive words in susbequent News Feed posts (supporting \(HR_{pos}1_a\) and \(HR_{pos}2_a\))
Cohen’s \(d\) puts estimates of experimental effects in standardized units (much like a Z-score!) in order to help understand their size relative to the underlying distribution of the dependent variable(s). The d-values for each of the effects estimated in the paper are 0.02, 0.001, 0.02, and 0.008 respectively (in the order presented in the paper, not in order of the hypotheses above!). These are miniscule effects. However, the treatment itself is also quite narrow in scope, suggesting that the presence of any treatment effect at all is an indication of the underlying phenomenon (emotional contagion). Personally, I find it difficult to attribute much substantive significance to the results because I’m not even convinced that tiny shifts in the percentage of positive/negative words used in News Feed updates accurately index meaningful emotional shifts (maybe we could call it linguistic contagion instead?). Despite these concerns and the ethical considerations that attracted so much public attention, I consider this a clever, well-executed study and I think it’s quite compelling. I expect many of you will have different opinions of various kinds!