2 title: 'Week 4 Problem set: Worked solutions'
3 subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525"
9 ```{r setup, include=FALSE}
10 knitr::opts_chunk$set(echo = TRUE)
13 # Programming challenges
18 You may need to edit these first lines to work on your own machine. Note that for working with .Rmd files interactively in Rstudio you may find it easier to do this using the drop down menus: "Session" → "Set Working Directory" → "To Source File Location"
21 ## setwd("~/Documents/Teaching/2019/stats/")
22 ## list.files("data/week_04")
24 mobile <- read.csv("data/week_04/COS-Statistics-Mobile_Sessions.csv")
25 total <- read.csv("data/week_04/COS-Statistics-Gov-Domains-Only.csv")
28 I'll write a little function to help inspect the data. Make sure you understand what the last line of the function is doing.
30 summary.df <- function (d) {
34 print(d[sample(seq(1, nrow(d)), 5),])
38 Then I can run these two lines a few times to look at some samples
44 I can check for missing values and summarize the different columns using `lapply`:
47 lapply(total, summary)
49 lapply(mobile, summary)
54 First let's create a table/array using `tapply` that sums pageviews per month across all the sites:
56 total.views.bymonth.tbl <- tapply(total$pageviews, total$month, sum)
57 total.views.bymonth.tbl
59 If you run `class` on `total.views.bymonth.tbl` you'll notice it's not a data frame yet. We can change that:
61 total.views <- data.frame(months=names(total.views.bymonth.tbl),
62 total=total.views.bymonth.tbl)
66 Let's cleanup the rownames (this would all work the same if i didn't do this part).
69 rownames(total.views) <- NULL
74 Onwards to the mobile dataset!
76 Here we have a challenge because we have to estimate total pageviews (it's not given in the raw dataset). I'll do this by multiplying sessions by pages-per-session. This assumes that the original pages-per-session calculation is precise, but I'm not sure what else we could do under the circumstances.
78 mobile$total.pages <- mobile$Sessions * mobile$PagesPerSession
80 Then, making the views-per-month array is more or less copy/pasted from above:
82 mobile.views.bymonth.tbl <- tapply(mobile$total.pages, mobile$Month, sum)
83 mobile.views.bymonth.tbl
85 mobile.views <- data.frame(months=names(mobile.views.bymonth.tbl),
86 mobile=mobile.views.bymonth.tbl)
87 rownames(mobile.views) <- NULL
90 Now we merge the two datasets. Notice that I have created the `months` column in both datasets with *exactly* the same name.
92 views <- merge(mobile.views, total.views, all.x=TRUE, all.y=TRUE, by="months")
95 These are sorted in strange ways and will be difficult to graph because the dates are stored as characters. Let's convert them into Date objects. Then I can use `sort.list` to sort everything.
98 views$months <- as.Date(views$months, format="%m/%d/%Y %H:%M:%S")
100 views <- views[sort.list(views$months),]
103 Take a look at the data. Some rows are missing observations. We can drop those rows using `complete.cases`:
105 lapply(views, summary)
107 views[rowSums(is.na(views)) > 0,]
109 views.complete <- views[complete.cases(views),]
114 For my proportion measure, I'll take the mobile views divided by the total views.
117 views.complete$prop.mobile <- views.complete$mobile / views.complete$total
124 ggplot(data=views.complete) + aes(x=months, y=prop.mobile) + geom_point() + geom_line() + scale_y_continuous(limits=c(0, 1))
128 # Statistical questions
130 # Empirical paper questions