--- title: 'Week 4 Problem set: Worked solutions' subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525" author: "Aaron Shaw" date: "April 25, 2019" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Programming challenges ## PC2 You may need to edit these first lines to work on your own machine. Note that for working with .Rmd files interactively in Rstudio you may find it easier to do this using the drop down menus: "Session" → "Set Working Directory" → "To Source File Location" ```{r} ## setwd("~/Documents/Teaching/2019/stats/") ## list.files("data/week_04") mobile <- read.csv("data/week_04/COS-Statistics-Mobile_Sessions.csv") total <- read.csv("data/week_04/COS-Statistics-Gov-Domains-Only.csv") ``` I'll write a little function to help inspect the data. Make sure you understand what the last line of the function is doing. ```{r} summary.df <- function (d) { print(nrow(d)) print(ncol(d)) print(head(d)) print(d[sample(seq(1, nrow(d)), 5),]) } ``` Then I can run these two lines a few times to look at some samples ```{r} summary.df(mobile) summary.df(total) ``` I can check for missing values and summarize the different columns using `lapply`: ```{r} lapply(total, summary) lapply(mobile, summary) ``` ## PC3 First let's create a table/array using `tapply` that sums pageviews per month across all the sites: ```{r} total.views.bymonth.tbl <- tapply(total$pageviews, total$month, sum) total.views.bymonth.tbl ``` If you run `class` on `total.views.bymonth.tbl` you'll notice it's not a data frame yet. We can change that: ```{r} total.views <- data.frame(months=names(total.views.bymonth.tbl), total=total.views.bymonth.tbl) head(total.views) ``` Let's cleanup the rownames (this would all work the same if i didn't do this part). ```{r} rownames(total.views) <- NULL head(total.views) ``` ## PC4 Onwards to the mobile dataset! Here we have a challenge because we have to estimate total pageviews (it's not given in the raw dataset). I'll do this by multiplying sessions by pages-per-session. This assumes that the original pages-per-session calculation is precise, but I'm not sure what else we could do under the circumstances. ```{r} mobile$total.pages <- mobile$Sessions * mobile$PagesPerSession ``` Then, making the views-per-month array is more or less copy/pasted from above: ```{r} mobile.views.bymonth.tbl <- tapply(mobile$total.pages, mobile$Month, sum) mobile.views.bymonth.tbl mobile.views <- data.frame(months=names(mobile.views.bymonth.tbl), mobile=mobile.views.bymonth.tbl) rownames(mobile.views) <- NULL ``` ## PC5 Now we merge the two datasets. Notice that I have created the `months` column in both datasets with *exactly* the same name. ```{r} views <- merge(mobile.views, total.views, all.x=TRUE, all.y=TRUE, by="months") ``` These are sorted in strange ways and will be difficult to graph because the dates are stored as characters. Let's convert them into Date objects. Then I can use `sort.list` to sort everything. ```{r} views$months <- as.Date(views$months, format="%m/%d/%Y %H:%M:%S") views <- views[sort.list(views$months),] ``` Take a look at the data. Some rows are missing observations. We can drop those rows using `complete.cases`: ```{r} lapply(views, summary) views[rowSums(is.na(views)) > 0,] views.complete <- views[complete.cases(views),] ``` ## PC6 For my proportion measure, I'll take the mobile views divided by the total views. ```{r} views.complete$prop.mobile <- views.complete$mobile / views.complete$total ``` ## PC7. ```{r} library(ggplot2) ggplot(data=views.complete) + aes(x=months, y=prop.mobile) + geom_point() + geom_line() + scale_y_continuous(limits=c(0, 1)) ``` # Statistical questions # Empirical paper questions