]> code.communitydata.science - stats_class_2019.git/blobdiff - problem_sets/week_04/ps4-worked_solution.Rmd
adding a bunch of new files around week 4
[stats_class_2019.git] / problem_sets / week_04 / ps4-worked_solution.Rmd
diff --git a/problem_sets/week_04/ps4-worked_solution.Rmd b/problem_sets/week_04/ps4-worked_solution.Rmd
new file mode 100644 (file)
index 0000000..cc3c81f
--- /dev/null
@@ -0,0 +1,130 @@
+---
+title: 'Week 4 Problem set: Worked solutions'
+subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
+author: "Aaron Shaw"
+date: "April 25, 2019"
+output: html_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+# Programming challenges
+
+
+## PC2
+
+You may need to edit these first lines to work on your own machine. Note that for working with .Rmd files interactively in Rstudio you may find it easier to do this using the drop down menus: "Session" → "Set Working Directory" → "To Source File Location" 
+
+```{r}
+## setwd("~/Documents/Teaching/2019/stats/")
+## list.files("data/week_04")
+
+mobile <- read.csv("data/week_04/COS-Statistics-Mobile_Sessions.csv")
+total <- read.csv("data/week_04/COS-Statistics-Gov-Domains-Only.csv")
+```
+
+I'll write a little function to help inspect the data. Make sure you understand what the last line of the function is doing.
+```{r}
+summary.df <- function (d) {
+    print(nrow(d))
+    print(ncol(d))
+    print(head(d))
+    print(d[sample(seq(1, nrow(d)), 5),])
+}
+```
+
+Then I can run these two lines a few times to look at some samples
+```{r}
+summary.df(mobile)
+
+summary.df(total)
+```
+I can check for missing values and summarize the different columns using `lapply`:
+
+```{r}
+lapply(total, summary)
+
+lapply(mobile, summary)
+```
+
+## PC3
+
+First let's create a table/array using `tapply` that sums pageviews per month across all the sites:
+```{r}    
+total.views.bymonth.tbl <- tapply(total$pageviews, total$month, sum)
+total.views.bymonth.tbl
+```
+If you run `class` on `total.views.bymonth.tbl` you'll notice it's not a data frame yet. We can change that:
+```{r}
+total.views <- data.frame(months=names(total.views.bymonth.tbl),
+                          total=total.views.bymonth.tbl)
+
+head(total.views)
+```
+Let's cleanup the rownames (this would all work the same if i didn't do this part).
+
+```{r}
+rownames(total.views) <- NULL
+
+head(total.views)
+```
+## PC4
+Onwards to the mobile dataset!
+
+Here we have a challenge because we have to estimate total pageviews (it's not given in the raw dataset). I'll do this by multiplying sessions by pages-per-session. This assumes that the original pages-per-session calculation is precise, but I'm not sure what else we could do under the circumstances.
+```{r}
+mobile$total.pages <- mobile$Sessions * mobile$PagesPerSession 
+```
+Then, making the views-per-month array is more or less copy/pasted from above:
+```{r}
+mobile.views.bymonth.tbl <- tapply(mobile$total.pages, mobile$Month, sum)
+mobile.views.bymonth.tbl
+
+mobile.views <- data.frame(months=names(mobile.views.bymonth.tbl),
+                           mobile=mobile.views.bymonth.tbl)
+rownames(mobile.views) <- NULL
+```
+## PC5
+Now we merge the two datasets. Notice that I have created the `months` column in both datasets with *exactly* the same name.
+```{r}
+views <- merge(mobile.views, total.views, all.x=TRUE, all.y=TRUE, by="months")
+```
+
+These are sorted in strange ways and will be difficult to graph because the dates are stored as characters. Let's convert them into Date objects. Then I can use `sort.list` to sort everything.
+
+```{r}
+views$months <- as.Date(views$months, format="%m/%d/%Y %H:%M:%S")
+
+views <- views[sort.list(views$months),]
+```
+
+Take a look at the data. Some rows are missing observations. We can drop those rows using `complete.cases`:
+```{r}
+lapply(views, summary)
+
+views[rowSums(is.na(views)) > 0,]
+
+views.complete <- views[complete.cases(views),]
+```
+
+## PC6 
+
+For my proportion measure, I'll take the mobile views divided by the total views.
+
+```{r}
+views.complete$prop.mobile <- views.complete$mobile / views.complete$total
+    
+```
+## PC7. 
+
+```{r}
+library(ggplot2)
+ggplot(data=views.complete) + aes(x=months, y=prop.mobile) + geom_point() + geom_line() + scale_y_continuous(limits=c(0, 1))
+
+```
+
+# Statistical questions
+
+# Empirical paper questions
\ No newline at end of file

Community Data Science Collective || Want to submit a patch?