]> code.communitydata.science - stats_class_2019.git/blobdiff - problem_sets/week_04/ps4-worked_solution.html
updates including statistical and empirical questions
[stats_class_2019.git] / problem_sets / week_04 / ps4-worked_solution.html
index cbb90a617090a2ef964d9654700d79fd9194e7e5..19bd13cfb6200baad9a299708dfc6cffc237349f 100644 (file)
@@ -11,7 +11,7 @@
 
 <meta name="author" content="Aaron Shaw" />
 
-<meta name="date" content="2019-04-16" />
+<meta name="date" content="2019-04-25" />
 
 <title>Week 4 Problem set: Worked solutions</title>
 
@@ -195,7 +195,7 @@ $(document).ready(function () {
 Northwestern University<br />
 MTS 525</h3>
 <h4 class="author">Aaron Shaw</h4>
-<h4 class="date">April 16, 2019</h4>
+<h4 class="date">April 25, 2019</h4>
 
 </div>
 
@@ -204,28 +204,21 @@ MTS 525</h3>
 <h1>Programming challenges</h1>
 <div id="pc2" class="section level2">
 <h2>PC2</h2>
-<pre class="r"><code>## You'll need to edit these first lines to work on your own machine
-##
-## Note that for working with .Rmd files interactively in Rstudio you may find it easier to do this 
-## using the drop down menus: &quot;Session&quot; → &quot;Set Working Directory&quot; → &quot;To Source File Location&quot; 
-##
-
-## setwd(&quot;~/Documents/Teaching/2019/stats/&quot;)
+<p>You may need to edit these first lines to work on your own machine. Note that for working with .Rmd files interactively in Rstudio you may find it easier to do this using the drop down menus: “Session” → “Set Working Directory” → “To Source File Location”</p>
+<pre class="r"><code>## setwd(&quot;~/Documents/Teaching/2019/stats/&quot;)
 ## list.files(&quot;data/week_04&quot;)
 
 mobile &lt;- read.csv(&quot;data/week_04/COS-Statistics-Mobile_Sessions.csv&quot;)
-total &lt;- read.csv(&quot;data/week_04/COS-Statistics-Gov-Domains-Only.csv&quot;)
-
-
-summary.df &lt;- function (d) {
+total &lt;- read.csv(&quot;data/week_04/COS-Statistics-Gov-Domains-Only.csv&quot;)</code></pre>
+<p>I’ll write a little function to help inspect the data. Make sure you understand what the last line of the function is doing.</p>
+<pre class="r"><code>summary.df &lt;- function (d) {
     print(nrow(d))
     print(ncol(d))
     print(head(d))
     print(d[sample(seq(1, nrow(d)), 5),])
-}
-
-## run these two lines a few times to look at the numbers
-summary.df(mobile)</code></pre>
+}</code></pre>
+<p>Then I can run these two lines a few times to look at some samples</p>
+<pre class="r"><code>summary.df(mobile)</code></pre>
 <pre><code>## [1] 231
 ## [1] 8
 ##   Operating_System Sessions New_Sessions New_Users Bounce_Rate
@@ -243,17 +236,17 @@ summary.df(mobile)</code></pre>
 ## 5            2.10            0:02:24 01/01/2015 12:00:00 AM
 ## 6            1.82            0:01:01 01/01/2015 12:00:00 AM
 ##     Operating_System Sessions New_Sessions New_Users Bounce_Rate
-## 51           Samsung       59        44.07        26       77.97
-## 160            Nokia       12       100.00        12      100.00
-## 162       Firefox OS        6       100.00         6        0.00
-## 224        (not set)      419        87.11       365       74.22
-## 172               LG        6       100.00         6        0.00
+## 182 Playstation Vita        6       100.00         6       100.0
+## 214          Android   214077        47.17    100978        59.4
+## 194          Android   178625        47.32     84526        57.4
+## 92          Series40        8       100.00         8       100.0
+## 14  Playstation Vita        6       100.00         6       100.0
 ##     PagesPerSession AvgSessionDuration                  Month
-## 51             1.54            0:03:01 04/01/2015 12:00:00 AM
-## 160            1.00            0:00:00 10/01/2015 12:00:00 AM
-## 162            2.00            0:00:02 10/01/2015 12:00:00 AM
-## 224            1.37            0:01:09 08/01/2016 12:00:00 AM
-## 172            1.83            0:01:37 12/01/2015 12:00:00 AM</code></pre>
+## 182            1.00            0:00:00 01/01/2016 12:00:00 AM
+## 214            3.65            0:04:45 07/01/2016 12:00:00 AM
+## 194            3.65            0:04:48 03/01/2016 12:00:00 AM
+## 92             1.00            0:00:00 07/01/2015 12:00:00 AM
+## 14             1.00            0:00:00 01/01/2015 12:00:00 AM</code></pre>
 <pre class="r"><code>summary.df(total)</code></pre>
 <pre><code>## [1] 1242
 ## [1] 7
@@ -271,25 +264,354 @@ summary.df(mobile)</code></pre>
 ## 4       69.29        46.42 04/01/2015 12:00:00 AM
 ## 5       59.57        18.76 04/01/2015 12:00:00 AM
 ## 6       25.67        21.74 04/01/2015 12:00:00 AM
-##                              domain pageviews unique.pageviews
-## 963    sdotperformance.seattle.gov/        20               15
-## 1022            obrien.seattle.gov/       283              241
-## 914     lc317web.light.seattle.gov/         4                4
-## 455        mayormurray.seattle.gov/      9896             8925
-## 744  councilconnection.seattle.gov/         2                2
-##      average.time.on.page bounce.rate exit.percent                  month
-## 963               0:00:13       81.82        55.00 01/01/2016 12:00:00 AM
-## 1022              0:01:51        0.61         0.34 02/01/2016 12:00:00 AM
-## 914               0:00:04        0.00        50.00 01/01/2016 12:00:00 AM
-## 455               0:03:23       90.59        87.12 06/01/2015 12:00:00 AM
-## 744               0:00:00        1.00         1.00 09/01/2015 12:00:00 AM</code></pre>
-<pre class="r"><code>## PC3. Using the top 5000 dataset, create a new data frame that one
-## column per month (as described in the data) and a second column is
-## the total number of views made to all pages in the dataset over
-## that month.
-
-## first create a table/array using tapply
-total.views.bymonth.tbl &lt;- tapply(total$pageviews, total$month, sum)
+##                         domain pageviews unique.pageviews
+## 542 dpdwinw101.ad.seattle.gov/        52               25
+## 678        murray.seattle.gov/     41246            35629
+## 776   consultants.seattle.gov/      2790             2203
+## 808  perspectives.seattle.gov/        46               44
+## 644                                   NA               NA
+##     average.time.on.page bounce.rate exit.percent                  month
+## 542               125.76         0.0        11.54 07/01/2015 12:00:00 AM
+## 678              0:02:38         0.8         0.69 09/01/2015 12:00:00 AM
+## 776              0:01:10      5446.0      3391.00 10/01/2015 12:00:00 AM
+## 808              0:02:17      8667.0      4130.00 10/01/2015 12:00:00 AM
+## 644                               NA           NA</code></pre>
+<p>I can check for missing values and summarize the different columns using <code>lapply</code>:</p>
+<pre class="r"><code>lapply(total, summary)</code></pre>
+<pre><code>## $domain
+##                                               
+##                                            34 
+##                             2035.seattle.gov/ 
+##                                            15 
+##                          artbeat.seattle.gov/ 
+##                                            15 
+##                    atyourservice.seattle.gov/ 
+##                                            15 
+##                          bagshaw.seattle.gov/ 
+##                                            15 
+##                       bottomline.seattle.gov/ 
+##                                            15 
+##                       brainstorm.seattle.gov/ 
+##                                            15 
+##              buildingconnections.seattle.gov/ 
+##                                            15 
+##                  centerspotlight.seattle.gov/ 
+##                                            15 
+##                        cityclerk.seattle.gov/ 
+##                                            15 
+##                            clark.seattle.gov/ 
+##                                            15 
+##                            clerk.seattle.gov/ 
+##                                            15 
+##                    climatechange.seattle.gov/ 
+##                                            15 
+##                           conlin.seattle.gov/ 
+##                                            15 
+##                      consultants.seattle.gov/ 
+##                                            15 
+##                          council.seattle.gov/ 
+##                                            15 
+##                             find.seattle.gov/ 
+##                                            15 
+##                         fireline.seattle.gov/ 
+##                                            15 
+##                       frontporch.seattle.gov/ 
+##                                            15 
+##                           godden.seattle.gov/ 
+##                                            15 
+##                 grantsandfunding.seattle.gov/ 
+##                                            15 
+##                       greenspace.seattle.gov/ 
+##                                            15 
+##                   hackthecommute.seattle.gov/ 
+##                                            15 
+##                   humaninterests.seattle.gov/ 
+##                                            15 
+##                           licata.seattle.gov/ 
+##                                            15 
+##                          married.seattle.gov/ 
+##                                            15 
+##                      mayormcginn.seattle.gov/ 
+##                                            15 
+##                                m.seattle.gov/ 
+##                                            15 
+##                             news.seattle.gov/ 
+##                                            15 
+##                           obrien.seattle.gov/ 
+##                                            15 
+##                        onthemove.seattle.gov/ 
+##                                            15 
+##                         parkways.seattle.gov/ 
+##                                            15 
+##                     perspectives.seattle.gov/ 
+##                                            15 
+##                       powerlines.seattle.gov/ 
+##                                            15 
+##                        rasmussen.seattle.gov/ 
+##                                            15 
+##                          rectech.seattle.gov/ 
+##                                            15 
+##                           sawant.seattle.gov/ 
+##                                            15 
+##                         sdotblog.seattle.gov/ 
+##                                            15 
+##                  sdotperformance.seattle.gov/ 
+##                                            15 
+##                       seattlerdy.seattle.gov/ 
+##                                            15 
+##                       spdblotter.seattle.gov/ 
+##                                            15 
+##                         techtalk.seattle.gov/ 
+##                                            15 
+##                       thebuyline.seattle.gov/ 
+##                                            15 
+##                         thescoop.seattle.gov/ 
+##                                            15 
+##                             web6.seattle.gov/ 
+##                                            15 
+##                             www2.seattle.gov/ 
+##                                            15 
+##                        www.clerk.seattle.gov/ 
+##                                            15 
+##                            wwwqa.seattle.gov/ 
+##                                            15 
+##                           cmstrn.seattle.gov/ 
+##                                            14 
+##                             cms8.seattle.gov/ 
+##                                            13 
+##                           igxqa8.seattle.gov/ 
+##                                            13 
+##                                  seattle.gov/ 
+##                                            13 
+##                            cttab.seattle.gov/ 
+##                                            12 
+##                          okamoto.seattle.gov/ 
+##                                            12 
+##                             web5.seattle.gov/ 
+##                                            12 
+##                             web7.seattle.gov/ 
+##                                            12 
+##                        education.seattle.gov/ 
+##                                            11 
+##                             web1.seattle.gov/ 
+##                                            11 
+##                           webqa7.seattle.gov/ 
+##                                            11 
+##                             www4.seattle.gov/ 
+##                                            11 
+##                            alert.seattle.gov/ 
+##                                            10 
+##                           alerts.seattle.gov/ 
+##                                            10 
+##                             data.seattle.gov/ 
+##                                            10 
+##             seattle-govstat.demo.socrata.com/ 
+##                                            10 
+##                          connect.seattle.gov/ 
+##                                             9 
+##                             igx8.seattle.gov/ 
+##                                             9 
+##                           murray.seattle.gov/ 
+##                                             9 
+##                           webqa6.seattle.gov/ 
+##                                             9 
+##                              www.seattle.gov/ 
+##                                             9 
+##           www.seattle.gov.googleweblight.com/ 
+##                                             9 
+##                          alphaqa.seattle.gov/ 
+##                                             8 
+##                          cmsdev8.seattle.gov/ 
+##                                             8 
+##                    dpdwinw101.ad.seattle.gov/ 
+##                                             8 
+##          web6.seattle.gov.googleweblight.com/ 
+##                                             8 
+##                              cms.seattle.gov/ 
+##                                             7 
+##                             ctab.seattle.gov/ 
+##                                             7 
+##                     www.citylink.seattle.gov/ 
+##                                             7 
+##                   aboveandbeyond.seattle.gov/ 
+##                                             6 
+##                         citylink.seattle.gov/ 
+##                                             6 
+##                        langstoninstitute.org/ 
+##                                             6 
+##                      mayormurray.seattle.gov/ 
+##                                             6 
+##                    take21.seattlechannel.org/ 
+##                                             6 
+##                             web8.seattle.gov/ 
+##                                             6 
+##                           wwwdev.seattle.gov/ 
+##                                             6 
+##                        www.evergreenapps.org/ 
+##                                             6 
+##                     www.safeyouthseattle.org/ 
+##                                             6 
+##                            cityofseattle.gov/ 
+##                                             5 
+##                councilconnection.seattle.gov/ 
+##                                             5 
+##                     filmandmusic.seattle.gov/ 
+##                                             5 
+##                         gonzalez.seattle.gov/ 
+##                                             5 
+##                         homebase.seattle.gov/ 
+##                                             5 
+##                          igxdev8.seattle.gov/ 
+##                                             5 
+##                        www.mayor.seattle.gov/ 
+##                                             5 
+## www.seattle.gov.offcampus.lib.washington.edu/ 
+##                                             5 
+##                  capitalprojects.seattle.gov/ 
+##                                             4 
+##                    dpdwina307.ad.seattle.gov/ 
+##                                             4 
+##                          herbold.seattle.gov/ 
+##                                             4 
+##                          johnson.seattle.gov/ 
+##                                             4 
+##                           juarez.seattle.gov/ 
+##                                             4 
+##                                       (Other) 
+##                                            97 
+## 
+## $pageviews
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
+##       1      24     402   66417    2752 4172985      34 
+## 
+## $unique.pageviews
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
+##       1      17     285   28515    2204 3213093      34 
+## 
+## $average.time.on.page
+## 0:00:00         0:01:11 0:01:18    0.00 0:01:12 0:01:13 0:01:14 0:01:20 
+##     134      34      17      17      16      15      13      13      12 
+## 0:01:53 0:01:09 0:01:17 0:01:23 0:01:32 0:01:05 0:01:24 0:01:29 0:01:36 
+##      12      11      11      11      11      10      10      10      10 
+## 0:01:51 0:01:54 0:01:58 0:00:55 0:01:01 0:01:06 0:01:08 0:01:10 0:01:16 
+##      10      10      10       9       9       9       9       9       9 
+## 0:01:22 0:01:25 0:01:30 0:01:35 0:01:37 0:01:56 0:00:39 0:00:53 0:00:56 
+##       9       9       9       9       9       9       8       8       8 
+## 0:00:57 0:01:03 0:01:27 0:01:31 0:01:38 0:01:43 0:01:47 0:00:42 0:00:48 
+##       8       8       8       8       8       8       8       7       7 
+## 0:01:07 0:01:19 0:01:40 0:01:41 0:01:42 0:01:45 0:01:50 0:01:52 0:02:00 
+##       7       7       7       7       7       7       7       7       7 
+## 0:02:04 0:02:31 0:00:31 0:00:54 0:00:59 0:01:21 0:01:26 0:01:44 0:01:48 
+##       7       7       6       6       6       6       6       6       6 
+## 0:01:59 0:02:06 0:02:07 0:02:23 0:02:35 0:00:08 0:00:38 0:01:00 0:01:02 
+##       6       6       6       6       6       5       5       5       5 
+## 0:01:04 0:01:33 0:01:34 0:01:39 0:01:46 0:02:09 0:02:12 0:02:19 0:02:21 
+##       5       5       5       5       5       5       5       5       5 
+## 0:02:27 0:02:29 0:02:42 0:02:47 0:02:51 0:02:54 0:03:03 0:00:11 0:00:12 
+##       5       5       5       5       5       5       5       4       4 
+## 0:00:20 0:00:27 0:00:33 0:00:41 0:00:49 0:00:50 0:00:58 0:01:15 0:01:28 
+##       4       4       4       4       4       4       4       4       4 
+## (Other) 
+##     350 
+## 
+## $bounce.rate
+##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
+##     0.00    24.89    65.75   430.47    79.32 10000.00       34 
+## 
+## $exit.percent
+##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
+##     0.00    17.67    42.09   347.91    62.37 10000.00       34 
+## 
+## $month
+##                        01/01/2015 12:00:00 AM 01/01/2016 12:00:00 AM 
+##                     34                     84                     84 
+## 02/01/2015 12:00:00 AM 02/01/2016 12:00:00 AM 03/01/2015 12:00:00 AM 
+##                     78                     79                     80 
+## 03/01/2016 12:00:00 AM 04/01/2015 12:00:00 AM 04/01/2016 12:00:00 AM 
+##                     88                     83                     87 
+## 05/01/2015 12:00:00 AM 06/01/2015 12:00:00 AM 07/01/2015 12:00:00 AM 
+##                     75                     84                     85 
+## 08/01/2015 12:00:00 AM 09/01/2015 12:00:00 AM 10/01/2015 12:00:00 AM 
+##                     70                     84                     77 
+## 12/01/2015 12:00:00 AM 
+##                     70</code></pre>
+<pre class="r"><code>lapply(mobile, summary)</code></pre>
+<pre><code>## $Operating_System
+##                           Android             Bada       BlackBerry 
+##               34               17                4               17 
+##       Firefox OS              iOS               LG              LGE 
+##                5               10               12                1 
+##              MOT     Nintendo 3DS            Nokia        (not set) 
+##                1                7               16               17 
+## Playstation Vita          Samsung         Series40        SymbianOS 
+##               12               17               10               17 
+##          Windows    Windows Phone 
+##               17               17 
+## 
+## $Sessions
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
+##       6      16     217   38469   10718  519563      34 
+## 
+## $New_Sessions
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
+##    0.44   45.53   84.62   72.65  100.00  100.00      34 
+## 
+## $New_Users
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
+##       6      13     124   17575    4853  236550      34 
+## 
+## $Bounce_Rate
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
+##    0.00   53.85   62.98   66.21   84.62  100.00      34 
+## 
+## $PagesPerSession
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
+##   1.000   1.210   1.860   2.082   2.500   9.000      34 
+## 
+## $AvgSessionDuration
+## 0:00:00         0:00:06 0:00:41 0:01:06 0:01:21 0:01:45 0:01:50 0:02:00 
+##      46      34       3       3       3       3       3       3       3 
+## 0:00:04 0:00:09 0:00:25 0:00:42 0:01:01 0:01:05 0:01:07 0:01:09 0:01:20 
+##       2       2       2       2       2       2       2       2       2 
+## 0:01:46 0:01:56 0:02:02 0:02:06 0:02:40 0:02:49 0:03:01 0:03:05 0:03:53 
+##       2       2       2       2       2       2       2       2       2 
+## 0:00:02 0:00:14 0:00:17 0:00:20 0:00:21 0:00:24 0:00:26 0:00:29 0:00:32 
+##       1       1       1       1       1       1       1       1       1 
+## 0:00:34 0:00:38 0:00:40 0:00:43 0:00:44 0:00:46 0:00:48 0:00:49 0:00:50 
+##       1       1       1       1       1       1       1       1       1 
+## 0:00:52 0:00:55 0:00:56 0:01:03 0:01:08 0:01:12 0:01:14 0:01:16 0:01:19 
+##       1       1       1       1       1       1       1       1       1 
+## 0:01:24 0:01:25 0:01:26 0:01:28 0:01:29 0:01:33 0:01:34 0:01:35 0:01:37 
+##       1       1       1       1       1       1       1       1       1 
+## 0:01:41 0:01:42 0:01:51 0:01:52 0:01:54 0:02:01 0:02:03 0:02:05 0:02:08 
+##       1       1       1       1       1       1       1       1       1 
+## 0:02:09 0:02:10 0:02:11 0:02:13 0:02:14 0:02:15 0:02:17 0:02:18 0:02:19 
+##       1       1       1       1       1       1       1       1       1 
+## 0:02:24 0:02:26 0:02:34 0:02:39 0:02:47 0:02:48 0:02:52 0:02:56 0:02:57 
+##       1       1       1       1       1       1       1       1       1 
+## 0:03:04 0:03:07 0:03:14 0:03:18 0:03:21 0:03:25 0:03:26 0:03:29 0:03:36 
+##       1       1       1       1       1       1       1       1       1 
+## (Other) 
+##      22 
+## 
+## $Month
+##                        01/01/2015 12:00:00 AM 01/01/2016 12:00:00 AM 
+##                     34                     15                      9 
+## 02/01/2015 12:00:00 AM 02/01/2016 12:00:00 AM 03/01/2015 12:00:00 AM 
+##                     13                     11                     15 
+## 03/01/2016 12:00:00 AM 04/01/2015 12:00:00 AM 04/01/2016 12:00:00 AM 
+##                      9                     12                     10 
+## 05/01/2015 12:00:00 AM 06/01/2015 12:00:00 AM 07/01/2015 12:00:00 AM 
+##                     11                     14                     12 
+## 07/01/2016 12:00:00 AM 08/01/2015 12:00:00 AM 08/01/2016 12:00:00 AM 
+##                      9                     14                     10 
+## 09/01/2015 12:00:00 AM 10/01/2015 12:00:00 AM 12/01/2015 12:00:00 AM 
+##                     10                     12                     11</code></pre>
+</div>
+<div id="pc3" class="section level2">
+<h2>PC3</h2>
+<p>First let’s create a table/array using <code>tapply</code> that sums pageviews per month across all the sites:</p>
+<pre class="r"><code>total.views.bymonth.tbl &lt;- tapply(total$pageviews, total$month, sum)
 total.views.bymonth.tbl</code></pre>
 <pre><code>##                        01/01/2015 12:00:00 AM 01/01/2016 12:00:00 AM 
 ##                     NA                6350440                3471121 
@@ -303,13 +625,20 @@ total.views.bymonth.tbl</code></pre>
 ##                7045189                3067760                2961681 
 ## 12/01/2015 12:00:00 AM 
 ##                5745045</code></pre>
-<pre class="r"><code>## now construct a data frame
-total.views &lt;- data.frame(months=names(total.views.bymonth.tbl),
+<p>If you run <code>class</code> on <code>total.views.bymonth.tbl</code> you’ll notice it’s not a data frame yet. We can change that:</p>
+<pre class="r"><code>total.views &lt;- data.frame(months=names(total.views.bymonth.tbl),
                           total=total.views.bymonth.tbl)
 
-## zero out the rownames so it looks a bit better (this would all work
-## the same if i didn't do this part)
-rownames(total.views) &lt;- NULL
+head(total.views)</code></pre>
+<pre><code>##                                        months   total
+##                                                    NA
+## 01/01/2015 12:00:00 AM 01/01/2015 12:00:00 AM 6350440
+## 01/01/2016 12:00:00 AM 01/01/2016 12:00:00 AM 3471121
+## 02/01/2015 12:00:00 AM 02/01/2015 12:00:00 AM 5820453
+## 02/01/2016 12:00:00 AM 02/01/2016 12:00:00 AM 3366834
+## 03/01/2015 12:00:00 AM 03/01/2015 12:00:00 AM 6609602</code></pre>
+<p>Let’s cleanup the rownames (this would all work the same if i didn’t do this part).</p>
+<pre class="r"><code>rownames(total.views) &lt;- NULL
 
 head(total.views)</code></pre>
 <pre><code>##                   months   total
@@ -319,19 +648,14 @@ head(total.views)</code></pre>
 ## 4 02/01/2015 12:00:00 AM 5820453
 ## 5 02/01/2016 12:00:00 AM 3366834
 ## 6 03/01/2015 12:00:00 AM 6609602</code></pre>
-<pre class="r"><code>## PC4. Using the mobile dataset, create a new data frame where one
-## column is each month described in the data and the second is a
-## measure (estimate?) of the total number of views made by mobiles
-## (all platforms) over each month. This will will involve at least
-## two steps since total views are included. You'll need to first use
-## the data there to create a measure of the total views per platform.
-
-## first, multiply sessions by pages per session to get an estimate of
-## total pages
-mobile$total.pages &lt;- mobile$Sessions * mobile$PagesPerSession 
-
-# see above, this is more or less copy/pasted from above
-mobile.views.bymonth.tbl &lt;- tapply(mobile$total.pages, mobile$Month, sum)
+</div>
+<div id="pc4" class="section level2">
+<h2>PC4</h2>
+<p>Onwards to the mobile dataset!</p>
+<p>Here we have a challenge because we have to estimate total pageviews (it’s not given in the raw dataset). I’ll do this by multiplying sessions by pages-per-session. This assumes that the original pages-per-session calculation is precise, but I’m not sure what else we could do under the circumstances.</p>
+<pre class="r"><code>mobile$total.pages &lt;- mobile$Sessions * mobile$PagesPerSession </code></pre>
+<p>Then, making the views-per-month array is more or less copy/pasted from above:</p>
+<pre class="r"><code>mobile.views.bymonth.tbl &lt;- tapply(mobile$total.pages, mobile$Month, sum)
 mobile.views.bymonth.tbl</code></pre>
 <pre><code>##                        01/01/2015 12:00:00 AM 01/01/2016 12:00:00 AM 
 ##                     NA              1399185.6               668275.2 
@@ -346,52 +670,114 @@ mobile.views.bymonth.tbl</code></pre>
 ## 09/01/2015 12:00:00 AM 10/01/2015 12:00:00 AM 12/01/2015 12:00:00 AM 
 ##               564453.5              1285288.0              1223414.0</code></pre>
 <pre class="r"><code>mobile.views &lt;- data.frame(months=names(mobile.views.bymonth.tbl),
-                           mobile=mobile.views.bymonth.tbl)</code></pre>
-<pre class="r"><code>## PC5. Merge your two datasets together into a new dataset with
-## columns for each month, total views (across the top 5000 pages) and
-## total mobile views. Are there are missing data? Can you tell why?
-
-### TODO cleanup variable names to match
-
-views &lt;- merge(mobile.views, total.views, all.x=TRUE, all.y=TRUE, by=&quot;months&quot;)
-
-## these don't sort well at the moment because they're not really
-## dates, so lets recode them
-views$months &lt;- as.Date(views$months, format=&quot;%m/%d/%Y %H:%M:%S&quot;)
-
-## as then sort them
-views &lt;- views[sort.list(views$months),]
-
-## there's one line that is all missing, so lets drop that
-views &lt;- views[apply(views, 1, function (x) {!all(is.na(x))}),]
-
-## inspect it, looks like there's some missing data. lets drop
-## that. there are a few ways but complete.cases() might make most
-## cases
-views.complete &lt;- views[complete.cases(views),]</code></pre>
-<pre class="r"><code>## PC6. Create a new column in your merged dataset that describes your
-## best estimate of the proportion (or percentage, if you really
-## must!) of views that comes from mobile. Be able to talk about the
-## assumptions you've made here. Make sure that date, in this final
-## column, is a date or datetime object in R.
-
-views.complete$pct.mobile &lt;- views.complete$mobile / views.complete$total
-    
-## PC6. Graph this over time and be ready to describe: (a) your best
-## estimate of the proportion of views from mobiles to the Seattle
-## City website over time and (b) an indication of whether it's going
-## up or down.
-
-library(ggplot2)
-ggplot(data=views.complete) + aes(x=months, y=pct.mobile) + geom_point() + scale_y_continuous(limits=c(0, 1))</code></pre>
-<p><img src="" width="672" /></p>
+                           mobile=mobile.views.bymonth.tbl)
+rownames(mobile.views) &lt;- NULL</code></pre>
+</div>
+<div id="pc5" class="section level2">
+<h2>PC5</h2>
+<p>Now we merge the two datasets. Notice that I have created the <code>months</code> column in both datasets with <em>exactly</em> the same name.</p>
+<pre class="r"><code>views &lt;- merge(mobile.views, total.views, all.x=TRUE, all.y=TRUE, by=&quot;months&quot;)</code></pre>
+<p>These are sorted in strange ways and will be difficult to graph because the dates are stored as characters. Let’s convert them into Date objects. Then I can use <code>sort.list</code> to sort everything.</p>
+<pre class="r"><code>views$months &lt;- as.Date(views$months, format=&quot;%m/%d/%Y %H:%M:%S&quot;)
+
+views &lt;- views[sort.list(views$months),]</code></pre>
+<p>Take a look at the data. Some rows are missing observations. We can drop those rows using <code>complete.cases</code>:</p>
+<pre class="r"><code>lapply(views, summary)</code></pre>
+<pre><code>## $months
+##         Min.      1st Qu.       Median         Mean      3rd Qu. 
+## &quot;2015-01-01&quot; &quot;2015-05-01&quot; &quot;2015-09-01&quot; &quot;2015-09-20&quot; &quot;2016-02-01&quot; 
+##         Max.         NA's 
+## &quot;2016-08-01&quot;          &quot;1&quot; 
+## 
+## $mobile
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
+##  564454  800843 1275315 1190013 1402086 1988848       1 
+## 
+## $total
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
+## 2961681 3557936 5820453 5348818 6576828 8084318       3</code></pre>
+<pre class="r"><code>views[rowSums(is.na(views)) &gt; 0,]</code></pre>
+<pre><code>##        months   mobile total
+## 13 2016-07-01 878142.6    NA
+## 15 2016-08-01 912435.4    NA
+## 1        &lt;NA&gt;       NA    NA</code></pre>
+<pre class="r"><code>views.complete &lt;- views[complete.cases(views),]</code></pre>
+</div>
+<div id="pc6" class="section level2">
+<h2>PC6</h2>
+<p>For my proportion measure, I’ll take the mobile views divided by the total views.</p>
+<pre class="r"><code>views.complete$prop.mobile &lt;- views.complete$mobile / views.complete$total</code></pre>
+</div>
+<div id="pc7." class="section level2">
+<h2>PC7.</h2>
+<pre class="r"><code>library(ggplot2)
+ggplot(data=views.complete) + aes(x=months, y=prop.mobile) + geom_point() + geom_line() + scale_y_continuous(limits=c(0, 1))</code></pre>
+<p><img src="" width="672" /></p>
+<ol style="list-style-type: lower-alpha">
+<li>For my estimate of the proportion I’ll just calculate an average from the monthly numbers:</li>
+</ol>
+<pre class="r"><code>mean(views.complete$prop.mobile)</code></pre>
+<pre><code>## [1] 0.2308486</code></pre>
+<ol start="2" style="list-style-type: lower-alpha">
+<li>From the graph, this proportion seems quite stable with the exception of a single outlier month in late 2015.</li>
+</ol>
 </div>
 </div>
 <div id="statistical-questions" class="section level1">
 <h1>Statistical questions</h1>
+<div id="sq1-4.8" class="section level2">
+<h2>SQ1 — 4.8</h2>
+<p>The general formula for a confidence interval is <span class="math inline">\(point~estimate~±~z^*\times~SE\)</span>. First, identify the three different values. The point estimate is 45%, <span class="math inline">\(z^* = 2.58\)</span> for a 99% confidence level (that’s the number of standard deviations around the mean that ensure that 99% of a Z-score distribution is included), and <span class="math inline">\(SE = 2.4\%\)</span>.</p>
+<p>With this we can plug and chug:</p>
+<p><span class="math display">\[52\% ± 2.58 \times 2.4\% → (45.8\%, 58.2\%)\]</span></p>
+<p>From this data we are 99% confident that between 45.8% and 58.2% U.S. adult Twitter users get some news through the site.</p>
+</div>
+<div id="sq2-4.10" class="section level2">
+<h2>SQ2 — 4.10</h2>
+<ol style="list-style-type: lower-alpha">
+<li><p>False. See the answer to 4.8 above. With <span class="math inline">\(\alpha = 0.01\)</span>, we can consult the 99% confidence interval. It includes 50% but also goes lower.</p></li>
+<li><p>False. The standard error of the sample does not contain any information about the proportion of the population included in the sample. It measures the variability of the sample distribution.</p></li>
+<li><p>False. Increasing the sample size will decrease the standard error. Consider the formula: <span class="math inline">\(\frac{\sigma}{\sqrt{n}}\)</span>. A smaller <span class="math inline">\(n\)</span> will result in a larger standard error.</p></li>
+<li><p>False. All else being equal, a lower confidence interval will cover a narrower range. A higher interval will cover a wider range. To confirm this, revisit the formula in SQ1 above. and plug in the corresponding alpha value of .9, resulting in a <span class="math inline">\(z^*\)</span> value of 1.28 (see the Z-score table in the back of <em>OpenIntro</em>).</p></li>
+</ol>
+</div>
+<div id="sq3-4.19" class="section level2">
+<h2>SQ3 — 4.19</h2>
+<p>The hypotheses should be about the population mean (<span class="math inline">\(\mu\)</span>) and not the sample mean (<span class="math inline">\(\bar{x}\)</span>). The null hypothesis should have an equal sign. The alternative hypothesis should be about the critical value, not the sample mean. The following would have been better:</p>
+<p><span class="math display">\[H_0: \mu = 10~hours\]</span> <span class="math display">\[H_A: \mu \gt 10~hours\]</span></p>
+</div>
+<div id="sq4-4.32" class="section level2">
+<h2>SQ4 — 4.32</h2>
+<ol style="list-style-type: lower-alpha">
+<li>True. See part (d) of SQ2 above.</li>
+<li>False. A lower alpha value is the probability of Type 1 Error, so reducing the one reduces the other.</li>
+<li>False. Failure to reject the null is evidence that we cannot conclude that the true value is different from the null. This is <strong>very</strong> different from evidence that the null hypothesis is true.</li>
+<li>True. Consult the section of <em>OpenIntro</em> discussing statistical power and Type 2 Error.</li>
+<li>True. We’ll revisit this in a moment below, but consider the relationship between statistical test, the standard error, and the sample size. As the sample size increases towards infinity, the standard error approaches zero, resulting in arbitrarily precise point estimates that will result in rejecting the null hypothesis for <em>any</em> value of a test statistic for any critical value of <span class="math inline">\(\alpha\)</span>.</li>
+</ol>
+</div>
 </div>
 <div id="empirical-paper-questions" class="section level1">
 <h1>Empirical paper questions</h1>
+<div id="eq1" class="section level2">
+<h2>EQ1</h2>
+<p>In my words (or rather formulas since I think that’s less ambiguous), the key pairs of null/alternative hypotheses look something like the following:</p>
+<p>Let <span class="math inline">\(\Delta\)</span> be the parameter estimate for the difference in mean percentage of positive (<span class="math inline">\(\mu_{pos}\)</span>) and negative (<span class="math inline">\(\mu_{neg}\)</span>) words between the experimental and control conditions for the treatments of reduced negative content (<span class="math inline">\(R_{neg}\)</span> and reduced positive content (<span class="math inline">\(R_{pos}\)</span>).</p>
+<p>For the reduced negative content conditions (the left-hand side of Figure 1), the paper tests:</p>
+<p><span class="math display">\[HR_{neg}1_0: \Delta_{\mu_{pos}} = 0\]</span> <span class="math display">\[HR_{neg}1_a: \Delta{\mu_{pos}} \gt 0\]</span> And: <span class="math display">\[HR_{neg}2_0: \Delta_{\mu_{neg}} = 0\]</span> <span class="math display">\[HR_{neg}2_a: \Delta_{\mu_{neg}} \lt 0\]</span> Then, for the reduced positive content conditions (the right-hand side of Figure 1), the paper tests:</p>
+<p><span class="math display">\[HR_{pos}1_0:~~ \Delta_{\mu_{pos}} = 0\]</span> <span class="math display">\[HR_{pos}1_a:~~ \Delta{\mu_{pos}} \lt 0\]</span></p>
+<p>And:</p>
+<p><span class="math display">\[HR_{pos}2_0:~~ \Delta_{\mu_{neg}} = 0\]</span> <span class="math display">\[HR_{pos}2_a:~~ \Delta_{\mu_{neg}} \gt 0\]</span> Note that the theories the authors used to motivate the study imply directions for the alternative hypotheses, but nothing in the description of the analysis suggests that they used one-tailed tests. I’ve written these all in terms of specific directions here to correspond with the theories stated in the paper. They could also (arguably more accurately) have been written in terms of inequalities (“<span class="math inline">\(\neq\)</span>”).</p>
+</div>
+<div id="eq2" class="section level2">
+<h2>EQ2</h2>
+<p>The authors’ estimates suggest that reduced negative News Feed content causes an increase in the percentage of positive words and a decrease in the percentage of negative words in subsequent News Feed posts by study participants (supporting <span class="math inline">\(HR_{neg}1_a\)</span> and <span class="math inline">\(HR_{neg}2_a\)</span> respectively).</p>
+<p>They also find that reduced positive News Feed content causes a decrease in the percentage of negative words and an increase in the percentage of positive words in susbequent News Feed posts (supporting <span class="math inline">\(HR_{pos}1_a\)</span> and <span class="math inline">\(HR_{pos}2_a\)</span>)</p>
+</div>
+<div id="eq3" class="section level2">
+<h2>EQ3</h2>
+<p>Cohen’s <span class="math inline">\(d\)</span> puts estimates of experimental effects in standardized units (much like a Z-score!) in order to help understand their size relative to the underlying distribution of the dependent variable(s). The d-values for each of the effects estimated in the paper are 0.02, 0.001, 0.02, and 0.008 respectively (in the order presented in the paper, not in order of the hypotheses above!). These are miniscule effects. However, the treatment itself is also quite narrow in scope, suggesting that the presence of any treatment effect at all is an indication of the underlying phenomenon (emotional contagion). Personally, I find it difficult to attribute much substantive significance to the results because I’m not even convinced that tiny shifts in the percentage of positive/negative words used in News Feed updates accurately index meaningful emotional shifts (maybe we could call it linguistic contagion instead?). Despite these concerns and the ethical considerations that attracted so much public attention, I consider this a clever, well-executed study and I think it’s quite compelling. I expect many of you will have different opinions of various kinds!</p>
+</div>
 </div>
 
 

Community Data Science Collective || Want to submit a patch?