X-Git-Url: https://code.communitydata.science/stats_class_2020.git/blobdiff_plain/efa6913590499d105277c2907cde2a96aa7bd51f..f79b3ace181536090eb8fbedfaa3a887ac028ac6:/psets/pset3-worked_solution.html?ds=inline

diff --git a/psets/pset3-worked_solution.html b/psets/pset3-worked_solution.html
index 0d9a589..20d0eae 100644
--- a/psets/pset3-worked_solution.html
+++ b/psets/pset3-worked_solution.html
@@ -1786,18 +1786,73 @@ round(
 ##   other                   0.00 0.00
 ##   white                   0.64 0.51</code></pre>
 <p>Here the noteworthy comparisons again arise within the black and hispanic categories. Both groups account for a substantially larger proportion of stops resulting in searches vs.Â those that do not result in searches.</p>
-<p>For the sake of completeness/comparison, hereâs a way to do similar cross-tabulations in single a chunk of tidyverse code. I include conditional proportions of all stops to facilitate comparison with some of the tables I created earlier as well:</p>
+<p>For the sake of completeness/comparison, hereâs a way to do similar cross-tabulations in chunks of tidyverse code. This first bit summarizes the number of stops and proportion of total stops accounted for within each of the categories of <code>subject_race</code>.</p>
 <pre class="r"><code>ilstops %&gt;%
   group_by(subject_race) %&gt;%
   filter(!is.na(subject_race)) %&gt;%
+  summarize(
+    n_stops = n(),
+    prop_total_stops = round(n() / nrow(ilstops), digits = 3),
+  )</code></pre>
+<pre><code>## # A tibble: 5 x 3
+##   subject_race           n_stops prop_total_stops
+##   &lt;fct&gt;                    &lt;int&gt;            &lt;dbl&gt;
+## 1 asian/pacific islander    4053            0.032
+## 2 black                    25627            0.202
+## 3 hispanic                 16940            0.133
+## 4 other                      335            0.003
+## 5 white                    80105            0.63</code></pre>
+<p>In that block I first make a call to <code>group_by()</code> to tell R that I want it to run subsequent commands on the data âgroupedâ within the categories of <code>subject_race</code>. Then I pipe the grouped data to <code>summarize()</code>, which I use to calculate the number of stops within each group (in this data thatâs just the number of observations within each group) as well as the proportion of total stops within each group.</p>
+<p>What about counting up the number and proportion of searches within each group? One way to think about that is as another call to <code>summarize()</code> (since, after all, I want to calculate the summary information for searches within the same groups). Within the Tidyverse approach to things, this kind of summarizing within groups and within another variable (<code>search_conducted</code> in this case) can be accomplished with the <code>across()</code> function.</p>
+<p>In general, the <code>across()</code> function seems to usually be made within a call to another verb like <code>summarize()</code> or <code>mutate()</code>. The syntax for <code>across()</code> is similar to these others. It requires two things: (1) at least one variable to summarize across (<code>search_conducted</code> here) and (2) the outputs I want.</p>
+<p>In this particular case, Iâll use it to calculate the within group sums of <code>search_conducted</code>. Notice that I also filter out the missing values from <code>search_conducted</code> before I call <code>summarize</code> here.</p>
+<pre class="r"><code>ilstops %&gt;%
+  group_by(subject_race) %&gt;%
+  filter(!is.na(subject_race), !is.na(search_conducted)) %&gt;%
+  summarize(
+    across(search_conducted, sum)
+  )</code></pre>
+<pre><code>## # A tibble: 5 x 2
+##   subject_race           search_conducted
+##   &lt;fct&gt;                             &lt;int&gt;
+## 1 asian/pacific islander               68
+## 2 black                              1806
+## 3 hispanic                           1049
+## 4 other                                14
+## 5 white                              3010</code></pre>
+<p>If I want <code>across()</code> to calculate more than one summary, I need to provide it a list of things (in a <code>name = value</code> format sort of similar to <code>summarize()</code> or <code>mutate()</code>).</p>
+<pre class="r"><code>ilstops %&gt;%
+  group_by(subject_race) %&gt;%
+  filter(!is.na(subject_race) &amp; !is.na(search_conducted)) %&gt;%
+  summarize(
+    across(
+      search_conducted,
+      list(
+        sum = sum,
+        over_n_stops = mean
+      )
+    )
+  )</code></pre>
+<pre><code>## # A tibble: 5 x 3
+##   subject_race           search_conducted_sum search_conducted_over_n_stops
+##   &lt;fct&gt;                                 &lt;int&gt;                         &lt;dbl&gt;
+## 1 asian/pacific islander                   68                        0.0168
+## 2 black                                  1806                        0.0707
+## 3 hispanic                               1049                        0.0620
+## 4 other                                    14                        0.0419
+## 5 white                                  3010                        0.0376</code></pre>
+<p>I can clean this up a bit by using two functions to the output in descending order by one of the columns. I do this with a nested call to two functions <code>arrange()</code> and <code>desc()</code>. I can also insert my earlier summary statistics for the number and proportions of stops by group back into the table.</p>
+<pre class="r"><code>ilstops %&gt;%
+  group_by(subject_race) %&gt;%
+  filter(!is.na(subject_race) &amp; !is.na(search_conducted)) %&gt;%
   summarize(
     n_stops = n(),
     prop_total_stops = round(n() / nrow(ilstops), digits = 3),
     across(
       search_conducted,
       list(
-        sum = ~ sum(.x, na.rm = TRUE),
-        over_n_stops = ~ round(mean(.x, na.rm = TRUE), digits = 3)
+        sum = sum,
+        over_n_stops = mean
       )
     )
   ) %&gt;%
@@ -1805,12 +1860,11 @@ round(
 <pre><code>## # A tibble: 5 x 5
 ##   subject_race    n_stops prop_total_stops search_conductedâ¦ search_conducted_oâ¦
 ##   &lt;fct&gt;             &lt;int&gt;            &lt;dbl&gt;             &lt;int&gt;               &lt;dbl&gt;
-## 1 white             80105            0.63               3010               0.038
-## 2 black             25627            0.202              1806               0.071
-## 3 hispanic          16940            0.133              1049               0.062
-## 4 asian/pacific â¦    4053            0.032                68               0.017
-## 5 other               335            0.003                14               0.042</code></pre>
-<p>Notice the use of <code>across()</code> within a call to <code>summarize()</code> provides one way to calculate conditional summariy info. I also use a nested call to <code>arrange()</code> and <code>desc()</code> at the end to sort my results in descending order by one of the columns.</p>
+## 1 white             80043            0.63               3010              0.0376
+## 2 black             25548            0.201              1806              0.0707
+## 3 hispanic          16914            0.133              1049              0.0620
+## 4 asian/pacific â¦    4049            0.032                68              0.0168
+## 5 other               334            0.003                14              0.0419</code></pre>
 </div>
 <div id="searches-by-date" class="section level3">
 <h3>Searches by <code>date</code></h3>