X-Git-Url: https://code.communitydata.science/stats_class_2020.git/blobdiff_plain/68275101f1d127a83e2c6fd076c9929c6d3f4dd4..c36ea661e7d087328543688db2ccdc22e12ededd:/psets/pset3-worked_solution.html diff --git a/psets/pset3-worked_solution.html b/psets/pset3-worked_solution.html index 0970f3e..20d0eae 100644 --- a/psets/pset3-worked_solution.html +++ b/psets/pset3-worked_solution.html @@ -1786,18 +1786,73 @@ round( ## other 0.00 0.00 ## white 0.64 0.51
Here the noteworthy comparisons again arise within the black and hispanic categories. Both groups account for a substantially larger proportion of stops resulting in searches vs. those that do not result in searches.
-For the sake of completeness/comparison, hereâs a way to do similar cross-tabulations in single a chunk of tidyverse code. I include conditional proportions of all stops to facilitate comparison with some of the tables I created earlier as well:
+For the sake of completeness/comparison, hereâs a way to do similar cross-tabulations in chunks of tidyverse code. This first bit summarizes the number of stops and proportion of total stops accounted for within each of the categories of subject_race
.
ilstops %>%
group_by(subject_race) %>%
filter(!is.na(subject_race)) %>%
+ summarize(
+ n_stops = n(),
+ prop_total_stops = round(n() / nrow(ilstops), digits = 3),
+ )
+## # A tibble: 5 x 3
+## subject_race n_stops prop_total_stops
+## <fct> <int> <dbl>
+## 1 asian/pacific islander 4053 0.032
+## 2 black 25627 0.202
+## 3 hispanic 16940 0.133
+## 4 other 335 0.003
+## 5 white 80105 0.63
+In that block I first make a call to group_by()
to tell R that I want it to run subsequent commands on the data âgroupedâ within the categories of subject_race
. Then I pipe the grouped data to summarize()
, which I use to calculate the number of stops within each group (in this data thatâs just the number of observations within each group) as well as the proportion of total stops within each group.
What about counting up the number and proportion of searches within each group? One way to think about that is as another call to summarize()
(since, after all, I want to calculate the summary information for searches within the same groups). Within the Tidyverse approach to things, this kind of summarizing within groups and within another variable (search_conducted
in this case) can be accomplished with the across()
function.
In general, the across()
function seems to usually be made within a call to another verb like summarize()
or mutate()
. The syntax for across()
is similar to these others. It requires two things: (1) at least one variable to summarize across (search_conducted
here) and (2) the outputs I want.
In this particular case, Iâll use it to calculate the within group sums of search_conducted
. Notice that I also filter out the missing values from search_conducted
before I call summarize
here.
ilstops %>%
+ group_by(subject_race) %>%
+ filter(!is.na(subject_race), !is.na(search_conducted)) %>%
+ summarize(
+ across(search_conducted, sum)
+ )
+## # A tibble: 5 x 2
+## subject_race search_conducted
+## <fct> <int>
+## 1 asian/pacific islander 68
+## 2 black 1806
+## 3 hispanic 1049
+## 4 other 14
+## 5 white 3010
+If I want across()
to calculate more than one summary, I need to provide it a list of things (in a name = value
format sort of similar to summarize()
or mutate()
).
ilstops %>%
+ group_by(subject_race) %>%
+ filter(!is.na(subject_race) & !is.na(search_conducted)) %>%
+ summarize(
+ across(
+ search_conducted,
+ list(
+ sum = sum,
+ over_n_stops = mean
+ )
+ )
+ )
+## # A tibble: 5 x 3
+## subject_race search_conducted_sum search_conducted_over_n_stops
+## <fct> <int> <dbl>
+## 1 asian/pacific islander 68 0.0168
+## 2 black 1806 0.0707
+## 3 hispanic 1049 0.0620
+## 4 other 14 0.0419
+## 5 white 3010 0.0376
+I can clean this up a bit by using two functions to the output in descending order by one of the columns. I do this with a nested call to two functions arrange()
and desc()
. I can also insert my earlier summary statistics for the number and proportions of stops by group back into the table.
ilstops %>%
+ group_by(subject_race) %>%
+ filter(!is.na(subject_race) & !is.na(search_conducted)) %>%
summarize(
n_stops = n(),
prop_total_stops = round(n() / nrow(ilstops), digits = 3),
across(
search_conducted,
list(
- sum = ~ sum(.x, na.rm = TRUE),
- over_n_stops = ~ round(mean(.x, na.rm = TRUE), digits = 3)
+ sum = sum,
+ over_n_stops = mean
)
)
) %>%
@@ -1805,12 +1860,11 @@ round(
## # A tibble: 5 x 5
## subject_race n_stops prop_total_stops search_conducted⦠search_conducted_oâ¦
## <fct> <int> <dbl> <int> <dbl>
-## 1 white 80105 0.63 3010 0.038
-## 2 black 25627 0.202 1806 0.071
-## 3 hispanic 16940 0.133 1049 0.062
-## 4 asian/pacific ⦠4053 0.032 68 0.017
-## 5 other 335 0.003 14 0.042
-Notice the use of across()
within a call to summarize()
provides one way to calculate conditional summariy info. I also use a nested call to arrange()
and desc()
at the end to sort my results in descending order by one of the columns.
+## 1 white 80043 0.63 3010 0.0376
+## 2 black 25548 0.201 1806 0.0707
+## 3 hispanic 16914 0.133 1049 0.0620
+## 4 asian/pacific ⦠4049 0.032 68 0.0168
+## 5 other 334 0.003 14 0.0419
date
Again, many possible things worth mentioning here, so Iâll provide a few that stand out to me.