initial commit for week 10 (linear models) material

[stats_class_2020.git] / psets / pset3-worked_solution.rmd
diff --git a/psets/pset3-worked_solution.rmd b/psets/pset3-worked_solution.rmd

index b7f3ec390c12533e9e64976dfbeec38f8a5a571b..f99c666c989f45757448c22cfd9f6bd358087000 100644 (file)
--- a/psets/pset3-worked_solution.rmd
+++ b/psets/pset3-worked_solution.rmd
@@ -176,26 +176,65 @@ round(
  ```
  Here the noteworthy comparisons again arise within the black and hispanic categories. Both groups account for a substantially larger proportion of stops resulting in searches vs. those that do not result in searches. 
  
-For the sake of completeness/comparison, here's a way to do similar cross-tabulations in single a chunk of tidyverse code. I include conditional proportions of all stops to facilitate comparison with some of the tables I created earlier as well:  
-```{r tidyverse crosstabs}
+For the sake of completeness/comparison, here's a way to do similar cross-tabulations in chunks of tidyverse code. This first bit summarizes the number of stops and proportion of total stops accounted for within each of the categories of `subject_race`.
+```{r tidyverse stops by subject_race}
  ilstops %>%
    group_by(subject_race) %>%
    filter(!is.na(subject_race)) %>%
    summarize(
      n_stops = n(),
      prop_total_stops = round(n() / nrow(ilstops), digits=3),
+    )
+```
+In that block I first make a call to `group_by()` to tell R that I want it to run subsequent commands on the data "grouped" within the categories of `subject_race`. Then I pipe the grouped data to `summarize()`, which I use to calculate the number of stops within each group (in this data that's just the number of observations within each group) as well as the proportion of total stops within each group.  
+
+What about counting up the number and proportion of searches within each group? One way to think about that is as another call to `summarize()` (since, after all, I want to calculate the summary information for searches within the same groups). Within the Tidyverse approach to things, this kind of summarizing within groups and within another variable (`search_conducted` in this case) can be accomplished with the `across()` function. 
+
+In general, the `across()` function seems to usually be made within a call to another verb like `summarize()` or `mutate()`. The syntax for `across()` is similar to these others. It requires two things: (1) at least one variable to summarize across (`search_conducted` here) and (2) the outputs I want.
+
+In this particular case, I'll use it to calculate the within group sums of `search_conducted`. Notice that I also filter out the missing values from `search_conducted` before I call `summarize` here.
+```{r }
+ilstops %>%
+  group_by(subject_race) %>%
+  filter(!is.na(subject_race), !is.na(search_conducted)) %>%
+  summarize(
+    across(search_conducted, sum)
+    )
+```
+If I want `across()` to calculate more than one summary, I need to provide it a list of things (in a `name = value` format sort of similar to `summarize()` or `mutate()`).  
+
+```{r}
+ilstops %>%
+  group_by(subject_race) %>%
+  filter(!is.na(subject_race) & !is.na(search_conducted)) %>%
+  summarize(
      across(
        search_conducted,
        list(
-        sum = ~ sum(.x, na.rm=TRUE),
-        over_n_stops = ~ round(mean(.x, na.rm=TRUE), digits=3)  
+        sum = sum,
+        over_n_stops = mean
          )
        )
-    ) %>%
-      arrange(desc(n_stops))
+    )
  ```
+I can clean this up a bit by using two functions to the output in descending order by one of the columns. I do this with a nested call to two functions `arrange()` and `desc()`. I can also insert my earlier summary statistics for the number and proportions of stops by group back into the table.
  
-Notice the use of `across()` within a call to `summarize()` provides one way to calculate conditional summariy info. I also use a nested call to `arrange()` and `desc()` at the end to sort my results in descending order by one of the columns.  
+```{r}
+ilstops %>%
+  group_by(subject_race) %>%
+  filter(!is.na(subject_race) & !is.na(search_conducted)) %>%
+  summarize(
+    n_stops = n(),
+    prop_total_stops = round(n() / nrow(ilstops), digits=3),
+    across(
+      search_conducted,
+      list(
+        sum = sum,
+        over_n_stops = mean)  
+        )
+      ) %>%
+      arrange(desc(n_stops))
+```
  
  ### Searches by `date`
  
@@ -501,6 +540,7 @@ Several noteworthy comparisons come looking across the different proportions for
  Again, many possible things worth mentioning here, so I'll provide a few that stand out to me.  
  
  * The generalizability of analysis focused on one state during one 6 year period is limited.
+* Working with a random $1\%$ sample of the full dataset means that our results here could diverge from those we would find in an analysis of the full population of traffic stops in unpredictable ways. That said, even the very small sample is quite big and once you've read *OpenIntro* §5 you'll have some tools to estimate standard errors and confidence intervals around the various results from this analysis.   
  * The data seem very prone to measurement errors of various kinds. In particular, I suspect the race/ethnicity classifications provided by officers are subject to some biases that are hard to identify and might also shift over time/region. The prevalence of missing values during the first two years of the dataset illustrate one aspect of this and may impact estimates of raw counts and proportions.  
  * While the comparisons across racial/ethnic groups and between the traffic stops/searches and baseline population proportions illustrates a number of suggestive patterns, conclusive interpretation or attribution of those patterns to any specific cause or causes is quite difficult in the absence of additional information or assumptions. For one example, see my comments regarding statistical independence and the possible explanations in SQ2 above. 
  * Extensions of this analysis might seek to investigate how some of the patterns identified in the aggregate sate-level data vary across sub-regions (e.g., counties or police districts) or even in comparison to other states.