psets/pset3-worked_solution.rmd

   1 ---
   2 title: "Problem set 3: Worked solutions"
   3 subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
   4 author: "Aaron Shaw"
   5 date: "October 12, 2020"
   6 output:
   7   html_document:
   8     toc: yes
   9     toc_depth: 3
  10     toc_float:
  11       collapsed: true
  12       smooth_scroll: true
  13     theme: readable
  14   pdf_document:
  15     toc: yes
  16     toc_depth: '3'
  17 ---
  18
  19 ```{r setup, include=FALSE}
  20 knitr::opts_chunk$set(echo = TRUE, message=FALSE, tidy='styler')
  21 ```
  22
  23 # Programming challenges
  24 ## PC1. Learn about the data.
  25
  26 The documentation makes it sound like the Illinois data is particularly messy/incomplete. It explains that the data is limited to 2012-2017 (instead of extending back to 2004) "due to format issues and relevance." The notes point out that data related to searches vary substantially year-to-year, suggesting that the measurement was inconsistent and that caution should be exercised in the analysis of this measure in particular. The `search_conducted` variable seems to include searches of any kind (vehicles, driver, passengers). I also notice that the codebook does not provide much insight into *how* `subject_race` and `subject_sex` information were collected, measured, and/or recoded. I think it's best to presume that the records reflect the judgments/notes/documents accessed by the officer(s) and/or police forces involved rather than the perspectives of the subjects of the traffic stops. There are likely some inconsistencies in these measures as well.
  27
  28 ## PC2. Import, explore, clean
  29
  30 I'll use the tidyverse `read_csv()` function to do this. While I'm at it, I'll import the `lubridate` library because I already know this is time series data and there are a bunch of helpful functions in `lubridate` that I will probably want to use.
  31
  32 ```{r import}
  33 library(tidyverse)
  34 library(lubridate)
  35 data.dir <- "https://communitydata.science/~ads/teaching/2020/stats/data/week_05"
  36 filename <- "il_statewide_sample-2020_04_01.csv"
  37 ilstops <- read_csv(url(paste(data.dir, filename, sep="/")))
  38
  39 ilstops
  40 ```
  41
  42 ### Recode and explore key variables
  43
  44 Right off the bat, I see a lot of `NA` values and some things likely to need recoding. Let's look a little more closely at the variables we are going to work with.
  45 ```{r class check}
  46
  47 key_variables <- c("date", "vehicle_year", "subject_race", "subject_sex", "search_conducted")
  48
  49 lapply(ilstops[key_variables], class)
  50 ```
  51 Indeed, I want to create some factors for the `subject_race` and `subject_sex` variables.
  52 ```{r recode factors}
  53 ilstops$subject_race <- factor(ilstops$subject_race)
  54 ilstops$subject_sex <- factor(ilstops$subject_sex)
  55 ```
  56 ## PC3 Summarize key variables
  57 Now, let's look at summaries of the key variables. Keep in mind that these are all variables calculated over all traffic stops in the dataset.
  58 ```{r summarize}
  59 lapply(ilstops[key_variables], summary)
  60 ```
  61 Note that the number of missing (`NA`) values is quite different depending on the variable.
  62
  63 ### Recode nonsense `vehicle_year` values
  64
  65 Review those summaries and you might notice something a little funny about the `vehicle_year` variable. Doesn't this data end in 2017? None of the `vehicle_year` values should be more recent than 2018 (if we're feeling generous since car manufacturers sometimes start selling "next year's car" at the end of a given year?). I'm also modestly skeptical that the police pulled over a car from 1900, but at least that's a plausible value! In the abence of additional information about the provenance of this variable or reasons why values larger than 2018 might be reasonable, I will go ahead and recode anything impossibly new as missing (`NA`).
  66 ```{r recode vehicle_year}
  67 ilstops$vehicle_year[ilstops$vehicle_year > 2018] <- NA
  68
  69 summary(ilstops$vehicle_year)
  70 hist(ilstops$vehicle_year)
  71 ```
  72
  73 ## PC4. Summarize bivariate relationships
  74 For bivariate summaries and comparisons, I'll walk through things variable by variable. In all of these cases except for the `date` variable, I'm considering the bivariate relationships in aggregate (rather than taking the "over time" character of the data into account). This is good to keep in mind since it might mean that the summaries mask temporal trends.
  75
  76 ### Searches by `vehicle year`
  77 I'll start with a continuous predictor, `vehicle_year`. Since the outcome variable is a dichotomous category, I can look at this by calculating summary statistics within the two outcomes or create a pair of plots. I love visual summaries, so I'll start with side-by-side boxplots:
  78
  79 ```{r pair boxplot vehicle year x searches}
  80 ilstops %>%
  81   filter(!is.na(vehicle_year) & !is.na(search_conducted)) %>%
  82   ggplot(aes(x=search_conducted, y=vehicle_year)) +  geom_boxplot()
  83
  84 ```
  85
  86 It looks like stops of older vehicles are, on average, a little bit more likely to result in a search, but stops involving the very oldest vehicles tend to result in fewer searches.
  87
  88 With that in mind, I'll calculate summary statistics across the two groups. One way to do this is with a call to the `tapply` function nested within a call to the `with()` command. I don't think we've used `with` yet, but it can allow me to identify the object I want to use (the first argument to `with()`) to run some command on (the second argument, in this case the call to `tapply()`).
  89
  90 ```{r crosstab vehicle year x searches}
  91 with(ilstops,
  92      tapply(vehicle_year, search_conducted, summary)
  93      )
  94 ```
  95 Indeed, as I suspected based on the histograms, the center of the distribution for stops resulting in searches is a little lower (older vehicles) than among stops not resulting in searches. I'll calculate the standard deviations with a separate command:
  96 ```{r}
  97 with(ilstops, tapply(vehicle_year, search_conducted, sd, na.rm=TRUE))
  98 ```
  99
 100 The stops not resulting in searches (`FALSE`) have a larger/wider spread of vehicle ages than those that do result in searches (`TRUE`). This difference is likely driven by the larger number of old vehicle outliers inovlved in stops that do not result in searches. This is a nice example of how the visual and numerical summaries might help convey a somewhat subtle feature of my data.
 101
 102 ### Searches by `subject_sex`
 103
 104 Next, we'll focus on the `subject_sex` variable. Both the outcome and this predictor variable are categorical measures that take only two (non-missing) values. That means that I can focus on creating contingency tables to report the raw numbers of traffic stops in each of the four possible combinations of the two variables. I've provided a couple of methods here:
 105 ```{r crosstab searches and sex}
 106 table(ilstops$subject_sex, ilstops$search_conducted)
 107
 108 ## the xtabs() command produces nicely formatted output that includes both variable names
 109 xtabs(~ subject_sex + search_conducted, ilstops)
 110
 111 ```
 112 One way to get the proportions in Base-R is to call the `proportions()` function on a contingency table.
 113 ```{r props searches}
 114 proportions(
 115   xtabs(~ subject_sex + search_conducted, ilstops)
 116   )
 117 ```
 118
 119 That's a lot of significant digits...especially for proportions. I can clean it up a bit by nesting everything within a call to the `round()` function and restricting the output to just two significant digits.
 120
 121 ```{r pretty prop searches}
 122 round(
 123   proportions(
 124     xtabs(~ subject_sex + search_conducted, ilstops)
 125     )
 126   , digits=2)
 127 ```
 128 Much better! However, notice that these proportions are all calculated against the *total* number of non-missing values. I can specify row or column proportions with an additional argument to the `proportions` function to specify which margins to use. Following R conventions, `margin=1` calculates proportions row-wise and `margin=2` calculates them column-wise:
 129
 130 ```{r marginal proportions}
 131
 132 ## Row-wise proportions
 133 round(
 134   proportions(
 135     xtabs(~ subject_sex + search_conducted, ilstops),1
 136     )
 137   , digits=2)
 138
 139 ## Column-wise proportions
 140 round(
 141   proportions(
 142     xtabs(~ subject_sex + search_conducted, ilstops),2
 143     )
 144   , digits=2)
 145 ```
 146 These different marginal proportions support distinct statements about the data. Here is an example for each:
 147
 148 * **Within the rows (`subject_sex`):** *the proportion of searches in stops of male drivers is twice as large as the proportion in stops of female drivers.*
 149 * **Within the columns (`search_conducted`):** *stops that resulted in searches involved male drivers almost $75\%$ of the time.*
 150
 151 ### Searches by `subject_race`
 152
 153 The `subject_race` variable is also categorical, but takes one of five non-missing values in this data. As a result, the basic structure/approach for my summaries is almost identical to that pursued with `subject_sex`.
 154
 155 First, tabular summaries of raw counts and proportions:
 156 ```{r}
 157 xtabs(~ subject_race + search_conducted, ilstops)
 158 ```
 159 In this case, I think the marginal proportions are far more interesting than the aggregate proportions. Here are proportions of searches within each category of `subject_race` (row-wise given the way I've structured the call to `xtabs`):
 160 ```{r prop searches within group}
 161 ## Proportions of searches among stops within groups
 162 round(
 163   proportions(
 164     xtabs(~ subject_race + search_conducted, ilstops), 1)
 165   , digits=2)
 166 ```
 167 Notice that the proportion of black and hispanic drivers with stops resulting in searches is higher than within other groups.
 168
 169 Now, I'll calculate the proportions of each `subject_race` category among stops resulting in searches and stops that do not result in searches:
 170 ```{r prop group within searches}
 171 ## Proportions of groups within those stops (not) resulting in searches
 172 round(
 173   proportions(
 174     xtabs(~ subject_race + search_conducted, ilstops), 2)
 175   , digits=2)
 176 ```
 177 Here the noteworthy comparisons again arise within the black and hispanic categories. Both groups account for a substantially larger proportion of stops resulting in searches vs. those that do not result in searches.
 178
 179 For the sake of completeness/comparison, here's a way to do similar cross-tabulations in chunks of tidyverse code. This first bit summarizes the number of stops and proportion of total stops accounted for within each of the categories of `subject_race`.
 180 ```{r tidyverse stops by subject_race}
 181 ilstops %>%
 182   group_by(subject_race) %>%
 183   filter(!is.na(subject_race)) %>%
 184   summarize(
 185     n_stops = n(),
 186     prop_total_stops = round(n() / nrow(ilstops), digits=3),
 187     )
 188 ```
 189 In that block I first make a call to `group_by()` to tell R that I want it to run subsequent commands on the data "grouped" within the categories of `subject_race`. Then I pipe the grouped data to `summarize()`, which I use to calculate the number of stops within each group (in this data that's just the number of observations within each group) as well as the proportion of total stops within each group.
 190
 191 What about counting up the number and proportion of searches within each group? One way to think about that is as another call to `summarize()` (since, after all, I want to calculate the summary information for searches within the same groups). Within the Tidyverse approach to things, this kind of summarizing within groups and within another variable (`search_conducted` in this case) can be accomplished with the `across()` function.
 192
 193 In general, the `across()` function seems to usually be made within a call to another verb like `summarize()` or `mutate()`. The syntax for `across()` is similar to these others. It requires two things: (1) at least one variable to summarize across (`search_conducted` here) and (2) the outputs I want.
 194
 195 In this particular case, I'll use it to calculate the within group sums of `search_conducted`. Notice that I also filter out the missing values from `search_conducted` before I call `summarize` here.
 196 ```{r }
 197 ilstops %>%
 198   group_by(subject_race) %>%
 199   filter(!is.na(subject_race), !is.na(search_conducted)) %>%
 200   summarize(
 201     across(search_conducted, sum)
 202     )
 203 ```
 204 If I want `across()` to calculate more than one summary, I need to provide it a list of things (in a `name = value` format sort of similar to `summarize()` or `mutate()`).
 205
 206 ```{r}
 207 ilstops %>%
 208   group_by(subject_race) %>%
 209   filter(!is.na(subject_race) & !is.na(search_conducted)) %>%
 210   summarize(
 211     across(
 212       search_conducted,
 213       list(
 214         sum = sum,
 215         over_n_stops = mean
 216         )
 217       )
 218     )
 219 ```
 220 I can clean this up a bit by using two functions to the output in descending order by one of the columns. I do this with a nested call to two functions `arrange()` and `desc()`. I can also insert my earlier summary statistics for the number and proportions of stops by group back into the table.
 221
 222 ```{r}
 223 ilstops %>%
 224   group_by(subject_race) %>%
 225   filter(!is.na(subject_race) & !is.na(search_conducted)) %>%
 226   summarize(
 227     n_stops = n(),
 228     prop_total_stops = round(n() / nrow(ilstops), digits=3),
 229     across(
 230       search_conducted,
 231       list(
 232         sum = sum,
 233         over_n_stops = mean)
 234         )
 235       ) %>%
 236       arrange(desc(n_stops))
 237 ```
 238
 239 ### Searches by `date`
 240
 241 Last, but not least, I can inspect how the number of searches is distributed over time. I can do this a few ways, but once again will start with a visual summary because that is how I prefer to approach things. Remember that `search_conducted` is a dichotomous categorical measure, so a boxplot is a good starting point.
 242 ```{r searches over time}
 243 ilstops %>%
 244   ggplot(aes(x=search_conducted, y=date)) + geom_boxplot()
 245
 246 ```
 247
 248 How considerate of ggplot to include the `NA` category by default! Notice how seemingly all of the missing values come from 2012-2014? There's some of that missing data the SOPP folks warned us about in their documentation. Let's inspect things a bit more closely:
 249
 250 ```{r explore missing values in searches}
 251 ilstops %>%
 252   group_by(search_conducted) %>%
 253   summarize(
 254     min = min(date, na.rm=TRUE),
 255     max = max(date, na.rm=TRUE)
 256   )
 257 ```
 258
 259 All the missing data for `search_conducted` comes within a two-year range.
 260
 261 I can use `filter` to look at the `search_conducted` variable within that two year range alone:
 262 ```{r missing search values pre 2014}
 263 ilstops %>%
 264   filter(date < as_date('2014-01-11')) %>%
 265   group_by(search_conducted) %>%
 266   summarize(
 267     n = n()
 268   )
 269 ```
 270 *Most* of the `search_conducted` data between 2012-2014 isn't missing, even if that's the only period within which missing values for `search_conducted` appear. There's not much we can do about this, but it's worth taking into consideration as we go.
 271
 272
 273 ## PC5 Searches within `subject_race` groups over time
 274
 275 The goal here is to synthesize pieces of the work we did in PC3 and PC4 to analyze subgroups of the stops and the stops resulting in searches over time. I'll start by reorganizing/tidying the data to support the creation of my time series plots.
 276
 277 ### 5.1 Binning data within time periods
 278
 279 I want to plot this data as a time series of some numeric measure, but right now my outcome variable is a categorical indicator. How can I plot TRUE/FALSE values over time? One good way is to bin my data into meaningful time-periods (e.g., months or weeks) and count the number of TRUE values (and/or calculate proportions of TRUE values) within each bin.
 280
 281
 282 Let's do that using some of the handy features of `lubridate` for working with date objects. I have demonstrated a few approaches to binning and grouping in the R tutorial materials, but here's yet another that takes advantage of some `lubridate` features (specifically the call to `format` within the `mutate` step below) quite effectively:
 283
 284 ```{r bin to months}
 285 ## library(lubridate)
 286
 287 ilstops_monthly <- ilstops %>%
 288   filter(!is.na(subject_race)) %>%
 289   mutate(
 290     yearmonth = format(date, '%Y-%m')
 291   ) %>%
 292   group_by(yearmonth) %>%
 293   summarize(
 294     stops = n(),
 295     searches = sum(as.numeric(search_conducted), na.rm=T),
 296     prop_searched = round(searches / stops,3)
 297     )
 298
 299 ilstops_monthly
 300 ```
 301 Looks great, except do you notice how my new `yearmonth` variable is no longer recognized as a date or date-time object? I think I can fix that with the handy `parse_date_time` function from `lubridate`:
 302 ```{r clean binned dates}
 303 ilstops_monthly$date <- parse_date_time(ilstops_monthly$yearmonth, "ym")
 304 class(ilstops_monthly$date)
 305 ```
 306
 307 Now I can plot it since `ggplot` knows what to do with objects in the `POSIXct` class. I'm also going to add a type of layer we haven't worked with yet (`geom_smooth`) which incorporates a smoothed conditional mean to the plot (appropriately enough in this dataset, in the form of a thin blue line).
 308
 309 ```{r plot monthly stops}
 310 ilstops_monthly %>%
 311   ggplot(aes(x=date, y=stops)) + geom_line() + geom_smooth() + labs(title="Illinois traffic stops", y="stops (per month)")
 312
 313 ```
 314
 315 Turns out traffic stops are not distributed smoothly across the months of the year (I notice, in particular, the big dips and spikes around the end of each year). It also looks like there is a slight $\cup$-shaped trend overall, but the shifts do not look very large relative to either the month-to-month variations or the total number of stops per year.
 316
 317 Here's a similar plot for searches:
 318 ```{r plot monthly searches}
 319 ilstops_monthly %>%
 320   ggplot(aes(x=date, y=searches)) + geom_line() + geom_smooth() + labs(title="Illinois traffic stops that result in searches",
 321                                                                        y="searches (per month)")
 322
 323 ```
 324 Turns out searches are also not distributed smoothly across the months of the year (again, the big dips and spikes around the end of the year). In this case, there seems to be a very slight downward trend on average, but this decrease mostly appears in the first couple of years of the data and does not look very large relative to the month-to-month variations or the total number of searches per year (maybe ten fewer searches per month on average during the latter half of the dataset?).
 325
 326 Let's try looking at the proportion of stops that result in searches.
 327
 328 ```{r plot monthly prop searched}
 329 ilstops_monthly %>%
 330   ggplot(aes(x=date, y=prop_searched)) + geom_line() + geom_smooth() + labs(title="Proportion of Illinois traffic stops that result in searches",  y="proportion searched\n(per month)")
 331 ```
 332 Still quite spiky! Interestingly, we see an inverse-$\cap$ trend in the average here (contrast with the raw counts of traffic stops above), but these fluctuations are, again, very small relative to the underlying variation or the range of measure. Also, notice the narrow y-axis scale sort of hides the fact that almost all the values fall between $4-6\%$. In other words, all of the variation here is within quite a narrow range.
 333
 334 ### 5.2 Plot stops within `subject_race` categories
 335
 336 To do this, I'll need to go back to the code that created my `ilstops_monthly` tibble above and incorporate the conditional summary data. To do this, I'll "explode" the aggregated summaries from the old tibble by inserting an extra variable to `group_by()` before making my call to `summarize`. Notice that I go ahead and cleanup the dates again too (since I know to anticipate that issue)
 337
 338 ```{r bin to subgroup months}
 339 ilstops_monthly_grouped <- ilstops %>%
 340   filter(!is.na(subject_race)) %>%
 341   mutate(
 342     yearmonth = format(date, '%Y-%m')
 343   ) %>%
 344   group_by(yearmonth, subject_race) %>%
 345   summarize(
 346     stops = n(),
 347     searches = sum(as.numeric(search_conducted), na.rm=T),
 348     prop_searched = round(searches / stops,3)
 349     )
 350
 351 ilstops_monthly_grouped$date <- parse_date_time(ilstops_monthly_grouped$yearmonth, "ym")
 352
 353 ilstops_monthly_grouped
 354
 355 ```
 356 Great! Now, I need to add the values for total stops and total searches each month (note that this will be the same for all five groups/rows within any given value of `date` or `yearmonth`) so that I can calculate the proportion of total stops and total searches accounted for by each group. There are a bunch of ways we could go about this. Since I've made it this far with tidyverse code, I'll keep going in that direction and use the `mutate` verb to keep all my existing rows while also creating some new ones. Notice that I have to do this twice because my first call to `group_by` is only by `yearmonth` so that I can create the monthly totals, whereas my second call to `group_by` once again looks wihtin `yearmonth` and `subject_race` to calculate the proportions. In order to make sure that nothing weird happens along the way I insert a call to `ungroup` as well (which does exactly what you might hope following an operation performed on some data that's already grouped in some way).
 357
 358 ```{r create total stops, total searches and prop for each within groups}
 359 ilstops_monthly_grouped <- ilstops_monthly_grouped %>%
 360   group_by(yearmonth) %>%
 361   mutate(
 362     total_stops = sum(stops),
 363     total_searches = sum(searches)
 364   ) %>%
 365   ungroup() %>%
 366   group_by(yearmonth, subject_race) %>%
 367   mutate(
 368     prop_total_stops = stops / total_stops,
 369     prop_total_searches = searches / total_searches
 370   )
 371
 372 ilstops_monthly_grouped
 373
 374 ## I think the html drops a few of our new columns, so I'll inspect them manually here
 375 head(ilstops_monthly_grouped[,c("yearmonth", "subject_race", "total_searches", "prop_total_stops", "prop_total_searches")])
 376 ```
 377
 378 Now my plotting code will look pretty similar to the univariate plots above, just with the addition of another aesthetic...
 379
 380 ```{r plot monthly subgroup stops}
 381 ilstops_monthly_grouped %>%
 382   ggplot(aes(x=date, y=stops, color=subject_race)) + geom_line() + geom_smooth() + labs(title="Illinois traffic stops",  y="stops (per month)")
 383
 384 ```
 385
 386 ### 5.3 Plot searches within `subject_race` categories
 387 ```{r plot monthly subgroup searches}
 388 ilstops_monthly_grouped %>%
 389   ggplot(aes(x=date, y=searches, color=subject_race)) + geom_line() + geom_smooth() + labs(title="Illinois traffic stops resulting in searches",  y="searches (per month)")
 390
 391 ```
 392
 393
 394 ### 5.4 Plot proportion of searches within `subject_race` categories
 395
 396 The question for this one only asked about proportions of total searches accounted for by each group of `subject_race`, when several other interesting baselines/comparisons are available. For example, the proportion of total stops across groups as well as stops resulting in searches *within* each group also strike me as potentially illustrative. I'll plot each of these proportions just for the sake of completeness. Let's start with the proportion of total stops plotted across groups.
 397 ```{r plot monthly subgroup prop total stops}
 398 ilstops_monthly_grouped %>%
 399   ggplot(aes(x=date, y=prop_total_stops, color=subject_race)) + geom_line() + geom_smooth() + labs(title="Proportion of total Illinois traffic stops by race/ethnicity",  y="proportion of total stops\n(per month)\n")
 400 ```
 401
 402 Onwards to the proportion of stops resulting in searches calculated *within* each group of `subject_race`:
 403
 404 ```{r plot monthly subgroup prop searched}
 405 ilstops_monthly_grouped %>%
 406   ggplot(aes(x=date, y=prop_searched, color=subject_race)) + geom_line() + geom_smooth() + labs(title="Proportion of Illinois traffic stops that result in searches",  y="proportion searched\n(calculated per month)")
 407 ```
 408
 409 Hmm, that looks pretty messy. The proportions compress everything into a narrow region of the y-axis. Also, the data for some of these groups (especially "other") fluctuates all over the place, which is a distration since that's the smallest category.
 410
 411 Instead of plotting all the proportions on the same figure, it might be easier to see/compare variations if I use facets and don't include the "other" category (because the y-axis scale is just so different).
 412
 413 ```{r plot monthly subgroup prop searched facets}
 414 ilstops_monthly_grouped %>%
 415   filter(subject_race != "other") %>%
 416   ggplot(aes(x=date, y=prop_searched, color=subject_race)) + geom_line() + geom_smooth() + facet_grid(rows=vars(subject_race)) + labs(title="Proportion of Illinois traffic stops that result in searches by race/ethnicity",  y="proportion searched\n(per month)\n")
 417 ```
 418
 419 The title runs a little long and facet label for "asian/pacific islander" gets truncated. If I were preparing this for publication I'd fix those issues, but for now, I'm going to acknowledge them and move on.
 420
 421 Now, let's look at the proportion of total searches.
 422
 423 ```{r plot monthly subgroup prop total searches}
 424 ilstops_monthly_grouped %>%
 425   ggplot(aes(x=date, y=prop_total_searches, color=subject_race)) + geom_line() + geom_smooth() + labs(title="Proportion of total Illinois traffic stops resulting in searches by race/ethnicity",  y="proportion of total searches\n(per month)\n")
 426 ```
 427
 428 ## PC6. Calculate baseline population proportions
 429
 430 As recommended, I'll calculate this using the [`county_complete`](https://www.openintro.org/data/index.php?data=county_complete) provided as part of the `openintro` package to build the relevant variables for Illinois in 2010. You'll want to review the [list of variables](https://www.openintro.org/data/index.php?data=county_complete) to figure out what you need to use here.
 431
 432 Note that what we're looking for are *statewide-proportions* of race/ethnic groups in 2010 within the same categories in the SOPP traffic stop data for Illinois. The `county_complete` dataset does not provide these values directly, but instead has county-level populations and county-level proportions of slightly different race/ethnic groups over multiple different years across all the counties in the United States. That means that we'll need to take some steps (like the following) to generate state-level proportion estimates:
 433
 434 * Restrict the dataset to Illinois counties only.
 435 * Multiply county-level populations by county-level proportions for salient groups to create county-level population counts within groups.
 436 * Sum county-level populations within groups to create state-level group populations.
 437 * Divide state-level group populations by total state population (sum of all county populations) to calculate state-level proportions.
 438
 439 In the code chunk below, I'll tackle these steps in a very "verbose" and redundant way:
 440 ```{r verbose county baselines}
 441 library(openintro)
 442 data(county_complete)
 443
 444 il_county <- county_complete[county_complete$state == "Illinois",]
 445
 446 ## total state population is easy
 447 il_pop2010 <- sum(il_county$pop2010)
 448
 449 ## Turn IL county-level proportions into counts
 450 il_county$white2010 <- il_county$pop2010*il_county$white_2010
 451 il_county$black2010 <- il_county$pop2010*il_county$black_2010
 452 il_county$hispanic2010 <- il_county$pop2010*il_county$hispanic_2010
 453
 454 ## We'll have to sum these two groups to approximate "Asian / Pacific Islander" from SOPP data
 455 il_county$asian2010 <- (il_county$pop2010*il_county$asian_2010)
 456 il_county$pi2010 <- (il_county$pop2010*il_county$pac_isl_2010)
 457
 458 ## Then go through summing and dividing by the total population
 459 ##
 460 ##prop.white
 461 sum(il_county$white2010) / il_pop2010
 462 ##prop.black
 463 sum(il_county$black2010) / il_pop2010
 464 ##prop.hispanic
 465 sum(il_county$hispanic2010) / il_pop2010
 466 ##prop.api
 467 (sum(il_county$asian2010, na.rm=TRUE)+ sum(il_county$pi2010, na.rm=TRUE)) / il_pop2010
 468 ```
 469
 470 That is all perfectly accurate; however, the code is verbose and redundant because it calculates each (intermediate and group) value with separate lines of code that are very repetitive. There are many more concise ways to go about this, all of which make it faster, less prone to typing mistakes, and easier to extend. How to proceed? Anytime I find myself repeating the same steps of analysis more than once, I consider creating a function and then calling that function within a call to an `*apply()` function. Here's what that might look like here:
 471 ```{r county baselines via function}
 472 gen_subgroup_prop <- function(group_var, d=il_county){
 473   total_pop2010 <- sum(d["pop2010"]) # state population total
 474   county_ests <- d["pop2010"]*d[group_var] # county-level counts within groups
 475   group_prop <- sum(county_ests, na.rm=TRUE) / total_pop2010 # state-level proportions within groups
 476   return(round(group_prop, 4)) # prettify output
 477 }
 478
 479 ## give it a try:
 480 gen_subgroup_prop("hispanic_2010")
 481
 482 ## Now let's tie everything together
 483 groups <- c("white_2010", "black_2010", "hispanic_2010", "asian_2010", "pac_isl_2010")
 484
 485 sapply(groups, gen_subgroup_prop)
 486
 487 ```
 488
 489 I need to add those last two values together to reproduce the "asian/pacific islander" category from the SOPP data. Note also that there are slight differences between my results here and my results above. Any differences **should** be due entirely to rounding and, in a scenario where this analysis was being prepared for publication and/or informing a policy decision, I'd ansp;ite;y want to check that and think carefully how many significant digits to include in my reported estimates. The Illinois population is sufficiently large that even a few hundredths of a percent turn into very meaningful numbers of people!
 490
 491 # Statistical questions
 492
 493 Note that my "solutions" below reflect possible interpretations, but are not intended to be exhaustive or exclusive of other possible interpretations. We'll discuss these in our class meeting.
 494
 495 ## SQ1. Interpret analysis from PCs3-5
 496
 497 Some interpretation appeared above alongside specific results. Here is a brief summary of several striking points.
 498
 499 * Overall, traffic stops and searches in Illinois between 2012-2018 were distributed unequally across different racial/ethnic groups.
 500     + The proportion of stops resulting in searches is higher among stops of black and hispanic drivers ($7\%$ and $6\%$ respectively vs. $4\%$ or less for all other groups).
 501     + Black and hispanic drivers account for a higher proportion of stops resulting in searches than stops not resulting in searches ($30\%$ and $18\%$ vs. $20\%$ and $13\%$ respectively).
 502
 503 * Over time, the tpyical number of traffic stops, stops resulting in searches, and the proportion of stops resulting in searches fluctuate widely, but have fairly stationary central trends.
 504     + The typical number of monthly stops dipped a bit between 2014-2015.
 505     + The typical number of monthly stops resulting in searches may have decreased 2012-2015, but held flat 2015-2018.
 506     + The typical monthly proportion of stops resulting in searches increased a little 2012-2014, but then declined a little bit 2014-2017.
 507
 508 * Comparing across the racial/ethnic groups identified in the data, the *typical monthly numbers of stops and searches* vary substantially and exhibit distinct trends.
 509     + The typical number of black and hispanic drivers stopped per month has increased.
 510     + The typical number of white drivers stopped per month has decreased.
 511     + The typical number of asian/pacific islander and "other" category drivers per month has remained nearly flat.
 512     + The typical number of stops resulting in searches among black, white, and hispanic drivers per month have tended to be quite close to each other, and sometimes overlap,
 513     + The typical number of stops resulting in searches among white drivers per month has increased slightly.
 514     + The typical number of stops resulting in searches among black and hispanic drivers per month has decreased slightly.
 515
 516 * The typical *proportion of stops resulting in searches within racial/ethnic groups* have shifted somewhat over time.
 517     + The proportion has increased slightly among white and asian/pacific islander drivers.
 518     + The proportion has decreased slightly among black and hispanice drivers.
 519
 520 * The typical *proportion of total stops accounted for by each racial/ethnic group* has shifted somewhat over time.
 521     + The proportion has decreased substantially among white drivers.
 522     + The proportion has increased among black drivers (substantially) and among hispanic drivers (slightly).
 523
 524 * The typical *proportion of total searches accounted for by each racial/ethnic group* has also shifted over time.
 525     + As of 2012, the proportion accounted for by black and white drivers was nearly identical (close to $40\%$ each)!
 526     + The proprtion has increased substantially among white drivers (up to .
 527     + The proportion has decreased substantially among black and hispanic drivers.
 528
 529 ## SQ2. Contextualize SQ1 interpretation in relation to PC6
 530
 531 Several noteworthy comparisons come looking across the different proportions for traffic stops and searches and comparing those against the baseline population proportions accounted for by each racial/ethnic group category. Here are several specific observations:
 532
 533 * The white proportion of the state population is consistently larger than either the proportion of white drivers involved in all traffic stops or the proportion of white drivers involved in all stops resulting in searches.
 534 * The hispanic proportion of the state population is about the same as the proportion of hispanic drivers involved in all traffic stops and usually a little smaller than the proportion of hispanic drivers involved in all stops resulting in searches (although this latter difference shrank over the period of data collection).
 535 * The black proportion of the state population is consistently and substantially smaller than the proportion of black drivers involved in all traffic stops and consistently and substantially smaller than the proportion of black drivers involved in all stops resulting in searches. This latter difference in particular stands out as it is almost more than double the baseline population proportion.
 536 * Overall, these results suggest that race/ethnicity and traffic stops and searches are not indepdendent in a statistical sense. The nature of their relationship is complex and could be due to a number of factors including biased policing practices, socioeconomic inequalities, different levels of behaviors linked to traffic stops within specific sub-groups, regional variations masked by these state-level analyses, political/cultural factors shaping police/driver interactions, and more.
 537
 538 ## SQ3. Limitations and possible extensions
 539
 540 Again, many possible things worth mentioning here, so I'll provide a few that stand out to me.
 541
 542 * The generalizability of analysis focused on one state during one 6 year period is limited.
 543 * Working with a random $1\%$ sample of the full dataset means that our results here could diverge from those we would find in an analysis of the full population of traffic stops in unpredictable ways. That said, even the very small sample is quite big and once you've read *OpenIntro* §5 you'll have some tools to estimate standard errors and confidence intervals around the various results from this analysis.
 544 * The data seem very prone to measurement errors of various kinds. In particular, I suspect the race/ethnicity classifications provided by officers are subject to some biases that are hard to identify and might also shift over time/region. The prevalence of missing values during the first two years of the dataset illustrate one aspect of this and may impact estimates of raw counts and proportions.
 545 * While the comparisons across racial/ethnic groups and between the traffic stops/searches and baseline population proportions illustrates a number of suggestive patterns, conclusive interpretation or attribution of those patterns to any specific cause or causes is quite difficult in the absence of additional information or assumptions. For one example, see my comments regarding statistical independence and the possible explanations in SQ2 above.
 546 * Extensions of this analysis might seek to investigate how some of the patterns identified in the aggregate sate-level data vary across sub-regions (e.g., counties or police districts) or even in comparison to other states.
 547 * Another extension might investigate how specific policy and/or socioeconomic changes may or may not relate to shifting patterns in the data.
 548 * Yet another extension could try to identify some factor that introduces some sudden, uncontrolled (by the drivers or the police) change into the process of traffic stops that could make it possible to identify and estimate the effects of specific explanatory factors. The original SOPP paper is an extraordinary example of this that we can discuss!