X-Git-Url: https://code.communitydata.science/stats_class_2020.git/blobdiff_plain/68275101f1d127a83e2c6fd076c9929c6d3f4dd4..c0a584cb21431f64a48de4cf3043e8e07b63ec7d:/psets/pset3-worked_solution.rmd diff --git a/psets/pset3-worked_solution.rmd b/psets/pset3-worked_solution.rmd index b7f3ec3..e4b6d32 100644 --- a/psets/pset3-worked_solution.rmd +++ b/psets/pset3-worked_solution.rmd @@ -501,6 +501,7 @@ Several noteworthy comparisons come looking across the different proportions for Again, many possible things worth mentioning here, so I'll provide a few that stand out to me. * The generalizability of analysis focused on one state during one 6 year period is limited. +* Working with a random $1\%$ sample of the full dataset means that our results here could diverge from those we would find in an analysis of the full population of traffic stops in unpredictable ways. That said, even the very small sample is quite big and once you've read *OpenIntro* §5 you'll have some tools to estimate standard errors and confidence intervals around the various results from this analysis. * The data seem very prone to measurement errors of various kinds. In particular, I suspect the race/ethnicity classifications provided by officers are subject to some biases that are hard to identify and might also shift over time/region. The prevalence of missing values during the first two years of the dataset illustrate one aspect of this and may impact estimates of raw counts and proportions. * While the comparisons across racial/ethnic groups and between the traffic stops/searches and baseline population proportions illustrates a number of suggestive patterns, conclusive interpretation or attribution of those patterns to any specific cause or causes is quite difficult in the absence of additional information or assumptions. For one example, see my comments regarding statistical independence and the possible explanations in SQ2 above. * Extensions of this analysis might seek to investigate how some of the patterns identified in the aggregate sate-level data vary across sub-regions (e.g., counties or police districts) or even in comparison to other states.