2 <html lang="" xml:lang="">
4 <title>How good of a model do you need? Accounting for classification errors in machine assisted content analysis.</title>
5 <meta charset="utf-8" />
6 <meta name="author" content="Nathan TeBlunthuis" />
7 <script src="libs/header-attrs-2.14/header-attrs.js"></script>
8 <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
9 <link rel="stylesheet" href="my-theme.css" type="text/css" />
10 <link rel="stylesheet" href="fontawesome.min.css" type="text/css" />
13 <textarea id="source">
16 class: center, middle, narrow
18 <script type='javascript'>
20 loader: {load: ['[tex]/xcolor']},
21 tex: {packages: {'[+]': ['xcolor']}}
25 <div class="my-header"></div>
28 ### .title-heading[Unlocking the power of big data: The importance of measurement error in machine assisted content analysis]
31 <img src="images/nu_logo.png" height="170px" style="padding:21px"/> <img src="images/uw_logo.png" height="170px" style="padding:21px"/> <img src="images/cdsc_logo.png" height="170px" style="padding:21px"/>
34 nathan.teblunthuis@northwestern.edu
36 [https://teblunthuis.cc](https://teblunthuis.cc)
40 This talk will be me presenting my "lab notebook" and not a polished research talk. Maybe it would be a good week of a graduate seminar? In sum, machine assisted content analysis has unique limitations and threats to validity that I wanted to understand better. I've learned how the noise introduced by predictive models can result in misleading statistical inferences, but that a sample of human-labeled validation data can often be used to account for this noise and obtain accurate inferences in the end. Statistical knowledge of this problem and computational tools for addressing are still in development. My goals for this presentation are to start sharing this information with the community and hopeful to stimulate us to work on extending existing approaches or using them in our work.
42 This is going to be a boring talk about some *very* technical material. If you're not that interested please return to your hackathon. Please interrupt me if I'm going too fast for you or if you don't understand something. I will try to move quickly in the interests of those wishing to wrap up their hackathon projects. I will also ask you to show hands once or twice, if you are already familiar with some concepts that it might be expedient to skip.
46 class:center, middle, inverse
47 ## Machine assistent content analysis (MACA)
51 I'm going to start by defining a study design that is increasingly common, especially in Communication and Political Science, but also across the social sciences and beyond. I call it *machine assisted content analysis* (MACA).
54 <div class="my-header"></div>
56 ### .border[Machine assisted content analysis (MACA) uses machine learning for scientific measurement.]
58 .emph[Content analysis:] Statistical analysis of variables measured by human labeling ("coding") of content. This might be simple categorical labels, or maybe more advanced annotations.
62 *Downside:* Human labeling is *a lot* of work.
66 .emph[Machine assisted content analysis:] Use a *predictive algorithm* (often trained on human-made labels) to measure variables for use in a downstream *primary analysis.*
70 *Downside:* Algorithms can be *biased* and *inaccurate* in ways that could invalidate the statistical analysis.
75 A machine assisted content analysis can be part of a more complex or more powerful study design (e.g., an experiment, time series analysis &c).
80 <!-- <div class="my-header"></div> -->
82 <!-- ### .border[Hypothetical Example: Predicting Racial Harassement in Social Media Comments] -->
87 <div class="my-header"></div>
89 ### .border[How can MACA go wrong?]
91 Algorithms can be *biased* and *error prone* (*noisy*).
95 Predictor bias is a potentially difficult problem that requires causal inference methods. I'll focus on *noise* for now.
99 Noise in the predictive model introduces bias in the primary analysis.
103 .indent[We can reduce and sometimes even *eliminate* this bias introduced by noise.]
107 <div class="my-header"></div>
109 ### .border[Example 1: An unbiased, but noisy classifier]
111 .large[.left-column[![](images/example_1_dag.png)]]
115 Please show hands if you are familiar with causal graphs or baysian networks. Should I explain what this diagram means?
121 `\(x\)` is *partly observed* because we have *validation data* `\(x^*\)`.
128 `\(x\)` is *partly observed* because we have *validation data* `\(x^*\)`.
130 `\(k\)` are the *features* used by the *predictive model* `\(g(k)\)`.
137 `\(x\)` is *partly observed* because we have *validation data* `\(x^*\)`.
139 `\(k\)` are the *features* used by the *predictive model* `\(g(k)\)`.
141 The predictions `\(w\)` are a *proxy variable* `\(g(k) = \hat{x} = w\)`.
149 `\(x\)` is *partly observed* because we have *validation data* `\(x^*\)`.
151 `\(k\)` are the *features* used by the *predictive model* `\(g(k)\)`.
153 The predictions `\(w\)` are a *proxy variable* `\(g(k) = \hat{x} = w\)`.
155 `\(x = w + \xi\)` because the predictive model makes errors.
163 <div class="my-header"></div>
165 ### .border[Noise in a *covariate* creates *attenuation bias*.]
167 .large[.left-column[![](images/example_1_dag.png)]]
172 We want to estimate, `\(y = Bx + \varepsilon\)`, but we estimate `\(y = Bw + \varepsilon\)` instead.
174 `\(x = w + \xi\)` because the predictive model makes errors.
181 We want to estimate, `\(y = Bx + \varepsilon\)`, but we estimate `\(y = Bw + \varepsilon\)` instead.
183 `\(x = w + \xi\)` because the predictive model makes errors.
186 Assume `\(g(k)\)` is *unbiased* so `\(E(\xi)=0\)`. Also assume error is *nondifferential* so `\(E(\xi y)=0\)`:
194 We want to estimate, `\(y = Bx + \varepsilon\)`, but we estimate `\(y = Bw + \varepsilon\)` instead.
196 `\(x = w + \xi\)` because the predictive model makes errors.
198 Assume `\(g(k)\)` is *unbiased* so `\(E(\xi)=0\)`. Also assume error is *nondifferential* so `\(E(\xi y)=0\)`:
200 `$$\widehat{B_w}^{ols}=\frac{\sum^n_{j=j}{(x_j + \xi_j - \overline{(x + \xi)})}(y_j - \bar{y})}{\sum_{j=1}^n{(x_j + \xi_j - \overline{(x+\xi)})^2}} = \frac{\sum^n_{j=j}{(x_j - \bar{x})(y_j -
201 \bar{y})}}{\sum_{j=1}^n{(x_j + \xi_j - \bar{x}){^2}}}$$`
209 We want to estimate, `\(y = Bx + \varepsilon\)`, but we estimate `\(y = Bw + \varepsilon\)` instead.
211 `\(x = w + \xi\)` because the predictive model makes errors.
213 Assume `\(g(k)\)` is *unbiased* so `\(E(\xi)=0\)`. Also assume error is *nondifferential* so `\(E(\xi y)=0\)`:
215 `$$\widehat{B_w}^{ols}=\frac{\sum^n_{j=j}{(x_j + \xi_j - \overline{(x + \xi)})}(y_j - \bar{y})}{\sum_{j=1}^n{(x_j + \xi_j - \overline{(x+\xi)})^2}} = \frac{\sum^n_{j=j}{(x_j - \bar{x})(y_j -
216 \bar{y})}}{\sum_{j=1}^n{(x_j + \color{red}{\xi_j} - \bar{x})\color{red}{^2}}}$$`
218 In this scenario, it's clear that `\(\widehat{B_w}^{ols} < B_x\)`.
226 Please raise your hands if you're familiar with attenuation bias. I expect that its covered in some graduate stats classes, but not universally.
231 <div class="my-header"></div>
233 ### .border[Beyond attenuation bias]
234 .larger[Measurement error can theaten validity because:]
236 - Attenuation bias *spreads* (e.g., to marginal effects as illustrated later).
240 - Measurement error can be *differential*— not distributed evenly and possible correlated with `\(x\)`, `\(y\)`, or `\(\varepsilon\)`.
244 - *Bias can be away from 0* in GLMs and nonlinear models or if measurement error is differential.
248 - *Confounding* if the *predictive model is biased* introducing a correlation the measurement error and the residuals `\((E[\xi\varepsilon]=0)\)`.
255 <div class="my-header"></div>
257 ### .border[Correcting measurement error]
259 There's a vast literature in statistics on measurement error. Mostly about noise you'd find in sensors. Lots of ideas. No magic bullets.
263 I'm going to briefly cover 3 different approaches: *multiple imputation*, *regression calibration* and *2SLS+GMM*.
267 These all depend on *validation data*. I'm going to ignore where this comes from, but assume it's a random sample of the hypothesis testing dataset.
271 You can *and should* use it to improve your statistical estimates.
275 <div class="my-header"></div>
277 ### .border[Multiple Imputation (MI) treats Measurement Error as a Missing Data Problem]
279 1. Use validation data to estimate `\(f(x|w,y)\)`, a probabilistic model of `\(x\)`.
283 2. *Sample* `\(m\)` datasets from `\(\widehat{f(x|w,y)}\)`.
287 3. Run your analysis on each of the `\(m\)` datasets.
291 4. Average the results from the `\(m\)` analyses using Rubin's rules.
295 .e[Advantages:] *Very flexible!* Sometimes can work if the predictor $g(k) $ is biased. Good R packages (**`{Amelia}`**, `{mi}`, `{mice}`, `{brms}`).
299 .e[Disadvantages:] Results depend on quality of `\(\widehat{f(x|w,y)}\)`; May require more validation data, computationally expensive, statistically inefficient and doesn't seem to benefit much from larger datasets.
303 ### .border[Regression calibration directly adjusts for attenuation bias.]
305 1. Use validation data to estimate the errors `\(\hat{\xi}\)`.
309 2. Use `\(\hat{\xi}\)` to correct the OLS estimate.
313 3. Correct the standard errors using MLE or bootstrapping.
317 .e[Advantages:] Simple, fast.
321 .e[Disadvantages:] Limited to OLS models. Requires an unbiased predictor `\(g(k)\)`. R support (`{mecor}` R package) is pretty new.
325 ### .border[2SLS+GMM is designed for this specific problem]
327 .left-column[![](images/Fong_Taylor.png)]
329 *Regression calibration with a trick.*
334 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
341 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
343 2. Estimate `\(y = B^{2sls}\hat{x} + \varepsilon^{2sls}\)`. (Second-stage LS / regression calibration).
350 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
352 2. Estimate `\(y = B^{2sls}\hat{x} + \varepsilon^{2sls}\)`. (Second-stage LS / regression calibration).
354 3. Estimate `\(y = B^{val}x^* + \varepsilon^{val}\)`. (Validation dataset model).
361 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
363 2. Estimate `\(y = B^{2sls}\hat{x} + \varepsilon^{2sls}\)`. (Second-stage LS / regression calibration).
365 3. Estimate `\(y = B^{val}x^* + \varepsilon^{val}\)`. (Validation dataset model).
367 4. Combine `\(B^{val}\)` and `\(B^{2sls}\)` using the generalized method of moments (GMM).
374 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
376 2. Estimate `\(y = B^{2sls}\hat{x} + \varepsilon^{2sls}\)`. (Second-stage LS / regression calibration).
378 3. Estimate `\(y = B^{val}x^* + \varepsilon^{val}\)`. (Validation dataset model).
380 4. Combine `\(B^{val}\)` and `\(B^{2sls}\)` using the generalized method of moments (GMM).
382 Advantages: Accurate. Sometimes robust if biased predictor `\(g(k)\)` is biased. In theory, flexible to any models that can be fit using GMM.
390 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
392 2. Estimate `\(y = B^{2sls}\hat{x} + \varepsilon^{2sls}\)`. (Second-stage LS / regression calibration).
394 3. Estimate `\(y = B^{val}x^* + \varepsilon^{val}\)`. (Validation dataset model).
396 4. Combine `\(B^{val}\)` and `\(B^{2sls}\)` using the generalized method of moments (GMM).
398 Advantages: Accurate. Sometimes robust if biased predictor `\(g(k)\)` is biased. In theory, flexible to any models that can be fit using GMM.
400 Disadvantages: Implementation (`{predictionError}`) is new. API is cumbersome and only supports linear models. Not robust if `\(E(w\varepsilon) \ne 0\)`. GMM may be unfamiliar to audiences.
406 ### .border[Testing attention bias correction]
408 <div class="my-header"></div>
410 I've run simulations to test these approaches in several scenarios.
412 The model is not very good: about 70% accurate.
414 Most plausible scenario:
416 y is continuous and normal-ish.
420 `\(x\)` is binary (human labels) `\(P(x)=0.5\)`.
424 `\(w\)` is the *continuous predictor* (e.g., probability) output of `\(f(x)\)` (not binary predictions).
428 if `\(w\)` is binary, most methods struggle, but regression calibration and 2SLS+GMM can do okay.
433 ### .border[Example 1: estimator of the effect of x]
436 ![](ica_hackathon_2022_files/figure-html/unnamed-chunk-2-1.svg)<!-- -->
440 All methods work in this scenario
442 Multiple imputation is inefficient.
448 ### .border[What about bias?]
451 .large[![](images/example_2_dag.png)]
455 A few notes on this scenario.
457 `\(B_x = 0.2\)`, `\(B_g=-0.2\)` and `\(sd(\varepsilon)=3\)`. So the signal-to-noise ratio is high.
459 `\(r\)` can be concieved of as a missing feature in the predictive model `\(g(k)\)` that is also correlated with `\(y\)`.
461 For example `\(r\)` might be the *race* of a commentor, `\(x\)` could be *racial harassment*, `\(y\)` whether the commentor gets banned and `\(k\)` only has textual features but human coders can see user profiles to know `\(r\)`.
467 ### .border[Example 2: Estimates of the effect of x ]
470 ![](ica_hackathon_2022_files/figure-html/unnamed-chunk-3-1.svg)<!-- -->
475 ### .border[Example 2: Estimates of the effect of r]
478 ![](ica_hackathon_2022_files/figure-html/unnamed-chunk-4-1.svg)<!-- -->
485 ###.border[Takeaways from example 2]
487 Bias in the predictive model creates bias in hypothesis tests.
491 Bias can be corrected *in this case*.
495 The next scenario has bias that's more tricky.
499 Multiple imputation helps, but doesn't fully correct the bias.
505 ### .border[When will GMM+2SLS fail?]
507 .large[.left-column[![](images/example_3_dag.png)]]
509 .right-column[The catch with GMM:
511 .emph[Exclusion restriction:] `\(E[w \varepsilon] = 0\)`.
513 The restriction is violated if a variable `\(U\)` causes both `\(K\)` and `\(Y\)` and `\(X\)` causes `\(K\)` (not visa-versa).
519 GMM optimizes a model to a system of equations of which the exclusion restriction is one. So if that assumption isn't true it will biased.
521 This is a different assumption than that of OLS or GLM models.
527 ### .border[Example 3: Estimates of the effect of x]
530 ![](ica_hackathon_2022_files/figure-html/unnamed-chunk-5-1.svg)<!-- -->
537 ### .border[Takaways]
539 - Attenuation bias can be a big problem with noisy predictors—leading to small and biased estimates.
541 - For more general hypothesis tests or if the predictor is biased, measurement error can lead to false discovery.
543 - It's fixable with validation data—you may not need that much and you should already be getting it.
545 - This means it can be okay poor predictors for hypothesis testing.
547 - The ecosystem is underdeveloped, but a lot of methods have been researched.
549 - Take advantage of machine learning + big data and get precise estimates when the signal-to-noise ratio is high!
554 ### .border[Future work: Noise in the *outcome*]
556 I've been focusing on noise in *covariates.* What if the predictive algorithm is used to measure the *outcome* `\(y\)`?
560 This isn't a problem in the simplest case (linear regression with homoskedastic errors). Noise in `\(y\)` is projected into the error term.
564 Noise in the outcome is still a problem if errors are heteroskedastic and for GLMs / non-linear regression (e.g., logistic regression).
568 Multiple imputation (in theory) could help here. The other method's aren't designed for this case.
572 Solving this problem could be an important methodological contribution with a very broad impact.
575 # .border[Questions?]
577 Links to slides:[html](https://teblunthuis.cc/~nathante/slides/ecological_adaptation_ica_2022.html) [pdf](https://teblunthuis.cc/~nathante/slides/ecological_adaptation_ica_2022.pdf)
579 Link to a messy git repository:
581 <i class="fa fa-envelope" aria-hidden='true'></i> nathan.teblunthuis@northwestern.edu
583 <i class="fa fa-twitter" aria-hidden='true'></i> @groceryheist
585 <i class="fa fa-globe" aria-hidden='true'></i> [https://communitydata.science](https://communitydata.science)
589 <!-- ### .border[Multiple imputation struggles with discrete variables] -->
591 <!-- In my experiments I've found that the 2SLS+GMM method works well with a broader range of data types. -->
593 <!-- To illustrate, Example 3 is the same as Example 2, but with `\(x\)` and `\(w\)` as discrete variables. -->
595 <!-- Practicallly speaking, a continuous "score" `\(w\)` is often available, and my opinion is that usually this is better + more informative than model predictions in all cases. Continuous validation data may be more difficult to obtain, but it is often possible using techniques like pairwise comparison. -->
596 <!-- layout:false -->
597 <!-- ### .border[Example 3: Estimates of the effect of x ] -->
599 <!-- .center[ -->
600 <!-- ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=8, fig.asp=.625,cache=F} -->
602 <!-- #plot.df <- -->
603 <!-- plot.df <- plot.df.example.2[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T), -->
604 <!-- N=factor(N), -->
605 <!-- m=factor(m))] -->
607 <!-- plot.df <- plot.df[(variable=='x') & (m != 1000) & (m!=500) & (N!=5000) & (N!=10000) & !is.na(p.true.in.ci) & (method!="Multiple imputation (Classifier features unobserved)")] -->
608 <!-- p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method)) -->
609 <!-- p <- p + geom_hline(aes(yintercept=0.2),linetype=2) -->
611 <!-- p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4)) -->
613 <!-- print(p) -->
615 <!-- # get gtable object -->
617 <!-- .large[.left [![](images/example_2_dag.png)]] -->
619 <!-- There are at two general ways using a predictive model can introduce bias: *attenuation*, and *confounding.* -->
621 <!-- Counfounding can be broken down into 4 types: -->
623 <!-- .right[Confounding on `\(X\)` by observed variables -->
625 <!-- Confounding on `\(Y\)` by observed variables -->
628 <!-- .left[Confounding on `\(X\)` by *un*observed variables -->
630 <!-- Confounding on `\(Y\)` by *un*observed variables -->
633 <!-- Attenuation and the top-right column can be dealt with relative ease using a few different methods. -->
635 <!-- The bottom-left column can be addressed, but so far I haven't found a magic bullet. -->
637 <!-- The left column is pretty much a hopeless situation. -->
639 <style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
640 <script src="libs/remark-latest.min.js"></script>
641 <script>var slideshow = remark.create({
642 "highlightStyle": "github",
644 "countIncrementalSlides": true,
645 "slideNumberFormat": "<div class=\"progress-bar-container\">\n <div class=\"progress-bar\" style=\"width: calc(%current% / %total% * 100%);\">\n </div>\n</div>\n"
647 if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
648 window.dispatchEvent(new Event('resize'));
651 var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
653 s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
654 d.head.appendChild(s);
658 var el = d.getElementsByClassName("remark-slides-area");
660 var slide, slides = slideshow.getSlides(), els = el[0].children;
661 for (var i = 1; i < slides.length; i++) {
663 if (slide.properties.continued === "true" || slide.properties.count === "false") {
664 els[i - 1].className += ' has-continuation';
667 var s = d.createElement("style");
668 s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
669 d.head.appendChild(s);
671 // delete the temporary CSS (for displaying all slides initially) when the user
672 // starts to view slides
675 slideshow.on('beforeShowSlide', function(slide) {
677 var sheets = document.styleSheets, node;
678 for (var i = 0; i < sheets.length; i++) {
679 node = sheets[i].ownerNode;
680 if (node.dataset["target"] !== "print-only") continue;
681 node.parentNode.removeChild(node);
686 // add `data-at-shortcutkeys` attribute to <body> to resolve conflicts with JAWS
687 // screen reader (see PR #262)
690 d.querySelectorAll('.remark-help-content table tr').forEach(tr => {
691 const t = tr.querySelector('td:nth-child(2)').innerText;
692 tr.querySelectorAll('td:first-child .key').forEach(key => {
693 const k = key.innerText;
694 if (/^[a-z]$/.test(k)) res[k] = t; // must be a single letter (key)
697 d.body.setAttribute('data-at-shortcutkeys', JSON.stringify(res));
701 // Replace <script> tags in slides area to make them executable
702 var scripts = document.querySelectorAll(
703 '.remark-slides-area .remark-slide-container script'
705 if (!scripts.length) return;
706 for (var i = 0; i < scripts.length; i++) {
707 var s = document.createElement('script');
708 var code = document.createTextNode(scripts[i].textContent);
710 var scriptAttrs = scripts[i].attributes;
711 for (var j = 0; j < scriptAttrs.length; j++) {
712 s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
714 scripts[i].parentElement.replaceChild(s, scripts[i]);
718 var links = document.getElementsByTagName('a');
719 for (var i = 0; i < links.length; i++) {
720 if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
721 links[i].target = '_blank';
727 slideshow._releaseMath = function(el) {
728 var i, text, code, codes = el.getElementsByTagName('code');
729 for (i = 0; i < codes.length;) {
731 if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
732 text = code.textContent;
733 if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
734 /^\$\$(.|\s)+\$\$$/.test(text) ||
735 /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
736 code.outerHTML = code.innerHTML; // remove <code></code>
743 slideshow._releaseMath(document);
745 <!-- dynamically load mathjax for compatibility with self-contained -->
748 var script = document.createElement('script');
749 script.type = 'text/javascript';
750 script.src = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
751 if (location.protocol !== 'file:' && /^https?:/.test(script.src))
752 script.src = script.src.replace(/^https?:/, '');
753 document.getElementsByTagName('head')[0].appendChild(script);