presentations/ica_hackathon_2022/ica_hackathon_2022.html

   1 <!DOCTYPE html>
   2 <html lang="" xml:lang="">
   3   <head>
   4     <title>How good of a model do you need? Accounting for classification errors in machine assisted content analysis.</title>
   5     <meta charset="utf-8" />
   6     <meta name="author" content="Nathan TeBlunthuis" />
   7     <script src="libs/header-attrs-2.14/header-attrs.js"></script>
   8     <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
   9     <link rel="stylesheet" href="my-theme.css" type="text/css" />
  10     <link rel="stylesheet" href="fontawesome.min.css" type="text/css" />
  11   </head>
  12   <body>
  13     <textarea id="source">
  14
  15
  16 class: center, middle, narrow
  17
  18 &lt;script type='javascript'&gt;
  19 window.MathJax = {
  20   loader: {load: ['[tex]/xcolor']},
  21   tex: {packages: {'[+]': ['xcolor']}}
  22 };
  23 &lt;/script&gt;
  24
  25 &lt;div class="my-header"&gt;&lt;/div&gt;
  26
  27
  28 ###  .title-heading[Unlocking the power of big data: The importance of measurement error in machine assisted content analysis]
  29 ## Nathan TeBlunthuis
  30
  31 &lt;img src="images/nu_logo.png" height="170px" style="padding:21px"/&gt; &lt;img src="images/uw_logo.png" height="170px" style="padding:21px"/&gt; &lt;img src="images/cdsc_logo.png" height="170px" style="padding:21px"/&gt;
  32
  33
  34 nathan.teblunthuis@northwestern.edu
  35
  36 [https://teblunthuis.cc](https://teblunthuis.cc)
  37
  38 ???
  39
  40 This talk will be me presenting my "lab notebook" and not a polished research talk.  Maybe it would be a good week of a graduate seminar? In sum, machine assisted content analysis has unique limitations and threats to validity that I wanted to understand better.  I've learned how the noise introduced by predictive models can result in misleading statistical inferences, but that a sample of human-labeled validation data can often be used to account for this noise and obtain accurate inferences in the end.  Statistical knowledge of this problem and computational tools for addressing are still in development.  My goals for this presentation are to start sharing this information with the community and hopeful to stimulate us to work on extending existing approaches or using them in our work.
  41
  42 This is going to be a boring talk about some *very* technical material. If you're not that interested please return to your hackathon. Please interrupt me if I'm going too fast for you or if you don't understand something.  I will try to move quickly in the interests of those wishing to wrap up their hackathon projects. I will also ask you to show hands once or twice, if you are already familiar with some concepts that it might be expedient to skip.
  43
  44 ---
  45
  46 class:center, middle, inverse
  47 ## Machine assistent content analysis (MACA)
  48
  49 ???
  50
  51 I'm going to start by defining a study design that is increasingly common, especially in Communication and Political Science, but also across the social sciences and beyond. I call it *machine assisted content analysis* (MACA).
  52
  53 ---
  54 &lt;div class="my-header"&gt;&lt;/div&gt;
  55
  56 ### .border[Machine assisted content analysis (MACA) uses machine learning for scientific measurement.]
  57
  58 .emph[Content analysis:] Statistical analysis of variables measured by human labeling ("coding") of content.  This might be simple categorical labels, or maybe more advanced annotations.
  59
  60 --
  61
  62 *Downside:* Human labeling is *a lot* of work.
  63
  64 --
  65
  66 .emph[Machine assisted content analysis:] Use a *predictive algorithm* (often trained on human-made labels) to measure variables for use in a downstream *primary analysis.*
  67
  68 --
  69
  70 *Downside:*  Algorithms can be *biased* and *inaccurate* in ways that could invalidate the statistical analysis.
  71
  72
  73 ???
  74
  75 A machine assisted content analysis can be part of a more complex or more powerful study design (e.g., an experiment, time series analysis &amp;c).
  76
  77 ---
  78
  79
  80 &lt;!-- &lt;div class="my-header"&gt;&lt;/div&gt; --&gt;
  81
  82 &lt;!-- ### .border[Hypothetical Example: Predicting Racial Harassement in Social Media Comments] --&gt;
  83
  84 ---
  85 class:large
  86
  87 &lt;div class="my-header"&gt;&lt;/div&gt;
  88
  89 ### .border[How can MACA go wrong?]
  90
  91 Algorithms can be *biased* and *error prone* (*noisy*).
  92
  93 --
  94
  95 Predictor bias is a potentially difficult problem that requires causal inference methods. I'll focus on *noise* for now.
  96
  97 --
  98
  99 Noise in the predictive model introduces bias in the primary analysis.
 100
 101 --
 102
 103 .indent[We can reduce and sometimes even *eliminate* this bias introduced by noise.]
 104
 105 ---
 106 layout:true
 107 &lt;div class="my-header"&gt;&lt;/div&gt;
 108
 109 ### .border[Example 1: An unbiased, but noisy classifier]
 110
 111 .large[.left-column[![](images/example_1_dag.png)]]
 112
 113 ???
 114
 115 Please show hands if you are familiar with causal graphs or baysian networks.  Should I explain what this diagram means?
 116
 117
 118 ---
 119
 120 .right-column[
 121 `\(x\)` is *partly observed* because we have *validation data* `\(x^*\)`.
 122 ]
 123
 124 ---
 125
 126
 127 .right-column[
 128 `\(x\)` is *partly observed* because we have *validation data* `\(x^*\)`.
 129
 130 `\(k\)` are the *features* used by the *predictive model* `\(g(k)\)`.
 131
 132 ]
 133
 134 ---
 135
 136 .right-column[
 137 `\(x\)` is *partly observed* because we have *validation data* `\(x^*\)`.
 138
 139 `\(k\)` are the *features* used by the *predictive model* `\(g(k)\)`.
 140
 141 The predictions `\(w\)` are a *proxy variable*  `\(g(k) = \hat{x} = w\)`.
 142
 143 ]
 144
 145 ---
 146
 147
 148 .right-column[
 149 `\(x\)` is *partly observed* because we have *validation data* `\(x^*\)`.
 150
 151 `\(k\)` are the *features* used by the *predictive model* `\(g(k)\)`.
 152
 153 The predictions `\(w\)` are a *proxy variable*  `\(g(k) = \hat{x} = w\)`.
 154
 155 `\(x = w + \xi\)` because the predictive model makes errors.
 156
 157 ]
 158
 159 ---
 160
 161
 162 layout:true
 163 &lt;div class="my-header"&gt;&lt;/div&gt;
 164
 165 ### .border[Noise in a *covariate* creates *attenuation bias*.]
 166
 167 .large[.left-column[![](images/example_1_dag.png)]]
 168
 169 ---
 170 .right-column[
 171
 172 We want to estimate, `\(y = Bx + \varepsilon\)`, but we estimate `\(y = Bw + \varepsilon\)` instead.
 173
 174 `\(x = w + \xi\)` because the predictive model makes errors.
 175
 176 ]
 177 ---
 178
 179 .right-column[
 180
 181 We want to estimate, `\(y = Bx + \varepsilon\)`, but we estimate `\(y = Bw + \varepsilon\)` instead.
 182
 183 `\(x = w + \xi\)` because the predictive model makes errors.
 184
 185
 186 Assume `\(g(k)\)` is *unbiased* so `\(E(\xi)=0\)`. Also assume error is *nondifferential* so `\(E(\xi y)=0\)`:
 187
 188 ]
 189
 190 ---
 191
 192 .right-column[
 193
 194 We want to estimate, `\(y = Bx + \varepsilon\)`, but we estimate `\(y = Bw + \varepsilon\)` instead.
 195
 196 `\(x = w + \xi\)` because the predictive model makes errors.
 197
 198 Assume `\(g(k)\)` is *unbiased* so `\(E(\xi)=0\)`. Also assume error is *nondifferential* so `\(E(\xi y)=0\)`:
 199
 200 `$$\widehat{B_w}^{ols}=\frac{\sum^n_{j=j}{(x_j + \xi_j - \overline{(x + \xi)})}(y_j - \bar{y})}{\sum_{j=1}^n{(x_j + \xi_j - \overline{(x+\xi)})^2}} = \frac{\sum^n_{j=j}{(x_j - \bar{x})(y_j -
 201  \bar{y})}}{\sum_{j=1}^n{(x_j + \xi_j - \bar{x}){^2}}}$$`
 202
 203 ]
 204
 205 ---
 206
 207 .right-column[
 208
 209 We want to estimate, `\(y = Bx + \varepsilon\)`, but we estimate `\(y = Bw + \varepsilon\)` instead.
 210
 211 `\(x = w + \xi\)` because the predictive model makes errors.
 212
 213 Assume `\(g(k)\)` is *unbiased* so `\(E(\xi)=0\)`. Also assume error is *nondifferential* so `\(E(\xi y)=0\)`:
 214
 215 `$$\widehat{B_w}^{ols}=\frac{\sum^n_{j=j}{(x_j + \xi_j - \overline{(x + \xi)})}(y_j - \bar{y})}{\sum_{j=1}^n{(x_j + \xi_j - \overline{(x+\xi)})^2}} = \frac{\sum^n_{j=j}{(x_j - \bar{x})(y_j -
 216  \bar{y})}}{\sum_{j=1}^n{(x_j + \color{red}{\xi_j} - \bar{x})\color{red}{^2}}}$$`
 217
 218 In this scenario, it's clear that `\(\widehat{B_w}^{ols} &lt; B_x\)`.
 219
 220
 221 ]
 222
 223
 224 ???
 225
 226 Please raise your hands if you're familiar with attenuation bias.  I expect that its covered in some graduate stats classes, but not universally.
 227
 228 ---
 229 class:large
 230 layout:false
 231 &lt;div class="my-header"&gt;&lt;/div&gt;
 232
 233 ### .border[Beyond attenuation bias]
 234 .larger[Measurement error can theaten validity because:]
 235
 236 - Attenuation bias *spreads* (e.g., to marginal effects as illustrated later).
 237
 238 --
 239
 240 - Measurement error can be *differential*— not distributed evenly and possible correlated with `\(x\)`, `\(y\)`, or `\(\varepsilon\)`.
 241
 242 --
 243
 244 - *Bias can be away from 0* in GLMs and nonlinear models or if measurement error is differential.
 245
 246 --
 247
 248 - *Confounding* if the *predictive model is biased* introducing a correlation the measurement error and the residuals `\((E[\xi\varepsilon]=0)\)`.
 249
 250
 251 ---
 252
 253 class:large
 254 layout:false
 255 &lt;div class="my-header"&gt;&lt;/div&gt;
 256
 257 ### .border[Correcting measurement error]
 258
 259 There's a vast literature in statistics on measurement error. Mostly about noise you'd find in sensors. Lots of ideas. No magic bullets.
 260
 261 --
 262
 263 I'm going to briefly cover 3 different approaches: *multiple imputation*,  *regression calibration* and *2SLS+GMM*.
 264
 265 --
 266
 267 These all depend on *validation data*. I'm going to ignore where this comes from, but assume it's a random sample of the hypothesis testing dataset.
 268
 269 --
 270
 271 You can *and should* use it to improve your statistical estimates.
 272
 273 ---
 274
 275 &lt;div class="my-header"&gt;&lt;/div&gt;
 276
 277 ### .border[Multiple Imputation (MI) treats Measurement Error as a Missing Data Problem]
 278
 279 1. Use validation data to estimate `\(f(x|w,y)\)`, a probabilistic model of `\(x\)`.
 280
 281 --
 282
 283 2. *Sample* `\(m\)` datasets from `\(\widehat{f(x|w,y)}\)`.
 284
 285 --
 286
 287 3. Run your analysis on each of the `\(m\)` datasets.
 288
 289 --
 290
 291 4. Average the results from the `\(m\)` analyses using Rubin's rules.
 292
 293 --
 294
 295 .e[Advantages:] *Very flexible!* Sometimes can work if the predictor $g(k) $ is biased. Good R packages (**`{Amelia}`**, `{mi}`, `{mice}`, `{brms}`).
 296
 297 --
 298
 299 .e[Disadvantages:] Results depend on quality of `\(\widehat{f(x|w,y)}\)`; May require more validation data, computationally expensive, statistically inefficient and doesn't seem to benefit much from larger datasets.
 300
 301 ---
 302
 303 ### .border[Regression calibration directly adjusts for attenuation bias.]
 304
 305 1. Use validation data to estimate the errors `\(\hat{\xi}\)`.
 306
 307 --
 308
 309 2. Use `\(\hat{\xi}\)` to correct the OLS estimate.
 310
 311 --
 312
 313 3. Correct the standard errors using MLE or bootstrapping.
 314
 315 --
 316
 317 .e[Advantages:] Simple, fast.
 318
 319 --
 320
 321 .e[Disadvantages:] Limited to OLS models. Requires an unbiased predictor `\(g(k)\)`. R support (`{mecor}` R package) is pretty new.
 322
 323 ---
 324 layout:true
 325 ### .border[2SLS+GMM is designed for this specific problem]
 326
 327 .left-column[![](images/Fong_Taylor.png)]
 328
 329 *Regression calibration with a trick.*
 330
 331 ---
 332 .right-column[
 333
 334 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
 335
 336 ]
 337
 338 ---
 339 .right-column[
 340
 341 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
 342
 343 2. Estimate `\(y = B^{2sls}\hat{x} + \varepsilon^{2sls}\)`. (Second-stage LS / regression calibration).
 344
 345 ]
 346
 347 ---
 348 .right-column[
 349
 350 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
 351
 352 2. Estimate `\(y = B^{2sls}\hat{x} + \varepsilon^{2sls}\)`.  (Second-stage LS / regression calibration).
 353
 354 3. Estimate `\(y = B^{val}x^* + \varepsilon^{val}\)`. (Validation dataset model).
 355
 356 ]
 357
 358 ---
 359 .right-column[
 360
 361 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
 362
 363 2. Estimate `\(y = B^{2sls}\hat{x} + \varepsilon^{2sls}\)`.  (Second-stage LS / regression calibration).
 364
 365 3. Estimate `\(y = B^{val}x^* + \varepsilon^{val}\)`. (Validation dataset model).
 366
 367 4. Combine `\(B^{val}\)` and `\(B^{2sls}\)` using the generalized method of moments (GMM).
 368
 369 ]
 370
 371 ---
 372 .right-column[
 373
 374 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
 375
 376 2. Estimate `\(y = B^{2sls}\hat{x} + \varepsilon^{2sls}\)`.  (Second-stage LS / regression calibration).
 377
 378 3. Estimate `\(y = B^{val}x^* + \varepsilon^{val}\)`. (Validation dataset model).
 379
 380 4. Combine `\(B^{val}\)` and `\(B^{2sls}\)` using the generalized method of moments (GMM).
 381
 382 Advantages: Accurate. Sometimes robust if biased predictor `\(g(k)\)` is biased.  In theory, flexible to any models that can be fit using GMM.
 383
 384 ]
 385
 386
 387 ---
 388 .right-column[
 389
 390 1. Estimate `\(x = w + \xi\)` to obtain `\(\hat{x}\)`. (First-stage LS).
 391
 392 2. Estimate `\(y = B^{2sls}\hat{x} + \varepsilon^{2sls}\)`.  (Second-stage LS / regression calibration).
 393
 394 3. Estimate `\(y = B^{val}x^* + \varepsilon^{val}\)`. (Validation dataset model).
 395
 396 4. Combine `\(B^{val}\)` and `\(B^{2sls}\)` using the generalized method of moments (GMM).
 397
 398 Advantages: Accurate. Sometimes robust if biased predictor `\(g(k)\)` is biased.  In theory, flexible to any models that can be fit using GMM.
 399
 400 Disadvantages: Implementation (`{predictionError}`) is new. API is cumbersome and only supports linear models. Not robust if `\(E(w\varepsilon) \ne 0\)`. GMM may be unfamiliar to audiences.
 401
 402 ]
 403
 404 ---
 405 layout:false
 406 ### .border[Testing attention bias correction]
 407
 408 &lt;div class="my-header"&gt;&lt;/div&gt;
 409
 410 I've run simulations to test these approaches in several scenarios.
 411
 412 The model is not very good: about 70% accurate.
 413
 414 Most plausible scenario:
 415
 416 y is continuous and normal-ish.
 417
 418 --
 419
 420 `\(x\)` is binary (human labels) `\(P(x)=0.5\)`.
 421
 422 --
 423
 424 `\(w\)` is the *continuous predictor* (e.g., probability) output of `\(f(x)\)` (not binary predictions).
 425
 426 --
 427
 428 if `\(w\)` is binary, most methods struggle, but regression calibration and 2SLS+GMM can do okay.
 429
 430 ---
 431 layout:false
 432
 433 ### .border[Example 1: estimator of the effect of x]
 434
 435 .right-column[
 436 ![](ica_hackathon_2022_files/figure-html/unnamed-chunk-2-1.svg)&lt;!-- --&gt;
 437 ]
 438 .left-column[
 439
 440 All methods work in this scenario
 441
 442 Multiple imputation is inefficient.
 443
 444 ]
 445
 446
 447 ---
 448 ### .border[What about bias?]
 449
 450 .left-column[
 451 .large[![](images/example_2_dag.png)]
 452 ]
 453
 454 .right-column[
 455 A few notes on this scenario.
 456
 457 `\(B_x = 0.2\)`, `\(B_g=-0.2\)` and `\(sd(\varepsilon)=3\)`. So the signal-to-noise ratio is high.
 458
 459 `\(r\)` can be concieved of as a missing feature in the predictive model `\(g(k)\)` that is also correlated with `\(y\)`.
 460
 461 For example `\(r\)` might be the *race* of a commentor,  `\(x\)` could be *racial harassment*, `\(y\)` whether the commentor gets banned and `\(k\)` only has textual features but human coders can see user profiles to know `\(r\)`.
 462
 463 ]
 464
 465 ---
 466 layout:false
 467 ### .border[Example 2: Estimates of the effect of x ]
 468
 469 .center[
 470 ![](ica_hackathon_2022_files/figure-html/unnamed-chunk-3-1.svg)&lt;!-- --&gt;
 471 ]
 472 ---
 473 layout:false
 474
 475 ### .border[Example 2: Estimates of the effect of r]
 476
 477 .center[
 478 ![](ica_hackathon_2022_files/figure-html/unnamed-chunk-4-1.svg)&lt;!-- --&gt;
 479 ]
 480 ---
 481
 482 layout:false
 483 class:large
 484
 485 ###.border[Takeaways from example 2]
 486
 487 Bias in the predictive model creates bias in hypothesis tests.
 488
 489 --
 490
 491 Bias can be corrected *in this case*.
 492
 493 --
 494
 495 The next scenario has bias that's more tricky.
 496
 497 --
 498
 499 Multiple imputation helps, but doesn't fully correct the bias.
 500
 501 ---
 502
 503 layout:false
 504
 505 ### .border[When will GMM+2SLS fail?]
 506
 507 .large[.left-column[![](images/example_3_dag.png)]]
 508
 509 .right-column[The catch with GMM:
 510
 511 .emph[Exclusion restriction:] `\(E[w \varepsilon] = 0\)`.
 512
 513 The restriction is violated if a variable `\(U\)` causes both `\(K\)` and `\(Y\)` and `\(X\)` causes `\(K\)` (not visa-versa).
 514
 515 ]
 516
 517 ???
 518
 519 GMM optimizes a model to a system of equations of which the exclusion restriction is one.  So if that assumption isn't true it will biased.
 520
 521 This is a different assumption than that of OLS or GLM models.
 522
 523 ---
 524
 525 layout:false
 526
 527 ### .border[Example 3: Estimates of the effect of x]
 528
 529 .center[
 530 ![](ica_hackathon_2022_files/figure-html/unnamed-chunk-5-1.svg)&lt;!-- --&gt;
 531 ]
 532
 533
 534
 535 ---
 536
 537 ### .border[Takaways]
 538
 539 - Attenuation bias can be a big problem with noisy predictors—leading to small and biased estimates.
 540
 541 - For more general hypothesis tests or if the predictor is biased, measurement error can lead to false discovery.
 542
 543 - It's fixable with validation data—you may not need that much and you should already be getting it.
 544
 545 - This means it can be okay poor predictors for hypothesis testing.
 546
 547 - The ecosystem is underdeveloped, but a lot of methods have been researched.
 548
 549 - Take advantage of machine learning + big data and get precise estimates when the signal-to-noise ratio is high!
 550
 551 ---
 552 layout:false
 553
 554 ### .border[Future work: Noise in the *outcome*]
 555
 556 I've been focusing on noise in *covariates.* What if the predictive algorithm is used to measure the *outcome* `\(y\)`?
 557
 558 --
 559
 560 This isn't a problem in the simplest case (linear regression with homoskedastic errors).  Noise in `\(y\)` is projected into the error term.
 561
 562 --
 563
 564 Noise in the outcome is still a problem if errors are heteroskedastic and for GLMs / non-linear regression (e.g., logistic regression).
 565
 566 --
 567
 568 Multiple imputation (in theory) could help here. The other method's aren't designed for this case.
 569
 570 --
 571
 572 Solving this problem could be an important methodological contribution with a very broad impact.
 573
 574 ---
 575 # .border[Questions?]
 576
 577 Links to slides:[html](https://teblunthuis.cc/~nathante/slides/ecological_adaptation_ica_2022.html) [pdf](https://teblunthuis.cc/~nathante/slides/ecological_adaptation_ica_2022.pdf)
 578
 579 Link to a messy git repository:
 580
 581 &lt;i class="fa fa-envelope" aria-hidden='true'&gt;&lt;/i&gt; nathan.teblunthuis@northwestern.edu
 582
 583 &lt;i class="fa fa-twitter" aria-hidden='true'&gt;&lt;/i&gt; @groceryheist
 584
 585 &lt;i class="fa fa-globe" aria-hidden='true'&gt;&lt;/i&gt; [https://communitydata.science](https://communitydata.science)
 586
 587
 588
 589 &lt;!-- ### .border[Multiple imputation struggles with discrete variables] --&gt;
 590
 591 &lt;!-- In my experiments I've found that the 2SLS+GMM method works well with a broader range of data types.  --&gt;
 592
 593 &lt;!-- To illustrate, Example 3 is the same as Example 2, but with `\(x\)` and `\(w\)` as discrete variables.  --&gt;
 594
 595 &lt;!-- Practicallly speaking, a continuous "score" `\(w\)` is often available, and my opinion is that usually this is better + more informative than model predictions in all cases.  Continuous validation data may be more difficult to obtain, but it is often possible using techniques like pairwise comparison. --&gt;
 596 &lt;!-- layout:false --&gt;
 597 &lt;!-- ### .border[Example 3: Estimates of the effect of x ] --&gt;
 598
 599 &lt;!-- .center[ --&gt;
 600 &lt;!-- ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=8, fig.asp=.625,cache=F} --&gt;
 601
 602 &lt;!-- #plot.df &lt;-  --&gt;
 603 &lt;!-- plot.df &lt;- plot.df.example.2[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T), --&gt;
 604 &lt;!--                                      N=factor(N), --&gt;
 605 &lt;!--                                      m=factor(m))] --&gt;
 606
 607 &lt;!-- plot.df &lt;- plot.df[(variable=='x') &amp; (m != 1000) &amp; (m!=500) &amp; (N!=5000) &amp; (N!=10000) &amp; !is.na(p.true.in.ci) &amp; (method!="Multiple imputation (Classifier features unobserved)")] --&gt;
 608 &lt;!-- p &lt;- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method)) --&gt;
 609 &lt;!-- p &lt;- p + geom_hline(aes(yintercept=0.2),linetype=2) --&gt;
 610
 611 &lt;!-- p &lt;- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4)) --&gt;
 612
 613 &lt;!-- print(p) --&gt;
 614
 615 &lt;!-- # get gtable object --&gt;
 616
 617 &lt;!-- .large[.left [![](images/example_2_dag.png)]] --&gt;
 618
 619 &lt;!-- There are at two general ways using a predictive model can introduce bias: *attenuation*, and *confounding.* --&gt;
 620
 621 &lt;!-- Counfounding can be broken down into 4 types: --&gt;
 622
 623 &lt;!-- .right[Confounding on `\(X\)` by observed variables --&gt;
 624
 625 &lt;!--         Confounding on `\(Y\)` by observed variables --&gt;
 626 &lt;!-- ] --&gt;
 627
 628 &lt;!-- .left[Confounding on `\(X\)` by *un*observed variables --&gt;
 629
 630 &lt;!--         Confounding on `\(Y\)` by *un*observed variables --&gt;
 631 &lt;!-- ] --&gt;
 632
 633 &lt;!-- Attenuation and the top-right column can be dealt with relative ease using a few different methods. --&gt;
 634
 635 &lt;!-- The bottom-left column can be addressed, but so far I haven't found a magic bullet. --&gt;
 636
 637 &lt;!-- The left column is pretty much a hopeless situation. --&gt;
 638     </textarea>
 639 <style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
 640 <script src="libs/remark-latest.min.js"></script>
 641 <script>var slideshow = remark.create({
 642 "highlightStyle": "github",
 643 "ratio": "16:9",
 644 "countIncrementalSlides": true,
 645 "slideNumberFormat": "<div class=\"progress-bar-container\">\n  <div class=\"progress-bar\" style=\"width: calc(%current% / %total% * 100%);\">\n  </div>\n</div>\n"
 646 });
 647 if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
 648   window.dispatchEvent(new Event('resize'));
 649 });
 650 (function(d) {
 651   var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
 652   if (!r) return;
 653   s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
 654   d.head.appendChild(s);
 655 })(document);
 656
 657 (function(d) {
 658   var el = d.getElementsByClassName("remark-slides-area");
 659   if (!el) return;
 660   var slide, slides = slideshow.getSlides(), els = el[0].children;
 661   for (var i = 1; i < slides.length; i++) {
 662     slide = slides[i];
 663     if (slide.properties.continued === "true" || slide.properties.count === "false") {
 664       els[i - 1].className += ' has-continuation';
 665     }
 666   }
 667   var s = d.createElement("style");
 668   s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
 669   d.head.appendChild(s);
 670 })(document);
 671 // delete the temporary CSS (for displaying all slides initially) when the user
 672 // starts to view slides
 673 (function() {
 674   var deleted = false;
 675   slideshow.on('beforeShowSlide', function(slide) {
 676     if (deleted) return;
 677     var sheets = document.styleSheets, node;
 678     for (var i = 0; i < sheets.length; i++) {
 679       node = sheets[i].ownerNode;
 680       if (node.dataset["target"] !== "print-only") continue;
 681       node.parentNode.removeChild(node);
 682     }
 683     deleted = true;
 684   });
 685 })();
 686 // add `data-at-shortcutkeys` attribute to <body> to resolve conflicts with JAWS
 687 // screen reader (see PR #262)
 688 (function(d) {
 689   let res = {};
 690   d.querySelectorAll('.remark-help-content table tr').forEach(tr => {
 691     const t = tr.querySelector('td:nth-child(2)').innerText;
 692     tr.querySelectorAll('td:first-child .key').forEach(key => {
 693       const k = key.innerText;
 694       if (/^[a-z]$/.test(k)) res[k] = t;  // must be a single letter (key)
 695     });
 696   });
 697   d.body.setAttribute('data-at-shortcutkeys', JSON.stringify(res));
 698 })(document);
 699 (function() {
 700   "use strict"
 701   // Replace <script> tags in slides area to make them executable
 702   var scripts = document.querySelectorAll(
 703     '.remark-slides-area .remark-slide-container script'
 704   );
 705   if (!scripts.length) return;
 706   for (var i = 0; i < scripts.length; i++) {
 707     var s = document.createElement('script');
 708     var code = document.createTextNode(scripts[i].textContent);
 709     s.appendChild(code);
 710     var scriptAttrs = scripts[i].attributes;
 711     for (var j = 0; j < scriptAttrs.length; j++) {
 712       s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
 713     }
 714     scripts[i].parentElement.replaceChild(s, scripts[i]);
 715   }
 716 })();
 717 (function() {
 718   var links = document.getElementsByTagName('a');
 719   for (var i = 0; i < links.length; i++) {
 720     if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
 721       links[i].target = '_blank';
 722     }
 723   }
 724 })();</script>
 725
 726 <script>
 727 slideshow._releaseMath = function(el) {
 728   var i, text, code, codes = el.getElementsByTagName('code');
 729   for (i = 0; i < codes.length;) {
 730     code = codes[i];
 731     if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
 732       text = code.textContent;
 733       if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
 734           /^\$\$(.|\s)+\$\$$/.test(text) ||
 735           /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
 736         code.outerHTML = code.innerHTML;  // remove <code></code>
 737         continue;
 738       }
 739     }
 740     i++;
 741   }
 742 };
 743 slideshow._releaseMath(document);
 744 </script>
 745 <!-- dynamically load mathjax for compatibility with self-contained -->
 746 <script>
 747 (function () {
 748   var script = document.createElement('script');
 749   script.type = 'text/javascript';
 750   script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
 751   if (location.protocol !== 'file:' && /^https?:/.test(script.src))
 752     script.src  = script.src.replace(/^https?:/, '');
 753   document.getElementsByTagName('head')[0].appendChild(script);
 754 })();
 755 </script>
 756   </body>
 757 </html>