presentations/ica_hackathon_2022/ica_hackathon_2022.Rmd

   1 ---
   2 title: "How good of a model do you need? Accounting for classification errors in machine assisted content analysis."
   3 author: Nathan TeBlunthuis
   4 date: May 24 2022
   5 template: "../resources/template.html"
   6 output:
   7   xaringan::moon_reader:
   8     lib_dir: libs
   9     seal: false
  10     nature:
  11         highlightStyle: github
  12         ratio: 16:9
  13         countIncrementalSlides: true
  14         slideNumberFormat: |
  15           <div class="progress-bar-container">
  16             <div class="progress-bar" style="width: calc(%current% / %total% * 100%);">
  17             </div>
  18           </div>
  19     self_contained: false
  20     css: [default, my-theme.css, fontawesome.min.css]
  21     chakra: libs/remark-latest.min.js
  22
  23 ---
  24 ```{r echo=FALSE, warning=FALSE, message=FALSE}
  25 library(knitr)
  26 library(ggplot2)
  27 library(data.table)
  28 f <- function (x) {formatC(x, format="d", big.mark=',')}
  29
  30 theme_set(theme_bw())
  31 r <- readRDS('remembr.RDS')
  32 attach(r)
  33
  34 ```
  35 class: center, middle, narrow
  36
  37 <script type='javascript'>
  38 window.MathJax = {
  39   loader: {load: ['[tex]/xcolor']},
  40   tex: {packages: {'[+]': ['xcolor']}}
  41 };
  42 </script>
  43
  44 <div class="my-header"></div>
  45
  46
  47 ###  .title-heading[Unlocking the power of big data: The importance of measurement error in machine assisted content analysis]
  48 ## Nathan TeBlunthuis
  49
  50 <img src="images/nu_logo.png" height="170px" style="padding:21px"/> <img src="images/uw_logo.png" height="170px" style="padding:21px"/> <img src="images/cdsc_logo.png" height="170px" style="padding:21px"/>
  51
  52
  53 nathan.teblunthuis@northwestern.edu
  54
  55 [https://teblunthuis.cc](https://teblunthuis.cc)
  56
  57 ???
  58
  59 This talk will be me presenting my "lab notebook" and not a polished research talk.  Maybe it would be a good week of a graduate seminar? In sum, machine assisted content analysis has unique limitations and threats to validity that I wanted to understand better.  I've learned how the noise introduced by predictive models can result in misleading statistical inferences, but that a sample of human-labeled validation data can often be used to account for this noise and obtain accurate inferences in the end.  Statistical knowledge of this problem and computational tools for addressing are still in development.  My goals for this presentation are to start sharing this information with the community and hopeful to stimulate us to work on extending existing approaches or using them in our work.
  60
  61 This is going to be a boring talk about some *very* technical material. If you're not that interested please return to your hackathon. Please interrupt me if I'm going too fast for you or if you don't understand something.  I will try to move quickly in the interests of those wishing to wrap up their hackathon projects. I will also ask you to show hands once or twice, if you are already familiar with some concepts that it might be expedient to skip.
  62
  63 ---
  64
  65 class:center, middle, inverse
  66 ## Machine assistent content analysis (MACA)
  67
  68 ???
  69
  70 I'm going to start by defining a study design that is increasingly common, especially in Communication and Political Science, but also across the social sciences and beyond. I call it *machine assisted content analysis* (MACA).
  71
  72 ---
  73 <div class="my-header"></div>
  74
  75 ### .border[Machine assisted content analysis (MACA) uses machine learning for scientific measurement.]
  76
  77 .emph[Content analysis:] Statistical analysis of variables measured by human labeling ("coding") of content.  This might be simple categorical labels, or maybe more advanced annotations.
  78
  79 --
  80
  81 *Downside:* Human labeling is *a lot* of work.
  82
  83 --
  84
  85 .emph[Machine assisted content analysis:] Use a *predictive algorithm* (often trained on human-made labels) to measure variables for use in a downstream *primary analysis.*
  86
  87 --
  88
  89 *Downside:*  Algorithms can be *biased* and *inaccurate* in ways that could invalidate the statistical analysis.
  90
  91
  92 ???
  93
  94 A machine assisted content analysis can be part of a more complex or more powerful study design (e.g., an experiment, time series analysis &c).
  95
  96 ---
  97
  98
  99 <!-- <div class="my-header"></div> -->
 100
 101 <!-- ### .border[Hypothetical Example: Predicting Racial Harassement in Social Media Comments] -->
 102
 103 ---
 104 class:large
 105
 106 <div class="my-header"></div>
 107
 108 ### .border[How can MACA go wrong?]
 109
 110 Algorithms can be *biased* and *error prone* (*noisy*).
 111
 112 --
 113
 114 Predictor bias is a potentially difficult problem that requires causal inference methods. I'll focus on *noise* for now.
 115
 116 --
 117
 118 Noise in the predictive model introduces bias in the primary analysis.
 119
 120 --
 121
 122 .indent[We can reduce and sometimes even *eliminate* this bias introduced by noise.]
 123
 124 ---
 125 layout:true
 126 <div class="my-header"></div>
 127
 128 ### .border[Example 1: An unbiased, but noisy classifier]
 129
 130 .large[.left-column[![](images/example_1_dag.png)]]
 131
 132 ???
 133
 134 Please show hands if you are familiar with causal graphs or baysian networks.  Should I explain what this diagram means?
 135
 136
 137 ---
 138
 139 .right-column[
 140 $x$ is *partly observed* because we have *validation data* $x^*$.
 141 ]
 142
 143 ---
 144
 145
 146 .right-column[
 147 $x$ is *partly observed* because we have *validation data* $x^*$.
 148
 149 $k$ are the *features* used by the *predictive model* $g(k)$.
 150
 151 ]
 152
 153 ---
 154
 155 .right-column[
 156 $x$ is *partly observed* because we have *validation data* $x^*$.
 157
 158 $k$ are the *features* used by the *predictive model* $g(k)$.
 159
 160 The predictions $w$ are a *proxy variable*  $g(k) = \hat{x} = w$.
 161
 162 ]
 163
 164 ---
 165
 166
 167 .right-column[
 168 $x$ is *partly observed* because we have *validation data* $x^*$.
 169
 170 $k$ are the *features* used by the *predictive model* $g(k)$.
 171
 172 The predictions $w$ are a *proxy variable*  $g(k) = \hat{x} = w$.
 173
 174 $x = w + \xi$ because the predictive model makes errors.
 175
 176 ]
 177
 178 ---
 179
 180
 181 layout:true
 182 <div class="my-header"></div>
 183
 184 ### .border[Noise in a *covariate* creates *attenuation bias*.]
 185
 186 .large[.left-column[![](images/example_1_dag.png)]]
 187
 188 ---
 189 .right-column[
 190
 191 We want to estimate, $y = Bx + \varepsilon$, but we estimate $y = Bw + \varepsilon$ instead.
 192
 193 $x = w + \xi$ because the predictive model makes errors.
 194
 195 ]
 196 ---
 197
 198 .right-column[
 199
 200 We want to estimate, $y = Bx + \varepsilon$, but we estimate $y = Bw + \varepsilon$ instead.
 201
 202 $x = w + \xi$ because the predictive model makes errors.
 203
 204
 205 Assume $g(k)$ is *unbiased* so $E(\xi)=0$. Also assume error is *nondifferential* so $E(\xi y)=0$:
 206
 207 ]
 208
 209 ---
 210
 211 .right-column[
 212
 213 We want to estimate, $y = Bx + \varepsilon$, but we estimate $y = Bw + \varepsilon$ instead.
 214
 215 $x = w + \xi$ because the predictive model makes errors.
 216
 217 Assume $g(k)$ is *unbiased* so $E(\xi)=0$. Also assume error is *nondifferential* so $E(\xi y)=0$:
 218
 219 $$\widehat{B_w}^{ols}=\frac{\sum^n_{j=j}{(x_j + \xi_j - \overline{(x + \xi)})}(y_j - \bar{y})}{\sum_{j=1}^n{(x_j + \xi_j - \overline{(x+\xi)})^2}} = \frac{\sum^n_{j=j}{(x_j - \bar{x})(y_j -
 220  \bar{y})}}{\sum_{j=1}^n{(x_j + \xi_j - \bar{x}){^2}}}$$
 221
 222 ]
 223
 224 ---
 225
 226 .right-column[
 227
 228 We want to estimate, $y = Bx + \varepsilon$, but we estimate $y = Bw + \varepsilon$ instead.
 229
 230 $x = w + \xi$ because the predictive model makes errors.
 231
 232 Assume $g(k)$ is *unbiased* so $E(\xi)=0$. Also assume error is *nondifferential* so $E(\xi y)=0$:
 233
 234 $$\widehat{B_w}^{ols}=\frac{\sum^n_{j=j}{(x_j + \xi_j - \overline{(x + \xi)})}(y_j - \bar{y})}{\sum_{j=1}^n{(x_j + \xi_j - \overline{(x+\xi)})^2}} = \frac{\sum^n_{j=j}{(x_j - \bar{x})(y_j -
 235  \bar{y})}}{\sum_{j=1}^n{(x_j + \color{red}{\xi_j} - \bar{x})\color{red}{^2}}}$$
 236
 237 In this scenario, it's clear that $\widehat{B_w}^{ols} < B_x$.
 238
 239
 240 ]
 241
 242
 243 ???
 244
 245 Please raise your hands if you're familiar with attenuation bias.  I expect that its covered in some graduate stats classes, but not universally.
 246
 247 ---
 248 class:large
 249 layout:false
 250 <div class="my-header"></div>
 251
 252 ### .border[Beyond attenuation bias]
 253 .larger[Measurement error can theaten validity because:]
 254
 255 - Attenuation bias *spreads* (e.g., to marginal effects as illustrated later).
 256
 257 --
 258
 259 - Measurement error can be *differential*— not distributed evenly and possible correlated with $x$, $y$, or $\varepsilon$.
 260
 261 --
 262
 263 - *Bias can be away from 0* in GLMs and nonlinear models or if measurement error is differential.
 264
 265 --
 266
 267 - *Confounding* if the *predictive model is biased* introducing a correlation the measurement error and the residuals $(E[\xi\varepsilon]=0)$.
 268
 269
 270 ---
 271
 272 class:large
 273 layout:false
 274 <div class="my-header"></div>
 275
 276 ### .border[Correcting measurement error]
 277
 278 There's a vast literature in statistics on measurement error. Mostly about noise you'd find in sensors. Lots of ideas. No magic bullets.
 279
 280 --
 281
 282 I'm going to briefly cover 3 different approaches: *multiple imputation*,  *regression calibration* and *2SLS+GMM*.
 283
 284 --
 285
 286 These all depend on *validation data*. I'm going to ignore where this comes from, but assume it's a random sample of the hypothesis testing dataset.
 287
 288 --
 289
 290 You can *and should* use it to improve your statistical estimates.
 291
 292 ---
 293
 294 <div class="my-header"></div>
 295
 296 ### .border[Multiple Imputation (MI) treats Measurement Error as a Missing Data Problem]
 297
 298 1. Use validation data to estimate $f(x|w,y)$, a probabilistic model of $x$.
 299
 300 --
 301
 302 2. *Sample* $m$ datasets from $\widehat{f(x|w,y)}$.
 303
 304 --
 305
 306 3. Run your analysis on each of the $m$ datasets.
 307
 308 --
 309
 310 4. Average the results from the $m$ analyses using Rubin's rules.
 311
 312 --
 313
 314 .e[Advantages:] *Very flexible!* Sometimes can work if the predictor $g(k) $ is biased. Good R packages (**`{Amelia}`**, `{mi}`, `{mice}`, `{brms}`).
 315
 316 --
 317
 318 .e[Disadvantages:] Results depend on quality of $\widehat{f(x|w,y)}$; May require more validation data, computationally expensive, statistically inefficient and doesn't seem to benefit much from larger datasets.
 319
 320 ---
 321
 322 ### .border[Regression calibration directly adjusts for attenuation bias.]
 323
 324 1. Use validation data to estimate the errors $\hat{\xi}$.
 325
 326 --
 327
 328 2. Use $\hat{\xi}$ to correct the OLS estimate.
 329
 330 --
 331
 332 3. Correct the standard errors using MLE or bootstrapping.
 333
 334 --
 335
 336 .e[Advantages:] Simple, fast.
 337
 338 --
 339
 340 .e[Disadvantages:] Limited to OLS models. Requires an unbiased predictor $g(k)$. R support (`{mecor}` R package) is pretty new.
 341
 342 ---
 343 layout:true
 344 ### .border[2SLS+GMM is designed for this specific problem]
 345
 346 .left-column[![](images/Fong_Taylor.png)]
 347
 348 *Regression calibration with a trick.*
 349
 350 ---
 351 .right-column[
 352
 353 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 354
 355 ]
 356
 357 ---
 358 .right-column[
 359
 360 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 361
 362 2. Estimate $y = B^{2sls}\hat{x} + \varepsilon^{2sls}$. (Second-stage LS / regression calibration).
 363
 364 ]
 365
 366 ---
 367 .right-column[
 368
 369 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 370
 371 2. Estimate $y = B^{2sls}\hat{x} + \varepsilon^{2sls}$.  (Second-stage LS / regression calibration).
 372
 373 3. Estimate $y = B^{val}x^* + \varepsilon^{val}$. (Validation dataset model).
 374
 375 ]
 376
 377 ---
 378 .right-column[
 379
 380 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 381
 382 2. Estimate $y = B^{2sls}\hat{x} + \varepsilon^{2sls}$.  (Second-stage LS / regression calibration).
 383
 384 3. Estimate $y = B^{val}x^* + \varepsilon^{val}$. (Validation dataset model).
 385
 386 4. Combine $B^{val}$ and $B^{2sls}$ using the generalized method of moments (GMM).
 387
 388 ]
 389
 390 ---
 391 .right-column[
 392
 393 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 394
 395 2. Estimate $y = B^{2sls}\hat{x} + \varepsilon^{2sls}$.  (Second-stage LS / regression calibration).
 396
 397 3. Estimate $y = B^{val}x^* + \varepsilon^{val}$. (Validation dataset model).
 398
 399 4. Combine $B^{val}$ and $B^{2sls}$ using the generalized method of moments (GMM).
 400
 401 Advantages: Accurate. Sometimes robust if biased predictor $g(k)$ is biased.  In theory, flexible to any models that can be fit using GMM.
 402
 403 ]
 404
 405
 406 ---
 407 .right-column[
 408
 409 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 410
 411 2. Estimate $y = B^{2sls}\hat{x} + \varepsilon^{2sls}$.  (Second-stage LS / regression calibration).
 412
 413 3. Estimate $y = B^{val}x^* + \varepsilon^{val}$. (Validation dataset model).
 414
 415 4. Combine $B^{val}$ and $B^{2sls}$ using the generalized method of moments (GMM).
 416
 417 Advantages: Accurate. Sometimes robust if biased predictor $g(k)$ is biased.  In theory, flexible to any models that can be fit using GMM.
 418
 419 Disadvantages: Implementation (`{predictionError}`) is new. API is cumbersome and only supports linear models. Not robust if $E(w\varepsilon) \ne 0$. GMM may be unfamiliar to audiences.
 420
 421 ]
 422
 423 ---
 424 layout:false
 425 ### .border[Testing attention bias correction]
 426
 427 <div class="my-header"></div>
 428
 429 I've run simulations to test these approaches in several scenarios.
 430
 431 The model is not very good: about 70% accurate.
 432
 433 Most plausible scenario:
 434
 435 y is continuous and normal-ish.
 436
 437 --
 438
 439 $x$ is binary (human labels) $P(x)=0.5$.
 440
 441 --
 442
 443 $w$ is the *continuous predictor* (e.g., probability) output of $f(x)$ (not binary predictions).
 444
 445 --
 446
 447 if $w$ is binary, most methods struggle, but regression calibration and 2SLS+GMM can do okay.
 448
 449 ---
 450 layout:false
 451
 452 ### .border[Example 1: estimator of the effect of x]
 453
 454 .right-column[
 455 ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=7.5, fig.asp=.625,cache=F}
 456
 457 #plot.df <-
 458 plot.df <- plot.df.example.1[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T),
 459                                      N=factor(N),
 460                                      m=factor(m))]
 461
 462 plot.df <- plot.df[(variable=='x') & (m != 1000) & (m!=500) & (N!=10000) & !is.na(p.true.in.ci) & (method!="Multiple imputation (Classifier features unobserved)")]
 463 p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method))
 464 p <- p + geom_hline(aes(yintercept=0.2),linetype=2)
 465
 466 p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4))
 467
 468 print(p)
 469
 470 # get gtable object
 471
 472 ```
 473 ]
 474 .left-column[
 475
 476 All methods work in this scenario
 477
 478 Multiple imputation is inefficient.
 479
 480 ]
 481
 482
 483 ---
 484 ### .border[What about bias?]
 485
 486 .left-column[
 487 .large[![](images/example_2_dag.png)]
 488 ]
 489
 490 .right-column[
 491 A few notes on this scenario.
 492
 493 $B_x = 0.2$, $B_g=-0.2$ and $sd(\varepsilon)=3$. So the signal-to-noise ratio is high.
 494
 495 $r$ can be concieved of as a missing feature in the predictive model $g(k)$ that is also correlated with $y$.
 496
 497 For example $r$ might be the *race* of a commentor,  $x$ could be *racial harassment*, $y$ whether the commentor gets banned and $k$ only has textual features but human coders can see user profiles to know $r$.
 498
 499 ]
 500
 501 ---
 502 layout:false
 503 ### .border[Example 2: Estimates of the effect of x ]
 504
 505 .center[
 506 ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=8, fig.asp=.625,cache=F}
 507
 508 #plot.df <-
 509 plot.df <- plot.df.example.2B[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T),
 510                                      N=factor(N),
 511                                      m=factor(m))]
 512
 513 plot.df <- plot.df[(variable=='x') & (m != 1000) & (m!=500) & (N!=10000) & !is.na(p.true.in.ci) & (method!="Multiple imputation (Classifier features unobserved)")]
 514 p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method))
 515 p <- p + geom_hline(aes(yintercept=0.2),linetype=2)
 516
 517 p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4))
 518
 519 print(p)
 520
 521 # get gtable object
 522
 523 ```
 524 ]
 525 ---
 526 layout:false
 527
 528 ### .border[Example 2: Estimates of the effect of r]
 529
 530 .center[
 531 ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=8, fig.asp=.625,cache=F}
 532
 533 #plot.df <-
 534 plot.df <- plot.df.example.2B[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T),
 535                                      N=factor(N),
 536                                      m=factor(m))]
 537
 538 plot.df <- plot.df[(variable=='g') & (m != 1000) & (m!=500) & (N!=10000) & !is.na(p.true.in.ci) & (method!="Multiple imputation (Classifier features unobserved)")]
 539 p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method))
 540 p <- p + geom_hline(aes(yintercept=-0.2),linetype=2)
 541
 542 p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4))
 543
 544
 545 print(p)
 546 ```
 547 ]
 548 ---
 549
 550 layout:false
 551 class:large
 552
 553 ###.border[Takeaways from example 2]
 554
 555 Bias in the predictive model creates bias in hypothesis tests.
 556
 557 --
 558
 559 Bias can be corrected *in this case*.
 560
 561 --
 562
 563 The next scenario has bias that's more tricky.
 564
 565 --
 566
 567 Multiple imputation helps, but doesn't fully correct the bias.
 568
 569 ---
 570
 571 layout:false
 572
 573 ### .border[When will GMM+2SLS fail?]
 574
 575 .large[.left-column[![](images/example_3_dag.png)]]
 576
 577 .right-column[The catch with GMM:
 578
 579 .emph[Exclusion restriction:] $E[w \varepsilon] = 0$.
 580
 581 The restriction is violated if a variable $U$ causes both $K$ and $Y$ and $X$ causes $K$ (not visa-versa).
 582
 583 ]
 584
 585 ???
 586
 587 GMM optimizes a model to a system of equations of which the exclusion restriction is one.  So if that assumption isn't true it will biased.
 588
 589 This is a different assumption than that of OLS or GLM models.
 590
 591 ---
 592
 593 layout:false
 594
 595 ### .border[Example 3: Estimates of the effect of x]
 596
 597 .center[
 598 ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=8, fig.asp=.625,cache=F}
 599
 600 #plot.df <-
 601 plot.df <- plot.df.example.3[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T),
 602                                      N=factor(N),
 603                                      m=factor(m))]
 604
 605 plot.df <- plot.df[(variable=='x') & (m != 1000) & (m!=500) & (N!=10000) & (method!="Multiple imputation (Classifier features unobserved)")]
 606 p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method))
 607 p <- p + geom_hline(aes(yintercept=0.2),linetype=2)
 608
 609 p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4))
 610
 611
 612 print(p)
 613 ```
 614 ]
 615
 616
 617
 618 ---
 619
 620 ### .border[Takaways]
 621
 622 - Attenuation bias can be a big problem with noisy predictors—leading to small and biased estimates.
 623
 624 - For more general hypothesis tests or if the predictor is biased, measurement error can lead to false discovery.
 625
 626 - It's fixable with validation data—you may not need that much and you should already be getting it.
 627
 628 - This means it can be okay poor predictors for hypothesis testing.
 629
 630 - The ecosystem is underdeveloped, but a lot of methods have been researched.
 631
 632 - Take advantage of machine learning + big data and get precise estimates when the signal-to-noise ratio is high!
 633
 634 ---
 635 layout:false
 636
 637 ### .border[Future work: Noise in the *outcome*]
 638
 639 I've been focusing on noise in *covariates.* What if the predictive algorithm is used to measure the *outcome* $y$?
 640
 641 --
 642
 643 This isn't a problem in the simplest case (linear regression with homoskedastic errors).  Noise in $y$ is projected into the error term.
 644
 645 --
 646
 647 Noise in the outcome is still a problem if errors are heteroskedastic and for GLMs / non-linear regression (e.g., logistic regression).
 648
 649 --
 650
 651 Multiple imputation (in theory) could help here. The other method's aren't designed for this case.
 652
 653 --
 654
 655 Solving this problem could be an important methodological contribution with a very broad impact.
 656
 657 ---
 658 # .border[Questions?]
 659
 660 Links to slides:[html](https://teblunthuis.cc/~nathante/slides/ecological_adaptation_ica_2022.html) [pdf](https://teblunthuis.cc/~nathante/slides/ecological_adaptation_ica_2022.pdf)
 661
 662 Link to a messy git repository:[https://code.communitydata.science/ml_measurement_error_public.git](https://code.communitydata.science/ml_measurement_error_public.git)
 663
 664 <i class="fa fa-envelope" aria-hidden='true'></i> nathan.teblunthuis@northwestern.edu
 665
 666 <i class="fa fa-twitter" aria-hidden='true'></i> @groceryheist
 667
 668 <i class="fa fa-globe" aria-hidden='true'></i> [https://communitydata.science](https://communitydata.science)
 669
 670
 671
 672 <!-- ### .border[Multiple imputation struggles with discrete variables] -->
 673
 674 <!-- In my experiments I've found that the 2SLS+GMM method works well with a broader range of data types.  -->
 675
 676 <!-- To illustrate, Example 3 is the same as Example 2, but with $x$ and $w$ as discrete variables.  -->
 677
 678 <!-- Practicallly speaking, a continuous "score" $w$ is often available, and my opinion is that usually this is better + more informative than model predictions in all cases.  Continuous validation data may be more difficult to obtain, but it is often possible using techniques like pairwise comparison. -->
 679 <!-- layout:false -->
 680 <!-- ### .border[Example 3: Estimates of the effect of x ] -->
 681
 682 <!-- .center[ -->
 683 <!-- ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=8, fig.asp=.625,cache=F} -->
 684
 685 <!-- #plot.df <-  -->
 686 <!-- plot.df <- plot.df.example.2[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T), -->
 687 <!--                                      N=factor(N), -->
 688 <!--                                      m=factor(m))] -->
 689
 690 <!-- plot.df <- plot.df[(variable=='x') & (m != 1000) & (m!=500) & (N!=5000) & (N!=10000) & !is.na(p.true.in.ci) & (method!="Multiple imputation (Classifier features unobserved)")] -->
 691 <!-- p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method)) -->
 692 <!-- p <- p + geom_hline(aes(yintercept=0.2),linetype=2) -->
 693
 694 <!-- p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4)) -->
 695
 696 <!-- print(p) -->
 697
 698 <!-- # get gtable object -->
 699
 700 <!-- .large[.left [![](images/example_2_dag.png)]] -->
 701
 702 <!-- There are at two general ways using a predictive model can introduce bias: *attenuation*, and *confounding.* -->
 703
 704 <!-- Counfounding can be broken down into 4 types: -->
 705
 706 <!-- .right[Confounding on $X$ by observed variables -->
 707
 708 <!--    Confounding on $Y$ by observed variables -->
 709 <!-- ] -->
 710
 711 <!-- .left[Confounding on $X$ by *un*observed variables -->
 712
 713 <!--    Confounding on $Y$ by *un*observed variables -->
 714 <!-- ] -->
 715
 716 <!-- Attenuation and the top-right column can be dealt with relative ease using a few different methods. -->
 717
 718 <!-- The bottom-left column can be addressed, but so far I haven't found a magic bullet. -->
 719
 720 <!-- The left column is pretty much a hopeless situation. -->