presentations/ica_hackathon_2022/ica_hackathon_2022.Rmd

   1 ---
   2 title: "How good of a model do you need? Accounting for classification errors in machine assisted content analysis."
   3 author: Nathan TeBlunthuis
   4 date: May 24 2022
   5 template: "../resources/template.html"
   6 output:
   7   xaringan::moon_reader:
   8     lib_dir: libs
   9     seal: false
  10     nature:
  11         highlightStyle: github
  12         ratio: 16:9
  13         countIncrementalSlides: true
  14         slideNumberFormat: |
  15           <div class="progress-bar-container">
  16             <div class="progress-bar" style="width: calc(%current% / %total% * 100%);">
  17             </div>
  18           </div>
  19     self_contained: false
  20     css: [default, my-theme.css, fontawesome.min.css]
  21     chakra: libs/remark-latest.min.js
  22
  23 ---
  24 ```{r echo=FALSE, warning=FALSE, message=FALSE}
  25 library(knitr)
  26 library(ggplot2)
  27 library(data.table)
  28 library(icons)
  29
  30 f <- function (x) {formatC(x, format="d", big.mark=',')}
  31
  32 theme_set(theme_bw())
  33 r <- readRDS('remembr.RDS')
  34 attach(r)
  35
  36 ```
  37 class: center, middle, narrow
  38
  39 <script type='javascript'>
  40 window.MathJax = {
  41   loader: {load: ['[tex]/xcolor']},
  42   tex: {packages: {'[+]': ['xcolor']}}
  43 };
  44 </script>
  45
  46 <div class="my-header"></div>
  47
  48
  49 ###  .title-heading[Unlocking the power of big data: The importance of measurement error in machine assisted content analysis]
  50 ## Nathan TeBlunthuis
  51
  52 <img src="images/nu_logo.png" height="170px" style="padding:21px"/> <img src="images/uw_logo.png" height="170px" style="padding:21px"/> <img src="images/cdsc_logo.png" height="170px" style="padding:21px"/>
  53
  54
  55 `r icons::fontawesome('envelope')` nathan.teblunthuis@northwestern.edu
  56
  57 `r icons::fontawesome('globe')` [https://teblunthuis.cc](https://teblunthuis.cc)
  58
  59 ???
  60
  61 This talk will be me presenting my "lab notebook" and not a polished research talk.  Maybe it would be a good week of a graduate seminar? In sum, machine assisted content analysis has unique limitations and threats to validity that I wanted to understand better.  I've learned how the noise introduced by predictive models can result in misleading statistical inferences, but that a sample of human-labeled validation data can often be used to account for this noise and obtain accurate inferences in the end.  Statistical knowledge of this problem and computational tools for addressing are still in development.  My goals for this presentation are to start sharing this information with the community and hopeful to stimulate us to work on extending existing approaches or using them in our work.
  62
  63 This is going to be a boring talk about some *very* technical material. If you're not that interested please return to your hackathon. Please interrupt me if I'm going too fast for you or if you don't understand something.  I will try to move quickly in the interests of those wishing to wrap up their hackathon projects. I will also ask you to show hands once or twice, if you are already familiar with some concepts that it might be expedient to skip.
  64
  65 ---
  66
  67 class:center, middle, inverse
  68 ## Machine assistent content analysis (MACA)
  69
  70 ???
  71
  72 I'm going to start by defining a study design that is increasingly common, especially in Communication and Political Science, but also across the social sciences and beyond. I call it *machine assisted content analysis* (MACA).
  73
  74 ---
  75 <div class="my-header"></div>
  76
  77 ### .border[Machine assisted content analysis (MACA) uses machine learning for scientific measurement.]
  78
  79 .emph[Content analysis:] Statistical analysis of variables measured by human labeling ("coding") of content.  This might be simple categorical labels, or maybe more advanced annotations.
  80
  81 --
  82
  83 *Downside:* Human labeling is *a lot* of work.
  84
  85 --
  86
  87 .emph[Machine assisted content analysis:] Use a *predictive algorithm* (often trained on human-made labels) to measure variables for use in a downstream *primary analysis.*
  88
  89 --
  90
  91 *Downside:*  Algorithms can be *biased* and *inaccurate* in ways that could invalidate the statistical analysis.
  92
  93
  94 ???
  95
  96 A machine assisted content analysis can be part of a more complex or more powerful study design (e.g., an experiment, time series analysis &c).
  97
  98 ---
  99
 100
 101 <!-- <div class="my-header"></div> -->
 102
 103 <!-- ### .border[Hypothetical Example: Predicting Racial Harassement in Social Media Comments] -->
 104
 105 ---
 106 class:large
 107
 108 <div class="my-header"></div>
 109
 110 ### .border[How can MACA go wrong?]
 111
 112 Algorithms can be *biased* and *error prone* (*noisy*).
 113
 114 --
 115
 116 Predictor bias is a potentially difficult problem that requires causal inference methods. I'll focus on *noise* for now.
 117
 118 --
 119
 120 Noise in the predictive model introduces bias in the primary analysis.
 121
 122 --
 123
 124 .indent[We can reduce and sometimes even *eliminate* this bias introduced by noise.]
 125
 126 ---
 127 layout:true
 128 <div class="my-header"></div>
 129
 130 ### .border[Example 1: An unbiased, but noisy classifier]
 131
 132 .large[.left-column[![](images/example_1_dag.png)]]
 133
 134 ???
 135
 136 Please show hands if you are familiar with causal graphs or baysian networks.  Should I explain what this diagram means?
 137
 138
 139 ---
 140
 141 .right-column[
 142 $x$ is *partly observed* because we have *validation data* $x^*$.
 143 ]
 144
 145 ---
 146
 147
 148 .right-column[
 149 $x$ is *partly observed* because we have *validation data* $x^*$.
 150
 151 $k$ are the *features* used by the *predictive model* $g(k)$.
 152
 153 ]
 154
 155 ---
 156
 157 .right-column[
 158 $x$ is *partly observed* because we have *validation data* $x^*$.
 159
 160 $k$ are the *features* used by the *predictive model* $g(k)$.
 161
 162 The predictions $w$ are a *proxy variable*  $g(k) = \hat{x} = w$.
 163
 164 ]
 165
 166 ---
 167
 168
 169 .right-column[
 170 $x$ is *partly observed* because we have *validation data* $x^*$.
 171
 172 $k$ are the *features* used by the *predictive model* $g(k)$.
 173
 174 The predictions $w$ are a *proxy variable*  $g(k) = \hat{x} = w$.
 175
 176 $x = w + \xi$ because the predictive model makes errors.
 177
 178 ]
 179
 180 ---
 181
 182
 183 layout:true
 184 <div class="my-header"></div>
 185
 186 ### .border[Noise in a *covariate* creates *attenuation bias*.]
 187
 188 .large[.left-column[![](images/example_1_dag.png)]]
 189
 190 ---
 191 .right-column[
 192
 193 We want to estimate, $y = Bx + \varepsilon$, but we estimate $y = Bw + \varepsilon$ instead.
 194
 195 $x = w + \xi$ because the predictive model makes errors.
 196
 197 ]
 198 ---
 199
 200 .right-column[
 201
 202 We want to estimate, $y = Bx + \varepsilon$, but we estimate $y = Bw + \varepsilon$ instead.
 203
 204 $x = w + \xi$ because the predictive model makes errors.
 205
 206
 207 Assume $g(k)$ is *unbiased* so $E(\xi)=0$. Also assume error is *nondifferential* so $E(\xi y)=0$:
 208
 209 ]
 210
 211 ---
 212
 213 .right-column[
 214
 215 We want to estimate, $y = Bx + \varepsilon$, but we estimate $y = Bw + \varepsilon$ instead.
 216
 217 $x = w + \xi$ because the predictive model makes errors.
 218
 219 Assume $g(k)$ is *unbiased* so $E(\xi)=0$. Also assume error is *nondifferential* so $E(\xi y)=0$:
 220
 221 $$\widehat{B_w}^{ols}=\frac{\sum^n_{j=j}{(x_j + \xi_j - \overline{(x + \xi)})}(y_j - \bar{y})}{\sum_{j=1}^n{(x_j + \xi_j - \overline{(x+\xi)})^2}} = \frac{\sum^n_{j=j}{(x_j - \bar{x})(y_j -
 222  \bar{y})}}{\sum_{j=1}^n{(x_j + \xi_j - \bar{x}){^2}}}$$
 223
 224 ]
 225
 226 ---
 227
 228 .right-column[
 229
 230 We want to estimate, $y = Bx + \varepsilon$, but we estimate $y = Bw + \varepsilon$ instead.
 231
 232 $x = w + \xi$ because the predictive model makes errors.
 233
 234 Assume $g(k)$ is *unbiased* so $E(\xi)=0$. Also assume error is *nondifferential* so $E(\xi y)=0$:
 235
 236 $$\widehat{B_w}^{ols}=\frac{\sum^n_{j=j}{(x_j + \xi_j - \overline{(x + \xi)})}(y_j - \bar{y})}{\sum_{j=1}^n{(x_j + \xi_j - \overline{(x+\xi)})^2}} = \frac{\sum^n_{j=j}{(x_j - \bar{x})(y_j -
 237  \bar{y})}}{\sum_{j=1}^n{(x_j + \color{red}{\xi_j} - \bar{x})\color{red}{^2}}}$$
 238
 239 In this scenario, it's clear that $\widehat{B_w}^{ols} < B_x$.
 240
 241
 242 ]
 243
 244
 245 ???
 246
 247 Please raise your hands if you're familiar with attenuation bias.  I expect that its covered in some graduate stats classes, but not universally.
 248
 249 ---
 250 class:large
 251 layout:false
 252 <div class="my-header"></div>
 253
 254 ### .border[Beyond attenuation bias]
 255 .larger[Measurement error can theaten validity because:]
 256
 257 - Attenuation bias *spreads* (e.g., to marginal effects as illustrated later).
 258
 259 --
 260
 261 - Measurement error can be *differential*— not distributed evenly and possible correlated with $x$, $y$, or $\varepsilon$.
 262
 263 --
 264
 265 - *Bias can be away from 0* in GLMs and nonlinear models or if measurement error is differential.
 266
 267 --
 268
 269 - *Confounding* if the *predictive model is biased* introducing a correlation the measurement error and the residuals $(E[\xi\varepsilon]=0)$.
 270
 271
 272 ---
 273
 274 class:large
 275 layout:false
 276 <div class="my-header"></div>
 277
 278 ### .border[Correcting measurement error]
 279
 280 There's a vast literature in statistics on measurement error. Mostly about noise you'd find in sensors. Lots of ideas. No magic bullets.
 281
 282 --
 283
 284 I'm going to briefly cover 3 different approaches: *multiple imputation*,  *regression calibration* and *2SLS+GMM*.
 285
 286 --
 287
 288 These all depend on *validation data*. I'm going to ignore where this comes from, but assume it's a random sample of the hypothesis testing dataset.
 289
 290 --
 291
 292 You can *and should* use it to improve your statistical estimates.
 293
 294 ---
 295
 296 <div class="my-header"></div>
 297
 298 ### .border[Multiple Imputation (MI) treats Measurement Error as a Missing Data Problem]
 299
 300 1. Use validation data to estimate $f(x|w,y)$, a probabilistic model of $x$.
 301
 302 --
 303
 304 2. *Sample* $m$ datasets from $\widehat{f(x|w,y)}$.
 305
 306 --
 307
 308 3. Run your analysis on each of the $m$ datasets.
 309
 310 --
 311
 312 4. Average the results from the $m$ analyses using Rubin's rules.
 313
 314 --
 315
 316 .e[Advantages:] *Very flexible!* Sometimes can work if the predictor $g(k) $ is biased. Good R packages (**`{Amelia}`**, `{mi}`, `{mice}`, `{brms}`).
 317
 318 --
 319
 320 .e[Disadvantages:] Results depend on quality of $\widehat{f(x|w,y)}$; May require more validation data, computationally expensive, statistically inefficient and doesn't seem to benefit much from larger datasets.
 321
 322 ---
 323
 324 ### .border[Regression calibration directly adjusts for attenuation bias.]
 325
 326 1. Use validation data to estimate the errors $\hat{\xi}$.
 327
 328 --
 329
 330 2. Use $\hat{\xi}$ to correct the OLS estimate.
 331
 332 --
 333
 334 3. Correct the standard errors using MLE or bootstrapping.
 335
 336 --
 337
 338 .e[Advantages:] Simple, fast.
 339
 340 --
 341
 342 .e[Disadvantages:] Limited to OLS models. Requires an unbiased predictor $g(k)$. R support (`{mecor}` R package) is pretty new.
 343
 344 ---
 345 layout:true
 346 ### .border[2SLS+GMM is designed for this specific problem]
 347
 348 .left-column[![](images/Fong_Taylor.png)]
 349
 350 *Regression calibration with a trick.*
 351
 352 ---
 353 .right-column[
 354
 355 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 356
 357 ]
 358
 359 ---
 360 .right-column[
 361
 362 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 363
 364 2. Estimate $y = B^{2sls}\hat{x} + \varepsilon^{2sls}$. (Second-stage LS / regression calibration).
 365
 366 ]
 367
 368 ---
 369 .right-column[
 370
 371 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 372
 373 2. Estimate $y = B^{2sls}\hat{x} + \varepsilon^{2sls}$.  (Second-stage LS / regression calibration).
 374
 375 3. Estimate $y = B^{val}x^* + \varepsilon^{val}$. (Validation dataset model).
 376
 377 ]
 378
 379 ---
 380 .right-column[
 381
 382 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 383
 384 2. Estimate $y = B^{2sls}\hat{x} + \varepsilon^{2sls}$.  (Second-stage LS / regression calibration).
 385
 386 3. Estimate $y = B^{val}x^* + \varepsilon^{val}$. (Validation dataset model).
 387
 388 4. Combine $B^{val}$ and $B^{2sls}$ using the generalized method of moments (GMM).
 389
 390 ]
 391
 392 ---
 393 .right-column[
 394
 395 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 396
 397 2. Estimate $y = B^{2sls}\hat{x} + \varepsilon^{2sls}$.  (Second-stage LS / regression calibration).
 398
 399 3. Estimate $y = B^{val}x^* + \varepsilon^{val}$. (Validation dataset model).
 400
 401 4. Combine $B^{val}$ and $B^{2sls}$ using the generalized method of moments (GMM).
 402
 403 Advantages: Accurate. Sometimes robust if biased predictor $g(k)$ is biased.  In theory, flexible to any models that can be fit using GMM.
 404
 405 ]
 406
 407
 408 ---
 409 .right-column[
 410
 411 1. Estimate $x = w + \xi$ to obtain $\hat{x}$. (First-stage LS).
 412
 413 2. Estimate $y = B^{2sls}\hat{x} + \varepsilon^{2sls}$.  (Second-stage LS / regression calibration).
 414
 415 3. Estimate $y = B^{val}x^* + \varepsilon^{val}$. (Validation dataset model).
 416
 417 4. Combine $B^{val}$ and $B^{2sls}$ using the generalized method of moments (GMM).
 418
 419 Advantages: Accurate. Sometimes robust if biased predictor $g(k)$ is biased.  In theory, flexible to any models that can be fit using GMM.
 420
 421 Disadvantages: Implementation (`{predictionError}`) is new. API is cumbersome and only supports linear models. Not robust if $E(w\varepsilon) \ne 0$. GMM may be unfamiliar to audiences.
 422
 423 ]
 424
 425 ---
 426 layout:false
 427 ### .border[Testing attention bias correction]
 428
 429 <div class="my-header"></div>
 430
 431 I've run simulations to test these approaches in several scenarios.
 432
 433 I simulate random data, fit 100 models and plot the average estimate and its variance.
 434
 435 The model is not very good: about 70% accurate.
 436
 437 Most plausible scenario:
 438
 439 y is continuous and normal-ish.
 440
 441 --
 442
 443 $x$ is binary (human labels) $P(x)=0.5$.
 444
 445 --
 446
 447 $w$ is the *continuous predictor* (e.g., probability) output of $f(x)$ (not binary predictions).
 448
 449 --
 450
 451 if $w$ is binary, most methods struggle, but regression calibration and 2SLS+GMM can do okay.
 452
 453 ---
 454 layout:false
 455
 456 ### .border[Example 1: estimator of the effect of x]
 457
 458 .right-column[
 459 ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=7.5, fig.asp=.625,cache=F}
 460
 461 #plot.df <-
 462 plot.df <- plot.df.example.1[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T),
 463                                      N=factor(N),
 464                                      m=factor(m))]
 465
 466 plot.df <- plot.df[(variable=='x') & (m != 1000) & (m!=500) & (N!=10000) & !is.na(p.true.in.ci) & (method!="Multiple imputation (Classifier features unobserved)")]
 467 p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method))
 468 p <- p + geom_hline(aes(yintercept=0.2),linetype=2)
 469
 470 p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4))
 471
 472 print(p)
 473
 474 # get gtable object
 475
 476 ```
 477 ]
 478 .left-column[
 479
 480 All methods work in this scenario
 481
 482 Multiple imputation is inefficient.
 483
 484 ]
 485
 486
 487 ---
 488 ### .border[What about bias?]
 489
 490 .left-column[
 491 .large[![](images/example_2_dag.png)]
 492 ]
 493
 494 .right-column[
 495 A few notes on this scenario.
 496
 497 $B_x = 0.2$, $B_g=-0.2$ and $sd(\varepsilon)=3$. So the signal-to-noise ratio is high.
 498
 499 $r$ can be concieved of as a missing feature in the predictive model $g(k)$ that is also correlated with $y$.
 500
 501 For example $r$ might be the *race* of a commentor,  $x$ could be *racial harassment*, $y$ whether the commentor gets banned and $k$ only has textual features but human coders can see user profiles to know $r$.
 502
 503 ]
 504
 505 ---
 506 layout:false
 507 ### .border[Example 2: Estimates of the effect of x ]
 508
 509 .center[
 510 ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=8, fig.asp=.625,cache=F}
 511
 512 #plot.df <-
 513 plot.df <- plot.df.example.2B[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T),
 514                                      N=factor(N),
 515                                      m=factor(m))]
 516
 517 plot.df <- plot.df[(variable=='x') & (m != 1000) & (m!=500) & (N!=10000) & !is.na(p.true.in.ci) & (method!="Multiple imputation (Classifier features unobserved)")]
 518 p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method))
 519 p <- p + geom_hline(aes(yintercept=0.2),linetype=2)
 520
 521 p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4))
 522
 523 print(p)
 524
 525 # get gtable object
 526
 527 ```
 528 ]
 529 ---
 530 layout:false
 531
 532 ### .border[Example 2: Estimates of the effect of r]
 533
 534 .center[
 535 ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=8, fig.asp=.625,cache=F}
 536
 537 #plot.df <-
 538 plot.df <- plot.df.example.2B[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T),
 539                                      N=factor(N),
 540                                      m=factor(m))]
 541
 542 plot.df <- plot.df[(variable=='g') & (m != 1000) & (m!=500) & (N!=10000) & !is.na(p.true.in.ci) & (method!="Multiple imputation (Classifier features unobserved)")]
 543 p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method))
 544 p <- p + geom_hline(aes(yintercept=-0.2),linetype=2)
 545
 546 p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4))
 547
 548
 549 print(p)
 550 ```
 551 ]
 552 ---
 553
 554 layout:false
 555 class:large
 556
 557 ###.border[Takeaways from example 2]
 558
 559 Bias in the predictive model creates bias in hypothesis tests.
 560
 561 --
 562
 563 Bias can be corrected *in this case*.
 564
 565 --
 566
 567 The next scenario has bias that's more tricky.
 568
 569 --
 570
 571 Multiple imputation helps, but doesn't fully correct the bias.
 572
 573 ---
 574
 575 layout:false
 576
 577 ### .border[When will GMM+2SLS fail?]
 578
 579 .large[.left-column[![](images/example_3_dag.png)]]
 580
 581 .right-column[The catch with GMM:
 582
 583 .emph[Exclusion restriction:] $E[w \varepsilon] = 0$.
 584
 585 The restriction is violated if a variable $U$ causes both $K$ and $Y$ and $X$ causes $K$ (not visa-versa).
 586
 587 ]
 588
 589 ???
 590
 591 GMM optimizes a model to a system of equations of which the exclusion restriction is one.  So if that assumption isn't true it will biased.
 592
 593 This is a different assumption than that of OLS or GLM models.
 594
 595 ---
 596
 597 layout:false
 598
 599 ### .border[Example 3: Estimates of the effect of x]
 600
 601 .center[
 602 ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=8, fig.asp=.625,cache=F}
 603
 604 #plot.df <-
 605 plot.df <- plot.df.example.3[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T),
 606                                      N=factor(N),
 607                                      m=factor(m))]
 608
 609 plot.df <- plot.df[(variable=='x') & (m != 1000) & (m!=500) & (N!=10000) & (method!="Multiple imputation (Classifier features unobserved)")]
 610 p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method))
 611 p <- p + geom_hline(aes(yintercept=0.2),linetype=2)
 612
 613 p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4))
 614
 615
 616 print(p)
 617 ```
 618 ]
 619
 620
 621
 622 ---
 623
 624 ### .border[Takaways]
 625
 626 - Attenuation bias can be a big problem with noisy predictors—leading to small and biased estimates.
 627
 628 - For more general hypothesis tests or if the predictor is biased, measurement error can lead to false discovery.
 629
 630 - It's fixable with validation data—you may not need that much and you should already be getting it.
 631
 632 - This means it can be okay poor predictors for hypothesis testing.
 633
 634 - The ecosystem is underdeveloped, but a lot of methods have been researched.
 635
 636 - Take advantage of machine learning + big data and get precise estimates when the signal-to-noise ratio is high!
 637
 638 ---
 639 layout:false
 640
 641 ### .border[Future work: Noise in the *outcome*]
 642
 643 I've been focusing on noise in *covariates.* What if the predictive algorithm is used to measure the *outcome* $y$?
 644
 645 --
 646
 647 This isn't a problem in the simplest case (linear regression with homoskedastic errors).  Noise in $y$ is projected into the error term.
 648
 649 --
 650
 651 Noise in the outcome is still a problem if errors are heteroskedastic and for GLMs / non-linear regression (e.g., logistic regression).
 652
 653 --
 654
 655 Multiple imputation (in theory) could help here. The other method's aren't designed for this case.
 656
 657 --
 658
 659 Solving this problem could be an important methodological contribution with a very broad impact.
 660
 661 ---
 662 # .border[Questions?]
 663
 664 Links to slides:[html](https://teblunthuis.cc/~nathante/slides/ecological_adaptation_ica_2022.html) [pdf](https://teblunthuis.cc/~nathante/slides/ecological_adaptation_ica_2022.pdf)
 665
 666 Link to a messy git repository:[https://code.communitydata.science/ml_measurement_error_public.git](https://code.communitydata.science/ml_measurement_error_public.git)
 667
 668 `r icons::fontawesome("envelope")` nathan.teblunthuis@northwestern.edu
 669
 670 `r icons::fontawesome("twitter")` @groceryheist
 671
 672 `r icons::fontawesome("globe")` [https://communitydata.science](https://communitydata.science)
 673
 674
 675
 676 <!-- ### .border[Multiple imputation struggles with discrete variables] -->
 677
 678 <!-- In my experiments I've found that the 2SLS+GMM method works well with a broader range of data types.  -->
 679
 680 <!-- To illustrate, Example 3 is the same as Example 2, but with $x$ and $w$ as discrete variables.  -->
 681
 682 <!-- Practicallly speaking, a continuous "score" $w$ is often available, and my opinion is that usually this is better + more informative than model predictions in all cases.  Continuous validation data may be more difficult to obtain, but it is often possible using techniques like pairwise comparison. -->
 683 <!-- layout:false -->
 684 <!-- ### .border[Example 3: Estimates of the effect of x ] -->
 685
 686 <!-- .center[ -->
 687 <!-- ```{r echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='svg', fig.width=8, fig.asp=.625,cache=F} -->
 688
 689 <!-- #plot.df <-  -->
 690 <!-- plot.df <- plot.df.example.2[,':='(method=factor(method,levels=c("Naive","Multiple imputation", "Multiple imputation (Classifier features unobserved)","Regression Calibration","2SLS+gmm","Feasible"),ordered=T), -->
 691 <!--                                      N=factor(N), -->
 692 <!--                                      m=factor(m))] -->
 693
 694 <!-- plot.df <- plot.df[(variable=='x') & (m != 1000) & (m!=500) & (N!=5000) & (N!=10000) & !is.na(p.true.in.ci) & (method!="Multiple imputation (Classifier features unobserved)")] -->
 695 <!-- p <- ggplot(plot.df, aes(y=mean.est, ymax=mean.est + var.est/2, ymin=mean.est-var.est/2, x=method)) -->
 696 <!-- p <- p + geom_hline(aes(yintercept=0.2),linetype=2) -->
 697
 698 <!-- p <- p + geom_pointrange() + facet_grid(m~N,as.table=F) + scale_x_discrete(labels=label_wrap_gen(4)) -->
 699
 700 <!-- print(p) -->
 701
 702 <!-- # get gtable object -->
 703
 704 <!-- .large[.left [![](images/example_2_dag.png)]] -->
 705
 706 <!-- There are at two general ways using a predictive model can introduce bias: *attenuation*, and *confounding.* -->
 707
 708 <!-- Counfounding can be broken down into 4 types: -->
 709
 710 <!-- .right[Confounding on $X$ by observed variables -->
 711
 712 <!--    Confounding on $Y$ by observed variables -->
 713 <!-- ] -->
 714
 715 <!-- .left[Confounding on $X$ by *un*observed variables -->
 716
 717 <!--    Confounding on $Y$ by *un*observed variables -->
 718 <!-- ] -->
 719
 720 <!-- Attenuation and the top-right column can be dealt with relative ease using a few different methods. -->
 721
 722 <!-- The bottom-left column can be addressed, but so far I haven't found a magic bullet. -->
 723
 724 <!-- The left column is pretty much a hopeless situation. -->