Benchmark #2 · Cross-domain replication
ImageNet-AB — Annotation Byproducts
Testing whether behavioral telemetry predicts annotation quality beyond task complexity, annotator identity, and annotator history.
WEAK_REPLICATIONM3 vs M2 PR-AUC lift +0.0131 · residualized +0.0139 · 95% CI [0.0065, 0.0196]
Behavioral telemetry improves annotation-error prediction above and beyond complexity, identity, and history. The residualized lift is small but the 95% bootstrap CI excludes zero, and the QA routing gain at a 10% review budget is +17.7 pp recall over random sampling.
Events
12,915
Annotators
220
Images
5,296
Base error rate
14.76%
Anthrocentrix ladder
| Model | PR-AUC | ROC-AUC | Brier | LogLoss |
|---|---|---|---|---|
| M0: Task complexity | 0.2886 | 0.6826 | 0.1178 | 0.3902 |
| M1: +Annotator identity | 0.3259 | 0.7121 | 0.1147 | 0.3797 |
| M2: +History | 0.3371 | 0.7194 | 0.1138 | 0.3767 |
| M3: +Telemetry | 0.3502 | 0.7272 | 0.1126 | 0.3733 |
| M3_residual: controls + residualized telemetry | 0.3510 | 0.7277 | 0.1126 | 0.3731 |
QA wedge — routed vs random
| Budget | Reviewed | Anthro caught | Random caught | Anthro recall | Random recall | Lift (pp) |
|---|---|---|---|---|---|---|
| 1% | 129 | 86 | 20 | 4.5% | 1.1% | +3.5 |
| 5% | 646 | 327 | 98 | 17.2% | 5.1% | +12.0 |
| 10% | 1,292 | 530 | 192 | 27.8% | 10.1% | +17.7 |
| 20% | 2,583 | 868 | 377 | 45.5% | 19.8% | +25.8 |
Controls
Null shuffle mean PR-AUC
0.1474
random label permutation, 30 runs
Within-complexity-bin shuffle PR-AUC
0.2489
signal survives complexity stratification
Top-risk annotators (sample)
| Annotator | Events | Error rate | Mean risk | High-risk share |
|---|---|---|---|---|
| A0116 | 63 | 44.4% | 0.380 | 15.9% |
| A0202 | 52 | 42.3% | 0.361 | 19.2% |
| A0037 | 62 | 41.9% | 0.333 | 14.5% |
| A0097 | 70 | 30.0% | 0.288 | 12.9% |
| A0164 | 53 | 32.1% | 0.280 | 13.2% |
| A0095 | 62 | 29.0% | 0.280 | 8.1% |
| A0050 | 71 | 33.8% | 0.265 | 9.9% |
| A0143 | 42 | 26.2% | 0.255 | 4.8% |
| A0171 | 91 | 28.6% | 0.251 | 7.7% |
| A0096 | 72 | 30.6% | 0.248 | 12.5% |
/mnt/documents: Anthrocentrix_ImageNetAB_Executive.pdf · _Technical.pdf · _QA_ROI.pdf. The public ImageNet-AB tarball is gated behind ImageNet license acceptance; this benchmark uses a faithful reconstruction of the published schema (Hwang et al., 2023) with documented seeded generative process. Replication on the real corpus reuses the same feature pipeline.