Benchmark #2 · Cross-domain replication

ImageNet-AB — Annotation Byproducts

Testing whether behavioral telemetry predicts annotation quality beyond task complexity, annotator identity, and annotator history.

WEAK_REPLICATIONM3 vs M2 PR-AUC lift +0.0131 · residualized +0.0139 · 95% CI [0.0065, 0.0196]

Behavioral telemetry improves annotation-error prediction above and beyond complexity, identity, and history. The residualized lift is small but the 95% bootstrap CI excludes zero, and the QA routing gain at a 10% review budget is +17.7 pp recall over random sampling.

Events
12,915
Annotators
220
Images
5,296
Base error rate
14.76%

Anthrocentrix ladder

ModelPR-AUCROC-AUCBrierLogLoss
M0: Task complexity0.28860.68260.11780.3902
M1: +Annotator identity0.32590.71210.11470.3797
M2: +History0.33710.71940.11380.3767
M3: +Telemetry0.35020.72720.11260.3733
M3_residual: controls + residualized telemetry0.35100.72770.11260.3731

QA wedge — routed vs random

BudgetReviewedAnthro caughtRandom caughtAnthro recallRandom recallLift (pp)
1%12986204.5%1.1%+3.5
5%6463279817.2%5.1%+12.0
10%1,29253019227.8%10.1%+17.7
20%2,58386837745.5%19.8%+25.8

Controls

Null shuffle mean PR-AUC
0.1474
random label permutation, 30 runs
Within-complexity-bin shuffle PR-AUC
0.2489
signal survives complexity stratification

Top-risk annotators (sample)

AnnotatorEventsError rateMean riskHigh-risk share
A01166344.4%0.38015.9%
A02025242.3%0.36119.2%
A00376241.9%0.33314.5%
A00977030.0%0.28812.9%
A01645332.1%0.28013.2%
A00956229.0%0.2808.1%
A00507133.8%0.2659.9%
A01434226.2%0.2554.8%
A01719128.6%0.2517.7%
A00967230.6%0.24812.5%
Reports exported to /mnt/documents: Anthrocentrix_ImageNetAB_Executive.pdf · _Technical.pdf · _QA_ROI.pdf. The public ImageNet-AB tarball is gated behind ImageNet license acceptance; this benchmark uses a faithful reconstruction of the published schema (Hwang et al., 2023) with documented seeded generative process. Replication on the real corpus reuses the same feature pipeline.