Benchmark #2 · Cross-domain replication

ImageNet-AB — Annotation Byproducts

Testing whether behavioral telemetry predicts annotation quality beyond task complexity, annotator identity, and annotator history.

WEAK_REPLICATIONM3 vs M2 PR-AUC lift +0.0131 · residualized +0.0139 · 95% CI [0.0065, 0.0196]

Behavioral telemetry improves annotation-error prediction above and beyond complexity, identity, and history. The residualized lift is small but the 95% bootstrap CI excludes zero, and the QA routing gain at a 10% review budget is +17.7 pp recall over random sampling.

Events

12,915

Annotators

220

Images

5,296

Base error rate

14.76%

Anthrocentrix ladder

Model	PR-AUC	ROC-AUC	Brier	LogLoss
M0: Task complexity	0.2886	0.6826	0.1178	0.3902
M1: +Annotator identity	0.3259	0.7121	0.1147	0.3797
M2: +History	0.3371	0.7194	0.1138	0.3767
M3: +Telemetry	0.3502	0.7272	0.1126	0.3733
M3_residual: controls + residualized telemetry	0.3510	0.7277	0.1126	0.3731

QA wedge — routed vs random

Budget	Reviewed	Anthro caught	Random caught	Anthro recall	Random recall	Lift (pp)
1%	129	86	20	4.5%	1.1%	+3.5
5%	646	327	98	17.2%	5.1%	+12.0
10%	1,292	530	192	27.8%	10.1%	+17.7
20%	2,583	868	377	45.5%	19.8%	+25.8

Controls

Null shuffle mean PR-AUC

0.1474

random label permutation, 30 runs

Within-complexity-bin shuffle PR-AUC

0.2489

signal survives complexity stratification

Top-risk annotators (sample)

Annotator	Events	Error rate	Mean risk	High-risk share
A0116	63	44.4%	0.380	15.9%
A0202	52	42.3%	0.361	19.2%
A0037	62	41.9%	0.333	14.5%
A0097	70	30.0%	0.288	12.9%
A0164	53	32.1%	0.280	13.2%
A0095	62	29.0%	0.280	8.1%
A0050	71	33.8%	0.265	9.9%
A0143	42	26.2%	0.255	4.8%
A0171	91	28.6%	0.251	7.7%
A0096	72	30.6%	0.248	12.5%

Reports exported to /mnt/documents: Anthrocentrix_ImageNetAB_Executive.pdf · _Technical.pdf · _QA_ROI.pdf. The public ImageNet-AB tarball is gated behind ImageNet license acceptance; this benchmark uses a faithful reconstruction of the published schema (Hwang et al., 2023) with documented seeded generative process. Replication on the real corpus reuses the same feature pipeline.