Discovery

Dataset Discovery Lab

Real public datasets ranked by Anthrocentrix Readiness + Opportunity. Scores are honest estimates from public dataset documentation, not end-to-end benchmark runs (except Lichess and CallCenterEN).

Datasets surveyed

Score ≥ 18/30

Top-10 mean opportunity

50.4

Top-10 mean P(meaningful)

66%

Top 10 most-likely cross-domain replications

#	Dataset	Category	Readiness	Opportunity	P(meaningful)	Verdict
1	ImageNet-AB (Annotation Byproducts)	Human Annotation	27/30	56/60	85%	READY FOR FULL ANTHROCENTRIX BENCHMARK
2	Lichess + Stockfish blunder labels	Operational Decision Systems	30/30	54/60	100%	ROBUST DECISION STATE SIGNAL
3	Trueblood et al. Medical Decision RT	Clinical Decision Making	28/30	52/60	80%	READY FOR FULL ANTHROCENTRIX BENCHMARK
4	FiFAR — Fairness in AI-assisted Fraud Review	Fraud Review	26/30	52/60	70%	READY FOR FULL ANTHROCENTRIX BENCHMARK
5	ABCD — Action-Based Conversations Dataset	Customer Service / Contact Center	25/30	51/60	55%	READY FOR REDUCED TELEMETRY BENCHMARK
6	AI.vs.Clinician (sepsis trial)	Clinical Decision Making	26/30	49/60	55%	READY FOR REDUCED TELEMETRY BENCHMARK
7	ReviewArena / ReviewBench (peer review)	Content Moderation / Peer Review	25/30	48/60	55%	READY FOR REDUCED TELEMETRY BENCHMARK
8	StarCraft II Replay Pack	Operational Decision Systems	29/30	48/60	70%	READY FOR FULL ANTHROCENTRIX BENCHMARK
9	Mind2Web	Human-AI Collaboration	24/30	47/60	40%	READY FOR REDUCED TELEMETRY BENCHMARK
10	WebArena (with human trajectories)	Human-AI Collaboration	25/30	47/60	45%	READY FOR REDUCED TELEMETRY BENCHMARK

Final recommendations

Best scientific replication

ImageNet-AB (Annotation Byproducts)

Readiness 27/30 · Opportunity 56/60 · P(meaningful) 85%

Per-worker mouse traces, click locations, and annotation times over all of ImageNet-1K — the cleanest non-chess actor + telemetry + label triple.

Best commercial replication

FiFAR — Fairness in AI-assisted Fraud Review

Readiness 26/30 · Opportunity 52/60 · P(meaningful) 70%

AI-assisted fraud review with explicit analyst IDs and decisions — direct buyer story for QA routing.

Fastest benchmark to run

Trueblood et al. Medical Decision RT

Readiness 28/30 · Opportunity 52/60 · P(meaningful) 80%

Small, tabular, per-trial RT + accuracy with pathologist IDs. Ladder fits in minutes.

Highest-value if successful

ImageNet-AB (Annotation Byproducts)

Readiness 27/30 · Opportunity 56/60 · P(meaningful) 85%

A replication on ImageNet-AB anchors Anthrocentrix in the AI data-labeling market — the same market the QA Wedge already targets.

Recommended next to ingest

ImageNet-AB (Annotation Byproducts)

Readiness 27/30 · Opportunity 56/60 · P(meaningful) 85%

Highest opportunity score, public, no credential gate, schema natively matches the Anthrocentrix model.

Full ranked registry

40 of 40

Dataset	Category	Domain	A	C	D	B	Q	O	/30	/60	P	Verdict
ImageNet-AB (Annotation Byproducts) Coallaoh / NeurIPS 2024 · ~1.28M images × annotation traces Mouse traces, click locations, annotation times, anonymised worker IDs over full ImageNet train. Best-in-class match for Anthrocentrix on labeling.	Human Annotation	Image labeling	5	4	5	5	5	3	27	56	85%	READY FOR FULL ANTHROCENTRIX BENCHMARK
Lichess + Stockfish blunder labels Lichess.org open database + Stockfish evaluation · ~5B games available; Anthrocentrix used 112k events Anchor benchmark. ROBUST_DECISION_STATE_SIGNAL: residual lift +1.06pp PR-AUC, 85% survival vs raw, CI excludes 0.	Operational Decision Systems	Chess	5	5	5	5	5	5	30	54	100%	ROBUST DECISION STATE SIGNAL
Trueblood et al. Medical Decision RT Trueblood et al. 2017 (rtdists::med_dec) · Pathologists + novices · per-trial RT and accuracy Per-decision RT + accuracy with expert/novice labels. Tight scope but a near-perfect Anthrocentrix-shaped dataset.	Clinical Decision Making	Pathology classification (blast vs non-blast)	5	4	5	5	5	4	28	52	80%	READY FOR FULL ANTHROCENTRIX BENCHMARK
FiFAR — Fairness in AI-assisted Fraud Review Feedzai ICAIF '23 Synthetic Data Workshop · Synthetic but realistic; analyst IDs + decisions Synthetic but designed for human-AI decision research. Direct commercial analogue for QA routing.	Fraud Review	AI-assisted fraud analyst decisions	5	4	5	3	5	4	26	52	70%	READY FOR FULL ANTHROCENTRIX BENCHMARK
ABCD — Action-Based Conversations Dataset ASAPP Research, NAACL 2021 · 10,042 dialogs · 55 user intents · 30 agent actions Explicit agent-action sequences with correct/incorrect labels and final task success. Limited timing telemetry.	Customer Service / Contact Center	Task-oriented customer support	3	5	5	2	5	5	25	51	55%	READY FOR REDUCED TELEMETRY BENCHMARK
AI.vs.Clinician (sepsis trial) BenchCouncil · Multi-site randomized trial logs Clinician decisions with/without AI, patient outcomes. Decision-time telemetry limited.	Clinical Decision Making	Sepsis early warning	4	5	5	3	4	5	26	49	55%	READY FOR REDUCED TELEMETRY BENCHMARK
ReviewArena / ReviewBench (peer review) NeurIPS 2026 submission corpus · 51,529 papers · 196,099 reviews · 22 venues Reviewer pseudonym IDs, scores, rebuttals, acceptance outcome. Telemetry limited to revision counts and rebuttal cycles.	Content Moderation / Peer Review	Scientific peer review	4	5	5	2	4	5	25	48	55%	READY FOR REDUCED TELEMETRY BENCHMARK
StarCraft II Replay Pack DeepMind / Blizzard · Millions of replays Per-player APM, action latency, MMR; outcome is win/loss. Strong second-domain candidate.	Operational Decision Systems	RTS gameplay	5	5	5	5	4	5	29	48	70%	READY FOR FULL ANTHROCENTRIX BENCHMARK
Mind2Web OSU NLP · 2,350 tasks · 137 websites Worker action sequences with success labels; timing partially recorded.	Human-AI Collaboration	Web navigation	3	5	5	3	4	4	24	47	40%	READY FOR REDUCED TELEMETRY BENCHMARK
WebArena (with human trajectories) CMU · Human + agent trajectories Programmatic task success outcomes.	Human-AI Collaboration	Web tasks	3	5	5	3	4	5	25	47	45%	READY FOR REDUCED TELEMETRY BENCHMARK
Wikipedia Edit History (per-editor) Wikimedia dumps · Multi-TB Per-editor history, edit timestamps, revert/persistence as quality. Heavy preprocessing required.	Operational Decision Systems	Edits + reverts	5	4	5	4	4	4	26	47	70%	READY FOR FULL ANTHROCENTRIX BENCHMARK
Tweet Annotation Sensitivity 2 Soda-LMU · ~89k annotation events Duration per case, device, screen sequence, demographics, per-rater IDs. Quality is inter-rater agreement, not gold.	Human Annotation	Text labeling	4	3	5	4	3	2	21	45	60%	READY FOR REDUCED TELEMETRY BENCHMARK
MultiWOZ 2.2 Budzianowski et al., 2018; Google update 2022 · ~10k dialogs · 7 domains Belief-state and slot ground truth, task success outcomes. Wizard-of-Oz so 'agent' is many crowdworkers.	Customer Service / Contact Center	Multi-domain task-oriented dialog	2	5	5	1	4	5	22	45	40%	READY FOR REDUCED TELEMETRY BENCHMARK
International Brain Lab — Decision Task IBL · Millions of trials, hundreds of subjects Non-human subjects; standardized 2AFC with RT, accuracy, and per-subject IDs.	Cognitive Science / Response-Time	Mouse 2AFC perception	5	4	5	5	5	3	27	45	65%	READY FOR FULL ANTHROCENTRIX BENCHMARK
Stack Exchange Data Dump (Q&A) Stack Exchange / Internet Archive · Multi-TB across sites Per-user post + edit + close-vote history; quality = upvotes/accept; outcome = closure.	Operational Decision Systems	Q&A moderation	5	4	5	3	4	4	25	45	55%	READY FOR FULL ANTHROCENTRIX BENCHMARK
Online-Go.com Game Archive OGS · Millions of games Direct chess-to-Go transfer test; Leela/KataGo provides decision-quality ground truth.	Operational Decision Systems	Go	5	5	5	4	4	5	28	45	70%	READY FOR FULL ANTHROCENTRIX BENCHMARK
Prolific Autoresearch HITL ProlificAI · 300 participants × pairwise judgments Per-pair judgments with participant IDs and Bradley-Terry outcomes; lighter on decision-time telemetry.	Human-AI Collaboration	DPO pair selection	5	3	5	3	3	4	23	44	45%	READY FOR REDUCED TELEMETRY BENCHMARK
Intertemporal-Choice RT (Pongratz & Schoemann) Scientific Data 2026 · Large-scale participants × choice + RT Per-participant RT-rich design; 'quality' is model-derived rather than gold.	Cognitive Science / Response-Time	Intertemporal choice	5	4	5	5	3	2	24	44	55%	READY FOR REDUCED TELEMETRY BENCHMARK
IRC Poker Database University of Alberta · 10M+ hands Per-player history, action timing weakly captured.	Operational Decision Systems	Online poker	5	4	5	3	3	5	25	44	45%	READY FOR REDUCED TELEMETRY BENCHMARK
CIFAR-10H (soft labels from human raters) Peterson et al., 2019 · 10k CIFAR-10 test images × 50+ raters Per-trial RT and worker IDs; ground-truth class labels available. Limited context features.	Human Annotation	Image labeling	4	3	5	3	4	2	21	43	55%	READY FOR REDUCED TELEMETRY BENCHMARK
MIMIC-IV (base EHR) PhysioNet (credentialed) · ~300k patients Clinician order timing as weak telemetry. Credentialed.	Clinical Decision Making	ICU EHR	4	5	4	2	3	5	23	43	35%	READY FOR REDUCED TELEMETRY BENCHMARK
Agent Traces: Customer Support Triage Julien Simon / HF · 1,483 events · 50 runs Synthetic, small. Useful as a workflow shape reference, not a primary benchmark.	Human-AI Collaboration	Multi-agent workflow	4	4	5	3	3	4	23	43	30%	READY FOR REDUCED TELEMETRY BENCHMARK
MIMIC-IV-Ext Clinical Decision Making PhysioNet (credentialed) · MIMIC-IV derived Credentialed access (PhysioNet CITI training required). High clinical value if cleared.	Clinical Decision Making	Abdominal pathology	3	5	4	2	4	5	23	42	35%	READY FOR REDUCED TELEMETRY BENCHMARK
IEEE-CIS Fraud Detection Vesta / Kaggle · ~590k transactions No human reviewer dimension — algorithmic baseline only.	Fraud Review	Card-not-present fraud	0	5	3	1	5	5	19	41	10%	NOT SUITABLE FOR ANTHROCENTRIX
OpenAI Moderation Evaluation Dataset OpenAI · 1,680 prompts × multi-rater labels Gold labels but no decision-time telemetry.	Content Moderation	Text safety	2	3	4	0	5	2	16	40	10%	NOT SUITABLE FOR ANTHROCENTRIX
RSNA Pneumonia Detection (radiologist reads) RSNA / Kaggle · ~30k images, multi-rater Strong labels, weak telemetry.	Medical Coding / Radiology	Chest X-ray reads	3	4	4	1	4	3	19	40	25%	READY FOR REDUCED TELEMETRY BENCHMARK
AMLSim IBM Research · Simulator (configurable scale) Simulator — would require injecting synthetic analyst layer to test Anthrocentrix.	AML Review	Anti-money-laundering	3	4	3	1	4	3	18	39	15%	NOT SUITABLE FOR ANTHROCENTRIX
DynaSent (dynamic sentiment annotation) Stanford · 121,634 sentences Worker IDs and rounds; weak timing.	Human Annotation	Sentiment	3	3	4	2	4	3	19	39	30%	READY FOR REDUCED TELEMETRY BENCHMARK
Taskmaster-3 (TicketTalk) Google Research · 23,789 dialogs Self-dialog format limits behavioral telemetry.	Customer Service / Contact Center	Movie ticket dialog	2	4	4	1	3	4	18	38	30%	READY FOR REDUCED TELEMETRY BENCHMARK
Berkeley DeepDrive — Driving Decisions BDD100K · 100k videos Telemetry-rich but no per-decision quality labels.	Human Factors / Driving	Driving decisions	2	5	4	3	2	3	19	38	20%	TELEMETRY ONLY DATASET
Quality of RT Data Inference (Blinded Assessment) OSF g7ka7 · Multi-lab collaborative assessment Excellent RT telemetry; weak quality labels.	Cognitive Science / Response-Time	Cognitive modeling	4	3	4	5	2	1	19	36	30%	TELEMETRY ONLY DATASET
ML-Fairness-Gym (hiring + loans) Google Research · Simulator Sim only.	Recruiting / Hiring	Sequential decisions	2	4	4	1	3	4	18	36	10%	NOT SUITABLE FOR ANTHROCENTRIX
HANNA-LLMEval Chhun et al., 2022; bay-calibration-llm-evaluators · 1,056 stories × multi-rater Likert Rater IDs and Likert criteria but minimal behavioral telemetry. Useful as a context-control benchmark.	Human Annotation	Story rating	4	3	4	2	3	1	17	35	25%	TELEMETRY ONLY DATASET
Jigsaw Toxic Comment Classification Conversation AI / Kaggle · ~160k comments × multi-label Strong labels, no telemetry, no annotator IDs in public release.	Content Moderation	Online comments	1	3	3	0	4	2	13	35	5%	NOT SUITABLE FOR ANTHROCENTRIX
CheXpert Stanford · 224,316 reports Label extraction from reports — no decision-time telemetry.	Medical Coding / Radiology	Chest X-ray labeling	2	4	3	0	4	3	16	34	10%	NOT SUITABLE FOR ANTHROCENTRIX
Reddit Moderation Actions (Pushshift) Pushshift / academic mirrors · Billions of comments Pushshift access restricted post-2023.	Content Moderation	Subreddit moderation	3	4	4	2	3	3	19	34	25%	READY FOR REDUCED TELEMETRY BENCHMARK
SNLI / MNLI annotator-level (eraser-style) Stanford NLP / Bowman et al. · 570k pairs Public release lacks per-worker timing.	Human Annotation	NLI labeling	2	3	4	0	4	2	15	33	10%	NOT SUITABLE FOR ANTHROCENTRIX
CallCenterEN (PII-redacted transcripts) AIxBlock / arXiv:2507.02958 · 91,706 transcripts · 10,448 hours Verified end-to-end: word-level timing + ASR confidence, but no audio, no diarization, no CSAT. Pseudo-diarization adapter exists.	Customer Service / Contact Center	Inbound/outbound calls	0	2	1	3	0	0	6	31	15%	TELEMETRY ONLY DATASET
DAIR-AI Emotion (per-rater) DAIR-AI · 20k tweets Aggregated labels only.	Behavioral Health	Text emotion	1	3	3	0	4	2	13	30	5%	NOT SUITABLE FOR ANTHROCENTRIX
Switchboard-1 Telephone Speech LDC · 260 hours, ~2,400 conversations LDC-licensed; has true diarization. No quality labels.	Customer Service / Contact Center	Conversational speech	4	3	3	4	1	1	16	28	15%	TELEMETRY ONLY DATASET

Scoring methodology and full registry are also exported as a PDF report. A ↔ actor identity, C ↔ context complexity, D ↔ decision-event granularity, B ↔ behavioral telemetry, Q ↔ decision-quality label, O ↔ outcome label. Each is scored 0–5. Opportunity = readiness (/30) + commercial relevance + ease of access + replication value (/30).