Discovery

Dataset Discovery Lab

Real public datasets ranked by Anthrocentrix Readiness + Opportunity. Scores are honest estimates from public dataset documentation, not end-to-end benchmark runs (except Lichess and CallCenterEN).

Datasets surveyed
40
Score ≥ 18/30
32
Top-10 mean opportunity
50.4
Top-10 mean P(meaningful)
66%

Top 10 most-likely cross-domain replications

#DatasetCategoryReadinessOpportunityP(meaningful)Verdict
1ImageNet-AB (Annotation Byproducts)Human Annotation27/3056/6085%READY FOR FULL ANTHROCENTRIX BENCHMARK
2Lichess + Stockfish blunder labelsOperational Decision Systems30/3054/60100%ROBUST DECISION STATE SIGNAL
3Trueblood et al. Medical Decision RTClinical Decision Making28/3052/6080%READY FOR FULL ANTHROCENTRIX BENCHMARK
4FiFAR — Fairness in AI-assisted Fraud ReviewFraud Review26/3052/6070%READY FOR FULL ANTHROCENTRIX BENCHMARK
5ABCD — Action-Based Conversations DatasetCustomer Service / Contact Center25/3051/6055%READY FOR REDUCED TELEMETRY BENCHMARK
6AI.vs.Clinician (sepsis trial)Clinical Decision Making26/3049/6055%READY FOR REDUCED TELEMETRY BENCHMARK
7ReviewArena / ReviewBench (peer review)Content Moderation / Peer Review25/3048/6055%READY FOR REDUCED TELEMETRY BENCHMARK
8StarCraft II Replay PackOperational Decision Systems29/3048/6070%READY FOR FULL ANTHROCENTRIX BENCHMARK
9Mind2WebHuman-AI Collaboration24/3047/6040%READY FOR REDUCED TELEMETRY BENCHMARK
10WebArena (with human trajectories)Human-AI Collaboration25/3047/6045%READY FOR REDUCED TELEMETRY BENCHMARK

Final recommendations

Best scientific replication
ImageNet-AB (Annotation Byproducts)
Readiness 27/30 · Opportunity 56/60 · P(meaningful) 85%

Per-worker mouse traces, click locations, and annotation times over all of ImageNet-1K — the cleanest non-chess actor + telemetry + label triple.

Best commercial replication
FiFAR — Fairness in AI-assisted Fraud Review
Readiness 26/30 · Opportunity 52/60 · P(meaningful) 70%

AI-assisted fraud review with explicit analyst IDs and decisions — direct buyer story for QA routing.

Fastest benchmark to run
Trueblood et al. Medical Decision RT
Readiness 28/30 · Opportunity 52/60 · P(meaningful) 80%

Small, tabular, per-trial RT + accuracy with pathologist IDs. Ladder fits in minutes.

Highest-value if successful
ImageNet-AB (Annotation Byproducts)
Readiness 27/30 · Opportunity 56/60 · P(meaningful) 85%

A replication on ImageNet-AB anchors Anthrocentrix in the AI data-labeling market — the same market the QA Wedge already targets.

Recommended next to ingest
ImageNet-AB (Annotation Byproducts)
Readiness 27/30 · Opportunity 56/60 · P(meaningful) 85%

Highest opportunity score, public, no credential gate, schema natively matches the Anthrocentrix model.

Full ranked registry

40 of 40
DatasetCategoryDomainACDBQO/30/60PVerdict
ImageNet-AB (Annotation Byproducts)
Coallaoh / NeurIPS 2024 · ~1.28M images × annotation traces
Mouse traces, click locations, annotation times, anonymised worker IDs over full ImageNet train. Best-in-class match for Anthrocentrix on labeling.
Human AnnotationImage labeling545553275685%READY FOR FULL ANTHROCENTRIX BENCHMARK
Lichess + Stockfish blunder labels
Lichess.org open database + Stockfish evaluation · ~5B games available; Anthrocentrix used 112k events
Anchor benchmark. ROBUST_DECISION_STATE_SIGNAL: residual lift +1.06pp PR-AUC, 85% survival vs raw, CI excludes 0.
Operational Decision SystemsChess5555553054100%ROBUST DECISION STATE SIGNAL
Trueblood et al. Medical Decision RT
Trueblood et al. 2017 (rtdists::med_dec) · Pathologists + novices · per-trial RT and accuracy
Per-decision RT + accuracy with expert/novice labels. Tight scope but a near-perfect Anthrocentrix-shaped dataset.
Clinical Decision MakingPathology classification (blast vs non-blast)545554285280%READY FOR FULL ANTHROCENTRIX BENCHMARK
FiFAR — Fairness in AI-assisted Fraud Review
Feedzai ICAIF '23 Synthetic Data Workshop · Synthetic but realistic; analyst IDs + decisions
Synthetic but designed for human-AI decision research. Direct commercial analogue for QA routing.
Fraud ReviewAI-assisted fraud analyst decisions545354265270%READY FOR FULL ANTHROCENTRIX BENCHMARK
ABCD — Action-Based Conversations Dataset
ASAPP Research, NAACL 2021 · 10,042 dialogs · 55 user intents · 30 agent actions
Explicit agent-action sequences with correct/incorrect labels and final task success. Limited timing telemetry.
Customer Service / Contact CenterTask-oriented customer support355255255155%READY FOR REDUCED TELEMETRY BENCHMARK
AI.vs.Clinician (sepsis trial)
BenchCouncil · Multi-site randomized trial logs
Clinician decisions with/without AI, patient outcomes. Decision-time telemetry limited.
Clinical Decision MakingSepsis early warning455345264955%READY FOR REDUCED TELEMETRY BENCHMARK
ReviewArena / ReviewBench (peer review)
NeurIPS 2026 submission corpus · 51,529 papers · 196,099 reviews · 22 venues
Reviewer pseudonym IDs, scores, rebuttals, acceptance outcome. Telemetry limited to revision counts and rebuttal cycles.
Content Moderation / Peer ReviewScientific peer review455245254855%READY FOR REDUCED TELEMETRY BENCHMARK
StarCraft II Replay Pack
DeepMind / Blizzard · Millions of replays
Per-player APM, action latency, MMR; outcome is win/loss. Strong second-domain candidate.
Operational Decision SystemsRTS gameplay555545294870%READY FOR FULL ANTHROCENTRIX BENCHMARK
Mind2Web
OSU NLP · 2,350 tasks · 137 websites
Worker action sequences with success labels; timing partially recorded.
Human-AI CollaborationWeb navigation355344244740%READY FOR REDUCED TELEMETRY BENCHMARK
WebArena (with human trajectories)
CMU · Human + agent trajectories
Programmatic task success outcomes.
Human-AI CollaborationWeb tasks355345254745%READY FOR REDUCED TELEMETRY BENCHMARK
Wikipedia Edit History (per-editor)
Wikimedia dumps · Multi-TB
Per-editor history, edit timestamps, revert/persistence as quality. Heavy preprocessing required.
Operational Decision SystemsEdits + reverts545444264770%READY FOR FULL ANTHROCENTRIX BENCHMARK
Tweet Annotation Sensitivity 2
Soda-LMU · ~89k annotation events
Duration per case, device, screen sequence, demographics, per-rater IDs. Quality is inter-rater agreement, not gold.
Human AnnotationText labeling435432214560%READY FOR REDUCED TELEMETRY BENCHMARK
MultiWOZ 2.2
Budzianowski et al., 2018; Google update 2022 · ~10k dialogs · 7 domains
Belief-state and slot ground truth, task success outcomes. Wizard-of-Oz so 'agent' is many crowdworkers.
Customer Service / Contact CenterMulti-domain task-oriented dialog255145224540%READY FOR REDUCED TELEMETRY BENCHMARK
International Brain Lab — Decision Task
IBL · Millions of trials, hundreds of subjects
Non-human subjects; standardized 2AFC with RT, accuracy, and per-subject IDs.
Cognitive Science / Response-TimeMouse 2AFC perception545553274565%READY FOR FULL ANTHROCENTRIX BENCHMARK
Stack Exchange Data Dump (Q&A)
Stack Exchange / Internet Archive · Multi-TB across sites
Per-user post + edit + close-vote history; quality = upvotes/accept; outcome = closure.
Operational Decision SystemsQ&A moderation545344254555%READY FOR FULL ANTHROCENTRIX BENCHMARK
Online-Go.com Game Archive
OGS · Millions of games
Direct chess-to-Go transfer test; Leela/KataGo provides decision-quality ground truth.
Operational Decision SystemsGo555445284570%READY FOR FULL ANTHROCENTRIX BENCHMARK
Prolific Autoresearch HITL
ProlificAI · 300 participants × pairwise judgments
Per-pair judgments with participant IDs and Bradley-Terry outcomes; lighter on decision-time telemetry.
Human-AI CollaborationDPO pair selection535334234445%READY FOR REDUCED TELEMETRY BENCHMARK
Intertemporal-Choice RT (Pongratz & Schoemann)
Scientific Data 2026 · Large-scale participants × choice + RT
Per-participant RT-rich design; 'quality' is model-derived rather than gold.
Cognitive Science / Response-TimeIntertemporal choice545532244455%READY FOR REDUCED TELEMETRY BENCHMARK
IRC Poker Database
University of Alberta · 10M+ hands
Per-player history, action timing weakly captured.
Operational Decision SystemsOnline poker545335254445%READY FOR REDUCED TELEMETRY BENCHMARK
CIFAR-10H (soft labels from human raters)
Peterson et al., 2019 · 10k CIFAR-10 test images × 50+ raters
Per-trial RT and worker IDs; ground-truth class labels available. Limited context features.
Human AnnotationImage labeling435342214355%READY FOR REDUCED TELEMETRY BENCHMARK
MIMIC-IV (base EHR)
PhysioNet (credentialed) · ~300k patients
Clinician order timing as weak telemetry. Credentialed.
Clinical Decision MakingICU EHR454235234335%READY FOR REDUCED TELEMETRY BENCHMARK
Agent Traces: Customer Support Triage
Julien Simon / HF · 1,483 events · 50 runs
Synthetic, small. Useful as a workflow shape reference, not a primary benchmark.
Human-AI CollaborationMulti-agent workflow445334234330%READY FOR REDUCED TELEMETRY BENCHMARK
MIMIC-IV-Ext Clinical Decision Making
PhysioNet (credentialed) · MIMIC-IV derived
Credentialed access (PhysioNet CITI training required). High clinical value if cleared.
Clinical Decision MakingAbdominal pathology354245234235%READY FOR REDUCED TELEMETRY BENCHMARK
IEEE-CIS Fraud Detection
Vesta / Kaggle · ~590k transactions
No human reviewer dimension — algorithmic baseline only.
Fraud ReviewCard-not-present fraud053155194110%NOT SUITABLE FOR ANTHROCENTRIX
OpenAI Moderation Evaluation Dataset
OpenAI · 1,680 prompts × multi-rater labels
Gold labels but no decision-time telemetry.
Content ModerationText safety234052164010%NOT SUITABLE FOR ANTHROCENTRIX
RSNA Pneumonia Detection (radiologist reads)
RSNA / Kaggle · ~30k images, multi-rater
Strong labels, weak telemetry.
Medical Coding / RadiologyChest X-ray reads344143194025%READY FOR REDUCED TELEMETRY BENCHMARK
AMLSim
IBM Research · Simulator (configurable scale)
Simulator — would require injecting synthetic analyst layer to test Anthrocentrix.
AML ReviewAnti-money-laundering343143183915%NOT SUITABLE FOR ANTHROCENTRIX
DynaSent (dynamic sentiment annotation)
Stanford · 121,634 sentences
Worker IDs and rounds; weak timing.
Human AnnotationSentiment334243193930%READY FOR REDUCED TELEMETRY BENCHMARK
Taskmaster-3 (TicketTalk)
Google Research · 23,789 dialogs
Self-dialog format limits behavioral telemetry.
Customer Service / Contact CenterMovie ticket dialog244134183830%READY FOR REDUCED TELEMETRY BENCHMARK
Berkeley DeepDrive — Driving Decisions
BDD100K · 100k videos
Telemetry-rich but no per-decision quality labels.
Human Factors / DrivingDriving decisions254323193820%TELEMETRY ONLY DATASET
Quality of RT Data Inference (Blinded Assessment)
OSF g7ka7 · Multi-lab collaborative assessment
Excellent RT telemetry; weak quality labels.
Cognitive Science / Response-TimeCognitive modeling434521193630%TELEMETRY ONLY DATASET
ML-Fairness-Gym (hiring + loans)
Google Research · Simulator
Sim only.
Recruiting / HiringSequential decisions244134183610%NOT SUITABLE FOR ANTHROCENTRIX
HANNA-LLMEval
Chhun et al., 2022; bay-calibration-llm-evaluators · 1,056 stories × multi-rater Likert
Rater IDs and Likert criteria but minimal behavioral telemetry. Useful as a context-control benchmark.
Human AnnotationStory rating434231173525%TELEMETRY ONLY DATASET
Jigsaw Toxic Comment Classification
Conversation AI / Kaggle · ~160k comments × multi-label
Strong labels, no telemetry, no annotator IDs in public release.
Content ModerationOnline comments13304213355%NOT SUITABLE FOR ANTHROCENTRIX
CheXpert
Stanford · 224,316 reports
Label extraction from reports — no decision-time telemetry.
Medical Coding / RadiologyChest X-ray labeling243043163410%NOT SUITABLE FOR ANTHROCENTRIX
Reddit Moderation Actions (Pushshift)
Pushshift / academic mirrors · Billions of comments
Pushshift access restricted post-2023.
Content ModerationSubreddit moderation344233193425%READY FOR REDUCED TELEMETRY BENCHMARK
SNLI / MNLI annotator-level (eraser-style)
Stanford NLP / Bowman et al. · 570k pairs
Public release lacks per-worker timing.
Human AnnotationNLI labeling234042153310%NOT SUITABLE FOR ANTHROCENTRIX
CallCenterEN (PII-redacted transcripts)
AIxBlock / arXiv:2507.02958 · 91,706 transcripts · 10,448 hours
Verified end-to-end: word-level timing + ASR confidence, but no audio, no diarization, no CSAT. Pseudo-diarization adapter exists.
Customer Service / Contact CenterInbound/outbound calls02130063115%TELEMETRY ONLY DATASET
DAIR-AI Emotion (per-rater)
DAIR-AI · 20k tweets
Aggregated labels only.
Behavioral HealthText emotion13304213305%NOT SUITABLE FOR ANTHROCENTRIX
Switchboard-1 Telephone Speech
LDC · 260 hours, ~2,400 conversations
LDC-licensed; has true diarization. No quality labels.
Customer Service / Contact CenterConversational speech433411162815%TELEMETRY ONLY DATASET

Scoring methodology and full registry are also exported as a PDF report. A ↔ actor identity, C ↔ context complexity, D ↔ decision-event granularity, B ↔ behavioral telemetry, Q ↔ decision-quality label, O ↔ outcome label. Each is scored 0–5. Opportunity = readiness (/30) + commercial relevance + ease of access + replication value (/30).

See also: imported datasets · dataset qualification gate.