Dataset Discovery Lab
Real public datasets ranked by Anthrocentrix Readiness + Opportunity. Scores are honest estimates from public dataset documentation, not end-to-end benchmark runs (except Lichess and CallCenterEN).
Top 10 most-likely cross-domain replications
| # | Dataset | Category | Readiness | Opportunity | P(meaningful) | Verdict |
|---|---|---|---|---|---|---|
| 1 | ImageNet-AB (Annotation Byproducts) | Human Annotation | 27/30 | 56/60 | 85% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| 2 | Lichess + Stockfish blunder labels | Operational Decision Systems | 30/30 | 54/60 | 100% | ROBUST DECISION STATE SIGNAL |
| 3 | Trueblood et al. Medical Decision RT | Clinical Decision Making | 28/30 | 52/60 | 80% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| 4 | FiFAR — Fairness in AI-assisted Fraud Review | Fraud Review | 26/30 | 52/60 | 70% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| 5 | ABCD — Action-Based Conversations Dataset | Customer Service / Contact Center | 25/30 | 51/60 | 55% | READY FOR REDUCED TELEMETRY BENCHMARK |
| 6 | AI.vs.Clinician (sepsis trial) | Clinical Decision Making | 26/30 | 49/60 | 55% | READY FOR REDUCED TELEMETRY BENCHMARK |
| 7 | ReviewArena / ReviewBench (peer review) | Content Moderation / Peer Review | 25/30 | 48/60 | 55% | READY FOR REDUCED TELEMETRY BENCHMARK |
| 8 | StarCraft II Replay Pack | Operational Decision Systems | 29/30 | 48/60 | 70% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| 9 | Mind2Web | Human-AI Collaboration | 24/30 | 47/60 | 40% | READY FOR REDUCED TELEMETRY BENCHMARK |
| 10 | WebArena (with human trajectories) | Human-AI Collaboration | 25/30 | 47/60 | 45% | READY FOR REDUCED TELEMETRY BENCHMARK |
Final recommendations
Per-worker mouse traces, click locations, and annotation times over all of ImageNet-1K — the cleanest non-chess actor + telemetry + label triple.
AI-assisted fraud review with explicit analyst IDs and decisions — direct buyer story for QA routing.
Small, tabular, per-trial RT + accuracy with pathologist IDs. Ladder fits in minutes.
A replication on ImageNet-AB anchors Anthrocentrix in the AI data-labeling market — the same market the QA Wedge already targets.
Highest opportunity score, public, no credential gate, schema natively matches the Anthrocentrix model.
Full ranked registry
| Dataset | Category | Domain | A | C | D | B | Q | O | /30 | /60 | P | Verdict |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ImageNet-AB (Annotation Byproducts) Coallaoh / NeurIPS 2024 · ~1.28M images × annotation traces Mouse traces, click locations, annotation times, anonymised worker IDs over full ImageNet train. Best-in-class match for Anthrocentrix on labeling. | Human Annotation | Image labeling | 5 | 4 | 5 | 5 | 5 | 3 | 27 | 56 | 85% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| Lichess + Stockfish blunder labels Lichess.org open database + Stockfish evaluation · ~5B games available; Anthrocentrix used 112k events Anchor benchmark. ROBUST_DECISION_STATE_SIGNAL: residual lift +1.06pp PR-AUC, 85% survival vs raw, CI excludes 0. | Operational Decision Systems | Chess | 5 | 5 | 5 | 5 | 5 | 5 | 30 | 54 | 100% | ROBUST DECISION STATE SIGNAL |
| Trueblood et al. Medical Decision RT Trueblood et al. 2017 (rtdists::med_dec) · Pathologists + novices · per-trial RT and accuracy Per-decision RT + accuracy with expert/novice labels. Tight scope but a near-perfect Anthrocentrix-shaped dataset. | Clinical Decision Making | Pathology classification (blast vs non-blast) | 5 | 4 | 5 | 5 | 5 | 4 | 28 | 52 | 80% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| FiFAR — Fairness in AI-assisted Fraud Review Feedzai ICAIF '23 Synthetic Data Workshop · Synthetic but realistic; analyst IDs + decisions Synthetic but designed for human-AI decision research. Direct commercial analogue for QA routing. | Fraud Review | AI-assisted fraud analyst decisions | 5 | 4 | 5 | 3 | 5 | 4 | 26 | 52 | 70% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| ABCD — Action-Based Conversations Dataset ASAPP Research, NAACL 2021 · 10,042 dialogs · 55 user intents · 30 agent actions Explicit agent-action sequences with correct/incorrect labels and final task success. Limited timing telemetry. | Customer Service / Contact Center | Task-oriented customer support | 3 | 5 | 5 | 2 | 5 | 5 | 25 | 51 | 55% | READY FOR REDUCED TELEMETRY BENCHMARK |
| AI.vs.Clinician (sepsis trial) BenchCouncil · Multi-site randomized trial logs Clinician decisions with/without AI, patient outcomes. Decision-time telemetry limited. | Clinical Decision Making | Sepsis early warning | 4 | 5 | 5 | 3 | 4 | 5 | 26 | 49 | 55% | READY FOR REDUCED TELEMETRY BENCHMARK |
| ReviewArena / ReviewBench (peer review) NeurIPS 2026 submission corpus · 51,529 papers · 196,099 reviews · 22 venues Reviewer pseudonym IDs, scores, rebuttals, acceptance outcome. Telemetry limited to revision counts and rebuttal cycles. | Content Moderation / Peer Review | Scientific peer review | 4 | 5 | 5 | 2 | 4 | 5 | 25 | 48 | 55% | READY FOR REDUCED TELEMETRY BENCHMARK |
| StarCraft II Replay Pack DeepMind / Blizzard · Millions of replays Per-player APM, action latency, MMR; outcome is win/loss. Strong second-domain candidate. | Operational Decision Systems | RTS gameplay | 5 | 5 | 5 | 5 | 4 | 5 | 29 | 48 | 70% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| Mind2Web OSU NLP · 2,350 tasks · 137 websites Worker action sequences with success labels; timing partially recorded. | Human-AI Collaboration | Web navigation | 3 | 5 | 5 | 3 | 4 | 4 | 24 | 47 | 40% | READY FOR REDUCED TELEMETRY BENCHMARK |
| WebArena (with human trajectories) CMU · Human + agent trajectories Programmatic task success outcomes. | Human-AI Collaboration | Web tasks | 3 | 5 | 5 | 3 | 4 | 5 | 25 | 47 | 45% | READY FOR REDUCED TELEMETRY BENCHMARK |
| Wikipedia Edit History (per-editor) Wikimedia dumps · Multi-TB Per-editor history, edit timestamps, revert/persistence as quality. Heavy preprocessing required. | Operational Decision Systems | Edits + reverts | 5 | 4 | 5 | 4 | 4 | 4 | 26 | 47 | 70% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| Tweet Annotation Sensitivity 2 Soda-LMU · ~89k annotation events Duration per case, device, screen sequence, demographics, per-rater IDs. Quality is inter-rater agreement, not gold. | Human Annotation | Text labeling | 4 | 3 | 5 | 4 | 3 | 2 | 21 | 45 | 60% | READY FOR REDUCED TELEMETRY BENCHMARK |
| MultiWOZ 2.2 Budzianowski et al., 2018; Google update 2022 · ~10k dialogs · 7 domains Belief-state and slot ground truth, task success outcomes. Wizard-of-Oz so 'agent' is many crowdworkers. | Customer Service / Contact Center | Multi-domain task-oriented dialog | 2 | 5 | 5 | 1 | 4 | 5 | 22 | 45 | 40% | READY FOR REDUCED TELEMETRY BENCHMARK |
| International Brain Lab — Decision Task IBL · Millions of trials, hundreds of subjects Non-human subjects; standardized 2AFC with RT, accuracy, and per-subject IDs. | Cognitive Science / Response-Time | Mouse 2AFC perception | 5 | 4 | 5 | 5 | 5 | 3 | 27 | 45 | 65% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| Stack Exchange Data Dump (Q&A) Stack Exchange / Internet Archive · Multi-TB across sites Per-user post + edit + close-vote history; quality = upvotes/accept; outcome = closure. | Operational Decision Systems | Q&A moderation | 5 | 4 | 5 | 3 | 4 | 4 | 25 | 45 | 55% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| Online-Go.com Game Archive OGS · Millions of games Direct chess-to-Go transfer test; Leela/KataGo provides decision-quality ground truth. | Operational Decision Systems | Go | 5 | 5 | 5 | 4 | 4 | 5 | 28 | 45 | 70% | READY FOR FULL ANTHROCENTRIX BENCHMARK |
| Prolific Autoresearch HITL ProlificAI · 300 participants × pairwise judgments Per-pair judgments with participant IDs and Bradley-Terry outcomes; lighter on decision-time telemetry. | Human-AI Collaboration | DPO pair selection | 5 | 3 | 5 | 3 | 3 | 4 | 23 | 44 | 45% | READY FOR REDUCED TELEMETRY BENCHMARK |
| Intertemporal-Choice RT (Pongratz & Schoemann) Scientific Data 2026 · Large-scale participants × choice + RT Per-participant RT-rich design; 'quality' is model-derived rather than gold. | Cognitive Science / Response-Time | Intertemporal choice | 5 | 4 | 5 | 5 | 3 | 2 | 24 | 44 | 55% | READY FOR REDUCED TELEMETRY BENCHMARK |
| IRC Poker Database University of Alberta · 10M+ hands Per-player history, action timing weakly captured. | Operational Decision Systems | Online poker | 5 | 4 | 5 | 3 | 3 | 5 | 25 | 44 | 45% | READY FOR REDUCED TELEMETRY BENCHMARK |
| CIFAR-10H (soft labels from human raters) Peterson et al., 2019 · 10k CIFAR-10 test images × 50+ raters Per-trial RT and worker IDs; ground-truth class labels available. Limited context features. | Human Annotation | Image labeling | 4 | 3 | 5 | 3 | 4 | 2 | 21 | 43 | 55% | READY FOR REDUCED TELEMETRY BENCHMARK |
| MIMIC-IV (base EHR) PhysioNet (credentialed) · ~300k patients Clinician order timing as weak telemetry. Credentialed. | Clinical Decision Making | ICU EHR | 4 | 5 | 4 | 2 | 3 | 5 | 23 | 43 | 35% | READY FOR REDUCED TELEMETRY BENCHMARK |
| Agent Traces: Customer Support Triage Julien Simon / HF · 1,483 events · 50 runs Synthetic, small. Useful as a workflow shape reference, not a primary benchmark. | Human-AI Collaboration | Multi-agent workflow | 4 | 4 | 5 | 3 | 3 | 4 | 23 | 43 | 30% | READY FOR REDUCED TELEMETRY BENCHMARK |
| MIMIC-IV-Ext Clinical Decision Making PhysioNet (credentialed) · MIMIC-IV derived Credentialed access (PhysioNet CITI training required). High clinical value if cleared. | Clinical Decision Making | Abdominal pathology | 3 | 5 | 4 | 2 | 4 | 5 | 23 | 42 | 35% | READY FOR REDUCED TELEMETRY BENCHMARK |
| IEEE-CIS Fraud Detection Vesta / Kaggle · ~590k transactions No human reviewer dimension — algorithmic baseline only. | Fraud Review | Card-not-present fraud | 0 | 5 | 3 | 1 | 5 | 5 | 19 | 41 | 10% | NOT SUITABLE FOR ANTHROCENTRIX |
| OpenAI Moderation Evaluation Dataset OpenAI · 1,680 prompts × multi-rater labels Gold labels but no decision-time telemetry. | Content Moderation | Text safety | 2 | 3 | 4 | 0 | 5 | 2 | 16 | 40 | 10% | NOT SUITABLE FOR ANTHROCENTRIX |
| RSNA Pneumonia Detection (radiologist reads) RSNA / Kaggle · ~30k images, multi-rater Strong labels, weak telemetry. | Medical Coding / Radiology | Chest X-ray reads | 3 | 4 | 4 | 1 | 4 | 3 | 19 | 40 | 25% | READY FOR REDUCED TELEMETRY BENCHMARK |
| AMLSim IBM Research · Simulator (configurable scale) Simulator — would require injecting synthetic analyst layer to test Anthrocentrix. | AML Review | Anti-money-laundering | 3 | 4 | 3 | 1 | 4 | 3 | 18 | 39 | 15% | NOT SUITABLE FOR ANTHROCENTRIX |
| DynaSent (dynamic sentiment annotation) Stanford · 121,634 sentences Worker IDs and rounds; weak timing. | Human Annotation | Sentiment | 3 | 3 | 4 | 2 | 4 | 3 | 19 | 39 | 30% | READY FOR REDUCED TELEMETRY BENCHMARK |
| Taskmaster-3 (TicketTalk) Google Research · 23,789 dialogs Self-dialog format limits behavioral telemetry. | Customer Service / Contact Center | Movie ticket dialog | 2 | 4 | 4 | 1 | 3 | 4 | 18 | 38 | 30% | READY FOR REDUCED TELEMETRY BENCHMARK |
| Berkeley DeepDrive — Driving Decisions BDD100K · 100k videos Telemetry-rich but no per-decision quality labels. | Human Factors / Driving | Driving decisions | 2 | 5 | 4 | 3 | 2 | 3 | 19 | 38 | 20% | TELEMETRY ONLY DATASET |
| Quality of RT Data Inference (Blinded Assessment) OSF g7ka7 · Multi-lab collaborative assessment Excellent RT telemetry; weak quality labels. | Cognitive Science / Response-Time | Cognitive modeling | 4 | 3 | 4 | 5 | 2 | 1 | 19 | 36 | 30% | TELEMETRY ONLY DATASET |
| ML-Fairness-Gym (hiring + loans) Google Research · Simulator Sim only. | Recruiting / Hiring | Sequential decisions | 2 | 4 | 4 | 1 | 3 | 4 | 18 | 36 | 10% | NOT SUITABLE FOR ANTHROCENTRIX |
| HANNA-LLMEval Chhun et al., 2022; bay-calibration-llm-evaluators · 1,056 stories × multi-rater Likert Rater IDs and Likert criteria but minimal behavioral telemetry. Useful as a context-control benchmark. | Human Annotation | Story rating | 4 | 3 | 4 | 2 | 3 | 1 | 17 | 35 | 25% | TELEMETRY ONLY DATASET |
| Jigsaw Toxic Comment Classification Conversation AI / Kaggle · ~160k comments × multi-label Strong labels, no telemetry, no annotator IDs in public release. | Content Moderation | Online comments | 1 | 3 | 3 | 0 | 4 | 2 | 13 | 35 | 5% | NOT SUITABLE FOR ANTHROCENTRIX |
| CheXpert Stanford · 224,316 reports Label extraction from reports — no decision-time telemetry. | Medical Coding / Radiology | Chest X-ray labeling | 2 | 4 | 3 | 0 | 4 | 3 | 16 | 34 | 10% | NOT SUITABLE FOR ANTHROCENTRIX |
| Reddit Moderation Actions (Pushshift) Pushshift / academic mirrors · Billions of comments Pushshift access restricted post-2023. | Content Moderation | Subreddit moderation | 3 | 4 | 4 | 2 | 3 | 3 | 19 | 34 | 25% | READY FOR REDUCED TELEMETRY BENCHMARK |
| SNLI / MNLI annotator-level (eraser-style) Stanford NLP / Bowman et al. · 570k pairs Public release lacks per-worker timing. | Human Annotation | NLI labeling | 2 | 3 | 4 | 0 | 4 | 2 | 15 | 33 | 10% | NOT SUITABLE FOR ANTHROCENTRIX |
| CallCenterEN (PII-redacted transcripts) AIxBlock / arXiv:2507.02958 · 91,706 transcripts · 10,448 hours Verified end-to-end: word-level timing + ASR confidence, but no audio, no diarization, no CSAT. Pseudo-diarization adapter exists. | Customer Service / Contact Center | Inbound/outbound calls | 0 | 2 | 1 | 3 | 0 | 0 | 6 | 31 | 15% | TELEMETRY ONLY DATASET |
| DAIR-AI Emotion (per-rater) DAIR-AI · 20k tweets Aggregated labels only. | Behavioral Health | Text emotion | 1 | 3 | 3 | 0 | 4 | 2 | 13 | 30 | 5% | NOT SUITABLE FOR ANTHROCENTRIX |
| Switchboard-1 Telephone Speech LDC · 260 hours, ~2,400 conversations LDC-licensed; has true diarization. No quality labels. | Customer Service / Contact Center | Conversational speech | 4 | 3 | 3 | 4 | 1 | 1 | 16 | 28 | 15% | TELEMETRY ONLY DATASET |
Scoring methodology and full registry are also exported as a PDF report. A ↔ actor identity, C ↔ context complexity, D ↔ decision-event granularity, B ↔ behavioral telemetry, Q ↔ decision-quality label, O ↔ outcome label. Each is scored 0–5. Opportunity = readiness (/30) + commercial relevance + ease of access + replication value (/30).
See also: imported datasets · dataset qualification gate.