Qualification

Dataset Qualification

Pre-analysis gate. A dataset cannot enter ACTIVE TEST until all five core criteria are satisfied: actor identity, case context, explicit decision, observable outcome, and repeated observations per actor.

Warning · Do not interpret actor main effects as execution-fit. Anthrocentrix proof requires actor × case interaction and ideally downstream causal improvement.

Evidence axes — project-wide

Every dataset result is reported on four independent axes. A result on one axis is never treated as a result on another — actor main effects are not execution-fit, and modeled policy lift is not a causal proof.

Actor Main Effect
Does actor identity add information beyond case context? A positive result here is necessary but NOT sufficient for execution-fit.
Actor main effects are proven across crossed actor×case datasets (EOIR, STAR, LaborSupply, Grunfeld, Arizona Open Policing).
STRONGLY SUPPORTED
Actor × Case Interaction (Execution-Fit)
Does the actor's contribution depend on the case? This is the execution-fit claim. A main effect alone does not establish it.
Execution-fit is proven to exist. MLB Statcast Umpires (1.47M called pitches, 124 umpires) shows interaction share 67.8% (95% CI 58.8%–75.9%), classified INTERACTION_DOMINATED. EOIR shows weak interaction; Arizona shows main-effect-dominated. Allocation strategy must be measured per domain, not assumed.
SUPPORTED
Policy Gain (actor-aware vs actor-blind)
Does an actor-aware routing/conditioning policy beat the actor-blind baseline in offline/modeled evaluation? Modeled lift only — not yet a causal claim.
Actor-aware allocation produces measurable downstream gains (e.g. Arizona Open Policing: +1.3 pts recall over the actor-blind baseline). Observed across multiple domains; magnitude is domain-dependent.
SUPPORTED
Causal Confidence
Is the policy gain backed by random or quasi-random assignment so it can be read causally? Without this, lift can be confounded by selection.
Causal proof is domain dependent. MLB variance decomposition supports a high-confidence interaction read in that domain; EOIR and Arizona remain associational pending quasi-random assignment.
WEAKLY SUPPORTED

Observed domain types

Allocation strategy must be measured, not assumed. The platform classifies each domain empirically by decomposition.

Main-Effect Dominated
e.g. Arizona Open Policing
Actor quality drives most of the observed actor value. Allocate on actor quality; execution-fit layer stays OFF.
Interaction-Dominated
e.g. MLB Statcast Umpires
Actor × case interaction drives most of the observed actor value. Activate execution-fit allocation.

Benchmark leaderboard

DatasetVerdictInteraction shareNote
MLB Statcast UmpiresINTERACTION DOMINATED67.8% · CI 58.8%–75.9%1.47M called pitches · 124 umpires · variance decomposition
Arizona Open PolicingMAIN EFFECT DOMINATED9%+1.3 pts recall; 91% actor main effect / 9% execution-fit
EOIR JudgesWEAK INTERACTION≈30%Interaction statistically significant (p<0.001) but below practical-meaningfulness threshold

Dataset evidence cards

MLB Statcast Umpires (2021–2024)
Sports · human decision making · ball/strike calls
INTERACTION DOMINATEDActor × case interaction confirmed

Execution-fit dominates actor main effects in a multidimensional decision environment. Variance decomposition of calibration residuals over 1.47M called pitches across 124 umpires yields an interaction share of 67.8% (95% CI 58.8%–75.9%) — the first INTERACTION_DOMINATED benchmark in the Anthrocentrix evidence base.

Actor × case interaction is large and replicated; downstream causal benefit not yet demonstrated.
Evidence axes
Actor Main Effect
Umpire identity contributes a stable main-effect component, but it is the minority share of total actor value in this domain.
SUPPORTED
Actor × Case Interaction (Execution-Fit)
Interaction share 67.8% (95% CI 58.8%–75.9%). Umpire × pitch-context interaction materially exceeds main effects and clears the deployment threshold.
STRONGLY SUPPORTED
Policy Gain (actor-aware vs actor-blind)
Decomposition implies large headroom for execution-fit-aware allocation in pitch-call contexts.
SUPPORTED
Causal Confidence
Causal confidence HIGH for the decomposition claim: 1.47M pitches and 272,978 contested pitches across 124 actors yield tight CIs; variance decomposition is identified within the calibration-residual frame.
SUPPORTED
Actor Main Effect
32%
of policy gain
Execution-Fit
68%
of policy gain
Policy Gain
interaction-dominated headroom (decomposition-implied)
vs actor-blind baseline
Fit LayerON
Execution-fit contributes 68% of total actor value, clearing the 25% deployment threshold; CI does not cross the threshold.
Decomposition gate
PASS
Interaction share 67.8% (CI 58.8%–75.9%) materially exceeds the 25% deployment threshold.
Volume gate
PASS
1.47M called pitches across 124 repeated actors.
Causal confidence gate
QUASI-CAUSAL
Variance decomposition is identified within the calibration-residual frame; causal confidence labeled HIGH for the decomposition claim.
Called pitches
1,469,350
Umpires
124
Contested pitches
272,978
Method
Variance decomposition of calibration residuals
Interaction share
67.8%
Interaction 95% CI
58.8%–75.9%
Verdict
INTERACTION DOMINATED
Causal confidence
HIGH
Evidence status
YES
Qualification fields
actor_present
case_context_present
decision_present
downstream_outcome_present
repeated_actors_present
assignment_random_or_quasi_random
proof_eligibility_score: 92/100
commercial_relevance_score: 7/10
current_status: INTERACTION_CONFIRMED
First INTERACTION_DOMINATED benchmark. Demonstrates that some domains are genuinely execution-fit-led; allocation strategy must be measured per domain, not assumed.
Arizona Open Policing
Traffic stops · officer × incident
MAIN EFFECT DOMINATEDActor × case interaction detected

Arizona is the first dataset where an actor-aware allocation policy materially improved a downstream outcome. The bulk of the gain (~91%) is actor main effect; execution-fit contributes the remaining ~9%. Result is associational, not causal.

Statistically significant actor × case interaction exists, but magnitude is modest and/or downstream causal effect is unproven.
Evidence axes
Actor Main Effect
Officer identity carries strong, stable performance signal independent of incident type — drives ~91% of observed policy gain.
STRONGLY SUPPORTED
Actor × Case Interaction (Execution-Fit)
Officer × incident execution-fit is real but small; contributes only ~9% of total actor value.
WEAKLY SUPPORTED
Policy Gain (actor-aware vs actor-blind)
Actor-aware allocation produced a measurable downstream improvement (+1.3 points recall) over the actor-blind baseline.
WEAKLY SUPPORTED
Causal Confidence
Assignment is not random or quasi-random; gain is associational and cannot yet be read as a causal interventional claim.
NOT YET ESTABLISHED
Actor Main Effect
91%
of policy gain
Execution-Fit
9%
of policy gain
Policy Gain
+1.3 points recall
vs actor-blind baseline
Fit LayerOFF
Execution-fit contributes only 9% of total actor value and does not clear deployment threshold (25%).
Decomposition gate
BLOCKED
Execution-fit is real (statistically significant) but contributes <25% of total actor gain.
Volume gate
PASS
Repeated officers across many stops; volume gate cleared.
Causal confidence gate
ASSOCIATIONAL
Officer assignment is not random/quasi-random; observed gain cannot yet be read causally.
Policy gain
+1.3 pts recall
Actor main effect share
91%
Execution-fit share
9%
Verdict
MAIN EFFECT DOMINATED
Causal confidence
ASSOCIATIONAL
Fit layer
OFF
Qualification fields
actor_present
case_context_present
decision_present
downstream_outcome_present
repeated_actors_present
assignment_random_or_quasi_random
proof_eligibility_score: 78/100
commercial_relevance_score: 7/10
current_status: INTERACTION_DETECTED
Headline result: actor-aware allocation works. Mechanism is dominated by actor quality, not execution-fit. Anthrocentrix must report this as an Actor Quality Allocation win, not an execution-fit win.
U.S. Immigration Judges (EOIR)
Asylum adjudication
MAIN EFFECT DOMINATEDActor × case interaction detected

EOIR shows statistically significant judge × case execution-fit in decision behavior, but the effect is modest and downstream causal improvement is not proven.

Statistically significant actor × case interaction exists, but magnitude is modest and/or downstream causal effect is unproven.
Evidence axes
Actor Main Effect
Actor identity adds information beyond context (actor lift +0.0082 AUC over context-only).
STRONGLY SUPPORTED
Actor × Case Interaction (Execution-Fit)
Judge × case interaction is statistically significant (AUC +0.0036, 95% CI [0.0026, 0.0046], p<0.001) but below the 0.01 practical-meaningfulness threshold.
WEAKLY SUPPORTED
Policy Gain (actor-aware vs actor-blind)
Actor-aware policy beats the actor-blind baseline in modeled offline evaluation, but the lift is small.
WEAKLY SUPPORTED
Causal Confidence
Judge assignment is not quasi-random in EOIR; Phase 3 downstream reversal/remand test did not establish a causal interventional claim.
NOT YET ESTABLISHED
Actor Main Effect
70%
of policy gain
Execution-Fit
30%
of policy gain
Policy Gain
small modeled lift; not validated downstream
vs actor-blind baseline
Fit LayerOFF
Execution-fit is statistically significant but below the practical-meaningfulness threshold; does not clear deployment gate.
Decomposition gate
BLOCKED
Interaction is statistically significant but practically below threshold.
Volume gate
PASS
300,000 modeled decisions across many judges.
Causal confidence gate
ASSOCIATIONAL
Judge assignment is not quasi-random; no clean interventional contrast.
Merits decisions scanned
6,485,038
Modeled sample
300,000
Context-only AUC
0.8947
Actor + context AUC
0.9029
Actor × case AUC
0.9065
Actor main lift
+0.0082
Interaction lift
+0.0036
Interaction 95% CI
[0.0026, 0.0046]
p-value
< 0.001
Interaction as % of main effect
43.5%
Verdict
MAIN EFFECT DOMINATED
Qualification fields
actor_present
case_context_present
decision_present
downstream_outcome_present
repeated_actors_present
assignment_random_or_quasi_random
proof_eligibility_score: 72/100
commercial_relevance_score: 6/10
current_status: INTERACTION_DETECTED
Phase 3 downstream reversal/remand test did NOT prove the interventional claim — judge assignment was not quasi-random and the downstream actor × case interaction was practically negligible. Phase 4 decision-level execution-fit test detected a statistically significant but practically modest interaction (below the 0.01 practical-meaningfulness threshold).

Claim ladder

Actor effects exist
Actor main effects proven across crossed actor×case datasets — EOIR, STAR, LaborSupply, Grunfeld, and Arizona Open Policing all show actor identity adds information beyond context.
SUPPORTED
Actor × case fit exists
MLB Statcast Umpires (1.47M called pitches, 124 umpires): interaction share 67.8% (95% CI 58.8%–75.9%). Execution-fit is proven to exist in at least one production-scale domain.
SUPPORTED
Actor-aware allocation improves downstream outcomes
Arizona Open Policing: actor-aware allocation produced +1.3 points recall over the actor-blind baseline. Observed; magnitude is domain-dependent.
SUPPORTED
Execution-fit can be the dominant mechanism in some domains
MLB Statcast: INTERACTION_DOMINATED (68% execution-fit share). Arizona: MAIN_EFFECT_DOMINATED (9%). EOIR: WEAK_INTERACTION. Allocation strategy must be measured per domain, not assumed.
SUPPORTED
Actor-conditioned intervention causally improves outcomes
Causal proof is domain dependent. MLB variance decomposition is high-confidence within its frame; EOIR and Arizona remain associational pending quasi-random assignment.
WEAKLY SUPPORTED

Next proof target

PhysioNet MIMIC-IV — clinician × patient decision dataset
Closest available dataset that supplies all components needed to clear the downstream causal-improvement bar.
Required proof
  • · clinician identity
  • · patient context
  • · treatment / decision
  • · downstream outcome
  • · repeated clinicians
  • · actor × case interaction
  • · policy lift on downstream outcome
Datasets
40
Gate-clear (all 5 criteria)
9
ELIGIBLE
7
ACTIVE TEST
0

Lifecycle distribution

DISCOVERED · 2SCREENING · 21REJECTED · 8PARKED · 1ELIGIBLE · 7ACTIVE TEST · 0PROVEN · 1FAILED · 0

Filter

40 of 40

Qualification register

ImageNet-AB (Annotation Byproducts)
Human Annotation · Image labeling · ~1.28M images × annotation traces
Proof Candidate
93
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
1,280,000
Domain
Image labeling
Commercial relevance
9/10
Proposed state
ELIGIBLE
StateELIGIBLE
Lichess + Stockfish blunder labels
Operational Decision Systems · Chess · ~5B games available; Anthrocentrix used 112k events
Proof Candidate
90
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
5,000,000,000
Domain
Chess
Commercial relevance
4/10
Proposed state
PROVEN
StatePROVEN
Trueblood et al. Medical Decision RT
Clinical Decision Making · Pathology classification (blast vs non-blast) · Pathologists + novices · per-trial RT and accuracy
Proof Candidate
87
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Pathology classification (blast vs non-blast)
Commercial relevance
6/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
FiFAR — Fairness in AI-assisted Fraud Review
Fraud Review · AI-assisted fraud analyst decisions · Synthetic but realistic; analyst IDs + decisions
Proof Candidate
87
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
AI-assisted fraud analyst decisions
Commercial relevance
9/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
ABCD — Action-Based Conversations Dataset
Customer Service / Contact Center · Task-oriented customer support · 10,042 dialogs · 55 user intents · 30 agent actions
Proof Candidate
85
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
10,042
Domain
Task-oriented customer support
Commercial relevance
8/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
AI.vs.Clinician (sepsis trial)
Clinical Decision Making · Sepsis early warning · Multi-site randomized trial logs
Proof Candidate
82
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Sepsis early warning
Commercial relevance
7/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
ReviewArena / ReviewBench (peer review)
Content Moderation / Peer Review · Scientific peer review · 51,529 papers · 196,099 reviews · 22 venues
Proof Candidate
80
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
196,099
Domain
Scientific peer review
Commercial relevance
5/10
Proposed state
ELIGIBLE
StateELIGIBLE
StarCraft II Replay Pack
Operational Decision Systems · RTS gameplay · Millions of replays
Proof Candidate
80
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
RTS gameplay
Commercial relevance
3/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
Mind2Web
Human-AI Collaboration · Web navigation · 2,350 tasks · 137 websites
Proof Candidate
78
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
2,350
Domain
Web navigation
Commercial relevance
7/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
WebArena (with human trajectories)
Human-AI Collaboration · Web tasks · Human + agent trajectories
Proof Candidate
78
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Web tasks
Commercial relevance
7/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
Wikipedia Edit History (per-editor)
Operational Decision Systems · Edits + reverts · Multi-TB
Proof Candidate
78
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Edits + reverts
Commercial relevance
4/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
Tweet Annotation Sensitivity 2
Human Annotation · Text labeling · ~89k annotation events
Proof Candidate
75
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
89,000
Domain
Text labeling
Commercial relevance
7/10
Proposed state
ELIGIBLE
StateELIGIBLE
MultiWOZ 2.2
Customer Service / Contact Center · Multi-domain task-oriented dialog · ~10k dialogs · 7 domains
Proof Candidate
75
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
10,000
Domain
Multi-domain task-oriented dialog
Commercial relevance
7/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.
StateSCREENING
International Brain Lab — Decision Task
Cognitive Science / Response-Time · Mouse 2AFC perception · Millions of trials, hundreds of subjects
Proof Candidate
75
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Mouse 2AFC perception
Commercial relevance
2/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
Stack Exchange Data Dump (Q&A)
Operational Decision Systems · Q&A moderation · Multi-TB across sites
Proof Candidate
75
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Q&A moderation
Commercial relevance
4/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
Online-Go.com Game Archive
Operational Decision Systems · Go · Millions of games
Proof Candidate
75
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Go
Commercial relevance
2/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
Prolific Autoresearch HITL
Human-AI Collaboration · DPO pair selection · 300 participants × pairwise judgments
Proof Candidate
73
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
300
Domain
DPO pair selection
Commercial relevance
6/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
Intertemporal-Choice RT (Pongratz & Schoemann)
Cognitive Science / Response-Time · Intertemporal choice · Large-scale participants × choice + RT
Proof Candidate
73
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Intertemporal choice
Commercial relevance
3/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
IRC Poker Database
Operational Decision Systems · Online poker · 10M+ hands
Proof Candidate
73
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
10,000,000
Domain
Online poker
Commercial relevance
4/10
Proposed state
ELIGIBLE
StateELIGIBLE
CIFAR-10H (soft labels from human raters)
Human Annotation · Image labeling · 10k CIFAR-10 test images × 50+ raters
Proof Candidate
72
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
10,000
Domain
Image labeling
Commercial relevance
5/10
Proposed state
ELIGIBLE
StateELIGIBLE
MIMIC-IV (base EHR)
Clinical Decision Making · ICU EHR · ~300k patients
Proof Candidate
72
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
300,000
Domain
ICU EHR
Commercial relevance
8/10
Proposed state
ELIGIBLE
StateELIGIBLE
Agent Traces: Customer Support Triage
Human-AI Collaboration · Multi-agent workflow · 1,483 events · 50 runs
Proof Candidate
72
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
1,483
Domain
Multi-agent workflow
Commercial relevance
6/10
Proposed state
ELIGIBLE
StateELIGIBLE
MIMIC-IV-Ext Clinical Decision Making
Clinical Decision Making · Abdominal pathology · MIMIC-IV derived
Proof Candidate
70
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Abdominal pathology
Commercial relevance
7/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
IEEE-CIS Fraud Detection
Fraud Review · Card-not-present fraud · ~590k transactions
Proof Candidate
68
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
590,000
Domain
Card-not-present fraud
Commercial relevance
9/10
Proposed state
REJECTED
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.
StateREJECTED
OpenAI Moderation Evaluation Dataset
Content Moderation · Text safety · 1,680 prompts × multi-rater labels
Proof Candidate
67
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
1,680
Domain
Text safety
Commercial relevance
8/10
Proposed state
REJECTED
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.
StateREJECTED
RSNA Pneumonia Detection (radiologist reads)
Medical Coding / Radiology · Chest X-ray reads · ~30k images, multi-rater
Proof Candidate
67
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
30,000
Domain
Chest X-ray reads
Commercial relevance
7/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
AMLSim
AML Review · Anti-money-laundering · Simulator (configurable scale)
Proof Candidate
65
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Anti-money-laundering
Commercial relevance
8/10
Proposed state
REJECTED
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateREJECTED
DynaSent (dynamic sentiment annotation)
Human Annotation · Sentiment · 121,634 sentences
Proof Candidate
65
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
121,634
Domain
Sentiment
Commercial relevance
5/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
Taskmaster-3 (TicketTalk)
Customer Service / Contact Center · Movie ticket dialog · 23,789 dialogs
Proof Candidate
63
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
23,789
Domain
Movie ticket dialog
Commercial relevance
5/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.
StateSCREENING
Berkeley DeepDrive — Driving Decisions
Human Factors / Driving · Driving decisions · 100k videos
Proof Candidate
63
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
100,000
Domain
Driving decisions
Commercial relevance
6/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.
StateSCREENING
Quality of RT Data Inference (Blinded Assessment)
Cognitive Science / Response-Time · Cognitive modeling · Multi-lab collaborative assessment
Proof Candidate
60
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Cognitive modeling
Commercial relevance
2/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Observable outcome missing, Repeated observations per actor missing.
StateSCREENING
ML-Fairness-Gym (hiring + loans)
Recruiting / Hiring · Sequential decisions · Simulator
Proof Candidate
60
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Sequential decisions
Commercial relevance
6/10
Proposed state
REJECTED
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.
StateREJECTED
HANNA-LLMEval
Human Annotation · Story rating · 1,056 stories × multi-rater Likert
Proof Candidate
58
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
1,056
Domain
Story rating
Commercial relevance
4/10
Proposed state
PARKED
StatePARKED
Jigsaw Toxic Comment Classification
Content Moderation · Online comments · ~160k comments × multi-label
Proof Candidate
58
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
160,000
Domain
Online comments
Commercial relevance
8/10
Proposed state
REJECTED
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.
StateREJECTED
CheXpert
Medical Coding / Radiology · Chest X-ray labeling · 224,316 reports
Proof Candidate
57
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
224,316
Domain
Chest X-ray labeling
Commercial relevance
7/10
Proposed state
REJECTED
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.
StateREJECTED
Reddit Moderation Actions (Pushshift)
Content Moderation · Subreddit moderation · Billions of comments
Proof Candidate
57
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
0
Domain
Subreddit moderation
Commercial relevance
5/10
Proposed state
SCREENING
Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.
StateSCREENING
SNLI / MNLI annotator-level (eraser-style)
Human Annotation · NLI labeling · 570k pairs
Proof Candidate
55
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
570,000
Domain
NLI labeling
Commercial relevance
4/10
Proposed state
REJECTED
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.
StateREJECTED
CallCenterEN (PII-redacted transcripts)
Customer Service / Contact Center · Inbound/outbound calls · 91,706 transcripts · 10,448 hours
Proof Candidate
52
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
91,706
Domain
Inbound/outbound calls
Commercial relevance
9/10
Proposed state
DISCOVERED
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Case context missing, Explicit decision missing, Observable outcome missing, Repeated observations per actor missing.
StateDISCOVERED
DAIR-AI Emotion (per-rater)
Behavioral Health · Text emotion · 20k tweets
Proof Candidate
50
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
20,000
Domain
Text emotion
Commercial relevance
4/10
Proposed state
REJECTED
Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.
StateREJECTED
Switchboard-1 Telephone Speech
Customer Service / Contact Center · Conversational speech · 260 hours, ~2,400 conversations
Proof Candidate
47
/ 100
Actor
Case context
Decision
Outcome
Repeated actors
Sample size
2,400
Domain
Conversational speech
Commercial relevance
4/10
Proposed state
DISCOVERED
Gate blocked — cannot enter ACTIVE TEST. Missing: Observable outcome missing, Repeated observations per actor missing.
StateDISCOVERED

Lifecycle states: DISCOVERED → SCREENING → (REJECTED · PARKED) → ELIGIBLE → ACTIVE TEST → (PROVEN · FAILED). ACTIVE TEST is gated server-side by the five core criteria; all other transitions are operator-driven and persisted locally. See also: discovery registry.