Qualification

Dataset Qualification

Pre-analysis gate. A dataset cannot enter ACTIVE TEST until all five core criteria are satisfied: actor identity, case context, explicit decision, observable outcome, and repeated observations per actor.

Warning · Do not interpret actor main effects as execution-fit. Anthrocentrix proof requires actor × case interaction and ideally downstream causal improvement.

Evidence axes — project-wide

Every dataset result is reported on four independent axes. A result on one axis is never treated as a result on another — actor main effects are not execution-fit, and modeled policy lift is not a causal proof.

Actor Main Effect

Does actor identity add information beyond case context? A positive result here is necessary but NOT sufficient for execution-fit.

Actor main effects are proven across crossed actor×case datasets (EOIR, STAR, LaborSupply, Grunfeld, Arizona Open Policing).

STRONGLY SUPPORTED

Actor × Case Interaction (Execution-Fit)

Does the actor's contribution depend on the case? This is the execution-fit claim. A main effect alone does not establish it.

Execution-fit is proven to exist. MLB Statcast Umpires (1.47M called pitches, 124 umpires) shows interaction share 67.8% (95% CI 58.8%–75.9%), classified INTERACTION_DOMINATED. EOIR shows weak interaction; Arizona shows main-effect-dominated. Allocation strategy must be measured per domain, not assumed.

SUPPORTED

Policy Gain (actor-aware vs actor-blind)

Does an actor-aware routing/conditioning policy beat the actor-blind baseline in offline/modeled evaluation? Modeled lift only — not yet a causal claim.

Actor-aware allocation produces measurable downstream gains (e.g. Arizona Open Policing: +1.3 pts recall over the actor-blind baseline). Observed across multiple domains; magnitude is domain-dependent.

SUPPORTED

Causal Confidence

Is the policy gain backed by random or quasi-random assignment so it can be read causally? Without this, lift can be confounded by selection.

Causal proof is domain dependent. MLB variance decomposition supports a high-confidence interaction read in that domain; EOIR and Arizona remain associational pending quasi-random assignment.

WEAKLY SUPPORTED

Observed domain types

Allocation strategy must be measured, not assumed. The platform classifies each domain empirically by decomposition.

Main-Effect Dominated

e.g. Arizona Open Policing

Actor quality drives most of the observed actor value. Allocate on actor quality; execution-fit layer stays OFF.

Interaction-Dominated

e.g. MLB Statcast Umpires

Actor × case interaction drives most of the observed actor value. Activate execution-fit allocation.

Benchmark leaderboard

Dataset	Verdict	Interaction share	Note
MLB Statcast Umpires	INTERACTION DOMINATED	67.8% · CI 58.8%–75.9%	1.47M called pitches · 124 umpires · variance decomposition
Arizona Open Policing	MAIN EFFECT DOMINATED	9%	+1.3 pts recall; 91% actor main effect / 9% execution-fit
EOIR Judges	WEAK INTERACTION	≈30%	Interaction statistically significant (p<0.001) but below practical-meaningfulness threshold

Dataset evidence cards

MLB Statcast Umpires (2021–2024)

Sports · human decision making · ball/strike calls

INTERACTION DOMINATEDActor × case interaction confirmed

Execution-fit dominates actor main effects in a multidimensional decision environment. Variance decomposition of calibration residuals over 1.47M called pitches across 124 umpires yields an interaction share of 67.8% (95% CI 58.8%–75.9%) — the first INTERACTION_DOMINATED benchmark in the Anthrocentrix evidence base.

Actor × case interaction is large and replicated; downstream causal benefit not yet demonstrated.

Evidence axes

Actor Main Effect

Umpire identity contributes a stable main-effect component, but it is the minority share of total actor value in this domain.

SUPPORTED

Actor × Case Interaction (Execution-Fit)

Interaction share 67.8% (95% CI 58.8%–75.9%). Umpire × pitch-context interaction materially exceeds main effects and clears the deployment threshold.

STRONGLY SUPPORTED

Policy Gain (actor-aware vs actor-blind)

Decomposition implies large headroom for execution-fit-aware allocation in pitch-call contexts.

SUPPORTED

Causal Confidence

Causal confidence HIGH for the decomposition claim: 1.47M pitches and 272,978 contested pitches across 124 actors yield tight CIs; variance decomposition is identified within the calibration-residual frame.

SUPPORTED

Actor Main Effect

32%

of policy gain

Execution-Fit

68%

of policy gain

Policy Gain

interaction-dominated headroom (decomposition-implied)

vs actor-blind baseline

Fit LayerON

Execution-fit contributes 68% of total actor value, clearing the 25% deployment threshold; CI does not cross the threshold.

Decomposition gate

PASS

Interaction share 67.8% (CI 58.8%–75.9%) materially exceeds the 25% deployment threshold.

Volume gate

PASS

1.47M called pitches across 124 repeated actors.

Causal confidence gate

QUASI-CAUSAL

Variance decomposition is identified within the calibration-residual frame; causal confidence labeled HIGH for the decomposition claim.

Called pitches

1,469,350

Umpires

124

Contested pitches

272,978

Method

Variance decomposition of calibration residuals

Interaction share

67.8%

Interaction 95% CI

58.8%–75.9%

Verdict

INTERACTION DOMINATED

Causal confidence

HIGH

Evidence status

YES

Qualification fields

✓actor_present

✓case_context_present

✓decision_present

✓downstream_outcome_present

✓repeated_actors_present

✓assignment_random_or_quasi_random

proof_eligibility_score: 92/100

commercial_relevance_score: 7/10

current_status: INTERACTION_CONFIRMED

First INTERACTION_DOMINATED benchmark. Demonstrates that some domains are genuinely execution-fit-led; allocation strategy must be measured per domain, not assumed.

Arizona Open Policing

Traffic stops · officer × incident

MAIN EFFECT DOMINATEDActor × case interaction detected

Arizona is the first dataset where an actor-aware allocation policy materially improved a downstream outcome. The bulk of the gain (~91%) is actor main effect; execution-fit contributes the remaining ~9%. Result is associational, not causal.

Statistically significant actor × case interaction exists, but magnitude is modest and/or downstream causal effect is unproven.

Evidence axes

Actor Main Effect

Officer identity carries strong, stable performance signal independent of incident type — drives ~91% of observed policy gain.

STRONGLY SUPPORTED

Actor × Case Interaction (Execution-Fit)

Officer × incident execution-fit is real but small; contributes only ~9% of total actor value.

WEAKLY SUPPORTED

Policy Gain (actor-aware vs actor-blind)

Actor-aware allocation produced a measurable downstream improvement (+1.3 points recall) over the actor-blind baseline.

WEAKLY SUPPORTED

Causal Confidence

Assignment is not random or quasi-random; gain is associational and cannot yet be read as a causal interventional claim.

NOT YET ESTABLISHED

Actor Main Effect

91%

of policy gain

Execution-Fit

of policy gain

Policy Gain

+1.3 points recall

vs actor-blind baseline

Fit LayerOFF

Execution-fit contributes only 9% of total actor value and does not clear deployment threshold (25%).

Decomposition gate

BLOCKED

Execution-fit is real (statistically significant) but contributes <25% of total actor gain.

Volume gate

PASS

Repeated officers across many stops; volume gate cleared.

Causal confidence gate

ASSOCIATIONAL

Officer assignment is not random/quasi-random; observed gain cannot yet be read causally.

Policy gain

+1.3 pts recall

Actor main effect share

91%

Execution-fit share

Verdict

MAIN EFFECT DOMINATED

Causal confidence

ASSOCIATIONAL

Fit layer

OFF

Qualification fields

✓actor_present

✓case_context_present

✓decision_present

✓downstream_outcome_present

✓repeated_actors_present

✗assignment_random_or_quasi_random

proof_eligibility_score: 78/100

commercial_relevance_score: 7/10

current_status: INTERACTION_DETECTED

Headline result: actor-aware allocation works. Mechanism is dominated by actor quality, not execution-fit. Anthrocentrix must report this as an Actor Quality Allocation win, not an execution-fit win.

U.S. Immigration Judges (EOIR)

Asylum adjudication

MAIN EFFECT DOMINATEDActor × case interaction detected

EOIR shows statistically significant judge × case execution-fit in decision behavior, but the effect is modest and downstream causal improvement is not proven.

Statistically significant actor × case interaction exists, but magnitude is modest and/or downstream causal effect is unproven.

Evidence axes

Actor Main Effect

Actor identity adds information beyond context (actor lift +0.0082 AUC over context-only).

STRONGLY SUPPORTED

Actor × Case Interaction (Execution-Fit)

Judge × case interaction is statistically significant (AUC +0.0036, 95% CI [0.0026, 0.0046], p<0.001) but below the 0.01 practical-meaningfulness threshold.

WEAKLY SUPPORTED

Policy Gain (actor-aware vs actor-blind)

Actor-aware policy beats the actor-blind baseline in modeled offline evaluation, but the lift is small.

WEAKLY SUPPORTED

Causal Confidence

Judge assignment is not quasi-random in EOIR; Phase 3 downstream reversal/remand test did not establish a causal interventional claim.

NOT YET ESTABLISHED

Actor Main Effect

70%

of policy gain

Execution-Fit

30%

of policy gain

Policy Gain

small modeled lift; not validated downstream

vs actor-blind baseline

Fit LayerOFF

Execution-fit is statistically significant but below the practical-meaningfulness threshold; does not clear deployment gate.

Decomposition gate

BLOCKED

Interaction is statistically significant but practically below threshold.

Volume gate

PASS

300,000 modeled decisions across many judges.

Causal confidence gate

ASSOCIATIONAL

Judge assignment is not quasi-random; no clean interventional contrast.

Merits decisions scanned

6,485,038

Modeled sample

300,000

Context-only AUC

0.8947

Actor + context AUC

0.9029

Actor × case AUC

0.9065

Actor main lift

+0.0082

Interaction lift

+0.0036

Interaction 95% CI

[0.0026, 0.0046]

p-value

< 0.001

Interaction as % of main effect

43.5%

Verdict

MAIN EFFECT DOMINATED

Qualification fields

✓actor_present

✓case_context_present

✓decision_present

✓downstream_outcome_present

✓repeated_actors_present

✗assignment_random_or_quasi_random

proof_eligibility_score: 72/100

commercial_relevance_score: 6/10

current_status: INTERACTION_DETECTED

Phase 3 downstream reversal/remand test did NOT prove the interventional claim — judge assignment was not quasi-random and the downstream actor × case interaction was practically negligible. Phase 4 decision-level execution-fit test detected a statistically significant but practically modest interaction (below the 0.01 practical-meaningfulness threshold).

Claim ladder

Actor effects exist

Actor main effects proven across crossed actor×case datasets — EOIR, STAR, LaborSupply, Grunfeld, and Arizona Open Policing all show actor identity adds information beyond context.

SUPPORTED

Actor × case fit exists

MLB Statcast Umpires (1.47M called pitches, 124 umpires): interaction share 67.8% (95% CI 58.8%–75.9%). Execution-fit is proven to exist in at least one production-scale domain.

SUPPORTED

Actor-aware allocation improves downstream outcomes

Arizona Open Policing: actor-aware allocation produced +1.3 points recall over the actor-blind baseline. Observed; magnitude is domain-dependent.

SUPPORTED

Execution-fit can be the dominant mechanism in some domains

MLB Statcast: INTERACTION_DOMINATED (68% execution-fit share). Arizona: MAIN_EFFECT_DOMINATED (9%). EOIR: WEAK_INTERACTION. Allocation strategy must be measured per domain, not assumed.

SUPPORTED

Actor-conditioned intervention causally improves outcomes

Causal proof is domain dependent. MLB variance decomposition is high-confidence within its frame; EOIR and Arizona remain associational pending quasi-random assignment.

WEAKLY SUPPORTED

Next proof target

PhysioNet MIMIC-IV — clinician × patient decision dataset

Closest available dataset that supplies all components needed to clear the downstream causal-improvement bar.

Required proof

· clinician identity
· patient context
· treatment / decision
· downstream outcome
· repeated clinicians
· actor × case interaction
· policy lift on downstream outcome

Datasets

Gate-clear (all 5 criteria)

ELIGIBLE

ACTIVE TEST

Lifecycle distribution

DISCOVERED · 2SCREENING · 21REJECTED · 8PARKED · 1ELIGIBLE · 7ACTIVE TEST · 0PROVEN · 1FAILED · 0

Filter

Only gate-clear40 of 40

Qualification register

ImageNet-AB (Annotation Byproducts)

Human Annotation · Image labeling · ~1.28M images × annotation traces

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✓Repeated actors

Sample size

1,280,000

Domain

Image labeling

Commercial relevance

9/10

Proposed state

ELIGIBLE

StateELIGIBLE

Lichess + Stockfish blunder labels

Operational Decision Systems · Chess · ~5B games available; Anthrocentrix used 112k events

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✓Repeated actors

Sample size

5,000,000,000

Domain

Chess

Commercial relevance

4/10

Proposed state

PROVEN

StatePROVEN

Trueblood et al. Medical Decision RT

Clinical Decision Making · Pathology classification (blast vs non-blast) · Pathologists + novices · per-trial RT and accuracy

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Pathology classification (blast vs non-blast)

Commercial relevance

6/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

FiFAR — Fairness in AI-assisted Fraud Review

Fraud Review · AI-assisted fraud analyst decisions · Synthetic but realistic; analyst IDs + decisions

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

AI-assisted fraud analyst decisions

Commercial relevance

9/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

ABCD — Action-Based Conversations Dataset

Customer Service / Contact Center · Task-oriented customer support · 10,042 dialogs · 55 user intents · 30 agent actions

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

10,042

Domain

Task-oriented customer support

Commercial relevance

8/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

AI.vs.Clinician (sepsis trial)

Clinical Decision Making · Sepsis early warning · Multi-site randomized trial logs

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Sepsis early warning

Commercial relevance

7/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

ReviewArena / ReviewBench (peer review)

Content Moderation / Peer Review · Scientific peer review · 51,529 papers · 196,099 reviews · 22 venues

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✓Repeated actors

Sample size

196,099

Domain

Scientific peer review

Commercial relevance

5/10

Proposed state

ELIGIBLE

StateELIGIBLE

StarCraft II Replay Pack

Operational Decision Systems · RTS gameplay · Millions of replays

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

RTS gameplay

Commercial relevance

3/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

Mind2Web

Human-AI Collaboration · Web navigation · 2,350 tasks · 137 websites

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

2,350

Domain

Web navigation

Commercial relevance

7/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

WebArena (with human trajectories)

Human-AI Collaboration · Web tasks · Human + agent trajectories

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Web tasks

Commercial relevance

7/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

Wikipedia Edit History (per-editor)

Operational Decision Systems · Edits + reverts · Multi-TB

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Edits + reverts

Commercial relevance

4/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

Tweet Annotation Sensitivity 2

Human Annotation · Text labeling · ~89k annotation events

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✓Repeated actors

Sample size

89,000

Domain

Text labeling

Commercial relevance

7/10

Proposed state

ELIGIBLE

StateELIGIBLE

MultiWOZ 2.2

Customer Service / Contact Center · Multi-domain task-oriented dialog · ~10k dialogs · 7 domains

Proof Candidate

/ 100

✗Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

10,000

Domain

Multi-domain task-oriented dialog

Commercial relevance

7/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.

StateSCREENING

International Brain Lab — Decision Task

Cognitive Science / Response-Time · Mouse 2AFC perception · Millions of trials, hundreds of subjects

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Mouse 2AFC perception

Commercial relevance

2/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

Stack Exchange Data Dump (Q&A)

Operational Decision Systems · Q&A moderation · Multi-TB across sites

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Q&A moderation

Commercial relevance

4/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

Online-Go.com Game Archive

Operational Decision Systems · Go · Millions of games

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Commercial relevance

2/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

Prolific Autoresearch HITL

Human-AI Collaboration · DPO pair selection · 300 participants × pairwise judgments

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

300

Domain

DPO pair selection

Commercial relevance

6/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

Intertemporal-Choice RT (Pongratz & Schoemann)

Cognitive Science / Response-Time · Intertemporal choice · Large-scale participants × choice + RT

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Intertemporal choice

Commercial relevance

3/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

IRC Poker Database

Operational Decision Systems · Online poker · 10M+ hands

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✓Repeated actors

Sample size

10,000,000

Domain

Online poker

Commercial relevance

4/10

Proposed state

ELIGIBLE

StateELIGIBLE

CIFAR-10H (soft labels from human raters)

Human Annotation · Image labeling · 10k CIFAR-10 test images × 50+ raters

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✓Repeated actors

Sample size

10,000

Domain

Image labeling

Commercial relevance

5/10

Proposed state

ELIGIBLE

StateELIGIBLE

MIMIC-IV (base EHR)

Clinical Decision Making · ICU EHR · ~300k patients

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✓Repeated actors

Sample size

300,000

Domain

ICU EHR

Commercial relevance

8/10

Proposed state

ELIGIBLE

StateELIGIBLE

Agent Traces: Customer Support Triage

Human-AI Collaboration · Multi-agent workflow · 1,483 events · 50 runs

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✓Repeated actors

Sample size

1,483

Domain

Multi-agent workflow

Commercial relevance

6/10

Proposed state

ELIGIBLE

StateELIGIBLE

MIMIC-IV-Ext Clinical Decision Making

Clinical Decision Making · Abdominal pathology · MIMIC-IV derived

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Abdominal pathology

Commercial relevance

7/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

IEEE-CIS Fraud Detection

Fraud Review · Card-not-present fraud · ~590k transactions

Proof Candidate

/ 100

✗Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

590,000

Domain

Card-not-present fraud

Commercial relevance

9/10

Proposed state

REJECTED

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.

StateREJECTED

OpenAI Moderation Evaluation Dataset

Content Moderation · Text safety · 1,680 prompts × multi-rater labels

Proof Candidate

/ 100

✗Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

1,680

Domain

Text safety

Commercial relevance

8/10

Proposed state

REJECTED

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.

StateREJECTED

RSNA Pneumonia Detection (radiologist reads)

Medical Coding / Radiology · Chest X-ray reads · ~30k images, multi-rater

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

30,000

Domain

Chest X-ray reads

Commercial relevance

7/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

AMLSim

AML Review · Anti-money-laundering · Simulator (configurable scale)

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Anti-money-laundering

Commercial relevance

8/10

Proposed state

REJECTED

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateREJECTED

DynaSent (dynamic sentiment annotation)

Human Annotation · Sentiment · 121,634 sentences

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

121,634

Domain

Sentiment

Commercial relevance

5/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

Taskmaster-3 (TicketTalk)

Customer Service / Contact Center · Movie ticket dialog · 23,789 dialogs

Proof Candidate

/ 100

✗Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

23,789

Domain

Movie ticket dialog

Commercial relevance

5/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.

StateSCREENING

Berkeley DeepDrive — Driving Decisions

Human Factors / Driving · Driving decisions · 100k videos

Proof Candidate

/ 100

✗Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

100,000

Domain

Driving decisions

Commercial relevance

6/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.

StateSCREENING

Quality of RT Data Inference (Blinded Assessment)

Cognitive Science / Response-Time · Cognitive modeling · Multi-lab collaborative assessment

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✗Outcome

✗Repeated actors

Sample size

Domain

Cognitive modeling

Commercial relevance

2/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Observable outcome missing, Repeated observations per actor missing.

StateSCREENING

ML-Fairness-Gym (hiring + loans)

Recruiting / Hiring · Sequential decisions · Simulator

Proof Candidate

/ 100

✗Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Sequential decisions

Commercial relevance

6/10

Proposed state

REJECTED

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.

StateREJECTED

HANNA-LLMEval

Human Annotation · Story rating · 1,056 stories × multi-rater Likert

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✓Repeated actors

Sample size

1,056

Domain

Story rating

Commercial relevance

4/10

Proposed state

PARKED

StatePARKED

Jigsaw Toxic Comment Classification

Content Moderation · Online comments · ~160k comments × multi-label

Proof Candidate

/ 100

✗Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

160,000

Domain

Online comments

Commercial relevance

8/10

Proposed state

REJECTED

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.

StateREJECTED

CheXpert

Medical Coding / Radiology · Chest X-ray labeling · 224,316 reports

Proof Candidate

/ 100

✗Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

224,316

Domain

Chest X-ray labeling

Commercial relevance

7/10

Proposed state

REJECTED

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.

StateREJECTED

Reddit Moderation Actions (Pushshift)

Content Moderation · Subreddit moderation · Billions of comments

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

Domain

Subreddit moderation

Commercial relevance

5/10

Proposed state

SCREENING

Gate blocked — cannot enter ACTIVE TEST. Missing: Repeated observations per actor missing.

StateSCREENING

SNLI / MNLI annotator-level (eraser-style)

Human Annotation · NLI labeling · 570k pairs

Proof Candidate

/ 100

✗Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

570,000

Domain

NLI labeling

Commercial relevance

4/10

Proposed state

REJECTED

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.

StateREJECTED

CallCenterEN (PII-redacted transcripts)

Customer Service / Contact Center · Inbound/outbound calls · 91,706 transcripts · 10,448 hours

Proof Candidate

/ 100

✗Actor

✗Case context

✗Decision

✗Outcome

✗Repeated actors

Sample size

91,706

Domain

Inbound/outbound calls

Commercial relevance

9/10

Proposed state

DISCOVERED

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Case context missing, Explicit decision missing, Observable outcome missing, Repeated observations per actor missing.

StateDISCOVERED

DAIR-AI Emotion (per-rater)

Behavioral Health · Text emotion · 20k tweets

Proof Candidate

/ 100

✗Actor

✓Case context

✓Decision

✓Outcome

✗Repeated actors

Sample size

20,000

Domain

Text emotion

Commercial relevance

4/10

Proposed state

REJECTED

Gate blocked — cannot enter ACTIVE TEST. Missing: Actor identity missing, Repeated observations per actor missing.

StateREJECTED

Switchboard-1 Telephone Speech

Customer Service / Contact Center · Conversational speech · 260 hours, ~2,400 conversations

Proof Candidate

/ 100

✓Actor

✓Case context

✓Decision

✗Outcome

✗Repeated actors

Sample size

2,400

Domain

Conversational speech

Commercial relevance

4/10

Proposed state

DISCOVERED

Gate blocked — cannot enter ACTIVE TEST. Missing: Observable outcome missing, Repeated observations per actor missing.

StateDISCOVERED

Lifecycle states: DISCOVERED → SCREENING → (REJECTED · PARKED) → ELIGIBLE → ACTIVE TEST → (PROVEN · FAILED). ACTIVE TEST is gated server-side by the five core criteria; all other transitions are operator-driven and persisted locally. See also: discovery registry.