Core Capability

Calibration

The central capability of Anthrocentrix: making confidence mean what it says. Calibrated probabilities are a precondition for risk-weighted routing, review prioritization, and audit.

Brier (case only)

0.214

baseline

Brier (case + actor)

0.162

Δ 0.052

ECE (case only)

0.094

ECE (case + actor)

0.031

Δ 0.063

Reliability Diagram

predicted vs. observed

perfectcase onlycase + actor

Actor Calibration Distribution

ECE bins · n actors

0.00–0.02

412

0.02–0.04

884

0.04–0.06

612

0.06–0.08

311

0.08–0.10

142

>0.10

Calibration Interpretation

how to read these metrics

Brier Score

Average squared error between predicted probability and what actually happened. Penalizes both being wrong and being wrong with confidence.

Lower is better. 0.00 = perfect, 0.25 = random for a balanced binary task.

Expected Calibration Error (ECE)

When the engine says 70%, does it happen 70% of the time? ECE measures the gap, weighted across confidence bins.

Lower is better. <0.05 is production-grade.

Reliability

Visual diagonal — predictions plotted against observed frequencies. Points above the diagonal mean under-confident; below means over-confident.

Curve should hug the diagonal across all bins.

Overconfidence

Engine assigns high probability to outcomes that fail more often than predicted. The most dangerous failure mode for routing and review.

Detectable as the curve sagging below the diagonal in the 0.7–0.9 range.

Underconfidence

Engine hedges on outcomes that turn out to be reliable. Wastes review capacity but does not produce harm.

Detectable as the curve rising above the diagonal in the 0.3–0.6 range.

Domain Calibration

Domain	ECE	Grade
Lichess	0.041	production
LeWiDi	0.094	acceptable
MLB	0.029	production