Core Capability

Calibration

The central capability of Anthrocentrix: making confidence mean what it says. Calibrated probabilities are a precondition for risk-weighted routing, review prioritization, and audit.

Brier (case only)
0.214
baseline
Brier (case + actor)
0.162
Δ 0.052
ECE (case only)
0.094
ECE (case + actor)
0.031
Δ 0.063

Reliability Diagram

predicted vs. observed
predicted probability →observed
perfectcase onlycase + actor

Actor Calibration Distribution

ECE bins · n actors
0.00–0.02
412
0.02–0.04
884
0.04–0.06
612
0.06–0.08
311
0.08–0.10
142
>0.10
61

Calibration Interpretation

how to read these metrics
Brier Score

Average squared error between predicted probability and what actually happened. Penalizes both being wrong and being wrong with confidence.

Lower is better. 0.00 = perfect, 0.25 = random for a balanced binary task.

Expected Calibration Error (ECE)

When the engine says 70%, does it happen 70% of the time? ECE measures the gap, weighted across confidence bins.

Lower is better. <0.05 is production-grade.

Reliability

Visual diagonal — predictions plotted against observed frequencies. Points above the diagonal mean under-confident; below means over-confident.

Curve should hug the diagonal across all bins.

Overconfidence

Engine assigns high probability to outcomes that fail more often than predicted. The most dangerous failure mode for routing and review.

Detectable as the curve sagging below the diagonal in the 0.7–0.9 range.

Underconfidence

Engine hedges on outcomes that turn out to be reliable. Wastes review capacity but does not produce harm.

Detectable as the curve rising above the diagonal in the 0.3–0.6 range.

Domain Calibration

DomainECEGrade
Lichess0.041production
LeWiDi0.094acceptable
MLB0.029production