Calibration
The central capability of Anthrocentrix: making confidence mean what it says. Calibrated probabilities are a precondition for risk-weighted routing, review prioritization, and audit.
Reliability Diagram
predicted vs. observedActor Calibration Distribution
ECE bins · n actorsCalibration Interpretation
how to read these metricsAverage squared error between predicted probability and what actually happened. Penalizes both being wrong and being wrong with confidence.
Lower is better. 0.00 = perfect, 0.25 = random for a balanced binary task.
When the engine says 70%, does it happen 70% of the time? ECE measures the gap, weighted across confidence bins.
Lower is better. <0.05 is production-grade.
Visual diagonal — predictions plotted against observed frequencies. Points above the diagonal mean under-confident; below means over-confident.
Curve should hug the diagonal across all bins.
Engine assigns high probability to outcomes that fail more often than predicted. The most dangerous failure mode for routing and review.
Detectable as the curve sagging below the diagonal in the 0.7–0.9 range.
Engine hedges on outcomes that turn out to be reliable. Wastes review capacity but does not produce harm.
Detectable as the curve rising above the diagonal in the 0.3–0.6 range.
Domain Calibration
| Domain | ECE | Grade |
|---|---|---|
| Lichess | 0.041 | production |
| LeWiDi | 0.094 | acceptable |
| MLB | 0.029 | production |