Commercial Prototype

Annotation QA — Decision-Quality Routing

Route AI-annotation work to QA review by behavioral decision-quality risk. Catch more bad labels in fewer reviewer hours.

Buyer · Head of Data Quality

prototype v0.1

Random or uniform QA sampling reviews everything at equal probability and misses the systematic shape of human labeling mistakes. Anthrocentrix scores each annotation decision from its telemetry — time on task, revisions, reopens, skip-and-return, self-confidence, fatigue, disagreement history — and routes only the high-risk decisions to human review.

Decisions Modeled

3,000

Labelers

Base Error Rate

42.2%

PR-AUC (errors)

0.590

anthrocentrix model

Errors @ 20% Review

391

random: 260

Recall @ 20% Review

30.9%

random: 20.5%

Workflow

labeler → telemetry → risk → routing

Step 1
Labeler decides
submits annotation
Step 2
Telemetry captured
time, revisions, fatigue
Step 3
Risk scored
anthrocentrix model
Step 4
High-risk routed
→ human QA
Step 5
Low-risk bypassed
auto-accept

Review-Efficiency Frontier

anthrocentrix routing vs random QA sampling

Review %	Anthrocentrix recall	Random recall	Lift	Errors / 100 reviews (A)	Errors / 100 reviews (R)
5%	8.3%	4.7%	+3.6 pp	70.0	40.0
10%	16.0%	9.7%	+6.3 pp	67.7	41.0
15%	23.3%	15.6%	+7.7 pp	65.6	44.0
20%	30.9%	20.5%	+10.3 pp	65.2	43.3
30%	43.1%	29.8%	+13.3 pp	60.7	41.9
40%	53.6%	39.8%	+13.8 pp	56.6	42.0
50%	63.6%	49.9%	+13.7 pp	53.7	42.1
75%	85.8%	75.5%	+10.3 pp	48.3	42.5
100%	100.0%	100.0%	+0.0 pp	42.2	42.2

ROI Calculator

adjust inputs to match your operation

Annotations / monthBaseline review %Reviewer cost ($/h)Minutes / reviewCost / missed error ($)Target recall (0–1)

Result · monthly

Anthrocentrix chose review fraction 75% to hit target recall 75%.

Baseline cost

$182,292

Routed cost

$546,875

Hours saved

-10,417

Reviews avoided

-250,000

Errors caught Δ

129,500

Error exposure ↓

$518,000

Net monthly savings

$153,417

Routing Preview

top 20% risk → human QA

Task	Labeler	Time s	Rev	Reopen	Conf	Fatigue	Risk	Route
t_31_54	lbl_032	44.2	5	yes	0.00	0.70	0.781	review
t_1_64	lbl_002	44.5	5	yes	0.33	0.86	0.776	review
t_11_63	lbl_012	37.9	5	yes	0.11	0.83	0.776	review
t_2_72	lbl_003	46.1	5	yes	0.03	0.91	0.763	review
t_9_66	lbl_010	46	4	yes	0.00	0.91	0.762	review
t_29_74	lbl_030	43.4	4	yes	0.00	1.00	0.762	review
t_26_72	lbl_027	46	4	yes	0.00	0.94	0.758	review
t_12_67	lbl_013	44	4	yes	0.00	0.85	0.755	review
t_30_70	lbl_031	47.5	5	yes	0.39	0.98	0.754	review
t_9_70	lbl_010	40	4	yes	0.00	0.92	0.752	review
t_36_71	lbl_037	47	5	yes	0.37	0.91	0.750	review
t_25_73	lbl_026	39.4	4	yes	0.04	0.94	0.747	review

Uploadable Dataset Schema

CSV or JSON

Annotation QA Dataset Schema
============================
Required columns (CSV or JSON):

  task_id               string   unique per labeling decision
  labeler_id            string   stable labeler identifier
  task_index            int      position in labeler's session (0-based)
  time_on_task_s        number   seconds spent on the task
  revision_count        int      number of edits before submission
  reopened              0|1      labeler reopened after submit
  skipped_then_returned 0|1      task was skipped and later completed
  self_confidence       0..1     optional self-reported confidence
  session_fatigue       0..1     fraction of session elapsed
  disagreement_history  0..1     labeler's historical disagreement rate
  label                 string   submitted label (any taxonomy)

Optional (for evaluation only):
  is_error              0|1      ground-truth flag from gold review