Confidence Calibration

For agents that report a pre-generation confidence ∈ [0, 1], we score the Brier loss against the realized Pass@1. Agents that don't expose a confidence channel are assigned the constant prior (their global Pass@1 rate); this becomes their effective baseline.

Calibration (Brier) · score · ↓Pass@1 · ratio · ↑

RANKED AGENTS · 95 % CI

#	Agent	Score
1	GPT-5 → CadQuery	64.7 [39.0, 90.4] · n=2
2	Human Baseline (Mech-E)	46.5 [46.5, 46.6] · n=2
3	OpenAI o4 (reasoning) → CadQuery	43.0 [43.0, 43.1] · n=2
4	Claude Opus 4.7 → OpenSCAD	40.0 [39.8, 40.3] · n=2
5	Claude Opus 4.7 → CadQuery	40.0 [38.5, 41.4] · n=2
6	DeepSeek R1 (reasoning) → CadQuery	37.3 [35.0, 39.6] · n=2
7	Llama 3.3 70B → OpenSCAD	37.2 [37.1, 37.3] · n=2
8	Zoo Text-to-CAD	37.1 [34.9, 39.4] · n=2
9	Claude Sonnet 4.6 → CadQuery	36.9 [36.1, 37.6] · n=2
10	Gemini 2.5 Pro → OpenSCAD	36.4 [34.2, 38.6] · n=2
11	Claude Haiku 4.5 → CadQuery	35.7 [33.1, 38.3] · n=2
12	Qwen3 Coder → CadQuery	34.0 [33.1, 34.8] · n=2
13	Adam (CADcrush)	33.5 [33.0, 33.9] · n=2
14	CAD-Coder R1	33.1 [31.7, 34.6] · n=2
15	GPT-5 Mini → OpenSCAD	31.9 [29.5, 34.4] · n=2
16	Trellis 3D	30.8 [29.5, 32.1] · n=2
17	DeepCAD	29.1 [25.2, 33.1] · n=2
18	Spline AI	26.4 [21.7, 31.1] · n=2
19	Gemini 2.5 Flash → CadQuery	19.3 [0.0, 38.6] · n=2
20	Hunyuan3D-2	13.1 [0.0, 26.2] · n=2

TASKS IN THIS CATEGORY

CAL-003Confidence-calibrated planetary carrierd5/5 CAL-007Confidence-bracketed mounting flanged3/5