Confidence Calibration
For agents that report a pre-generation confidence ∈ [0, 1], we score the Brier loss against the realized Pass@1. Agents that don't expose a confidence channel are assigned the constant prior (their global Pass@1 rate); this becomes their effective baseline.
Calibration (Brier) · score · ↓Pass@1 · ratio · ↑
RANKED AGENTS · 95 % CI
| # | Agent | Score |
|---|---|---|
| 1 | GPT-5 → CadQuery | 64.7 [39.0, 90.4] · n=2 |
| 2 | Human Baseline (Mech-E) | 46.5 [46.5, 46.6] · n=2 |
| 3 | OpenAI o4 (reasoning) → CadQuery | 43.0 [43.0, 43.1] · n=2 |
| 4 | Claude Opus 4.7 → OpenSCAD | 40.0 [39.8, 40.3] · n=2 |
| 5 | Claude Opus 4.7 → CadQuery | 40.0 [38.5, 41.4] · n=2 |
| 6 | DeepSeek R1 (reasoning) → CadQuery | 37.3 [35.0, 39.6] · n=2 |
| 7 | Llama 3.3 70B → OpenSCAD | 37.2 [37.1, 37.3] · n=2 |
| 8 | Zoo Text-to-CAD | 37.1 [34.9, 39.4] · n=2 |
| 9 | Claude Sonnet 4.6 → CadQuery | 36.9 [36.1, 37.6] · n=2 |
| 10 | Gemini 2.5 Pro → OpenSCAD | 36.4 [34.2, 38.6] · n=2 |
| 11 | Claude Haiku 4.5 → CadQuery | 35.7 [33.1, 38.3] · n=2 |
| 12 | Qwen3 Coder → CadQuery | 34.0 [33.1, 34.8] · n=2 |
| 13 | Adam (CADcrush) | 33.5 [33.0, 33.9] · n=2 |
| 14 | CAD-Coder R1 | 33.1 [31.7, 34.6] · n=2 |
| 15 | GPT-5 Mini → OpenSCAD | 31.9 [29.5, 34.4] · n=2 |
| 16 | Trellis 3D | 30.8 [29.5, 32.1] · n=2 |
| 17 | DeepCAD | 29.1 [25.2, 33.1] · n=2 |
| 18 | Spline AI | 26.4 [21.7, 31.1] · n=2 |
| 19 | Gemini 2.5 Flash → CadQuery | 19.3 [0.0, 38.6] · n=2 |
| 20 | Hunyuan3D-2 | 13.1 [0.0, 26.2] · n=2 |