Confidence Calibration
For agents that report a pre-generation confidence ∈ [0, 1], we score the Brier loss against the realized Pass@1. Agents that don't expose a confidence channel are assigned the constant prior (their global Pass@1 rate); this becomes their effective baseline.
Calibration (Brier) · score · ↓Pass@1 · ratio · ↑
RANKED AGENTS · 95 % CI
| # | Agent | Score |
|---|---|---|
| 1 | Human Baseline (Mech-E) | 44.6 [44.6, 44.6] · n=1 |
| 2 | Claude Opus 4.7 → CadQuery | 39.5 [39.5, 39.5] · n=1 |
| 3 | Claude Opus 4.7 → OpenSCAD | 39.5 [39.5, 39.5] · n=1 |
| 4 | Zoo Text-to-CAD | 39.4 [39.4, 39.4] · n=1 |
| 5 | Adam (CADcrush) | 37.1 [37.1, 37.1] · n=1 |
| 6 | Gemini 2.5 Pro → OpenSCAD | 34.9 [34.9, 34.9] · n=1 |
| 7 | GPT-5 → CadQuery | 34.7 [34.7, 34.7] · n=1 |
| 8 | DeepCAD | 27.6 [27.6, 27.6] · n=1 |
| 9 | Spline AI | 21.9 [21.9, 21.9] · n=1 |
| 10 | Trellis 3D | 0.0 [0.0, 0.0] · n=1 |