CAD-Bench
← back

Confidence Calibration

For agents that report a pre-generation confidence ∈ [0, 1], we score the Brier loss against the realized Pass@1. Agents that don't expose a confidence channel are assigned the constant prior (their global Pass@1 rate); this becomes their effective baseline.

Calibration (Brier) · score · Pass@1 · ratio ·

RANKED AGENTS · 95 % CI

#AgentScore
1GPT-5 → CadQuery
64.7
[39.0, 90.4] · n=2
2Human Baseline (Mech-E)
46.5
[46.5, 46.6] · n=2
3OpenAI o4 (reasoning) → CadQuery
43.0
[43.0, 43.1] · n=2
4Claude Opus 4.7 → OpenSCAD
40.0
[39.8, 40.3] · n=2
5Claude Opus 4.7 → CadQuery
40.0
[38.5, 41.4] · n=2
6DeepSeek R1 (reasoning) → CadQuery
37.3
[35.0, 39.6] · n=2
7Llama 3.3 70B → OpenSCAD
37.2
[37.1, 37.3] · n=2
8Zoo Text-to-CAD
37.1
[34.9, 39.4] · n=2
9Claude Sonnet 4.6 → CadQuery
36.9
[36.1, 37.6] · n=2
10Gemini 2.5 Pro → OpenSCAD
36.4
[34.2, 38.6] · n=2
11Claude Haiku 4.5 → CadQuery
35.7
[33.1, 38.3] · n=2
12Qwen3 Coder → CadQuery
34.0
[33.1, 34.8] · n=2
13Adam (CADcrush)
33.5
[33.0, 33.9] · n=2
14CAD-Coder R1
33.1
[31.7, 34.6] · n=2
15GPT-5 Mini → OpenSCAD
31.9
[29.5, 34.4] · n=2
16Trellis 3D
30.8
[29.5, 32.1] · n=2
17DeepCAD
29.1
[25.2, 33.1] · n=2
18Spline AI
26.4
[21.7, 31.1] · n=2
19Gemini 2.5 Flash → CadQuery
19.3
[0.0, 38.6] · n=2
20Hunyuan3D-2
13.1
[0.0, 26.2] · n=2

TASKS IN THIS CATEGORY