CAD-Bench
← back

Paraphrase Robustness

Each task ships with N=5 LLM-rephrased prompts that preserve every spec quantity. We compute the std-dev of vol_iou across the paraphrase set; lower = the agent reads intent rather than surface form.

Paraphrase IoU σ · ratio · Seed σ · ratio ·

RANKED AGENTS · 95 % CI

#AgentScore
1Human Baseline (Mech-E)
97.1
[96.8, 97.5] · n=2
2Zoo Text-to-CAD
86.5
[85.8, 87.3] · n=2
3OpenAI o4 (reasoning) → CadQuery
84.5
[83.0, 86.0] · n=2
4Claude Sonnet 4.6 → CadQuery
82.6
[78.3, 87.0] · n=2
5Adam (CADcrush)
82.5
[79.5, 85.5] · n=2
6Claude Opus 4.7 → OpenSCAD
81.9
[81.3, 82.5] · n=2
7Claude Opus 4.7 → CadQuery
81.1
[80.3, 82.0] · n=2
8GPT-5 → CadQuery
79.6
[78.5, 80.8] · n=2
9Gemini 2.5 Flash → CadQuery
76.0
[71.5, 80.5] · n=2
10DeepSeek R1 (reasoning) → CadQuery
74.0
[72.8, 75.3] · n=2
11GPT-5 Mini → OpenSCAD
72.5
[72.5, 72.5] · n=2
12Qwen3 Coder → CadQuery
71.9
[71.8, 72.0] · n=2
13Claude Haiku 4.5 → CadQuery
71.5
[66.5, 76.5] · n=2
14Llama 3.3 70B → OpenSCAD
66.1
[62.8, 69.5] · n=2
15Hunyuan3D-2
66.0
[63.0, 69.0] · n=2
16DeepCAD
62.8
[59.8, 65.8] · n=2
17Trellis 3D
61.4
[56.8, 66.0] · n=2
18Spline AI
57.4
[50.0, 64.8] · n=2
19CAD-Coder R1
42.0
[0.0, 84.0] · n=2
20Gemini 2.5 Pro → OpenSCAD
36.6
[0.0, 73.3] · n=2

TASKS IN THIS CATEGORY