Paraphrase Robustness

Each task ships with N=5 LLM-rephrased prompts that preserve every spec quantity. We compute the std-dev of vol_iou across the paraphrase set; lower = the agent reads intent rather than surface form.

Paraphrase IoU σ · ratio · ↓Seed σ · ratio · ↓

RANKED AGENTS · 95 % CI

#	Agent	Score
1	Human Baseline (Mech-E)	97.1 [96.8, 97.5] · n=2
2	Zoo Text-to-CAD	86.5 [85.8, 87.3] · n=2
3	OpenAI o4 (reasoning) → CadQuery	84.5 [83.0, 86.0] · n=2
4	Claude Sonnet 4.6 → CadQuery	82.6 [78.3, 87.0] · n=2
5	Adam (CADcrush)	82.5 [79.5, 85.5] · n=2
6	Claude Opus 4.7 → OpenSCAD	81.9 [81.3, 82.5] · n=2
7	Claude Opus 4.7 → CadQuery	81.1 [80.3, 82.0] · n=2
8	GPT-5 → CadQuery	79.6 [78.5, 80.8] · n=2
9	Gemini 2.5 Flash → CadQuery	76.0 [71.5, 80.5] · n=2
10	DeepSeek R1 (reasoning) → CadQuery	74.0 [72.8, 75.3] · n=2
11	GPT-5 Mini → OpenSCAD	72.5 [72.5, 72.5] · n=2
12	Qwen3 Coder → CadQuery	71.9 [71.8, 72.0] · n=2
13	Claude Haiku 4.5 → CadQuery	71.5 [66.5, 76.5] · n=2
14	Llama 3.3 70B → OpenSCAD	66.1 [62.8, 69.5] · n=2
15	Hunyuan3D-2	66.0 [63.0, 69.0] · n=2
16	DeepCAD	62.8 [59.8, 65.8] · n=2
17	Trellis 3D	61.4 [56.8, 66.0] · n=2
18	Spline AI	57.4 [50.0, 64.8] · n=2
19	CAD-Coder R1	42.0 [0.0, 84.0] · n=2
20	Gemini 2.5 Pro → OpenSCAD	36.6 [0.0, 73.3] · n=2

TASKS IN THIS CATEGORY

PARA-0015× paraphrased L-bracketd3/5 PARA-0055× paraphrased planetary carrierd4/5