Paraphrase Robustness
Each task ships with N=5 LLM-rephrased prompts that preserve every spec quantity. We compute the std-dev of vol_iou across the paraphrase set; lower = the agent reads intent rather than surface form.
Paraphrase IoU σ · ratio · ↓Seed σ · ratio · ↓
RANKED AGENTS · 95 % CI
| # | Agent | Score |
|---|---|---|
| 1 | Human Baseline (Mech-E) | 97.1 [96.8, 97.5] · n=2 |
| 2 | Zoo Text-to-CAD | 86.5 [85.8, 87.3] · n=2 |
| 3 | OpenAI o4 (reasoning) → CadQuery | 84.5 [83.0, 86.0] · n=2 |
| 4 | Claude Sonnet 4.6 → CadQuery | 82.6 [78.3, 87.0] · n=2 |
| 5 | Adam (CADcrush) | 82.5 [79.5, 85.5] · n=2 |
| 6 | Claude Opus 4.7 → OpenSCAD | 81.9 [81.3, 82.5] · n=2 |
| 7 | Claude Opus 4.7 → CadQuery | 81.1 [80.3, 82.0] · n=2 |
| 8 | GPT-5 → CadQuery | 79.6 [78.5, 80.8] · n=2 |
| 9 | Gemini 2.5 Flash → CadQuery | 76.0 [71.5, 80.5] · n=2 |
| 10 | DeepSeek R1 (reasoning) → CadQuery | 74.0 [72.8, 75.3] · n=2 |
| 11 | GPT-5 Mini → OpenSCAD | 72.5 [72.5, 72.5] · n=2 |
| 12 | Qwen3 Coder → CadQuery | 71.9 [71.8, 72.0] · n=2 |
| 13 | Claude Haiku 4.5 → CadQuery | 71.5 [66.5, 76.5] · n=2 |
| 14 | Llama 3.3 70B → OpenSCAD | 66.1 [62.8, 69.5] · n=2 |
| 15 | Hunyuan3D-2 | 66.0 [63.0, 69.0] · n=2 |
| 16 | DeepCAD | 62.8 [59.8, 65.8] · n=2 |
| 17 | Trellis 3D | 61.4 [56.8, 66.0] · n=2 |
| 18 | Spline AI | 57.4 [50.0, 64.8] · n=2 |
| 19 | CAD-Coder R1 | 42.0 [0.0, 84.0] · n=2 |
| 20 | Gemini 2.5 Pro → OpenSCAD | 36.6 [0.0, 73.3] · n=2 |