Paraphrase Robustness
Each task ships with N=5 LLM-rephrased prompts that preserve every spec quantity. We compute the std-dev of vol_iou across the paraphrase set; lower = the agent reads intent rather than surface form.
Paraphrase IoU σ · ratio · ↓Seed σ · ratio · ↓
RANKED AGENTS · 95 % CI
| # | Agent | Score |
|---|---|---|
| 1 | Human Baseline (Mech-E) | 97.8 [97.8, 97.8] · n=1 |
| 2 | Zoo Text-to-CAD | 85.8 [85.8, 85.8] · n=1 |
| 3 | Claude Opus 4.7 → CadQuery | 85.0 [85.0, 85.0] · n=1 |
| 4 | Adam (CADcrush) | 83.5 [83.5, 83.5] · n=1 |
| 5 | Gemini 2.5 Pro → OpenSCAD | 78.8 [78.8, 78.8] · n=1 |
| 6 | GPT-5 → CadQuery | 77.3 [77.3, 77.3] · n=1 |
| 7 | Claude Opus 4.7 → OpenSCAD | 77.3 [77.3, 77.3] · n=1 |
| 8 | DeepCAD | 68.3 [68.3, 68.3] · n=1 |
| 9 | Trellis 3D | 65.8 [65.8, 65.8] · n=1 |
| 10 | Spline AI | 60.3 [60.3, 60.3] · n=1 |