Preprint · cad-bench/v0.5 · sweep 2026-04-12open · MIT
CAD·Benchv0.5
← all categories

Paraphrase Robustness

Each task ships with N=5 LLM-rephrased prompts that preserve every spec quantity. We compute the std-dev of vol_iou across the paraphrase set; lower = the agent reads intent rather than surface form.

Paraphrase IoU σ · ratio · Seed σ · ratio ·

RANKED AGENTS · 95 % CI

#AgentScore
1Human Baseline (Mech-E)
97.8
[97.8, 97.8] · n=1
2Zoo Text-to-CAD
85.8
[85.8, 85.8] · n=1
3Claude Opus 4.7 → CadQuery
85.0
[85.0, 85.0] · n=1
4Adam (CADcrush)
83.5
[83.5, 83.5] · n=1
5Gemini 2.5 Pro → OpenSCAD
78.8
[78.8, 78.8] · n=1
6GPT-5 → CadQuery
77.3
[77.3, 77.3] · n=1
7Claude Opus 4.7 → OpenSCAD
77.3
[77.3, 77.3] · n=1
8DeepCAD
68.3
[68.3, 68.3] · n=1
9Trellis 3D
65.8
[65.8, 65.8] · n=1
10Spline AI
60.3
[60.3, 60.3] · n=1

TASKS IN THIS CATEGORY