Report 02 · CAD-Bench Lab · May 2026
A research-grade benchmark for AI CAD agents.
28-task pilot subset of the 343-task suite, run across 10agents at 5 seeds each. Scoring is layered — geometry, engineering, manufacturability, cognition — and reported with bootstrapped 95 % CIs, worst-case p5, and a 2PL IRT ability θ calibrated against task difficulty. Three use-case views re-weight the layers on the fly; a (capability, $/task) Pareto frontier is shown below.
§1Leaderboarduse-case weighted
95 % CI · p5 worst-case · IRT 2PL θ| # | Agent | Composite · p5 | Pass@1 |
|---|---|---|---|
| 1 | Human Baseline (Mech-E)PARETO n=4 senior engineers | 86.4 [84.5, 88.8] · p5=81.5 | 39% |
| 2 | Zoo Text-to-CADPARETO Zoo (KittyCAD) | 71.9 [66.0, 76.1] · p5=39.4 | 4% |
| 3 | Claude Opus 4.7 → CadQuery Anthropic + CadQuery 2.4 | 70.8 [68.9, 76.6] · p5=57.7 | 11% |
| 4 | Adam (CADcrush) CADcrush | 68.3 [65.5, 73.1] · p5=56.1 | 4% |
| 5 | GPT-5 → CadQuery OpenAI + CadQuery 2.4 | 66.2 [63.7, 70.6] · p5=52.5 | 0% |
| 6 | Gemini 2.5 Pro → OpenSCADPARETO Google + OpenSCAD 2024.06 | 53.7 [49.9, 59.8] · p5=33.3 | 0% |
| 7 | Claude Opus 4.7 → OpenSCAD Anthropic + OpenSCAD 2024.06 | 51.0 [42.2, 59.7] · p5=0.0 | 0% |
| 8 | DeepCADPARETO Wu et al. 2021 (research) | 39.0 [34.0, 50.3] · p5=0.0 | 0% |
| 9 | Trellis 3D Microsoft Research | 22.3 [18.1, 33.5] · p5=0.0 | 0% |
| 10 | Spline AI Spline.design | 14.8 [10.7, 25.9] · p5=0.0 | 0% |
§2Pareto frontier
capability · $/task · production weighting
Filled markers are on the (capability, $/task) Pareto frontier — every other agent is dominated on both axes by something on the line. The Pareto frontier rotates as the use-case weighting changes; non-production weightings move different agents onto the frontier.
§3Per-layer composite
L1·geom / L2·eng / L3·mfg / L4·cog
| Agent | L1 Geom | L2 Eng | L3 Mfg | L4 Cog |
|---|---|---|---|---|
| Human Baseline (Mech-E) | 94.1 | 86.6 | 85.3 | 81.8 |
| Zoo Text-to-CAD | 75.9 | 72.1 | 72.3 | 66.8 |
| Claude Opus 4.7 → CadQuery | 87.2 | 68.5 | 70.2 | 70.1 |
| Adam (CADcrush) | 81.5 | 68.5 | 65.0 | 63.7 |
| GPT-5 → CadQuery | 75.2 | 64.6 | 66.9 | 65.5 |
| Gemini 2.5 Pro → OpenSCAD | 60.9 | 51.0 | 56.7 | 54.6 |
| Claude Opus 4.7 → OpenSCAD | 57.1 | 45.9 | 60.0 | 51.2 |
| DeepCAD | 66.9 | 36.6 | 34.3 | 38.5 |
| Trellis 3D | 48.8 | 17.4 | 22.9 | 23.1 |
| Spline AI | 34.9 | 13.2 | 11.0 | 14.3 |
§4Per-category matrix
20 categories across 4 layers · top-3 per column bolded
| Agent | Primitives | Bool. | BREP | Surf. | Parametric Mech | Mate | Standards | Sheet | Seals | Kinem. | CNC | Mould | FDM | CAM | Constraint Solving/Editability | RevEng | Sketch | Func/FEA | Paraphrase Robust | Calib |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L1 · GEOMETRY | L2 · ENGINEERING | L3 · MANUFACTURING | L4 · COGNITION | |||||||||||||||||
| Human Baseline (Mech-E) | 95.3 | 94.5 | 94.2 | 92.5 | 87.5 | 84.9 | 92.2 | 83.2 | 87.0 | 84.2 | 85.6 | 81.5 | 86.3 | 89.6 | 84.0 | 82.2 | 87.6 | 82.9 | 97.8 | 44.6 |
| Claude Opus 4.7 → CadQuery | 89.7 | 87.5 | 90.7 | 80.6 | 72.3 | 58.7 | 69.5 | 70.1 | 73.2 | 73.1 | 71.0 | 64.3 | 75.7 | 72.2 | 79.8 | 70.8 | 84.8 | 62.6 | 85.0 | 39.5 |
| Zoo Text-to-CAD | 93.3 | 44.0 | 90.4 | 87.2 | 73.6 | 68.9 | 69.7 | 76.5 | 76.8 | 69.2 | 71.6 | 71.5 | 74.6 | 72.3 | 69.0 | 64.9 | 83.8 | 64.2 | 85.8 | 39.4 |
| Adam (CADcrush) | 89.5 | 75.3 | 89.3 | 74.0 | 72.9 | 59.2 | 68.4 | 70.8 | 73.2 | 69.0 | 65.7 | 61.0 | 72.3 | 62.6 | 72.6 | 60.3 | 81.2 | 57.0 | 83.5 | 37.1 |
| GPT-5 → CadQuery | 87.5 | 74.5 | 62.6 | 84.5 | 68.7 | 53.7 | 62.1 | 71.2 | 73.8 | 66.7 | 69.0 | 57.2 | 76.8 | 68.3 | 70.8 | 66.0 | 83.8 | 61.4 | 77.3 | 34.7 |
| Gemini 2.5 Pro → OpenSCAD | 84.2 | 60.6 | 33.3 | 82.6 | 55.2 | 39.9 | 47.9 | 58.5 | 62.0 | 51.8 | 57.5 | 50.7 | 71.3 | 49.8 | 55.1 | 55.4 | 72.5 | 47.6 | 78.8 | 34.9 |
| Claude Opus 4.7 → OpenSCAD | 0.0 | 87.9 | 34.6 | 84.2 | 38.9 | 43.9 | 51.6 | 60.0 | 63.8 | 56.9 | 63.2 | 54.4 | 71.3 | 52.5 | 29.3 | 58.0 | 76.4 | 51.5 | 77.3 | 39.5 |
| DeepCAD | 91.8 | 45.3 | 82.5 | 58.3 | 33.9 | 34.8 | 31.2 | 52.6 | 47.5 | 39.7 | 44.2 | 36.5 | 50.6 | 0.0 | 27.1 | 41.0 | 53.7 | 35.1 | 68.3 | 27.6 |
| Trellis 3D | 83.9 | 23.7 | 26.3 | 89.0 | 25.5 | 9.5 | 5.0 | 26.7 | 24.7 | 0.0 | 26.6 | 14.6 | 52.7 | 0.0 | 5.0 | 41.0 | 12.6 | 16.8 | 65.8 | 0.0 |
| Spline AI | 0.0 | 23.1 | 23.7 | 88.7 | 19.4 | 4.6 | 2.3 | 19.3 | 23.9 | 6.1 | 0.0 | 6.9 | 37.9 | 6.8 | 3.8 | 12.8 | 12.1 | 8.0 | 60.3 | 21.9 |
Tasks evaluated
28 / 343
pilot subset · full v0.5 suite
Categories
20
across 4 layers
Agents
10
incl. n=4 human baseline
Compute
431.7 min
aggregate wall-clock