Preprint · cad-bench/v0.5 · sweep 2026-04-12open · MIT
CAD·Benchv0.5
Report 02 · CAD-Bench Lab · May 2026

A research-grade benchmark for AI CAD agents.

28-task pilot subset of the 343-task suite, run across 10agents at 5 seeds each. Scoring is layered — geometry, engineering, manufacturability, cognition — and reported with bootstrapped 95 % CIs, worst-case p5, and a 2PL IRT ability θ calibrated against task difficulty. Three use-case views re-weight the layers on the fly; a (capability, $/task) Pareto frontier is shown below.

§1Leaderboarduse-case weighted

95 % CI · p5 worst-case · IRT 2PL θ
#AgentComposite · p5Pass@1
1Human Baseline (Mech-E)PARETO
n=4 senior engineers
86.4
[84.5, 88.8] · p5=81.5
39%
2Zoo Text-to-CADPARETO
Zoo (KittyCAD)
71.9
[66.0, 76.1] · p5=39.4
4%
3Claude Opus 4.7 → CadQuery
Anthropic + CadQuery 2.4
70.8
[68.9, 76.6] · p5=57.7
11%
4Adam (CADcrush)
CADcrush
68.3
[65.5, 73.1] · p5=56.1
4%
5GPT-5 → CadQuery
OpenAI + CadQuery 2.4
66.2
[63.7, 70.6] · p5=52.5
0%
6Gemini 2.5 Pro → OpenSCADPARETO
Google + OpenSCAD 2024.06
53.7
[49.9, 59.8] · p5=33.3
0%
7Claude Opus 4.7 → OpenSCAD
Anthropic + OpenSCAD 2024.06
51.0
[42.2, 59.7] · p5=0.0
0%
8DeepCADPARETO
Wu et al. 2021 (research)
39.0
[34.0, 50.3] · p5=0.0
0%
9Trellis 3D
Microsoft Research
22.3
[18.1, 33.5] · p5=0.0
0%
10Spline AI
Spline.design
14.8
[10.7, 25.9] · p5=0.0
0%

§2Pareto frontier

capability · $/task · production weighting
255075100$0.01$0.10$1$10capability score (composite, 0–100) →↑ cost per task ($, log)ZooAdamClaude→CQGPT-5→CQGemini→SCADClaude→SCADDeepCADTrellisSplineHuman

Filled markers are on the (capability, $/task) Pareto frontier — every other agent is dominated on both axes by something on the line. The Pareto frontier rotates as the use-case weighting changes; non-production weightings move different agents onto the frontier.

§3Per-layer composite

L1·geom / L2·eng / L3·mfg / L4·cog
AgentL1 GeomL2 EngL3 MfgL4 Cog
Human Baseline (Mech-E)94.186.685.381.8
Zoo Text-to-CAD75.972.172.366.8
Claude Opus 4.7 → CadQuery87.268.570.270.1
Adam (CADcrush)81.568.565.063.7
GPT-5 → CadQuery75.264.666.965.5
Gemini 2.5 Pro → OpenSCAD60.951.056.754.6
Claude Opus 4.7 → OpenSCAD57.145.960.051.2
DeepCAD66.936.634.338.5
Trellis 3D48.817.422.923.1
Spline AI34.913.211.014.3

§4Per-category matrix

20 categories across 4 layers · top-3 per column bolded
All categories →
AgentPrimitivesBool.BREPSurf.Parametric MechMateStandardsSheetSealsKinem.CNCMouldFDMCAMConstraint Solving/EditabilityRevEngSketchFunc/FEAParaphrase RobustCalib
L1 · GEOMETRYL2 · ENGINEERINGL3 · MANUFACTURINGL4 · COGNITION
Human Baseline (Mech-E)95.394.594.292.587.584.992.283.287.084.285.681.586.389.684.082.287.682.997.844.6
Claude Opus 4.7 → CadQuery89.787.590.780.672.358.769.570.173.273.171.064.375.772.279.870.884.862.685.039.5
Zoo Text-to-CAD93.344.090.487.273.668.969.776.576.869.271.671.574.672.369.064.983.864.285.839.4
Adam (CADcrush)89.575.389.374.072.959.268.470.873.269.065.761.072.362.672.660.381.257.083.537.1
GPT-5 → CadQuery87.574.562.684.568.753.762.171.273.866.769.057.276.868.370.866.083.861.477.334.7
Gemini 2.5 Pro → OpenSCAD84.260.633.382.655.239.947.958.562.051.857.550.771.349.855.155.472.547.678.834.9
Claude Opus 4.7 → OpenSCAD0.087.934.684.238.943.951.660.063.856.963.254.471.352.529.358.076.451.577.339.5
DeepCAD91.845.382.558.333.934.831.252.647.539.744.236.550.60.027.141.053.735.168.327.6
Trellis 3D83.923.726.389.025.59.55.026.724.70.026.614.652.70.05.041.012.616.865.80.0
Spline AI0.023.123.788.719.44.62.319.323.96.10.06.937.96.83.812.812.18.060.321.9
Tasks evaluated
28 / 343
pilot subset · full v0.5 suite
Categories
20
across 4 layers
Agents
10
incl. n=4 human baseline
Compute
431.7 min
aggregate wall-clock