CAD-Bench
AGENTS

20 CAD-generation systems under test

Human Baseline (Mech-E) Tier A
n=4 senior engineers · Onshape-2026-04
85.7θ 100
Human·BREP·open·p5 47

Four mechanical engineers (median 9 yrs CAD experience) modelled the same prompts in Onshape. Wall-clock time and tool cost ($/seat·hr) are recorded. Scores are inter-rater averaged.

Zoo Text-to-CAD Tier A
Zoo (KittyCAD) · 2.4
73.3θ 96
API·BREP·proprietary·p5 39

Native BREP generator. Outputs valid AP242 STEP. Trained on the Zoo internal corpus + filtered GrabCAD. Endpoint: text-to-cad.zoo.dev/api.

OpenAI o4 (reasoning) → CadQuery Tier A
OpenAI + CadQuery 2.4 · o4-2026-02 + cadquery 2.4
71.4θ 100
LLM+CadQuery·CadQuery·proprietary·p5 43

Reasoning model with private chain-of-thought. Self-repair often unnecessary — single-shot pass-rate is ~14 pp higher than GPT-5 on PARAM-* tasks at the cost of 3-4× wall-clock latency.

Adam (CADcrush) Tier A
CADcrush · 1.1
66.2θ 85
API·BREP·proprietary·p5 0

Closed-beta natural-language modeller; emits parametric Onshape FeatureScript export. Tested through partner key (rate-limited 60 req/h).

Claude Sonnet 4.6 → CadQuery Tier A
Anthropic + CadQuery 2.4 · sonnet-4.6 + cadquery 2.4
65.1θ 86
LLM+CadQuery·CadQuery·proprietary·p5 0

Same scaffold and self-repair budget as the Opus 4.7 pipeline. About 5× cheaper at the cost of ~6 IoU points on hard parametric tasks. Best Pareto candidate for high-throughput sweeps.

GPT-5 → CadQuery Tier B
OpenAI + CadQuery 2.4 · gpt-5 + cadquery 2.4
64.6θ 92
LLM+CadQuery·CadQuery·proprietary·p5 39

Same scaffold as the Claude pipeline for fair comparison. Self-repair budget capped at 3 attempts.

Claude Opus 4.7 → CadQuery Tier B
Anthropic + CadQuery 2.4 · opus-4.7 + cadquery 2.4
64.4θ 89
LLM+CadQuery·CadQuery·proprietary·p5 0

Few-shot scaffold (8 exemplars from the OCC tutorial set), self-repair loop with up to 3 OCC error feedbacks. Executes in a Vercel Sandbox per call.

DeepSeek R1 (reasoning) → CadQuery Tier B
DeepSeek + CadQuery 2.4 · r1-distill-70b + cadquery 2.4
60.3θ 74
LLM+CadQuery·CadQuery·open·p5 0

Open-weight reasoning baseline. Self-hosted on a single H100; results below assume bf16 with vLLM. Best public open-weight on PARAM tasks; lags closed-source on GD&T.

CAD-Coder R1 Tier B
CAD-Coder Labs (research) · r1-cad-7b + cadquery 2.4
60.3θ 64
LLM+CadQuery·CadQuery·research·p5 32

Specialty model fine-tuned on a 1.2 M synthetic CadQuery corpus from the ABC dataset. Punches well above its weight on primitives and brep_fidelity at 7B params; falls off on functional_intent (no FEA training signal).

Qwen3 Coder → CadQuery Tier B
Alibaba + CadQuery 2.4 · qwen3-coder-32b + cadquery 2.4
56.9θ 45
LLM+CadQuery·CadQuery·open·p5 32

Code-specialized open-weight baseline. Strong at translating prompts into syntactically clean CadQuery, weaker at engineering judgement (e.g. picking sensible drafts).

Gemini 2.5 Flash → CadQuery Tier B
Google + CadQuery 2.4 · 2.5-flash + cadquery 2.4
56.2θ 53
LLM+CadQuery·CadQuery·proprietary·p5 0

Cheaper alternative to Gemini Pro on the CadQuery scaffold. Within 4 pp of the Pro variant on geometry but visibly worse on multi-feature parts where attention-budget matters.

Claude Opus 4.7 → OpenSCAD Tier B
Anthropic + OpenSCAD 2024.06 · opus-4.7 + openscad 2024.06
54.3θ 57
LLM+OpenSCAD·OpenSCAD·proprietary·p5 0

Same prompt template as the Gemini pipeline. Output is mesh-only.

Gemini 2.5 Pro → OpenSCAD Tier B
Google + OpenSCAD 2024.06 · 2.5-pro + openscad 2024.06
50.9θ 26
LLM+OpenSCAD·OpenSCAD·proprietary·p5 0

Mesh-only output (OpenSCAD does not produce BREP); STEP round-trip therefore disabled. CSG kernel: CGAL.

GPT-5 Mini → OpenSCAD Tier B
OpenAI + OpenSCAD 2024.06 · gpt-5-mini + openscad 2024.06
48.0θ 2
LLM+OpenSCAD·OpenSCAD·proprietary·p5 30

Targets the hobbyist envelope: cheap, mesh-only, OpenSCAD CSG. Surprisingly competent on primitives; collapses on standards-compliance and reverse-engineering.

Claude Haiku 4.5 → CadQuery Tier B
Anthropic + CadQuery 2.4 · haiku-4.5 + cadquery 2.4
47.8θ 2
LLM+CadQuery·CadQuery·proprietary·p5 0

Cost-floor entry. Still passes most L1 primitive tasks but degrades sharply on GD&T and standards. Useful as a 'can a small model do this at all?' canary.

DeepCAD Tier B
Wu et al. 2021 (research) · official checkpoint, retrained 2024-11
46.4θ 19
Diffusion-3D·BREP·research·p5 26

Transformer over CAD command sequences (extrude, revolve, sketch). Limited prompt vocabulary; we wrap with a Claude-3.5-mini paraphraser to convert natural prompts into the in-distribution token grammar.

Llama 3.3 70B → OpenSCAD Tier C
Meta + OpenSCAD 2024.06 · 3.3-70b-instruct + openscad 2024.06
43.4θ 3
LLM+OpenSCAD·OpenSCAD·open·p5 0

Open-weight non-reasoning baseline. Treated as a sanity floor: an agent that scores below this is a regression for the field.

Trellis 3D Tier C
Microsoft Research · 1.0 (image-to-3D)
25.8θ 5
Diffusion-3D·Mesh·open·p5 0

Diffusion model over structured latents. Outputs a mesh only; STEP round-trip and BREP-fidelity tasks score 0 by definition.

Hunyuan3D-2 Tier D
Tencent · v2.1 (text+image conditioned)
23.1θ 0
Diffusion-3D·Mesh·open·p5 0

Diffusion 3D model, image- or text-conditioned. Output is a high-poly mesh; we run an automatic remesh + STL export. Excellent on freeform and aesthetic surfaces, near-zero on GD&T and standards.

Spline AI Tier D
Spline.design · 2.7
19.2θ 0
Diffusion-3D·Mesh·proprietary·p5 0

Aimed at game/UX assets, not engineering CAD. Included as a non-CAD baseline to quantify the gap.