Preprint · cad-bench/v0.5 · sweep 2026-04-12open · MIT
CAD·Benchv0.5
Design note 03 · v0.5 · 2026-05-09

From first principles: judging AI CAD agents at research-lab rigor

The v0.4 leaderboard was a single weighted mean over ten categories scored by automatic geometric metrics. That number is concise, but it conflates capabilities that need separate judging modalities, hides reliability, and is gameable. v0.5 rebuilds scoring on top of an explicit task-space taxonomy, a four-layer category hierarchy, a 2PL Item-Response-Theory ability θ, worst-case p5 reporting, a (capability, $/task) Pareto frontier, and three pre-baked use-case re-weightings. This page is the rationale.

1. What is "good at CAD" actually measuring?

An engineer making a hire decision watches for five orthogonal capabilities, ordered roughly:

  1. Intent → spec. Turning a prompt into a complete, unambiguous specification (dimensions, tolerances, datums, fits). Most evals skip this entirely and hand the agent an over-specified prompt.
  2. Spec → representation. Encoding that spec as a BREP or sketch graph other tools (CAM, FEA, downstream parametric edits) can consume. Mesh-only output passes "looks right" and fails everything past it.
  3. Engineering soundness. Implicit constraints: manufacturability for the chosen process, GD&T, standards compliance, fits, draft, stress-concentration awareness, no closed voids in DLP, etc.
  4. Editability. The model survives downstream parametric changes, paraphrase, prompt translation, partner-part substitution.
  5. Reliability. Low variance across seeds, calibrated confidence, graceful failure modes, paraphrase invariance. A 75 ± 25 agent is worse than a 70 ± 5 agent for production.

v0.4 mostly tested (3). v0.5 separates them.

2. The actual space of CAD parts: 8 axes

Cartesian product of the eight axes is implausible (~20 k cells), but collapsing low-occupancy combos yields roughly 80–120 truly distinct test classes. v0.5 covers about 60. The eight axes:

AxisLevels
Topology classgenus-0 single shell, genus-N single shell, multi-shell, sheet body, hybrid
Representationsketch+features (BREP), pure CSG, direct edit, NURBS class-A, SDF/lattice, mesh
Process regime3-ax CNC, 5-ax CNC, turning, sheet metal, injection, die cast, sand cast, FDM, SLA, SLS, DMLS, forging, stamping, layup, weldment
Functional classstructural, kinematic, sealing, thermal, fluid, optical, EM, containment, ergonomic, compliant, fastening
Scaleµm (MEMS), mm (electronics), cm-m (industrial), m+ (aerospace)
Precision regimedecorative ±0.5, fit/clearance ±0.05, aero ±0.01, optical sub-µm
Spec sourceNL only, drawing, photo, scan, functional req only, mating-context
Editability targetone-shot, parametric, configurable family, adaptive

3. The four-layer category hierarchy

A flat list of categories is wrong because each layer has a different correct judging modality. Mixing them obscures what an agent is actually good at:

Default layer weights (L1 0.20 · L2 0.35 · L3 0.25 · L4 0.20) reflect a "production mechanical engineer" prior; the leaderboard tabs let you re-weight on the fly to "design exploration" (L4 + L1 dominant) or "hobbyist" (L3-FDM + cost dominant).

4. What's missing from the v0.4 metric set

v0.4 had 18 metrics — adequate for L1 and partial L2. v0.5 adds ten more to cover the rest:

5. Composite scoring: the mean is misleading

A single weighted mean conceals three things you'd want to know: how reliably the agent works, how it does on hard items, and what it costs. v0.5 always shows three composites side-by-side, plus a Pareto chart:

6. Capability tiers

Rank ordering is brittle when scores are within CI of each other. v0.5 classifies into five tiers with explicit gates that must all be cleared:

Tier S · Production-Ready

≥80 composite, ≥75 L2, ≥70 L3, ≥70 BREP fidelity, p5 ≥ 50

Tier A · Engineering-Capable

≥65 composite, ≥60 L1, ≥55 L2, ≥40 BREP fidelity

Tier B · Engineering-Aided

≥45 composite, ≥50 L1 — useful as a starting point

Tier C · Conceptual

≥25 composite — sketch-quality output

Tier D · Non-CAD Asset

below tier C — generates 3D shapes, not for engineering

Tiers are floor gates, not score-based clusters. An agent with 85 composite that flunks BREP fidelity (mesh-only) does not reach tier S — it falls to tier B. This is a deliberate design choice: production users care about manufacturability gates, not any-cost capability.

7. Three use-case views

Layer weights re-shape the leaderboard for different consumers. Same data, different vector:

8. Better judging modalities

v0.4 was purely automatic geometric scoring. That doesn't reach L2-L4. v0.5 mixes:

  1. Automatic geometric. ICP align, voxel IoU, Chamfer, Hausdorff, BREP topology checks. (L1)
  2. Automatic feature extraction + named-dim matching. OCC BRepFeat → expected-feature map. (L2)
  3. Automatic GD&T parser. Datum reference frame, position/parallelism/runout against datums. (L2)
  4. Automatic CAM postprocessor. FreeCAD-Path 3-ax simulation with Ø6 → Ø3 → Ø1 tools. (L3)
  5. Automatic FEA pipeline. CalculiX / Code_Aster linear-elastic at spec'd load, max von Mises gate. (L4)
  6. LLM-as-judge with rubric. Open-ended prompts where ground truth is a rubric, not a STEP. Pairwise prompt → judge → Bradley-Terry.
  7. Human pairwise ranking. Gold standard, expensive — used only for held-out validation.
  8. Real CAM / additive manufacturing → scan → compare. The most expensive option, reserved for headline tasks each year.

9. Open problems

10. What v0.6 should add

  1. Multi-body assemblies (5–10 parts) with full mate graph and motion simulation.
  2. Live FEA-in-the-loop for L4 functional-intent tasks at production cadence (target < 90 s per task).
  3. Sheet-metal flat-pattern fidelity as a first-class L3 metric.
  4. Real-manufacture checkpoints. 12 headline tasks per year manufactured (CNC + SLA + injection-tool) and 3D-scanned for ground-truth comparison.
  5. Adversarial paraphrase generator that injects known-distractor terminology to test semantic robustness.
  6. Pairwise human study with N=20 mech-E judges via a custom CAD-diff UI; Bradley-Terry score reconciled against the automatic score.
  7. Calibrated abstention. Agents allowed to declare "out of scope"; abstentions counted as null (not zero) when the task is genuinely outside the agent's representation class.

The full task list, formal metric definitions, and run-protocol live in /methodology. The runner and reference scorer are in scripts/ — adding a new agent is one TypeScript adapter.