From first principles: judging AI CAD agents at research-lab rigor
The v0.4 leaderboard was a single weighted mean over ten categories scored by automatic geometric metrics. That number is concise, but it conflates capabilities that need separate judging modalities, hides reliability, and is gameable. v0.5 rebuilds scoring on top of an explicit task-space taxonomy, a four-layer category hierarchy, a 2PL Item-Response-Theory ability θ, worst-case p5 reporting, a (capability, $/task) Pareto frontier, and three pre-baked use-case re-weightings. This page is the rationale.
1. What is "good at CAD" actually measuring?
An engineer making a hire decision watches for five orthogonal capabilities, ordered roughly:
- Intent → spec. Turning a prompt into a complete, unambiguous specification (dimensions, tolerances, datums, fits). Most evals skip this entirely and hand the agent an over-specified prompt.
- Spec → representation. Encoding that spec as a BREP or sketch graph other tools (CAM, FEA, downstream parametric edits) can consume. Mesh-only output passes "looks right" and fails everything past it.
- Engineering soundness. Implicit constraints: manufacturability for the chosen process, GD&T, standards compliance, fits, draft, stress-concentration awareness, no closed voids in DLP, etc.
- Editability. The model survives downstream parametric changes, paraphrase, prompt translation, partner-part substitution.
- Reliability. Low variance across seeds, calibrated confidence, graceful failure modes, paraphrase invariance. A 75 ± 25 agent is worse than a 70 ± 5 agent for production.
v0.4 mostly tested (3). v0.5 separates them.
2. The actual space of CAD parts: 8 axes
Cartesian product of the eight axes is implausible (~20 k cells), but collapsing low-occupancy combos yields roughly 80–120 truly distinct test classes. v0.5 covers about 60. The eight axes:
| Axis | Levels |
|---|---|
| Topology class | genus-0 single shell, genus-N single shell, multi-shell, sheet body, hybrid |
| Representation | sketch+features (BREP), pure CSG, direct edit, NURBS class-A, SDF/lattice, mesh |
| Process regime | 3-ax CNC, 5-ax CNC, turning, sheet metal, injection, die cast, sand cast, FDM, SLA, SLS, DMLS, forging, stamping, layup, weldment |
| Functional class | structural, kinematic, sealing, thermal, fluid, optical, EM, containment, ergonomic, compliant, fastening |
| Scale | µm (MEMS), mm (electronics), cm-m (industrial), m+ (aerospace) |
| Precision regime | decorative ±0.5, fit/clearance ±0.05, aero ±0.01, optical sub-µm |
| Spec source | NL only, drawing, photo, scan, functional req only, mating-context |
| Editability target | one-shot, parametric, configurable family, adaptive |
3. The four-layer category hierarchy
A flat list of categories is wrong because each layer has a different correct judging modality. Mixing them obscures what an agent is actually good at:
- L1 Geometry — judged by automatic geometric metrics (IoU, Chamfer, BREP topology). Cheap, deterministic, no domain reasoning.
- L2 Engineering — judged by named-dimension matching, GD&T parsers (ASME Y14.5-2018), partner-part assembly simulation, ISO/ANSI fit tolerance enforcement, standards-derived feature checks (ISO 4762, DIN 471, AS568, ANSI B5.50). Requires CAD-kernel reasoning about engineering intent.
- L3 Manufacturing — judged by running an actual CAM postprocessor (FreeCAD-Path) or DFM analyzer against the chosen process: detect undercuts, check tool reach, compute draft on each face, simulate FDM overhangs, extract parting line.
- L4 Cognition / Robustness — paraphrase consistency, parametric range survival, FEA-pass under spec'd loads, calibration of self-reported confidence, LLM-as-judge with rubric where ground truth is open-ended.
Default layer weights (L1 0.20 · L2 0.35 · L3 0.25 · L4 0.20) reflect a "production mechanical engineer" prior; the leaderboard tabs let you re-weight on the fly to "design exploration" (L4 + L1 dominant) or "hobbyist" (L3-FDM + cost dominant).
4. What's missing from the v0.4 metric set
v0.4 had 18 metrics — adequate for L1 and partial L2. v0.5 adds ten more to cover the rest:
- Named-dim RMSE — RMS error on the *labeled* dimensions, not bbox. Catches a part that hits bbox via wrong feature placement.
- GD&T compliance — fraction of position / parallelism / runout callouts satisfied via OCC + custom GD&T parser.
- CAM reachability — does a 3-axis FreeCAD-Path postprocessor produce a collision-free toolpath at 0.05 mm finish stepover?
- FEA-yield pass — automated mesh, run linear-elastic at the spec'd load, max von Mises < 0.8 σ_y.
- Parametric range integrity — over the declared parameter range sampled at N=20 points, what fraction preserve topology?
- Paraphrase IoU σ — std-dev of vol_iou across N=5 prompt paraphrases. Tests whether the agent reads intent or surface form.
- Seed σ — same prompt, k=5 seeds, std-dev of vol_iou.
- Confidence calibration (Brier) — Brier score on agent's pre-generation self-assessed confidence.
- Edit-vs-fresh latency ratio — is parametric editing actually fast? < 1 means real parametric.
- Fit-class compliance — for ISO/ANSI fits (e.g. H7/g6), fraction of mating dimensions in the prescribed shaft/hole tolerance band.
5. Composite scoring: the mean is misleading
A single weighted mean conceals three things you'd want to know: how reliably the agent works, how it does on hard items, and what it costs. v0.5 always shows three composites side-by-side, plus a Pareto chart:
- Mean composite. The default — easy to read, easy to game by being good at easy categories.
- 2PL IRT ability θ. Item Response Theory fit jointly over agents and tasks: P(pass | θ_a, β_t, α_t) = σ(α_t (θ_a − β_t)). Hard tasks weight more (high β), noise weights less, can't game by cherry-picking. Normalized to 0–100 across the agent set.
- Worst-case p5. 5th-percentile score across tasks. Reliability metric; for production users, the bad-day score matters more than the average.
- Pareto frontier in (capability, $/task). Many users want "best per dollar". The frontier rotates with use-case weighting so different agents become relevant.
6. Capability tiers
Rank ordering is brittle when scores are within CI of each other. v0.5 classifies into five tiers with explicit gates that must all be cleared:
≥80 composite, ≥75 L2, ≥70 L3, ≥70 BREP fidelity, p5 ≥ 50
≥65 composite, ≥60 L1, ≥55 L2, ≥40 BREP fidelity
≥45 composite, ≥50 L1 — useful as a starting point
≥25 composite — sketch-quality output
below tier C — generates 3D shapes, not for engineering
Tiers are floor gates, not score-based clusters. An agent with 85 composite that flunks BREP fidelity (mesh-only) does not reach tier S — it falls to tier B. This is a deliberate design choice: production users care about manufacturability gates, not any-cost capability.
7. Three use-case views
Layer weights re-shape the leaderboard for different consumers. Same data, different vector:
- Production engineering — L1 0.10 · L2 0.45 · L3 0.35 · L4 0.10. GD&T, mating, manufacturability dominate.
- Design exploration — L1 0.20 · L2 0.20 · L3 0.10 · L4 0.50. Reverse-eng, paraphrase robustness, parametric edit, FEA-gated function dominate.
- Hobbyist / maker — L1 0.25 · L2 0.15 · L3 0.40 · L4 0.20. FDM DFM, cost, simple parts dominate.
8. Better judging modalities
v0.4 was purely automatic geometric scoring. That doesn't reach L2-L4. v0.5 mixes:
- Automatic geometric. ICP align, voxel IoU, Chamfer, Hausdorff, BREP topology checks. (L1)
- Automatic feature extraction + named-dim matching. OCC BRepFeat → expected-feature map. (L2)
- Automatic GD&T parser. Datum reference frame, position/parallelism/runout against datums. (L2)
- Automatic CAM postprocessor. FreeCAD-Path 3-ax simulation with Ø6 → Ø3 → Ø1 tools. (L3)
- Automatic FEA pipeline. CalculiX / Code_Aster linear-elastic at spec'd load, max von Mises gate. (L4)
- LLM-as-judge with rubric. Open-ended prompts where ground truth is a rubric, not a STEP. Pairwise prompt → judge → Bradley-Terry.
- Human pairwise ranking. Gold standard, expensive — used only for held-out validation.
- Real CAM / additive manufacturing → scan → compare. The most expensive option, reserved for headline tasks each year.
9. Open problems
- Functional intent at scale. FEA-gated tasks are expensive; LLM-as-FEA-judge is unreliable.
- Multi-body assembly mating. v0.5 covers 2-3 bodies. Real assemblies are ~200.
- Sensitivity to drawing conventions. ASME Y14.5 vs ISO 8015 produce subtly different ground truth; the suite must hold the convention fixed per task.
- Paraphrase generation as adversarial benchmark. Auto-generating paraphrases that preserve intent without leaking spec is itself a research problem.
- Tool-licence cost in human baseline. Currently we only count seat-hours; Onshape per-seat / Solidworks Premium licences are not amortized in.
- Process-aware DFM weights. The same part is unmanufacturable on 3-ax CNC and trivial on SLA — DFM scores should be reported per process, not as a composite.
10. What v0.6 should add
- Multi-body assemblies (5–10 parts) with full mate graph and motion simulation.
- Live FEA-in-the-loop for L4 functional-intent tasks at production cadence (target < 90 s per task).
- Sheet-metal flat-pattern fidelity as a first-class L3 metric.
- Real-manufacture checkpoints. 12 headline tasks per year manufactured (CNC + SLA + injection-tool) and 3D-scanned for ground-truth comparison.
- Adversarial paraphrase generator that injects known-distractor terminology to test semantic robustness.
- Pairwise human study with N=20 mech-E judges via a custom CAD-diff UI; Bradley-Terry score reconciled against the automatic score.
- Calibrated abstention. Agents allowed to declare "out of scope"; abstentions counted as null (not zero) when the task is genuinely outside the agent's representation class.
The full task list, formal metric definitions, and run-protocol live in /methodology. The runner and reference scorer are in scripts/ — adding a new agent is one TypeScript adapter.