Methodology · v0.5 · 2026-04-12

How CAD-Bench scores agents

The harness is open-source and built to be reproducible: a single Python entry point reads tasks.jsonl, dispatches each prompt to a registered agent, and writes a per-run record (artifact, latency, tokens, metrics) into runs.jsonl. The site you are reading is a static render of those records.

Dataset

343 tasks across 20 categories. The pilot subset rendered on this site is a stratified sample of 20 × ~2 tasks per category.
Each task ships with a canonical reference: an AP242 STEP file and a 200 k-vertex tessellation, sha-256 fingerprinted in the task entry.
Reference parts were authored in Onshape by four mechanical engineers and reviewed by a fifth (inter-rater κ = 0.84 on the GD&T sub-set).
Eleven prompts are paraphrased — both originals and paraphrases are scored separately and treated as identical tasks for averaging.
Held-out: reference STEP files are signed and not exposed in the prompt context. Agents are scored against them blind.

Run protocol

Each (agent, task) pair is sampled k = 5 times with seeds 1…5.
The agent receives the verbatim prompt and may return either a STEP, STL, GLB, or executable source (OpenSCAD/CadQuery). Source is executed inside an isolated Vercel Sandbox with a 90 s wall-clock cap, no network, and 4 GiB memory.
The output is rigidly aligned to the reference by ICP (≤ 5° rotation, ≤ 2 mm translation) before any geometric metric is computed. Misalignment that exceeds this budget counts as a hard fail.
Boolean validity (watertightness, manifoldness, Euler compliance) is checked via OpenCascade 7.8 ShapeAnalysis.
Latency is measured client-side, end-to-end. Cost is the verifiable invoice from the provider, not a list-price estimate.

Metrics

Volumetric IoUratiohigher = better

|V(A) ∩ V(B)| / |V(A) ∪ V(B)| on a 1 mm³ voxel grid after ICP alignment (≤5°, ≤2 mm).

Jaccard 1912; Nooruddin & Turk 2003.

Bidirectional Chamfermmlower = better

0.5·E_x[min_y ||x−y||] + 0.5·E_y[min_x ||x−y||] over 50 k surface samples.

Fan, Su & Guibas 2017.

Hausdorff p95mmlower = better

95th percentile of bidirectional surface distances; ignores triangulation outliers.

Normal Consistencycosinehigher = better

E[|n_A · n_{NN(A→B)}|] over corresponding nearest-neighbor surface samples.

Mescheder et al. 2019.

Watertightnessbooleanhigher = better

Every edge is shared by exactly two faces; BREP shell has no naked edges (OCC ShapeAnalysis_Wire).

Edge-Manifoldnessratiohigher = better

1 − |E_nm| / |E|, where E_nm are edges incident to ≠ 2 faces.

Euler-Poincaré Compliancebooleanhigher = better

V − E + F = 2(S − G) matches the reference shell count S and genus G exactly.

STEP Round-trip Chamfermmlower = better

Chamfer after AP242 export → re-import via OpenCascade. Measures BREP fidelity loss; mesh-only agents score null.

Named-Dimension RMSEmmlower = better

RMSE between agent-output named dimensions and the spec list. Measured by feature-graph extraction (OCC BRepFeat) followed by name-matched comparison; unmatched dimensions count as max(2× tolerance).

Stronger than bbox error: catches a part that hits bbox via wrong feature placement.

GD&T Complianceratiohigher = better

Fraction of GD&T callouts (position, parallelism, perpendicularity, concentricity, runout, flatness) satisfied within their declared tolerance band, evaluated against the listed datums by OCC + custom GD&T parser.

ASME Y14.5-2018 / ISO 1101.

Feature Recallratiohigher = better

|F_pred ∩ F_true| / |F_true|; features auto-detected via OCC BRepFeat then matched by type + position (≤2 mm).

Mating Clearanceratiohigher = better

Fraction of mating pairs whose realized clearance falls inside the spec'd [c_min, c_max] band after assembly mate.

Fit-Class Complianceratiohigher = better

For ISO/ANSI fit specs (e.g. H7/g6), fraction of mating dimensions that respect the prescribed shaft/hole tolerance bands.

ISO 286-2.

Standards Complianceratiohigher = better

For prompts referencing a standard (ISO 4762, DIN 471, AS568, ANSI B5.50, etc.), fraction of standard-derived feature parameters matched within the standard's tolerance.

DFM Composite0-100higher = better

Process-specific weighted composite of draft, wall, undercut, tool reach, overhang. See dfm_cnc / dfm_mold / dfm_fdm for the per-process variants.

Boothroyd-Dewhurst DFM; MIT 2.008.

Draft-Angle Complianceratiohigher = better

Fraction of vertical (parting-axis) faces with ≥1° draft on cast/molded parts. Computed face-wise by surface normal vs parting plane.

Min-Wall Complianceratiohigher = better

Fraction of solid that survives a process-specific minimum-wall erosion test (FDM ≥0.8 mm, mould ≥1.0 mm, SLS ≥0.5 mm).

CAM Reachabilityratiohigher = better

Fraction of part surface a 3-axis CAM postprocessor can reach without collision using a Ø6 / Ø3 / Ø1 tool stack. Computed via FreeCAD-Path simulation.

FreeCAD CAM Workbench; Mastercam-equivalent collision check.

Support-Volume Ratioratiolower = better

Volume of FDM/SLA support material required, divided by part volume. Lower = better-orientable design.

Wall-Thickness Uniformityratiohigher = better

1 − stdev(t_i)/mean(t_i) over wall-thickness samples — reflects mouldability/sheet-metal regularity.

Parametric Edit Accuracyratiohigher = better

Fraction of declared parametric edits that produce ΔV within ±5 % of analytically expected ΔV without breaking topology.

Constraint Solve Rateratiohigher = better

Fraction of sketch constraints that resolve to a fully-determined (DOF=0) sketch under the agent's native solver.

Parametric Range Integrityratiohigher = better

Across declared parameter ranges sampled at N=20 points each, fraction of samples that produce a watertight, topologically-equivalent solid.

FEA-Yield Passratiohigher = better

Fraction of functional-intent tasks where automatic linear-elastic FEA at the spec'd load returns max von Mises < 0.8·σ_yield.

CalculiX / Code_Aster pipeline; mesh size ≤ feature/4.

Paraphrase IoU σratiolower = better

Std-dev of vol_iou across N=5 prompt paraphrases of the same task. Measures semantic robustness.

Seed σratiolower = better

Std-dev of vol_iou across k=5 seeds at the same prompt. Measures sampling reliability.

Calibration (Brier)scorelower = better

Brier score on agent's self-reported pre-generation confidence ∈ [0,1] vs. realized pass@1. Lower = better calibrated.

Brier 1950.

Edit-vs-Fresh Latencyratiolower = better

Median latency of a parametric edit divided by median latency of fresh generation; <1 means real parametric editing exists.

Pass@1ratiohigher = better

1[Vol_IoU ≥ τ_c ∧ DFM ≥ 70 ∧ watertight] on a single sample. τ_c is category-dependent (0.85 / 0.75 / 0.65).

Chen et al. 2021 (HumanEval).

Pass@5ratiohigher = better

Unbiased estimator 1 − C(n−c, k)/C(n, k) with n=5, k=5; same gating predicate as Pass@1.

Latency p50slower = better

Median wall-clock from prompt submission to artifact return.

Latency p95slower = better

95th-percentile wall-clock latency over n ≥ 30 trials.

Cost per TaskUSDlower = better

Σ provider invoice line-items per generated artifact (input + output tokens, image tokens, fixed surcharges).

Composite score

For category c we compute a per-task score S_c(t) = mean over c.primaryMetrics of normalize(metric), where normalize maps each metric onto 0..100 with the transforms documented in lib/data/results.ts (e.g. vol_iou·100, max(0, 100 − chamfer·50)). A 95 % bias-corrected bootstrap CI over the per-task scores is reported. The composite is the category-weighted mean over all tasks with weights w_c in the categories table.

Reproducibility

All randomness (sample seeds, ICP initial poses, voxelisation orientation) is recorded.
Run sheets are publishable as a single JSONL; we publish the exact one used to produce these tables.
The runner ships with offline reference scoring code (scripts/score.py) so vendors can verify their numbers against ours.
Adding a new agent is one TypeScript stub plus a runner adapter; see scripts/run-evals.ts.

Known limitations

Mesh-only agents (Trellis, Spline) cannot be scored on STEP round-trip and BREP feature recall, and therefore lose ~25 weighted points by construction.
The DFM rubric encodes a particular set of manufacturing assumptions (3-axis CNC + injection moulding). FDM-only or DLP-only suites would re-weight differently.
Human baseline timing is wall-clock at the bench; the harness does not yet model tool licence cost (only seat-hour rate).
The pilot subset shown on this site is small enough that some category CIs span ~15 points; full-suite numbers (n=194) tighten these by ≈√(194/22) ≈ 3×.