How CAD-Bench scores agents
The harness is open-source and built to be reproducible: a single Python entry point reads tasks.jsonl, dispatches each prompt to a registered agent, and writes a per-run record (artifact, latency, tokens, metrics) into runs.jsonl. The site you are reading is a static render of those records.
Dataset
- 343 tasks across 20 categories. The pilot subset rendered on this site is a stratified sample of 20 × ~2 tasks per category.
- Each task ships with a canonical reference: an AP242 STEP file and a 200 k-vertex tessellation, sha-256 fingerprinted in the task entry.
- Reference parts were authored in Onshape by four mechanical engineers and reviewed by a fifth (inter-rater κ = 0.84 on the GD&T sub-set).
- Eleven prompts are paraphrased — both originals and paraphrases are scored separately and treated as identical tasks for averaging.
- Held-out: reference STEP files are signed and not exposed in the prompt context. Agents are scored against them blind.
Run protocol
- Each (agent, task) pair is sampled
k = 5times with seeds 1…5. - The agent receives the verbatim prompt and may return either a STEP, STL, GLB, or executable source (OpenSCAD/CadQuery). Source is executed inside an isolated Vercel Sandbox with a 90 s wall-clock cap, no network, and 4 GiB memory.
- The output is rigidly aligned to the reference by ICP (≤ 5° rotation, ≤ 2 mm translation) before any geometric metric is computed. Misalignment that exceeds this budget counts as a hard fail.
- Boolean validity (watertightness, manifoldness, Euler compliance) is checked via
OpenCascade 7.8 ShapeAnalysis. - Latency is measured client-side, end-to-end. Cost is the verifiable invoice from the provider, not a list-price estimate.
Metrics
|V(A) ∩ V(B)| / |V(A) ∪ V(B)| on a 1 mm³ voxel grid after ICP alignment (≤5°, ≤2 mm).
Jaccard 1912; Nooruddin & Turk 2003.
0.5·E_x[min_y ||x−y||] + 0.5·E_y[min_x ||x−y||] over 50 k surface samples.
Fan, Su & Guibas 2017.
95th percentile of bidirectional surface distances; ignores triangulation outliers.
E[|n_A · n_{NN(A→B)}|] over corresponding nearest-neighbor surface samples.
Mescheder et al. 2019.
Every edge is shared by exactly two faces; BREP shell has no naked edges (OCC ShapeAnalysis_Wire).
1 − |E_nm| / |E|, where E_nm are edges incident to ≠ 2 faces.
V − E + F = 2(S − G) matches the reference shell count S and genus G exactly.
Chamfer after AP242 export → re-import via OpenCascade. Measures BREP fidelity loss; mesh-only agents score null.
RMSE between agent-output named dimensions and the spec list. Measured by feature-graph extraction (OCC BRepFeat) followed by name-matched comparison; unmatched dimensions count as max(2× tolerance).
Stronger than bbox error: catches a part that hits bbox via wrong feature placement.
Fraction of GD&T callouts (position, parallelism, perpendicularity, concentricity, runout, flatness) satisfied within their declared tolerance band, evaluated against the listed datums by OCC + custom GD&T parser.
ASME Y14.5-2018 / ISO 1101.
|F_pred ∩ F_true| / |F_true|; features auto-detected via OCC BRepFeat then matched by type + position (≤2 mm).
Fraction of mating pairs whose realized clearance falls inside the spec'd [c_min, c_max] band after assembly mate.
For ISO/ANSI fit specs (e.g. H7/g6), fraction of mating dimensions that respect the prescribed shaft/hole tolerance bands.
ISO 286-2.
For prompts referencing a standard (ISO 4762, DIN 471, AS568, ANSI B5.50, etc.), fraction of standard-derived feature parameters matched within the standard's tolerance.
Process-specific weighted composite of draft, wall, undercut, tool reach, overhang. See dfm_cnc / dfm_mold / dfm_fdm for the per-process variants.
Boothroyd-Dewhurst DFM; MIT 2.008.
Fraction of vertical (parting-axis) faces with ≥1° draft on cast/molded parts. Computed face-wise by surface normal vs parting plane.
Fraction of solid that survives a process-specific minimum-wall erosion test (FDM ≥0.8 mm, mould ≥1.0 mm, SLS ≥0.5 mm).
Fraction of part surface a 3-axis CAM postprocessor can reach without collision using a Ø6 / Ø3 / Ø1 tool stack. Computed via FreeCAD-Path simulation.
FreeCAD CAM Workbench; Mastercam-equivalent collision check.
Volume of FDM/SLA support material required, divided by part volume. Lower = better-orientable design.
1 − stdev(t_i)/mean(t_i) over wall-thickness samples — reflects mouldability/sheet-metal regularity.
Fraction of declared parametric edits that produce ΔV within ±5 % of analytically expected ΔV without breaking topology.
Fraction of sketch constraints that resolve to a fully-determined (DOF=0) sketch under the agent's native solver.
Across declared parameter ranges sampled at N=20 points each, fraction of samples that produce a watertight, topologically-equivalent solid.
Fraction of functional-intent tasks where automatic linear-elastic FEA at the spec'd load returns max von Mises < 0.8·σ_yield.
CalculiX / Code_Aster pipeline; mesh size ≤ feature/4.
Std-dev of vol_iou across N=5 prompt paraphrases of the same task. Measures semantic robustness.
Std-dev of vol_iou across k=5 seeds at the same prompt. Measures sampling reliability.
Brier score on agent's self-reported pre-generation confidence ∈ [0,1] vs. realized pass@1. Lower = better calibrated.
Brier 1950.
Median latency of a parametric edit divided by median latency of fresh generation; <1 means real parametric editing exists.
1[Vol_IoU ≥ τ_c ∧ DFM ≥ 70 ∧ watertight] on a single sample. τ_c is category-dependent (0.85 / 0.75 / 0.65).
Chen et al. 2021 (HumanEval).
Unbiased estimator 1 − C(n−c, k)/C(n, k) with n=5, k=5; same gating predicate as Pass@1.
Median wall-clock from prompt submission to artifact return.
95th-percentile wall-clock latency over n ≥ 30 trials.
Σ provider invoice line-items per generated artifact (input + output tokens, image tokens, fixed surcharges).
Composite score
For category c we compute a per-task score S_c(t) = mean over c.primaryMetrics of normalize(metric), where normalize maps each metric onto 0..100 with the transforms documented in lib/data/results.ts (e.g. vol_iou·100, max(0, 100 − chamfer·50)). A 95 % bias-corrected bootstrap CI over the per-task scores is reported. The composite is the category-weighted mean over all tasks with weights w_c in the categories table.
Reproducibility
- All randomness (sample seeds, ICP initial poses, voxelisation orientation) is recorded.
- Run sheets are publishable as a single JSONL; we publish the exact one used to produce these tables.
- The runner ships with offline reference scoring code (
scripts/score.py) so vendors can verify their numbers against ours. - Adding a new agent is one TypeScript stub plus a runner adapter; see
scripts/run-evals.ts.
Known limitations
- Mesh-only agents (Trellis, Spline) cannot be scored on STEP round-trip and BREP feature recall, and therefore lose ~25 weighted points by construction.
- The DFM rubric encodes a particular set of manufacturing assumptions (3-axis CNC + injection moulding). FDM-only or DLP-only suites would re-weight differently.
- Human baseline timing is wall-clock at the bench; the harness does not yet model tool licence cost (only seat-hour rate).
- The pilot subset shown on this site is small enough that some category CIs span ~15 points; full-suite numbers (n=194) tighten these by ≈√(194/22) ≈ 3×.