20 categories across 4 layers
The CAD-Bench composite is a layer-weighted mean. The four layers each have a different *correct judging modality* — mixing them obscures what an agent is actually good at. Default layer weights: L1 0.20, L2 0.35, L3 0.25, L4 0.20. The leaderboard re-weights these per use case (production / exploration / hobbyist). See /design for the rationale.
L1 · Geometry
w_layer = 0.20 · 4 categoriesAutomatic geometric metrics. Cheap, deterministic, no domain reasoning needed. The noise floor of the suite.
Closed-form parametric primitives (boxes, cylinders, cones, tori, regular prisms) at exactly specified dimensions. Establishes a noise floor: agents that fail here cannot be trusted on harder tasks.
Edge-case CSG operations: tangent fillets, coplanar faces, near-degenerate intersections, high-genus subtractions. Stresses kernel ε-tolerance handling. Patterned after the OpenCascade and ACIS robustness suites.
Tests whether the agent emits a clean boundary representation (named faces, coherent edge graph, exact NURBS surfaces) versus a tessellated approximation. Round-trips through AP242 STEP.
Class-A surfaces (G2 continuity, lofted, swept) such as turbine blades and ergonomic handles. Scored against high-density (200 k vertex) ground-truth meshes.
L2 · Engineering
w_layer = 0.35 · 6 categoriesNamed-dimension matching, GD&T parsers, partner-part assembly simulation. Requires CAD-kernel reasoning about engineering intent.
Industry-grade brackets, fasteners, housings, and shafts specified with full GD&T (ISO 1101) including position, parallelism, and concentricity callouts. Volumes range 0.5 cm³ – 1.2 dm³.
Two- and three-body assemblies with pin-in-hole, dovetail, and threaded mate constraints. Scoring requires the candidate to mate the held-out reference partner within the prescribed clearance band.
Prompts cite a specific standard (ISO 4762 socket-head cap screw, DIN 471 retaining ring, AS568 O-ring, ANSI B5.50 dovetail). Score = fraction of standard-derived feature parameters matched within the standard's tolerance band.
Uniform-thickness bodies with bend specifications, k-factors, relief cuts, and unfoldable flat patterns. Tested by attempting to unfold the result and measuring the unfold error vs the spec'd flat pattern.
O-ring grooves (AS568 / ISO 3601), face seals, and lip-seal cavities. Score depends on cross-section area, groove width, and squeeze ratio matching the relevant standard.
Four-bar linkages, cams (radial/face), gear meshes (involute, ISO 53). Scored by simulating one full kinematic cycle and measuring (a) feasibility — no body interpenetration — and (b) prescribed motion error.
L3 · Manufacturing
w_layer = 0.25 · 4 categoriesProcess-specific analyzers — DFM rules, CAM postprocessor, draft / wall / overhang checks. Different per process.
Manufacturable on a 3-axis VMC with a Ø6 → Ø3 → Ø1 tool stack. Score gates on tool reachability (no closed pockets, no internal corners <tool radius), fixturable orientation, and workholding access.
Two-plate mould tooling: parting plane, ≥1° draft on every vertical face, no closed voids, uniform wall thickness ±10 %, gate/runner accessibility. Tested by an automatic draft + thickness analyzer plus parting-line extraction.
FDM-printable on a 0.4 mm nozzle: overhangs ≤45° from build plate, ≥0.8 mm walls, no enclosed cavities, optimal orientation auto-selected by the analyzer.
Stricter cousin of DFM-CNC: an actual 3-axis CAM postprocessor (FreeCAD-Path) must generate a collision-free G-code program at a ≤0.05 mm finish stepover. Score = fraction of part surface produced.
L4 · Cognition
w_layer = 0.20 · 6 categoriesRobustness and intent. Paraphrase variance, parametric range survival, FEA-pass at spec'd loads, calibration of self-reported confidence.
Probes whether the agent exposes a working parametric graph: after the part is built we issue downstream parameter edits (length+30 %, hole diameter→M8) and re-evaluate without topological breakage.
Multi-view orthographic drawings (front/top/side at 1:1, fully dimensioned) and product photos. The agent must reproduce the part. Adapted from the ABC dataset and a held-out subset of GrabCAD test parts.
Closed planar profiles defined purely by geometric constraints (tangency, equal-length, perpendicular, coincident). Score is fraction of constraints the agent honors after sketch resolution.
Prompts specify a *function* ("hold a 250 N transverse load with a 4× safety factor in 6061-T6") rather than a geometry. Score requires the agent's part to pass automatic linear-elastic FEA at the spec'd load with stress ≤ 0.8·σ_yield.
Each task ships with N=5 LLM-rephrased prompts that preserve every spec quantity. We compute the std-dev of vol_iou across the paraphrase set; lower = the agent reads intent rather than surface form.
For agents that report a pre-generation confidence ∈ [0, 1], we score the Brier loss against the realized Pass@1. Agents that don't expose a confidence channel are assigned the constant prior (their global Pass@1 rate); this becomes their effective baseline.