Preprint · cad-bench/v0.5 · sweep 2026-04-12open · MIT
CAD·Benchv0.5
CATEGORIES

20 categories across 4 layers

The CAD-Bench composite is a layer-weighted mean. The four layers each have a different *correct judging modality* — mixing them obscures what an agent is actually good at. Default layer weights: L1 0.20, L2 0.35, L3 0.25, L4 0.20. The leaderboard re-weights these per use case (production / exploration / hobbyist). See /design for the rationale.

L1 · Geometry

w_layer = 0.20 · 4 categories

Automatic geometric metrics. Cheap, deterministic, no domain reasoning needed. The noise floor of the suite.

L2 · Engineering

w_layer = 0.35 · 6 categories

Named-dimension matching, GD&T parsers, partner-part assembly simulation. Requires CAD-kernel reasoning about engineering intent.

Parametric Mechanical Parts
w=0.25 · n=30
judging: named-dim+GD&T

Industry-grade brackets, fasteners, housings, and shafts specified with full GD&T (ISO 1101) including position, parallelism, and concentricity callouts. Volumes range 0.5 cm³ – 1.2 dm³.

Named-Dimension RMSEGD&T ComplianceFeature Recall
Assembly & Mating
w=0.22 · n=16
judging: named-dim+GD&T

Two- and three-body assemblies with pin-in-hole, dovetail, and threaded mate constraints. Scoring requires the candidate to mate the held-out reference partner within the prescribed clearance band.

Mating ClearanceFit-Class ComplianceFeature Recall
Standards Compliance
w=0.15 · n=18
judging: named-dim+GD&T

Prompts cite a specific standard (ISO 4762 socket-head cap screw, DIN 471 retaining ring, AS568 O-ring, ANSI B5.50 dovetail). Score = fraction of standard-derived feature parameters matched within the standard's tolerance band.

Standards ComplianceGD&T Compliance
Sheet-Metal Bodies
w=0.13 · n=14
judging: process-analyzer

Uniform-thickness bodies with bend specifications, k-factors, relief cuts, and unfoldable flat patterns. Tested by attempting to unfold the result and measuring the unfold error vs the spec'd flat pattern.

Wall-Thickness UniformityNamed-Dimension RMSEFeature Recall
Sealing-Groove Design
w=0.10 · n=10
judging: named-dim+GD&T

O-ring grooves (AS568 / ISO 3601), face seals, and lip-seal cavities. Score depends on cross-section area, groove width, and squeeze ratio matching the relevant standard.

Standards ComplianceNamed-Dimension RMSE
Kinematic Mechanisms
w=0.15 · n=12
judging: process-analyzer

Four-bar linkages, cams (radial/face), gear meshes (involute, ISO 53). Scored by simulating one full kinematic cycle and measuring (a) feasibility — no body interpenetration — and (b) prescribed motion error.

Mating ClearanceFeature RecallParametric Edit Accuracy

L3 · Manufacturing

w_layer = 0.25 · 4 categories

Process-specific analyzers — DFM rules, CAM postprocessor, draft / wall / overhang checks. Different per process.

L4 · Cognition

w_layer = 0.20 · 6 categories

Robustness and intent. Paraphrase variance, parametric range survival, FEA-pass at spec'd loads, calibration of self-reported confidence.

Constraint Solving & Editability
w=0.18 · n=18
judging: variance-analysis

Probes whether the agent exposes a working parametric graph: after the part is built we issue downstream parameter edits (length+30 %, hole diameter→M8) and re-evaluate without topological breakage.

Parametric Edit AccuracyParametric Range IntegrityConstraint Solve Rate
Reverse Engineering
w=0.20 · n=22
judging: named-dim+GD&T

Multi-view orthographic drawings (front/top/side at 1:1, fully dimensioned) and product photos. The agent must reproduce the part. Adapted from the ABC dataset and a held-out subset of GrabCAD test parts.

Volumetric IoUFeature RecallNamed-Dimension RMSE
2-D Sketch Constraints
w=0.10 · n=20
judging: geometric

Closed planar profiles defined purely by geometric constraints (tangency, equal-length, perpendicular, coincident). Score is fraction of constraints the agent honors after sketch resolution.

Constraint Solve RateBidirectional Chamfer
Functional Intent · FEA-Gated
w=0.25 · n=16
judging: FEA

Prompts specify a *function* ("hold a 250 N transverse load with a 4× safety factor in 6061-T6") rather than a geometry. Score requires the agent's part to pass automatic linear-elastic FEA at the spec'd load with stress ≤ 0.8·σ_yield.

FEA-Yield PassMin-Wall ComplianceFeature Recall
Paraphrase Robustness
w=0.15 · n=20
judging: variance-analysis

Each task ships with N=5 LLM-rephrased prompts that preserve every spec quantity. We compute the std-dev of vol_iou across the paraphrase set; lower = the agent reads intent rather than surface form.

Paraphrase IoU σSeed σ
Confidence Calibration
w=0.12 · n=15
judging: rubric+pairwise

For agents that report a pre-generation confidence ∈ [0, 1], we score the Brier loss against the realized Pass@1. Agents that don't expose a confidence channel are assigned the constant prior (their global Pass@1 rate); this becomes their effective baseline.

Calibration (Brier)Pass@1