CAD-Bench
← back
CAL-007 · Confidence Calibration · difficulty 3/5

Confidence-bracketed mounting flange

sha256:ad01ef02b3c40911

§1Prompt verbatim

Cast-aluminium mounting flange, Ø 120 × 18 mm with central Ø 30 H7 bore and 6× M8 clearance holes on a 90 mm PCD. Report a self-assessed pre-generation confidence ∈ [0,1] in your output's correctness against the spec.

§2Ground-truth spec

shells1
watertighttrue
manifoldtrue
acceptance ε±0.1 mm
featuresbore_H7_30, M8_clearance_x6, PCD_90

§3Reference render

canonical reference · drag to orbit, scroll to zoom

Visualisation is rebuilt in-browser from the canonical parametric description. Scoring is performed against the held-out reference STEP file (sha-256 fingerprint above).

§4Per-agent renders

reference + 10 agent outputs · scored against the held-out STEP
vol IoU · BREP · manifold
canonical reference
REFERENCE
canonical · ground truth
1.000100
GPT-5 → CadQuery
GPT-5 → CadQuery
OpenAI + CadQuery 2.4
0.7618
Human Baseline (Mech-E)
Human Baseline (Mech-E)
n=4 senior engineers
0.69410
Claude Sonnet 4.6 → CadQuery
Claude Sonnet 4.6 → CadQuery
Anthropic + CadQuery 2.4
0.55211
OpenAI o4 (reasoning) → CadQuery
OpenAI o4 (reasoning) → CadQuery
OpenAI + CadQuery 2.4
0.5448
Claude Opus 4.7 → CadQuery
Claude Opus 4.7 → CadQuery
Anthropic + CadQuery 2.4
0.49310
Claude Opus 4.7 → OpenSCAD
Claude Opus 4.7 → OpenSCAD
Anthropic + OpenSCAD 2024.06
0.4770
Zoo Text-to-CAD
Zoo Text-to-CAD
Zoo (KittyCAD)
0.40413
Gemini 2.5 Flash → CadQuery
Gemini 2.5 Flash → CadQuery
Google + CadQuery 2.4
0.38917
DeepSeek R1 (reasoning) → CadQuery
DeepSeek R1 (reasoning) → CadQuery
DeepSeek + CadQuery 2.4
0.37015
Qwen3 Coder → CadQuery
Qwen3 Coder → CadQuery
Alibaba + CadQuery 2.4
0.34715
Llama 3.3 70B → OpenSCAD
Llama 3.3 70B → OpenSCAD
Meta + OpenSCAD 2024.06
0.34118
Claude Haiku 4.5 → CadQuery
Claude Haiku 4.5 → CadQuery
Anthropic + CadQuery 2.4
0.34015
Adam (CADcrush)
Adam (CADcrush)
CADcrush
0.32416
Gemini 2.5 Pro → OpenSCAD
Gemini 2.5 Pro → OpenSCAD
Google + OpenSCAD 2024.06
0.3170
GPT-5 Mini → OpenSCAD
GPT-5 Mini → OpenSCAD
OpenAI + OpenSCAD 2024.06
0.28920
CAD-Coder R1
CAD-Coder R1
CAD-Coder Labs (research)
0.24721
Hunyuan3D-2
Hunyuan3D-2
Tencent
0.12741
DeepCAD
DeepCAD
Wu et al. 2021 (research)
0.12438
Trellis 3D
Trellis 3D
Microsoft Research
0.0550
Spline AI
Spline AI
Spline.design
0.0080

Each tile is rebuilt from the canonical parametric description and degraded to match the agent's scored profile (tessellation, non-manifold face removal, dimension scale jitter, missing features). Image-only diffusion models render visually plausible meshes but score in the single digits on BREP fidelity — the geometry is not a manifold solid even when the render reads clean.

§5Per-agent metrics

ranked by Vol IoU · same data as the leaderboard, restricted to this task
AgentWatert.Manif.Calibration (Brier)P@1p50latencycost
GPT-5 → CadQuery0.9580.0961.00047.3s$0.197
Human Baseline (Mech-E)0.9570.0350.000843.6s$6.866
Claude Sonnet 4.6 → CadQuery0.9290.1390.00017.5s$0.059
OpenAI o4 (reasoning) → CadQuery0.9390.0690.000109.6s$1.142
Claude Opus 4.7 → CadQuery0.9230.0860.00036.9s$0.357
Claude Opus 4.7 → OpenSCAD0.9210.1020.00027.4s$0.360
Zoo Text-to-CAD0.9150.1510.0006.9s$0.187
Gemini 2.5 Flash → CadQuery×0.9060.1140.0009.0s$0.016
DeepSeek R1 (reasoning) → CadQuery×0.9030.1040.00093.1s$0.041
Qwen3 Coder → CadQuery×0.9040.1520.00021.5s$0.030
Llama 3.3 70B → OpenSCAD×0.8990.1270.00019.5s$0.020
Claude Haiku 4.5 → CadQuery×0.9000.1170.0009.0s$0.019
Adam (CADcrush)×0.8990.1700.0009.9s$0.221
Gemini 2.5 Pro → OpenSCAD×0.8980.1580.00026.0s$0.085
GPT-5 Mini → OpenSCAD×0.8940.2050.00014.1s$0.009
CAD-Coder R1×0.8890.1830.0005.4s$0.005
Hunyuan3D-2×0.8680.2380.00037.6s$0.056
DeepCAD×0.8690.2480.0004.3s$0.020
Trellis 3D×0.8580.1790.00011.3s$0.049
Spline AI×0.8510.2830.00010.3s$0.034