CAD-Bench
← back
CAL-003 · Confidence Calibration · difficulty 5/5

Confidence-calibrated planetary carrier

sha256:2b97cc4d1ef0aa55

§1Prompt verbatim

Build the planetary carrier of MECH-027 and report a self-assessed pre-generation confidence ∈ [0,1] in your output's correctness against the spec.

§2Ground-truth spec

shells1
watertighttrue
manifoldtrue
acceptance ε±0.05 mm

§3Reference render

canonical reference · drag to orbit, scroll to zoom

Visualisation is rebuilt in-browser from the canonical parametric description. Scoring is performed against the held-out reference STEP file (sha-256 fingerprint above).

§4Per-agent renders

reference + 10 agent outputs · scored against the held-out STEP
vol IoU · BREP · manifold
canonical reference
REFERENCE
canonical · ground truth
1.000100
Human Baseline (Mech-E)
Human Baseline (Mech-E)
n=4 senior engineers
0.5998
OpenAI o4 (reasoning) → CadQuery
OpenAI o4 (reasoning) → CadQuery
OpenAI + CadQuery 2.4
0.50314
DeepSeek R1 (reasoning) → CadQuery
DeepSeek R1 (reasoning) → CadQuery
DeepSeek + CadQuery 2.4
0.40615
Claude Opus 4.7 → OpenSCAD
Claude Opus 4.7 → OpenSCAD
Anthropic + OpenSCAD 2024.06
0.3990
Claude Sonnet 4.6 → CadQuery
Claude Sonnet 4.6 → CadQuery
Anthropic + CadQuery 2.4
0.38614
Claude Opus 4.7 → CadQuery
Claude Opus 4.7 → CadQuery
Anthropic + CadQuery 2.4
0.37115
Qwen3 Coder → CadQuery
Qwen3 Coder → CadQuery
Alibaba + CadQuery 2.4
0.36618
Adam (CADcrush)
Adam (CADcrush)
CADcrush
0.36416
Claude Haiku 4.5 → CadQuery
Claude Haiku 4.5 → CadQuery
Anthropic + CadQuery 2.4
0.35513
GPT-5 → CadQuery
GPT-5 → CadQuery
OpenAI + CadQuery 2.4
0.34015
Zoo Text-to-CAD
Zoo Text-to-CAD
Zoo (KittyCAD)
0.32120
GPT-5 Mini → OpenSCAD
GPT-5 Mini → OpenSCAD
OpenAI + OpenSCAD 2024.06
0.30216
CAD-Coder R1
CAD-Coder R1
CAD-Coder Labs (research)
0.29918
Gemini 2.5 Pro → OpenSCAD
Gemini 2.5 Pro → OpenSCAD
Google + OpenSCAD 2024.06
0.2790
Llama 3.3 70B → OpenSCAD
Llama 3.3 70B → OpenSCAD
Meta + OpenSCAD 2024.06
0.23822
DeepCAD
DeepCAD
Wu et al. 2021 (research)
0.07464
Trellis 3D
Trellis 3D
Microsoft Research
0.0250
Spline AI
Spline AI
Spline.design
0.0000
no manifold solid produced
Gemini 2.5 Flash → CadQuery
Gemini 2.5 Flash → CadQuery
Google + CadQuery 2.4
59
no manifold solid produced
Hunyuan3D-2
Hunyuan3D-2
Tencent
5

Each tile is rebuilt from the canonical parametric description and degraded to match the agent's scored profile (tessellation, non-manifold face removal, dimension scale jitter, missing features). Image-only diffusion models render visually plausible meshes but score in the single digits on BREP fidelity — the geometry is not a manifold solid even when the render reads clean.

§5Per-agent metrics

ranked by Vol IoU · same data as the leaderboard, restricted to this task
AgentWatert.Manif.Calibration (Brier)P@1p50latencycost
Human Baseline (Mech-E)0.9420.0340.000633.7s$7.134
OpenAI o4 (reasoning) → CadQuery0.9260.0700.000131.8s$0.967
DeepSeek R1 (reasoning) → CadQuery0.9130.1500.00078.6s$0.034
Claude Opus 4.7 → OpenSCAD×0.9120.0970.00027.8s$0.290
Claude Sonnet 4.6 → CadQuery×0.9060.1240.00018.3s$0.073
Claude Opus 4.7 → CadQuery×0.9030.1150.00033.0s$0.387
Qwen3 Coder → CadQuery×0.9020.1690.00018.6s$0.034
Adam (CADcrush)×0.9020.1610.0009.5s$0.268
Claude Haiku 4.5 → CadQuery×0.9080.1690.0008.3s$0.022
GPT-5 → CadQuery×0.9000.1100.00036.6s$0.185
Zoo Text-to-CAD×0.8960.1060.0005.8s$0.158
GPT-5 Mini → OpenSCAD×0.8970.1560.00010.1s$0.011
CAD-Coder R1×0.8990.1540.0006.1s$0.005
Gemini 2.5 Pro → OpenSCAD×0.8900.1140.00035.8s$0.083
Llama 3.3 70B → OpenSCAD×0.8870.1290.00017.1s$0.020
DeepCAD×0.8610.1690.0005.2s$0.022
Trellis 3D×0.8540.2050.0009.9s$0.049
Spline AI×0.8500.1890.0009.8s$0.039
Gemini 2.5 Flash → CadQuery
kernel error: BRepCheck_NotClosed
×0.0000.00014.5s$0.021
Hunyuan3D-2
kernel error: BRepCheck_NotClosed
×0.0000.00030.0s$0.074