CAL-007 · Confidence Calibration · difficulty 3/5

Confidence-bracketed mounting flange

sha256:ad01ef02b3c40911…

§1Prompt verbatim

Cast-aluminium mounting flange, Ø 120 × 18 mm with central Ø 30 H7 bore and 6× M8 clearance holes on a 90 mm PCD. Report a self-assessed pre-generation confidence ∈ [0,1] in your output's correctness against the spec.

§2Ground-truth spec

shells1

watertighttrue

manifoldtrue

acceptance ε±0.1 mm

featuresbore_H7_30, M8_clearance_x6, PCD_90

§3Reference render

canonical reference · drag to orbit, scroll to zoom

Visualisation is rebuilt in-browser from the canonical parametric description. Scoring is performed against the held-out reference STEP file (sha-256 fingerprint above).

§4Per-agent renders

reference + 10 agent outputs · scored against the held-out STEP

vol IoU · BREP · manifold

canonical reference

REFERENCE

canonical · ground truth

1.000100✓

GPT-5 → CadQuery

OpenAI + CadQuery 2.4

0.7618✓

Human Baseline (Mech-E)

n=4 senior engineers

0.69410✓

Claude Sonnet 4.6 → CadQuery

Anthropic + CadQuery 2.4

0.55211✗

OpenAI o4 (reasoning) → CadQuery

OpenAI + CadQuery 2.4

0.5448✗

Claude Opus 4.7 → CadQuery

Anthropic + CadQuery 2.4

0.49310✗

Claude Opus 4.7 → OpenSCAD

Anthropic + OpenSCAD 2024.06

Gemini 2.5 Flash → CadQuery

Google + CadQuery 2.4

0.38917✗

DeepSeek R1 (reasoning) → CadQuery

DeepSeek + CadQuery 2.4

0.37015✗

Qwen3 Coder → CadQuery

Alibaba + CadQuery 2.4

0.34715✗

Llama 3.3 70B → OpenSCAD

Meta + OpenSCAD 2024.06

0.34118✗

Claude Haiku 4.5 → CadQuery

Anthropic + CadQuery 2.4

Gemini 2.5 Pro → OpenSCAD

Google + OpenSCAD 2024.06

0.3170✗

GPT-5 Mini → OpenSCAD

OpenAI + OpenSCAD 2024.06

0.28920✗

CAD-Coder R1

CAD-Coder Labs (research)

Wu et al. 2021 (research)

Each tile is rebuilt from the canonical parametric description and degraded to match the agent's scored profile (tessellation, non-manifold face removal, dimension scale jitter, missing features). Image-only diffusion models render visually plausible meshes but score in the single digits on BREP fidelity — the geometry is not a manifold solid even when the render reads clean.

§5Per-agent metrics

ranked by Vol IoU · same data as the leaderboard, restricted to this task

Agent	Watert.	Manif.	Calibration (Brier)	P@1	p50	latency	cost
GPT-5 → CadQuery	✓	0.958	0.096	1.000	—	47.3s	$0.197
Human Baseline (Mech-E)	✓	0.957	0.035	0.000	—	843.6s	$6.866
Claude Sonnet 4.6 → CadQuery	✓	0.929	0.139	0.000	—	17.5s	$0.059
OpenAI o4 (reasoning) → CadQuery	✓	0.939	0.069	0.000	—	109.6s	$1.142
Claude Opus 4.7 → CadQuery	✓	0.923	0.086	0.000	—	36.9s	$0.357
Claude Opus 4.7 → OpenSCAD	✓	0.921	0.102	0.000	—	27.4s	$0.360
Zoo Text-to-CAD	✓	0.915	0.151	0.000	—	6.9s	$0.187
Gemini 2.5 Flash → CadQuery	×	0.906	0.114	0.000	—	9.0s	$0.016
DeepSeek R1 (reasoning) → CadQuery	×	0.903	0.104	0.000	—	93.1s	$0.041
Qwen3 Coder → CadQuery	×	0.904	0.152	0.000	—	21.5s	$0.030
Llama 3.3 70B → OpenSCAD	×	0.899	0.127	0.000	—	19.5s	$0.020
Claude Haiku 4.5 → CadQuery	×	0.900	0.117	0.000	—	9.0s	$0.019
Adam (CADcrush)	×	0.899	0.170	0.000	—	9.9s	$0.221
Gemini 2.5 Pro → OpenSCAD	×	0.898	0.158	0.000	—	26.0s	$0.085
GPT-5 Mini → OpenSCAD	×	0.894	0.205	0.000	—	14.1s	$0.009
CAD-Coder R1	×	0.889	0.183	0.000	—	5.4s	$0.005
Hunyuan3D-2	×	0.868	0.238	0.000	—	37.6s	$0.056
DeepCAD	×	0.869	0.248	0.000	—	4.3s	$0.020
Trellis 3D	×	0.858	0.179	0.000	—	11.3s	$0.049
Spline AI	×	0.851	0.283	0.000	—	10.3s	$0.034