CAL-003 · Confidence Calibration · difficulty 5/5

Confidence-calibrated planetary carrier

sha256:2b97cc4d1ef0aa55…

§1Prompt verbatim

Build the planetary carrier of MECH-027 and report a self-assessed pre-generation confidence ∈ [0,1] in your output's correctness against the spec.

§2Ground-truth spec

shells1

watertighttrue

manifoldtrue

acceptance ε±0.05 mm

§3Reference render

canonical reference · drag to orbit, scroll to zoom

Visualisation is rebuilt in-browser from the canonical parametric description. Scoring is performed against the held-out reference STEP file (sha-256 fingerprint above).

§4Per-agent renders

reference + 10 agent outputs · scored against the held-out STEP

vol IoU · BREP · manifold

canonical reference

REFERENCE

canonical · ground truth

1.000100✓

Human Baseline (Mech-E)

n=4 senior engineers

0.5998✗

OpenAI o4 (reasoning) → CadQuery

OpenAI + CadQuery 2.4

0.50314✗

DeepSeek R1 (reasoning) → CadQuery

DeepSeek + CadQuery 2.4

0.40615✗

Claude Opus 4.7 → OpenSCAD

Anthropic + OpenSCAD 2024.06

0.3990✗

Claude Sonnet 4.6 → CadQuery

Anthropic + CadQuery 2.4

0.38614✗

Claude Opus 4.7 → CadQuery

Anthropic + CadQuery 2.4

0.37115✗

Qwen3 Coder → CadQuery

Alibaba + CadQuery 2.4

Claude Haiku 4.5 → CadQuery

Anthropic + CadQuery 2.4

0.35513✗

GPT-5 → CadQuery

OpenAI + CadQuery 2.4

GPT-5 Mini → OpenSCAD

OpenAI + OpenSCAD 2024.06

0.30216✗

CAD-Coder R1

CAD-Coder Labs (research)

0.29918✗

Gemini 2.5 Pro → OpenSCAD

Google + OpenSCAD 2024.06

0.2790✗

Llama 3.3 70B → OpenSCAD

Meta + OpenSCAD 2024.06

0.23822✗

DeepCAD

Wu et al. 2021 (research)

no manifold solid produced

Gemini 2.5 Flash → CadQuery

Google + CadQuery 2.4

—59✗

no manifold solid produced

Hunyuan3D-2

Tencent

—5✗

Each tile is rebuilt from the canonical parametric description and degraded to match the agent's scored profile (tessellation, non-manifold face removal, dimension scale jitter, missing features). Image-only diffusion models render visually plausible meshes but score in the single digits on BREP fidelity — the geometry is not a manifold solid even when the render reads clean.

§5Per-agent metrics

ranked by Vol IoU · same data as the leaderboard, restricted to this task

Agent	Watert.	Manif.	Calibration (Brier)	p50	latency	cost
Human Baseline (Mech-E)	✓	0.942	0.034	—	633.7s	$7.134
OpenAI o4 (reasoning) → CadQuery	✓	0.926	0.070	—	131.8s	$0.967
DeepSeek R1 (reasoning) → CadQuery	✓	0.913	0.150	—	78.6s	$0.034
Claude Opus 4.7 → OpenSCAD	×	0.912	0.097	—	27.8s	$0.290
Claude Sonnet 4.6 → CadQuery	×	0.906	0.124	—	18.3s	$0.073
Claude Opus 4.7 → CadQuery	×	0.903	0.115	—	33.0s	$0.387
Qwen3 Coder → CadQuery	×	0.902	0.169	—	18.6s	$0.034
Adam (CADcrush)	×	0.902	0.161	—	9.5s	$0.268
Claude Haiku 4.5 → CadQuery	×	0.908	0.169	—	8.3s	$0.022
GPT-5 → CadQuery	×	0.900	0.110	—	36.6s	$0.185
Zoo Text-to-CAD	×	0.896	0.106	—	5.8s	$0.158
GPT-5 Mini → OpenSCAD	×	0.897	0.156	—	10.1s	$0.011
CAD-Coder R1	×	0.899	0.154	—	6.1s	$0.005
Gemini 2.5 Pro → OpenSCAD	×	0.890	0.114	—	35.8s	$0.083
Llama 3.3 70B → OpenSCAD	×	0.887	0.129	—	17.1s	$0.020
DeepCAD	×	0.861	0.169	—	5.2s	$0.022
Trellis 3D	×	0.854	0.205	—	9.9s	$0.049
Spline AI	×	0.850	0.189	—	9.8s	$0.039
Gemini 2.5 Flash → CadQuery kernel error: BRepCheck_NotClosed	×	0.000	—	—	14.5s	$0.021
Hunyuan3D-2 kernel error: BRepCheck_NotClosed	×	0.000	—	—	30.0s	$0.074