NVIDIA's Nemotron-3-Super sits at the top of PinchBench's average score leaderboard at 84.7% — ahead of Claude Opus 4.6 (80.8%), GPT-5.4 (80.5%), and every other frontier model tested. For a model that most people haven't heard of, that's a statement.
But look closer and the story gets more nuanced than "NVIDIA beats everyone."
The Architecture Advantage
Nemotron-3-Super is a hybrid Mamba-2 + Transformer + Latent Mixture-of-Experts model. Total parameters: 120 billion. Active per token: 12.7 billion. That's the key ratio — it carries the knowledge breadth of a 120B model while firing only a fraction of its parameters on each inference step.
The Mamba-2 layers give it linear scaling on long contexts (up to 1 million tokens), which is exactly what agents need when processing codebases, logs, or documentation. The LatentMoE routing keeps it focused — instead of activating everything like a dense model, it selects the relevant experts and ignores the rest.
The practical result: inference speeds up to 484 tokens per second on serverless providers, roughly 10% more throughput per GPU than comparable open models. For an agent that needs to iterate, retry, and call tools in rapid succession, that speed compounds.
100% on the Boring Stuff
PinchBench's 23 tasks span everything from creative writing to file management to web research. Nemotron doesn't win by being brilliant at all of them. It wins by being perfect at the predictable ones.
Across three benchmark runs:
- Project scaffolding: 100% every time
- Calendar event creation: 100% every time
- Stock price research: 100% every time
- OpenClaw report comprehension: 100% every time
- CSV/Excel summarization: 98%, 100%, 100%
When half the benchmark tests deterministic, multi-step agent workflows — create this file, parse that data, navigate this structure — a model that never drops a step accumulates an enormous average advantage.
Where It Falls Apart
Nemotron-3-Super scored 27% on AI image generation. It tried to call the Pollinations API, got an error, then spent its remaining budget exploring ClawHub skills instead of completing the task. The grading gave it partial credit for prompt crafting and zero for the actual deliverable.
This is revealing. The model excels at structured, well-defined tasks with clear success criteria. Hand it something that requires creative tool discovery or graceful failure recovery, and the robotic consistency becomes robotic rigidity.
For comparison: Claude Opus 4.6 also scored zero on image generation in its PinchBench run. As did most models. Which, as Guido noted, raises a legitimate question about whether the benchmark's image task is testing the models or testing the test.
The "Native" Effect
There's a critical detail in the numbers. When Nemotron-3-Super runs through a proxy that injects artificial reasoning budgets or parameters, its average drops to 79.4%. Let the model run natively — with its own default behavior and its three built-in reasoning modes (off, low-effort, regular) — and the average jumps to 84.7%.
Those three reasoning gears are the model's secret weapon for agent work. Simple file operations get reasoning-off. Complex project scaffolding gets regular. The model dynamically adjusts its own compute budget per task, which is exactly what an agent needs: don't overthink the easy stuff, don't underthink the hard stuff.
What This Actually Means
Nemotron-3-Super doesn't top the best-run leaderboard — that goes to Claude Sonnet 4.6 at 86.9%. It doesn't have the deepest reasoning capability (AI Intelligence Index: 36 vs. Qwen 3.5's 42). It's not going to write you the most eloquent essay or have the most insightful conversation.
What it does is never screw up a CSV parse. Never botch a file structure. Never fumble a calendar event. And when your benchmark averages across 23 tasks and a model aces 15 of them with machine-like precision, the math works out.
For OpenClaw deployments where the agent's job is predominantly structured — data processing, file management, report generation, tool orchestration — Nemotron-3-Super is arguably the optimal choice. It's the model that treats agent work like assembly-line production: precise, fast, cheap, boring, and exactly right.
For everything else, you still want a frontier model that can think.