Most LLM benchmarks test models in isolation — pure reasoning, code generation, knowledge retrieval. PinchBench does something different: it tests models as they actually run inside OpenClaw. Same tools, same workspace, same constraints a real agent faces.
The result is a leaderboard that answers the question developers actually care about: which model should I plug into my OpenClaw agent?
23 Tasks, Three Grading Modes
PinchBench's task suite covers the full spectrum of what an OpenClaw agent does in practice:
- File operations: creating project structures, search-and-replace across files
- Data work: CSV/Excel summarization, extracting facts from documents
- Web research: stock prices, conference lookups, competitive analysis
- Creative output: blog posts, professional emails, ELI5 summaries
- Tool use: installing ClawHub skills, generating images, email triage
- Memory: persisting and recalling knowledge across sessions
Each task is defined as a markdown file with YAML frontmatter containing the exact prompt, expected behavior, and grading criteria. Grading happens three ways: automated (Python functions checking files and transcripts), LLM judge (Claude Opus scoring against detailed rubrics), or hybrid (both).
The benchmark is versioned by git commit hash, so every result links to the exact task definitions and grading logic used. When scoring criteria change, a new generation begins — old results are preserved but kept separate.
The Current Leaderboard
As of March 2026, the average scores tell an interesting story:
| Model | Avg Score |
|---|---|
| NVIDIA Nemotron-3-Super-120B | 84.7% |
| Claude Opus 4.6 | 80.8% |
| GPT-5.4 | 80.5% |
| Qwen 3.5-397B | 80.5% |
| Claude Sonnet 4 | 80.5% |
| Kimi K2.5 | 80.1% |
The best single-run scores reshuffle things — Claude Sonnet 4.6 tops out at 86.9%, followed by Opus at 86.3% and GPT-5.4 at 86.0%. The gap between average and best-run performance reveals which models are consistent versus which are brilliant but erratic.
What PinchBench Exposes
The most revealing finding isn't which model wins — it's where models fail. Image generation is a massacre across the board. Nearly every model scores near zero, which raises a fair question: is this a model problem or a benchmark design problem? When the test expects agents to generate images using tools that may not be available in the sandbox, you're arguably testing environment setup more than intelligence.
Meanwhile, the deterministic tasks — file creation, data parsing, calendar events — separate the reliable workhorses from the unreliable geniuses. A model that scores 100% on project scaffolding across three runs is telling you something different than one that hits 86% once and 72% twice.
Open Source, Open Methodology
The entire system is open source across three repos: pinchbench/skill (runner + tasks + grading), pinchbench/leaderboard (the Next.js frontend), and pinchbench/api (Cloudflare Workers backend). Anyone can submit new tasks via PR using the provided template.
This matters because the OpenClaw ecosystem desperately needed a benchmark that tests agent behavior, not just model capability. PinchBench is that benchmark. Check the live leaderboard and draw your own conclusions.