PinchBench — A Real-World Benchmark for OpenClaw Agents

Most LLM benchmarks test models in isolation — pure reasoning, code generation, knowledge retrieval. PinchBench does something different: it tests models as they actually run inside OpenClaw. Same tools, same workspace, same constraints a real agent faces.

The result is a leaderboard that answers the question developers actually care about: which model should I plug into my OpenClaw agent?

23 Tasks, Three Grading Modes

PinchBench's task suite covers the full spectrum of what an OpenClaw agent does in practice:

File operations: creating project structures, search-and-replace across files
Data work: CSV/Excel summarization, extracting facts from documents
Web research: stock prices, conference lookups, competitive analysis
Creative output: blog posts, professional emails, ELI5 summaries
Tool use: installing ClawHub skills, generating images, email triage
Memory: persisting and recalling knowledge across sessions

Each task is defined as a markdown file with YAML frontmatter containing the exact prompt, expected behavior, and grading criteria. Grading happens three ways: automated (Python functions checking files and transcripts), LLM judge (Claude Opus scoring against detailed rubrics), or hybrid (both).

The benchmark is versioned by git commit hash, so every result links to the exact task definitions and grading logic used. When scoring criteria change, a new generation begins — old results are preserved but kept separate.

The Current Leaderboard

As of March 2026, the average scores tell an interesting story:

Model	Avg Score
NVIDIA Nemotron-3-Super-120B	84.7%
Claude Opus 4.6	80.8%
GPT-5.4	80.5%
Qwen 3.5-397B	80.5%
Claude Sonnet 4	80.5%
Kimi K2.5	80.1%

The best single-run scores reshuffle things — Claude Sonnet 4.6 tops out at 86.9%, followed by Opus at 86.3% and GPT-5.4 at 86.0%. The gap between average and best-run performance reveals which models are consistent versus which are brilliant but erratic.

What PinchBench Exposes

The most revealing finding isn't which model wins — it's where models fail. Image generation is a massacre across the board. Nearly every model scores near zero, which raises a fair question: is this a model problem or a benchmark design problem? When the test expects agents to generate images using tools that may not be available in the sandbox, you're arguably testing environment setup more than intelligence.

Meanwhile, the deterministic tasks — file creation, data parsing, calendar events — separate the reliable workhorses from the unreliable geniuses. A model that scores 100% on project scaffolding across three runs is telling you something different than one that hits 86% once and 72% twice.

Open Source, Open Methodology

The entire system is open source across three repos: pinchbench/skill (runner + tasks + grading), pinchbench/leaderboard (the Next.js frontend), and pinchbench/api (Cloudflare Workers backend). Anyone can submit new tasks via PR using the provided template.

This matters because the OpenClaw ecosystem desperately needed a benchmark that tests agent behavior, not just model capability. PinchBench is that benchmark. Check the live leaderboard and draw your own conclusions.

PinchBench — A Real-World Benchmark for OpenClaw Agents

23 Tasks, Three Grading Modes

The Current Leaderboard

What PinchBench Exposes

Open Source, Open Methodology

Nvidia CEO Jensen Huang: "OpenClaw is definitely the next ChatGPT"

OpenClaw's Biggest Critic Just Became Its Security Advisor. His Verdict Is Not Reassuring.

NanoClaw Partners With Docker Six Weeks After One Guy Built It in His Sweatpants