Code change accuracy improved 55% in preliminary benchmark.
We recently ran our first controlled preliminary benchmark to measure how Devramp improves agent accuracy. Using
Claude Code (Sonnet-4) on a 100K-line Go codebase, we tested eight representative pull requests (10–40 files each).
For each PR, we generated structured summaries: a 300-word explanation of functional and technical changes (with no
file or symbol references), distilled down to a 30-word prompt. With Devramp, the agent's outputs matched human
reference diffs far more closely. On average, accuracy improved by 55%, and variability dropped by 18 percentage
points — making results both more reliable and more consistent.
Result
preliminary
Date
23rd Sept 2025
Agent/Model
Claude Code (Sonnet-4)
Accuracy
+55
Variability
-18
Result
benchmark
Date
TBC
Agent/Model
Frontier Agent/Models
Accuracy
TBC
Variability
TBC
Date
Type
Agent/Model
/devramp
Accuracy
/devramp
Variability
23rd Sept 2025
preliminary
Claude Code (Sonnet-4)
+55%
-18pp
TBC
benchmark
Frontier Agent/Models
TBC
TBC
Subscribe for Updates
We benchmark the latest frontier models and discuss practical context engineering tactics.
Frequently Asked Questions
Ready to make AI work in your complex code base?
Without context AI stumbles in complex codebases. Devramp makes it work!