If Your Coding Agents Don't Share Memory, You're Burning Money
Most teams running AI coding agents pay for the same knowledge over and over. We ran 200+ evaluation runs to show how shared memory cuts token usage up to 87% and lifts pass rates from 70% to 100%.
Most teams running AI coding agents are paying for the same knowledge over and over again. Every session starts from scratch. Every agent re-discovers the same framework patterns, hits the same dead ends, burns through the same tokens your teammate's agent already burned through last week. It's wasteful, and on any non-trivial codebase, it kills reliability too.
You might think a static markdown file in your repo solves this. An agents.md, a .cursorrules, whatever. It doesn't. Those are single-point-in-time snapshots that someone has to write and maintain. They don't learn from agent sessions, they don't accumulate knowledge as your team works, and they definitely don't transfer what one model learned to a completely different model. They also won't scale to adequately cover the thousands of lines of code in your proprietary codebase. Personal memory tools like mem0 get closer, but they're scoped to one user and one model. Your team's agents are still siloed.
Spark is multiplayer, multi-model, and multi-domain. One engineer's agent figures out how your state machine framework works, and every agent after that — different engineer, different model, different problem — starts with that knowledge already in hand. It accumulates automatically as agents work, compounds over time, and transfers across model boundaries. We ran 5 rounds of evaluation to pressure-test these claims. The short version: agents with Spark passed more tests, used up to 87% fewer tokens, and the knowledge generalized to problems no agent had ever seen.
The benchmark
We built TeamForge, a Node.js project with intentionally weird framework patterns. Custom routing, validation, state machines, event handling, authorization. None of it matches what agents see in training data, so they can't coast on memorized patterns. They have to read the code and figure it out.
Each round compares two conditions:
- Control: Agent works alone, no shared memory
- Treatment: Agent has access to Spark, seeded with knowledge from previous sessions
We track pass rate (do tests pass?), tokens (compute cost), and wall-clock time.
Rounds 1-2: Nothing
Both conditions hit 100% pass rate. Every feature, every time. Sonnet 4.6 just figured out our framework on its own.
Shared memory doesn't help when the agent can solve things from scratch. We needed harder problems.
Round 3: First real signal
We redesigned the eval to simulate sprints, where each one builds on the last and the codebase grows. Three trials.
| Metric | Control | With Spark |
|---|---|---|
| Pass rate | 70% | 100% |
| Tokens per pass | 1.35M | 0.65M |
Spark rescued 7 features that control failed. Zero went the other direction. By Sprint 3, agents with Spark used 65% fewer tokens. The gap widened as complexity grew, which is what you'd expect if the memory is actually doing something useful.
Round 4: More trials, same result
Same setup, five trials instead of three, beefier hardware.
| Metric | Control | With Spark |
|---|---|---|
| Pass rate | 90% | 100% |
| Total cost (50 runs) | $215 | $93 |
Treatment passed every feature across all 5 runs. The compounding was interesting to watch: by run 3, Spark agents had plateaued at ~430K tokens per feature, down from 1M+ in run 1. Each session made the next one cheaper, and then the savings leveled off once the knowledge base was saturated.
Round 5: Does it actually generalize?
Here's the obvious objection to everything above: the Spark namespace contained solutions to the exact features being tested. Maybe agents were just looking up answers, not learning transferable patterns.
So we created two brand-new features that exercise the same hard patterns (state machines, event handling, authorization) with completely different business logic. No agent had ever seen them. Spark had zero direct knowledge.
| Metric | Control | With Spark |
|---|---|---|
| Pass rate | 80% | 90% |
| Avg tokens per pass | 2.7M | 1.8M |
| Avg time (complex feature) | 334s | 170s |
Even on problems it had never seen, Spark delivered a perfect pass rate while cutting token usage by 35% and halving wall-clock time on the hardest feature. The framework knowledge genuinely generalized.
What held up across 200+ runs
Pass rates went from 70% to 93-100% once the tasks were hard enough to matter. Token usage dropped 35-57% per round, with compounding savings as knowledge accumulated. Cross-model, the savings hit 87%.
And the knowledge transfers. Between models. To problems the agent has never seen. That's the part that matters if you're thinking about this at team scale.
Right now every agent session your team runs throws away everything it learned. Spark captures that and makes it available to every future session, whether that's the same engineer continuing tomorrow, a teammate working on something adjacent, or a cheaper model handling routine work.
Try Spark
Spark works with Claude Code today. Install the CLI or add the MCP server and point it at your codebase — the compound savings start from session one.
Install the CLI:
npm install -g @memco/spark
Or add the MCP server directly to Claude Code:
claude mcp add --transport http spark https://spark.memco.ai/mcp