Agents need knowledge they can't generate themselves

A new paper just dropped that confirms what we've been building toward for the past year.

SkillsBench, published last week, is the first serious benchmark for measuring whether procedural knowledge actually helps AI agents. The researchers tested 86 tasks across 11 domains, ran over 7,000 agent trajectories, and arrived at two findings that matter for anyone deploying agents at scale.

Finding one: curated procedural knowledge works. Agents with access to the right how-to guidance improved their success rate by 16% on average. In some domains the effect was dramatic. Healthcare tasks jumped from 34% to 86%. Financial reporting went from near-zero to 75%. Manufacturing saw a 42% improvement.

Finding two: agents can't write this knowledge themselves. When models were asked to generate their own procedural skills before attempting a task, performance actually dropped by 1.3%. Not flat. Negative. The models that benefit most from consuming external knowledge are the worst at producing it.

That second finding is the one worth sitting with.

The cold-start problem is real

The Hacker News discussion around the paper was predictably split. Academics pointed to the data. Practitioners pushed back on the methodology, arguing that real-world skill creation involves iteration and human steering, not zero-shot generation from a task brief.

Both sides are right, and both are missing the bigger picture.

The paper tested what happens when you ask a model to plan before doing the work. Write a skill file, then execute. That approach fails because no new information enters the system. The model is writing a document from its own priors and then reading that document back to itself. Of course that adds nothing.

What actually works in practice is the opposite sequence. Do the work first. Notice what went wrong. Capture the non-obvious lessons. Test whether they help next time. Skills as memoization of hard-won experience, not pre-task planning.

But here's the problem nobody in the thread addressed: that cycle dies with the session. Agent A discovers that a specific approach to 13F filing analysis works. That knowledge exists for exactly one run. Agent B, facing the same problem tomorrow, starts from zero.

Static files don't scale

The paper's entire architecture is fundamentally single-player. A markdown file sits in a directory. One agent reads it. End of story.

No learning loop. No sharing across agents within an organization. No accumulation of what worked. The 322 contributors who wrote skills for this benchmark did valuable manual work. But it's manual work that doesn't compound.

The domain results tell you exactly where this matters. The biggest gains came from domains where procedural knowledge is specialized, proprietary, and underrepresented in pretraining data. Healthcare. Manufacturing. Finance. Cybersecurity. These are also the domains where enterprises deploy the most agents, where institutional knowledge is most valuable, and where losing it between sessions is most expensive.

A static skill file written by one contributor and read by one agent is a start. It is not infrastructure.

Try Spark

Give your agents memory in 30 seconds.

Plugs into Cursor, Claude Code, Copilot, and Windsurf via MCP. Free for individual developers. Your code never leaves your machine.

npm install -g @memco/spark

Rolling out AI at your company? Talk to us about a pilot

What a learning loop actually looks like

We built Spark because we saw this gap 12 months ago, before the paper put numbers on it.

Spark is a shared memory layer for AI agents. When an agent solves a problem, the knowledge it gained gets extracted, scored for reliability, and stored for every other agent operating in the same space. Not as a trace to replay, but as reusable know-how that surfaces at the right moment in the right context.

The SkillsBench results map directly to what we see in our own benchmarks. The paper found 16% average improvement from curated skills. Our benchmarks show agents with Spark access achieving 100% task success, roughly 50% faster, using 50% fewer tokens. These aren't competing numbers. They're complementary layers of the same stack. Static skills provide procedural templates. Shared memory makes those templates, and everything else, better over time.

The paper also found that 2-3 focused skills outperform comprehensive documentation, and that loading too many skills actually hurts performance. This is a retrieval problem. The right knowledge at the right moment helps. Too much context degrades performance. SkillsBench treats this as authoring guidance: write shorter files. We treat it as infrastructure: surface the relevant memory, not all of it.

And critically, shared memory compounds. The paper's architecture is linear. Contributor writes a skill, agent reads it, nothing flows back. Spark's architecture is a loop. Every agent interaction makes the memory better, which makes every subsequent agent interaction better. One engineer's breakthrough becomes every agent's starting point.

The real takeaway

SkillsBench proves three things. Agents need external knowledge to perform. They can't generate it themselves. And the current approach of hand-curated static files doesn't scale to the way enterprises actually deploy agents: dozens of them, across multiple tools and teams, from multiple vendors.

With Spark, knowledge is extracted transparently from users working with their agents in the usual fashion. They don't need to write any extra files, they don't need any extra buttons. Spark simply abstracts and assimilates the domain knowledge that the agent was missing, and makes it available for the whole team to leverage.

The question isn't whether procedural knowledge helps. The paper settled that. The question is where that knowledge comes from, who maintains it, and whether it compounds. That's the infrastructure problem we're solving.

The SkillsBench paper is available at arxiv.org/abs/2602.12670. Read more about Spark's benchmarks and shared memory architecture at memco.ai.