Making Sense of Agentic Memory: A Map of the Design Space

TL;DR: "Memory" in the AI agent space covers everything from markdown files to model retraining, making the term almost meaningless. We break down the design space along four dimensions: where memory is stored, how it is created, how it evolves, and who it is for. Each dimension presents a spectrum of choices, and these choices are not independent: the decision to build shared institutional memory (rather than personal memory) logically drives the rest of Spark's architecture: automatic creation, dynamic trust scoring, continuous evolution, and a fully managed service.

The word "memory" has become one of the most overloaded terms in AI. Over the past year, we have seen memory features launched by major AI providers, well-funded startups building memory infrastructure, a wave of academic papers proposing memory architectures, and open source projects offering memory for every conceivable use case. A recent survey paper cataloguing the space (Memory in the Age of AI Agents) runs to over 40 pages of references, and the list keeps growing.

The problem is that "memory" is used to describe wildly different things. Using a single word for all of them is a bit like using the word "tool" to cover both a hammer and a CT scanner. They are both tools, in the sense that they help us achieve goals, but the category is so broad that it tells you almost nothing about what the thing actually does. When someone says they are building "memory for agents," that could mean anything from a markdown file with a few preferences, to a managed service processing millions of knowledge records, to retraining the weights of a foundation model. These are fundamentally different systems, solving different problems, with different trade-offs.

At Memco, we have spent the past year building Spark, a shared memory layer for AI agents. Along the way, we have had to make choices at every level of the design space: how to store knowledge, how to create it, how to keep it current, who it is for, and how to deliver it. This post is an attempt to map out that design space and explain the reasoning behind the choices we made.

Scott, my co-founder, has previously written about memory incentives and architectures, exploring the ownership dimension: who controls the memory, and who benefits from it. In this post I want to focus on the functional and technical dimensions.

A caveat before we start: these dimensions are not independent of each other. Real systems make bundled choices, and a decision along one dimension often constrains or shapes choices along others. In fact, one of the things I want to show is that the choices we made for Spark are not arbitrary: they form a coherent set that follows logically from our focus on shared, institutional memory. More on that at the end.

Where Does the Memory Live?

The most basic question about any memory system is where the knowledge is physically stored. The approaches in the wild range from the simplest possible to the rather complex.

Markdown files. The simplest form of memory is a set of files that live alongside your code or in your agent's working directory. These could be manually authored ("here are things I want my agent to know"), or they could be automatically populated by the agent itself when it encounters something worth recording. The AGENTS.md convention, adopted by thousands of open source projects, is a good example. A related pattern, used especially for procedural knowledge, is collecting SKILL.md files that capture how to solve specific types of problems. Systems like Context Hub provide tooling for curating these files.

This approach is appealing in its simplicity, and it works well when the volume of knowledge is modest: tens to perhaps a few hundred items. Beyond that, it starts to break down. Finding the right piece of knowledge in a sea of files becomes unreliable, and the agent's context window fills up fast if you try to include everything.

External memory servers. The middle ground, and the approach we chose for Spark, is to use a dedicated external server to store, index, and retrieve knowledge, with agents connecting to it via standard interfaces like MCP or CLI tools. Unlike a collection of markdown files, a server can scale to millions of pieces of knowledge. As the volume grows, the retrieval problem becomes increasingly important. Returning relevant results from a large corpus in milliseconds is well-understood, but it requires specialist infrastructure and careful engineering.

Several approaches in this category use knowledge graphs as the primary storage substrate. Systems like Zep build temporal knowledge graphs that capture relationships between entities and track how those relationships evolve over time. The appeal is clear: graphs make relationships explicit and navigable, which is powerful for reasoning about connected facts.

We took a different path. Spark uses a hybrid of keyword search and vector search, combined with semi-structured tags for organising knowledge. Our reasoning here comes down to a tension between explicit and implicit semantics. A knowledge graph forces you to commit to a specific schema of relationships upfront. This is powerful when the domain is well-understood and the relationships are stable, but it introduces rigidity. If a new type of relationship emerges, or the boundaries of the domain shift, you need to update the schema. This echoes a familiar pattern from the history of AI: the rules-based systems of classical AI were precise but brittle, while modern connectionist approaches trade some precision for flexibility and robustness.

By relying on keyword and vector-based search, with lightweight tagging to add structure where it is useful, we get a system that adapts more gracefully as the knowledge base evolves. We accept that we lose some of the explicit navigability that graphs provide, but we gain scalability and resilience — both as the amount of knowledge grows, and as query volume increases.

Parametric memory (model weights). At the other end of the spectrum, some approaches aim to encode new knowledge directly into the model itself. This could mean full retraining, or more commonly, creating an adapter like a LoRA module that modifies the model's behaviour. The "parametric memory" layer in the MemOS framework is an example of this thinking.

This is an intellectually interesting line of research, but it faces significant practical constraints. Most teams in production use third-party foundation models from OpenAI, Anthropic, Google, and others, that they cannot retrain or apply adapters to. Even when retraining is possible, it is slow and expensive, which limits how frequently knowledge can be updated. A memory system that can only learn new things once a week is not much of a memory system.

There is also a more fundamental issue. Model weights are not a precise storage medium. The training process compresses and reshapes information, guided by a loss function that optimises for general performance, not for the faithful retention of specific facts. Models are good at remembering the gist of an idea (the general shape of a concept, the typical pattern), but they are unreliable at storing verbatim details. Research on LLM knowledge has repeatedly shown this: models struggle with precise factual recall, and the well-known "reversal curse" (Berglund et al., 2023) demonstrated that a model trained on "A is B" often cannot reliably infer "B is A." This is, in a sense, the same reason we humans need reference books. Our brains are excellent at gist and pattern, but we reach for a manual when precision matters. External memory is simply better at storing information that needs to be retrieved faithfully.

How Is the Memory Created?

The next question is how knowledge enters the system in the first place. Here the spectrum runs from fully manual to fully automatic.

Manual creation. Some approaches rely on the user to identify and add knowledge explicitly. You decide what is worth remembering, you write it down, and you add it to the memory. This is the model behind curated markdown files, and also behind systems like Context Hub, where a human maintains a structured repository of context for their agents.

The advantage is precision: the contents of the memory are exactly what the user intended. The disadvantage is that it creates a parallel task alongside the actual work. People use AI agents to get their jobs done. Manually curating a memory base is extra effort, and in practice, not everyone has the time or the inclination to do it consistently. The result is that manually curated memories tend to suffer from starvation: the knowledge base starts strong but falls behind as people get busy with the work itself.

Automatic creation. In Spark, we chose to make the memory creation process fully automatic. Rather than asking users to do extra work, we instruct agents to identify newly discovered knowledge during the course of normal interactions and to automatically submit those as candidate entries for the shared memory. These represent insights that were not obvious before the agent encountered the specific problem.

This means some of the submitted content will not be high quality. That is an expected consequence of removing the human filter at the input stage. To compensate, we implement input filtering that rejects insights that are trivial, duplicative, or show indicators of low quality. Our philosophy is that people should just use their agents in the usual way, to do their jobs. It is the machine (the agent and the memory server working together) that has the extra job of learning from those interactions. The friction should be zero.

How Does the Memory Change Over Time?

The world is not static, and a memory system that cannot evolve with it will gradually become a liability rather than an asset. This is one of the dimensions where the choices are most consequential.

Static memory. Approaches that require manual effort to create memories, like markdown files, also require manual effort to maintain them. When a library releases a breaking change, or an internal convention shifts, someone needs to go update the relevant files. Anyone who has worked in software engineering knows that maintenance effort tends to exceed creation effort over the long run. A memory base that starts as a useful resource can quietly become a source of outdated and misleading guidance if no one is keeping it current.

Dynamic memory. In Spark, memory evolution is an automatic, continuous process. We manage it through several mechanisms.

First, we use the feedback signal that comes from clients (agents and their users) to accumulate positive and negative evidence about individual pieces of knowledge over time. We take this as an input for a statistical trust score for each insight and for each contributor, using Bayesian methods that make good use of all available evidence. These trust scores feed into retrieval ranking: when an agent queries the memory, results are ranked not just by relevance but by trustworthiness.

Second, time matters. Spark implements a decay function where memories that are not being used gradually lose salience. This ensures that recent knowledge can naturally override older knowledge, without requiring anyone to manually identify and remove stale content.

Third, we run a process of memory evolution that aims to extract new knowledge from existing knowledge. This involves a set of memory processing operators: decomposing complex insights into atomic units, identifying patterns across related insights, and recombining knowledge in new ways that may not have been obvious from any single contribution. The goal is to make the memory greater than the sum of its parts.

What Is the Memory For?

Perhaps the most important dimension is what the memory is aiming to remember. We think of this as a spectrum running from general knowledge about the world, through institutional and team knowledge, to personal knowledge about an individual user.

General knowledge is the job of foundation models. This is what pre-training provides: a broad understanding of language, concepts, facts, and patterns distilled from the training corpus. No external memory system needs to replicate this.

Personal knowledge sits at the other end: your preferences, your conversation history, your working style. This is the focus of the memory features built into consumer AI products like ChatGPT and Claude, and also the primary use case for dedicated memory platforms like Mem0 and the stateful agent framework Letta (formerly MemGPT). These systems help make an individual agent feel personalised and consistent across sessions.

Try Spark

Give your agents memory in 30 seconds.

Plugs into Cursor, Claude Code, Copilot, and Windsurf via MCP. Free for individual developers. Your code never leaves your machine.

npm install -g @memco/spark

Rolling out AI at your company? Talk to us about a pilot

Our focus at Memco is in the middle of this spectrum: shared memory that captures team and institutional knowledge. This is the practical know-how, the tribal conventions, the hard-won lessons from past failures, the implicit context that experienced team members carry in their heads but never write down. It is the kind of knowledge that makes a senior engineer effective on day one of a new project, and its absence is what makes AI agents stumble when they encounter the idiosyncrasies of a real codebase.

This middle ground is underserved. The overwhelming majority of investment and attention in the memory space is going toward personal memory, which has a clear user story and an obvious path to consumer adoption. But shared memory is where the compounding value lives. When one agent discovers a solution to a tricky problem, and that solution becomes available to every other agent working in the same context, the collective capability of the team grows in a way that personal memory alone cannot achieve.

It is also where the hardest problems live. Shared memory introduces challenges that personal memory does not face:

Multi-model compatibility. In any real team, people use different tools. Not everyone will be using the same coding agent, the same foundation model, or the same IDE. Shared knowledge needs to be able to flow between agents running on different models of different capability levels. A piece of knowledge that was discovered by an agent running Claude needs to be useful to an agent running GPT, and vice versa.

Trust. Unlike personal memory, where you implicitly trust yourself, shared knowledge comes from multiple contributors who may not be equally competent or well-intentioned. An incorrect insight from one contributor, if uncritically propagated, can cause harm across the entire group. We have implemented content filtering, and a Bayesian trust model that leverages all available evidence (feedback signals, contributor track record, consistency with other knowledge) to make the best possible judgement about whether a given piece of knowledge is reliable.

Combinatorial value. A community using shared memory accumulates knowledge faster than any individual could. But to capture the full value of that accumulation, you need to combine disparate pieces of knowledge from different sources and contexts. This is what our memory evolution operators do: synthesising new understanding from the collective experience of the group.

The Choices We Made, and Why They Cohere

Looking at these dimensions individually is useful for understanding the design space. But the more important insight is that our choices along each dimension are not independent. They follow from a single core commitment: building shared, institutional memory for teams of agents. Once you commit to shared memory, the rest of the design follows.

Shared memory means knowledge comes from many sources, which means you cannot rely on manual curation: the volume is too high and the contributors too dispersed. So automatic creation becomes a necessity, not a nice-to-have.

Automatic creation from diverse sources means variable quality, which means the system must be able to assess and rank knowledge by reliability. So dynamic trust scoring becomes essential.

Knowledge that accumulates over time from a changing world will contain stale and contradictory entries. So automatic evolution (decay, deduplication, recombination) becomes necessary to keep the memory healthy.

The sophistication required by trust scoring, evolution, and large-scale hybrid retrieval means this is not something you spin up with a weekend project. It requires specialist infrastructure (authentication, scaling, monitoring, compliance) working together as an integrated system. So a fully managed service is the natural delivery model, one that minimises operational burden for the customer while ensuring the complex internals work correctly. For most of our customers, setting up Spark in their organisation takes minutes.

And shared knowledge must flow between agents of different kinds, so an external memory server accessible via standard protocols is the right storage medium, not markdown files tied to a single workspace, and not model weights tied to a single foundation model.

Each of these choices reinforces the others. Take away any one, and the system becomes less effective. This is why we believe the framing of "memory" as a single category is misleading. Different memory systems are making different bundled choices, optimised for different problems. A personal memory system that extracts user preferences from conversation history is solving a genuinely different problem from a shared memory system that curates institutional knowledge for a team of agents. Both are valuable. But they should not be confused, and evaluating one by the criteria of the other will lead you astray.

Where This Is Going

The memory space is young, and moving fast. The academic literature is producing genuinely interesting ideas, from Zettelkasten-inspired memory organisation (A-MEM) to memory operating systems that treat knowledge as a first-class schedulable resource (MemOS). The commercial landscape is maturing too, with dedicated memory platforms finding their niches and major AI providers integrating memory features into their core products.

We expect this space to develop along lines that mirror the evolution of databases. In the early days of computing, every application rolled its own storage. Then general-purpose databases emerged, and eventually specialised databases for different workloads: relational for transactions, document stores for flexible schemas, graph databases for connected data, time-series databases for temporal patterns. Memory for AI agents is at the "every application rolls its own" stage. We think it will follow a similar specialisation trajectory, with different memory architectures optimised for different use cases.

Our bet is on the shared, institutional layer. We believe that as AI agents become the primary interface between humans and software, the organisations that give their agents access to collective knowledge will dramatically outperform those that leave each agent to figure things out on its own. Our research shows this is already happening: 40% lower costs, 34% faster completion times, and more than halved variance in outcomes when agents have access to Spark's shared memory.

If that vision resonates, and you want to give your agents a memory that learns from the collective experience of your team, try Spark or get in touch.