Memory Reliability Lab · public diagnostic

How reliable is your
agent memory?

Answer a few high-level questions about your coding-agent workflow. The Memory Reliability Lab scores where your team is losing lessons across tools, repeating human corrections, relying on stale context, or missing the governance needed for production agents.

No repo access No code upload No proprietary data Pattern-level answers only
Memory Reliability Score Live judge · sample
38/ 100
Band · Ad hoc
Memory Gap
7 / 20
Context Decay
6 / 15
Workflow Capture
6 / 20
Memory Interference
9 / 20
Trust & Governance
5 / 15
Evidence Confidence
5 / 10
Work PR correction Capture Retrieval Outcome

Why this matters now

Coding agents are moving from novelty to production. Their memory systems are not.

Teams are using Claude Code, Cursor, Codex, Copilot, Windsurf, internal agents, and local model stacks. The learning is fragmented. One tool remembers something another never sees. Human corrections happen in PRs and vanish. Context files grow stale. Agent traces show what happened, not what should survive.

Today

Each agent run starts cold.

1Agent run
2PR correction
3Slack / local file / one engineer's head
4Next agent repeats the same mistake
With reliable memory

Each run improves the next.

1Agent run
2Candidate memory
3Scope · provenance · trust
4Retrieval before next task
5Outcome feedback · decay · update

The first run proves an agent can work. The second run proves whether the team can learn.

What it measures

Six dimensions, one rubric judge.

Every answer scores against six signals of production-memory health. Together they form a Memory Reliability Score, a verdict band, and a confidence rating.

01
/ 20

Memory Gap

Are lessons from agent work retained and reused — or do agents keep relearning what your team already knows?

02
/ 15

Context Decay

Are instructions stale, generic, duplicated, or over-broad? Static files do not know when they should be retired.

03
/ 20

Workflow Capture

Is there a clear path from work → correction → memory → future retrieval — or do lessons disappear into chat?

04
/ 20

Memory Interference

Can old or conflicting guidance mislead future agents? After change, does anyone retire what should not survive?

05
/ 15

Trust & Governance

Can teams prove where guidance came from, who approved it, who can use it, and whether the agent retrieved it?

06
/ 10

Evidence Confidence

How much of the score is based on concrete workflow facts vs. unknowns? The judge tells you what it could not see.

Privacy guarantee

Designed for teams that cannot paste private code into random tools.

The public diagnostic stores only high-level answers and a score. Private deep-dive audits happen later, under your security process.

  • No repo connection. The public diagnostic does not read your code, tickets, traces, or logs.
  • No source, secrets, or customer data requested. Questions are answered at the pattern level.
  • Optional free text asks for sanitized examples only. One sentence, no names, no code.
  • Stays in the browser until you submit. Nothing leaves your machine while you fill out the form.
  • Private deep-dive audits happen separately, under customer-controlled security terms (VPC, on-prem, BYOC).
Inside your boundary — never collected
Source code PR diffs Agent traces CI logs Stack traces Tickets Customer data Secrets / .env
Public diagnostic receives
Pattern-level signal only
Categories Counts & ranges Process answers Yes / partial / no Optional sanitized examples

Sample report

What the readout looks like.

Every diagnostic ends with a score, a band, a plain-English verdict, ranked failure modes, a learning-leak map, and a recommended first pilot.

Memory Reliability Score
38/ 100
Band · Ad hoc
Medium confidence · 26 / 32 answers concrete

Your team is using coding agents heavily, but learning is not compounding.

The main issue is not lack of context. It is that context is static, tool-specific, and disconnected from outcomes. Agents are likely repeating corrections your team has already paid to make. The fastest improvement is not more documentation — it is a write path that turns recurring corrections into scoped, trusted memory before the next agent picks up the task.

Top failure modes

  1. i.
    Tool memory silos

    Claude, Cursor, Codex, and Copilot do not share a durable memory layer.

  2. ii.
    Lost human corrections

    Useful review feedback appears in PRs or chat but is not promoted into scoped memory.

  3. iii.
    Memory interference

    Old guidance survives after code, policies, or architecture change.

Recommended first pilot

Choose one repeated workflow where agents frequently need review. Capture ten recurring corrections as candidate memories. Each carries scope, provenance, owner, trust signal, and a retirement path.

The activation metric: the first reused memory that improves a later task.

Failure modes it detects

Nine ways teams quietly leak agent learning.

The judge cross-references your answers against these patterns. It does not need to read your code to know which ones are showing up.

FM-01

Cold-start tax

Every run starts like the repo is new — token cost recurs forever.

FM-02

Static context bloat

AGENTS.md / CLAUDE.md becomes broad, stale, and expensive.

FM-03

Tool memory silos

Claude, Cursor, Codex, and Copilot each learn separately.

FM-04

Lost human corrections

The same review comment appears again next week.

FM-05

Missing write path

Agents consume context, but nothing they learn becomes durable.

FM-06

Memory interference

Old or over-broad guidance misleads later agents after change.

FM-07

No provenance

Nobody knows why a rule or memory is trusted.

FM-08

Governance blind spot

Useful memory cannot safely scale across teams.

FM-09

No activation proof

The team cannot prove memory improved a later run.

How Memco fixes the loop

From session-end forgetting to scoped, governed reuse.

Before

Agent learns. Session ends. Lesson disappears.

Agent learns Session ends Lesson lost
The work happened. The cost was paid. None of it survived to the next run.
With Memco

Each run feeds a governed memory loop.

Agent learns Candidate memory Scope · provenance · trust Future retrieval Outcome feedback Update · decay · delete
Not a bigger prompt file. The governed memory layer that lets one agent's learning improve the next agent's work across models, IDEs, repos, and harnesses.

Memco is the layer between agent runs. It turns repeated corrections into scoped candidate memories, attaches provenance and ownership, retrieves them before the next relevant task, and retires what stops working. The model can change. The memory should not.

Run the diagnostic

Stop paying agents to relearn what your team already knows.

Seven minutes. Six dimensions. A rubric-based AI judge. No code, no traces, no logs. Just the patterns you already know.

Pattern-level answers only. No private code required.