I BUILT A MEMORY LAYER THAT FORGETS ONLY WHAT DOESN'T MATTER

10 min read|

DecayCognitive ScienceConflict Resolution

A user tells your AI agent they live in San Francisco. Two months later, they mention they just moved to Boston. Your agent now has two facts: "lives in San Francisco" and "lives in Boston." Next time someone asks where this user lives, the answer depends on which embedding is closer to the query vector. Coin flip. The agent says San Francisco half the time.

This is not a retrieval problem. It is a curation problem. The system never decided that "lives in San Francisco" should be replaced. It accumulated both facts and left the mess for search to sort out.

I ran into this while building widemem, an open-source memory layer for LLMs. The project started as a straightforward vector-store-plus-extraction setup. It became something different when I realized that the hard part of memory is not storing things. It is deciding what to let go of.

EVERY MEMORY FRAMEWORK HAS THE SAME BLIND SPOT

I dug through LangChain's ConversationBufferMemory, early versions of mem0, and a few RAG-plus-vector-store setups before writing any code. They all shared the same architecture: text goes in, facts get extracted, vectors get stored, similarity search gets them back.

Adding information was a first-class operation. Removing or superseding information was bolted on after the fact, if it existed at all. Apparently "what if the user moves?" was not on anyone's product roadmap.

Dan Giannone wrote about this failure mode: "confidently wrong responses that blend stale retrieved context with current information." The LLM is not hallucinating from nothing. It is hallucinating from bad memories you handed it.

The longer an agent runs, the worse this gets. Contradictions pile up silently. Old preferences sit next to new ones. A drug allergy mentioned six months ago decays out of the retrieval window because the decay function treated it the same as a lunch preference.

THE BRAIN DOES NOT HAVE THIS PROBLEM

Before writing code, I spent a few weeks reading about biological memory. The single biggest takeaway: forgetting is not a defect. It is a feature the brain invests serious resources in.

The forgetting curve. Ebbinghaus measured this in the 1880s by memorizing nonsense syllables and testing himself at intervals. Without reinforcement, about 56% of information is gone within an hour, 66% after a day, 75% after six days. The retention follows R = e^(-t/S) where S is the stability of the memory. Low-stability memories vanish. High-stability memories persist.

Forgetting as active learning. In 2022, Ryan and Frankland published in Nature Reviews Neuroscience that forgetting is a form of learning. The brain does not delete memories. It adjusts their accessibility based on environmental relevance. A year later, they proved this with optogenetics in mice: "forgotten" memories could be reactivated by stimulating the original engram cells. The information was still there. The brain had deprioritized it.

Retrieval suppresses competition. Anderson and Bjork showed in 1994 that retrieving one memory actively inhibits competing ones. Every time you remember where you parked today, your brain suppresses yesterday's parking spot. Not because yesterday's memory is damaged, but because surfacing it right now would cause errors.

Three principles fell out of this reading:

Not all memories are equal. The system should know the difference at write time, not search time.
Relevance decays, but not uniformly. Trivial facts should fade fast. Critical ones should resist decay.
New information should actively displace contradicted old information. Not passively, not eventually. At write time.

HOW THIS BECAME CODE

Importance scoring at extraction

When text enters widemem, an LLM extracts facts and assigns each an importance score from 1 to 10:

"I live in San Francisco and I had pizza for lunch"
  -> Fact: "lives in San Francisco"   importance: 8
  -> Fact: "had pizza for lunch"      importance: 2

Location, occupation, health conditions score high. Meal choices, weather observations, transient preferences score low. Is the LLM always right? No. But it is dramatically better than treating every fact as equally important, which is what a raw vector store does.

Decay that respects importance

Retrieval combines three signals:

final_score = (0.5 * similarity) + (0.3 * importance) + (0.2 * recency)

The recency component applies time decay. The implementation:

def compute_recency_score(created_at, now, decay_function, decay_rate=0.01):
    age_days = max((now - created_at).total_seconds() / 86400, 0.0)

    if decay_function == "exponential":
        return math.exp(-decay_rate * age_days)

    if decay_function == "linear":
        return max(1.0 - decay_rate * age_days, 0.0)

    if decay_function == "step":
        if age_days < 7:   return 1.0
        if age_days < 30:  return 0.7
        if age_days < 90:  return 0.4
        return 0.1

    return 1.0  # no decay

Recency is only 20% of the score. A six-month-old fact with importance 9 still outranks a one-day-old fact with importance 2. The peanut allergy survives. The pizza does not. As it should be. Nobody was going to ask about the pizza.

This maps loosely to Ebbinghaus's stability variable S. High-importance memories get a higher effective S, so they resist the exponential drop. Low-importance ones decay on the standard curve.

Importance + recency score over 6 months. Similarity (50% of final score) excluded to isolate the decay effect.

Batch conflict resolution

This is where active forgetting happens. When new facts arrive, the system does not just append them. It pulls related existing memories via vector similarity and sends everything to the LLM in one call:

N new facts checked against existing memories in a single LLM call.

New facts:
[0] "just moved to Boston" (importance: 8)

Existing memories:
[0] "lives in San Francisco"

-> LLM decides: UPDATE existing[0] with new[0]

Four possible actions per fact:

Action	Meaning
ADD	Genuinely new information
UPDATE	Supersedes or refines an existing memory
DELETE	Invalidates an existing memory
NONE	Already captured, skip

One LLM call for N facts, not N calls. Your API bill will notice the difference. The LLM also sees the full batch, so it can reason about interactions between facts ("fact 3 and fact 5 both relate to memory 7, but fact 3 is more recent").

The prompt tells the LLM to be conservative: prefer NONE over ADD if a fact is already captured, and explicitly detect contradictions and refinements. This works well for clear-cut cases (location changes, job changes, relationship status). It stumbles on subtle implications. Good enough for now.

What a real conversation looks like

Here is what happens when an agent uses widemem over three months with one user:

January 15

User mentions they live in San Francisco, work as a data engineer at a fintech startup, and are allergic to penicillin.

3 facts stored: "lives in San Francisco" (importance 8), "data engineer at fintech startup" (importance 7), "allergic to penicillin" (importance 9, YMYL-flagged, decay-immune)

February 3

User mentions they got promoted to senior data engineer.

Conflict resolver: UPDATE "data engineer at fintech startup" -> "senior data engineer at fintech startup". No duplicate. One clean fact.

March 10

User says they just relocated to Boston for a new role at a bank.

UPDATE "lives in San Francisco" -> "lives in Boston". UPDATE "senior data engineer at fintech startup" -> "works at a bank". The San Francisco fact is gone.

March 12

Agent is asked "what should I know about this user?"

Returns: "lives in Boston" (high similarity + importance + recent), "works at a bank" (same), "allergic to penicillin" (high importance, decay-immune, full strength despite being 2 months old). The pizza from January 15? Decayed out weeks ago.

The YMYL safety net

I added YMYL handling after a test run where the system decayed a user's medication fact below the retrieval threshold after 60 days of inactivity. The decay math was correct. The outcome was not acceptable.

YMYL (Your Money or Your Life) is borrowed from Google's search quality guidelines. In widemem, facts matching health, medical, financial, legal, safety, insurance, tax, or pharmaceutical patterns get protection:

Strong matches ("blood pressure", "bank account", "drug interaction"): importance floor of 8, immune to time decay, forced contradiction detection.
Weak matches ("doctor", "bank"): nudged to importance 6, normal decay.

Two tiers because "walked by the bank" should not trigger the same response as "opened a savings account at the bank." The detection is regex-based. It catches "diabetes diagnosis" and "drug interaction" reliably. It will miss "my sugar levels have been weird lately." Semantic classification would be better but would need an LLM call per fact, which kills the fast-path.

The tradeoff: some false positives (a mention of Doctor Who gets a minor importance bump) in exchange for no false negatives on things like medication allergies. I am comfortable with that asymmetry.

WHAT I GOT WRONG BEFORE I GOT IT RIGHT

Per-fact conflict resolution. Version one checked each new fact against existing memories individually. Ten facts meant ten LLM calls, each deciding in isolation. The batch approach (one call, full context) was cheaper and made better decisions because the LLM could see relationships between the incoming facts.

Decay as the only forgetting mechanism. Early builds relied entirely on time decay to handle staleness. This fails for slow-moving contradictions. If you say San Francisco in January and Boston in March, the San Francisco fact has two months of retrieval history behind it. Decay alone will not fix it. You need explicit contradiction detection to mark it as wrong, not just old. Time decay handles irrelevance. Conflict resolution handles incorrectness. Both are needed.

Over-engineering the hierarchy. I built a three-tier system (facts roll up into summaries, summaries into themes) because it seemed like the right abstraction. In practice, flat fact retrieval with good scoring handles 90% of queries fine. The hierarchy helps for broad questions like "tell me about this user" but those are a small fraction of real usage. I should have shipped without it and added it when someone asked. Classic case of building for an imaginary user who never showed up.

WHAT I LEARNED FROM OTHER SYSTEMS

The AI memory space has gotten crowded in the last year. Rather than a feature comparison, here is what I learned from studying each system:

Mem0 is doing excellent work on graph memory. Entity relationships catch things that flat storage misses, and their research (arXiv:2504.19413) shows 26% better scores than OpenAI's memory with 90%+ token savings. If you need relationship-aware memory, Mem0 is worth a serious look.

From Zep's Graphiti I learned that temporal versioning (tracking how facts change over time, not just what the current state is) is valuable for audit trails and debugging. widemem logs history but does not maintain a full temporal graph. Zep's 94.8% on Deep Memory Retrieval is the number to beat.

From Letta (MemGPT) I learned that letting the LLM manage its own memory is elegant but fragile. The model can drift as it edits its own state, and every memory operation costs tokens. Explicit, system-level memory management is less flexible but more predictable.

From Google's Titans paper (arXiv:2501.00663) I got the idea that "surprise" is a useful signal. They measure the gradient of new input against current memory state. High gradient means high novelty, which means prioritize for storage. This is close to what neuroscience says about how the brain flags unexpected information for stronger encoding. I have not implemented surprise-based scoring yet but the concept is sound.

No system has solved AI memory. We are all making different tradeoffs along the same axes: complexity vs. simplicity, cloud vs. local, graph vs. flat, implicit vs. explicit. widemem bets on local-first, flat storage, and explicit conflict resolution. That bet works for single-user agents running on a laptop. It may not work for multi-tenant platforms with millions of users.

THE PROBLEMS I HAVE NOT SOLVED

Importance is subjective. "I'm learning Python" might be importance 3 for a hobbyist and 9 for someone changing careers. The LLM makes a guess with limited context. Topic weights (boost "programming" to 2x) help, but they are a patch on a deeper issue: importance depends on who the user is, and the system learns that slowly.

Similarity thresholds create blind spots. The conflict resolver only checks existing memories above a 0.6 similarity threshold. "Lives on the West Coast" vs. "just moved to Boston" might not overlap enough in embedding space to trigger resolution. The contradiction slips through.

Scale is an open question. Conflict resolution cost grows linearly with the number of related memories. At hundreds of memories per user this is fine. At tens of thousands, the context window fills up and you need a selection strategy for which memories to check. I have not built that yet.

Regex safety has a ceiling. YMYL detection catches explicit patterns but misses oblique references. A proper solution would combine fast regex as a first pass with semantic classification as a fallback, at the cost of one additional LLM call when the regex is uncertain. I may build this.

WHERE I THINK THIS IS HEADING

The current generation of memory systems, widemem included, are doing to memory what early search engines did to information retrieval. The approaches work. They are useful. And they will look primitive in a few years.

The research is converging fast. A-MEM (NeurIPS 2025) uses Zettelkasten-style interconnected notes with dynamic indexing. MAGMA (January 2026) introduced multi-graph architectures with async memory consolidation. There is an ICLR 2026 workshop dedicated entirely to agent memory. The field is arriving at the same conclusion from multiple directions: memory is not a storage problem. It is a curation problem.

The gap between biological memory and what we build today is not about capacity. It is about intelligence in forgetting. The brain runs synaptic pruning, retrieval-induced suppression, engram accessibility modulation, and surprise-based encoding. We approximate all of that with an exponential decay function and a conflict resolution prompt.

But the approximation clears a real bar: it is better than appending everything to a vector store and hoping similarity search will sort it out. That bar is low, and clearing it makes a material difference for agents that need to remember users across weeks and months.

The hard part was never making AI remember. The hard part is teaching it what to let go of.

widemem is open source (Apache 2.0) on GitHub and PyPI. If you are building with LLM memory and want to compare notes, I would like to hear about it.

READ RELATED

YOUR AI FORGOT SOMEONE'S MEDICATION. NOW WHAT?

YMYL safety for AI memory: why some facts should never decay, and the edge cases that still need solving.

THE CONTRADICTION PROBLEM IN AI MEMORY

What happens when AI agents accumulate conflicting facts, and why vector similarity alone cannot detect it.

EVERY LLM MEMORY PROJECT, RATED

A comprehensive directory of open-source, commercial, and LLM provider memory solutions. What each solves and how to choose.