MEMORY FOR AI THAT OPERATES UNDER AUDIT

Whisper hallucinated in clinical transcripts. Memory layers are the next surface. What audit-grade actually looks like.

11 min read|

ComplianceAuditSelf-hostedYMYL

In October 2024, the Associated Press reported that OpenAI's Whisper model was hallucinating in clinical transcripts. Researchers from Cornell, the University of Washington, and elsewhere documented invented phrases, fabricated medications, and racial commentary inserted into patient audio. Multiple US hospitals had Whisper-based scribes in production at the time. The vendors removed the tools, the lawyers showed up, and the conversation about "is the model good enough" changed shape.

The next surface is memory. An AI scribe that captures one note correctly and then stores it in a memory layer that silently decays it, mutates it under a competing fact, or loses it in a vector eviction is the same failure mode at a longer horizon. The transcript looked clean. The chart six weeks later did not.

This post is about what changes when you build a memory layer that assumes its outputs will be audited. We built widemem with that assumption. The audit trail is not bolted on. It is the spine.

What auditors actually ask

The questions are boring and they are always the same. They are not about AI. They are about records.

Five questions every regulated buyer asks. Most memory layers cannot answer them.

Audit question	Typical memory layer	widemem
Who created this record, and when?	Not tracked	user_id + timestamp on every fact
When did it change, and to what?	Last write wins, no history	get_history(memory_id) returns ordered log
Can it silently disappear?	Yes (TTL, eviction, blanket decay)	YMYL facts immune to decay; deletes logged
Why was it weighted highly at recall?	Black-box scoring	importance + ymyl_category in metadata
Can we reconstruct state at time T?	No	History log replay (manual stitching today)

You can pass each of these with a Postgres table and discipline. The problem is that most AI memory layers ship without any of them, and retrofitting after a contract is signed is the wrong direction to retrofit.

History is a first-class citizen

Every add(), update(), and delete() in widemem writes a row to a SQLite history table. Not a debug log, not a metric, not a thing you have to remember to turn on. The history store is constructed at the same time as the vector store.

from widemem import WideMemory

memory = WideMemory()  # history.db wired up by default

result = memory.add(
    "Patient reports persistent chest pain for three days.",
    user_id="patient_42",
)
mid = result.memories[0].id

# Later, in compliance review:
for entry in memory.get_history(mid):
    print(entry.action.value, entry.timestamp.isoformat(),
          entry.old_content, "->", entry.new_content)

Each HistoryEntry carries an ID, the memory ID, the action (add / update / delete), the old content if any, the new content if any, and a UTC timestamp. That is enough to answer the five questions above without a single new dependency.

How a single fact moves through an audit-grade memory layer

Two things are worth stating plainly about what this is and what it isn't.

What it is. A tamper-resistant local record of how a memory got to its current state. Useful in a vendor questionnaire. Useful in a postmortem. Useful when a clinician asks "why did the assistant think the patient was allergic to penicillin" six weeks after the conversation that introduced the fact.

What it isn't. A source-message provenance log. The history tells you that on 2026-05-12 at 14:33 UTC the fact changed from A to B. It does not yet tell you which inbound message triggered the change. Source-message linking is on the roadmap, not in the library today. Better to say that here than have a buyer find it in a deposition.

Where OSS memory layers actually stand

Out of the major open-source AI memory layers, only widemem ships all five of the audit-relevant features by default. The next chart shows what you can install today and run with no configuration changes, compared against Zep, the most credible adjacent project on audit-shaped concerns.

AUDIT FEATURES SHIPPED, BY DEFAULT

Out of 5: history log, importance scoring, configurable decay, YMYL category, zero-egress local stack

Feature-by-feature, with the sources cited below the table:

Feature-by-feature breakdown

Audit feature	widemem	Zep
History log (per-memory action trail)	YES	YES
Importance scoring (1-10 weight at recall)	YES	no
Configurable decay function	YES	no
YMYL category flag + decay immunity	YES	no
Zero external services for full pipeline	YES	no

How each was checked. widemem features are in the library source at github.com/remete618/widemem-ai: history store in widemem/storage/history.py, importance and decay in widemem/scoring/, YMYL in widemem/scoring/ymyl.py, local-only stack in the default provider configuration. Zep ships a temporal knowledge graph with a session change history, which is closer to a history log than most memory layers offer, but it does not carry importance scoring, a configurable decay model, YMYL semantics, or a zero-egress local-only stack. If any of this is out of date by the time you read it, the source links are above.

The point of the chart is not that widemem is the best memory layer for every use case. It isn't. Zep's temporal graph is impressive, and there are integration stories elsewhere in the OSS memory landscape that widemem doesn't match yet. The point is that audit features are still rare in this category, and if you operate under audit, that rarity is your problem.

YMYL classification and decay immunity

Most memory systems apply the same retention rules to every fact. A dinner reservation and a medication allergy get scored, decayed, and evicted by the same function. That is fine for a chatbot. It is malpractice for a clinical assistant.

widemem ships a two-tier YMYL (Your Money or Your Life) classifier: fast regex for strong patterns, semantic LLM check for the rest, with no extra API calls because it piggybacks on the extraction call you already pay for. The accuracy numbers and the regex failure modes live in the semantic YMYL post. What matters here is what changes once a fact is classified:

Importance floor of 8.0/10 regardless of the extractor's initial score. Decay function skipped entirely for the lifetime of the fact. Forced contradiction detection on every update, whether or not active retrieval is globally enabled. The classification itself persists as ymyl_category in the vector store metadata and survives restarts.

From an auditor's perspective, decay immunity is the load-bearing property. A peanut allergy stored six months ago is as critical to recall today as it was the day it was stored. If you can't prove that holds in your memory layer, you cannot ship to a healthcare buyer.

How this maps to actual compliance controls

Auditors don't care about your README. They care about specific control IDs. Here is how widemem's features map to the controls a regulated buyer's legal team will ask about first:

Where widemem features map to common compliance controls

Control	What it asks	How widemem helps
HIPAA §164.312(b) Audit controls	Record and examine activity in systems handling ePHI.	get_history(memory_id) returns every add / update / delete with UTC timestamp.
SOC 2 CC7.2 System monitoring	Detect and respond to unauthorized or unexpected changes.	History log is the change record. Importance + YMYL metadata flag what to monitor.
GDPR Art. 30 Records of processing	Maintain a record of processing activities, retention, and changes.	Per-fact created_at / updated_at, history log, configurable retention via decay.
GDPR Art. 17 Right to erasure	Erase personal data on request and prove it was erased.	Delete actions are logged. Manual today: stitch deletes to the data subject by user_id.
EU AI Act Art. 12 Record-keeping	High-risk AI systems must keep automatic logs throughout the lifecycle.	History log persists for the lifetime of the SQLite file. No telemetry, no external sink.

None of this means widemem is HIPAA-certified or SOC 2-certified. widemem is a library. Certifications live with the deployment, not the dependency. What the table above gives a buyer is a clean answer to "does this dependency help us answer the auditor's questions," and that answer is yes for the controls above.

Local-first is a regulatory shortcut

A surprising number of compliance questions stop being questions when the data does not leave your perimeter. "What is your sub-processor list" becomes "there is no sub-processor." "Where is the data residency" becomes "wherever your VPC is." "What is your incident response process for vendor data exposure" becomes a much shorter conversation when there is no vendor.

widemem ships with SQLite plus FAISS as the default stack. No external service is required to run the library at any feature level. Pair it with Ollama (for LLM extraction) and sentence-transformers (for embeddings), and the entire pipeline (ingest, extract, classify, embed, store, retrieve) runs inside your network.

# Fully local pipeline. No OpenAI key, no Pinecone, no Qdrant Cloud.
from widemem import WideMemory
from widemem.core.types import MemoryConfig, LLMConfig, EmbeddingConfig

memory = WideMemory(config=MemoryConfig(
    llm=LLMConfig(provider="ollama", model="llama3"),
    embeddings=EmbeddingConfig(
        provider="sentence_transformers",
        model="all-MiniLM-L6-v2",
    ),
))

This is not the most accurate configuration. GPT-4o-mini will catch edge cases that Llama 3 misses, and we are honest about that in the benchmarks page. But for an air-gapped deployment where "most accurate" loses to "cannot leave the room," a fully local stack is the only shape that ships.

What audit-grade memory still owes you

Three things widemem does not give you today, and you should not pretend otherwise to a buyer or an auditor:

Source-message provenance. The history log knows when a fact changed. It does not yet know which inbound message changed it. For most clinical and legal workflows, an auditor will want the original transcript line that produced "patient allergic to penicillin," not just the timestamp at which the fact was created. Today, you wire it up at the application layer by passing message IDs in metadata and correlating manually. First-class support is the next big audit-track item.

Certifications. widemem is a Python library, not a managed service. We do not hold SOC 2, HIPAA, or ISO 27001 because there is no service to certify. The compliance posture of your deployment is yours. What we can do is make the audit answers easier to produce: architecture diagrams, data flow docs, prepared answers for the vendor questionnaires your legal team will hand you. That is what a support contract pays for. The library itself is Apache 2.0 and free.

BAA. Same reason. If you need a Business Associate Agreement, you need it with the entity that hosts the model and the data store, not with a library author. We will help you find the right BAA partners and we will help you sign your own.

Honesty is the moat. Anyone claiming a library is HIPAA-certified is either lying or selling you a service in a library's clothing. Either way, your legal team should run.

Try it

pip install widemem-ai

from widemem import WideMemory
from widemem.core.types import MemoryConfig, YMYLConfig

memory = WideMemory(config=MemoryConfig(
    ymyl=YMYLConfig(enabled=True, decay_immune=True),
))

result = memory.add(
    "Patient is severely allergic to penicillin.",
    user_id="patient_42",
)
mid = result.memories[0].id

# Any later retrieval ranks this fact at the top.
# importance >= 8.0, decay function skipped, contradiction
# detection forced on every update.

# Audit trail is on by default. No flag to flip.
for entry in memory.get_history(mid):
    print(entry.action.value, entry.timestamp.isoformat())

If your AI agent operates under audit, the enterprise page covers the support shape: architecture review, vendor-questionnaire help, priority bug fixes, and a pilot scoped to a single production use case. First call is 30 minutes, technical, not a sales pitch.

The Whisper story is not finished. The next chapter will be about where the transcripts went after the model was done with them. That is the memory layer's problem. We'd rather it be a boring one.

YOUR AI MEMORY CAN'T TELL A RIVER BANK FROM A SAVINGS ACCOUNT

How the two-tier YMYL classifier catches implied safety-critical content and rejects metaphors, at zero additional API cost.

YOUR AI FORGOT SOMEONE'S MEDICATION. NOW WHAT?

The original YMYL piece: why some facts should never decay, and what changes downstream when one does.