Uncertainty

Your AI Should Know When It Doesn't Know

March 19, 2026 · 8 min read

ConfidenceRetrievalHonesty

A user asks your agent: "What's my doctor's name?" The agent has never been told this. But the vector store returns something with a 0.38 similarity score — maybe a mention of a hospital visit, maybe a prescription refill. The agent confidently says "Dr. Martinez." It made that up. Not from nothing — from a weak match it should have ignored.

This is not a hallucination problem. It is an overconfidence problem. The memory system returned low-quality results and the agent treated them as facts. The system had no way to say "I don't have good information about this."

I hit this repeatedly while building widemem. The retrieval was working. The scoring was working. But the system was answering questions it had no business answering, because every query got results and every result looked the same. v1.4.0 adds confidence scoring, uncertainty modes, and retrieval depth controls. Here is what changed and why.

THE PROBLEM WITH ALWAYS ANSWERING

Most vector stores return the top K results for any query, regardless of quality. Ask for the user's blood type and you get back their favorite color, because that was the closest vector in the space. The similarity score might be 0.3, which is basically noise, but the system returns it anyway.

The downstream LLM sees retrieved context and assumes it is relevant. It weaves that context into its response. The user gets an answer that looks authoritative but is built on a foundation of "closest thing I had, which was not close at all."

This is worse than returning nothing. A blank response tells the user the system does not know. A confident wrong response tells them the system knows and is lying. The second failure mode erodes trust much faster.

I tracked this in testing with a synthetic user who mentioned living in San Francisco, working as a data engineer, and being allergic to peanuts. When I asked about their spouse's name — something never mentioned — the system pulled back the peanut allergy fact (similarity 0.31) and the agent worked it into a response. Technically a retrieval. Practically a disaster.

CONFIDENCE LEVELS: HIGH, MODERATE, LOW, NONE

widemem 1.4.0 assigns a confidence level to every retrieval result. Not the raw similarity score — a human-readable assessment based on multiple signals:

Level	Score range	Meaning
HIGH	≥ 0.7	Strong match. Use with confidence.
MODERATE	0.5 – 0.69	Relevant but may be incomplete or tangential.
LOW	0.3 – 0.49	Weak match. Might be useful, might be noise.
NONE	< 0.3	No meaningful match found.

The score combines similarity, importance, and recency — the same composite score widemem has always computed. The difference is that now the system interprets that score before returning it. A result with a composite score of 0.35 comes back tagged as LOW, not just as a number the downstream code has to interpret.

python

results = mem.search("what's their doctor's name?", user_id="user_42")

for r in results:
    print(f"{r.fact}  confidence={r.confidence}  score={r.score:.2f}")

# allergic to peanuts    confidence=LOW     score=0.31
# lives in San Francisco confidence=LOW     score=0.28

Now the agent can see that neither result is a good match. It can respond with "I don't have information about their doctor" instead of hallucinating a name from weak context.

THREE UNCERTAINTY MODES

Confidence levels tell the agent how good the results are. Uncertainty modes tell the system what to do about it. Different applications need different behaviors:

Strict mode

Only return HIGH confidence results. Everything else is filtered out. If there are no high-confidence matches, the system returns an empty result set with a message explaining why.

python

mem = WideMemory(uncertainty_mode="strict")

results = mem.search("what's their blood type?", user_id="user_42")
# returns: []
# message: "No high-confidence matches found for this query."

# Good for: medical apps, financial advisors, legal tools
# Bad for: casual chatbots where "I think..." is acceptable

Strict mode is the right default for YMYL domains. If you are building a health assistant, returning nothing is better than returning something you are not sure about. The user can rephrase or provide more context. The system does not guess.

Helpful mode

Return HIGH and MODERATE results normally. Include LOW results but flag them explicitly. Filter out NONE. This is the default.

python

mem = WideMemory(uncertainty_mode="helpful")

results = mem.search("where do they work?", user_id="user_42")
# returns results with confidence tags
# HIGH/MODERATE results: presented normally
# LOW results: flagged with "low confidence — verify with user"
# NONE results: excluded

# Good for: general-purpose agents, productivity tools
# Bad for: nothing, honestly — this is a reasonable default

Creative mode

Return everything, including LOW and NONE matches, but label each with its confidence level. The agent decides what to do with them. Useful when you want the system to make connections that might not be obvious.

python

mem = WideMemory(uncertainty_mode="creative")

results = mem.search("what might they enjoy for dinner?", user_id="user_42")
# returns everything, including tangential matches
# "lives in San Francisco" (MODERATE) — agent might infer local cuisine
# "mentioned sushi last week" (HIGH) — direct preference
# "works late on Thursdays" (LOW) — might affect dinner timing

# Good for: recommendation engines, creative writing tools
# Bad for: anything where accuracy matters more than coverage

The mode name is "creative" but the real use case is exploration. Sometimes you want the system to surface weak signals because a human or a capable LLM can reason about them. The key is that the confidence labels are always there. The agent is never tricked into thinking a 0.3 match is a 0.8 match.

MEM.PIN() — MEMORIES THAT NEVER FADE

Temporal decay is useful for most facts. Your lunch order from three months ago should fade. But some things should be permanent. Not just YMYL-protected — genuinely permanent, immune to any form of decay or garbage collection.

python

# Pin a critical fact
mem.pin(memory_id="mem_abc123", reason="user-confirmed allergy")

# Pinned memories:
# - Always returned at full strength regardless of age
# - Cannot be deleted by automated cleanup
# - Can only be unpinned explicitly by the application
# - Show up with a "pinned" flag in search results

mem.search("health information", user_id="user_42")
# "allergic to peanuts" — confidence=HIGH, pinned=True, age=186 days
# Still at full strength after 6 months because it's pinned

YMYL protection slows decay and boosts importance. Pinning removes decay entirely. The distinction matters: a YMYL-flagged fact about someone's blood pressure medication will persist for a long time but could eventually fade if the user goes inactive for years. A pinned fact persists until explicitly unpinned.

Use pinning sparingly. The point of memory decay is that most facts should fade. If you pin everything, you have rebuilt the append-only vector store that created the problems in the first place. Pin the things that would cause real harm if forgotten: allergies, critical preferences, identity facts the user has explicitly confirmed.

FRUSTRATION DETECTION

There is a specific interaction pattern that signals a memory failure: the user says something like "I already told you this" or "we talked about this last week." This means the system had the information and failed to retrieve it, or had it and let it decay when it should not have.

python

# widemem detects frustration signals in conversation text
result = mem.add(
    "I TOLD you I moved to Boston! We talked about this!",
    user_id="user_42"
)

# System response:
# - Searches for related memories about location
# - Finds "lives in San Francisco" (stale)
# - Updates to "lives in Boston" with boosted importance
# - Flags the memory for pinning review
# - Returns: frustration_detected=True, action="updated stale memory"

# Detection patterns:
# "I already told you" / "I said this before" / "we discussed this"
# "how many times do I have to say" / "I mentioned this last time"
# "you should know this" / "remember when I said"

When the system detects frustration, it does three things: searches harder for the relevant memory (expanding the similarity threshold), boosts the importance of whatever it finds or creates, and logs the incident for review. If the same type of frustration happens repeatedly, it suggests the decay parameters need adjustment.

This is not about making the AI seem more empathetic. It is about using user feedback as a signal that the memory system made a bad curation decision. The user is telling you, in real time, that a fact should not have decayed. Listen to that signal.

RETRIEVAL MODES: FAST, BALANCED, DEEP

Not every query needs the same level of effort. Asking "what's their name?" should be instant. Asking "summarize everything we know about this user" justifies more computation. widemem 1.4.0 adds three retrieval modes that let you choose the tradeoff:

Fast mode

python

results = mem.search(
    "what's their name?",
    user_id="user_42",
    mode="fast"
)
# Vector similarity only. No re-ranking, no hierarchy traversal.
# ~5ms latency. Cheapest option. Good for simple lookups.

Balanced mode

python

results = mem.search(
    "what do they care about?",
    user_id="user_42",
    mode="balanced"
)
# Vector similarity + importance re-ranking + confidence scoring.
# ~15ms latency. Default mode. Good for most queries.

Deep mode

python

results = mem.search(
    "give me a full picture of this user",
    user_id="user_42",
    mode="deep"
)
# Full pipeline: vector search, hierarchy traversal, theme
# aggregation, cross-reference check, confidence scoring.
# ~50ms latency. Most thorough. Good for summary queries.

The cost difference is real. Fast mode is a single FAISS lookup. Deep mode might traverse the hierarchy, pull summaries, check for related themes, and re-rank everything. For a system processing thousands of queries per minute, letting the application choose the retrieval depth means the expensive path only runs when it is actually needed.

A practical pattern: use fast mode for inline autocomplete and real-time suggestions. Use balanced for normal conversation turns. Use deep for explicit "tell me everything" queries or when the agent detects it needs more context to answer well.

PUTTING IT TOGETHER

Here is what a real session looks like with all of these features working together:

Week 1

User mentions they live in San Francisco and work at a startup. Agent stores both facts. Importance: 8 and 7 respectively.

Week 3

User asks "do you remember my wife's name?" Agent searches in helpful mode. No matches above LOW confidence. Returns: "I don't have that information. Could you tell me?"

Old behavior: would have returned the closest match and hallucinated a name.

Week 6

User mentions they moved to Boston. Conflict resolver updates the location. User also says "I'm severely allergic to shellfish." YMYL flagged, importance 9. Agent pins it.

Week 10

User says "I told you I moved to Boston!" Frustration detected. System checks — the update did happen in week 6, but the agent's prompt was using cached context. System boosts the Boston fact and logs the retrieval miss.

Week 14

Application runs a deep retrieval: "summarize this user." Returns: lives in Boston (HIGH, recent), works at startup (MODERATE, aging), allergic to shellfish (HIGH, pinned). The San Francisco fact is gone. The startup fact is starting to fade but still relevant.

WHAT THIS DOES NOT SOLVE

Confidence calibration. The score ranges (0.7+ for HIGH, 0.5-0.69 for MODERATE, etc.) are based on testing with a synthetic dataset. They work well for the embedding models I have tested (sentence-transformers, OpenAI ada-002) but they are not universal. Different embedding spaces have different score distributions. The thresholds should be configurable per model, and eventually auto-calibrated. They are not yet.

Frustration detection is pattern-based. It catches explicit signals like "I told you this" but misses subtle frustration. A user who quietly rephrases the same question three times is frustrated too. Detecting that requires tracking query patterns across a session, which is a different problem.

Mode selection is manual.The application chooses fast/balanced/deep. Ideally the system would auto-detect based on query complexity. A simple name lookup should automatically use fast mode. A broad "tell me about this user" should automatically use deep. This is on the roadmap but not shipped yet.

THE HONEST MEMORY THESIS

The core argument is simple: a memory system that says "I don't know" is more useful than one that always answers. Not because silence is better than information — it is not. But because false confidence destroys the trust that makes memory useful in the first place.

If your agent confidently tells a user they live in San Francisco when they moved to Boston three months ago, the user will stop trusting the agent's memory entirely. They will start re-stating facts every conversation. At that point, the memory system is not just broken — it is actively counterproductive, because it still retrieves and injects stale context that the LLM has to override.

Confidence scoring, uncertainty modes, and retrieval depth are all ways of saying the same thing: be honest about what you know and how well you know it. The user can handle uncertainty. They cannot handle confidently wrong.

widemem is open source (Apache 2.0) on GitHub and PyPI. v1.4.0 adds everything discussed in this post. If you have thoughts on confidence calibration or automatic mode selection, I would like to hear about it.

READ RELATED

I BUILT A MEMORY LAYER THAT FORGETS ONLY WHAT DOESN'T MATTER

Why forgetting is harder than remembering, and how batch conflict resolution, importance decay, and YMYL safety work under the hood.

YOUR AI FORGOT SOMEONE'S MEDICATION. NOW WHAT?

YMYL safety for AI memory: why some facts should never decay, and the edge cases that still need solving.

THE CONTRADICTION PROBLEM IN AI MEMORY

What happens when AI agents accumulate conflicting facts, and why vector similarity alone cannot detect it.