BENCHMARKS

Measured results for widemem v1.4.1 on LoCoMo, the standard long-term conversational memory benchmark from Snap Research (ACL 2024). Published reference numbers from Mem0, Zep, LangMem, A-Mem, OpenAI Memory, and a full-context baseline are included for context.

TL;DR

widemem v1.4.1 scores 54.81 overall J on the full 1,540-question LoCoMo set. Its strongest category is multi-hop reasoning at 57.27, ahead of every reference system. It is also among the most token-efficient in the field at about 214 tokens per query. Open-domain (42.71) is the weakest category and an accepted limit of the lean flat architecture. An earlier version of this page reported 45.32 and called widemem mid-pack; that figure measured a stale older index and understated the shipped library. The correction, the cause, and a failed attempt to lift open-domain are all documented below, because getting this wrong and fixing it in public is more useful than a clean story.

What is LoCoMo

LoCoMo (Long-term Conversational Memory) is the standard benchmark for evaluating AI memory systems, published by Snap Research at ACL 2024 (Maharana et al.). It is the benchmark used by Mem0, Zep, LangMem, A-Mem, and MemMachine in their own published numbers. We ran widemem against the same test set so the numbers sit alongside theirs on like-for-like terms.

Paper: Evaluating Very Long-Term Conversational Memory of LLM Agents. Dataset: github.com/snap-research/locomo.

Read this before the numbers

These widemem numbers are the full 1,540-question v1.4.1 run, a single clean pass, no repair step, like-for-like with the other systems' published full runs. An earlier version of this page carried a six-conversation estimate of about 56; the full set came in at 54.81, a small downward revision we are stating plainly rather than quietly rounding. The prior 45.32 figure was a stale-index measurement and is explained below.

Headline: overall J score

LOCOMO OVERALL J SCORE (HIGHER IS BETTER)
J averaged across single-hop, multi-hop, open-domain, temporal (n=1540). widemem and all reference systems: full-run numbers
Full-context
72.90
Mem0^g
68.44
Mem0
66.88
Zep
65.99
LangMem
58.10
widemem v1.4.1
54.81
OpenAI Memory
52.90
A-Mem
48.38

An earlier version of this page reported 45.32 and called widemem mid-pack. That number was measured against a memory index built by widemem v1.3.0 and never rebuilt. v1.4.0 changed extraction to resolve relative dates (“yesterday”, “last week”) to absolute dates at write time. That change shipped on PyPI and was never re-benchmarked, because every run reused the v1.3.0 index. Re-ingesting all ten conversations with the current v1.4.1 code and re-scoring the full 1,540-question set gives 54.81. This is the confirmed number, not an estimate.

Multi-hop reasoning

MULTI-HOP J SCORE
Questions requiring synthesis across multiple sessions (the largest category)
widemem v1.4.1
57.27
Mem0
51.15
LangMem
47.92
Mem0^g
47.19
Full-context
42.92
Zep
41.35
A-Mem
18.85

Multi-hop questions are the hardest category. They require connecting facts across multiple sessions, sometimes weeks apart. This is widemem's strongest result on LoCoMo: 57.27 J, ahead of every reference system in the set. Importance-weighted retrieval filters to a small set of high-relevance, high-importance facts, which is what multi-hop synthesis needs. Published numbers for the other systems are linked in the methodology section.

Token efficiency

AVERAGE TOKENS PER QUERY (LOWER IS BETTER)
Total memory context delivered to the answer-generation LLM per question
LangMem
127
widemem v1.4.1
~214
Mem0
1,764
A-Mem
2,520
Mem0^g
3,616
Zep
3,911
OpenAI Memory
4,437
Full-context
26,031

widemem delivers about 214 tokens per query, roughly an order of magnitude leaner than the graph and large-context systems. This matters because every retrieved memory gets prepended to the answer-generation call, so it sets operating cost, rate-limit pressure, and latency in production. Compact context is a deliberate design choice: importance-weighted retrieval surfaces a few high-signal memories instead of a large pool of mediocre ones.

J per 1,000 tokens (efficiency)

ANSWER QUALITY PER TOKEN (HIGHER IS BETTER)
SystemJ scoreAvg tokensJ / 1k tokens
LangMem58.10127457
widemem v1.4.154.81214256
Mem066.881,76438
A-Mem48.382,52019
Mem0^g68.443,61619
Zep65.993,91117
OpenAI Memory52.904,43712
Full-context72.9026,0313

For workloads where context cost is a real budget constraint (high-volume agents, rate-limited APIs, local-LLM deployments), answer quality per token is the number that decides the bill. The full field is shown so the trade-off is visible: systems that index more context tend to reach higher raw J at a large token cost.

Category breakdown, including the weak spot

The full v1.4.1 run changes the per-category story sharply from the old v1.3.0 numbers.

Temporal questions (J 30.53 then, 59.09 now)

This is the category the stale index hid. v1.3.0 extraction stored relative time references unresolved, so “Caroline went yesterday” never became a date and time questions were effectively unanswerable, scoring 30.53. v1.4.0 changed extraction to resolve relative dates to absolute dates at write time. On the full run temporal is 59.09, one of widemem's stronger categories, not a weakness. This single fix, already shipped, is most of the gap between the old 45.32 and the confirmed 54.81.

Single-hop factual recall (J 41.25 then, 46.69 now)

Simple “where does Alice live” recall is 46.69 on the full run, up from the stale-index 41.25. Still a growth area. Two-pass re-ranking (factual queries get a similarity boost toward the top pure-similarity match) is the path to closing the rest.

Open-domain questions (J 36.81 then, 42.71 now)

The weakest category at 42.71, though higher than the old number, not lower. Graph-backed systems traverse entity connections natively; Zep's temporal knowledge graph is a strong example of that approach. widemem stores facts flat and leans on the retrieval layer for synthesis, which works for multi-hop but is weaker for broad relationship questions. We tried a lean, entity-aware re-rank to lift this; in a controlled gate it regressed multi-hop and the other categories without improving open-domain, so it was not shipped (the full negative result is in the repo issues). Becoming a graph database is not a goal. Open-domain at 42.71 is a deliberate, accepted limit of the lean flat architecture, and it is on the page precisely because it is the honest weak spot.

Latency

SEARCH AND TOTAL QUERY LATENCY (SECONDS)
MetricwidememMem0ZepLangMem
Search p500.2580.1480.51317.99
Search p950.5190.2000.77859.82
Total p500.8430.7081.29218.53
Total p951.5671.4402.92660.40

widemem's search latency sits in the competitive range with Mem0, within 150ms at p50. Mem0 has the lowest published latency in the set; widemem and Mem0 are the two sub-second p95 systems.

Methodology

Running a fair benchmark is harder than running any benchmark. Here is what we did to keep the numbers comparable with the other systems' published results, and where the comparison is not yet apples-to-apples.

BENCHMARK CONFIGURATION
ParameterValueRationale
widemem versionv1.4.1 (re-ingested)Current shipped
LLM (all phases)GPT-4o-miniSame as Mem0 paper
Embeddingstext-embedding-3-smallSame as Mem0 paper
Vector storeFAISS localwidemem default
Decayexponential, rate 0.01widemem default
Scoring weightssim 0.5 / imp 0.3 / rec 0.2widemem default
Top k per speaker1020 memories total per question
Judge runs3Mem0 paper uses 10; we use 3 for cost
Scoring passsingle clean passNo repair step
Coverageall 10 conversations, 1,540 questionsFull confirmed run, like-for-like with the field

Pipeline

Caveats and known issues

Raw data and reproducibility

Full JSON results, benchmark runner, and evaluator are in the benchmark directory of the widemem-ai repo. The re-baseline runner is benchmark/run_ws1.py; the original run used benchmark/run_locomo.py. The investigation that found the stale-index error is documented in the repo issues.

What's next

Confirmed run: May 2026, widemem v1.4.1, all 10 conversations (1,540 questions), clean single pass, GPT-4o-mini judge averaged over 3 runs. Original run: March 2026, v1.3.0, superseded by this page. If you want to discuss the numbers, open an issue on GitHub or start at /enterprise.