BENCHMARKS
Measured results for widemem v1.4.1 on LoCoMo, the standard long-term conversational memory benchmark from Snap Research (ACL 2024). Published reference numbers from Mem0, Zep, LangMem, A-Mem, OpenAI Memory, and a full-context baseline are included for context.
widemem v1.4.1 scores 54.81 overall J on the full 1,540-question LoCoMo set. Its strongest category is multi-hop reasoning at 57.27, ahead of every reference system. It is also among the most token-efficient in the field at about 214 tokens per query. Open-domain (42.71) is the weakest category and an accepted limit of the lean flat architecture. An earlier version of this page reported 45.32 and called widemem mid-pack; that figure measured a stale older index and understated the shipped library. The correction, the cause, and a failed attempt to lift open-domain are all documented below, because getting this wrong and fixing it in public is more useful than a clean story.
What is LoCoMo
LoCoMo (Long-term Conversational Memory) is the standard benchmark for evaluating AI memory systems, published by Snap Research at ACL 2024 (Maharana et al.). It is the benchmark used by Mem0, Zep, LangMem, A-Mem, and MemMachine in their own published numbers. We ran widemem against the same test set so the numbers sit alongside theirs on like-for-like terms.
- 10 extended conversations between pairs of people
- Each conversation: 19-35 sessions spanning weeks to months
- 5,882 total dialogue turns across all conversations
- 1,540 evaluable questions across 4 categories (single-hop, multi-hop, open-domain, temporal)
- Primary metric: LLM-as-a-Judge J score (GPT-4o-mini judges correct vs wrong, averaged over 3 runs per question)
Paper: Evaluating Very Long-Term Conversational Memory of LLM Agents. Dataset: github.com/snap-research/locomo.
These widemem numbers are the full 1,540-question v1.4.1 run, a single clean pass, no repair step, like-for-like with the other systems' published full runs. An earlier version of this page carried a six-conversation estimate of about 56; the full set came in at 54.81, a small downward revision we are stating plainly rather than quietly rounding. The prior 45.32 figure was a stale-index measurement and is explained below.
Headline: overall J score
An earlier version of this page reported 45.32 and called widemem mid-pack. That number was measured against a memory index built by widemem v1.3.0 and never rebuilt. v1.4.0 changed extraction to resolve relative dates (“yesterday”, “last week”) to absolute dates at write time. That change shipped on PyPI and was never re-benchmarked, because every run reused the v1.3.0 index. Re-ingesting all ten conversations with the current v1.4.1 code and re-scoring the full 1,540-question set gives 54.81. This is the confirmed number, not an estimate.
Multi-hop reasoning
Multi-hop questions are the hardest category. They require connecting facts across multiple sessions, sometimes weeks apart. This is widemem's strongest result on LoCoMo: 57.27 J, ahead of every reference system in the set. Importance-weighted retrieval filters to a small set of high-relevance, high-importance facts, which is what multi-hop synthesis needs. Published numbers for the other systems are linked in the methodology section.
Token efficiency
widemem delivers about 214 tokens per query, roughly an order of magnitude leaner than the graph and large-context systems. This matters because every retrieved memory gets prepended to the answer-generation call, so it sets operating cost, rate-limit pressure, and latency in production. Compact context is a deliberate design choice: importance-weighted retrieval surfaces a few high-signal memories instead of a large pool of mediocre ones.
J per 1,000 tokens (efficiency)
| System | J score | Avg tokens | J / 1k tokens |
|---|---|---|---|
| LangMem | 58.10 | 127 | 457 |
| widemem v1.4.1 | 54.81 | 214 | 256 |
| Mem0 | 66.88 | 1,764 | 38 |
| A-Mem | 48.38 | 2,520 | 19 |
| Mem0^g | 68.44 | 3,616 | 19 |
| Zep | 65.99 | 3,911 | 17 |
| OpenAI Memory | 52.90 | 4,437 | 12 |
| Full-context | 72.90 | 26,031 | 3 |
For workloads where context cost is a real budget constraint (high-volume agents, rate-limited APIs, local-LLM deployments), answer quality per token is the number that decides the bill. The full field is shown so the trade-off is visible: systems that index more context tend to reach higher raw J at a large token cost.
Category breakdown, including the weak spot
The full v1.4.1 run changes the per-category story sharply from the old v1.3.0 numbers.
Temporal questions (J 30.53 then, 59.09 now)
This is the category the stale index hid. v1.3.0 extraction stored relative time references unresolved, so “Caroline went yesterday” never became a date and time questions were effectively unanswerable, scoring 30.53. v1.4.0 changed extraction to resolve relative dates to absolute dates at write time. On the full run temporal is 59.09, one of widemem's stronger categories, not a weakness. This single fix, already shipped, is most of the gap between the old 45.32 and the confirmed 54.81.
Single-hop factual recall (J 41.25 then, 46.69 now)
Simple “where does Alice live” recall is 46.69 on the full run, up from the stale-index 41.25. Still a growth area. Two-pass re-ranking (factual queries get a similarity boost toward the top pure-similarity match) is the path to closing the rest.
Open-domain questions (J 36.81 then, 42.71 now)
The weakest category at 42.71, though higher than the old number, not lower. Graph-backed systems traverse entity connections natively; Zep's temporal knowledge graph is a strong example of that approach. widemem stores facts flat and leans on the retrieval layer for synthesis, which works for multi-hop but is weaker for broad relationship questions. We tried a lean, entity-aware re-rank to lift this; in a controlled gate it regressed multi-hop and the other categories without improving open-domain, so it was not shipped (the full negative result is in the repo issues). Becoming a graph database is not a goal. Open-domain at 42.71 is a deliberate, accepted limit of the lean flat architecture, and it is on the page precisely because it is the honest weak spot.
Latency
| Metric | widemem | Mem0 | Zep | LangMem |
|---|---|---|---|---|
| Search p50 | 0.258 | 0.148 | 0.513 | 17.99 |
| Search p95 | 0.519 | 0.200 | 0.778 | 59.82 |
| Total p50 | 0.843 | 0.708 | 1.292 | 18.53 |
| Total p95 | 1.567 | 1.440 | 2.926 | 60.40 |
widemem's search latency sits in the competitive range with Mem0, within 150ms at p50. Mem0 has the lowest published latency in the set; widemem and Mem0 are the two sub-second p95 systems.
Methodology
Running a fair benchmark is harder than running any benchmark. Here is what we did to keep the numbers comparable with the other systems' published results, and where the comparison is not yet apples-to-apples.
| Parameter | Value | Rationale |
|---|---|---|
| widemem version | v1.4.1 (re-ingested) | Current shipped |
| LLM (all phases) | GPT-4o-mini | Same as Mem0 paper |
| Embeddings | text-embedding-3-small | Same as Mem0 paper |
| Vector store | FAISS local | widemem default |
| Decay | exponential, rate 0.01 | widemem default |
| Scoring weights | sim 0.5 / imp 0.3 / rec 0.2 | widemem default |
| Top k per speaker | 10 | 20 memories total per question |
| Judge runs | 3 | Mem0 paper uses 10; we use 3 for cost |
| Scoring pass | single clean pass | No repair step |
| Coverage | all 10 conversations, 1,540 questions | Full confirmed run, like-for-like with the field |
Pipeline
- Phase 1 (ingestion): re-ingest the conversation turns into widemem v1.4.1. Each turn runs through extraction, conflict resolution, and FAISS storage. This is the step the original run skipped by reusing a v1.3.0 index, which is what produced the stale 45.32.
- Phase 2 (Q&A): run the questions. Search memories for each speaker, build the prompt, GPT-4o-mini generates the answer. Record latency and tokens.
- Phase 3 (judge): score predictions against ground truth. F1 and BLEU computed locally. LLM-as-a-Judge runs 3 times per question, averaged to produce J. Single clean pass, no repair step.
Caveats and known issues
- Open-domain is an accepted limit. At 42.71 it is the weakest category. A lean entity-aware re-rank was built and gated to lift it; it regressed multi-hop and the other categories without improving open-domain, so it was not shipped. The full negative result is recorded in the widemem-ai repo issues. Open-domain is a deliberate trade of the lean, no-graph architecture, not an unfixed bug.
- Small revision from the earlier estimate. A prior version of this page carried a six-conversation estimate of about 56. The full 1,540-question run came in at 54.81. Stated plainly rather than quietly rounded.
- Re-baseline, not the original run. These numbers are widemem v1.4.1 re-ingested with current code. The earlier 45.32 was a v1.3.0-era measurement on an index that was never rebuilt after the v1.4.0 extraction change shipped.
- Repair-gap, resolved. An earlier run hit OpenAI rate limits and used a repair pass, which produced a large clean-vs-repaired score gap. A separate clean single-pass run confirmed the old 45.32 was a real measurement, not a repair artifact, and the re-baseline uses the clean single-pass method throughout. No repair step is involved in these numbers.
- Adversarial excluded. LoCoMo includes adversarial questions with no ground-truth answers. All systems exclude these from scoring.
- No hierarchy, no active retrieval. widemem's hierarchical memory and active retrieval are disabled here to compare like-for-like with flat-memory baselines. Enabling them may shift some categories at the cost of comparability.
Raw data and reproducibility
Full JSON results, benchmark runner, and evaluator are in the benchmark directory of the widemem-ai repo. The re-baseline runner is benchmark/run_ws1.py; the original run used benchmark/run_locomo.py. The investigation that found the stale-index error is documented in the repo issues.
What's next
- LongMemEval pass (a second standard benchmark in the memory space)
- Memory-quality work: single-call ADD/UPDATE/DELETE consolidation and dedup
- Memory-footprint and throughput benchmarks at 100k, 500k, and 1M memories (for the self-hosting page)
Confirmed run: May 2026, widemem v1.4.1, all 10 conversations (1,540 questions), clean single pass, GPT-4o-mini judge averaged over 3 runs. Original run: March 2026, v1.3.0, superseded by this page. If you want to discuss the numbers, open an issue on GitHub or start at /enterprise.