BENCHMARKS
Measured results on LoCoMo, the standard long-term conversational memory benchmark from Snap Research (ACL 2024). widemem v1.3 vs Mem0, Zep, LangMem, A-Mem, OpenAI Memory, and full-context baselines. Real numbers, including the parts we lose on.
widemem wins multi-hop reasoning (53.31 J, beats every baseline including Mem0) and is the second-most token-efficient system tested (157 tokens per query, 11x more efficient than Mem0). It loses on single-hop factual recall and open-domain questions, which is where we are investing next.
What is LoCoMo
LoCoMo (Long-term Conversational Memory) is the standard benchmark for evaluating AI memory systems, published by Snap Research at ACL 2024 (Maharana et al.). It is the benchmark used by Mem0, Zep, LangMem, A-Mem, and MemMachine in their own published numbers. We ran widemem against the same test set under the same conditions so the comparison is apples to apples.
- 10 extended conversations between pairs of people
- Each conversation: 19-35 sessions spanning weeks to months
- 5,882 total dialogue turns across all conversations
- 1,540 evaluable questions across 4 categories (single-hop, multi-hop, open-domain, temporal)
- Primary metric: LLM-as-a-Judge J score (GPT-4o-mini judges correct vs wrong, averaged over 3 runs per question)
Paper: Evaluating Very Long-Term Conversational Memory of LLM Agents. Dataset: github.com/snap-research/locomo.
Headline: overall J score
On overall J score, widemem is mid-pack. We publish this because hiding it would be dishonest and because the headline number masks the places where widemem genuinely wins. The overall average includes four categories and widemem is strong in one of them and weak in the others.
Where widemem wins: multi-hop reasoning
Multi-hop questions are the hardest category. They require connecting facts across multiple sessions, sometimes weeks apart. widemem outperforms every baseline we tested, including Mem0 (+2.16), LangMem (+5.39), Mem0^g (+6.12), Zep (+11.96), full-context (+10.39), and A-Mem (+34.46). The importance-weighted retrieval filters to highly-relevant, high-importance facts, which is exactly what multi-hop synthesis needs.
Where widemem wins: token efficiency
widemem uses 11x fewer tokens than Mem0, 25x fewer than Zep, and 166x fewer than full-context. Only LangMem uses fewer. Token efficiency matters because every retrieved memory gets prepended to the answer-generation call; paying for 4,000 tokens per query when 157 would do is a real operating cost.
J per 1,000 tokens (efficiency)
| System | J score | Avg tokens | J / 1k tokens | vs Mem0 |
|---|---|---|---|---|
| LangMem | 58.10 | 127 | 457 | 12.1x |
| widemem v1.3 | 45.32 | 157 | 288 | 7.6x |
| Mem0 | 66.88 | 1,764 | 38 | 1.0x (baseline) |
| A-Mem | 48.38 | 2,520 | 19 | 0.5x |
| Mem0^g | 68.44 | 3,616 | 19 | 0.5x |
| Zep | 65.99 | 3,911 | 17 | 0.4x |
| OpenAI Memory | 52.90 | 4,437 | 12 | 0.3x |
| Full-context | 72.90 | 26,031 | 3 | 0.07x |
Every token widemem spends carries 7.6x more answer-quality signal than Mem0. For workloads where context cost is a budget constraint (high-volume agents, rate-limited APIs, local-LLM deployments), this matters more than raw J score.
Where widemem loses
Being honest about the losses is the point of publishing this page. Three categories where we are weaker and why.
Single-hop factual recall (41.25 vs Mem0 67.13)
Our biggest weakness. 26 points behind Mem0 on “where does Alice live” style questions. Root cause: importance-weighted retrieval sometimes ranks a high-importance general memory (“Caroline is passionate about creating safe spaces”) above a lower-importance but directly-relevant one (“Caroline moved from Sweden”). The two-pass re-ranking we added in v1.4 (factual queries get a similarity boost to the top-k pure-similarity match) is designed to close this gap. We have not yet published v1.4 numbers against LoCoMo.
Open-domain questions (36.81 vs Zep 76.60)
Worst-in-class for questions needing broad relationship understanding. Zep wins here because its graph structure traverses entity connections natively. widemem stores facts flat and relies on the retrieval layer for synthesis, which works for multi-hop but struggles when the question needs a graph walk. We are not trying to become a graph database, so the fix is better retrieval prompts rather than a new storage model.
Temporal questions (30.53 vs Mem0^g 58.13)
Below most baselines on time-sensitive reasoning. Root cause: current retrieval does not strongly weight temporal metadata. Timestamps are stored, used for decay, but not used as a primary retrieval signal. The query-adaptive scoring we added in v1.4 boosts recency for queries that look temporal, which we expect will move this number materially.
Latency
| Metric | widemem v1.3 | Mem0 | Zep | LangMem |
|---|---|---|---|---|
| Search p50 | 0.258 | 0.148 | 0.513 | 17.99 |
| Search p95 | 0.519 | 0.200 | 0.778 | 59.82 |
| Total p50 | 0.843 | 0.708 | 1.292 | 18.53 |
| Total p95 | 1.567 | 1.440 | 2.926 | 60.40 |
Competitive with Mem0, significantly faster than Zep, and orders of magnitude faster than LangMem. Mem0 is the latency leader by a small margin; we are within 150ms at p50.
Methodology
Running a fair benchmark is harder than running any benchmark. Here's what we did to make the numbers comparable with the other systems' published results.
| Parameter | Value | Rationale |
|---|---|---|
| LLM (all phases) | GPT-4o-mini | Same as Mem0 paper |
| Embeddings | text-embedding-3-small | Same as Mem0 paper |
| Vector store | FAISS local | widemem default |
| Decay | exponential, rate 0.01 | widemem default |
| Scoring weights | sim 0.5 / imp 0.3 / rec 0.2 | widemem default |
| Top k per speaker | 10 | 20 memories total per question |
| Judge runs | 3 | Mem0 paper uses 10; we used 3 for cost |
| Hierarchy | disabled | Fair comparison with flat baselines |
| Active retrieval | disabled | Fair comparison with baselines |
Pipeline
- Phase 1 (ingestion): feed all 5,882 conversation turns into widemem. Each turn runs through extraction, conflict resolution, and FAISS storage. 8,972 memories created. Ratio: 1.52 memories per turn.
- Phase 2 (Q&A): run 1,540 questions. Search memories for each speaker, build prompt, GPT-4o-mini generates answer (max 100 tokens). Record latency and tokens.
- Phase 3 (judge): score predictions against ground truth. F1 and BLEU computed locally. LLM-as-a-Judge runs 3 times per question, averaged to produce J.
Total run time: 34 hours across ingestion, Q&A, and judging. Total API calls: approximately 24,000. Total cost: about $4.
Caveats and known issues
- v1.3, not v1.4. These numbers are from widemem v1.3. Several v1.4 changes (query-adaptive scoring, two-pass re-ranking, improved extraction prompts) specifically target the single-hop and temporal weaknesses. A v1.4 pass is on the near-term roadmap.
- Clean vs repaired J gap. Phase 3 hit OpenAI rate limits mid-run. We built a chunked, resumable evaluator and ran a repair pass. Clean predictions (never rate-limited) scored J=98.16 on average; repaired predictions scored J=4.64. The 93-point gap warrants investigation with a fully clean single-pass run. This is the biggest methodological question mark in our results.
- Adversarial excluded. LoCoMo includes 446 adversarial questions with no ground-truth answers. All systems exclude these from scoring.
- No hierarchy, no active retrieval. We disabled widemem's hierarchical memory and active retrieval features to compare like-for-like with flat-memory baselines. Enabling them may improve some categories at the cost of comparability.
Raw data and reproducibility
Full JSON results, benchmark runner, and evaluator are in the benchmark directory of the widemem-ai repo. You can reproduce the run with python benchmark/run_locomo.py. The markdown report we used to prepare this page is at benchmark/BENCHMARK_RESULTS.md.
What's next
- v1.4 pass on LoCoMo to measure the query-adaptive scoring and two-pass re-ranking improvements
- LongMemEval pass (a second standard benchmark in the memory space)
- A single-pass clean run to close the repair-gap methodological question
- Memory-footprint and throughput benchmarks at 100k, 500k, and 1M memories (for the self-hosting page)
Benchmark run: March 16-18, 2026. widemem v1.3.0. LoCoMo dataset (Maharana et al., 2024). GPT-4o-mini for LLM, text-embedding-3-small for embeddings. If you want to discuss the numbers, open an issue on GitHub or start at /enterprise.