BENCHMARKS

Measured results on LoCoMo, the standard long-term conversational memory benchmark from Snap Research (ACL 2024). widemem v1.3 vs Mem0, Zep, LangMem, A-Mem, OpenAI Memory, and full-context baselines. Real numbers, including the parts we lose on.

TL;DR

widemem wins multi-hop reasoning (53.31 J, beats every baseline including Mem0) and is the second-most token-efficient system tested (157 tokens per query, 11x more efficient than Mem0). It loses on single-hop factual recall and open-domain questions, which is where we are investing next.

What is LoCoMo

LoCoMo (Long-term Conversational Memory) is the standard benchmark for evaluating AI memory systems, published by Snap Research at ACL 2024 (Maharana et al.). It is the benchmark used by Mem0, Zep, LangMem, A-Mem, and MemMachine in their own published numbers. We ran widemem against the same test set under the same conditions so the comparison is apples to apples.

Paper: Evaluating Very Long-Term Conversational Memory of LLM Agents. Dataset: github.com/snap-research/locomo.

Headline: overall J score

LOCOMO OVERALL J SCORE (HIGHER IS BETTER)
J score averaged across single-hop, multi-hop, open-domain, and temporal categories (n=1540 questions)
Full-context
72.90
Mem0^g
68.44
Mem0
66.88
Zep
65.99
LangMem
58.10
OpenAI Memory
52.90
A-Mem
48.38
widemem v1.3
45.32

On overall J score, widemem is mid-pack. We publish this because hiding it would be dishonest and because the headline number masks the places where widemem genuinely wins. The overall average includes four categories and widemem is strong in one of them and weak in the others.

Where widemem wins: multi-hop reasoning

MULTI-HOP J SCORE
Questions requiring synthesis across multiple sessions (n=841, the largest category)
widemem v1.3
53.31
Mem0
51.15
LangMem
47.92
Mem0^g
47.19
Full-context
42.92
Zep
41.35
A-Mem
18.85

Multi-hop questions are the hardest category. They require connecting facts across multiple sessions, sometimes weeks apart. widemem outperforms every baseline we tested, including Mem0 (+2.16), LangMem (+5.39), Mem0^g (+6.12), Zep (+11.96), full-context (+10.39), and A-Mem (+34.46). The importance-weighted retrieval filters to highly-relevant, high-importance facts, which is exactly what multi-hop synthesis needs.

Where widemem wins: token efficiency

AVERAGE TOKENS PER QUERY (LOWER IS BETTER)
Total memory context delivered to the answer-generation LLM per question
LangMem
127
widemem v1.3
157
Mem0
1,764
A-Mem
2,520
Mem0^g
3,616
Zep
3,911
OpenAI Memory
4,437
Full-context
26,031

widemem uses 11x fewer tokens than Mem0, 25x fewer than Zep, and 166x fewer than full-context. Only LangMem uses fewer. Token efficiency matters because every retrieved memory gets prepended to the answer-generation call; paying for 4,000 tokens per query when 157 would do is a real operating cost.

J per 1,000 tokens (efficiency)

ANSWER QUALITY PER TOKEN (HIGHER IS BETTER)
SystemJ scoreAvg tokensJ / 1k tokensvs Mem0
LangMem58.1012745712.1x
widemem v1.345.321572887.6x
Mem066.881,764381.0x (baseline)
A-Mem48.382,520190.5x
Mem0^g68.443,616190.5x
Zep65.993,911170.4x
OpenAI Memory52.904,437120.3x
Full-context72.9026,03130.07x

Every token widemem spends carries 7.6x more answer-quality signal than Mem0. For workloads where context cost is a budget constraint (high-volume agents, rate-limited APIs, local-LLM deployments), this matters more than raw J score.

Where widemem loses

Being honest about the losses is the point of publishing this page. Three categories where we are weaker and why.

Single-hop factual recall (41.25 vs Mem0 67.13)

Our biggest weakness. 26 points behind Mem0 on “where does Alice live” style questions. Root cause: importance-weighted retrieval sometimes ranks a high-importance general memory (“Caroline is passionate about creating safe spaces”) above a lower-importance but directly-relevant one (“Caroline moved from Sweden”). The two-pass re-ranking we added in v1.4 (factual queries get a similarity boost to the top-k pure-similarity match) is designed to close this gap. We have not yet published v1.4 numbers against LoCoMo.

Open-domain questions (36.81 vs Zep 76.60)

Worst-in-class for questions needing broad relationship understanding. Zep wins here because its graph structure traverses entity connections natively. widemem stores facts flat and relies on the retrieval layer for synthesis, which works for multi-hop but struggles when the question needs a graph walk. We are not trying to become a graph database, so the fix is better retrieval prompts rather than a new storage model.

Temporal questions (30.53 vs Mem0^g 58.13)

Below most baselines on time-sensitive reasoning. Root cause: current retrieval does not strongly weight temporal metadata. Timestamps are stored, used for decay, but not used as a primary retrieval signal. The query-adaptive scoring we added in v1.4 boosts recency for queries that look temporal, which we expect will move this number materially.

Latency

SEARCH AND TOTAL QUERY LATENCY (SECONDS)
Metricwidemem v1.3Mem0ZepLangMem
Search p500.2580.1480.51317.99
Search p950.5190.2000.77859.82
Total p500.8430.7081.29218.53
Total p951.5671.4402.92660.40

Competitive with Mem0, significantly faster than Zep, and orders of magnitude faster than LangMem. Mem0 is the latency leader by a small margin; we are within 150ms at p50.

Methodology

Running a fair benchmark is harder than running any benchmark. Here's what we did to make the numbers comparable with the other systems' published results.

BENCHMARK CONFIGURATION
ParameterValueRationale
LLM (all phases)GPT-4o-miniSame as Mem0 paper
Embeddingstext-embedding-3-smallSame as Mem0 paper
Vector storeFAISS localwidemem default
Decayexponential, rate 0.01widemem default
Scoring weightssim 0.5 / imp 0.3 / rec 0.2widemem default
Top k per speaker1020 memories total per question
Judge runs3Mem0 paper uses 10; we used 3 for cost
HierarchydisabledFair comparison with flat baselines
Active retrievaldisabledFair comparison with baselines

Pipeline

Total run time: 34 hours across ingestion, Q&A, and judging. Total API calls: approximately 24,000. Total cost: about $4.

Caveats and known issues

Raw data and reproducibility

Full JSON results, benchmark runner, and evaluator are in the benchmark directory of the widemem-ai repo. You can reproduce the run with python benchmark/run_locomo.py. The markdown report we used to prepare this page is at benchmark/BENCHMARK_RESULTS.md.

What's next

Benchmark run: March 16-18, 2026. widemem v1.3.0. LoCoMo dataset (Maharana et al., 2024). GPT-4o-mini for LLM, text-embedding-3-small for embeddings. If you want to discuss the numbers, open an issue on GitHub or start at /enterprise.