Benchmarks

widemem v1.4.1 on LoCoMo, measured in the open.

The standard long-term conversational memory benchmark from Snap Research (ACL 2024). Published reference numbers from Mem0, Zep, LangMem, A-Mem, OpenAI Memory, and a full-context baseline are included for context.

54.81

Overall J, full 1,540-question run

~214

Tokens per query, among the leanest in the field

1-10%

Of the token cost of the reference systems (1,700-26,000)

The short version

TL;DR, including the number we got wrong.

TL;DR

widemem v1.4.1 scores 54.81 overall J on the full 1,540-question LoCoMo set, at about 214 tokens per query where the reference systems use 1,700 to 26,000. That is the honest headline: mid-pack accuracy at 1-10% of the token cost. Per category (official mapping): temporal 59.09, single-hop 57.27, multi-hop 46.69, open-domain 42.71. This page has now carried two public corrections. First, an earlier version reported 45.32 measured on a stale index; the fresh re-ingest gave 54.81. Second, our harness had the single-hop and multi-hop labels transposed, so we spent two months calling widemem a multi-hop leader when the 57.27 was measured on single-hop questions. That claim is retracted. Both corrections, their causes, and a failed attempt to lift open-domain are documented below, because getting this wrong and fixing it in public is more useful than a clean story.

An earlier version of this page reported 45.32 and called widemem mid-pack. That number was measured against a memory index built by widemem v1.3.0 and never rebuilt, before the v1.4.0 extraction change shipped. We corrected it in public rather than quietly restating the number.

45.32→54.81LoCoMo overall J, corrected in the open

Re-ingesting all ten conversations with the current v1.4.1 code and re-scoring the full 1,540-question set gives 54.81. This is the confirmed number, not an estimate.

Second correction (2026-07-06): our LoCoMo harness had the single-hop and multi-hop category labels transposed. The official LoCoMo evaluation maps category 1 (282 questions) to multi-hop and category 4 (841 questions) to single-hop; our harness had it backwards. Every number this page published as “multi-hop”, including the 57.27 we called ahead of the field, was measured on the 841 single-hop questions. The multi-hop leadership claim is retracted.

“multi-hop 57.27”→single-hop 57.27, multi-hop 46.69Labels corrected to the official LoCoMo mapping

The overall 54.81 and the token numbers are unaffected: they never depended on category labels. Under the correct labels widemem trails Mem0 on both single-hop (57.27 vs 67.13) and multi-hop (46.69 vs 51.15); temporal (59.09) is now the strongest category. The harness mapping is fixed and a full rerun under the official mapping is planned. The retraction lives permanently in the public corrections log.

Read this before the numbers

These widemem numbers are the full 1,540-question v1.4.1 run, a single clean pass, no repair step, like-for-like with the other systems' published full runs. An earlier version of this page carried a six-conversation estimate of about 56; the full set came in at 54.81, a small downward revision we are stating plainly rather than quietly rounding. The prior 45.32 figure was a stale-index measurement, and per-category labels published before 2026-07-06 had single-hop and multi-hop transposed. Both are explained below.

What is LoCoMo

The benchmark the whole field reports on.

LoCoMo (Long-term Conversational Memory) is the standard benchmark for evaluating AI memory systems, published by Snap Research at ACL 2024 (Maharana et al.). It is the benchmark used by Mem0, Zep, LangMem, A-Mem, and MemMachine in their own published numbers. We ran widemem against the same test set so the numbers sit alongside theirs on like-for-like terms.

10 extended conversations between pairs of people
Each conversation: 19-35 sessions spanning weeks to months
5,882 total dialogue turns across all conversations
1,540 evaluable questions across 4 categories (single-hop, multi-hop, open-domain, temporal)
Primary metric: LLM-as-a-Judge J score (GPT-4o-mini judges correct vs wrong, averaged over 3 runs per question)

Paper: Evaluating Very Long-Term Conversational Memory of LLM Agents. Dataset: github.com/snap-research/locomo.

Headline

Overall J score.

LOCOMO OVERALL J SCORE (HIGHER IS BETTER)

J averaged across single-hop, multi-hop, open-domain, temporal (n=1540). widemem and all reference systems: full-run numbers

Full-context

72.90

Mem0^g

68.44

Mem0

66.88

Zep

65.99

LangMem

58.10

widemem v1.4.1

54.81

OpenAI Memory

52.90

A-Mem

48.38

An earlier version of this page reported 45.32 and called widemem mid-pack. That number was measured against a memory index built by widemem v1.3.0 and never rebuilt. v1.4.0 changed extraction to resolve relative dates (“yesterday”, “last week”) to absolute dates at write time. That change shipped on PyPI and was never re-benchmarked, because every run reused the v1.3.0 index. Re-ingesting all ten conversations with the current v1.4.1 code and re-scoring the full 1,540-question set gives 54.81. This is the confirmed number, not an estimate.

Multi-hop, correctly labeled

Multi-hop reasoning, after the label correction.

MULTI-HOP J SCORE (OFFICIAL LOCOMO MAPPING, 282 QUESTIONS)

Questions requiring synthesis across multiple sessions

Mem0

51.15

LangMem

47.92

Mem0^g

47.19

widemem v1.4.1

46.69

Full-context

42.92

Zep

41.35

A-Mem

18.85

An earlier version of this section claimed widemem led every reference system on multi-hop at 57.27. That was wrong. Our harness had the LoCoMo category labels transposed, so the 57.27 was measured on the 841 single-hop questions. Under the official mapping widemem scores 46.69on the 282 multi-hop questions: mid-pack, behind Mem0 (51.15) and LangMem (47.92), ahead of full-context, Zep, and A-Mem. The reference numbers above were always under the official mapping (they come from the systems' own papers), which is exactly why comparing our transposed row against them was misleading. Retraction details are in the corrections log.

Token efficiency

Answer quality per token is what pays the bill.

AVERAGE TOKENS PER QUERY (LOWER IS BETTER)

Total memory context delivered to the answer-generation LLM per question

LangMem

127

widemem v1.4.1

~214

Mem0

1,764

A-Mem

2,520

Mem0^g

3,616

Zep

3,911

OpenAI Memory

4,437

Full-context

26,031

widemem delivers about 214 tokens per query, roughly an order of magnitude leaner than the graph and large-context systems. This matters because every retrieved memory gets prepended to the answer-generation call, so it sets operating cost, rate-limit pressure, and latency in production. Compact context is a deliberate design choice: importance-weighted retrieval surfaces a few high-signal memories instead of a large pool of mediocre ones.

J per 1,000 tokens (efficiency)

ANSWER QUALITY PER TOKEN (HIGHER IS BETTER)

System	J score	Avg tokens	J / 1k tokens
LangMem	58.10	127	457
widemem v1.4.1	54.81	214	256
Mem0	66.88	1,764	38
A-Mem	48.38	2,520	19
Mem0^g	68.44	3,616	19
Zep	65.99	3,911	17
OpenAI Memory	52.90	4,437	12
Full-context	72.90	26,031	3

For workloads where context cost is a real budget constraint (high-volume agents, rate-limited APIs, local-LLM deployments), answer quality per token is the number that decides the bill. The full field is shown so the trade-off is visible: systems that index more context tend to reach higher raw J at a large token cost.

Category breakdown

Including the weak spot.

The full v1.4.1 run changes the per-category story sharply from the old v1.3.0 numbers. All labels below use the official LoCoMo mapping (category 1 = multi-hop, 282 questions; category 4 = single-hop, 841 questions). Per-category numbers this page showed before 2026-07-06 had single-hop and multi-hop transposed; the then-and-now pairs below are both relabeled correctly.

Temporal questions (J 30.53 then, 59.09 now)

This is the category the stale index hid. v1.3.0 extraction stored relative time references unresolved, so “Caroline went yesterday” never became a date and time questions were effectively unanswerable, scoring 30.53. v1.4.0 changed extraction to resolve relative dates to absolute dates at write time. On the full run temporal is 59.09, now widemem's strongest category, and it is unaffected by the label transposition. This single fix, already shipped, is most of the gap between the old 45.32 and the confirmed 54.81.

Single-hop factual recall (J 53.31 then, 57.27 now)

Simple “where does Alice live” recall is 57.27on the full run, measured on the 841 single-hop questions. This is the number an earlier version of this page mislabeled as multi-hop. It still trails Mem0's published single-hop 67.13 by about 10 points, so it is a growth area, not a bragging right. Two-pass re-ranking (factual queries get a similarity boost toward the top pure-similarity match) is the path to closing the gap.

Multi-hop synthesis (J 41.25 then, 46.69 now)

Connecting facts across sessions scores 46.69on the 282 multi-hop questions, up from the stale-index 41.25 but behind Mem0's published 51.15. The claim that this was widemem's strongest category is retracted; it rested on the transposed labels.

Open-domain questions (J 36.81 then, 42.71 now)

The weakest category at 42.71, though higher than the old number, not lower. Graph-backed systems traverse entity connections natively; Zep's temporal knowledge graph is a strong example of that approach. widemem stores facts flat and leans on the retrieval layer, which is weaker for broad relationship questions. We tried a lean, entity-aware re-rank to lift this; in a controlled gate it regressed the other categories without improving open-domain, so it was not shipped (the full negative result is in the repo issues). Becoming a graph database is not a goal. Open-domain at 42.71 is a deliberate, accepted limit of the lean flat architecture, and it is on the page precisely because it is the honest weak spot.

Latency

Sub-second at p95, alongside Mem0.

SEARCH AND TOTAL QUERY LATENCY (SECONDS)

Metric	widemem	Mem0	Zep	LangMem
Search p50	0.258	0.148	0.513	17.99
Search p95	0.519	0.200	0.778	59.82
Total p50	0.843	0.708	1.292	18.53
Total p95	1.567	1.440	2.926	60.40

widemem's search latency sits in the competitive range with Mem0, within 150ms at p50. Mem0 has the lowest published latency in the set; widemem and Mem0 are the two sub-second p95 systems.

Methodology

Running a fair benchmark is harder than running any benchmark.

Here is what we did to keep the numbers comparable with the other systems' published results, and where the comparison is not yet apples-to-apples.

BENCHMARK CONFIGURATION

Parameter	Value	Rationale
widemem version	v1.4.1 (re-ingested)	Current shipped
LLM (all phases)	GPT-4o-mini	Same as Mem0 paper
Embeddings	text-embedding-3-small	Same as Mem0 paper
Vector store	FAISS local	widemem default
Decay	exponential, rate 0.01	widemem default
Scoring weights	sim 0.5 / imp 0.3 / rec 0.2	widemem default
Top k per speaker	10	20 memories total per question
Judge runs	3	Mem0 paper uses 10; we use 3 for cost
Scoring pass	single clean pass	No repair step
Coverage	all 10 conversations, 1,540 questions	Full confirmed run, like-for-like with the field

Pipeline

Phase 1 (ingestion): re-ingest the conversation turns into widemem v1.4.1. Each turn runs through extraction, conflict resolution, and FAISS storage. This is the step the original run skipped by reusing a v1.3.0 index, which is what produced the stale 45.32.
Phase 2 (Q&A): run the questions. Search memories for each speaker, build the prompt, GPT-4o-mini generates the answer. Record latency and tokens.
Phase 3 (judge): score predictions against ground truth. F1 and BLEU computed locally. LLM-as-a-Judge runs 3 times per question, averaged to produce J. Single clean pass, no repair step.

Caveats and known issues

Where the comparison is not yet apples-to-apples.

Open-domain is an accepted limit. At 42.71 it is the weakest category. A lean entity-aware re-rank was built and gated to lift it; it regressed the other categories without improving open-domain, so it was not shipped. The full negative result is recorded in the widemem-ai repo issues. Open-domain is a deliberate trade of the lean, no-graph architecture, not an unfixed bug.
Category labels were transposed until 2026-07-06. Our harness swapped the single-hop and multi-hop labels relative to the official LoCoMo evaluation. Per-category numbers published before that date carried the transposed names; overall, temporal, and open-domain scores were never affected. The harness mapping is fixed, this page now uses the official labels, and a full rerun under the corrected harness is planned. The retraction is recorded permanently in the corrections log.
Small revision from the earlier estimate. A prior version of this page carried a six-conversation estimate of about 56. The full 1,540-question run came in at 54.81. Stated plainly rather than quietly rounded.
Re-baseline, not the original run. These numbers are widemem v1.4.1 re-ingested with current code. The earlier 45.32 was a v1.3.0-era measurement on an index that was never rebuilt after the v1.4.0 extraction change shipped.
Repair-gap, resolved. An earlier run hit OpenAI rate limits and used a repair pass, which produced a large clean-vs-repaired score gap. A separate clean single-pass run confirmed the old 45.32 was a real measurement, not a repair artifact, and the re-baseline uses the clean single-pass method throughout. No repair step is involved in these numbers.
Adversarial excluded. LoCoMo includes adversarial questions with no ground-truth answers. All systems exclude these from scoring.
No hierarchy, no active retrieval. widemem's hierarchical memory and active retrieval are disabled here to compare like-for-like with flat-memory baselines. Enabling them may shift some categories at the cost of comparability.

Raw data and reproducibility

Verify it yourself.

Full JSON results, benchmark runner, and evaluator are in the benchmark directory of the widemem-ai repo. The re-baseline runner is benchmark/run_ws1.py; the original run used benchmark/run_locomo.py. The investigation that found the stale-index error is documented in the repo issues.

What's next

Full LoCoMo rerun with the corrected official category mapping in the harness
LongMemEval pass (a second standard benchmark in the memory space)
Memory-quality work: single-call ADD/UPDATE/DELETE consolidation and dedup
Memory-footprint and throughput benchmarks at 100k, 500k, and 1M memories (for the self-hosting page)

Confirmed run: May 2026, widemem v1.4.1, all 10 conversations (1,540 questions), clean single pass, GPT-4o-mini judge averaged over 3 runs. Original run: March 2026, v1.3.0, superseded by this page. If you want to discuss the numbers, open an issue on GitHub or start at /enterprise.