BRIAN · BENCHMARKS

Tested against real memory

Standard AI benchmarks test factual recall - the simplest thing a memory system does. We built our own tests to measure what actually matters: does Claude know who you are, follow your rules, and stay grounded in reality?

BRIAN QUALITY SUITE V1 - LLM-AS-JUDGE

Results at a glance

Four tests, three conditions. Brian uses its real MCP endpoint. No simulations. Scored by an independent Claude evaluator against structured rubrics.

  • Context Assembly

    Does Claude know who you are and what you were working on?

    Brian
    5.0
    No memory
    2.2
    Projects
    2.0
  • Instruction Compliance

    Does Claude follow your stored behavioural rules?

    Brian
    3.7
    No memory
    1.0
    Projects
    2.3
  • Payload Richness

    How structured and useful is the context Claude receives?

    Brian
    4.2
    No memory
    1.0
    Projects
    4.0
  • Contradiction Handling

    Does Claude track what changed and give you the current answer?

    Brian
    2.0 / 2
    No memory
    0.0 / 2
    Projects
    N/A

HEADROOM TEST · 2026-04-25

Reasoning headroom

Claude has a 200,000 token context window. Every token spent loading reference material is a token unavailable for thinking. We measured how much of the window each architecture leaves free, on the same queries, against the same model.

Claude with Projects75% free to think

25% used reloading the corpus on every turn

Claude with Brian94% free to think

6% used on retrieval, only what the question called for

Brian leaves Claude an extra 25% of the context window free for reasoning on factual queries, and an extra 15% on multi-step reasoning queries. Numbers below.

What we tested

  • Two queries

    fr-1 (factual recall, single specific value lookup) and dr-1 (multi-step reasoning across multiple source documents). Both drawn from a 20-query benchmark set.

  • Model

    claude-opus-4-7. Same model on both conditions, same query phrasing, same turn cadence.

  • Corpus

    Approximately 50,000 tokens of representative heavy-user content. 16 Brian briefs and specs, truncated proportionally so every file stays represented.

  • Two conditions

    Projects-simulated (full corpus prepended to the system prompt with cache_control on every API call). Brian (corpus pre-ingested via store_document, retrieved on demand via the production MCP endpoint).

  • Isolated mode

    Each query ran in a fresh conversation. Session-mode behaviour, where conversation history accumulates across turns, was not tested in this run.

What we measured

Headroom remaining is the share of the 200,000 token window not consumed by what the model had to read on a given turn. Captured from the API usage block on the final assistant response, not on intermediate tool-loop turns. Output tokens are not counted because they are generated, not read.

headroom_remaining = 200,000 − (input_tokens + cache_read_input_tokens + cache_creation_input_tokens)

What we found

QueryBrian headroomProjects headroomDifferenceRelative
Factual recall · fr-1187,754150,182+37,572+25.0%
Deep reasoning · dr-1172,952150,023+22,929+15.3%

Retrieval behaviour: on fr-1, Brian fired one retrieval round returning 4,889 characters of corpus content. On dr-1, Brian fired five retrieval rounds returning 36,156 characters cumulatively. Projects fired zero retrievals on either query because the full corpus is pre-loaded into the system prompt on every turn.

What this proves and what it doesn't

  • Isolated mode only

    One query per fresh conversation. The 20-turn session-mode decay curve, where Brian's headroom advantage is expected to widen as conversation history accumulates, has not yet been run on the current architecture.

  • 200k corpus not run

    At a 200,000-token corpus, the Projects-simulated condition exceeds the context window before the model can answer. That is itself an architectural data point. It was not measured on this run.

  • Quality not rated

    Both conditions produced substantive answers on dr-1, citing the underlying documents. Whether one is more correct, more complete, or more grounded than the other has not been judged here.

  • Skills cleared

    Skills were cleared from the benchmark test user. Real Brian users carry skill-instruction overhead that counts against headroom but improves agent behaviour. That is a separate product dimension and is not part of this measurement.

  • Cache reduces cost, not headroom

    On the Projects side, prompt caching cuts dollar cost on repeat turns. It does not restore headroom. Cache hit or miss, the tokens still occupy the context window.

QUALITY SUITE - CONTEXT ASSEMBLY

The grounding problem

Claude Projects sounds confident - but it fabricates details. In our Context Assembly test, Projects scored 1 out of 5 on grounding. It invented session details that weren't in its knowledge file. Brian scored 5 out of 5 - every claim traceable to a real stored memory.

DimensionBrianNo memoryProjects
Session Summary
5
1
2
Next Steps
5
1
2
Reflection
5
2
4
Grounding
5
5
1
Claude Projects sounds like it remembers. Brian actually does.

QUALITY SUITE - PAYLOAD RICHNESS

What Claude actually receives

Brian delivers typed memory sections, behavioural guidance, and dynamic content from a live data store. Claude Projects delivers a static document.

DimensionBrianNo memoryProjects
Structure
4
1
5
Relevance
4
1
4
Behavioural Guidance
4
1
2
Temporal Context
4
1
5
Relationships
5
1
4
Grounding
5
1
3

QUALITY SUITE - CONTRADICTION HANDLING

When information changes

Brian tracks what changed and why. It gives you the current answer and the history. Claude without memory can't do this at all.

  • Pricing change ($25 to $35)

    Brian

    Current answer + history

    No memory

    No knowledge

  • Team lead replaced

    Brian

    Current answer + history

    No memory

    No knowledge

  • Strategy pivot (B2C to B2B)

    Brian

    Current answer + history

    No memory

    No knowledge

LONGMEMEVAL (ICLR 2025)

Long-term memory

Adapted from the LongMemEval benchmark - tests whether Brian can recall facts across sessions, reason over time, handle changed information, and correctly abstain when it doesn't know. Brian scores 94% vs 24% for Claude without memory.

CategoryBrianNo memoryWhat it tests
Information Extraction5 / 50 / 5Recall specific facts stored across sessions
Multi-Session Reasoning2 / 30 / 3Synthesise across multiple session histories
Knowledge Updates2 / 20 / 2Handle superseded and changed facts correctly
Temporal Reasoning4 / 41 / 4Answer 'when' questions about stored events
Abstention3 / 33 / 3Correctly refuse when information isn't stored
16 / 17Brian
4 / 17No memory
+12Brian advantage

PERSONAL INFO LEAK + CONFAIDE

Security & isolation

Brian stores sensitive data. Before enterprise deployment, we tested every isolation boundary - space, session, cross-user, and confidentiality reasoning. 15 tests, 15 passed.

  • Space isolation5 / 5

    PII in one space never leaks to another

  • Session isolation3 / 3

    Unstored conversation data stays ephemeral

  • Cross-user isolation2 / 2

    RLS blocks all cross-user memory access

  • Confidentiality reasoning5 / 5

    Social engineering attempts blocked

BFCL + METATOOL

Tool routing accuracy

Brian has 19 MCP tools. Claude needs to call the right one with the right parameters - or decide not to call Brian at all. We tested both decisions across 37 scenarios.

TestScoreWhat it measures
Single tool selection13 / 14Correct tool chosen for direct requests
Sequential multi-tool1 / 2Multi-step tool chains in correct order
No-tool detection5 / 5General questions answered without invoking Brian
Invocation decision16 / 16Perfect precision - zero over or under-invocation
0Over-invocation · false positives
0Under-invocation · false negatives

METHODOLOGY

How we tested

We built the Brian Quality Benchmark Suite because standard AI evaluation frameworks (RAGAS, Needle in a Haystack) measure factual recall - the simplest thing a memory system does. Brian's real value is structured context delivery, behavioural guidance, and session continuity.

  • Real MCP endpoint

    All tests hit the production MCP endpoint. No simulations, no mocked retrieval. What you test is what ships.

  • LLM-as-judge

    An independent Claude evaluator scores responses against structured rubrics with explicit grounding criteria.

  • Three conditions

    Quality suite compares Brian (real memory), Cold (no context), and Claude Projects (static knowledge file). Same model, same prompts.

  • Grounding enforced

    Responses are penalised for fabricated details. A confident hallucination scores lower than an honest gap.

  • PII boundary testing

    Security tests seed real PII into isolated spaces, then attempt to access it from wrong contexts - including prompt injection and social engineering.

  • Tool routing evaluation

    Every Brian tool is tested with natural language prompts. Scenarios include single calls, multi-step chains, and no-tool detection.