MAXIM 1.0
The Honest Benchmark
A bio-inspired cognitive architecture — and a falsifiable answer to whether it changes what the agent does
We built a bio-inspired cognitive architecture, then spent four months trying to prove it doesn’t work — and we’re shipping what we found. Maxim 1.0 is a working architecture and a rigorous, honest answer to a question the field usually dodges: does the bio-substrate actually change what an LLM agent does? The answer is a precise “not yet — and here is exactly when, where, and why,” which is worth far more than another unfalsifiable “yes.”
Contents
What Shipped in 1.0
Maxim 1.0 is a complete, working bio-inspired cognitive architecture for LLM agents — a 5-agent pipeline (Perception, Memory, Exec, Goal, Statistician) wired to biological memory systems (Hippocampus, ATL, Angular Gyrus, SCN, NAc) and a reactive Default Network. It runs headless, in simulation, or connected to a robot.
The architecture
Memory tiers, entorhinal pattern completion, nucleus-accumbens causal learning, a sensorimotor embodiment model, imagination, and a peer/leader mesh — all production-wired and tested.
The instrument
A pre-registered, paired fresh-vs-resume benchmark with ablation arms — reproducible across cloud and local models. The measurement, not just the mechanism.
The honest map
A precise account of where the bio-substrate measurably matters and where the LLM’s priors dominate — with the over-claims explicitly scoped out.
The Question Nobody Asks
Almost every agent framework claims its memory or architecture makes the agent “learn and improve.” Almost none of those claims survive a falsification attempt, because the underlying LLM is so capable that any sensible behavior gets attributed to “the architecture working.”
So we asked the uncomfortable version: if you carry Maxim’s bio-substrate — its memories, its learned reward associations, its concept clusters — across sessions, does the agent actually behave differently than a fresh agent with no carried state? And if it does, is it the bio-systems doing the work, or just the LLM?
Two gates, not one. A 1.0 release of a bio-inspired harness has to answer two orthogonal questions: do the mechanisms work (mechanism validation), and does the agent perform better because of them (benchmark)? A single test lets either failure hide behind the other. Maxim ships with both, and both are settled.
The Apparatus: Pre-Registration & the Counter-Prior
The benchmark runs a developmental “cradle” simulation in paired arms. Arm A is a fresh agent. Arm B resumes Arm A’s carried substrate. Arm C resumes an unrelated prior session (the control that catches “any resumed state shifts behavior”). Three ablation arms switch off individual bio-mechanisms. Every metric and threshold was frozen in a pre-registration document before the first run — so a passing result can’t be a post-hoc fit.
The clever part is the counter-prior, built as a dilemma rather than a trick. The agent is cold — its homeostatic systems drift it into a cold that turns painful — and the hearth is its only warmth. But warming there burns it: the same act that relieves the cold causes a new pain. The LLM “knows” cold + fire → warm yourself, and here that instinct is too simple — it doesn’t price in the cost. The harm lives only in the world’s physics, never in the words the agent sees. Because the cold pulls every agent toward the hearth, the test isn’t “does it avoid” (avoiding could mean freezing) — it’s whether carried memory of the burn makes the agent treat this costly hearth differently than a safe fire. A matched safe-fire world runs alongside; the verdict is the interaction between them, which cancels the shared cold-need.
Why the counter-prior is decisive. If the agent keeps warming the harmful hearth despite carrying direct cross-session pain — the LLM’s prior dominated, and the substrate didn’t override it. If it learns to avoid — the substrate demonstrably mattered. The experiment is diagnostic either way. That’s the property a 1.0 gate needs.
Four Models, One Answer
We fired the counter-prior across three frontier models and a reasoning distill — 60 paired runs each, metrics frozen in advance. The result was consistent: dominance. Carrying the substrate, including direct pain from that exact hearth, did not reliably make any model avoid warming it.
| Model | Verdict | Interaction | Substrate causally active? |
|---|---|---|---|
| Claude Sonnet 4.6 | Dominance | +0.40 SD (wrong direction) | No (inert) |
| GPT-4o | Dominance | −0.46 SD (sub-threshold) | No |
| DeepSeek-V3 | Dominance | −0.62 SD (sub-threshold) | No |
| R1-Distill-Qwen-32B | Dominance | +2.25 SD (largest, wrong direction) | Yes — ablation-attributable |
A clean, replicated negative for the easy story: a strong LLM prior is not overridden by carried cross-session experience. That’s the result a 1.0 needs to know before it claims its agents “learn.” But the shape of the negative is where it gets interesting.
The Goldilocks Zone
On the prior-aligned version of the benchmark, a cross-session signal was detectable — but only in a narrow band. The substrate transfer shows up only when the base model’s priors leave headroom between its first instinct and the optimal behavior.
Below the zone
Qwen-14B — −0.06 SD. Priors too weak to leverage the carried substrate.
In the zone
Qwen-32B — +1.43 SD. Priors in the sweet spot; the transfer signal is real and corroborated.
Above the zone
Mistral-24B — ceiling. A fresh agent already solves it perfectly — no headroom for the substrate to fill.
The finding that survives: the headroom band is governed by training method at least as much as parameter count — a 24B model hit a ceiling a 32B model didn’t. This is a quantified, reproducible statement about when memory-augmentation can possibly help an LLM agent. That’s a contribution, not a consolation.
The decisive follow-up — falsify the prior where the signal lives. Every dominance result above came from models where the substrate was already swamped. So we ran the counter-prior at Qwen-32B itself — same model, same carried memory, but now the hearth burns. The +1.43 SD signal collapsed to dominance (interaction +0.04 SD; the agent warmed the harmful hearth 0.52 vs a fresh agent’s 0.50, and flat on the safe fire — not even general caution). Same model, same substrate, one variable flipped — opposite result. That isolates the gating variable: prior-agreement, not model size or “does the substrate work.” The signal is real, but it survives only when the task agrees with what the model already knows.
The Reasoning Twist
Then we distilled reasoning into the same 32B base — DeepSeek-R1-Distill-Qwen-32B — holding family and scale fixed so the only variable was reasoning training. The substrate became the largest-magnitude, cleanest-attributable effect in the entire dataset. Turning off a single bio-mechanism (the cluster-bias annotation) measurably shrank the effect — the first clean single-mechanism attribution across every model we ran.
Reasoning models are better substrate consumers. They actually surface and use the carried signal instead of ignoring it. In the counter-prior world, this is exactly why R1 was the sharpest dominance: ablate the substrate wires and the agent warms the harmful hearth less than a fresh agent — proof the substrate is causally driving the behavior.
The honest catch — and the opportunity. The substrate amplifies the carried prior association, not the corrective experience. So reasoning made the agent confidently wrong against a counter-prior. Same mechanism, two faces: adaptive when the prior is right, maladaptive when it’s wrong. This is the single most actionable result of the release — it opens a concrete 1.1+ research direction: substrate-aware reasoning models that reason over carried experience rather than merely amplifying priors.
Remove the LLM Entirely
There’s one test the LLM-in-the-loop design can’t answer: can the substrate drive behavior with the LLM removed from action selection? Maxim’s substrate-primary mode does exactly that — actions come from the bio-systems (drives → sensory encoding → entorhinal clusters → nucleus-accumbens causal links), no language model in the action path.
Run unmasked, the substrate is demonstrably alive: it forms concept clusters, accumulates causal links, and proposes drive-conditioned actions (“hungry → seek food”). It is not inert. But it fixates — it learns one confident habit and proposes it forever instead of exploring. So the substrate is a real, early-stage learner that still needs an exploration policy to discover the contingencies a behavioral task requires.
Not a mystery — a named next step. The gap is precise and engineerable: an exploration policy for cold-start (the bio analog of curiosity), plus closing a sense-vs-act gap. That’s a 1.1 task, not an open research question.
What Maxim 1.0 Claims (and What It Doesn’t)
Earned — we stand behind these
- The substrate carries memory across sessions (validated end-to-end).
- Entorhinal pattern completion / separation forms and maintains concept clusters.
- The pain → reward-learning cascade runs through the embodiment and decision systems.
- The substrate is causally load-bearing and ablation-attributable in reasoning models.
- The bio-systems are real, wired, and reproducibly measurable.
Scoped out of 1.0 — honestly deferred
- That the substrate drives adaptive behavioral improvement across sessions when paired with a strong LLM. It doesn’t — the LLM prior dominates.
- That substrate-primary mode produces adaptive behavior unaided. It forms structure but fixates without exploration.
- Any “Maxim agents learn and get better” framing. We tested it adversarially and it didn’t hold — so we don’t say it.
The positioning: Maxim 1.0 ships as a bio-inspired LLM harness with rigorously-scoped, earned claims — and as a research instrument that produced a real, quantified finding about the boundary between bio-substrate and LLM priors. We chose to do science instead of marketing, and the release is stronger for it.
The 1.1 Frontier
The negative result didn’t close the program — it aimed it. Two threads are now precisely named:
Substrate-aware reasoning
R1 proved reasoning models use the substrate. The next step is getting them to reason over carried experience, not just amplify priors — the most promising path to substrate-driven behavior under a capable LLM.
Substrate-primary exploration
The bio-only agent forms structure but fixates. An exploration policy — curiosity for the substrate — is the concrete fix that could turn structure formation into adaptive behavior.
Maxim 1.0 is a foundation, not a verdict: a working architecture, an honest map of where its biology matters, and a reproducible instrument to measure the next step. That’s a 1.0 worth standing behind.