MAXIM
Benchmarks
Multi-Model Cognitive Architecture Testing
Contents
Overview
Standard LLM benchmarks measure token prediction. Maxim benchmarks measure cognitive behavior — whether a model can form memories, learn causal relationships, use tools correctly, and drive a bio-inspired architecture end-to-end.
The benchmark system runs the same simulation pipeline used for scenario testing, but adds structured metric collection across three tiers. Each tier captures a different layer of the architecture, from raw LLM output quality up through embodiment readiness.
Tier 1 — LLM Behavior
How well does the model follow instructions? Measures hallucination, JSON compliance, tool usage accuracy, and alias handling at the raw output level.
Tier 2 — Cognitive Architecture
Does the model drive the cognitive systems effectively? Tracks memory formation, associative graph growth, concept extraction, causal link discovery, and learning efficiency.
Tier 3 — Embodiment Hooks
Is the model ready for physical deployment? Auto-detected when hardware adapters are available. Measures spatial attention accuracy, motor planning latency, and sensor fusion coherence.
Why three tiers? A model that scores perfectly on JSON compliance (Tier 1) might still fail to form useful memories (Tier 2). And a model that drives the cognitive architecture well in simulation might produce actions too slowly for real-time embodiment (Tier 3). Each tier catches failures the others miss.
Quick Start
Run a benchmark from the CLI by specifying models to compare and a campaign scenario:
Or use the Python API for programmatic access:
Metric Tiers
Every benchmark run collects metrics across all available tiers. Tier 1 and Tier 2 are always present. Tier 3 activates automatically when embodiment adapters are detected.
Tier 1 — LLM Behavior
Raw model output quality. These metrics measure how well the LLM follows structured output requirements and avoids common failure modes.
| Metric | Type | Description |
|---|---|---|
| hallucination_rate | float (0–1) | Fraction of responses containing fabricated facts or non-existent tool names |
| correct_tool_usage_rate | float (0–1) | Fraction of tool calls with valid name, correct argument types, and meaningful parameters |
| json_compliance_rate | float (0–1) | Fraction of responses that parse as valid JSON on first attempt (before repair pipeline) |
| alias_redirect_rate | float (0–1) | Fraction of hallucinated tool names successfully caught and redirected via TOOL_ALIASES |
Tier 2 — Cognitive Architecture
How effectively the model drives Maxim's bio-inspired subsystems. These metrics reflect the quality of the cognitive pipeline, not just the LLM output.
| Metric | Type | Description |
|---|---|---|
| memory_formation_rate | float (0–1) | Fraction of salient percepts that produce at least one hippocampal memory |
| associative_graph_density | float | Edges / nodes in the hippocampal associative graph (higher = richer associations) |
| concept_formation_rate | float (0–1) | Fraction of eligible memory clusters that produce ATL semantic concepts |
| causal_link_count | int | Number of action-outcome causal links discovered by NAc |
| learning_efficiency | float | Causal links per observation — how quickly the model learns from experience |
| observation_density | float | Observations per simulation turn — how much the model attends to its environment |
| pain_signal_count | int | Number of pain/aversion signals triggered during the run |
| type_token_ratio | float (0–1) | Lexical diversity of model output — unique tokens / total tokens |
Tier 3 — Embodiment Hooks
Auto-detected when hardware adapters are present. These metrics measure readiness for physical deployment.
Auto-detection: Tier 3 metrics activate when the runtime detects hardware adapters (vision engine, motor controller, sensor fusion). In pure simulation mode, Tier 3 is reported as not_available and does not affect pass/fail status.
Scenario Suite
Maxim ships with six built-in benchmark scenarios, ranging from a 30-second smoke test to a comprehensive cognitive evaluation.
quick_check
30-second smoke test. Verifies the pipeline boots, the model produces valid JSON, and at least one tool call succeeds.
~30s | Tier 1 onlytool_discovery
Presents novel situations requiring tool exploration. Measures correct_tool_usage_rate and alias_redirect_rate under unfamiliar conditions.
~60s | Tier 1causal_learning
Repeated action-outcome sequences to test NAc causal link formation. Measures causal_link_count and learning_efficiency.
~90s | Tier 1 + 2aversion_learning
Scenarios that should trigger pain/aversion signals. Tests whether the model learns to avoid harmful actions after negative feedback.
~90s | Tier 1 + 2concept_formation
Multi-turn narrative with recurring themes. Measures whether hippocampal memories cluster into ATL semantic concepts.
~120s | Tier 1 + 2cognitive_suite
Comprehensive evaluation that combines all scenarios above. The standard benchmark for model-to-model comparison.
~5min | Tier 1 + 2 (+ Tier 3 if available)Writing Custom Scenarios
Benchmark scenarios use the same YAML format as simulation scenarios, with additional benchmark and suite sections for metric expectations and metadata.
The metadata section is used for filtering and reporting. The benchmark.expectations list defines pass/fail criteria. The suite section controls which aggregate suites include this scenario.
Baseline Comparison
Use --baseline to compare the current run against a previous benchmark result. The output shows deltas for every metric, making regressions immediately visible.
The delta report uses directional arrows to show improvement or regression:
Tip: Baselines are saved automatically after each run. To create a named baseline for long-term tracking, copy the output directory: cp -r data/benchmarks/latest/ data/benchmarks/v0.9.0_mistral-7b/
Output & Reports
Benchmark results are saved to data/benchmarks/{timestamp}_{model}/ with a structured JSON report and a human-readable Markdown summary.
| File | Description |
|---|---|
| results.json | Full structured output — all metrics, expectations, pass/fail status, model metadata |
| summary.md | Human-readable summary with tables, deltas (if baseline provided), and per-scenario breakdown |
| actions.jsonl | Full action log from the simulation run (same format as sim reports) |
| aut_state/ | Subdirectory with AUT hippocampus, NAc, and ATL state snapshots at end of run |
The JSON report structure:
Bio-System Expectations
Expectations define pass/fail criteria for benchmark scenarios. Each expectation type maps to a specific bio-system measurement. A scenario passes only when all its expectations are met.
| Expectation Type | Bio-System | Parameters | Description |
|---|---|---|---|
| memory_count_range | Hippocampus | min, max | Total episodic memories formed must fall within the given range |
| concept_formed | ATL | concept_name | A specific semantic concept must be extracted by the ATL |
| graph_density_above | Hippocampus | threshold | Associative graph edge/node ratio must exceed the threshold |
| causal_link_formed | NAc | action, outcome | A specific action → outcome causal link must be discovered |
| prediction_valence | NAc | action, valence | NAc's predicted valence for an action must match (positive/negative/neutral) |
| hallucination_rate_below | LLM | threshold | Hallucination rate must stay below the threshold (typically 0.05–0.15) |
| tool_used | Executor | tool_name, min_count | A specific tool must be called at least min_count times during the run |
| pain_signal_count | Proprioception | min, max | Pain/aversion signals triggered must fall within the given range |
Composing expectations: A single scenario can combine any number of expectations. For example, a causal learning scenario might require causal_link_formed + memory_count_range + hallucination_rate_below to pass. All expectations must be satisfied — there is no partial credit.