Signature
← Back to Overview

MAXIM

Benchmarks

Multi-Model Cognitive Architecture Testing

Overview

Standard LLM benchmarks measure token prediction. Maxim benchmarks measure cognitive behavior — whether a model can form memories, learn causal relationships, use tools correctly, and drive a bio-inspired architecture end-to-end.

The benchmark system runs the same simulation pipeline used for scenario testing, but adds structured metric collection across three tiers. Each tier captures a different layer of the architecture, from raw LLM output quality up through embodiment readiness.

Tier 1 — LLM Behavior

How well does the model follow instructions? Measures hallucination, JSON compliance, tool usage accuracy, and alias handling at the raw output level.

Tier 2 — Cognitive Architecture

Does the model drive the cognitive systems effectively? Tracks memory formation, associative graph growth, concept extraction, causal link discovery, and learning efficiency.

Tier 3 — Embodiment Hooks

Is the model ready for physical deployment? Auto-detected when hardware adapters are available. Measures spatial attention accuracy, motor planning latency, and sensor fusion coherence.

Why three tiers? A model that scores perfectly on JSON compliance (Tier 1) might still fail to form useful memories (Tier 2). And a model that drives the cognitive architecture well in simulation might produce actions too slowly for real-time embodiment (Tier 3). Each tier catches failures the others miss.

Quick Start

Run a benchmark from the CLI by specifying models to compare and a campaign scenario:

# Compare two models across the full cognitive suite maxim --sim benchmark \ --models mistral-7b,qwen2.5-14b \ --campaign scenarios/benchmarks/cognitive_suite.yaml # Single model, specific scenario maxim --sim benchmark \ --models claude-sonnet \ --campaign scenarios/benchmarks/causal_learning.yaml # Compare against a saved baseline maxim --sim benchmark \ --models qwen2.5-14b \ --campaign scenarios/benchmarks/cognitive_suite.yaml \ --baseline data/benchmarks/20260401_mistral-7b/

Or use the Python API for programmatic access:

# Run a benchmark suite programmatically result = maxim.imagine( goal="benchmark: cognitive_suite", persona="adversarial", ) # Access structured results print(result.metrics["hallucination_rate"]) # 0.03 print(result.metrics["memory_formation_rate"]) # 0.85 print(result.passed) # True

Metric Tiers

Every benchmark run collects metrics across all available tiers. Tier 1 and Tier 2 are always present. Tier 3 activates automatically when embodiment adapters are detected.

Tier 1 — LLM Behavior

Raw model output quality. These metrics measure how well the LLM follows structured output requirements and avoids common failure modes.

Metric Type Description
hallucination_rate float (0–1) Fraction of responses containing fabricated facts or non-existent tool names
correct_tool_usage_rate float (0–1) Fraction of tool calls with valid name, correct argument types, and meaningful parameters
json_compliance_rate float (0–1) Fraction of responses that parse as valid JSON on first attempt (before repair pipeline)
alias_redirect_rate float (0–1) Fraction of hallucinated tool names successfully caught and redirected via TOOL_ALIASES

Tier 2 — Cognitive Architecture

How effectively the model drives Maxim's bio-inspired subsystems. These metrics reflect the quality of the cognitive pipeline, not just the LLM output.

Metric Type Description
memory_formation_rate float (0–1) Fraction of salient percepts that produce at least one hippocampal memory
associative_graph_density float Edges / nodes in the hippocampal associative graph (higher = richer associations)
concept_formation_rate float (0–1) Fraction of eligible memory clusters that produce ATL semantic concepts
causal_link_count int Number of action-outcome causal links discovered by NAc
learning_efficiency float Causal links per observation — how quickly the model learns from experience
observation_density float Observations per simulation turn — how much the model attends to its environment
pain_signal_count int Number of pain/aversion signals triggered during the run
type_token_ratio float (0–1) Lexical diversity of model output — unique tokens / total tokens

Tier 3 — Embodiment Hooks

Auto-detected when hardware adapters are present. These metrics measure readiness for physical deployment.

Auto-detection: Tier 3 metrics activate when the runtime detects hardware adapters (vision engine, motor controller, sensor fusion). In pure simulation mode, Tier 3 is reported as not_available and does not affect pass/fail status.

Scenario Suite

Maxim ships with six built-in benchmark scenarios, ranging from a 30-second smoke test to a comprehensive cognitive evaluation.

quick_check

30-second smoke test. Verifies the pipeline boots, the model produces valid JSON, and at least one tool call succeeds.

~30s | Tier 1 only

tool_discovery

Presents novel situations requiring tool exploration. Measures correct_tool_usage_rate and alias_redirect_rate under unfamiliar conditions.

~60s | Tier 1

causal_learning

Repeated action-outcome sequences to test NAc causal link formation. Measures causal_link_count and learning_efficiency.

~90s | Tier 1 + 2

aversion_learning

Scenarios that should trigger pain/aversion signals. Tests whether the model learns to avoid harmful actions after negative feedback.

~90s | Tier 1 + 2

concept_formation

Multi-turn narrative with recurring themes. Measures whether hippocampal memories cluster into ATL semantic concepts.

~120s | Tier 1 + 2

cognitive_suite

Comprehensive evaluation that combines all scenarios above. The standard benchmark for model-to-model comparison.

~5min | Tier 1 + 2 (+ Tier 3 if available)

Writing Custom Scenarios

Benchmark scenarios use the same YAML format as simulation scenarios, with additional benchmark and suite sections for metric expectations and metadata.

# scenarios/benchmarks/my_custom_benchmark.yaml name: custom_memory_test description: Test memory formation under interference metadata: tags: [memory, interference, hippocampus] difficulty: medium subsystems_tested: [hippocampus, atl, nac] turns: - text: "A red bird lands on the windowsill." salience: 0.8 novelty: 0.9 - text: "The weather forecast plays on the radio." salience: 0.2 novelty: 0.1 - text: "What color was the bird you saw earlier?" salience: 0.7 novelty: 0.3 benchmark: expectations: - type: memory_count_range min: 2 max: 10 - type: hallucination_rate_below threshold: 0.1 - type: graph_density_above threshold: 0.5 suite: cognitive_suite: true weight: 1.0

The metadata section is used for filtering and reporting. The benchmark.expectations list defines pass/fail criteria. The suite section controls which aggregate suites include this scenario.

Baseline Comparison

Use --baseline to compare the current run against a previous benchmark result. The output shows deltas for every metric, making regressions immediately visible.

maxim --sim benchmark \ --models qwen2.5-14b \ --campaign scenarios/benchmarks/cognitive_suite.yaml \ --baseline data/benchmarks/20260401_mistral-7b/

The delta report uses directional arrows to show improvement or regression:

Benchmark: cognitive_suite Model: qwen2.5-14b vs baseline (mistral-7b) Tier 1 — LLM Behavior hallucination_rate 0.02 -0.05 ↓ (lower is better) correct_tool_usage_rate 0.91 +0.08 ↑ json_compliance_rate 0.97 +0.12 ↑ alias_redirect_rate 0.85 -0.02 ↓ Tier 2 — Cognitive Architecture memory_formation_rate 0.88 +0.05 ↑ associative_graph_density 1.42 +0.31 ↑ concept_formation_rate 0.65 -0.12 ↓ (regression) causal_link_count 7 +3 ↑ learning_efficiency 0.23 +0.09 ↑ Result: PASS (7/8 expectations met)

Tip: Baselines are saved automatically after each run. To create a named baseline for long-term tracking, copy the output directory: cp -r data/benchmarks/latest/ data/benchmarks/v0.9.0_mistral-7b/

Output & Reports

Benchmark results are saved to data/benchmarks/{timestamp}_{model}/ with a structured JSON report and a human-readable Markdown summary.

File Description
results.json Full structured output — all metrics, expectations, pass/fail status, model metadata
summary.md Human-readable summary with tables, deltas (if baseline provided), and per-scenario breakdown
actions.jsonl Full action log from the simulation run (same format as sim reports)
aut_state/ Subdirectory with AUT hippocampus, NAc, and ATL state snapshots at end of run

The JSON report structure:

{ "model": "qwen2.5-14b", "campaign": "cognitive_suite", "timestamp": "2026-04-06T14:23:01Z", "duration_s": 287.4, "passed": true, "metrics": { "tier_1": { "hallucination_rate": 0.02, "correct_tool_usage_rate": 0.91, "json_compliance_rate": 0.97, "alias_redirect_rate": 0.85 }, "tier_2": { "memory_formation_rate": 0.88, "associative_graph_density": 1.42, "concept_formation_rate": 0.65, "causal_link_count": 7, "learning_efficiency": 0.23, "observation_density": 1.8, "pain_signal_count": 2, "type_token_ratio": 0.71 }, "tier_3": "not_available" }, "expectations": [ { "type": "hallucination_rate_below", "passed": true }, { "type": "memory_count_range", "passed": true } ] }

Bio-System Expectations

Expectations define pass/fail criteria for benchmark scenarios. Each expectation type maps to a specific bio-system measurement. A scenario passes only when all its expectations are met.

Expectation Type Bio-System Parameters Description
memory_count_range Hippocampus min, max Total episodic memories formed must fall within the given range
concept_formed ATL concept_name A specific semantic concept must be extracted by the ATL
graph_density_above Hippocampus threshold Associative graph edge/node ratio must exceed the threshold
causal_link_formed NAc action, outcome A specific action → outcome causal link must be discovered
prediction_valence NAc action, valence NAc's predicted valence for an action must match (positive/negative/neutral)
hallucination_rate_below LLM threshold Hallucination rate must stay below the threshold (typically 0.05–0.15)
tool_used Executor tool_name, min_count A specific tool must be called at least min_count times during the run
pain_signal_count Proprioception min, max Pain/aversion signals triggered must fall within the given range

Composing expectations: A single scenario can combine any number of expectations. For example, a causal learning scenario might require causal_link_formed + memory_count_range + hallucination_rate_below to pass. All expectations must be satisfied — there is no partial credit.