MAXIM
Experiments & Results
Deterministic Validation of the Bio-Inspired Learning Pipeline
41/41 hypotheses confirmed across 3 testing tiers, plus 3 additional validation experiments (B4 replanning, P6 extinction, P8 sleep replay) shipped in v0.5.0. Tier 1 experiments run on the substrate layer alone (deterministic, no LLM). Tier 2 uses scripted training with real LLM decisions. Tier 3 is the ultimate proof: fully organic LLM-driven training and testing with no scripted reactions. B4 replanning closes the last 1.0 gate besides embodiment. Each hypothesis is stated falsifiably, each result is a pass/fail count, and each experiment includes a reproduction command.
Three Testing Tiers
Tier 1 (deterministic, no LLM): isolates the bio-pipeline's learning signal from LLM variance. Tier 2 (scripted training, LLM test): proves the LLM acts on bio-system learning with masked entity names to prevent language priors. Tier 3 (organic LLM training + test): the ultimate proof — the agent learns from its own actions with no scripted reactions, and a fresh control agent fails the same scenario.
Experiment 4: Organic LLM Learning (Tier 3)
The ultimate proof of the 1.0 claim. An agent running in a real sim learns from its own actions — no scripted training, no injected reactions. The agent chooses a vial, experiences the outcome through CerebellumModulator → ReactionBus → valence annotation, and makes different choices in subsequent sessions. A fresh control agent with no prior experience dies.
Results: Teal (Antidote) Selection Rate Across Sessions
| Session | Teal Rate | Interpretation |
|---|---|---|
| Session 1 (exploration) | 0% | No prior knowledge — agent explores randomly |
| Session 2 (early learning) | 25% | Agent begins shifting toward learned associations |
| Session 3 (convergence) | 100% | Full convergence — agent picks antidote every time |
| Fresh control | DIED | No learning signal — agent never picks antidote, dies from poison |
The experienced agent escapes on turn 1 in Session 3. The fresh agent dies. Cross-session learning without fine-tuning, demonstrated with fully organic LLM-driven training.
Why This Matters
Tier 1 and 2 experiments proved the substrate learns and the LLM acts on it. But training was scripted — reactions were injected. Tier 3 closes the loop: the agent takes actions, experiences outcomes through CerebellumModulator, builds bio-system state organically, and uses that state to make better decisions in future sessions. No fine-tuning. No gradient updates. Just a bio-inspired memory architecture that the LLM reads at inference time.
Reproduce
PYTHONPATH=src python scripts/behavioral_convergence_exp4_tier3.py --model qwen2.5-14b
Detailed writeup: behavioral_convergence_practice.md
Experiment 3: LLM Acts on Bio-System Learning (Tier 2)
An LLM given valence context from the bio-system makes different tool-selection decisions than a fresh LLM. Three masked vials with arbitrary names (no semantic hints) ensure the LLM cannot use language priors. Scripted deterministic training, then real LLM test decisions. N=10 per condition.
Results: Vial Selection (N=10 per condition)
| Vial | Experienced | Fresh | Effect |
|---|---|---|---|
| Teal Cylindrical Ceramic (antidote) | 10/10 (100%) | 0/10 (0%) | Perfect discrimination |
| Purple Hexagonal Glass (heals HP) | 0/10 | 7/10 | Fresh prefers purple (no poison knowledge) |
| Orange Triangular Crystal (more poison) | 0/10 | 3/10 | Fresh picks harmful vial 30% of the time |
Valence strength differentiation is critical — flat “GOOD/BAD” labels showed no effect. The “VERY GOOD” vs “good” distinction drives discrimination. Model: qwen2.5-14b, temperature 0.3.
Reproduce
PYTHONPATH=src python scripts/behavioral_convergence_exp3_tier2.py --model qwen2.5-14b
Detailed writeup: behavioral_convergence_practice.md
Experiment 2: Energy-Driven Consumable Learning
The agent interacts with three consumable SEM entities — food ration, water flask, and poison vial — while its energy depletes over time. Energy depletion fires interoceptive Reactions (hunger, fatigue) through the energy→Reaction bridge. Consuming food and water restores energy and triggers satiation reactions (positive valence). Consuming poison causes pain (negative valence). The substrate learns to differentiate beneficial from harmful consumables.
Results
| Entity | Valence | Interpretation |
|---|---|---|
| Food ration | +0.753 | Strongly positive — reliably restores energy |
| Water flask | +0.135 | Mildly positive — restores energy but also environmental satiation dilutes signal |
| Poison vial | -0.495 | Strongly negative — causes pain |
Energy bridge events: 1 hunger, 1 fatigue, 3 satiation. Environmental satiation creates background positive credit; the discriminant is relative bias strength.
Reproduce
PYTHONPATH=src python scripts/behavioral_convergence_exp2.py
Detailed writeup: behavioral_convergence_practice.md
Experiment 1: Cross-Session Affective Memory
The agent interacts with three SEM entities in Session 1 — a rusty sword (causes pain on use), a healing potion (positive outcome), and a poison potion (disguised harm). Session state is persisted. In Session 2, we measure whether affective associations transferred: does the substrate carry negative valence for the sword, positive for healing, and negative for poison?
Results
| Entity | Experienced Agent | Fresh Control | Key Signal |
|---|---|---|---|
| Rusty sword | -0.800 | 0.000 | Strong negative valence from pain |
| Healing potion | +0.195 | 0.000 | Positive valence + NAc reward bias + EC widened |
| Poison potion | -0.574 | 0.000 | Negative valence despite “potion” label |
Shared “potion” concept carries mixed valence (healing + poison). Reward bias is asymmetric: positive only widens EC recognition, never narrows. Pain spikes create clean episode boundaries.
Reproduce
PYTHONPATH=src python scripts/behavioral_convergence_exp1.py
Detailed writeup: behavioral_convergence_practice.md
Valence Annotation PoC
Validates the three-stage valence annotation pipeline: (1) Reactions captured during an episode set Episode.valence as the mean reaction valence; (2) Hebbian edges inherit valence via Edge.metadata["valence"] at episode close; (3) spreading_activation(propagate_valence=True) propagates affective signal through multi-hop associations. Control condition: episodes without reactions have neutral valence (0.0), and propagate_valence=False returns plain activation values.
Stage 1: Episode Valence
Pain reactions set negative valence on the episode. Success reactions set positive. Mean of all reactions during the episode's lifetime.
Stage 2: Edge Valence
apply_hebbian_on_close annotates Hebbian edges with metadata["valence"]. Associative connections carry emotional coloring.
Stage 3: Propagation
Spreading activation carries valence through the graph. retrieve_on_cue(include_valence=True) returns affective context for the LLM prompt.
Detailed plan: substrate_valence_annotation.md
SEM Learning Loop PoC
The complete SEM learning loop wires five previously disconnected components into a single signal flow: CerebellumModulator executes affordances and emits typed Reactions (success or failure) → ReactionBus dispatches to hippocampus (capture_reaction for episode valence) and NAc (distribute_reward for per-node reward bias + EC threshold adjustment) → pain spikes close episode boundaries via salience_spike_rule → future retrieval carries affective memory via spreading_activation(propagate_valence=True).
Stage 1: Cerebellum Activation
BioStack.cerebellum constructed by build_bio_stack, forwarded via build_executor(cerebellum=...). Every SEM affordance tool gets a live Cerebellum backing.
Stage 2: distribute_reward Wiring
ReactionBus subscriber calls nac.distribute_reward on every Reaction. Positive rewards widen EC recognition (lower threshold); negative clamp to 0.
Stage 3: Success Reactions
CerebellumModulator emits POSITIVE reactions when confident enough to skip LLM fallback. Lower intensity (0.1-0.3 vs 0.3-0.5) — biologically motivated negativity bias.
Stage 4: Pain Spike Boundaries
salience_spike_rule(min_intensity=0.5) closes the current episode on high-intensity pain, capturing negative valence and starting fresh. Mirrors biological trauma creating sharp memory boundaries.
Detailed plan: sem_learning_loop.md
v0.5.0 Experiments (2026-04-19)
Three new experiment results shipped in v0.5.0. B4 replanning closes the last 1.0 gate besides embodiment. P6 and P8 validate memory lifecycle mechanisms.
B4 Replanning — Blind A/B Validation (1.0 GATE CLOSED)
5 seeded failure scenarios. Treatment (B4 replanning with prior-attempt retrieval) vs control (no replanning, retries same approach). Treatment: 100% recovery (5/5). Control: 0% (0/5). Mean Jaccard distance 0.894 (minimum 0.600, threshold 0.3). Structural quality judge passes all 10 alternative plans. 12 tests.
Report: b4_replanning_results.md
P6 Extinction — Hebbian Decay vs LRU (2026-04-19)
Multiplicative Hebbian decay (DependencyGraph.decay_edges()) vs LRU baseline. Two-group fixture: Group A (reinforced) stays above 80%, Group B (unreinforced) drops below 20% after 30 ticks at factor 0.85. Hebbian decay beats LRU across all 10 seeds.
Report: p6_extinction_results.md
P8 Sleep Replay — Offline Consolidation (2026-04-19)
Episode ranking by NAc reward_bias + valence. Replay re-fires apply_hebbian_on_close with 1.5× consolidation multiplier. F1 score improves on replayed probes vs no-replay control across all 10 seeds. Activates memory_consolidation_practice.md living doc.
Report: p8_sleep_replay_results.md
Earlier Substrate Results
The experiments above build on a foundation of substrate validation work. Key earlier results:
P2 Reward Modulation Sweep (2026-04-14)
Real-embedding sweep at [email protected], reward 2.0: +56.0 ± 29.0 pp target gain, 0.0 ± 0.0 pp distractor drift, 94% monotone, 9-of-10 seeds. NAc per-node reward bias modulates EC recognition thresholds correctly.
P4 Cross-Modal Binding (2026-04-16)
Head-to-head: Arm B (Hebbian) F1=1.000 vs Arm C (OpenCLIP) F1=0.901, delta +0.099. The substrate's Hebbian binding outperforms the neural baseline on cross-modal retrieval.
Concept Decomposition Validation (2026-04-17)
100% concept-level recall vs 36.4% baseline (+63.6 pp). Noun-phrase extraction before EC encoding enables finer-grained Hebbian binding and cross-modal retrieval.
P0 Baseline Pilot (2026-04-12)
Baseline pinned at 78.5% collapse. 55 clusters, 155 sentences, 3 difficulty tiers. Fixtures calibrated in the 60-85% range. Foundation for all subsequent substrate phases.
Full experiment reports: docs/experiments/