MAXIM
Multi-LLM Networking & Peer Mesh
Distributed Inference, Cloudflare Tunnels, and Cross-Machine Coordination
Contents
Why Distributed Inference
Maxim's agent pipeline calls the LLM multiple times per cycle: perception, memory consolidation, goal reasoning, execution planning, and statistical review. On a single machine with one GPU, these calls compete for the same inference server. When the model is large or the context window is deep, that bottleneck caps the cycle rate.
Distributed inference solves this by letting peer machines contribute their compute. A laptop on the same network, a desktop in another room, or a cloud VM halfway around the world can all send inference requests to the machine running the GPU. The leader handles the model; peers handle everything else.
The Core Problem
- Single-GPU saturation — A 7B model on a consumer GPU handles ~20 req/s; Maxim's full pipeline can generate 5-8 requests per cycle at higher frequencies.
- Model locality — The model lives in VRAM on one machine. Moving it is expensive. Moving the requests to it is cheap.
- Remote access — Cloudflare Tunnel exposes the inference server securely without opening firewall ports or configuring NAT, enabling peers anywhere on the internet.
Architecture Overview
The networking layer sits between the LLM router and the inference backend. On the leader, a reverse proxy accepts authenticated requests and forwards them to the local llama-cpp-server. On peers, the router is configured to point at the leader instead of localhost.
Both the leader's own agent loop and remote peers are independent HTTP clients of the same llama-cpp-server. The LeaderProxy authenticates and rate-limits peer traffic before it reaches the backend. The leader's own requests go directly to :8100, bypassing the proxy entirely.
Roles
Every Maxim instance operates in one of three roles. The role determines how the LLM router resolves inference endpoints and whether the proxy/tunnel subsystems activate.
| Role | Description | Runs |
|---|---|---|
| Leader | GPU machine. Hosts the model, runs the inference server, proxy, and tunnel. | llama-cpp-server + LeaderProxy + cloudflared + agent loop |
| Peer | Client machine. Sends inference requests to the leader over the tunnel. | agent loop only (LLM router points at leader) |
| Solo | Default. Everything local, no networking. Equivalent to a leader with no peers. | llama-cpp-server + agent loop (no proxy, no tunnel) |
Role Detection
Maxim detects the role at startup using the following priority:
- Explicit
MAXIM_ROLE=leader|peer|soloenvironment variable. - Presence of a cloudflared config file with an ingress rule pointing to :8099 implies leader.
- Presence of a
peer.ymlwith a remote endpoint implies peer. - Otherwise, solo.
LeaderProxy (Phase 7a)
The LeaderProxy is a stdlib-only reverse proxy that listens on port 8099 and forwards authenticated requests to the local llama-cpp-server on port 8100. It uses only Python's http.server and urllib.request — no third-party dependencies.
Key Features
- Auth enforcement — Every request must carry a valid
Authorization: Bearer <key>header. Keys are managed withmaxim tunnel key rotate. - Request-ID propagation — Each proxied request gets an
X-Request-IDheader (UUID4) for end-to-end tracing. - Response headers —
X-Maxim-Proxy: true,X-Maxim-Request-ID, andX-Maxim-Latency-Mson every response.
Debug Endpoints
The proxy exposes debug endpoints for operational visibility. All require the same Bearer token as inference requests (or localhost access).
| Endpoint | Purpose |
|---|---|
/v1/debug/status |
Proxy uptime, active connections, backend reachability |
/v1/debug/heartbeat |
Lightweight liveness check (200 OK) |
/v1/debug/metrics |
Request counts, latency percentiles, error rates |
/v1/debug/last-requests |
Ring buffer of recent requests (peer ID, latency, status) |
/v1/debug/vram |
Live VRAM usage (nvidia-smi ratio, spillover/warning flags) + projected model footprint. Returns 503 if no GPU. Prerequisite for capacity-aware routing. |
/v1/debug/version |
Maxim version, git hash, Python version |
/v1/debug/logs |
Recent structured log entries from the ring buffer |
/v1/debug/deps |
Installed Python packages and optional extras |
Admission Control (Phase 7b)
The inference server can only handle so many concurrent requests before latency degrades or VRAM is exhausted. Admission control prevents overload by rejecting excess traffic early, before it reaches the backend.
Two-Layer Protection
Concurrency Semaphore
Configured via MAXIM_PROXY_MAX_CONCURRENT (default: 4). When all slots are occupied, new requests receive a 429 Too Many Requests with an X-Maxim-Queue-Depth header indicating how many requests are waiting.
Per-Peer Rate Limiting
Configured via MAXIM_PROXY_RATE_LIMIT_RPM (default: 60). Each peer is tracked by its API key. Exceeding the limit returns a 429 with a Retry-After header.
Environment Variables
| Variable | Default | Description |
|---|---|---|
MAXIM_PROXY_MAX_CONCURRENT |
4 | Maximum simultaneous requests forwarded to backend |
MAXIM_PROXY_RATE_LIMIT_RPM |
60 | Requests per minute allowed per peer API key |
On the peer side, the LLM router's retry logic handles 429s gracefully: it reads Retry-After, backs off, and retries. From the agent's perspective, the request is simply slower — the pipeline does not crash.
Lane Metrics (Phase 8)
The WorkerPool routes inference requests through capability tiers: large (14B+ GPU), medium (7B CPU/GPU), and small (1.7B CPU). Functions declare which tier they need via a FunctionRouter with fallback chains. Per-tier performance counters track throughput and latency in real time.
Tracked Counters
| Metric | Description |
|---|---|
| p50 / p99 latency | Median and tail latency per lane, computed over a sliding window |
| Failure rate | Percentage of requests that returned an error or timed out |
| Token throughput | Tokens per second generated, per lane |
The MetricsRegistry is a singleton shared between the agent runtime and the LeaderProxy. Both write to the same counters, giving a unified view of local and remote load. These metrics feed into maxim doctor, which flags lanes with elevated failure rates or latency spikes.
# View lane metrics from the proxy debug endpoint
curl -s -H "Authorization: Bearer $MAXIM_API_KEY" \
https://maxim.yourdomain.com/v1/debug/metrics | python -m json.tool
# Example output
{
"lanes": {
"infer": { "p50_ms": 142, "p99_ms": 890, "fail_pct": 0.3, "tok_per_sec": 48.2 },
"review": { "p50_ms": 98, "p99_ms": 520, "fail_pct": 0.0, "tok_per_sec": 31.7 },
"record": { "p50_ms": 67, "p99_ms": 310, "fail_pct": 0.1, "tok_per_sec": 22.4 }
},
"uptime_s": 3847,
"total_requests": 1294
}
System Heartbeat
The heartbeat subsystem runs as a daemon thread that samples system vitals every 10 seconds. It provides early warning when hardware resources are constrained or the agent loop has stalled.
Sampled Signals
Hardware
- GPU utilization and VRAM usage
- CPU load average (1m / 5m / 15m)
- RAM usage (resident + swap)
- Disk free space on the data partition
- WiFi signal strength and link quality
Runtime
- Agent loop cycle count and timestamp of last cycle
- Stall detection: warns when the loop is idle >30 seconds
- WorkerPool queue depths per lane
- Active inference request count
Enabling the Heartbeat
# Enable heartbeat logging export MAXIM_HEARTBEAT=1 # Or enable lane trace (which also enables heartbeat) export MAXIM_LANE_TRACE=1 # Heartbeat output in the log [heartbeat] gpu=72% vram=5.1/8.0GB cpu=2.4 ram=61% disk=42GB loop=+0.8s lanes=3/0/1
Stall detection is particularly useful for debugging distributed setups. If the agent loop blocks on a remote inference call that the leader has rate-limited, the heartbeat will flag the idle gap before the user notices the pause.
Peer Setup
Setting up a peer takes three steps: install Maxim on the peer machine, connect to the leader, and verify the link.
1. Connect to the Leader
# On the peer machine maxim peer connect https://maxim.yourdomain.com/v1 # This prompts for the API key (generated on the leader with `maxim tunnel key rotate`) # and writes ~/.config/maxim/peer.yml
2. Verify Connectivity
# Quick connectivity test (runs from the peer, no full agent runtime needed) maxim peer test https://maxim.yourdomain.com/v1 # Expected output: # Connecting to https://maxim.yourdomain.com/v1 ... # Auth ............ OK (Bearer token accepted) # Heartbeat ....... OK (proxy alive, 1294 requests served) # Inference ....... OK (model loaded, 48 tok/s) # Latency ......... 23ms round-trip
3. Run with Tracing
# Enable lane tracing to see which requests go remote export MAXIM_LANE_TRACE=1 maxim --language-model mistral-7b # Trace output shows remote routing: # [lane:large] POST /v1/chat/completions -> remote (23ms RTT, 142ms total)
peer.yml Structure
endpoint: https://maxim.yourdomain.com/v1 api_key: mk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx timeout_s: 30 retry_max: 3 verify_ssl: true
Troubleshooting
Most networking issues fall into a few categories. Start with maxim doctor — it checks tunnel status, proxy reachability, auth validity, key hygiene, inference coherence, and system resources automatically. Use --json for CI integration or --as peer <url> to diagnose from the peer's perspective.
Peer-Mode Diagnostics
When running Maxim as a peer pointed at a remote leader, doctor auto-detects the role (or use --as peer) and runs connectivity-specific checks instead of tunnel/key setup:
# Full peer diagnostic with retry support maxim doctor --as peer https://maxim.yourdomain.com/v1 # JSON output for scripts maxim doctor --json --as peer https://maxim.yourdomain.com/v1 # Quick connectivity test (minimal, no retry) maxim peer test https://maxim.yourdomain.com/v1
Peer checks cover: DNS resolution, URL reachability, API key validation, auth verification, model availability, and round-trip latency (p50/p95 from 5 probes). Fix hints point at the leader machine, not the peer.
Provider Priority
If inference is unexpectedly hitting a cloud API (Anthropic/OpenAI) instead of the peer leader, check the provider priority in ~/.maxim/config/llm.json. The router picks the highest-priority provider that's available. Ensure the local/peer provider is ranked above cloud providers for the model you're using.
Pricing Gate
The energy tracker applies a cost multiplier per provider. If the peer endpoint is not registered as a local-class provider, the router may reject requests that exceed the per-cycle cost budget. Fix by setting the provider class to local in llm.json.
Cloudflare WAF / Bot Fight Mode
Cloudflare's Web Application Firewall can return a 403 Forbidden for automated requests. If maxim peer test shows a 403 with an HTML body mentioning "Just a moment," Bot Fight Mode is interfering. Disable it in the Cloudflare dashboard under Security → Bots, or add a WAF exception rule for the tunnel hostname.
Stale DNS
After rotating the Cloudflare tunnel or changing the hostname, DNS propagation can take up to 5 minutes. If maxim peer test times out but the leader's cloudflared shows no incoming connections, flush DNS on the peer machine and retry.
Key Hygiene
maxim doctor now checks API key age (warns after 90 days), file permissions (fails if world-readable on POSIX), and runs an auth smoke test that verifies the server accepts the real key and rejects bogus ones. If auth smoke reports "server accepts ANY key," your tunnel is bypassing the LeaderProxy — route through port 8099 instead of 8100.
Inference Coherence
Doctor sends a fixed prompt ("What is 2+2?") and checks for "4" in the response. A wrong answer suggests the model is misconfigured, corrupted, or loaded in the wrong quantization. This catches silent failures where the server responds 200 but produces gibberish.
For deeper diagnostic procedures, see docs/troubleshooting/ in the repository. Use maxim doctor --json to generate machine-readable output for support bundles or CI pipelines.
What's Next
The networking layer has a clear progression from the current manual-setup model toward automatic discovery and intelligent routing.
Agent Mesh — mDNS Discovery
Automatic peer discovery on the local network using mDNS/DNS-SD. A Maxim instance broadcasts a _maxim-llm._tcp service record; peers discover it without manual endpoint configuration. Falls back to the current explicit peer.yml approach on networks where mDNS is blocked. Now part of the Agent Mesh plan (Phase 0a).
Agent Mesh — InferenceRouter
Smart request routing when multiple inference backends are available (local GPU + cloud API + peer leader). The InferenceRouter selects the backend per-request based on lane metrics (latency, queue depth, failure rate), cost constraints, and model compatibility. Now part of the Agent Mesh plan (Phase 0b).
Remote Self-Update & Soft Restart
maxim peer update auto-detects the leader's install mode. Pip-installed leaders upgrade via PyPI (--version 0.3.1 to pin). Git-checkout leaders pull + reinstall (--dev [branch] to force git mode). Installed extras are auto-detected and preserved during pip upgrades. maxim peer restart soft-restarts the leader via os.execv (same PID, clean import cycle).
LLM Hot-Swap
maxim peer llm <model> swaps the running LLM without restarting the Maxim process. Stops the current llama-cpp-server, starts a new one with the requested model, and health-checks it. The choice persists across restarts. maxim peer llm --status shows the active model, uptimes, and GPU utilization.
Cloud Provider Integration
Cloud LLMs (Claude, GPT-4o) can be added as fallback engines or dedicated lane providers. --cloud-fallback claude-sonnet adds Claude as a fallback when the self-hosted model fails. --cloud-lane review claude-haiku assigns a cloud model to a specific lane. Cost tracking, redaction gates, and session budgets enforce safety.