Multi-LLM Networking — Maxim Docs | Distributed Inference

Why Distributed Inference

Maxim's agent pipeline calls the LLM multiple times per cycle: perception, memory consolidation, goal reasoning, execution planning, and statistical review. On a single machine with one GPU, these calls compete for the same inference server. When the model is large or the context window is deep, that bottleneck caps the cycle rate.

Distributed inference solves this by letting peer machines contribute their compute. A laptop on the same network, a desktop in another room, or a cloud VM halfway around the world can all send inference requests to the machine running the GPU. The leader handles the model; peers handle everything else.

The Core Problem

Single-GPU saturation — A 7B model on a consumer GPU handles ~20 req/s; Maxim's full pipeline can generate 5-8 requests per cycle at higher frequencies.
Model locality — The model lives in VRAM on one machine. Moving it is expensive. Moving the requests to it is cheap.
Remote access — Cloudflare Tunnel exposes the inference server securely without opening firewall ports or configuring NAT, enabling peers anywhere on the internet.

Architecture Overview

The networking layer sits between the LLM router and the inference backend. On the leader, a reverse proxy accepts authenticated requests and forwards them to the local llama-cpp-server. On peers, the router is configured to point at the leader instead of localhost.

Network Topology Peer Machine Leader Machine ┌──────────────┐ ┌───────────────────────────────────┐ │ Agent Loop │ │ │ │ │ │ │ Agent Loop ──┐ │ │ ▼ │ │ │ │ │ LLM Router │ │ LLM Router │ │ │ │ │ │ │ │ │ └───┼──────────┘ │ │ ▼ │ │ HTTPS │ │ llama-cpp-server (:8100) │ ▼ │ │ ▲ │ Cloudflare Tunnel │ └───────────┘ │ │ │ ▲ │ ▼ │ │ │ LeaderProxy (:8099) ────────────────┼────────────────┘ │ ▲ │ │ │ cloudflared │ │ └───────────────────────────────┘ │ └───────────────────────────────────┘

Both the leader's own agent loop and remote peers are independent HTTP clients of the same llama-cpp-server. The LeaderProxy authenticates and rate-limits peer traffic before it reaches the backend. The leader's own requests go directly to :8100, bypassing the proxy entirely.

Roles

Every Maxim instance operates in one of three roles. The role determines how the LLM router resolves inference endpoints and whether the proxy/tunnel subsystems activate.

Role	Description	Runs
Leader	GPU machine. Hosts the model, runs the inference server, proxy, and tunnel.	llama-cpp-server + LeaderProxy + cloudflared + agent loop
Peer	Client machine. Sends inference requests to the leader over the tunnel.	agent loop only (LLM router points at leader)
Solo	Default. Everything local, no networking. Equivalent to a leader with no peers.	llama-cpp-server + agent loop (no proxy, no tunnel)

Role Detection (seven-rank order as of 1.0)

Maxim's unified detector at startup uses the following priority — first match wins:

Explicit MAXIM_ROLE=leader|peer|solo environment variable.
New in 1.0: ~/.config/maxim/config.json::role — the canonical persistent setting via maxim config set role <value>.
Presence of ~/.config/maxim/mesh.yml implies peer.
Presence of ~/.cloudflared/config.yml OR .yaml (extension widened in 1.0) OR the systemd path /etc/cloudflared/config.{yml,yaml} implies leader. This is promoted above peer.yml as of 1.0 so a stale peer.yml from earlier exploration doesn't silently override a real leader provisioning signal.
Presence of ~/.config/maxim/peer.yml (legacy, deprecated as of 1.0, retired in 2.0) implies peer.
--llm <local-profile> CLI flag with none of the above signals implies solo.
Default — leader.

First-startup peer.yml → config.json auto-migration: when config.json is absent AND peer.yml is present AND cloudflared config is absent, the loader writes a minimal config.json from peer.yml fields on first run. peer.yml is never deleted. When cloudflared is present, migration is skipped so a stale peer.yml from a previous peer setup doesn't auto-flip a leader machine to peer.

LeaderProxy (Phase 7a)

The LeaderProxy is a stdlib-only reverse proxy that listens on port 8099 and forwards authenticated requests to the local llama-cpp-server on port 8100. It uses only Python's http.server and urllib.request — no third-party dependencies.

Key Features

Auth enforcement — Every request must carry a valid Authorization: Bearer <key> header. Keys are managed with maxim tunnel key rotate.
Request-ID propagation — Each proxied request gets an X-Request-ID header (UUID4) for end-to-end tracing.
Response headers — X-Maxim-Proxy: true, X-Maxim-Request-ID, and X-Maxim-Latency-Ms on every response.

Debug Endpoints

The proxy exposes debug endpoints for operational visibility. All require the same Bearer token as inference requests (or localhost access).

Endpoint	Purpose
`/v1/debug/status`	Proxy uptime, active connections, backend reachability
`/v1/debug/heartbeat`	Lightweight liveness check (200 OK)
`/v1/debug/metrics`	Request counts, latency percentiles, error rates
`/v1/debug/last-requests`	Ring buffer of recent requests (peer ID, latency, status)
`/v1/debug/vram`	Live VRAM usage (nvidia-smi ratio, spillover/warning flags) + projected model footprint. Returns 503 if no GPU. Prerequisite for capacity-aware routing.
`/v1/debug/version`	Maxim version, git hash, Python version
`/v1/debug/logs`	Recent structured log entries from the ring buffer
`/v1/debug/deps`	Installed Python packages and optional extras

Admission Control (Phase 7b)

The inference server can only handle so many concurrent requests before latency degrades or VRAM is exhausted. Admission control prevents overload by rejecting excess traffic early, before it reaches the backend.

Two-Layer Protection

Concurrency Semaphore

Configured via MAXIM_PROXY_MAX_CONCURRENT (default: 4). When all slots are occupied, new requests receive a 429 Too Many Requests with an X-Maxim-Queue-Depth header indicating how many requests are waiting.

Per-Peer Rate Limiting

Configured via MAXIM_PROXY_RATE_LIMIT_RPM (default: 0, disabled). Each peer is tracked by its API key. Exceeding the limit returns a 429 with a Retry-After header.

Environment Variables

Variable	Default	Description
`MAXIM_PROXY_MAX_CONCURRENT`	4	Maximum simultaneous requests forwarded to backend
`MAXIM_PROXY_RATE_LIMIT_RPM`	0 (disabled)	Requests per minute allowed per peer API key (0 = unlimited)

On the peer side, the LLM router's retry logic handles 429s gracefully: it reads Retry-After, backs off, and retries. From the agent's perspective, the request is simply slower — the pipeline does not crash.

Lane Metrics (Phase 8)

The WorkerPool routes inference requests through capability tiers: large (14B+ GPU), medium (7B CPU/GPU), and small (1.7B CPU). Functions declare which tier they need via a FunctionRouter with fallback chains. Per-tier performance counters track throughput and latency in real time.

Tracked Counters

Metric	Description
p50 / p99 latency	Median and tail latency per lane, computed over a sliding window
Failure rate	Percentage of requests that returned an error or timed out
Token throughput	Tokens per second generated, per lane

The MetricsRegistry is a singleton shared between the agent runtime and the LeaderProxy. Both write to the same counters, giving a unified view of local and remote load. These metrics feed into maxim doctor, which flags lanes with elevated failure rates or latency spikes.

# View lane metrics from the proxy debug endpoint
curl -s -H "Authorization: Bearer $MAXIM_API_KEY" \
  https://maxim.yourdomain.com/v1/debug/metrics | python -m json.tool

# Example output
{
  "lanes": {
    "large":  { "p50_ms": 142, "p99_ms": 890, "fail_pct": 0.3, "tok_per_sec": 48.2 },
    "medium": { "p50_ms": 98,  "p99_ms": 520, "fail_pct": 0.0, "tok_per_sec": 31.7 },
    "small":  { "p50_ms": 67,  "p99_ms": 310, "fail_pct": 0.1, "tok_per_sec": 22.4 }
  },
  "uptime_s": 3847,
  "total_requests": 1294
}

System Heartbeat

The heartbeat subsystem runs as a daemon thread that samples system vitals every 10 seconds. It provides early warning when hardware resources are constrained or the agent loop has stalled.

Sampled Signals

Hardware

GPU utilization and VRAM usage
CPU load average (1m / 5m / 15m)
RAM usage (resident + swap)
Disk free space on the data partition
WiFi signal strength and link quality

Runtime

Agent loop cycle count and timestamp of last cycle
Stall detection: warns when the loop is idle >30 seconds
WorkerPool queue depths per lane
Active inference request count

Enabling the Heartbeat

# Enable heartbeat logging
export MAXIM_HEARTBEAT=1

# Or enable lane trace (which also enables heartbeat)
export MAXIM_LANE_TRACE=1

# Heartbeat output in the log
[heartbeat] gpu=72% vram=5.1/8.0GB cpu=2.4 ram=61% disk=42GB loop=+0.8s lanes=3/0/1

Stall detection is particularly useful for debugging distributed setups. If the agent loop blocks on a remote inference call that the leader has rate-limited, the heartbeat will flag the idle gap before the user notices the pause.

Peer Setup

Setting up a peer takes three steps: install Maxim on the peer machine, connect to the leader, and verify the link.

1. Connect to the Leader

# On the peer machine
maxim peer connect https://maxim.yourdomain.com/v1

# This prompts for the API key (generated on the leader with `maxim tunnel key rotate`)
# and writes ~/.config/maxim/peer.yml

2. Verify Connectivity

# Quick connectivity test (runs from the peer, no full agent runtime needed)
maxim peer test https://maxim.yourdomain.com/v1

# Expected output:
#   Connecting to https://maxim.yourdomain.com/v1 ...
#   Auth ............ OK (Bearer token accepted)
#   Heartbeat ....... OK (proxy alive, 1294 requests served)
#   Inference ....... OK (model loaded, 48 tok/s)
#   Latency ......... 23ms round-trip

3. Run with Tracing

# Enable lane tracing to see which requests go remote
export MAXIM_LANE_TRACE=1
maxim --language-model mistral-7b

# Trace output shows remote routing:
# [lane:large] POST /v1/chat/completions -> remote (23ms RTT, 142ms total)

peer.yml Structure

url: https://maxim.yourdomain.com/v1
api_key: mk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Optional fields:
# model: mistral-7b        (overrides /v1/models default)
# is_cloud: true           (forces cloud-lane gate for public URLs)

Troubleshooting

Most networking issues fall into a few categories. Start with maxim doctor — it checks tunnel status, proxy reachability, auth validity, key hygiene, inference coherence, and system resources automatically. Use --json for CI integration or --as peer <url> to diagnose from the peer's perspective.

Peer-Mode Diagnostics

When running Maxim as a peer pointed at a remote leader, doctor auto-detects the role (or use --as peer) and runs connectivity-specific checks instead of tunnel/key setup:

# Full peer diagnostic with retry support
maxim doctor --as peer https://maxim.yourdomain.com/v1

# JSON output for scripts
maxim doctor --json --as peer https://maxim.yourdomain.com/v1

# Quick connectivity test (minimal, no retry)
maxim peer test https://maxim.yourdomain.com/v1

Peer checks cover: DNS resolution, URL reachability, API key validation, auth verification, model availability, and round-trip latency (p50/p95 from 5 probes). Fix hints point at the leader machine, not the peer.

Provider Priority

If inference is unexpectedly hitting a cloud API (Anthropic/OpenAI) instead of the peer leader, check the provider priority in ~/.maxim/config/llm.json. The router picks the highest-priority provider that's available. Ensure the local/peer provider is ranked above cloud providers for the model you're using.

Pricing Gate

The energy tracker applies a cost multiplier per provider. If the peer endpoint is not registered as a local-class provider, the router may reject requests that exceed the per-cycle cost budget. Fix by setting the provider class to local in llm.json.

Cloudflare WAF / Bot Fight Mode

Cloudflare's Web Application Firewall can return a 403 Forbidden for automated requests. If maxim peer test shows a 403 with an HTML body mentioning "Just a moment," Bot Fight Mode is interfering. Disable it in the Cloudflare dashboard under Security → Bots, or add a WAF exception rule for the tunnel hostname.

Stale DNS

After rotating the Cloudflare tunnel or changing the hostname, DNS propagation can take up to 5 minutes. If maxim peer test times out but the leader's cloudflared shows no incoming connections, flush DNS on the peer machine and retry.

Key Hygiene

maxim doctor now checks API key age (warns after 90 days), file permissions (fails if world-readable on POSIX), and runs an auth smoke test that verifies the server accepts the real key and rejects bogus ones. If auth smoke reports "server accepts ANY key," your tunnel is bypassing the LeaderProxy — route through port 8099 instead of 8100.

Inference Coherence

Doctor sends a fixed prompt ("What is 2+2?") and checks for "4" in the response. A wrong answer suggests the model is misconfigured, corrupted, or loaded in the wrong quantization. This catches silent failures where the server responds 200 but produces gibberish.

For deeper diagnostic procedures, see docs/troubleshooting/ in the repository. Use maxim doctor --json to generate machine-readable output for support bundles or CI pipelines.

What's Next

The networking layer has a clear progression from the current manual-setup model toward automatic discovery and intelligent routing.

Agent Mesh — mDNS Discovery

Automatic peer discovery on the local network using mDNS/DNS-SD. A Maxim instance broadcasts a _maxim-llm._tcp service record; peers discover it without manual endpoint configuration. Falls back to the current explicit peer.yml approach on networks where mDNS is blocked. Now part of the Agent Mesh plan (Phase 0a).

Agent Mesh — InferenceRouter

Smart request routing when multiple inference backends are available (local GPU + cloud API + peer leader). The InferenceRouter selects the backend per-request based on lane metrics (latency, queue depth, failure rate), cost constraints, and model compatibility. Now part of the Agent Mesh plan (Phase 0b).

Remote Self-Update & Soft Restart

maxim peer update auto-detects the leader's install mode. Pip-installed leaders upgrade via PyPI (--version 0.3.1 to pin). Git-checkout leaders pull + reinstall (--dev [branch] to force git mode). Installed extras are auto-detected and preserved during pip upgrades. maxim peer restart soft-restarts the leader via os.execv (same PID, clean import cycle).

LLM Hot-Swap

maxim peer llm <model> swaps the running LLM without restarting the Maxim process. Stops the current llama-cpp-server, starts a new one with the requested model, and health-checks it. The choice persists across restarts. maxim peer llm --status shows the active model, uptimes, and GPU utilization.

Cloud Provider Integration

Cloud LLMs (Claude, GPT-4o) can be added as fallback engines or dedicated lane providers. --cloud-fallback claude-sonnet adds Claude as a fallback when the self-hosted model fails. --cloud-lane review claude-haiku assigns a cloud model to a specific lane. Cost tracking, redaction gates, and session budgets enforce safety.

Contents