Signature
← Back to Overview

MAXIM

Multi-LLM Networking & Peer Mesh

Distributed Inference, Cloudflare Tunnels, and Cross-Machine Coordination

Why Distributed Inference

Maxim's agent pipeline calls the LLM multiple times per cycle: perception, memory consolidation, goal reasoning, execution planning, and statistical review. On a single machine with one GPU, these calls compete for the same inference server. When the model is large or the context window is deep, that bottleneck caps the cycle rate.

Distributed inference solves this by letting peer machines contribute their compute. A laptop on the same network, a desktop in another room, or a cloud VM halfway around the world can all send inference requests to the machine running the GPU. The leader handles the model; peers handle everything else.

The Core Problem

  • Single-GPU saturation — A 7B model on a consumer GPU handles ~20 req/s; Maxim's full pipeline can generate 5-8 requests per cycle at higher frequencies.
  • Model locality — The model lives in VRAM on one machine. Moving it is expensive. Moving the requests to it is cheap.
  • Remote access — Cloudflare Tunnel exposes the inference server securely without opening firewall ports or configuring NAT, enabling peers anywhere on the internet.

Architecture Overview

The networking layer sits between the LLM router and the inference backend. On the leader, a reverse proxy accepts authenticated requests and forwards them to the local llama-cpp-server. On peers, the router is configured to point at the leader instead of localhost.

Network Topology Peer Machine Leader Machine ┌──────────────┐ ┌───────────────────────────────────┐ │ Agent Loop │ │ │ │ │ │ │ Agent Loop ──┐ │ │ ▼ │ │ │ │ │ LLM Router │ │ LLM Router │ │ │ │ │ │ │ │ │ └───┼──────────┘ │ │ ▼ │ │ HTTPS │ │ llama-cpp-server (:8100) │ ▼ │ │ ▲ │ Cloudflare Tunnel │ └───────────┘ │ │ │ ▲ │ ▼ │ │ │ LeaderProxy (:8099) ────────────────┼────────────────┘ │ ▲ │ │ │ cloudflared │ │ └───────────────────────────────┘ │ └───────────────────────────────────┘

Both the leader's own agent loop and remote peers are independent HTTP clients of the same llama-cpp-server. The LeaderProxy authenticates and rate-limits peer traffic before it reaches the backend. The leader's own requests go directly to :8100, bypassing the proxy entirely.

Roles

Every Maxim instance operates in one of three roles. The role determines how the LLM router resolves inference endpoints and whether the proxy/tunnel subsystems activate.

Role Description Runs
Leader GPU machine. Hosts the model, runs the inference server, proxy, and tunnel. llama-cpp-server + LeaderProxy + cloudflared + agent loop
Peer Client machine. Sends inference requests to the leader over the tunnel. agent loop only (LLM router points at leader)
Solo Default. Everything local, no networking. Equivalent to a leader with no peers. llama-cpp-server + agent loop (no proxy, no tunnel)

Role Detection

Maxim detects the role at startup using the following priority:

  1. Explicit MAXIM_ROLE=leader|peer|solo environment variable.
  2. Presence of a cloudflared config file with an ingress rule pointing to :8099 implies leader.
  3. Presence of a peer.yml with a remote endpoint implies peer.
  4. Otherwise, solo.

LeaderProxy (Phase 7a)

The LeaderProxy is a stdlib-only reverse proxy that listens on port 8099 and forwards authenticated requests to the local llama-cpp-server on port 8100. It uses only Python's http.server and urllib.request — no third-party dependencies.

Key Features

  • Auth enforcement — Every request must carry a valid Authorization: Bearer <key> header. Keys are managed with maxim tunnel key rotate.
  • Request-ID propagation — Each proxied request gets an X-Request-ID header (UUID4) for end-to-end tracing.
  • Response headersX-Maxim-Proxy: true, X-Maxim-Request-ID, and X-Maxim-Latency-Ms on every response.

Debug Endpoints

The proxy exposes four debug endpoints for operational visibility. All require the same Bearer token as inference requests.

Endpoint Purpose
/v1/debug/status Proxy uptime, active connections, backend reachability
/v1/debug/heartbeat Lightweight liveness check (200 OK)
/v1/debug/metrics Request counts, latency percentiles, error rates
/v1/debug/last-requests Ring buffer of recent requests (peer ID, latency, status)

Admission Control (Phase 7b)

The inference server can only handle so many concurrent requests before latency degrades or VRAM is exhausted. Admission control prevents overload by rejecting excess traffic early, before it reaches the backend.

Two-Layer Protection

Concurrency Semaphore

Configured via MAXIM_PROXY_MAX_CONCURRENT (default: 4). When all slots are occupied, new requests receive a 429 Too Many Requests with an X-Maxim-Queue-Depth header indicating how many requests are waiting.

Per-Peer Rate Limiting

Configured via MAXIM_PROXY_RATE_LIMIT_RPM (default: 60). Each peer is tracked by its API key. Exceeding the limit returns a 429 with a Retry-After header.

Environment Variables

Variable Default Description
MAXIM_PROXY_MAX_CONCURRENT 4 Maximum simultaneous requests forwarded to backend
MAXIM_PROXY_RATE_LIMIT_RPM 60 Requests per minute allowed per peer API key

On the peer side, the LLM router's retry logic handles 429s gracefully: it reads Retry-After, backs off, and retries. From the agent's perspective, the request is simply slower — the pipeline does not crash.

Lane Metrics (Phase 8)

The WorkerPool routes inference requests through capability tiers: large (14B+ GPU), medium (7B CPU/GPU), and small (1.7B CPU). Functions declare which tier they need via a FunctionRouter with fallback chains. Per-tier performance counters track throughput and latency in real time.

Tracked Counters

Metric Description
p50 / p99 latency Median and tail latency per lane, computed over a sliding window
Failure rate Percentage of requests that returned an error or timed out
Token throughput Tokens per second generated, per lane

The MetricsRegistry is a singleton shared between the agent runtime and the LeaderProxy. Both write to the same counters, giving a unified view of local and remote load. These metrics feed into maxim doctor, which flags lanes with elevated failure rates or latency spikes.

# View lane metrics from the proxy debug endpoint
curl -s -H "Authorization: Bearer $MAXIM_API_KEY" \
  https://maxim.yourdomain.com/v1/debug/metrics | python -m json.tool

# Example output
{
  "lanes": {
    "infer":  { "p50_ms": 142, "p99_ms": 890, "fail_pct": 0.3, "tok_per_sec": 48.2 },
    "review": { "p50_ms": 98,  "p99_ms": 520, "fail_pct": 0.0, "tok_per_sec": 31.7 },
    "record": { "p50_ms": 67,  "p99_ms": 310, "fail_pct": 0.1, "tok_per_sec": 22.4 }
  },
  "uptime_s": 3847,
  "total_requests": 1294
}

System Heartbeat

The heartbeat subsystem runs as a daemon thread that samples system vitals every 10 seconds. It provides early warning when hardware resources are constrained or the agent loop has stalled.

Sampled Signals

Hardware

  • GPU utilization and VRAM usage
  • CPU load average (1m / 5m / 15m)
  • RAM usage (resident + swap)
  • Disk free space on the data partition
  • WiFi signal strength and link quality

Runtime

  • Agent loop cycle count and timestamp of last cycle
  • Stall detection: warns when the loop is idle >30 seconds
  • WorkerPool queue depths per lane
  • Active inference request count

Enabling the Heartbeat

# Enable heartbeat logging
export MAXIM_HEARTBEAT=1

# Or enable lane trace (which also enables heartbeat)
export MAXIM_LANE_TRACE=1

# Heartbeat output in the log
[heartbeat] gpu=72% vram=5.1/8.0GB cpu=2.4 ram=61% disk=42GB loop=+0.8s lanes=3/0/1

Stall detection is particularly useful for debugging distributed setups. If the agent loop blocks on a remote inference call that the leader has rate-limited, the heartbeat will flag the idle gap before the user notices the pause.

Peer Setup

Setting up a peer takes three steps: install Maxim on the peer machine, connect to the leader, and verify the link.

1. Connect to the Leader

# On the peer machine
maxim peer connect https://maxim.yourdomain.com/v1

# This prompts for the API key (generated on the leader with `maxim tunnel key rotate`)
# and writes ~/.config/maxim/peer.yml

2. Verify Connectivity

# Quick connectivity test (runs from the peer, no full agent runtime needed)
maxim peer test https://maxim.yourdomain.com/v1

# Expected output:
#   Connecting to https://maxim.yourdomain.com/v1 ...
#   Auth ............ OK (Bearer token accepted)
#   Heartbeat ....... OK (proxy alive, 1294 requests served)
#   Inference ....... OK (model loaded, 48 tok/s)
#   Latency ......... 23ms round-trip

3. Run with Tracing

# Enable lane tracing to see which requests go remote
export MAXIM_LANE_TRACE=1
maxim --language-model mistral-7b

# Trace output shows remote routing:
# [lane:infer] POST /v1/chat/completions -> remote (23ms RTT, 142ms total)

peer.yml Structure

endpoint: https://maxim.yourdomain.com/v1
api_key: mk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
timeout_s: 30
retry_max: 3
verify_ssl: true

Troubleshooting

Most networking issues fall into a few categories. Start with maxim doctor — it checks tunnel status, proxy reachability, and auth validity automatically.

Provider Priority

If inference is unexpectedly hitting a cloud API (Anthropic/OpenAI) instead of the peer leader, check the provider priority in data/util/llm.json. The router picks the highest-priority provider that's available. Ensure the local/peer provider is ranked above cloud providers for the model you're using.

Pricing Gate

The energy tracker applies a cost multiplier per provider. If the peer endpoint is not registered as a local-class provider, the router may reject requests that exceed the per-cycle cost budget. Fix by setting the provider class to local in llm.json.

Cloudflare WAF / Bot Fight Mode

Cloudflare's Web Application Firewall can return a 403 Forbidden for automated requests. If maxim peer test shows a 403 with an HTML body mentioning "Just a moment," Bot Fight Mode is interfering. Disable it in the Cloudflare dashboard under Security → Bots, or add a WAF exception rule for the tunnel hostname.

Stale DNS

After rotating the Cloudflare tunnel or changing the hostname, DNS propagation can take up to 5 minutes. If maxim peer test times out but the leader's cloudflared shows no incoming connections, flush DNS on the peer machine and retry.

For deeper diagnostic procedures, see docs/troubleshooting/ in the repository.

What's Next

The networking layer has a clear progression from the current manual-setup model toward automatic discovery and intelligent routing.

Agent Mesh — mDNS Discovery

Automatic peer discovery on the local network using mDNS/DNS-SD. A Maxim instance broadcasts a _maxim-llm._tcp service record; peers discover it without manual endpoint configuration. Falls back to the current explicit peer.yml approach on networks where mDNS is blocked. Now part of the Agent Mesh plan (Phase 0a).

Agent Mesh — InferenceRouter

Smart request routing when multiple inference backends are available (local GPU + cloud API + peer leader). The InferenceRouter selects the backend per-request based on lane metrics (latency, queue depth, failure rate), cost constraints, and model compatibility. Now part of the Agent Mesh plan (Phase 0b).

Remote Self-Update & Soft Restart

maxim peer update pulls code + installs remotely. maxim peer restart soft-restarts the leader via os.execv (same PID, clean import cycle). Use --force to stash dirty working tree automatically.

LLM Hot-Swap

maxim peer llm <model> swaps the running LLM without restarting the Maxim process. Stops the current llama-cpp-server, starts a new one with the requested model, and health-checks it. The choice persists across restarts. maxim peer llm --status shows the active model, uptimes, and GPU utilization.

Cloud Provider Integration

Cloud LLMs (Claude, GPT-4o) can be added as fallback engines or dedicated lane providers. --cloud-fallback claude-sonnet adds Claude as a fallback when the self-hosted model fails. --cloud-lane review claude-haiku assigns a cloud model to a specific lane. Cost tracking, redaction gates, and session budgets enforce safety.