docs: add federation architecture design document

Comprehensive analysis of Gas Town federation via "Outposts" abstraction: - LocalOutpost: current tmux model - SSHOutpost: full Gas Town clone on VM - CloudRunOutpost: elastic container workers Key insights: - Persistent HTTP/2 connections solve Cloud Run cold start - ~$0.017 per 5-min worker session vs $50-200/mo VM - Git remains source of truth for code and beads - Local-first, remote for overflow/burst 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-16 18:04:10 -08:00
parent 86625df15d
commit fdbbd82457
1 changed files with 470 additions and 0 deletions
@@ -0,0 +1,470 @@
+# Federation Architecture: Ultrathink
+
+## The Problem
+
+Gas Town needs to scale beyond a single machine:
+- More workers than one machine can handle (RAM, CPU, context windows)
+- Geographic distribution (workers close to data/services)
+- Cost efficiency (pay-per-use vs always-on VMs)
+- Platform flexibility (support various deployment targets)
+
+## Two Deployment Models
+
+### Model A: "Town Clone" (VMs)
+
+Clone the entire `~/ai` workspace to a remote VM. It runs like a regular Gas Town:
+
+```
+┌─────────────────────────────────────────┐
+│  GCE VM (or any Linux box)              │
+│                                         │
+│  ~/ai/                # Full town clone │
+│  ├── config/          # Town config     │
+│  ├── mayor/           # Mayor (or none) │
+│  ├── gastown/         # Rig with agents │
+│  │   ├── polecats/    # Workers here    │
+│  │   ├── refinery/                      │
+│  │   └── witness/                       │
+│  └── beads/           # Another rig     │
+│                                         │
+│  Runs autonomously, syncs via git       │
+└─────────────────────────────────────────┘
+```
+
+**Characteristics:**
+- Full autonomy if disconnected
+- Familiar model - it's just another Gas Town
+- VM overhead (cost, management, always-on)
+- Coarse-grained scaling (spin up whole VMs)
+- Good for: always-on capacity, long-running work, full independence
+
+**Federation via:**
+- Git sync for beads (already works)
+- Extended mail routing (`vm1:gastown/polecat`)
+- SSH for remote commands
+
+### Model B: "Cloud Run Workers" (Containers)
+
+Workers are stateless containers that wake on demand:
+
+```
+┌─────────────────────────────────────────┐
+│  Cloud Run Service: gastown-worker      │
+│                                         │
+│  ┌────────────────────────────────┐     │
+│  │ Container Instance             │     │
+│  │  - Claude Code + git           │     │
+│  │  - HTTP endpoint for work      │     │
+│  │  - Persistent volume mount     │     │
+│  │  - Scales 0→N automatically    │     │
+│  └────────────────────────────────┘     │
+│                                         │
+│  Zero cost when idle                    │
+│  Persistent connections keep warm       │
+└─────────────────────────────────────────┘
+```
+
+**Characteristics:**
+- Pay-per-use (nearly free when idle)
+- Scales elastically (0 to many workers)
+- No VM management
+- Stateless(ish) - needs fast bootstrap or persistent storage
+- Good for: burst capacity, background work, elastic scaling
+
+**Key insight from your friend:**
+Persistent connections solve the "zero to one" problem. Keep the connection open, container stays warm, subsequent requests are fast. This transforms Cloud Run from "cold functions" to "elastic workers."
+
+## Unified Abstraction: Outposts
+
+To support "however people want to do it," we need an abstraction that covers both models (and future ones like K8s, bare metal, etc.).
+
+### The Outpost Concept
+
+An **Outpost** is a remote compute environment that can run workers.
+
+```go
+type Outpost interface {
+    // Identity
+    Name() string
+    Type() OutpostType  // local, ssh, cloudrun, k8s
+
+    // Capacity
+    MaxWorkers() int
+    ActiveWorkers() int
+
+    // Worker lifecycle
+    Spawn(issue string, config WorkerConfig) (Worker, error)
+    Workers() []Worker
+
+    // Health
+    Ping() error
+
+    // Optional: Direct communication (VM outposts)
+    SendMail(worker string, msg Message) error
+}
+
+type OutpostType string
+const (
+    OutpostLocal    OutpostType = "local"
+    OutpostSSH      OutpostType = "ssh"      // Full VM clone
+    OutpostCloudRun OutpostType = "cloudrun" // Container workers
+    OutpostK8s      OutpostType = "k8s"      // Future
+)
+```
+
+### Worker Interface
+
+```go
+type Worker interface {
+    ID() string
+    Outpost() string
+    Status() WorkerStatus  // idle, working, done, failed
+    Issue() string         // Current issue being worked
+
+    // For interactive outposts (local, SSH)
+    Attach() error         // Connect to worker session
+
+    // For all outposts
+    Logs() (io.Reader, error)
+    Stop() error
+}
+```
+
+### Outpost Implementations
+
+#### LocalOutpost
+- Current model: tmux panes on localhost
+- Uses existing Connection interface (LocalConnection)
+- Workers are tmux sessions
+
+#### SSHOutpost
+- Full Gas Town clone on remote VM
+- Uses SSHConnection for remote ops
+- Workers are remote tmux sessions
+- Town config replicated to VM
+
+#### CloudRunOutpost
+- Workers are container instances
+- HTTP/gRPC for work dispatch
+- No tmux (stateless containers)
+- Persistent connections for warmth
+
+## Cloud Run Deep Dive
+
+### Container Design
+
+```dockerfile
+FROM golang:1.21 AS builder
+# Build gt binary...
+
+FROM ubuntu:22.04
+# Install Claude Code
+RUN npm install -g @anthropic-ai/claude-code
+
+# Install git, common tools
+RUN apt-get update && apt-get install -y git
+
+# Copy gt binary
+COPY --from=builder /app/gt /usr/local/bin/gt
+
+# Entrypoint accepts work via HTTP
+COPY worker-entrypoint.sh /entrypoint.sh
+ENTRYPOINT ["/entrypoint.sh"]
+```
+
+### HTTP Work Protocol
+
+```
+POST /work
+{
+  "issue_id": "gt-abc123",
+  "rig_url": "https://github.com/steveyegge/gastown",
+  "beads_url": "https://github.com/steveyegge/gastown",
+  "context": { /* optional hints */ }
+}
+
+Response (streaming):
+{
+  "status": "working|done|failed",
+  "branch": "polecat/gt-abc123",
+  "logs": "...",
+  "pr_url": "..."  // if created
+}
+```
+
+### Persistent Connections
+
+The "zero to one" solution:
+1. Mayor opens HTTP/2 connection to Cloud Run
+2. Connection stays open (Cloud Run keeps container warm)
+3. Send work requests over same connection
+4. Container processes work, streams results back
+5. On idle timeout, connection closes, container scales down
+6. Next request: small cold start, but acceptable
+
+```
+┌──────────┐                     ┌────────────────┐
+│  Mayor   │────HTTP/2 stream───▶│  Cloud Run     │
+│          │◀───results stream───│  Container     │
+└──────────┘                     └────────────────┘
+     │                                   │
+     │   Connection persists             │ Container stays
+     │   for hours if needed             │ warm while
+     │                                   │ connection open
+     ▼                                   ▼
+  [New work requests go over same connection]
+```
+
+### Git in Cloud Run
+
+Options for code access:
+
+1. **Clone on startup** (slow, ~30s+ for large repos)
+   - Simple but adds latency
+   - Acceptable if persistent connection keeps container warm
+
+2. **Cloud Storage FUSE mount** (read-only)
+   - Mount bucket with repo snapshot
+   - Fast startup
+   - Read-only limits usefulness
+
+3. **Persistent volume** (Cloud Run now supports!)
+   - Attach Cloud Storage or Filestore volume
+   - Git clone persists across container restarts
+   - Best of both worlds
+
+4. **Shallow clone with depth**
+   - `git clone --depth 1` for speed
+   - Sufficient for most worker tasks
+   - Can fetch more history if needed
+
+**Recommendation:** Persistent volume with shallow clone. Container starts, checks if clone exists, pulls if yes, shallow clones if no.
+
+### Beads Sync in Cloud Run
+
+Workers need beads access. Options:
+
+1. **Clone beads repo at startup**
+   - Same as code: persistent volume helps
+   - `bd sync` before and after work
+
+2. **Beads as API** (future)
+   - Central beads server
+   - Workers query/update via HTTP
+   - More complex but cleaner for distributed
+
+3. **Beads in git (current)**
+   - Works today
+   - Worker clones .beads, does work, pushes
+   - Git handles conflicts
+
+**Recommendation:** Start with git-based beads. It works today and Cloud Run workers can push to the beads repo just like local workers.
+
+### Mail in Cloud Run
+
+For VM outposts, mail is filesystem-based. For Cloud Run:
+
+**Option A: No mail needed**
+- Cloud Run workers are "fire and forget"
+- Mayor pushes work via HTTP, gets results via HTTP
+- Simpler model for stateless workers
+
+**Option B: Mail via git**
+- Worker checks `mail/inbox.jsonl` in repo
+- Rare for workers to need incoming mail
+- Mostly they just do work and report results
+
+**Recommendation:** Start with Option A. Cloud Run workers receive work via HTTP, report via HTTP. Mail is for long-running stateful agents (Witness, Refinery), not burst workers.
+
+### Cost Model
+
+Cloud Run pricing (as of late 2024):
+- CPU: ~$0.00002400/vCPU-second
+- Memory: ~$0.00000250/GiB-second
+- Requests: ~$0.40/million
+
+For a worker running 5 minutes (300s) with 2 vCPU, 4GB RAM:
+- CPU: 300 × 2 × $0.000024 = $0.0144
+- Memory: 300 × 4 × $0.0000025 = $0.003
+- **Total: ~$0.017 per worker session**
+
+50 workers × 5 minutes each = ~$0.85
+
+**Key insight:** When idle (connection closed, scaled to zero): **$0**
+
+Compare to a VM running 24/7: ~$50-200/month
+
+Cloud Run makes burst capacity essentially free when not in use.
+
+### Claude API in Cloud Run
+
+Workers need Claude API access:
+
+1. **API key in Secret Manager**
+   - Cloud Run mounts secret as env var
+   - Standard pattern
+
+2. **Workload Identity** (if using Vertex AI)
+   - Service account with Claude access
+   - No keys to manage
+
+3. **Rate limiting concerns**
+   - Many concurrent workers = many API calls
+   - May need to coordinate or queue
+   - Could use Mayor as API proxy (future)
+
+## Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                         MAYOR                                │
+│  ┌──────────────────────────────────────────────────────┐   │
+│  │               Outpost Manager                         │   │
+│  │  - Tracks all registered outposts                     │   │
+│  │  - Routes work to appropriate outpost                 │   │
+│  │  - Monitors worker status across outposts             │   │
+│  └──────────────────────────────────────────────────────┘   │
+│         │              │                │                    │
+│         ▼              ▼                ▼                    │
+│  ┌──────────┐   ┌──────────┐     ┌──────────────┐           │
+│  │  Local   │   │   SSH    │     │   CloudRun   │           │
+│  │ Outpost  │   │ Outpost  │     │   Outpost    │           │
+│  └────┬─────┘   └────┬─────┘     └──────┬───────┘           │
+└───────┼──────────────┼──────────────────┼───────────────────┘
+        │              │                  │
+        ▼              ▼                  ▼
+   ┌─────────┐   ┌─────────┐        ┌─────────────┐
+   │  tmux   │   │  SSH    │        │  HTTP/2     │
+   │ panes   │   │sessions │        │ connections │
+   └─────────┘   └─────────┘        └─────────────┘
+        │              │                  │
+        ▼              ▼                  ▼
+   ┌─────────┐   ┌─────────┐        ┌─────────────┐
+   │ Workers │   │ Workers │        │  Workers    │
+   │ (local) │   │  (VM)   │        │ (containers)│
+   └─────────┘   └─────────┘        └─────────────┘
+        │              │                  │
+        └──────────────┼──────────────────┘
+                       ▼
+              ┌─────────────────┐
+              │   Git Repos     │
+              │  (beads sync)   │
+              │  (code repos)   │
+              └─────────────────┘
+```
+
+## Configuration
+
+```yaml
+# ~/ai/config/outposts.yaml
+outposts:
+  # Always present - the local machine
+  - name: local
+    type: local
+    max_workers: 4
+
+  # VM with full Gas Town clone
+  - name: gce-worker-1
+    type: ssh
+    host: 10.0.0.5
+    user: steve
+    ssh_key: ~/.ssh/gce_worker
+    town_path: /home/steve/ai
+    max_workers: 8
+
+  # Cloud Run for burst capacity
+  - name: cloudrun-burst
+    type: cloudrun
+    project: my-gcp-project
+    region: us-central1
+    service: gastown-worker
+    max_workers: 20  # Or unlimited with cost cap
+    cost_cap_hourly: 5.00  # Optional spending limit
+
+# Work assignment policy
+policy:
+  # Try local first, then VM, then Cloud Run
+  default_preference: [local, gce-worker-1, cloudrun-burst]
+
+  # Override for specific scenarios
+  overrides:
+    - condition: "priority >= P3"  # Background work
+      prefer: cloudrun-burst
+    - condition: "estimated_duration > 30m"  # Long tasks
+      prefer: gce-worker-1
+```
+
+## Implementation Phases
+
+### Phase 1: Outpost Abstraction (Local)
+- Define Outpost/Worker interfaces
+- Implement LocalOutpost (refactor current polecat spawning)
+- Configuration file for outposts
+- `gt outpost list`, `gt outpost status`
+
+### Phase 2: SSH Outpost (VMs)
+- Implement SSHConnection (extends existing Connection interface)
+- Implement SSHOutpost
+- VM provisioning docs (Terraform examples)
+- `gt outpost add ssh ...`
+- Test with actual GCE VM
+
+### Phase 3: Cloud Run Outpost
+- Define worker container image
+- Implement CloudRunOutpost
+- HTTP/2 work dispatch protocol
+- Persistent connection management
+- Cost tracking/limits
+- `gt outpost add cloudrun ...`
+
+### Phase 4: Policy & Intelligence
+- Smart assignment based on workload characteristics
+- Cost optimization (prefer free capacity)
+- Auto-scaling policies
+- Dashboard for cross-outpost visibility
+
+## Key Design Decisions
+
+### 1. Outpost as First-Class Concept
+Rather than baking in specific platforms (SSH, Cloud Run), model the abstraction. This gives flexibility for future platforms (K8s, bare metal, other clouds).
+
+### 2. Workers Are Ephemeral
+Whether local tmux, VM process, or Cloud Run container - workers are spawned for work and can be terminated. Don't assume persistence.
+
+### 3. Git as Source of Truth
+Code and beads always sync via git. This works regardless of where workers run. Even Cloud Run workers clone/pull from git.
+
+### 4. HTTP for Cloud Run Control Plane
+For Cloud Run specifically, use HTTP for work dispatch. Don't try to make filesystem mail work across containers. Keep it simple.
+
+### 5. Local-First Default
+Always try local workers first. Remote outposts are for overflow/burst, not primary capacity. This keeps latency low and costs down.
+
+### 6. Graceful Degradation
+If Cloud Run is unavailable, fall back to VM. If VM is down, use local only. System works with any subset of outposts.
+
+## Open Questions
+
+1. **Long-running sessions**: Cloud Run has request timeout limits (configurable up to 60 min, maybe longer now?). How does this interact with long Claude sessions?
+
+2. **Context handoff**: If a Cloud Run worker's container restarts mid-task, how do we resume? Mail-to-self? Checkpoint to storage?
+
+3. **Refinery in Cloud Run**: Could the Refinery itself run as a Cloud Run service? Long-running connection for merge queue processing?
+
+4. **Witness in Cloud Run**: Worker monitoring from Cloud Run? Or does Witness need to be local/VM?
+
+5. **Multi-region**: Cloud Run in multiple regions for geographic distribution? How to coordinate?
+
+## Summary
+
+The Outpost abstraction lets Gas Town scale flexibly:
+
+| Outpost Type | Best For | Cost Model | Scaling |
+|--------------|----------|------------|---------|
+| Local | Development, primary work | Free (your machine) | Fixed |
+| SSH/VM | Long-running, full autonomy | Always-on VM cost | Manual |
+| Cloud Run | Burst, background, elastic | Pay-per-use | Auto |
+
+Cloud Run's persistent connections solve the cold start problem, making it viable for interactive-ish work. Combined with VMs for heavier work and local for development, this gives a flexible spectrum of compute options.
+
+The key insight: **don't pick one model, support both.** Let users configure their outposts based on their needs, budget, and scale requirements.