diff --git a/docs/federation-design.md b/docs/federation-design.md new file mode 100644 index 00000000..e6d388a6 --- /dev/null +++ b/docs/federation-design.md @@ -0,0 +1,470 @@ +# Federation Architecture: Ultrathink + +## The Problem + +Gas Town needs to scale beyond a single machine: +- More workers than one machine can handle (RAM, CPU, context windows) +- Geographic distribution (workers close to data/services) +- Cost efficiency (pay-per-use vs always-on VMs) +- Platform flexibility (support various deployment targets) + +## Two Deployment Models + +### Model A: "Town Clone" (VMs) + +Clone the entire `~/ai` workspace to a remote VM. It runs like a regular Gas Town: + +``` +┌─────────────────────────────────────────┐ +│ GCE VM (or any Linux box) │ +│ │ +│ ~/ai/ # Full town clone │ +│ ├── config/ # Town config │ +│ ├── mayor/ # Mayor (or none) │ +│ ├── gastown/ # Rig with agents │ +│ │ ├── polecats/ # Workers here │ +│ │ ├── refinery/ │ +│ │ └── witness/ │ +│ └── beads/ # Another rig │ +│ │ +│ Runs autonomously, syncs via git │ +└─────────────────────────────────────────┘ +``` + +**Characteristics:** +- Full autonomy if disconnected +- Familiar model - it's just another Gas Town +- VM overhead (cost, management, always-on) +- Coarse-grained scaling (spin up whole VMs) +- Good for: always-on capacity, long-running work, full independence + +**Federation via:** +- Git sync for beads (already works) +- Extended mail routing (`vm1:gastown/polecat`) +- SSH for remote commands + +### Model B: "Cloud Run Workers" (Containers) + +Workers are stateless containers that wake on demand: + +``` +┌─────────────────────────────────────────┐ +│ Cloud Run Service: gastown-worker │ +│ │ +│ ┌────────────────────────────────┐ │ +│ │ Container Instance │ │ +│ │ - Claude Code + git │ │ +│ │ - HTTP endpoint for work │ │ +│ │ - Persistent volume mount │ │ +│ │ - Scales 0→N automatically │ │ +│ └────────────────────────────────┘ │ +│ │ +│ Zero cost when idle │ +│ Persistent connections keep warm │ +└─────────────────────────────────────────┘ +``` + +**Characteristics:** +- Pay-per-use (nearly free when idle) +- Scales elastically (0 to many workers) +- No VM management +- Stateless(ish) - needs fast bootstrap or persistent storage +- Good for: burst capacity, background work, elastic scaling + +**Key insight from your friend:** +Persistent connections solve the "zero to one" problem. Keep the connection open, container stays warm, subsequent requests are fast. This transforms Cloud Run from "cold functions" to "elastic workers." + +## Unified Abstraction: Outposts + +To support "however people want to do it," we need an abstraction that covers both models (and future ones like K8s, bare metal, etc.). + +### The Outpost Concept + +An **Outpost** is a remote compute environment that can run workers. + +```go +type Outpost interface { + // Identity + Name() string + Type() OutpostType // local, ssh, cloudrun, k8s + + // Capacity + MaxWorkers() int + ActiveWorkers() int + + // Worker lifecycle + Spawn(issue string, config WorkerConfig) (Worker, error) + Workers() []Worker + + // Health + Ping() error + + // Optional: Direct communication (VM outposts) + SendMail(worker string, msg Message) error +} + +type OutpostType string +const ( + OutpostLocal OutpostType = "local" + OutpostSSH OutpostType = "ssh" // Full VM clone + OutpostCloudRun OutpostType = "cloudrun" // Container workers + OutpostK8s OutpostType = "k8s" // Future +) +``` + +### Worker Interface + +```go +type Worker interface { + ID() string + Outpost() string + Status() WorkerStatus // idle, working, done, failed + Issue() string // Current issue being worked + + // For interactive outposts (local, SSH) + Attach() error // Connect to worker session + + // For all outposts + Logs() (io.Reader, error) + Stop() error +} +``` + +### Outpost Implementations + +#### LocalOutpost +- Current model: tmux panes on localhost +- Uses existing Connection interface (LocalConnection) +- Workers are tmux sessions + +#### SSHOutpost +- Full Gas Town clone on remote VM +- Uses SSHConnection for remote ops +- Workers are remote tmux sessions +- Town config replicated to VM + +#### CloudRunOutpost +- Workers are container instances +- HTTP/gRPC for work dispatch +- No tmux (stateless containers) +- Persistent connections for warmth + +## Cloud Run Deep Dive + +### Container Design + +```dockerfile +FROM golang:1.21 AS builder +# Build gt binary... + +FROM ubuntu:22.04 +# Install Claude Code +RUN npm install -g @anthropic-ai/claude-code + +# Install git, common tools +RUN apt-get update && apt-get install -y git + +# Copy gt binary +COPY --from=builder /app/gt /usr/local/bin/gt + +# Entrypoint accepts work via HTTP +COPY worker-entrypoint.sh /entrypoint.sh +ENTRYPOINT ["/entrypoint.sh"] +``` + +### HTTP Work Protocol + +``` +POST /work +{ + "issue_id": "gt-abc123", + "rig_url": "https://github.com/steveyegge/gastown", + "beads_url": "https://github.com/steveyegge/gastown", + "context": { /* optional hints */ } +} + +Response (streaming): +{ + "status": "working|done|failed", + "branch": "polecat/gt-abc123", + "logs": "...", + "pr_url": "..." // if created +} +``` + +### Persistent Connections + +The "zero to one" solution: +1. Mayor opens HTTP/2 connection to Cloud Run +2. Connection stays open (Cloud Run keeps container warm) +3. Send work requests over same connection +4. Container processes work, streams results back +5. On idle timeout, connection closes, container scales down +6. Next request: small cold start, but acceptable + +``` +┌──────────┐ ┌────────────────┐ +│ Mayor │────HTTP/2 stream───▶│ Cloud Run │ +│ │◀───results stream───│ Container │ +└──────────┘ └────────────────┘ + │ │ + │ Connection persists │ Container stays + │ for hours if needed │ warm while + │ │ connection open + ▼ ▼ + [New work requests go over same connection] +``` + +### Git in Cloud Run + +Options for code access: + +1. **Clone on startup** (slow, ~30s+ for large repos) + - Simple but adds latency + - Acceptable if persistent connection keeps container warm + +2. **Cloud Storage FUSE mount** (read-only) + - Mount bucket with repo snapshot + - Fast startup + - Read-only limits usefulness + +3. **Persistent volume** (Cloud Run now supports!) + - Attach Cloud Storage or Filestore volume + - Git clone persists across container restarts + - Best of both worlds + +4. **Shallow clone with depth** + - `git clone --depth 1` for speed + - Sufficient for most worker tasks + - Can fetch more history if needed + +**Recommendation:** Persistent volume with shallow clone. Container starts, checks if clone exists, pulls if yes, shallow clones if no. + +### Beads Sync in Cloud Run + +Workers need beads access. Options: + +1. **Clone beads repo at startup** + - Same as code: persistent volume helps + - `bd sync` before and after work + +2. **Beads as API** (future) + - Central beads server + - Workers query/update via HTTP + - More complex but cleaner for distributed + +3. **Beads in git (current)** + - Works today + - Worker clones .beads, does work, pushes + - Git handles conflicts + +**Recommendation:** Start with git-based beads. It works today and Cloud Run workers can push to the beads repo just like local workers. + +### Mail in Cloud Run + +For VM outposts, mail is filesystem-based. For Cloud Run: + +**Option A: No mail needed** +- Cloud Run workers are "fire and forget" +- Mayor pushes work via HTTP, gets results via HTTP +- Simpler model for stateless workers + +**Option B: Mail via git** +- Worker checks `mail/inbox.jsonl` in repo +- Rare for workers to need incoming mail +- Mostly they just do work and report results + +**Recommendation:** Start with Option A. Cloud Run workers receive work via HTTP, report via HTTP. Mail is for long-running stateful agents (Witness, Refinery), not burst workers. + +### Cost Model + +Cloud Run pricing (as of late 2024): +- CPU: ~$0.00002400/vCPU-second +- Memory: ~$0.00000250/GiB-second +- Requests: ~$0.40/million + +For a worker running 5 minutes (300s) with 2 vCPU, 4GB RAM: +- CPU: 300 × 2 × $0.000024 = $0.0144 +- Memory: 300 × 4 × $0.0000025 = $0.003 +- **Total: ~$0.017 per worker session** + +50 workers × 5 minutes each = ~$0.85 + +**Key insight:** When idle (connection closed, scaled to zero): **$0** + +Compare to a VM running 24/7: ~$50-200/month + +Cloud Run makes burst capacity essentially free when not in use. + +### Claude API in Cloud Run + +Workers need Claude API access: + +1. **API key in Secret Manager** + - Cloud Run mounts secret as env var + - Standard pattern + +2. **Workload Identity** (if using Vertex AI) + - Service account with Claude access + - No keys to manage + +3. **Rate limiting concerns** + - Many concurrent workers = many API calls + - May need to coordinate or queue + - Could use Mayor as API proxy (future) + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────┐ +│ MAYOR │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ Outpost Manager │ │ +│ │ - Tracks all registered outposts │ │ +│ │ - Routes work to appropriate outpost │ │ +│ │ - Monitors worker status across outposts │ │ +│ └──────────────────────────────────────────────────────┘ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ +│ │ Local │ │ SSH │ │ CloudRun │ │ +│ │ Outpost │ │ Outpost │ │ Outpost │ │ +│ └────┬─────┘ └────┬─────┘ └──────┬───────┘ │ +└───────┼──────────────┼──────────────────┼───────────────────┘ + │ │ │ + ▼ ▼ ▼ + ┌─────────┐ ┌─────────┐ ┌─────────────┐ + │ tmux │ │ SSH │ │ HTTP/2 │ + │ panes │ │sessions │ │ connections │ + └─────────┘ └─────────┘ └─────────────┘ + │ │ │ + ▼ ▼ ▼ + ┌─────────┐ ┌─────────┐ ┌─────────────┐ + │ Workers │ │ Workers │ │ Workers │ + │ (local) │ │ (VM) │ │ (containers)│ + └─────────┘ └─────────┘ └─────────────┘ + │ │ │ + └──────────────┼──────────────────┘ + ▼ + ┌─────────────────┐ + │ Git Repos │ + │ (beads sync) │ + │ (code repos) │ + └─────────────────┘ +``` + +## Configuration + +```yaml +# ~/ai/config/outposts.yaml +outposts: + # Always present - the local machine + - name: local + type: local + max_workers: 4 + + # VM with full Gas Town clone + - name: gce-worker-1 + type: ssh + host: 10.0.0.5 + user: steve + ssh_key: ~/.ssh/gce_worker + town_path: /home/steve/ai + max_workers: 8 + + # Cloud Run for burst capacity + - name: cloudrun-burst + type: cloudrun + project: my-gcp-project + region: us-central1 + service: gastown-worker + max_workers: 20 # Or unlimited with cost cap + cost_cap_hourly: 5.00 # Optional spending limit + +# Work assignment policy +policy: + # Try local first, then VM, then Cloud Run + default_preference: [local, gce-worker-1, cloudrun-burst] + + # Override for specific scenarios + overrides: + - condition: "priority >= P3" # Background work + prefer: cloudrun-burst + - condition: "estimated_duration > 30m" # Long tasks + prefer: gce-worker-1 +``` + +## Implementation Phases + +### Phase 1: Outpost Abstraction (Local) +- Define Outpost/Worker interfaces +- Implement LocalOutpost (refactor current polecat spawning) +- Configuration file for outposts +- `gt outpost list`, `gt outpost status` + +### Phase 2: SSH Outpost (VMs) +- Implement SSHConnection (extends existing Connection interface) +- Implement SSHOutpost +- VM provisioning docs (Terraform examples) +- `gt outpost add ssh ...` +- Test with actual GCE VM + +### Phase 3: Cloud Run Outpost +- Define worker container image +- Implement CloudRunOutpost +- HTTP/2 work dispatch protocol +- Persistent connection management +- Cost tracking/limits +- `gt outpost add cloudrun ...` + +### Phase 4: Policy & Intelligence +- Smart assignment based on workload characteristics +- Cost optimization (prefer free capacity) +- Auto-scaling policies +- Dashboard for cross-outpost visibility + +## Key Design Decisions + +### 1. Outpost as First-Class Concept +Rather than baking in specific platforms (SSH, Cloud Run), model the abstraction. This gives flexibility for future platforms (K8s, bare metal, other clouds). + +### 2. Workers Are Ephemeral +Whether local tmux, VM process, or Cloud Run container - workers are spawned for work and can be terminated. Don't assume persistence. + +### 3. Git as Source of Truth +Code and beads always sync via git. This works regardless of where workers run. Even Cloud Run workers clone/pull from git. + +### 4. HTTP for Cloud Run Control Plane +For Cloud Run specifically, use HTTP for work dispatch. Don't try to make filesystem mail work across containers. Keep it simple. + +### 5. Local-First Default +Always try local workers first. Remote outposts are for overflow/burst, not primary capacity. This keeps latency low and costs down. + +### 6. Graceful Degradation +If Cloud Run is unavailable, fall back to VM. If VM is down, use local only. System works with any subset of outposts. + +## Open Questions + +1. **Long-running sessions**: Cloud Run has request timeout limits (configurable up to 60 min, maybe longer now?). How does this interact with long Claude sessions? + +2. **Context handoff**: If a Cloud Run worker's container restarts mid-task, how do we resume? Mail-to-self? Checkpoint to storage? + +3. **Refinery in Cloud Run**: Could the Refinery itself run as a Cloud Run service? Long-running connection for merge queue processing? + +4. **Witness in Cloud Run**: Worker monitoring from Cloud Run? Or does Witness need to be local/VM? + +5. **Multi-region**: Cloud Run in multiple regions for geographic distribution? How to coordinate? + +## Summary + +The Outpost abstraction lets Gas Town scale flexibly: + +| Outpost Type | Best For | Cost Model | Scaling | +|--------------|----------|------------|---------| +| Local | Development, primary work | Free (your machine) | Fixed | +| SSH/VM | Long-running, full autonomy | Always-on VM cost | Manual | +| Cloud Run | Burst, background, elastic | Pay-per-use | Auto | + +Cloud Run's persistent connections solve the cold start problem, making it viable for interactive-ish work. Combined with VMs for heavier work and local for development, this gives a flexible spectrum of compute options. + +The key insight: **don't pick one model, support both.** Let users configure their outposts based on their needs, budget, and scale requirements.