docs: add federation architecture design document
Comprehensive analysis of Gas Town federation via "Outposts" abstraction: - LocalOutpost: current tmux model - SSHOutpost: full Gas Town clone on VM - CloudRunOutpost: elastic container workers Key insights: - Persistent HTTP/2 connections solve Cloud Run cold start - ~$0.017 per 5-min worker session vs $50-200/mo VM - Git remains source of truth for code and beads - Local-first, remote for overflow/burst 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
470
docs/federation-design.md
Normal file
470
docs/federation-design.md
Normal file
@@ -0,0 +1,470 @@
|
||||
# Federation Architecture: Ultrathink
|
||||
|
||||
## The Problem
|
||||
|
||||
Gas Town needs to scale beyond a single machine:
|
||||
- More workers than one machine can handle (RAM, CPU, context windows)
|
||||
- Geographic distribution (workers close to data/services)
|
||||
- Cost efficiency (pay-per-use vs always-on VMs)
|
||||
- Platform flexibility (support various deployment targets)
|
||||
|
||||
## Two Deployment Models
|
||||
|
||||
### Model A: "Town Clone" (VMs)
|
||||
|
||||
Clone the entire `~/ai` workspace to a remote VM. It runs like a regular Gas Town:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ GCE VM (or any Linux box) │
|
||||
│ │
|
||||
│ ~/ai/ # Full town clone │
|
||||
│ ├── config/ # Town config │
|
||||
│ ├── mayor/ # Mayor (or none) │
|
||||
│ ├── gastown/ # Rig with agents │
|
||||
│ │ ├── polecats/ # Workers here │
|
||||
│ │ ├── refinery/ │
|
||||
│ │ └── witness/ │
|
||||
│ └── beads/ # Another rig │
|
||||
│ │
|
||||
│ Runs autonomously, syncs via git │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Characteristics:**
|
||||
- Full autonomy if disconnected
|
||||
- Familiar model - it's just another Gas Town
|
||||
- VM overhead (cost, management, always-on)
|
||||
- Coarse-grained scaling (spin up whole VMs)
|
||||
- Good for: always-on capacity, long-running work, full independence
|
||||
|
||||
**Federation via:**
|
||||
- Git sync for beads (already works)
|
||||
- Extended mail routing (`vm1:gastown/polecat`)
|
||||
- SSH for remote commands
|
||||
|
||||
### Model B: "Cloud Run Workers" (Containers)
|
||||
|
||||
Workers are stateless containers that wake on demand:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Cloud Run Service: gastown-worker │
|
||||
│ │
|
||||
│ ┌────────────────────────────────┐ │
|
||||
│ │ Container Instance │ │
|
||||
│ │ - Claude Code + git │ │
|
||||
│ │ - HTTP endpoint for work │ │
|
||||
│ │ - Persistent volume mount │ │
|
||||
│ │ - Scales 0→N automatically │ │
|
||||
│ └────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Zero cost when idle │
|
||||
│ Persistent connections keep warm │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Characteristics:**
|
||||
- Pay-per-use (nearly free when idle)
|
||||
- Scales elastically (0 to many workers)
|
||||
- No VM management
|
||||
- Stateless(ish) - needs fast bootstrap or persistent storage
|
||||
- Good for: burst capacity, background work, elastic scaling
|
||||
|
||||
**Key insight from your friend:**
|
||||
Persistent connections solve the "zero to one" problem. Keep the connection open, container stays warm, subsequent requests are fast. This transforms Cloud Run from "cold functions" to "elastic workers."
|
||||
|
||||
## Unified Abstraction: Outposts
|
||||
|
||||
To support "however people want to do it," we need an abstraction that covers both models (and future ones like K8s, bare metal, etc.).
|
||||
|
||||
### The Outpost Concept
|
||||
|
||||
An **Outpost** is a remote compute environment that can run workers.
|
||||
|
||||
```go
|
||||
type Outpost interface {
|
||||
// Identity
|
||||
Name() string
|
||||
Type() OutpostType // local, ssh, cloudrun, k8s
|
||||
|
||||
// Capacity
|
||||
MaxWorkers() int
|
||||
ActiveWorkers() int
|
||||
|
||||
// Worker lifecycle
|
||||
Spawn(issue string, config WorkerConfig) (Worker, error)
|
||||
Workers() []Worker
|
||||
|
||||
// Health
|
||||
Ping() error
|
||||
|
||||
// Optional: Direct communication (VM outposts)
|
||||
SendMail(worker string, msg Message) error
|
||||
}
|
||||
|
||||
type OutpostType string
|
||||
const (
|
||||
OutpostLocal OutpostType = "local"
|
||||
OutpostSSH OutpostType = "ssh" // Full VM clone
|
||||
OutpostCloudRun OutpostType = "cloudrun" // Container workers
|
||||
OutpostK8s OutpostType = "k8s" // Future
|
||||
)
|
||||
```
|
||||
|
||||
### Worker Interface
|
||||
|
||||
```go
|
||||
type Worker interface {
|
||||
ID() string
|
||||
Outpost() string
|
||||
Status() WorkerStatus // idle, working, done, failed
|
||||
Issue() string // Current issue being worked
|
||||
|
||||
// For interactive outposts (local, SSH)
|
||||
Attach() error // Connect to worker session
|
||||
|
||||
// For all outposts
|
||||
Logs() (io.Reader, error)
|
||||
Stop() error
|
||||
}
|
||||
```
|
||||
|
||||
### Outpost Implementations
|
||||
|
||||
#### LocalOutpost
|
||||
- Current model: tmux panes on localhost
|
||||
- Uses existing Connection interface (LocalConnection)
|
||||
- Workers are tmux sessions
|
||||
|
||||
#### SSHOutpost
|
||||
- Full Gas Town clone on remote VM
|
||||
- Uses SSHConnection for remote ops
|
||||
- Workers are remote tmux sessions
|
||||
- Town config replicated to VM
|
||||
|
||||
#### CloudRunOutpost
|
||||
- Workers are container instances
|
||||
- HTTP/gRPC for work dispatch
|
||||
- No tmux (stateless containers)
|
||||
- Persistent connections for warmth
|
||||
|
||||
## Cloud Run Deep Dive
|
||||
|
||||
### Container Design
|
||||
|
||||
```dockerfile
|
||||
FROM golang:1.21 AS builder
|
||||
# Build gt binary...
|
||||
|
||||
FROM ubuntu:22.04
|
||||
# Install Claude Code
|
||||
RUN npm install -g @anthropic-ai/claude-code
|
||||
|
||||
# Install git, common tools
|
||||
RUN apt-get update && apt-get install -y git
|
||||
|
||||
# Copy gt binary
|
||||
COPY --from=builder /app/gt /usr/local/bin/gt
|
||||
|
||||
# Entrypoint accepts work via HTTP
|
||||
COPY worker-entrypoint.sh /entrypoint.sh
|
||||
ENTRYPOINT ["/entrypoint.sh"]
|
||||
```
|
||||
|
||||
### HTTP Work Protocol
|
||||
|
||||
```
|
||||
POST /work
|
||||
{
|
||||
"issue_id": "gt-abc123",
|
||||
"rig_url": "https://github.com/steveyegge/gastown",
|
||||
"beads_url": "https://github.com/steveyegge/gastown",
|
||||
"context": { /* optional hints */ }
|
||||
}
|
||||
|
||||
Response (streaming):
|
||||
{
|
||||
"status": "working|done|failed",
|
||||
"branch": "polecat/gt-abc123",
|
||||
"logs": "...",
|
||||
"pr_url": "..." // if created
|
||||
}
|
||||
```
|
||||
|
||||
### Persistent Connections
|
||||
|
||||
The "zero to one" solution:
|
||||
1. Mayor opens HTTP/2 connection to Cloud Run
|
||||
2. Connection stays open (Cloud Run keeps container warm)
|
||||
3. Send work requests over same connection
|
||||
4. Container processes work, streams results back
|
||||
5. On idle timeout, connection closes, container scales down
|
||||
6. Next request: small cold start, but acceptable
|
||||
|
||||
```
|
||||
┌──────────┐ ┌────────────────┐
|
||||
│ Mayor │────HTTP/2 stream───▶│ Cloud Run │
|
||||
│ │◀───results stream───│ Container │
|
||||
└──────────┘ └────────────────┘
|
||||
│ │
|
||||
│ Connection persists │ Container stays
|
||||
│ for hours if needed │ warm while
|
||||
│ │ connection open
|
||||
▼ ▼
|
||||
[New work requests go over same connection]
|
||||
```
|
||||
|
||||
### Git in Cloud Run
|
||||
|
||||
Options for code access:
|
||||
|
||||
1. **Clone on startup** (slow, ~30s+ for large repos)
|
||||
- Simple but adds latency
|
||||
- Acceptable if persistent connection keeps container warm
|
||||
|
||||
2. **Cloud Storage FUSE mount** (read-only)
|
||||
- Mount bucket with repo snapshot
|
||||
- Fast startup
|
||||
- Read-only limits usefulness
|
||||
|
||||
3. **Persistent volume** (Cloud Run now supports!)
|
||||
- Attach Cloud Storage or Filestore volume
|
||||
- Git clone persists across container restarts
|
||||
- Best of both worlds
|
||||
|
||||
4. **Shallow clone with depth**
|
||||
- `git clone --depth 1` for speed
|
||||
- Sufficient for most worker tasks
|
||||
- Can fetch more history if needed
|
||||
|
||||
**Recommendation:** Persistent volume with shallow clone. Container starts, checks if clone exists, pulls if yes, shallow clones if no.
|
||||
|
||||
### Beads Sync in Cloud Run
|
||||
|
||||
Workers need beads access. Options:
|
||||
|
||||
1. **Clone beads repo at startup**
|
||||
- Same as code: persistent volume helps
|
||||
- `bd sync` before and after work
|
||||
|
||||
2. **Beads as API** (future)
|
||||
- Central beads server
|
||||
- Workers query/update via HTTP
|
||||
- More complex but cleaner for distributed
|
||||
|
||||
3. **Beads in git (current)**
|
||||
- Works today
|
||||
- Worker clones .beads, does work, pushes
|
||||
- Git handles conflicts
|
||||
|
||||
**Recommendation:** Start with git-based beads. It works today and Cloud Run workers can push to the beads repo just like local workers.
|
||||
|
||||
### Mail in Cloud Run
|
||||
|
||||
For VM outposts, mail is filesystem-based. For Cloud Run:
|
||||
|
||||
**Option A: No mail needed**
|
||||
- Cloud Run workers are "fire and forget"
|
||||
- Mayor pushes work via HTTP, gets results via HTTP
|
||||
- Simpler model for stateless workers
|
||||
|
||||
**Option B: Mail via git**
|
||||
- Worker checks `mail/inbox.jsonl` in repo
|
||||
- Rare for workers to need incoming mail
|
||||
- Mostly they just do work and report results
|
||||
|
||||
**Recommendation:** Start with Option A. Cloud Run workers receive work via HTTP, report via HTTP. Mail is for long-running stateful agents (Witness, Refinery), not burst workers.
|
||||
|
||||
### Cost Model
|
||||
|
||||
Cloud Run pricing (as of late 2024):
|
||||
- CPU: ~$0.00002400/vCPU-second
|
||||
- Memory: ~$0.00000250/GiB-second
|
||||
- Requests: ~$0.40/million
|
||||
|
||||
For a worker running 5 minutes (300s) with 2 vCPU, 4GB RAM:
|
||||
- CPU: 300 × 2 × $0.000024 = $0.0144
|
||||
- Memory: 300 × 4 × $0.0000025 = $0.003
|
||||
- **Total: ~$0.017 per worker session**
|
||||
|
||||
50 workers × 5 minutes each = ~$0.85
|
||||
|
||||
**Key insight:** When idle (connection closed, scaled to zero): **$0**
|
||||
|
||||
Compare to a VM running 24/7: ~$50-200/month
|
||||
|
||||
Cloud Run makes burst capacity essentially free when not in use.
|
||||
|
||||
### Claude API in Cloud Run
|
||||
|
||||
Workers need Claude API access:
|
||||
|
||||
1. **API key in Secret Manager**
|
||||
- Cloud Run mounts secret as env var
|
||||
- Standard pattern
|
||||
|
||||
2. **Workload Identity** (if using Vertex AI)
|
||||
- Service account with Claude access
|
||||
- No keys to manage
|
||||
|
||||
3. **Rate limiting concerns**
|
||||
- Many concurrent workers = many API calls
|
||||
- May need to coordinate or queue
|
||||
- Could use Mayor as API proxy (future)
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ MAYOR │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ Outpost Manager │ │
|
||||
│ │ - Tracks all registered outposts │ │
|
||||
│ │ - Routes work to appropriate outpost │ │
|
||||
│ │ - Monitors worker status across outposts │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
|
||||
│ │ Local │ │ SSH │ │ CloudRun │ │
|
||||
│ │ Outpost │ │ Outpost │ │ Outpost │ │
|
||||
│ └────┬─────┘ └────┬─────┘ └──────┬───────┘ │
|
||||
└───────┼──────────────┼──────────────────┼───────────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────┐ ┌─────────┐ ┌─────────────┐
|
||||
│ tmux │ │ SSH │ │ HTTP/2 │
|
||||
│ panes │ │sessions │ │ connections │
|
||||
└─────────┘ └─────────┘ └─────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────┐ ┌─────────┐ ┌─────────────┐
|
||||
│ Workers │ │ Workers │ │ Workers │
|
||||
│ (local) │ │ (VM) │ │ (containers)│
|
||||
└─────────┘ └─────────┘ └─────────────┘
|
||||
│ │ │
|
||||
└──────────────┼──────────────────┘
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Git Repos │
|
||||
│ (beads sync) │
|
||||
│ (code repos) │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
```yaml
|
||||
# ~/ai/config/outposts.yaml
|
||||
outposts:
|
||||
# Always present - the local machine
|
||||
- name: local
|
||||
type: local
|
||||
max_workers: 4
|
||||
|
||||
# VM with full Gas Town clone
|
||||
- name: gce-worker-1
|
||||
type: ssh
|
||||
host: 10.0.0.5
|
||||
user: steve
|
||||
ssh_key: ~/.ssh/gce_worker
|
||||
town_path: /home/steve/ai
|
||||
max_workers: 8
|
||||
|
||||
# Cloud Run for burst capacity
|
||||
- name: cloudrun-burst
|
||||
type: cloudrun
|
||||
project: my-gcp-project
|
||||
region: us-central1
|
||||
service: gastown-worker
|
||||
max_workers: 20 # Or unlimited with cost cap
|
||||
cost_cap_hourly: 5.00 # Optional spending limit
|
||||
|
||||
# Work assignment policy
|
||||
policy:
|
||||
# Try local first, then VM, then Cloud Run
|
||||
default_preference: [local, gce-worker-1, cloudrun-burst]
|
||||
|
||||
# Override for specific scenarios
|
||||
overrides:
|
||||
- condition: "priority >= P3" # Background work
|
||||
prefer: cloudrun-burst
|
||||
- condition: "estimated_duration > 30m" # Long tasks
|
||||
prefer: gce-worker-1
|
||||
```
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Outpost Abstraction (Local)
|
||||
- Define Outpost/Worker interfaces
|
||||
- Implement LocalOutpost (refactor current polecat spawning)
|
||||
- Configuration file for outposts
|
||||
- `gt outpost list`, `gt outpost status`
|
||||
|
||||
### Phase 2: SSH Outpost (VMs)
|
||||
- Implement SSHConnection (extends existing Connection interface)
|
||||
- Implement SSHOutpost
|
||||
- VM provisioning docs (Terraform examples)
|
||||
- `gt outpost add ssh ...`
|
||||
- Test with actual GCE VM
|
||||
|
||||
### Phase 3: Cloud Run Outpost
|
||||
- Define worker container image
|
||||
- Implement CloudRunOutpost
|
||||
- HTTP/2 work dispatch protocol
|
||||
- Persistent connection management
|
||||
- Cost tracking/limits
|
||||
- `gt outpost add cloudrun ...`
|
||||
|
||||
### Phase 4: Policy & Intelligence
|
||||
- Smart assignment based on workload characteristics
|
||||
- Cost optimization (prefer free capacity)
|
||||
- Auto-scaling policies
|
||||
- Dashboard for cross-outpost visibility
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
### 1. Outpost as First-Class Concept
|
||||
Rather than baking in specific platforms (SSH, Cloud Run), model the abstraction. This gives flexibility for future platforms (K8s, bare metal, other clouds).
|
||||
|
||||
### 2. Workers Are Ephemeral
|
||||
Whether local tmux, VM process, or Cloud Run container - workers are spawned for work and can be terminated. Don't assume persistence.
|
||||
|
||||
### 3. Git as Source of Truth
|
||||
Code and beads always sync via git. This works regardless of where workers run. Even Cloud Run workers clone/pull from git.
|
||||
|
||||
### 4. HTTP for Cloud Run Control Plane
|
||||
For Cloud Run specifically, use HTTP for work dispatch. Don't try to make filesystem mail work across containers. Keep it simple.
|
||||
|
||||
### 5. Local-First Default
|
||||
Always try local workers first. Remote outposts are for overflow/burst, not primary capacity. This keeps latency low and costs down.
|
||||
|
||||
### 6. Graceful Degradation
|
||||
If Cloud Run is unavailable, fall back to VM. If VM is down, use local only. System works with any subset of outposts.
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Long-running sessions**: Cloud Run has request timeout limits (configurable up to 60 min, maybe longer now?). How does this interact with long Claude sessions?
|
||||
|
||||
2. **Context handoff**: If a Cloud Run worker's container restarts mid-task, how do we resume? Mail-to-self? Checkpoint to storage?
|
||||
|
||||
3. **Refinery in Cloud Run**: Could the Refinery itself run as a Cloud Run service? Long-running connection for merge queue processing?
|
||||
|
||||
4. **Witness in Cloud Run**: Worker monitoring from Cloud Run? Or does Witness need to be local/VM?
|
||||
|
||||
5. **Multi-region**: Cloud Run in multiple regions for geographic distribution? How to coordinate?
|
||||
|
||||
## Summary
|
||||
|
||||
The Outpost abstraction lets Gas Town scale flexibly:
|
||||
|
||||
| Outpost Type | Best For | Cost Model | Scaling |
|
||||
|--------------|----------|------------|---------|
|
||||
| Local | Development, primary work | Free (your machine) | Fixed |
|
||||
| SSH/VM | Long-running, full autonomy | Always-on VM cost | Manual |
|
||||
| Cloud Run | Burst, background, elastic | Pay-per-use | Auto |
|
||||
|
||||
Cloud Run's persistent connections solve the cold start problem, making it viable for interactive-ish work. Combined with VMs for heavier work and local for development, this gives a flexible spectrum of compute options.
|
||||
|
||||
The key insight: **don't pick one model, support both.** Let users configure their outposts based on their needs, budget, and scale requirements.
|
||||
Reference in New Issue
Block a user