Comprehensive analysis of Gas Town federation via "Outposts" abstraction: - LocalOutpost: current tmux model - SSHOutpost: full Gas Town clone on VM - CloudRunOutpost: elastic container workers Key insights: - Persistent HTTP/2 connections solve Cloud Run cold start - ~$0.017 per 5-min worker session vs $50-200/mo VM - Git remains source of truth for code and beads - Local-first, remote for overflow/burst 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
471 lines
17 KiB
Markdown
471 lines
17 KiB
Markdown
# Federation Architecture: Ultrathink
|
||
|
||
## The Problem
|
||
|
||
Gas Town needs to scale beyond a single machine:
|
||
- More workers than one machine can handle (RAM, CPU, context windows)
|
||
- Geographic distribution (workers close to data/services)
|
||
- Cost efficiency (pay-per-use vs always-on VMs)
|
||
- Platform flexibility (support various deployment targets)
|
||
|
||
## Two Deployment Models
|
||
|
||
### Model A: "Town Clone" (VMs)
|
||
|
||
Clone the entire `~/ai` workspace to a remote VM. It runs like a regular Gas Town:
|
||
|
||
```
|
||
┌─────────────────────────────────────────┐
|
||
│ GCE VM (or any Linux box) │
|
||
│ │
|
||
│ ~/ai/ # Full town clone │
|
||
│ ├── config/ # Town config │
|
||
│ ├── mayor/ # Mayor (or none) │
|
||
│ ├── gastown/ # Rig with agents │
|
||
│ │ ├── polecats/ # Workers here │
|
||
│ │ ├── refinery/ │
|
||
│ │ └── witness/ │
|
||
│ └── beads/ # Another rig │
|
||
│ │
|
||
│ Runs autonomously, syncs via git │
|
||
└─────────────────────────────────────────┘
|
||
```
|
||
|
||
**Characteristics:**
|
||
- Full autonomy if disconnected
|
||
- Familiar model - it's just another Gas Town
|
||
- VM overhead (cost, management, always-on)
|
||
- Coarse-grained scaling (spin up whole VMs)
|
||
- Good for: always-on capacity, long-running work, full independence
|
||
|
||
**Federation via:**
|
||
- Git sync for beads (already works)
|
||
- Extended mail routing (`vm1:gastown/polecat`)
|
||
- SSH for remote commands
|
||
|
||
### Model B: "Cloud Run Workers" (Containers)
|
||
|
||
Workers are stateless containers that wake on demand:
|
||
|
||
```
|
||
┌─────────────────────────────────────────┐
|
||
│ Cloud Run Service: gastown-worker │
|
||
│ │
|
||
│ ┌────────────────────────────────┐ │
|
||
│ │ Container Instance │ │
|
||
│ │ - Claude Code + git │ │
|
||
│ │ - HTTP endpoint for work │ │
|
||
│ │ - Persistent volume mount │ │
|
||
│ │ - Scales 0→N automatically │ │
|
||
│ └────────────────────────────────┘ │
|
||
│ │
|
||
│ Zero cost when idle │
|
||
│ Persistent connections keep warm │
|
||
└─────────────────────────────────────────┘
|
||
```
|
||
|
||
**Characteristics:**
|
||
- Pay-per-use (nearly free when idle)
|
||
- Scales elastically (0 to many workers)
|
||
- No VM management
|
||
- Stateless(ish) - needs fast bootstrap or persistent storage
|
||
- Good for: burst capacity, background work, elastic scaling
|
||
|
||
**Key insight from your friend:**
|
||
Persistent connections solve the "zero to one" problem. Keep the connection open, container stays warm, subsequent requests are fast. This transforms Cloud Run from "cold functions" to "elastic workers."
|
||
|
||
## Unified Abstraction: Outposts
|
||
|
||
To support "however people want to do it," we need an abstraction that covers both models (and future ones like K8s, bare metal, etc.).
|
||
|
||
### The Outpost Concept
|
||
|
||
An **Outpost** is a remote compute environment that can run workers.
|
||
|
||
```go
|
||
type Outpost interface {
|
||
// Identity
|
||
Name() string
|
||
Type() OutpostType // local, ssh, cloudrun, k8s
|
||
|
||
// Capacity
|
||
MaxWorkers() int
|
||
ActiveWorkers() int
|
||
|
||
// Worker lifecycle
|
||
Spawn(issue string, config WorkerConfig) (Worker, error)
|
||
Workers() []Worker
|
||
|
||
// Health
|
||
Ping() error
|
||
|
||
// Optional: Direct communication (VM outposts)
|
||
SendMail(worker string, msg Message) error
|
||
}
|
||
|
||
type OutpostType string
|
||
const (
|
||
OutpostLocal OutpostType = "local"
|
||
OutpostSSH OutpostType = "ssh" // Full VM clone
|
||
OutpostCloudRun OutpostType = "cloudrun" // Container workers
|
||
OutpostK8s OutpostType = "k8s" // Future
|
||
)
|
||
```
|
||
|
||
### Worker Interface
|
||
|
||
```go
|
||
type Worker interface {
|
||
ID() string
|
||
Outpost() string
|
||
Status() WorkerStatus // idle, working, done, failed
|
||
Issue() string // Current issue being worked
|
||
|
||
// For interactive outposts (local, SSH)
|
||
Attach() error // Connect to worker session
|
||
|
||
// For all outposts
|
||
Logs() (io.Reader, error)
|
||
Stop() error
|
||
}
|
||
```
|
||
|
||
### Outpost Implementations
|
||
|
||
#### LocalOutpost
|
||
- Current model: tmux panes on localhost
|
||
- Uses existing Connection interface (LocalConnection)
|
||
- Workers are tmux sessions
|
||
|
||
#### SSHOutpost
|
||
- Full Gas Town clone on remote VM
|
||
- Uses SSHConnection for remote ops
|
||
- Workers are remote tmux sessions
|
||
- Town config replicated to VM
|
||
|
||
#### CloudRunOutpost
|
||
- Workers are container instances
|
||
- HTTP/gRPC for work dispatch
|
||
- No tmux (stateless containers)
|
||
- Persistent connections for warmth
|
||
|
||
## Cloud Run Deep Dive
|
||
|
||
### Container Design
|
||
|
||
```dockerfile
|
||
FROM golang:1.21 AS builder
|
||
# Build gt binary...
|
||
|
||
FROM ubuntu:22.04
|
||
# Install Claude Code
|
||
RUN npm install -g @anthropic-ai/claude-code
|
||
|
||
# Install git, common tools
|
||
RUN apt-get update && apt-get install -y git
|
||
|
||
# Copy gt binary
|
||
COPY --from=builder /app/gt /usr/local/bin/gt
|
||
|
||
# Entrypoint accepts work via HTTP
|
||
COPY worker-entrypoint.sh /entrypoint.sh
|
||
ENTRYPOINT ["/entrypoint.sh"]
|
||
```
|
||
|
||
### HTTP Work Protocol
|
||
|
||
```
|
||
POST /work
|
||
{
|
||
"issue_id": "gt-abc123",
|
||
"rig_url": "https://github.com/steveyegge/gastown",
|
||
"beads_url": "https://github.com/steveyegge/gastown",
|
||
"context": { /* optional hints */ }
|
||
}
|
||
|
||
Response (streaming):
|
||
{
|
||
"status": "working|done|failed",
|
||
"branch": "polecat/gt-abc123",
|
||
"logs": "...",
|
||
"pr_url": "..." // if created
|
||
}
|
||
```
|
||
|
||
### Persistent Connections
|
||
|
||
The "zero to one" solution:
|
||
1. Mayor opens HTTP/2 connection to Cloud Run
|
||
2. Connection stays open (Cloud Run keeps container warm)
|
||
3. Send work requests over same connection
|
||
4. Container processes work, streams results back
|
||
5. On idle timeout, connection closes, container scales down
|
||
6. Next request: small cold start, but acceptable
|
||
|
||
```
|
||
┌──────────┐ ┌────────────────┐
|
||
│ Mayor │────HTTP/2 stream───▶│ Cloud Run │
|
||
│ │◀───results stream───│ Container │
|
||
└──────────┘ └────────────────┘
|
||
│ │
|
||
│ Connection persists │ Container stays
|
||
│ for hours if needed │ warm while
|
||
│ │ connection open
|
||
▼ ▼
|
||
[New work requests go over same connection]
|
||
```
|
||
|
||
### Git in Cloud Run
|
||
|
||
Options for code access:
|
||
|
||
1. **Clone on startup** (slow, ~30s+ for large repos)
|
||
- Simple but adds latency
|
||
- Acceptable if persistent connection keeps container warm
|
||
|
||
2. **Cloud Storage FUSE mount** (read-only)
|
||
- Mount bucket with repo snapshot
|
||
- Fast startup
|
||
- Read-only limits usefulness
|
||
|
||
3. **Persistent volume** (Cloud Run now supports!)
|
||
- Attach Cloud Storage or Filestore volume
|
||
- Git clone persists across container restarts
|
||
- Best of both worlds
|
||
|
||
4. **Shallow clone with depth**
|
||
- `git clone --depth 1` for speed
|
||
- Sufficient for most worker tasks
|
||
- Can fetch more history if needed
|
||
|
||
**Recommendation:** Persistent volume with shallow clone. Container starts, checks if clone exists, pulls if yes, shallow clones if no.
|
||
|
||
### Beads Sync in Cloud Run
|
||
|
||
Workers need beads access. Options:
|
||
|
||
1. **Clone beads repo at startup**
|
||
- Same as code: persistent volume helps
|
||
- `bd sync` before and after work
|
||
|
||
2. **Beads as API** (future)
|
||
- Central beads server
|
||
- Workers query/update via HTTP
|
||
- More complex but cleaner for distributed
|
||
|
||
3. **Beads in git (current)**
|
||
- Works today
|
||
- Worker clones .beads, does work, pushes
|
||
- Git handles conflicts
|
||
|
||
**Recommendation:** Start with git-based beads. It works today and Cloud Run workers can push to the beads repo just like local workers.
|
||
|
||
### Mail in Cloud Run
|
||
|
||
For VM outposts, mail is filesystem-based. For Cloud Run:
|
||
|
||
**Option A: No mail needed**
|
||
- Cloud Run workers are "fire and forget"
|
||
- Mayor pushes work via HTTP, gets results via HTTP
|
||
- Simpler model for stateless workers
|
||
|
||
**Option B: Mail via git**
|
||
- Worker checks `mail/inbox.jsonl` in repo
|
||
- Rare for workers to need incoming mail
|
||
- Mostly they just do work and report results
|
||
|
||
**Recommendation:** Start with Option A. Cloud Run workers receive work via HTTP, report via HTTP. Mail is for long-running stateful agents (Witness, Refinery), not burst workers.
|
||
|
||
### Cost Model
|
||
|
||
Cloud Run pricing (as of late 2024):
|
||
- CPU: ~$0.00002400/vCPU-second
|
||
- Memory: ~$0.00000250/GiB-second
|
||
- Requests: ~$0.40/million
|
||
|
||
For a worker running 5 minutes (300s) with 2 vCPU, 4GB RAM:
|
||
- CPU: 300 × 2 × $0.000024 = $0.0144
|
||
- Memory: 300 × 4 × $0.0000025 = $0.003
|
||
- **Total: ~$0.017 per worker session**
|
||
|
||
50 workers × 5 minutes each = ~$0.85
|
||
|
||
**Key insight:** When idle (connection closed, scaled to zero): **$0**
|
||
|
||
Compare to a VM running 24/7: ~$50-200/month
|
||
|
||
Cloud Run makes burst capacity essentially free when not in use.
|
||
|
||
### Claude API in Cloud Run
|
||
|
||
Workers need Claude API access:
|
||
|
||
1. **API key in Secret Manager**
|
||
- Cloud Run mounts secret as env var
|
||
- Standard pattern
|
||
|
||
2. **Workload Identity** (if using Vertex AI)
|
||
- Service account with Claude access
|
||
- No keys to manage
|
||
|
||
3. **Rate limiting concerns**
|
||
- Many concurrent workers = many API calls
|
||
- May need to coordinate or queue
|
||
- Could use Mayor as API proxy (future)
|
||
|
||
## Architecture Overview
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ MAYOR │
|
||
│ ┌──────────────────────────────────────────────────────┐ │
|
||
│ │ Outpost Manager │ │
|
||
│ │ - Tracks all registered outposts │ │
|
||
│ │ - Routes work to appropriate outpost │ │
|
||
│ │ - Monitors worker status across outposts │ │
|
||
│ └──────────────────────────────────────────────────────┘ │
|
||
│ │ │ │ │
|
||
│ ▼ ▼ ▼ │
|
||
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
|
||
│ │ Local │ │ SSH │ │ CloudRun │ │
|
||
│ │ Outpost │ │ Outpost │ │ Outpost │ │
|
||
│ └────┬─────┘ └────┬─────┘ └──────┬───────┘ │
|
||
└───────┼──────────────┼──────────────────┼───────────────────┘
|
||
│ │ │
|
||
▼ ▼ ▼
|
||
┌─────────┐ ┌─────────┐ ┌─────────────┐
|
||
│ tmux │ │ SSH │ │ HTTP/2 │
|
||
│ panes │ │sessions │ │ connections │
|
||
└─────────┘ └─────────┘ └─────────────┘
|
||
│ │ │
|
||
▼ ▼ ▼
|
||
┌─────────┐ ┌─────────┐ ┌─────────────┐
|
||
│ Workers │ │ Workers │ │ Workers │
|
||
│ (local) │ │ (VM) │ │ (containers)│
|
||
└─────────┘ └─────────┘ └─────────────┘
|
||
│ │ │
|
||
└──────────────┼──────────────────┘
|
||
▼
|
||
┌─────────────────┐
|
||
│ Git Repos │
|
||
│ (beads sync) │
|
||
│ (code repos) │
|
||
└─────────────────┘
|
||
```
|
||
|
||
## Configuration
|
||
|
||
```yaml
|
||
# ~/ai/config/outposts.yaml
|
||
outposts:
|
||
# Always present - the local machine
|
||
- name: local
|
||
type: local
|
||
max_workers: 4
|
||
|
||
# VM with full Gas Town clone
|
||
- name: gce-worker-1
|
||
type: ssh
|
||
host: 10.0.0.5
|
||
user: steve
|
||
ssh_key: ~/.ssh/gce_worker
|
||
town_path: /home/steve/ai
|
||
max_workers: 8
|
||
|
||
# Cloud Run for burst capacity
|
||
- name: cloudrun-burst
|
||
type: cloudrun
|
||
project: my-gcp-project
|
||
region: us-central1
|
||
service: gastown-worker
|
||
max_workers: 20 # Or unlimited with cost cap
|
||
cost_cap_hourly: 5.00 # Optional spending limit
|
||
|
||
# Work assignment policy
|
||
policy:
|
||
# Try local first, then VM, then Cloud Run
|
||
default_preference: [local, gce-worker-1, cloudrun-burst]
|
||
|
||
# Override for specific scenarios
|
||
overrides:
|
||
- condition: "priority >= P3" # Background work
|
||
prefer: cloudrun-burst
|
||
- condition: "estimated_duration > 30m" # Long tasks
|
||
prefer: gce-worker-1
|
||
```
|
||
|
||
## Implementation Phases
|
||
|
||
### Phase 1: Outpost Abstraction (Local)
|
||
- Define Outpost/Worker interfaces
|
||
- Implement LocalOutpost (refactor current polecat spawning)
|
||
- Configuration file for outposts
|
||
- `gt outpost list`, `gt outpost status`
|
||
|
||
### Phase 2: SSH Outpost (VMs)
|
||
- Implement SSHConnection (extends existing Connection interface)
|
||
- Implement SSHOutpost
|
||
- VM provisioning docs (Terraform examples)
|
||
- `gt outpost add ssh ...`
|
||
- Test with actual GCE VM
|
||
|
||
### Phase 3: Cloud Run Outpost
|
||
- Define worker container image
|
||
- Implement CloudRunOutpost
|
||
- HTTP/2 work dispatch protocol
|
||
- Persistent connection management
|
||
- Cost tracking/limits
|
||
- `gt outpost add cloudrun ...`
|
||
|
||
### Phase 4: Policy & Intelligence
|
||
- Smart assignment based on workload characteristics
|
||
- Cost optimization (prefer free capacity)
|
||
- Auto-scaling policies
|
||
- Dashboard for cross-outpost visibility
|
||
|
||
## Key Design Decisions
|
||
|
||
### 1. Outpost as First-Class Concept
|
||
Rather than baking in specific platforms (SSH, Cloud Run), model the abstraction. This gives flexibility for future platforms (K8s, bare metal, other clouds).
|
||
|
||
### 2. Workers Are Ephemeral
|
||
Whether local tmux, VM process, or Cloud Run container - workers are spawned for work and can be terminated. Don't assume persistence.
|
||
|
||
### 3. Git as Source of Truth
|
||
Code and beads always sync via git. This works regardless of where workers run. Even Cloud Run workers clone/pull from git.
|
||
|
||
### 4. HTTP for Cloud Run Control Plane
|
||
For Cloud Run specifically, use HTTP for work dispatch. Don't try to make filesystem mail work across containers. Keep it simple.
|
||
|
||
### 5. Local-First Default
|
||
Always try local workers first. Remote outposts are for overflow/burst, not primary capacity. This keeps latency low and costs down.
|
||
|
||
### 6. Graceful Degradation
|
||
If Cloud Run is unavailable, fall back to VM. If VM is down, use local only. System works with any subset of outposts.
|
||
|
||
## Open Questions
|
||
|
||
1. **Long-running sessions**: Cloud Run has request timeout limits (configurable up to 60 min, maybe longer now?). How does this interact with long Claude sessions?
|
||
|
||
2. **Context handoff**: If a Cloud Run worker's container restarts mid-task, how do we resume? Mail-to-self? Checkpoint to storage?
|
||
|
||
3. **Refinery in Cloud Run**: Could the Refinery itself run as a Cloud Run service? Long-running connection for merge queue processing?
|
||
|
||
4. **Witness in Cloud Run**: Worker monitoring from Cloud Run? Or does Witness need to be local/VM?
|
||
|
||
5. **Multi-region**: Cloud Run in multiple regions for geographic distribution? How to coordinate?
|
||
|
||
## Summary
|
||
|
||
The Outpost abstraction lets Gas Town scale flexibly:
|
||
|
||
| Outpost Type | Best For | Cost Model | Scaling |
|
||
|--------------|----------|------------|---------|
|
||
| Local | Development, primary work | Free (your machine) | Fixed |
|
||
| SSH/VM | Long-running, full autonomy | Always-on VM cost | Manual |
|
||
| Cloud Run | Burst, background, elastic | Pay-per-use | Auto |
|
||
|
||
Cloud Run's persistent connections solve the cold start problem, making it viable for interactive-ish work. Combined with VMs for heavier work and local for development, this gives a flexible spectrum of compute options.
|
||
|
||
The key insight: **don't pick one model, support both.** Let users configure their outposts based on their needs, budget, and scale requirements.
|