docs: design Dog pool architecture for concurrent shutdown dances (gt-fsld8)
Key decisions: - Fixed pool of 5 goroutines (not Claude sessions) - State file persistence for crash recovery - Warrant queuing when pool exhausted - Dogs are lightweight state machine executors - New internal/shutdown/ package (separate from existing dog package) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
495
dog-pool-architecture.md
Normal file
495
dog-pool-architecture.md
Normal file
@@ -0,0 +1,495 @@
|
||||
# Dog Pool Architecture for Concurrent Shutdown Dances
|
||||
|
||||
> Design document for gt-fsld8
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Boot needs to run multiple shutdown-dance molecules concurrently when multiple death
|
||||
warrants are issued. The current hook design only allows one molecule per agent.
|
||||
|
||||
Example scenario:
|
||||
- Warrant 1: Kill stuck polecat Toast (60s into interrogation)
|
||||
- Warrant 2: Kill stuck polecat Shadow (just started)
|
||||
- Warrant 3: Kill stuck witness (120s into interrogation)
|
||||
|
||||
All three need concurrent tracking, independent timeouts, and separate outcomes.
|
||||
|
||||
## Design Decision: Lightweight State Machines
|
||||
|
||||
After analyzing the options, the shutdown-dance does NOT need Claude sessions.
|
||||
The dance is a deterministic state machine:
|
||||
|
||||
```
|
||||
WARRANT -> INTERROGATE -> EVALUATE -> PARDON|EXECUTE
|
||||
```
|
||||
|
||||
Each step is mechanical:
|
||||
1. Send a tmux message (no LLM needed)
|
||||
2. Wait for timeout or response (timer)
|
||||
3. Check tmux output for ALIVE keyword (string match)
|
||||
4. Repeat or terminate
|
||||
|
||||
**Decision**: Dogs are lightweight Go routines, not Claude sessions.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────┐
|
||||
│ BOOT │
|
||||
│ (Claude session in tmux) │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Dog Manager │ │
|
||||
│ │ │ │
|
||||
│ │ Pool: [Dog1, Dog2, Dog3, ...] (goroutines + state files) │ │
|
||||
│ │ │ │
|
||||
│ │ allocate() → Dog │ │
|
||||
│ │ release(Dog) │ │
|
||||
│ │ status() → []DogStatus │ │
|
||||
│ └──────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Boot's job: │
|
||||
│ - Watch for warrants (file or event) │
|
||||
│ - Allocate dog from pool │
|
||||
│ - Monitor dog progress │
|
||||
│ - Handle dog completion/failure │
|
||||
│ - Report results │
|
||||
└────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Dog Structure
|
||||
|
||||
```go
|
||||
// Dog represents a shutdown-dance executor
|
||||
type Dog struct {
|
||||
ID string // Unique ID (e.g., "dog-1704567890123")
|
||||
Warrant *Warrant // The death warrant being processed
|
||||
State ShutdownDanceState
|
||||
Attempt int // Current interrogation attempt (1-3)
|
||||
StartedAt time.Time
|
||||
StateFile string // Persistent state: ~/gt/deacon/dogs/active/<id>.json
|
||||
}
|
||||
|
||||
type ShutdownDanceState string
|
||||
|
||||
const (
|
||||
StateIdle ShutdownDanceState = "idle"
|
||||
StateInterrogating ShutdownDanceState = "interrogating" // Sent message, waiting
|
||||
StateEvaluating ShutdownDanceState = "evaluating" // Checking response
|
||||
StatePardoned ShutdownDanceState = "pardoned" // Session responded
|
||||
StateExecuting ShutdownDanceState = "executing" // Killing session
|
||||
StateComplete ShutdownDanceState = "complete" // Done, ready for cleanup
|
||||
StateFailed ShutdownDanceState = "failed" // Dog crashed/errored
|
||||
)
|
||||
|
||||
type Warrant struct {
|
||||
ID string // Bead ID for the warrant
|
||||
Target string // Session to interrogate (e.g., "gt-gastown-Toast")
|
||||
Reason string // Why warrant was issued
|
||||
Requester string // Who filed the warrant
|
||||
FiledAt time.Time
|
||||
}
|
||||
```
|
||||
|
||||
## Pool Design
|
||||
|
||||
### Fixed Pool Size
|
||||
|
||||
**Decision**: Fixed pool of 5 dogs, configurable via environment.
|
||||
|
||||
Rationale:
|
||||
- Dynamic sizing adds complexity without clear benefit
|
||||
- 5 concurrent shutdown dances handles worst-case scenarios
|
||||
- If pool exhausted, warrants queue (better than infinite dog spawning)
|
||||
- Memory footprint is negligible (goroutines + small state files)
|
||||
|
||||
```go
|
||||
const (
|
||||
DefaultPoolSize = 5
|
||||
MaxPoolSize = 20
|
||||
)
|
||||
|
||||
type DogPool struct {
|
||||
mu sync.Mutex
|
||||
dogs []*Dog // All dogs in pool
|
||||
idle chan *Dog // Channel of available dogs
|
||||
active map[string]*Dog // ID -> Dog for active dogs
|
||||
stateDir string // ~/gt/deacon/dogs/active/
|
||||
}
|
||||
|
||||
func (p *DogPool) Allocate(warrant *Warrant) (*Dog, error) {
|
||||
select {
|
||||
case dog := <-p.idle:
|
||||
dog.Warrant = warrant
|
||||
dog.State = StateInterrogating
|
||||
dog.Attempt = 1
|
||||
dog.StartedAt = time.Now()
|
||||
p.active[dog.ID] = dog
|
||||
return dog, nil
|
||||
default:
|
||||
return nil, ErrPoolExhausted
|
||||
}
|
||||
}
|
||||
|
||||
func (p *DogPool) Release(dog *Dog) {
|
||||
p.mu.Lock()
|
||||
defer p.mu.Unlock()
|
||||
delete(p.active, dog.ID)
|
||||
dog.Reset()
|
||||
p.idle <- dog
|
||||
}
|
||||
```
|
||||
|
||||
### Why Not Dynamic Pool?
|
||||
|
||||
Considered but rejected:
|
||||
- Adding dogs on demand increases complexity
|
||||
- No clear benefit - warrants rarely exceed 5 concurrent
|
||||
- If needed, raise DefaultPoolSize
|
||||
- Simpler to reason about fixed resources
|
||||
|
||||
## Communication: State Files + Events
|
||||
|
||||
### State Persistence
|
||||
|
||||
Each active dog writes state to `~/gt/deacon/dogs/active/<id>.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "dog-1704567890123",
|
||||
"warrant": {
|
||||
"id": "gt-abc123",
|
||||
"target": "gt-gastown-Toast",
|
||||
"reason": "no_response_health_check",
|
||||
"requester": "deacon",
|
||||
"filed_at": "2026-01-07T20:15:00Z"
|
||||
},
|
||||
"state": "interrogating",
|
||||
"attempt": 2,
|
||||
"started_at": "2026-01-07T20:15:00Z",
|
||||
"last_message_at": "2026-01-07T20:16:00Z",
|
||||
"next_timeout": "2026-01-07T20:18:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### Boot Monitoring
|
||||
|
||||
Boot monitors dogs via:
|
||||
1. **Polling**: `gt dog status --active` every tick
|
||||
2. **Completion files**: Dogs write `<id>.done` when complete
|
||||
|
||||
```go
|
||||
type DogResult struct {
|
||||
DogID string
|
||||
Warrant *Warrant
|
||||
Outcome DogOutcome // pardoned | executed | failed
|
||||
Duration time.Duration
|
||||
Details string
|
||||
}
|
||||
|
||||
type DogOutcome string
|
||||
|
||||
const (
|
||||
OutcomePardoned DogOutcome = "pardoned" // Session responded
|
||||
OutcomeExecuted DogOutcome = "executed" // Session killed
|
||||
OutcomeFailed DogOutcome = "failed" // Dog crashed
|
||||
)
|
||||
```
|
||||
|
||||
### Why Not Mail?
|
||||
|
||||
Considered but rejected for dog<->boot communication:
|
||||
- Mail is async, poll-based - adds latency
|
||||
- State files are simpler for local coordination
|
||||
- Dogs don't need complex inter-agent communication
|
||||
- Keep mail for external coordination (Witness, Mayor)
|
||||
|
||||
## Shutdown Dance State Machine
|
||||
|
||||
Each dog executes this state machine:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ │
|
||||
▼ │
|
||||
┌───────────────────────────┐ │
|
||||
│ INTERROGATING │ │
|
||||
│ │ │
|
||||
│ 1. Send health check │ │
|
||||
│ 2. Start timeout timer │ │
|
||||
└───────────┬───────────────┘ │
|
||||
│ │
|
||||
│ timeout or response │
|
||||
▼ │
|
||||
┌───────────────────────────┐ │
|
||||
│ EVALUATING │ │
|
||||
│ │ │
|
||||
│ Check tmux output for │ │
|
||||
│ ALIVE keyword │ │
|
||||
└───────────┬───────────────┘ │
|
||||
│ │
|
||||
┌───────┴───────┐ │
|
||||
│ │ │
|
||||
▼ ▼ │
|
||||
[ALIVE found] [No ALIVE] │
|
||||
│ │ │
|
||||
│ │ attempt < 3? │
|
||||
│ ├──────────────────────────────────→─┘
|
||||
│ │ yes: attempt++, longer timeout
|
||||
│ │
|
||||
│ │ no: attempt == 3
|
||||
▼ ▼
|
||||
┌─────────┐ ┌─────────────┐
|
||||
│ PARDONED│ │ EXECUTING │
|
||||
│ │ │ │
|
||||
│ Cancel │ │ Kill tmux │
|
||||
│ warrant │ │ session │
|
||||
└────┬────┘ └──────┬──────┘
|
||||
│ │
|
||||
└────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────┐
|
||||
│ COMPLETE │
|
||||
│ │
|
||||
│ Write result │
|
||||
│ Release dog │
|
||||
└────────────────┘
|
||||
```
|
||||
|
||||
### Timeout Gates
|
||||
|
||||
| Attempt | Timeout | Cumulative Wait |
|
||||
|---------|---------|-----------------|
|
||||
| 1 | 60s | 60s |
|
||||
| 2 | 120s | 180s (3 min) |
|
||||
| 3 | 240s | 420s (7 min) |
|
||||
|
||||
### Health Check Message
|
||||
|
||||
```
|
||||
[DOG] HEALTH CHECK: Session {target}, respond ALIVE within {timeout}s or face termination.
|
||||
Warrant reason: {reason}
|
||||
Filed by: {requester}
|
||||
Attempt: {attempt}/3
|
||||
```
|
||||
|
||||
### Response Detection
|
||||
|
||||
```go
|
||||
func (d *Dog) CheckForResponse() bool {
|
||||
tm := tmux.NewTmux()
|
||||
output, err := tm.CapturePane(d.Warrant.Target, 50) // Last 50 lines
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
|
||||
// Any output after our health check counts as alive
|
||||
// Specifically look for ALIVE keyword for explicit response
|
||||
return strings.Contains(output, "ALIVE")
|
||||
}
|
||||
```
|
||||
|
||||
## Dog Implementation
|
||||
|
||||
### Not Reusing Polecat Infrastructure
|
||||
|
||||
**Decision**: Dogs do NOT reuse polecat infrastructure.
|
||||
|
||||
Rationale:
|
||||
- Polecats are Claude sessions with molecules, hooks, sandboxes
|
||||
- Dogs are simple state machine executors
|
||||
- Polecats have 3-layer lifecycle (session/sandbox/slot)
|
||||
- Dogs have single-layer lifecycle (just state)
|
||||
- Different resource profiles, different management
|
||||
|
||||
What dogs DO share:
|
||||
- tmux utilities for message sending/capture
|
||||
- State file patterns
|
||||
- Pool allocation pattern
|
||||
|
||||
### Dog Execution Loop
|
||||
|
||||
```go
|
||||
func (d *Dog) Run(ctx context.Context) DogResult {
|
||||
d.State = StateInterrogating
|
||||
d.saveState()
|
||||
|
||||
for d.Attempt <= 3 {
|
||||
// Send interrogation message
|
||||
if err := d.sendHealthCheck(); err != nil {
|
||||
return d.fail(err)
|
||||
}
|
||||
|
||||
// Wait for timeout or context cancellation
|
||||
timeout := d.timeoutForAttempt(d.Attempt)
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return d.fail(ctx.Err())
|
||||
case <-time.After(timeout):
|
||||
// Timeout reached
|
||||
}
|
||||
|
||||
// Evaluate response
|
||||
d.State = StateEvaluating
|
||||
d.saveState()
|
||||
|
||||
if d.CheckForResponse() {
|
||||
// Session is alive
|
||||
return d.pardon()
|
||||
}
|
||||
|
||||
// No response - try again or execute
|
||||
d.Attempt++
|
||||
if d.Attempt <= 3 {
|
||||
d.State = StateInterrogating
|
||||
d.saveState()
|
||||
}
|
||||
}
|
||||
|
||||
// All attempts exhausted - execute warrant
|
||||
return d.execute()
|
||||
}
|
||||
```
|
||||
|
||||
## Failure Handling
|
||||
|
||||
### Dog Crashes Mid-Dance
|
||||
|
||||
If a dog crashes (Boot process restarts, system crash):
|
||||
|
||||
1. State files persist in `~/gt/deacon/dogs/active/`
|
||||
2. On Boot restart, scan for orphaned state files
|
||||
3. Resume or restart based on state:
|
||||
|
||||
| State | Recovery Action |
|
||||
|------------------|------------------------------------|
|
||||
| interrogating | Restart from current attempt |
|
||||
| evaluating | Check response, continue |
|
||||
| executing | Verify kill, mark complete |
|
||||
| pardoned/complete| Already done, clean up |
|
||||
|
||||
```go
|
||||
func (p *DogPool) RecoverOrphans() error {
|
||||
files, _ := filepath.Glob(p.stateDir + "/*.json")
|
||||
for _, f := range files {
|
||||
state := loadDogState(f)
|
||||
if state.State != StateComplete && state.State != StatePardoned {
|
||||
dog := p.allocateForRecovery(state)
|
||||
go dog.Resume()
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
### Handling Pool Exhaustion
|
||||
|
||||
If all dogs are busy when new warrant arrives:
|
||||
|
||||
```go
|
||||
func (b *Boot) HandleWarrant(warrant *Warrant) error {
|
||||
dog, err := b.pool.Allocate(warrant)
|
||||
if err == ErrPoolExhausted {
|
||||
// Queue the warrant for later processing
|
||||
b.warrantQueue.Push(warrant)
|
||||
b.log("Warrant %s queued (pool exhausted)", warrant.ID)
|
||||
return nil
|
||||
}
|
||||
|
||||
go func() {
|
||||
result := dog.Run(b.ctx)
|
||||
b.handleResult(result)
|
||||
b.pool.Release(dog)
|
||||
|
||||
// Check queue for pending warrants
|
||||
if next := b.warrantQueue.Pop(); next != nil {
|
||||
b.HandleWarrant(next)
|
||||
}
|
||||
}()
|
||||
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
~/gt/deacon/dogs/
|
||||
├── boot/ # Boot's working directory
|
||||
│ ├── CLAUDE.md # Boot context
|
||||
│ └── .boot-status.json # Boot execution status
|
||||
├── active/ # Active dog state files
|
||||
│ ├── dog-123.json # Dog 1 state
|
||||
│ ├── dog-456.json # Dog 2 state
|
||||
│ └── ...
|
||||
├── completed/ # Completed dance records (for audit)
|
||||
│ ├── dog-789.json # Historical record
|
||||
│ └── ...
|
||||
└── warrants/ # Pending warrant queue
|
||||
├── warrant-abc.json
|
||||
└── ...
|
||||
```
|
||||
|
||||
## Command Interface
|
||||
|
||||
```bash
|
||||
# Pool status
|
||||
gt dog pool status
|
||||
# Output:
|
||||
# Dog Pool: 3/5 active
|
||||
# dog-123: interrogating Toast (attempt 2, 45s remaining)
|
||||
# dog-456: executing Shadow
|
||||
# dog-789: idle
|
||||
|
||||
# Manual dog operations (for debugging)
|
||||
gt dog pool allocate <warrant-id>
|
||||
gt dog pool release <dog-id>
|
||||
|
||||
# View active dances
|
||||
gt dog dances
|
||||
# Output:
|
||||
# Active Shutdown Dances:
|
||||
# dog-123 → Toast: Interrogating (2/3), timeout in 45s
|
||||
# dog-456 → Shadow: Executing warrant
|
||||
|
||||
# View warrant queue
|
||||
gt dog warrants
|
||||
# Output:
|
||||
# Pending Warrants: 2
|
||||
# 1. gt-abc: witness-gastown (stuck_no_progress)
|
||||
# 2. gt-def: polecat-Copper (crash_loop)
|
||||
```
|
||||
|
||||
## Integration with Existing Dogs
|
||||
|
||||
The existing `dog` package (`internal/dog/`) manages Deacon's multi-rig helper dogs.
|
||||
Those are different from shutdown-dance dogs:
|
||||
|
||||
| Aspect | Helper Dogs (existing) | Dance Dogs (new) |
|
||||
|-----------------|-----------------------------|-----------------------------|
|
||||
| Purpose | Cross-rig infrastructure | Shutdown dance execution |
|
||||
| Sessions | Claude sessions | Goroutines (no Claude) |
|
||||
| Worktrees | One per rig | None |
|
||||
| Lifecycle | Long-lived, reusable | Ephemeral per warrant |
|
||||
| State | idle/working | Dance state machine |
|
||||
|
||||
**Recommendation**: Use different package to avoid confusion:
|
||||
- `internal/dog/` - existing helper dogs
|
||||
- `internal/shutdown/` - shutdown dance pool
|
||||
|
||||
## Summary: Answers to Design Questions
|
||||
|
||||
| Question | Answer |
|
||||
|----------|--------|
|
||||
| How many Dogs in pool? | Fixed: 5 (configurable via GT_DOG_POOL_SIZE) |
|
||||
| How do Dogs communicate with Boot? | State files + completion markers |
|
||||
| Are Dogs tmux sessions? | No - goroutines with state machine |
|
||||
| Reuse polecat infrastructure? | No - too heavyweight, different model |
|
||||
| What if Dog dies mid-dance? | State file recovery on Boot restart |
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [x] Architecture document for Dog pool
|
||||
- [x] Clear allocation/deallocation protocol
|
||||
- [x] Failure handling for Dog crashes
|
||||
Reference in New Issue
Block a user