docs: Add watchdog chain documentation for Boot/Deacon lifecycle (gt-1847v)

Creates docs/watchdog-chain.md explaining the Daemon/Boot/Deacon architecture:
- Why two agents (Boot is ephemeral triage, Deacon is persistent patrol)
- Session ownership (gt-deacon-boot vs gt-deacon)
- Heartbeat mechanics and freshness thresholds
- Boot decision matrix (start/wake/nudge/nothing)
- Design decision: keep separation, fix implementation bugs

Cross-references added to operational-state.md and understanding-gas-town.md.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
morsov
2026-01-02 18:54:32 -08:00
committed by Steve Yegge
parent 27618e5c2e
commit ea7f434e81
3 changed files with 308 additions and 1 deletions

View File

@@ -102,6 +102,9 @@ bd list --type=role --label=mode:degraded
## Boot: The Deacon's Watchdog
> See [Watchdog Chain](watchdog-chain.md) for the complete Daemon/Boot/Deacon
> architecture and design rationale.
Boot is a dog (Deacon helper) that triages the Deacon's health. The daemon pokes
Boot instead of the Deacon directly, centralizing the "when to wake" decision in
an agent that can reason about it.

View File

@@ -27,7 +27,7 @@ These roles manage the Gas Town system itself:
| Role | Description | Lifecycle |
|------|-------------|-----------|
| **Mayor** | Global coordinator at town root | Singleton, persistent |
| **Deacon** | Background supervisor daemon | Singleton, persistent |
| **Deacon** | Background supervisor daemon ([watchdog chain](watchdog-chain.md)) | Singleton, persistent |
| **Witness** | Per-rig polecat lifecycle manager | One per rig, persistent |
| **Refinery** | Per-rig merge queue processor | One per rig, persistent |

304
docs/watchdog-chain.md Normal file
View File

@@ -0,0 +1,304 @@
# Daemon/Boot/Deacon Watchdog Chain
> Autonomous health monitoring and recovery in Gas Town.
## Overview
Gas Town uses a three-tier watchdog chain for autonomous health monitoring:
```
Daemon (Go process) ← Dumb transport, 3-min heartbeat
└─► Boot (AI agent) ← Intelligent triage, fresh each tick
└─► Deacon (AI agent) ← Continuous patrol, long-running
└─► Witnesses & Refineries ← Per-rig agents
```
**Key insight**: The daemon is mechanical (can't reason), but health decisions need
intelligence (is the agent stuck or just thinking?). Boot bridges this gap.
## Design Rationale: Why Two Agents?
### The Problem
The daemon needs to ensure the Deacon is healthy, but:
1. **Daemon can't reason** - It's Go code following the ZFC principle (don't reason
about other agents). It can check "is session alive?" but not "is agent stuck?"
2. **Waking costs context** - Each time you spawn an AI agent, you consume context
tokens. In idle towns, waking Deacon every 3 minutes wastes resources.
3. **Observation requires intelligence** - Distinguishing "agent composing large
artifact" from "agent hung on tool prompt" requires reasoning.
### The Solution: Boot as Triage
Boot is a narrow, ephemeral AI agent that:
- Runs fresh each daemon tick (no accumulated context debt)
- Makes a single decision: should Deacon wake?
- Exits immediately after deciding
This gives us intelligent triage without the cost of keeping a full AI running.
### Why Not Merge Boot into Deacon?
We could have Deacon handle its own "should I be awake?" logic, but:
1. **Deacon can't observe itself** - A hung Deacon can't detect it's hung
2. **Context accumulation** - Deacon runs continuously; Boot restarts fresh
3. **Cost in idle towns** - Boot only costs tokens when it runs; Deacon costs
tokens constantly if kept alive
### Why Not Replace with Go Code?
The daemon could directly monitor agents without AI, but:
1. **Can't observe panes** - Go code can't interpret tmux output semantically
2. **Can't distinguish stuck vs working** - No reasoning about agent state
3. **Escalation is complex** - When to notify? When to force-restart? AI handles
nuanced decisions better than hardcoded thresholds
## Session Ownership
| Agent | Session Name | Location | Lifecycle |
|-------|--------------|----------|-----------|
| Daemon | (Go process) | `~/gt/daemon/` | Persistent, auto-restart |
| Boot | `gt-deacon-boot` | `~/gt/deacon/dogs/boot/` | Ephemeral, fresh each tick |
| Deacon | `gt-deacon` | `~/gt/deacon/` | Long-running, handoff loop |
**Critical**: Boot runs in `gt-deacon-boot`, NOT `gt-deacon`. This prevents Boot
from conflicting with a running Deacon session.
## Heartbeat Mechanics
### Daemon Heartbeat (3 minutes)
The daemon runs a heartbeat tick every 3 minutes:
```go
func (d *Daemon) heartbeatTick() {
d.ensureBootRunning() // 1. Spawn Boot for triage
d.checkDeaconHeartbeat() // 2. Belt-and-suspenders fallback
d.ensureWitnessesRunning() // 3. Witness health
d.triggerPendingSpawns() // 4. Bootstrap polecats
d.processLifecycleRequests() // 5. Cycle/restart requests
d.checkStaleAgents() // 6. Timeout detection
// ... more checks
}
```
### Deacon Heartbeat (continuous)
The Deacon updates `~/gt/deacon/heartbeat.json` at the start of each patrol cycle:
```json
{
"timestamp": "2026-01-02T18:30:00Z",
"cycle": 42,
"last_action": "health-scan",
"healthy_agents": 3,
"unhealthy_agents": 0
}
```
### Heartbeat Freshness
| Age | State | Boot Action |
|-----|-------|-------------|
| < 5 min | Fresh | Nothing (Deacon active) |
| 5-15 min | Stale | Nudge if pending mail |
| > 15 min | Very stale | Wake (Deacon may be stuck) |
## Boot Decision Matrix
When Boot runs, it observes:
- Is Deacon session alive?
- How old is Deacon's heartbeat?
- Is there pending mail for Deacon?
- What's in Deacon's tmux pane?
Then decides:
| Condition | Action | Command |
|-----------|--------|---------|
| Session dead | START | Exit; daemon calls `ensureDeaconRunning()` |
| Heartbeat > 15 min | WAKE | `gt nudge deacon "Boot wake: check your inbox"` |
| Heartbeat 5-15 min + mail | NUDGE | `gt nudge deacon "Boot check-in: pending work"` |
| Heartbeat fresh | NOTHING | Exit silently |
## Handoff Flow
### Deacon Handoff
The Deacon runs continuous patrol cycles. After N cycles or high context:
```
End of patrol cycle:
├─ Squash wisp to digest (ephemeral → permanent)
├─ Write summary to molecule state
└─ gt handoff -s "Routine cycle" -m "Details"
└─ Creates mail for next session
```
Next daemon tick:
```
Daemon → ensureDeaconRunning()
└─ Spawns fresh Deacon in gt-deacon
└─ SessionStart hook: gt mail check --inject
└─ Previous handoff mail injected
└─ Deacon reads and continues
```
### Boot Handoff (Rare)
Boot is ephemeral - it exits after each tick. No persistent handoff needed.
However, Boot uses a marker file to prevent double-spawning:
- Marker: `~/gt/deacon/dogs/boot/.boot-running` (TTL: 5 minutes)
- Status: `~/gt/deacon/dogs/boot/.boot-status.json` (last action/result)
If the marker exists and is recent, daemon skips Boot spawn for that tick.
## Degraded Mode
When tmux is unavailable, Gas Town enters degraded mode:
| Capability | Normal | Degraded |
|------------|--------|----------|
| Boot runs | As AI in tmux | As Go code (mechanical) |
| Observe panes | Yes | No |
| Nudge agents | Yes | No |
| Start agents | tmux sessions | Direct spawn |
Degraded Boot triage is purely mechanical:
- Session dead → start
- Heartbeat stale → restart
- No reasoning, just thresholds
## Fallback Chain
Multiple layers ensure recovery:
1. **Boot triage** - Intelligent observation, first line
2. **Daemon checkDeaconHeartbeat()** - Belt-and-suspenders if Boot fails
3. **Daemon checkStaleAgents()** - Timeout-based detection
4. **Human escalation** - Mail to overseer for unrecoverable states
## State Files
| File | Purpose | Updated By |
|------|---------|-----------|
| `deacon/heartbeat.json` | Deacon freshness | Deacon (each cycle) |
| `deacon/dogs/boot/.boot-running` | Boot in-progress marker | Boot spawn |
| `deacon/dogs/boot/.boot-status.json` | Boot last action | Boot triage |
| `deacon/health-check-state.json` | Agent health tracking | `gt deacon health-check` |
| `daemon/daemon.log` | Daemon activity | Daemon |
| `daemon/daemon.pid` | Daemon process ID | Daemon startup |
## Debugging
```bash
# Check Deacon heartbeat
cat ~/gt/deacon/heartbeat.json | jq .
# Check Boot status
cat ~/gt/deacon/dogs/boot/.boot-status.json | jq .
# View daemon log
tail -f ~/gt/daemon/daemon.log
# Manual Boot run
gt boot triage
# Manual Deacon health check
gt deacon health-check
```
## Common Issues
### Boot Spawns in Wrong Session
**Symptom**: Boot runs in `gt-deacon` instead of `gt-deacon-boot`
**Cause**: Session name confusion in spawn code
**Fix**: Ensure `gt boot triage` specifies `--session=gt-deacon-boot`
### Zombie Sessions Block Restart
**Symptom**: tmux session exists but Claude is dead
**Cause**: Daemon checks session existence, not process health
**Fix**: Kill zombie sessions before recreating: `gt session kill gt-deacon`
### Status Shows Wrong State
**Symptom**: `gt status` shows "stopped" for running agents
**Cause**: Bead state and tmux state diverged
**Fix**: Reconcile with `gt sync-status` or restart agent
## Design Decision: Keep Separation
The issue [gt-1847v] considered three options:
### Option A: Keep Boot/Deacon Separation (CHOSEN)
- Boot is ephemeral, spawns fresh each heartbeat
- Boot runs in `gt-deacon-boot`, exits after triage
- Deacon runs in `gt-deacon`, continuous patrol
- Clear session boundaries, clear lifecycle
**Verdict**: This is the correct design. The implementation needs fixing, not the architecture.
### Option B: Merge Boot into Deacon (Rejected)
- Single `gt-deacon` session handles everything
- Deacon checks "should I be awake?" internally
**Why rejected**:
- Deacon can't observe itself (hung Deacon can't detect hang)
- Context accumulates even when idle (cost in quiet towns)
- No external watchdog means no recovery from Deacon failure
### Option C: Replace with Go Watchdog (Rejected)
- Daemon directly monitors witness/refinery
- No Boot, no Deacon AI for health checks
- AI agents only for complex decisions
**Why rejected**:
- Go code can't interpret tmux pane output semantically
- Can't distinguish "stuck" from "thinking deeply"
- Loses the intelligent triage that makes the system resilient
- Escalation decisions are nuanced (when to notify? force-restart?)
### Implementation Fixes Needed
The separation is correct; these bugs need fixing:
1. **Session confusion** (gt-sgzsb): Boot spawns in wrong session
2. **Zombie blocking** (gt-j1i0r): Daemon can't kill zombie sessions
3. **Status mismatch** (gt-doih4): Bead vs tmux state divergence
4. **Ensure semantics** (gt-ekc5u): Start should kill zombies first
## Summary
The watchdog chain provides autonomous recovery:
- **Daemon**: Mechanical heartbeat, spawns Boot
- **Boot**: Intelligent triage, decides Deacon fate
- **Deacon**: Continuous patrol, monitors workers
Boot exists because the daemon can't reason and Deacon can't observe itself.
The separation costs complexity but enables:
1. **Intelligent triage** without constant AI cost
2. **Fresh context** for each triage decision
3. **Graceful degradation** when tmux unavailable
4. **Multiple fallback** layers for reliability