From ea7f434e813c82f97fc81f3f19d396b20e700fff Mon Sep 17 00:00:00 2001 From: morsov Date: Fri, 2 Jan 2026 18:54:32 -0800 Subject: [PATCH] docs: Add watchdog chain documentation for Boot/Deacon lifecycle (gt-1847v) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Creates docs/watchdog-chain.md explaining the Daemon/Boot/Deacon architecture: - Why two agents (Boot is ephemeral triage, Deacon is persistent patrol) - Session ownership (gt-deacon-boot vs gt-deacon) - Heartbeat mechanics and freshness thresholds - Boot decision matrix (start/wake/nudge/nothing) - Design decision: keep separation, fix implementation bugs Cross-references added to operational-state.md and understanding-gas-town.md. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- docs/operational-state.md | 3 + docs/understanding-gas-town.md | 2 +- docs/watchdog-chain.md | 304 +++++++++++++++++++++++++++++++++ 3 files changed, 308 insertions(+), 1 deletion(-) create mode 100644 docs/watchdog-chain.md diff --git a/docs/operational-state.md b/docs/operational-state.md index 61dc28bf..f786d9d2 100644 --- a/docs/operational-state.md +++ b/docs/operational-state.md @@ -102,6 +102,9 @@ bd list --type=role --label=mode:degraded ## Boot: The Deacon's Watchdog +> See [Watchdog Chain](watchdog-chain.md) for the complete Daemon/Boot/Deacon +> architecture and design rationale. + Boot is a dog (Deacon helper) that triages the Deacon's health. The daemon pokes Boot instead of the Deacon directly, centralizing the "when to wake" decision in an agent that can reason about it. diff --git a/docs/understanding-gas-town.md b/docs/understanding-gas-town.md index 84267486..f73b4540 100644 --- a/docs/understanding-gas-town.md +++ b/docs/understanding-gas-town.md @@ -27,7 +27,7 @@ These roles manage the Gas Town system itself: | Role | Description | Lifecycle | |------|-------------|-----------| | **Mayor** | Global coordinator at town root | Singleton, persistent | -| **Deacon** | Background supervisor daemon | Singleton, persistent | +| **Deacon** | Background supervisor daemon ([watchdog chain](watchdog-chain.md)) | Singleton, persistent | | **Witness** | Per-rig polecat lifecycle manager | One per rig, persistent | | **Refinery** | Per-rig merge queue processor | One per rig, persistent | diff --git a/docs/watchdog-chain.md b/docs/watchdog-chain.md new file mode 100644 index 00000000..0480a9e9 --- /dev/null +++ b/docs/watchdog-chain.md @@ -0,0 +1,304 @@ +# Daemon/Boot/Deacon Watchdog Chain + +> Autonomous health monitoring and recovery in Gas Town. + +## Overview + +Gas Town uses a three-tier watchdog chain for autonomous health monitoring: + +``` +Daemon (Go process) ← Dumb transport, 3-min heartbeat + │ + └─► Boot (AI agent) ← Intelligent triage, fresh each tick + │ + └─► Deacon (AI agent) ← Continuous patrol, long-running + │ + └─► Witnesses & Refineries ← Per-rig agents +``` + +**Key insight**: The daemon is mechanical (can't reason), but health decisions need +intelligence (is the agent stuck or just thinking?). Boot bridges this gap. + +## Design Rationale: Why Two Agents? + +### The Problem + +The daemon needs to ensure the Deacon is healthy, but: + +1. **Daemon can't reason** - It's Go code following the ZFC principle (don't reason + about other agents). It can check "is session alive?" but not "is agent stuck?" + +2. **Waking costs context** - Each time you spawn an AI agent, you consume context + tokens. In idle towns, waking Deacon every 3 minutes wastes resources. + +3. **Observation requires intelligence** - Distinguishing "agent composing large + artifact" from "agent hung on tool prompt" requires reasoning. + +### The Solution: Boot as Triage + +Boot is a narrow, ephemeral AI agent that: +- Runs fresh each daemon tick (no accumulated context debt) +- Makes a single decision: should Deacon wake? +- Exits immediately after deciding + +This gives us intelligent triage without the cost of keeping a full AI running. + +### Why Not Merge Boot into Deacon? + +We could have Deacon handle its own "should I be awake?" logic, but: + +1. **Deacon can't observe itself** - A hung Deacon can't detect it's hung +2. **Context accumulation** - Deacon runs continuously; Boot restarts fresh +3. **Cost in idle towns** - Boot only costs tokens when it runs; Deacon costs + tokens constantly if kept alive + +### Why Not Replace with Go Code? + +The daemon could directly monitor agents without AI, but: + +1. **Can't observe panes** - Go code can't interpret tmux output semantically +2. **Can't distinguish stuck vs working** - No reasoning about agent state +3. **Escalation is complex** - When to notify? When to force-restart? AI handles + nuanced decisions better than hardcoded thresholds + +## Session Ownership + +| Agent | Session Name | Location | Lifecycle | +|-------|--------------|----------|-----------| +| Daemon | (Go process) | `~/gt/daemon/` | Persistent, auto-restart | +| Boot | `gt-deacon-boot` | `~/gt/deacon/dogs/boot/` | Ephemeral, fresh each tick | +| Deacon | `gt-deacon` | `~/gt/deacon/` | Long-running, handoff loop | + +**Critical**: Boot runs in `gt-deacon-boot`, NOT `gt-deacon`. This prevents Boot +from conflicting with a running Deacon session. + +## Heartbeat Mechanics + +### Daemon Heartbeat (3 minutes) + +The daemon runs a heartbeat tick every 3 minutes: + +```go +func (d *Daemon) heartbeatTick() { + d.ensureBootRunning() // 1. Spawn Boot for triage + d.checkDeaconHeartbeat() // 2. Belt-and-suspenders fallback + d.ensureWitnessesRunning() // 3. Witness health + d.triggerPendingSpawns() // 4. Bootstrap polecats + d.processLifecycleRequests() // 5. Cycle/restart requests + d.checkStaleAgents() // 6. Timeout detection + // ... more checks +} +``` + +### Deacon Heartbeat (continuous) + +The Deacon updates `~/gt/deacon/heartbeat.json` at the start of each patrol cycle: + +```json +{ + "timestamp": "2026-01-02T18:30:00Z", + "cycle": 42, + "last_action": "health-scan", + "healthy_agents": 3, + "unhealthy_agents": 0 +} +``` + +### Heartbeat Freshness + +| Age | State | Boot Action | +|-----|-------|-------------| +| < 5 min | Fresh | Nothing (Deacon active) | +| 5-15 min | Stale | Nudge if pending mail | +| > 15 min | Very stale | Wake (Deacon may be stuck) | + +## Boot Decision Matrix + +When Boot runs, it observes: +- Is Deacon session alive? +- How old is Deacon's heartbeat? +- Is there pending mail for Deacon? +- What's in Deacon's tmux pane? + +Then decides: + +| Condition | Action | Command | +|-----------|--------|---------| +| Session dead | START | Exit; daemon calls `ensureDeaconRunning()` | +| Heartbeat > 15 min | WAKE | `gt nudge deacon "Boot wake: check your inbox"` | +| Heartbeat 5-15 min + mail | NUDGE | `gt nudge deacon "Boot check-in: pending work"` | +| Heartbeat fresh | NOTHING | Exit silently | + +## Handoff Flow + +### Deacon Handoff + +The Deacon runs continuous patrol cycles. After N cycles or high context: + +``` +End of patrol cycle: + │ + ├─ Squash wisp to digest (ephemeral → permanent) + ├─ Write summary to molecule state + └─ gt handoff -s "Routine cycle" -m "Details" + │ + └─ Creates mail for next session +``` + +Next daemon tick: +``` +Daemon → ensureDeaconRunning() + │ + └─ Spawns fresh Deacon in gt-deacon + │ + └─ SessionStart hook: gt mail check --inject + │ + └─ Previous handoff mail injected + │ + └─ Deacon reads and continues +``` + +### Boot Handoff (Rare) + +Boot is ephemeral - it exits after each tick. No persistent handoff needed. + +However, Boot uses a marker file to prevent double-spawning: +- Marker: `~/gt/deacon/dogs/boot/.boot-running` (TTL: 5 minutes) +- Status: `~/gt/deacon/dogs/boot/.boot-status.json` (last action/result) + +If the marker exists and is recent, daemon skips Boot spawn for that tick. + +## Degraded Mode + +When tmux is unavailable, Gas Town enters degraded mode: + +| Capability | Normal | Degraded | +|------------|--------|----------| +| Boot runs | As AI in tmux | As Go code (mechanical) | +| Observe panes | Yes | No | +| Nudge agents | Yes | No | +| Start agents | tmux sessions | Direct spawn | + +Degraded Boot triage is purely mechanical: +- Session dead → start +- Heartbeat stale → restart +- No reasoning, just thresholds + +## Fallback Chain + +Multiple layers ensure recovery: + +1. **Boot triage** - Intelligent observation, first line +2. **Daemon checkDeaconHeartbeat()** - Belt-and-suspenders if Boot fails +3. **Daemon checkStaleAgents()** - Timeout-based detection +4. **Human escalation** - Mail to overseer for unrecoverable states + +## State Files + +| File | Purpose | Updated By | +|------|---------|-----------| +| `deacon/heartbeat.json` | Deacon freshness | Deacon (each cycle) | +| `deacon/dogs/boot/.boot-running` | Boot in-progress marker | Boot spawn | +| `deacon/dogs/boot/.boot-status.json` | Boot last action | Boot triage | +| `deacon/health-check-state.json` | Agent health tracking | `gt deacon health-check` | +| `daemon/daemon.log` | Daemon activity | Daemon | +| `daemon/daemon.pid` | Daemon process ID | Daemon startup | + +## Debugging + +```bash +# Check Deacon heartbeat +cat ~/gt/deacon/heartbeat.json | jq . + +# Check Boot status +cat ~/gt/deacon/dogs/boot/.boot-status.json | jq . + +# View daemon log +tail -f ~/gt/daemon/daemon.log + +# Manual Boot run +gt boot triage + +# Manual Deacon health check +gt deacon health-check +``` + +## Common Issues + +### Boot Spawns in Wrong Session + +**Symptom**: Boot runs in `gt-deacon` instead of `gt-deacon-boot` +**Cause**: Session name confusion in spawn code +**Fix**: Ensure `gt boot triage` specifies `--session=gt-deacon-boot` + +### Zombie Sessions Block Restart + +**Symptom**: tmux session exists but Claude is dead +**Cause**: Daemon checks session existence, not process health +**Fix**: Kill zombie sessions before recreating: `gt session kill gt-deacon` + +### Status Shows Wrong State + +**Symptom**: `gt status` shows "stopped" for running agents +**Cause**: Bead state and tmux state diverged +**Fix**: Reconcile with `gt sync-status` or restart agent + +## Design Decision: Keep Separation + +The issue [gt-1847v] considered three options: + +### Option A: Keep Boot/Deacon Separation (CHOSEN) + +- Boot is ephemeral, spawns fresh each heartbeat +- Boot runs in `gt-deacon-boot`, exits after triage +- Deacon runs in `gt-deacon`, continuous patrol +- Clear session boundaries, clear lifecycle + +**Verdict**: This is the correct design. The implementation needs fixing, not the architecture. + +### Option B: Merge Boot into Deacon (Rejected) + +- Single `gt-deacon` session handles everything +- Deacon checks "should I be awake?" internally + +**Why rejected**: +- Deacon can't observe itself (hung Deacon can't detect hang) +- Context accumulates even when idle (cost in quiet towns) +- No external watchdog means no recovery from Deacon failure + +### Option C: Replace with Go Watchdog (Rejected) + +- Daemon directly monitors witness/refinery +- No Boot, no Deacon AI for health checks +- AI agents only for complex decisions + +**Why rejected**: +- Go code can't interpret tmux pane output semantically +- Can't distinguish "stuck" from "thinking deeply" +- Loses the intelligent triage that makes the system resilient +- Escalation decisions are nuanced (when to notify? force-restart?) + +### Implementation Fixes Needed + +The separation is correct; these bugs need fixing: + +1. **Session confusion** (gt-sgzsb): Boot spawns in wrong session +2. **Zombie blocking** (gt-j1i0r): Daemon can't kill zombie sessions +3. **Status mismatch** (gt-doih4): Bead vs tmux state divergence +4. **Ensure semantics** (gt-ekc5u): Start should kill zombies first + +## Summary + +The watchdog chain provides autonomous recovery: + +- **Daemon**: Mechanical heartbeat, spawns Boot +- **Boot**: Intelligent triage, decides Deacon fate +- **Deacon**: Continuous patrol, monitors workers + +Boot exists because the daemon can't reason and Deacon can't observe itself. +The separation costs complexity but enables: + +1. **Intelligent triage** without constant AI cost +2. **Fresh context** for each triage decision +3. **Graceful degradation** when tmux unavailable +4. **Multiple fallback** layers for reliability