docs: Add watchdog chain documentation for Boot/Deacon lifecycle (gt-1847v)
Creates docs/watchdog-chain.md explaining the Daemon/Boot/Deacon architecture: - Why two agents (Boot is ephemeral triage, Deacon is persistent patrol) - Session ownership (gt-deacon-boot vs gt-deacon) - Heartbeat mechanics and freshness thresholds - Boot decision matrix (start/wake/nudge/nothing) - Design decision: keep separation, fix implementation bugs Cross-references added to operational-state.md and understanding-gas-town.md. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -102,6 +102,9 @@ bd list --type=role --label=mode:degraded
|
||||
|
||||
## Boot: The Deacon's Watchdog
|
||||
|
||||
> See [Watchdog Chain](watchdog-chain.md) for the complete Daemon/Boot/Deacon
|
||||
> architecture and design rationale.
|
||||
|
||||
Boot is a dog (Deacon helper) that triages the Deacon's health. The daemon pokes
|
||||
Boot instead of the Deacon directly, centralizing the "when to wake" decision in
|
||||
an agent that can reason about it.
|
||||
|
||||
@@ -27,7 +27,7 @@ These roles manage the Gas Town system itself:
|
||||
| Role | Description | Lifecycle |
|
||||
|------|-------------|-----------|
|
||||
| **Mayor** | Global coordinator at town root | Singleton, persistent |
|
||||
| **Deacon** | Background supervisor daemon | Singleton, persistent |
|
||||
| **Deacon** | Background supervisor daemon ([watchdog chain](watchdog-chain.md)) | Singleton, persistent |
|
||||
| **Witness** | Per-rig polecat lifecycle manager | One per rig, persistent |
|
||||
| **Refinery** | Per-rig merge queue processor | One per rig, persistent |
|
||||
|
||||
|
||||
304
docs/watchdog-chain.md
Normal file
304
docs/watchdog-chain.md
Normal file
@@ -0,0 +1,304 @@
|
||||
# Daemon/Boot/Deacon Watchdog Chain
|
||||
|
||||
> Autonomous health monitoring and recovery in Gas Town.
|
||||
|
||||
## Overview
|
||||
|
||||
Gas Town uses a three-tier watchdog chain for autonomous health monitoring:
|
||||
|
||||
```
|
||||
Daemon (Go process) ← Dumb transport, 3-min heartbeat
|
||||
│
|
||||
└─► Boot (AI agent) ← Intelligent triage, fresh each tick
|
||||
│
|
||||
└─► Deacon (AI agent) ← Continuous patrol, long-running
|
||||
│
|
||||
└─► Witnesses & Refineries ← Per-rig agents
|
||||
```
|
||||
|
||||
**Key insight**: The daemon is mechanical (can't reason), but health decisions need
|
||||
intelligence (is the agent stuck or just thinking?). Boot bridges this gap.
|
||||
|
||||
## Design Rationale: Why Two Agents?
|
||||
|
||||
### The Problem
|
||||
|
||||
The daemon needs to ensure the Deacon is healthy, but:
|
||||
|
||||
1. **Daemon can't reason** - It's Go code following the ZFC principle (don't reason
|
||||
about other agents). It can check "is session alive?" but not "is agent stuck?"
|
||||
|
||||
2. **Waking costs context** - Each time you spawn an AI agent, you consume context
|
||||
tokens. In idle towns, waking Deacon every 3 minutes wastes resources.
|
||||
|
||||
3. **Observation requires intelligence** - Distinguishing "agent composing large
|
||||
artifact" from "agent hung on tool prompt" requires reasoning.
|
||||
|
||||
### The Solution: Boot as Triage
|
||||
|
||||
Boot is a narrow, ephemeral AI agent that:
|
||||
- Runs fresh each daemon tick (no accumulated context debt)
|
||||
- Makes a single decision: should Deacon wake?
|
||||
- Exits immediately after deciding
|
||||
|
||||
This gives us intelligent triage without the cost of keeping a full AI running.
|
||||
|
||||
### Why Not Merge Boot into Deacon?
|
||||
|
||||
We could have Deacon handle its own "should I be awake?" logic, but:
|
||||
|
||||
1. **Deacon can't observe itself** - A hung Deacon can't detect it's hung
|
||||
2. **Context accumulation** - Deacon runs continuously; Boot restarts fresh
|
||||
3. **Cost in idle towns** - Boot only costs tokens when it runs; Deacon costs
|
||||
tokens constantly if kept alive
|
||||
|
||||
### Why Not Replace with Go Code?
|
||||
|
||||
The daemon could directly monitor agents without AI, but:
|
||||
|
||||
1. **Can't observe panes** - Go code can't interpret tmux output semantically
|
||||
2. **Can't distinguish stuck vs working** - No reasoning about agent state
|
||||
3. **Escalation is complex** - When to notify? When to force-restart? AI handles
|
||||
nuanced decisions better than hardcoded thresholds
|
||||
|
||||
## Session Ownership
|
||||
|
||||
| Agent | Session Name | Location | Lifecycle |
|
||||
|-------|--------------|----------|-----------|
|
||||
| Daemon | (Go process) | `~/gt/daemon/` | Persistent, auto-restart |
|
||||
| Boot | `gt-deacon-boot` | `~/gt/deacon/dogs/boot/` | Ephemeral, fresh each tick |
|
||||
| Deacon | `gt-deacon` | `~/gt/deacon/` | Long-running, handoff loop |
|
||||
|
||||
**Critical**: Boot runs in `gt-deacon-boot`, NOT `gt-deacon`. This prevents Boot
|
||||
from conflicting with a running Deacon session.
|
||||
|
||||
## Heartbeat Mechanics
|
||||
|
||||
### Daemon Heartbeat (3 minutes)
|
||||
|
||||
The daemon runs a heartbeat tick every 3 minutes:
|
||||
|
||||
```go
|
||||
func (d *Daemon) heartbeatTick() {
|
||||
d.ensureBootRunning() // 1. Spawn Boot for triage
|
||||
d.checkDeaconHeartbeat() // 2. Belt-and-suspenders fallback
|
||||
d.ensureWitnessesRunning() // 3. Witness health
|
||||
d.triggerPendingSpawns() // 4. Bootstrap polecats
|
||||
d.processLifecycleRequests() // 5. Cycle/restart requests
|
||||
d.checkStaleAgents() // 6. Timeout detection
|
||||
// ... more checks
|
||||
}
|
||||
```
|
||||
|
||||
### Deacon Heartbeat (continuous)
|
||||
|
||||
The Deacon updates `~/gt/deacon/heartbeat.json` at the start of each patrol cycle:
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-01-02T18:30:00Z",
|
||||
"cycle": 42,
|
||||
"last_action": "health-scan",
|
||||
"healthy_agents": 3,
|
||||
"unhealthy_agents": 0
|
||||
}
|
||||
```
|
||||
|
||||
### Heartbeat Freshness
|
||||
|
||||
| Age | State | Boot Action |
|
||||
|-----|-------|-------------|
|
||||
| < 5 min | Fresh | Nothing (Deacon active) |
|
||||
| 5-15 min | Stale | Nudge if pending mail |
|
||||
| > 15 min | Very stale | Wake (Deacon may be stuck) |
|
||||
|
||||
## Boot Decision Matrix
|
||||
|
||||
When Boot runs, it observes:
|
||||
- Is Deacon session alive?
|
||||
- How old is Deacon's heartbeat?
|
||||
- Is there pending mail for Deacon?
|
||||
- What's in Deacon's tmux pane?
|
||||
|
||||
Then decides:
|
||||
|
||||
| Condition | Action | Command |
|
||||
|-----------|--------|---------|
|
||||
| Session dead | START | Exit; daemon calls `ensureDeaconRunning()` |
|
||||
| Heartbeat > 15 min | WAKE | `gt nudge deacon "Boot wake: check your inbox"` |
|
||||
| Heartbeat 5-15 min + mail | NUDGE | `gt nudge deacon "Boot check-in: pending work"` |
|
||||
| Heartbeat fresh | NOTHING | Exit silently |
|
||||
|
||||
## Handoff Flow
|
||||
|
||||
### Deacon Handoff
|
||||
|
||||
The Deacon runs continuous patrol cycles. After N cycles or high context:
|
||||
|
||||
```
|
||||
End of patrol cycle:
|
||||
│
|
||||
├─ Squash wisp to digest (ephemeral → permanent)
|
||||
├─ Write summary to molecule state
|
||||
└─ gt handoff -s "Routine cycle" -m "Details"
|
||||
│
|
||||
└─ Creates mail for next session
|
||||
```
|
||||
|
||||
Next daemon tick:
|
||||
```
|
||||
Daemon → ensureDeaconRunning()
|
||||
│
|
||||
└─ Spawns fresh Deacon in gt-deacon
|
||||
│
|
||||
└─ SessionStart hook: gt mail check --inject
|
||||
│
|
||||
└─ Previous handoff mail injected
|
||||
│
|
||||
└─ Deacon reads and continues
|
||||
```
|
||||
|
||||
### Boot Handoff (Rare)
|
||||
|
||||
Boot is ephemeral - it exits after each tick. No persistent handoff needed.
|
||||
|
||||
However, Boot uses a marker file to prevent double-spawning:
|
||||
- Marker: `~/gt/deacon/dogs/boot/.boot-running` (TTL: 5 minutes)
|
||||
- Status: `~/gt/deacon/dogs/boot/.boot-status.json` (last action/result)
|
||||
|
||||
If the marker exists and is recent, daemon skips Boot spawn for that tick.
|
||||
|
||||
## Degraded Mode
|
||||
|
||||
When tmux is unavailable, Gas Town enters degraded mode:
|
||||
|
||||
| Capability | Normal | Degraded |
|
||||
|------------|--------|----------|
|
||||
| Boot runs | As AI in tmux | As Go code (mechanical) |
|
||||
| Observe panes | Yes | No |
|
||||
| Nudge agents | Yes | No |
|
||||
| Start agents | tmux sessions | Direct spawn |
|
||||
|
||||
Degraded Boot triage is purely mechanical:
|
||||
- Session dead → start
|
||||
- Heartbeat stale → restart
|
||||
- No reasoning, just thresholds
|
||||
|
||||
## Fallback Chain
|
||||
|
||||
Multiple layers ensure recovery:
|
||||
|
||||
1. **Boot triage** - Intelligent observation, first line
|
||||
2. **Daemon checkDeaconHeartbeat()** - Belt-and-suspenders if Boot fails
|
||||
3. **Daemon checkStaleAgents()** - Timeout-based detection
|
||||
4. **Human escalation** - Mail to overseer for unrecoverable states
|
||||
|
||||
## State Files
|
||||
|
||||
| File | Purpose | Updated By |
|
||||
|------|---------|-----------|
|
||||
| `deacon/heartbeat.json` | Deacon freshness | Deacon (each cycle) |
|
||||
| `deacon/dogs/boot/.boot-running` | Boot in-progress marker | Boot spawn |
|
||||
| `deacon/dogs/boot/.boot-status.json` | Boot last action | Boot triage |
|
||||
| `deacon/health-check-state.json` | Agent health tracking | `gt deacon health-check` |
|
||||
| `daemon/daemon.log` | Daemon activity | Daemon |
|
||||
| `daemon/daemon.pid` | Daemon process ID | Daemon startup |
|
||||
|
||||
## Debugging
|
||||
|
||||
```bash
|
||||
# Check Deacon heartbeat
|
||||
cat ~/gt/deacon/heartbeat.json | jq .
|
||||
|
||||
# Check Boot status
|
||||
cat ~/gt/deacon/dogs/boot/.boot-status.json | jq .
|
||||
|
||||
# View daemon log
|
||||
tail -f ~/gt/daemon/daemon.log
|
||||
|
||||
# Manual Boot run
|
||||
gt boot triage
|
||||
|
||||
# Manual Deacon health check
|
||||
gt deacon health-check
|
||||
```
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Boot Spawns in Wrong Session
|
||||
|
||||
**Symptom**: Boot runs in `gt-deacon` instead of `gt-deacon-boot`
|
||||
**Cause**: Session name confusion in spawn code
|
||||
**Fix**: Ensure `gt boot triage` specifies `--session=gt-deacon-boot`
|
||||
|
||||
### Zombie Sessions Block Restart
|
||||
|
||||
**Symptom**: tmux session exists but Claude is dead
|
||||
**Cause**: Daemon checks session existence, not process health
|
||||
**Fix**: Kill zombie sessions before recreating: `gt session kill gt-deacon`
|
||||
|
||||
### Status Shows Wrong State
|
||||
|
||||
**Symptom**: `gt status` shows "stopped" for running agents
|
||||
**Cause**: Bead state and tmux state diverged
|
||||
**Fix**: Reconcile with `gt sync-status` or restart agent
|
||||
|
||||
## Design Decision: Keep Separation
|
||||
|
||||
The issue [gt-1847v] considered three options:
|
||||
|
||||
### Option A: Keep Boot/Deacon Separation (CHOSEN)
|
||||
|
||||
- Boot is ephemeral, spawns fresh each heartbeat
|
||||
- Boot runs in `gt-deacon-boot`, exits after triage
|
||||
- Deacon runs in `gt-deacon`, continuous patrol
|
||||
- Clear session boundaries, clear lifecycle
|
||||
|
||||
**Verdict**: This is the correct design. The implementation needs fixing, not the architecture.
|
||||
|
||||
### Option B: Merge Boot into Deacon (Rejected)
|
||||
|
||||
- Single `gt-deacon` session handles everything
|
||||
- Deacon checks "should I be awake?" internally
|
||||
|
||||
**Why rejected**:
|
||||
- Deacon can't observe itself (hung Deacon can't detect hang)
|
||||
- Context accumulates even when idle (cost in quiet towns)
|
||||
- No external watchdog means no recovery from Deacon failure
|
||||
|
||||
### Option C: Replace with Go Watchdog (Rejected)
|
||||
|
||||
- Daemon directly monitors witness/refinery
|
||||
- No Boot, no Deacon AI for health checks
|
||||
- AI agents only for complex decisions
|
||||
|
||||
**Why rejected**:
|
||||
- Go code can't interpret tmux pane output semantically
|
||||
- Can't distinguish "stuck" from "thinking deeply"
|
||||
- Loses the intelligent triage that makes the system resilient
|
||||
- Escalation decisions are nuanced (when to notify? force-restart?)
|
||||
|
||||
### Implementation Fixes Needed
|
||||
|
||||
The separation is correct; these bugs need fixing:
|
||||
|
||||
1. **Session confusion** (gt-sgzsb): Boot spawns in wrong session
|
||||
2. **Zombie blocking** (gt-j1i0r): Daemon can't kill zombie sessions
|
||||
3. **Status mismatch** (gt-doih4): Bead vs tmux state divergence
|
||||
4. **Ensure semantics** (gt-ekc5u): Start should kill zombies first
|
||||
|
||||
## Summary
|
||||
|
||||
The watchdog chain provides autonomous recovery:
|
||||
|
||||
- **Daemon**: Mechanical heartbeat, spawns Boot
|
||||
- **Boot**: Intelligent triage, decides Deacon fate
|
||||
- **Deacon**: Continuous patrol, monitors workers
|
||||
|
||||
Boot exists because the daemon can't reason and Deacon can't observe itself.
|
||||
The separation costs complexity but enables:
|
||||
|
||||
1. **Intelligent triage** without constant AI cost
|
||||
2. **Fresh context** for each triage decision
|
||||
3. **Graceful degradation** when tmux unavailable
|
||||
4. **Multiple fallback** layers for reliability
|
||||
Reference in New Issue
Block a user