Additional cleanup from the agent_state refactoring: - Remove dead code: checkStaleAgents(), markAgentDead() in lifecycle.go - Remove dead code: reportAgentState(), getAgentFields() in prime.go - Update getAgentBeadState() comment to clarify non-observable states only - Update mol-witness-patrol.formula.toml to use tmux discovery - Update mol-polecat-lease.formula.toml to use POLECAT_DONE mail - Update docs/watchdog-chain.md to reflect new architecture 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
307 lines
9.8 KiB
Markdown
307 lines
9.8 KiB
Markdown
# Daemon/Boot/Deacon Watchdog Chain
|
|
|
|
> Autonomous health monitoring and recovery in Gas Town.
|
|
|
|
## Overview
|
|
|
|
Gas Town uses a three-tier watchdog chain for autonomous health monitoring:
|
|
|
|
```
|
|
Daemon (Go process) ← Dumb transport, 3-min heartbeat
|
|
│
|
|
└─► Boot (AI agent) ← Intelligent triage, fresh each tick
|
|
│
|
|
└─► Deacon (AI agent) ← Continuous patrol, long-running
|
|
│
|
|
└─► Witnesses & Refineries ← Per-rig agents
|
|
```
|
|
|
|
**Key insight**: The daemon is mechanical (can't reason), but health decisions need
|
|
intelligence (is the agent stuck or just thinking?). Boot bridges this gap.
|
|
|
|
## Design Rationale: Why Two Agents?
|
|
|
|
### The Problem
|
|
|
|
The daemon needs to ensure the Deacon is healthy, but:
|
|
|
|
1. **Daemon can't reason** - It's Go code following the ZFC principle (don't reason
|
|
about other agents). It can check "is session alive?" but not "is agent stuck?"
|
|
|
|
2. **Waking costs context** - Each time you spawn an AI agent, you consume context
|
|
tokens. In idle towns, waking Deacon every 3 minutes wastes resources.
|
|
|
|
3. **Observation requires intelligence** - Distinguishing "agent composing large
|
|
artifact" from "agent hung on tool prompt" requires reasoning.
|
|
|
|
### The Solution: Boot as Triage
|
|
|
|
Boot is a narrow, ephemeral AI agent that:
|
|
- Runs fresh each daemon tick (no accumulated context debt)
|
|
- Makes a single decision: should Deacon wake?
|
|
- Exits immediately after deciding
|
|
|
|
This gives us intelligent triage without the cost of keeping a full AI running.
|
|
|
|
### Why Not Merge Boot into Deacon?
|
|
|
|
We could have Deacon handle its own "should I be awake?" logic, but:
|
|
|
|
1. **Deacon can't observe itself** - A hung Deacon can't detect it's hung
|
|
2. **Context accumulation** - Deacon runs continuously; Boot restarts fresh
|
|
3. **Cost in idle towns** - Boot only costs tokens when it runs; Deacon costs
|
|
tokens constantly if kept alive
|
|
|
|
### Why Not Replace with Go Code?
|
|
|
|
The daemon could directly monitor agents without AI, but:
|
|
|
|
1. **Can't observe panes** - Go code can't interpret tmux output semantically
|
|
2. **Can't distinguish stuck vs working** - No reasoning about agent state
|
|
3. **Escalation is complex** - When to notify? When to force-restart? AI handles
|
|
nuanced decisions better than hardcoded thresholds
|
|
|
|
## Session Ownership
|
|
|
|
| Agent | Session Name | Location | Lifecycle |
|
|
|-------|--------------|----------|-----------|
|
|
| Daemon | (Go process) | `~/gt/daemon/` | Persistent, auto-restart |
|
|
| Boot | `gt-boot` | `~/gt/deacon/dogs/boot/` | Ephemeral, fresh each tick |
|
|
| Deacon | `hq-deacon` | `~/gt/deacon/` | Long-running, handoff loop |
|
|
|
|
**Critical**: Boot runs in `gt-boot`, NOT `hq-deacon`. This prevents Boot
|
|
from conflicting with a running Deacon session.
|
|
|
|
## Heartbeat Mechanics
|
|
|
|
### Daemon Heartbeat (3 minutes)
|
|
|
|
The daemon runs a heartbeat tick every 3 minutes:
|
|
|
|
```go
|
|
func (d *Daemon) heartbeatTick() {
|
|
d.ensureBootRunning() // 1. Spawn Boot for triage
|
|
d.checkDeaconHeartbeat() // 2. Belt-and-suspenders fallback
|
|
d.ensureWitnessesRunning() // 3. Witness health (checks tmux directly)
|
|
d.ensureRefineriesRunning() // 4. Refinery health (checks tmux directly)
|
|
d.triggerPendingSpawns() // 5. Bootstrap polecats
|
|
d.processLifecycleRequests() // 6. Cycle/restart requests
|
|
// Agent state derived from tmux, not recorded in beads (gt-zecmc)
|
|
}
|
|
```
|
|
|
|
### Deacon Heartbeat (continuous)
|
|
|
|
The Deacon updates `~/gt/deacon/heartbeat.json` at the start of each patrol cycle:
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2026-01-02T18:30:00Z",
|
|
"cycle": 42,
|
|
"last_action": "health-scan",
|
|
"healthy_agents": 3,
|
|
"unhealthy_agents": 0
|
|
}
|
|
```
|
|
|
|
### Heartbeat Freshness
|
|
|
|
| Age | State | Boot Action |
|
|
|-----|-------|-------------|
|
|
| < 5 min | Fresh | Nothing (Deacon active) |
|
|
| 5-15 min | Stale | Nudge if pending mail |
|
|
| > 15 min | Very stale | Wake (Deacon may be stuck) |
|
|
|
|
## Boot Decision Matrix
|
|
|
|
When Boot runs, it observes:
|
|
- Is Deacon session alive?
|
|
- How old is Deacon's heartbeat?
|
|
- Is there pending mail for Deacon?
|
|
- What's in Deacon's tmux pane?
|
|
|
|
Then decides:
|
|
|
|
| Condition | Action | Command |
|
|
|-----------|--------|---------|
|
|
| Session dead | START | Exit; daemon calls `ensureDeaconRunning()` |
|
|
| Heartbeat > 15 min | WAKE | `gt nudge deacon "Boot wake: check your inbox"` |
|
|
| Heartbeat 5-15 min + mail | NUDGE | `gt nudge deacon "Boot check-in: pending work"` |
|
|
| Heartbeat fresh | NOTHING | Exit silently |
|
|
|
|
## Handoff Flow
|
|
|
|
### Deacon Handoff
|
|
|
|
The Deacon runs continuous patrol cycles. After N cycles or high context:
|
|
|
|
```
|
|
End of patrol cycle:
|
|
│
|
|
├─ Squash wisp to digest (ephemeral → permanent)
|
|
├─ Write summary to molecule state
|
|
└─ gt handoff -s "Routine cycle" -m "Details"
|
|
│
|
|
└─ Creates mail for next session
|
|
```
|
|
|
|
Next daemon tick:
|
|
```
|
|
Daemon → ensureDeaconRunning()
|
|
│
|
|
└─ Spawns fresh Deacon in gt-deacon
|
|
│
|
|
└─ SessionStart hook: gt mail check --inject
|
|
│
|
|
└─ Previous handoff mail injected
|
|
│
|
|
└─ Deacon reads and continues
|
|
```
|
|
|
|
### Boot Handoff (Rare)
|
|
|
|
Boot is ephemeral - it exits after each tick. No persistent handoff needed.
|
|
|
|
However, Boot uses a marker file to prevent double-spawning:
|
|
- Marker: `~/gt/deacon/dogs/boot/.boot-running` (TTL: 5 minutes)
|
|
- Status: `~/gt/deacon/dogs/boot/.boot-status.json` (last action/result)
|
|
|
|
If the marker exists and is recent, daemon skips Boot spawn for that tick.
|
|
|
|
## Degraded Mode
|
|
|
|
When tmux is unavailable, Gas Town enters degraded mode:
|
|
|
|
| Capability | Normal | Degraded |
|
|
|------------|--------|----------|
|
|
| Boot runs | As AI in tmux | As Go code (mechanical) |
|
|
| Observe panes | Yes | No |
|
|
| Nudge agents | Yes | No |
|
|
| Start agents | tmux sessions | Direct spawn |
|
|
|
|
Degraded Boot triage is purely mechanical:
|
|
- Session dead → start
|
|
- Heartbeat stale → restart
|
|
- No reasoning, just thresholds
|
|
|
|
## Fallback Chain
|
|
|
|
Multiple layers ensure recovery:
|
|
|
|
1. **Boot triage** - Intelligent observation, first line
|
|
2. **Daemon checkDeaconHeartbeat()** - Belt-and-suspenders if Boot fails
|
|
3. **Tmux-based discovery** - Daemon checks tmux sessions directly (no bead state)
|
|
4. **Human escalation** - Mail to overseer for unrecoverable states
|
|
|
|
## State Files
|
|
|
|
| File | Purpose | Updated By |
|
|
|------|---------|-----------|
|
|
| `deacon/heartbeat.json` | Deacon freshness | Deacon (each cycle) |
|
|
| `deacon/dogs/boot/.boot-running` | Boot in-progress marker | Boot spawn |
|
|
| `deacon/dogs/boot/.boot-status.json` | Boot last action | Boot triage |
|
|
| `deacon/health-check-state.json` | Agent health tracking | `gt deacon health-check` |
|
|
| `daemon/daemon.log` | Daemon activity | Daemon |
|
|
| `daemon/daemon.pid` | Daemon process ID | Daemon startup |
|
|
|
|
## Debugging
|
|
|
|
```bash
|
|
# Check Deacon heartbeat
|
|
cat ~/gt/deacon/heartbeat.json | jq .
|
|
|
|
# Check Boot status
|
|
cat ~/gt/deacon/dogs/boot/.boot-status.json | jq .
|
|
|
|
# View daemon log
|
|
tail -f ~/gt/daemon/daemon.log
|
|
|
|
# Manual Boot run
|
|
gt boot triage
|
|
|
|
# Manual Deacon health check
|
|
gt deacon health-check
|
|
```
|
|
|
|
## Common Issues
|
|
|
|
### Boot Spawns in Wrong Session
|
|
|
|
**Symptom**: Boot runs in `hq-deacon` instead of `gt-boot`
|
|
**Cause**: Session name confusion in spawn code
|
|
**Fix**: Ensure `gt boot triage` specifies `--session=gt-boot`
|
|
|
|
### Zombie Sessions Block Restart
|
|
|
|
**Symptom**: tmux session exists but Claude is dead
|
|
**Cause**: Daemon checks session existence, not process health
|
|
**Fix**: Kill zombie sessions before recreating: `gt session kill hq-deacon`
|
|
|
|
### Status Shows Wrong State
|
|
|
|
**Symptom**: `gt status` shows wrong state for agents
|
|
**Cause**: Previously bead state and tmux state could diverge
|
|
**Fix**: As of gt-zecmc, status derives state from tmux directly (no bead state for
|
|
observable conditions like running/stopped). Non-observable states (stuck, awaiting-gate)
|
|
are still stored in beads.
|
|
|
|
## Design Decision: Keep Separation
|
|
|
|
The issue [gt-1847v] considered three options:
|
|
|
|
### Option A: Keep Boot/Deacon Separation (CHOSEN)
|
|
|
|
- Boot is ephemeral, spawns fresh each heartbeat
|
|
- Boot runs in `gt-boot`, exits after triage
|
|
- Deacon runs in `hq-deacon`, continuous patrol
|
|
- Clear session boundaries, clear lifecycle
|
|
|
|
**Verdict**: This is the correct design. The implementation needs fixing, not the architecture.
|
|
|
|
### Option B: Merge Boot into Deacon (Rejected)
|
|
|
|
- Single `hq-deacon` session handles everything
|
|
- Deacon checks "should I be awake?" internally
|
|
|
|
**Why rejected**:
|
|
- Deacon can't observe itself (hung Deacon can't detect hang)
|
|
- Context accumulates even when idle (cost in quiet towns)
|
|
- No external watchdog means no recovery from Deacon failure
|
|
|
|
### Option C: Replace with Go Watchdog (Rejected)
|
|
|
|
- Daemon directly monitors witness/refinery
|
|
- No Boot, no Deacon AI for health checks
|
|
- AI agents only for complex decisions
|
|
|
|
**Why rejected**:
|
|
- Go code can't interpret tmux pane output semantically
|
|
- Can't distinguish "stuck" from "thinking deeply"
|
|
- Loses the intelligent triage that makes the system resilient
|
|
- Escalation decisions are nuanced (when to notify? force-restart?)
|
|
|
|
### Implementation Fixes Needed
|
|
|
|
The separation is correct; these bugs need fixing:
|
|
|
|
1. **Session confusion** (gt-sgzsb): Boot spawns in wrong session
|
|
2. **Zombie blocking** (gt-j1i0r): Daemon can't kill zombie sessions
|
|
3. ~~**Status mismatch** (gt-doih4): Bead vs tmux state divergence~~ → FIXED in gt-zecmc
|
|
4. **Ensure semantics** (gt-ekc5u): Start should kill zombies first
|
|
|
|
## Summary
|
|
|
|
The watchdog chain provides autonomous recovery:
|
|
|
|
- **Daemon**: Mechanical heartbeat, spawns Boot
|
|
- **Boot**: Intelligent triage, decides Deacon fate
|
|
- **Deacon**: Continuous patrol, monitors workers
|
|
|
|
Boot exists because the daemon can't reason and Deacon can't observe itself.
|
|
The separation costs complexity but enables:
|
|
|
|
1. **Intelligent triage** without constant AI cost
|
|
2. **Fresh context** for each triage decision
|
|
3. **Graceful degradation** when tmux unavailable
|
|
4. **Multiple fallback** layers for reliability
|