Files
gastown/docs/watchdog-chain.md
gastown/crew/jack 6b8c897e37 feat: use hq- prefix for Mayor and Deacon session names
Town-level services (Mayor, Deacon) now use hq- prefix instead of gt-:
- hq-mayor (was gt-mayor)
- hq-deacon (was gt-deacon)

This distinguishes town-level sessions from rig-level sessions which
continue to use gt- prefix (gt-gastown-witness, gt-gastown-crew-max, etc).

Changes:
- session.MayorSessionName() returns "hq-mayor"
- session.DeaconSessionName() returns "hq-deacon"
- ParseSessionName() handles both hq- and gt- prefixes
- categorizeSession() handles both prefixes
- categorizeSessions() accepts both prefixes
- Updated all tests and documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-05 00:42:24 -08:00

9.5 KiB

Daemon/Boot/Deacon Watchdog Chain

Autonomous health monitoring and recovery in Gas Town.

Overview

Gas Town uses a three-tier watchdog chain for autonomous health monitoring:

Daemon (Go process)          ← Dumb transport, 3-min heartbeat
    │
    └─► Boot (AI agent)       ← Intelligent triage, fresh each tick
            │
            └─► Deacon (AI agent)  ← Continuous patrol, long-running
                    │
                    └─► Witnesses & Refineries  ← Per-rig agents

Key insight: The daemon is mechanical (can't reason), but health decisions need intelligence (is the agent stuck or just thinking?). Boot bridges this gap.

Design Rationale: Why Two Agents?

The Problem

The daemon needs to ensure the Deacon is healthy, but:

  1. Daemon can't reason - It's Go code following the ZFC principle (don't reason about other agents). It can check "is session alive?" but not "is agent stuck?"

  2. Waking costs context - Each time you spawn an AI agent, you consume context tokens. In idle towns, waking Deacon every 3 minutes wastes resources.

  3. Observation requires intelligence - Distinguishing "agent composing large artifact" from "agent hung on tool prompt" requires reasoning.

The Solution: Boot as Triage

Boot is a narrow, ephemeral AI agent that:

  • Runs fresh each daemon tick (no accumulated context debt)
  • Makes a single decision: should Deacon wake?
  • Exits immediately after deciding

This gives us intelligent triage without the cost of keeping a full AI running.

Why Not Merge Boot into Deacon?

We could have Deacon handle its own "should I be awake?" logic, but:

  1. Deacon can't observe itself - A hung Deacon can't detect it's hung
  2. Context accumulation - Deacon runs continuously; Boot restarts fresh
  3. Cost in idle towns - Boot only costs tokens when it runs; Deacon costs tokens constantly if kept alive

Why Not Replace with Go Code?

The daemon could directly monitor agents without AI, but:

  1. Can't observe panes - Go code can't interpret tmux output semantically
  2. Can't distinguish stuck vs working - No reasoning about agent state
  3. Escalation is complex - When to notify? When to force-restart? AI handles nuanced decisions better than hardcoded thresholds

Session Ownership

Agent Session Name Location Lifecycle
Daemon (Go process) ~/gt/daemon/ Persistent, auto-restart
Boot gt-boot ~/gt/deacon/dogs/boot/ Ephemeral, fresh each tick
Deacon hq-deacon ~/gt/deacon/ Long-running, handoff loop

Critical: Boot runs in gt-boot, NOT hq-deacon. This prevents Boot from conflicting with a running Deacon session.

Heartbeat Mechanics

Daemon Heartbeat (3 minutes)

The daemon runs a heartbeat tick every 3 minutes:

func (d *Daemon) heartbeatTick() {
    d.ensureBootRunning()           // 1. Spawn Boot for triage
    d.checkDeaconHeartbeat()        // 2. Belt-and-suspenders fallback
    d.ensureWitnessesRunning()      // 3. Witness health
    d.triggerPendingSpawns()        // 4. Bootstrap polecats
    d.processLifecycleRequests()    // 5. Cycle/restart requests
    d.checkStaleAgents()            // 6. Timeout detection
    // ... more checks
}

Deacon Heartbeat (continuous)

The Deacon updates ~/gt/deacon/heartbeat.json at the start of each patrol cycle:

{
  "timestamp": "2026-01-02T18:30:00Z",
  "cycle": 42,
  "last_action": "health-scan",
  "healthy_agents": 3,
  "unhealthy_agents": 0
}

Heartbeat Freshness

Age State Boot Action
< 5 min Fresh Nothing (Deacon active)
5-15 min Stale Nudge if pending mail
> 15 min Very stale Wake (Deacon may be stuck)

Boot Decision Matrix

When Boot runs, it observes:

  • Is Deacon session alive?
  • How old is Deacon's heartbeat?
  • Is there pending mail for Deacon?
  • What's in Deacon's tmux pane?

Then decides:

Condition Action Command
Session dead START Exit; daemon calls ensureDeaconRunning()
Heartbeat > 15 min WAKE gt nudge deacon "Boot wake: check your inbox"
Heartbeat 5-15 min + mail NUDGE gt nudge deacon "Boot check-in: pending work"
Heartbeat fresh NOTHING Exit silently

Handoff Flow

Deacon Handoff

The Deacon runs continuous patrol cycles. After N cycles or high context:

End of patrol cycle:
    │
    ├─ Squash wisp to digest (ephemeral → permanent)
    ├─ Write summary to molecule state
    └─ gt handoff -s "Routine cycle" -m "Details"
        │
        └─ Creates mail for next session

Next daemon tick:

Daemon → ensureDeaconRunning()
    │
    └─ Spawns fresh Deacon in gt-deacon
        │
        └─ SessionStart hook: gt mail check --inject
            │
            └─ Previous handoff mail injected
                │
                └─ Deacon reads and continues

Boot Handoff (Rare)

Boot is ephemeral - it exits after each tick. No persistent handoff needed.

However, Boot uses a marker file to prevent double-spawning:

  • Marker: ~/gt/deacon/dogs/boot/.boot-running (TTL: 5 minutes)
  • Status: ~/gt/deacon/dogs/boot/.boot-status.json (last action/result)

If the marker exists and is recent, daemon skips Boot spawn for that tick.

Degraded Mode

When tmux is unavailable, Gas Town enters degraded mode:

Capability Normal Degraded
Boot runs As AI in tmux As Go code (mechanical)
Observe panes Yes No
Nudge agents Yes No
Start agents tmux sessions Direct spawn

Degraded Boot triage is purely mechanical:

  • Session dead → start
  • Heartbeat stale → restart
  • No reasoning, just thresholds

Fallback Chain

Multiple layers ensure recovery:

  1. Boot triage - Intelligent observation, first line
  2. Daemon checkDeaconHeartbeat() - Belt-and-suspenders if Boot fails
  3. Daemon checkStaleAgents() - Timeout-based detection
  4. Human escalation - Mail to overseer for unrecoverable states

State Files

File Purpose Updated By
deacon/heartbeat.json Deacon freshness Deacon (each cycle)
deacon/dogs/boot/.boot-running Boot in-progress marker Boot spawn
deacon/dogs/boot/.boot-status.json Boot last action Boot triage
deacon/health-check-state.json Agent health tracking gt deacon health-check
daemon/daemon.log Daemon activity Daemon
daemon/daemon.pid Daemon process ID Daemon startup

Debugging

# Check Deacon heartbeat
cat ~/gt/deacon/heartbeat.json | jq .

# Check Boot status
cat ~/gt/deacon/dogs/boot/.boot-status.json | jq .

# View daemon log
tail -f ~/gt/daemon/daemon.log

# Manual Boot run
gt boot triage

# Manual Deacon health check
gt deacon health-check

Common Issues

Boot Spawns in Wrong Session

Symptom: Boot runs in hq-deacon instead of gt-boot Cause: Session name confusion in spawn code Fix: Ensure gt boot triage specifies --session=gt-boot

Zombie Sessions Block Restart

Symptom: tmux session exists but Claude is dead Cause: Daemon checks session existence, not process health Fix: Kill zombie sessions before recreating: gt session kill hq-deacon

Status Shows Wrong State

Symptom: gt status shows "stopped" for running agents Cause: Bead state and tmux state diverged Fix: Reconcile with gt sync-status or restart agent

Design Decision: Keep Separation

The issue [gt-1847v] considered three options:

Option A: Keep Boot/Deacon Separation (CHOSEN)

  • Boot is ephemeral, spawns fresh each heartbeat
  • Boot runs in gt-boot, exits after triage
  • Deacon runs in hq-deacon, continuous patrol
  • Clear session boundaries, clear lifecycle

Verdict: This is the correct design. The implementation needs fixing, not the architecture.

Option B: Merge Boot into Deacon (Rejected)

  • Single hq-deacon session handles everything
  • Deacon checks "should I be awake?" internally

Why rejected:

  • Deacon can't observe itself (hung Deacon can't detect hang)
  • Context accumulates even when idle (cost in quiet towns)
  • No external watchdog means no recovery from Deacon failure

Option C: Replace with Go Watchdog (Rejected)

  • Daemon directly monitors witness/refinery
  • No Boot, no Deacon AI for health checks
  • AI agents only for complex decisions

Why rejected:

  • Go code can't interpret tmux pane output semantically
  • Can't distinguish "stuck" from "thinking deeply"
  • Loses the intelligent triage that makes the system resilient
  • Escalation decisions are nuanced (when to notify? force-restart?)

Implementation Fixes Needed

The separation is correct; these bugs need fixing:

  1. Session confusion (gt-sgzsb): Boot spawns in wrong session
  2. Zombie blocking (gt-j1i0r): Daemon can't kill zombie sessions
  3. Status mismatch (gt-doih4): Bead vs tmux state divergence
  4. Ensure semantics (gt-ekc5u): Start should kill zombies first

Summary

The watchdog chain provides autonomous recovery:

  • Daemon: Mechanical heartbeat, spawns Boot
  • Boot: Intelligent triage, decides Deacon fate
  • Deacon: Continuous patrol, monitors workers

Boot exists because the daemon can't reason and Deacon can't observe itself. The separation costs complexity but enables:

  1. Intelligent triage without constant AI cost
  2. Fresh context for each triage decision
  3. Graceful degradation when tmux unavailable
  4. Multiple fallback layers for reliability