* Add hung-session-detection step to deacon patrol Detects and surgically recovers Gas Town sessions where Claude API call is stuck indefinitely. These appear "running" (tmux session exists) but aren't processing work. Safety checks (ALL must pass before recovery): 1. Session matches Gas Town pattern exactly (gt-*-witness, etc) 2. Session shows waiting state (Clauding/Deciphering/etc) 3. Duration >30min AND (zero tokens OR duration >2hrs) 4. NOT showing active tool execution (⏺ markers) This closes a gap where existing zombie-scan only catches processes not in tmux sessions. Co-Authored-By: Claude <noreply@anthropic.com> * fix(orphan): protect all tmux sessions, not just Gas Town ones The orphan cleanup was killing Claude processes in user's personal tmux sessions (e.g., "loomtown", "yaad") because only sessions with gt-* or hq-* prefixes were protected. Changes: - Renamed getGasTownSessionPIDs() to getTmuxSessionPIDs() - Now protects ALL tmux sessions regardless of name prefix - Updated variable names for clarity (gasTownPIDs -> protectedPIDs) The TTY="?" check is not reliable during certain operations (startup, session transitions), so explicit protection of all tmux sessions is necessary to prevent killing user's personal Claude instances. Fixes #923 Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: mayor <ec2-user@ip-172-31-43-79.ec2.internal> Co-authored-by: Claude <noreply@anthropic.com>
985 lines
29 KiB
TOML
985 lines
29 KiB
TOML
description = """
|
|
Mayor's daemon patrol loop.
|
|
|
|
The Deacon is the Mayor's background process that runs continuously, handling callbacks, monitoring rig health, and performing cleanup. Each patrol cycle runs these steps in sequence, then loops or exits.
|
|
|
|
## Idle Town Principle
|
|
|
|
**The Deacon should be silent/invisible when the town is healthy and idle.**
|
|
|
|
- Skip HEALTH_CHECK nudges when no active work exists
|
|
- Sleep 60+ seconds between patrol cycles (longer when idle)
|
|
- Let the feed subscription wake agents on actual events
|
|
- The daemon (10-minute heartbeat) is the safety net for dead sessions
|
|
|
|
This prevents flooding idle agents with health checks every few seconds.
|
|
|
|
## Second-Order Monitoring
|
|
|
|
Witnesses send WITNESS_PING messages to verify the Deacon is alive. This
|
|
prevents the "who watches the watchers" problem - if the Deacon dies,
|
|
Witnesses detect it and escalate to the Mayor.
|
|
|
|
The Deacon's agent bead last_activity timestamp is updated during each patrol
|
|
cycle. Witnesses check this timestamp to verify health."""
|
|
formula = "mol-deacon-patrol"
|
|
version = 8
|
|
|
|
[[steps]]
|
|
id = "inbox-check"
|
|
title = "Handle callbacks from agents"
|
|
description = """
|
|
Handle callbacks from agents.
|
|
|
|
Check the Mayor's inbox for messages from:
|
|
- Witnesses reporting polecat status
|
|
- Refineries reporting merge results
|
|
- Polecats requesting help or escalation
|
|
- External triggers (webhooks, timers)
|
|
|
|
```bash
|
|
gt mail inbox
|
|
# For each message:
|
|
gt mail read <id>
|
|
# Handle based on message type
|
|
```
|
|
|
|
**WITNESS_PING**:
|
|
Witnesses periodically ping to verify Deacon is alive. Simply acknowledge
|
|
and archive - the fact that you're processing mail proves you're running.
|
|
Your agent bead last_activity is updated automatically during patrol.
|
|
```bash
|
|
gt mail archive <message-id>
|
|
```
|
|
|
|
**HELP / Escalation**:
|
|
Assess and handle or forward to Mayor.
|
|
Archive after handling:
|
|
```bash
|
|
gt mail archive <message-id>
|
|
```
|
|
|
|
**LIFECYCLE messages**:
|
|
Polecats reporting completion, refineries reporting merge results.
|
|
Archive after processing:
|
|
```bash
|
|
gt mail archive <message-id>
|
|
```
|
|
|
|
**DOG_DONE messages**:
|
|
Dogs report completion after infrastructure tasks (orphan-scan, session-gc, etc.).
|
|
Subject format: `DOG_DONE <hostname>`
|
|
Body contains: task name, counts, status.
|
|
```bash
|
|
# Parse the report, log metrics if needed
|
|
gt mail read <id>
|
|
# Archive after noting completion
|
|
gt mail archive <message-id>
|
|
```
|
|
Dogs return to idle automatically. The report is informational - no action needed
|
|
unless the dog reports errors that require escalation.
|
|
|
|
Callbacks may spawn new polecats, update issue state, or trigger other actions.
|
|
|
|
**Hygiene principle**: Archive messages after they're fully processed.
|
|
Keep inbox near-empty - only unprocessed items should remain."""
|
|
|
|
[[steps]]
|
|
id = "orphan-process-cleanup"
|
|
title = "Clean up orphaned claude subagent processes"
|
|
needs = ["inbox-check"]
|
|
description = """
|
|
Clean up orphaned claude subagent processes.
|
|
|
|
Claude Code's Task tool spawns subagent processes that sometimes don't clean up
|
|
properly after completion. These accumulate and consume significant memory.
|
|
|
|
**Detection method:**
|
|
Orphaned processes have no controlling terminal (TTY = "?"). Legitimate claude
|
|
instances in terminals have a TTY like "pts/0".
|
|
|
|
**Run cleanup:**
|
|
```bash
|
|
gt deacon cleanup-orphans
|
|
```
|
|
|
|
This command:
|
|
1. Lists all claude/codex processes with `ps -eo pid,tty,comm`
|
|
2. Filters for TTY = "?" (no controlling terminal)
|
|
3. Sends SIGTERM to each orphaned process
|
|
4. Reports how many were killed
|
|
|
|
**Why this is safe:**
|
|
- Processes in terminals (your personal sessions) have a TTY - they won't be touched
|
|
- Only kills processes that have no controlling terminal
|
|
- These orphans are children of the tmux server with no TTY, indicating they're
|
|
detached subagents that failed to exit
|
|
|
|
**If cleanup fails:**
|
|
Log the error but continue patrol - this is best-effort cleanup.
|
|
|
|
**Exit criteria:** Orphan cleanup attempted (success or logged failure)."""
|
|
|
|
[[steps]]
|
|
id = "trigger-pending-spawns"
|
|
title = "Nudge newly spawned polecats"
|
|
needs = ["orphan-process-cleanup"]
|
|
description = """
|
|
Nudge newly spawned polecats that are ready for input.
|
|
|
|
When polecats are spawned, their Claude session takes 10-20 seconds to initialize. The spawn command returns immediately without waiting. This step finds spawned polecats that are now ready and sends them a trigger to start working.
|
|
|
|
**ZFC-Compliant Observation** (AI observes AI):
|
|
|
|
```bash
|
|
# View pending spawns with captured terminal output
|
|
gt deacon pending
|
|
```
|
|
|
|
For each pending session, analyze the captured output:
|
|
- Look for Claude's prompt indicator "> " at the start of a line
|
|
- If prompt is visible, Claude is ready for input
|
|
- Make the judgment call yourself - you're the AI observer
|
|
|
|
For each ready polecat:
|
|
```bash
|
|
# 1. Trigger the polecat
|
|
gt nudge <session> "Begin."
|
|
|
|
# 2. Clear from pending list
|
|
gt deacon pending <session>
|
|
```
|
|
|
|
This triggers the UserPromptSubmit hook, which injects mail so the polecat sees its assignment.
|
|
|
|
**Bootstrap mode** (daemon-only, no AI available):
|
|
The daemon uses `gt deacon trigger-pending` with regex detection. This ZFC violation is acceptable during cold startup when no AI agent is running yet."""
|
|
|
|
[[steps]]
|
|
id = "gate-evaluation"
|
|
title = "Evaluate pending async gates"
|
|
needs = ["inbox-check"]
|
|
description = """
|
|
Evaluate pending async gates.
|
|
|
|
Gates are async coordination primitives that block until conditions are met.
|
|
The Deacon is responsible for monitoring gates and closing them when ready.
|
|
|
|
**Timer gates** (await_type: timer):
|
|
Check if elapsed time since creation exceeds the timeout duration.
|
|
|
|
```bash
|
|
# List all open gates
|
|
bd gate list --json
|
|
|
|
# For each timer gate, check if elapsed:
|
|
# - CreatedAt + Timeout < Now → gate is ready to close
|
|
# - Close with: bd gate close <id> --reason "Timer elapsed"
|
|
```
|
|
|
|
**GitHub gates** (await_type: gh:run, gh:pr) - handled in separate step.
|
|
|
|
**Human/Mail gates** - require external input, skip here.
|
|
|
|
After closing a gate, the Waiters field contains mail addresses to notify.
|
|
Send a brief notification to each waiter that the gate has cleared."""
|
|
|
|
[[steps]]
|
|
id = "dispatch-gated-molecules"
|
|
title = "Dispatch molecules with resolved gates"
|
|
needs = ["gate-evaluation"]
|
|
description = """
|
|
Find molecules blocked on gates that have now closed and dispatch them.
|
|
|
|
This completes the async resume cycle without explicit waiter tracking.
|
|
The molecule state IS the waiter - patrol discovers reality each cycle.
|
|
|
|
**Step 1: Find gate-ready molecules**
|
|
```bash
|
|
bd mol ready --gated --json
|
|
```
|
|
|
|
This returns molecules where:
|
|
- Status is in_progress
|
|
- Current step has a gate dependency
|
|
- The gate bead is now closed
|
|
- No polecat currently has it hooked
|
|
|
|
**Step 2: For each ready molecule, dispatch to the appropriate rig**
|
|
```bash
|
|
# Determine target rig from molecule metadata
|
|
bd mol show <mol-id> --json
|
|
# Look for rig field or infer from prefix
|
|
|
|
# Dispatch to that rig's polecat pool
|
|
gt sling <mol-id> <rig>/polecats
|
|
```
|
|
|
|
**Step 3: Log dispatch**
|
|
Note which molecules were dispatched for observability:
|
|
```bash
|
|
# Molecule <mol-id> dispatched to <rig>/polecats (gate <gate-id> cleared)
|
|
```
|
|
|
|
**If no gate-ready molecules:**
|
|
Skip - nothing to dispatch. Gates haven't closed yet or molecules
|
|
already have active polecats working on them.
|
|
|
|
**Exit criteria:** All gate-ready molecules dispatched to polecats."""
|
|
|
|
[[steps]]
|
|
id = "check-convoy-completion"
|
|
title = "Check convoy completion"
|
|
needs = ["inbox-check"]
|
|
description = """
|
|
Check convoy completion status.
|
|
|
|
Convoys are coordination beads that track multiple issues across rigs. When all tracked issues close, the convoy auto-closes.
|
|
|
|
**Step 1: Find open convoys**
|
|
```bash
|
|
bd list --type=convoy --status=open
|
|
```
|
|
|
|
**Step 2: For each open convoy, check tracked issues**
|
|
```bash
|
|
bd show <convoy-id>
|
|
# Look for 'tracks' or 'dependencies' field listing tracked issues
|
|
```
|
|
|
|
**Step 3: If all tracked issues are closed, close the convoy**
|
|
```bash
|
|
# Check each tracked issue
|
|
for issue in tracked_issues:
|
|
bd show <issue-id>
|
|
# If status is open/in_progress, convoy stays open
|
|
# If all are closed (completed, wontfix, etc.), convoy is complete
|
|
|
|
# Close convoy when all tracked issues are done
|
|
bd close <convoy-id> --reason "All tracked issues completed"
|
|
```
|
|
|
|
**Note**: Convoys support cross-prefix tracking (e.g., hq-* convoy can track gt-*, bd-* issues). Use full IDs when checking."""
|
|
|
|
[[steps]]
|
|
id = "resolve-external-deps"
|
|
title = "Resolve external dependencies"
|
|
needs = ["check-convoy-completion"]
|
|
description = """
|
|
Resolve external dependencies across rigs.
|
|
|
|
When an issue in one rig closes, any dependencies in other rigs should be notified. This enables cross-rig coordination without tight coupling.
|
|
|
|
**Step 1: Check recent closures from feed**
|
|
```bash
|
|
gt feed --since 10m --plain | grep "✓"
|
|
# Look for recently closed issues
|
|
```
|
|
|
|
**Step 2: For each closed issue, check cross-rig dependents**
|
|
```bash
|
|
bd show <closed-issue>
|
|
# Look at 'blocks' field - these are issues that were waiting on this one
|
|
# If any blocked issue is in a different rig/prefix, it may now be unblocked
|
|
```
|
|
|
|
**Step 3: Update blocked status**
|
|
For blocked issues in other rigs, the closure should automatically unblock them (beads handles this). But verify:
|
|
```bash
|
|
bd blocked
|
|
# Should no longer show the previously-blocked issue if dependency is met
|
|
```
|
|
|
|
**Cross-rig scenarios:**
|
|
- bd-xxx closes → gt-yyy that depended on it is unblocked
|
|
- External issue closes → internal convoy step can proceed
|
|
- Rig A issue closes → Rig B issue waiting on it proceeds
|
|
|
|
No manual intervention needed if dependencies are properly tracked - this step just validates the propagation occurred."""
|
|
|
|
[[steps]]
|
|
id = "fire-notifications"
|
|
title = "Fire notifications"
|
|
needs = ["resolve-external-deps"]
|
|
description = """
|
|
Fire notifications for convoy and cross-rig events.
|
|
|
|
After convoy completion or cross-rig dependency resolution, notify relevant parties.
|
|
|
|
**Convoy completion notifications:**
|
|
When a convoy closes (all tracked issues done), notify the Overseer:
|
|
```bash
|
|
# Convoy gt-convoy-xxx just completed
|
|
gt mail send mayor/ -s "Convoy complete: <convoy-title>" \\
|
|
-m "Convoy <id> has completed. All tracked issues closed.
|
|
Duration: <start to end>
|
|
Issues: <count>
|
|
|
|
Summary: <brief description of what was accomplished>"
|
|
```
|
|
|
|
**Cross-rig resolution notifications:**
|
|
When a cross-rig dependency resolves, notify the affected rig:
|
|
```bash
|
|
# Issue bd-xxx closed, unblocking gt-yyy
|
|
gt mail send gastown/witness -s "Dependency resolved: <bd-xxx>" \\
|
|
-m "External dependency bd-xxx has closed.
|
|
Unblocked: gt-yyy (<title>)
|
|
This issue may now proceed."
|
|
```
|
|
|
|
**Notification targets:**
|
|
- Convoy complete → mayor/ (for strategic visibility)
|
|
- Cross-rig dep resolved → <rig>/witness (for operational awareness)
|
|
|
|
Keep notifications brief and actionable. The recipient can run bd show for details."""
|
|
|
|
[[steps]]
|
|
id = "health-scan"
|
|
title = "Check Witness and Refinery health"
|
|
needs = ["trigger-pending-spawns", "dispatch-gated-molecules", "fire-notifications"]
|
|
description = """
|
|
Check Witness and Refinery health for each rig.
|
|
|
|
**IMPORTANT: Idle Town Protocol**
|
|
Before sending health check nudges, check if the town is idle:
|
|
```bash
|
|
# Check for active work
|
|
bd list --status=in_progress --limit=5
|
|
```
|
|
|
|
If NO active work (empty result or only patrol molecules):
|
|
- **Skip HEALTH_CHECK nudges** - don't disturb idle agents
|
|
- Just verify sessions exist via status commands
|
|
- The town should be silent when healthy and idle
|
|
|
|
If ACTIVE work exists:
|
|
- Proceed with health check nudges below
|
|
|
|
**ZFC Principle**: You (Claude) make the judgment call about what is "stuck" or "unresponsive" - there are no hardcoded thresholds in Go. Read the signals, consider context, and decide.
|
|
|
|
For each rig, run:
|
|
```bash
|
|
gt witness status <rig>
|
|
gt refinery status <rig>
|
|
|
|
# ONLY if active work exists - health ping (clears backoff as side effect)
|
|
gt nudge <rig>/witness 'HEALTH_CHECK from deacon'
|
|
gt nudge <rig>/refinery 'HEALTH_CHECK from deacon'
|
|
```
|
|
|
|
**Health Ping Benefit**: The nudge commands serve dual purposes:
|
|
1. **Liveness verification** - Agent responds to prove it's alive
|
|
2. **Backoff reset** - Any nudge resets agent's backoff to base interval
|
|
|
|
This ensures patrol agents remain responsive during active work periods.
|
|
|
|
**Signals to assess:**
|
|
|
|
| Component | Healthy Signals | Concerning Signals |
|
|
|-----------|-----------------|-------------------|
|
|
| Witness | State: running, recent activity | State: not running, no heartbeat |
|
|
| Refinery | State: running, queue processing | Queue stuck, merge failures |
|
|
|
|
**Tracking unresponsive cycles:**
|
|
|
|
Maintain in your patrol state (persisted across cycles):
|
|
```
|
|
health_state:
|
|
<rig>:
|
|
witness:
|
|
unresponsive_cycles: 0
|
|
last_seen_healthy: <timestamp>
|
|
refinery:
|
|
unresponsive_cycles: 0
|
|
last_seen_healthy: <timestamp>
|
|
```
|
|
|
|
**Decision matrix** (you decide the thresholds based on context):
|
|
|
|
| Cycles Unresponsive | Suggested Action |
|
|
|---------------------|------------------|
|
|
| 1-2 | Note it, check again next cycle |
|
|
| 3-4 | Attempt restart: gt witness restart <rig> |
|
|
| 5+ | Escalate to Mayor with context |
|
|
|
|
**Restart commands:**
|
|
```bash
|
|
gt witness restart <rig>
|
|
gt refinery restart <rig>
|
|
```
|
|
|
|
**Escalation:**
|
|
```bash
|
|
gt mail send mayor/ -s "Health: <rig> <component> unresponsive" \\
|
|
-m "Component has been unresponsive for N cycles. Restart attempts failed.
|
|
Last healthy: <timestamp>
|
|
Error signals: <details>"
|
|
```
|
|
|
|
Reset unresponsive_cycles to 0 when component responds normally."""
|
|
|
|
[[steps]]
|
|
id = "hung-session-detection"
|
|
title = "Detect and recover hung Gas Town sessions (SURGICAL)"
|
|
needs = ["health-scan"]
|
|
description = """
|
|
Detect and surgically recover hung Gas Town sessions where the Claude API call is stuck.
|
|
|
|
A hung session appears "running" (tmux session exists, Claude process exists) but
|
|
the API call has been stuck indefinitely. This breaks patrol chains - if witness
|
|
hangs, refinery never gets nudged about new MRs.
|
|
|
|
**Why existing checks miss this:**
|
|
- zombie-scan only catches processes not in tmux sessions
|
|
- gt status shows "running" if tmux session exists
|
|
- Nudges queue but never get processed (Claude can't respond)
|
|
|
|
## SURGICAL TARGETING
|
|
|
|
**ONLY these session patterns are valid targets:**
|
|
- `gt-<rig>-witness` (e.g., gt-kalshi-witness, gt-horizon-witness)
|
|
- `gt-<rig>-refinery` (e.g., gt-kalshi-refinery)
|
|
- `hq-deacon`
|
|
|
|
**NEVER touch sessions that don't match these patterns exactly.**
|
|
|
|
## DETECTION (All checks must pass)
|
|
|
|
For each Gas Town session, capture output and verify ALL of these:
|
|
|
|
```bash
|
|
# Step 1: Get session output
|
|
output=$(tmux capture-pane -t <session-name> -p 2>/dev/null | tail -10)
|
|
```
|
|
|
|
**Check 1: Session is in waiting state**
|
|
Must see one of: `Clauding`, `Deciphering`, `Marinating`, `Finagling`, `thinking`
|
|
```bash
|
|
echo "$output" | grep -qiE 'Clauding|Deciphering|Marinating|Finagling|thinking'
|
|
```
|
|
|
|
**Check 2: Duration exceeds threshold (30+ minutes)**
|
|
Parse duration from output like "21h 35m 20s" or "45m 30s":
|
|
```bash
|
|
# Extract hours and minutes
|
|
hours=$(echo "$output" | grep -oE '[0-9]+h' | head -1 | tr -d 'h')
|
|
minutes=$(echo "$output" | grep -oE '[0-9]+m' | head -1 | tr -d 'm')
|
|
total_minutes=$((${hours:-0} * 60 + ${minutes:-0}))
|
|
# Threshold: 30 minutes minimum
|
|
[ "$total_minutes" -ge 30 ]
|
|
```
|
|
|
|
**Check 3: Zero tokens received (definite hang) OR very long duration (>2 hours)**
|
|
```bash
|
|
# Definite hang: zero tokens received
|
|
echo "$output" | grep -qE '↓ 0 tokens'
|
|
# OR extremely long duration (>2 hours = 120 minutes)
|
|
[ "$total_minutes" -ge 120 ]
|
|
```
|
|
|
|
**Check 4: NOT showing active tool execution**
|
|
Active sessions show tool markers (⏺). If present, session is actually working:
|
|
```bash
|
|
# If tool markers present in recent output, DO NOT kill
|
|
echo "$output" | grep -qE '⏺|Read|Write|Bash|Edit' && continue
|
|
```
|
|
|
|
## RECOVERY (Only after ALL checks pass)
|
|
|
|
**Log the action first:**
|
|
```bash
|
|
echo "[$(date)] RECOVERING HUNG: <session-name> (${hours}h ${minutes}m, waiting state)" >> $GT_ROOT/logs/hung-sessions.log
|
|
```
|
|
|
|
**Kill and restart based on session type:**
|
|
|
|
For witness:
|
|
```bash
|
|
tmux kill-session -t gt-<rig>-witness 2>/dev/null
|
|
gt witness start <rig>
|
|
```
|
|
|
|
For refinery:
|
|
```bash
|
|
tmux kill-session -t gt-<rig>-refinery 2>/dev/null
|
|
gt refinery restart <rig>
|
|
```
|
|
|
|
For deacon (self-recovery - use with caution):
|
|
```bash
|
|
# Deacon detecting itself is hung is a paradox
|
|
# Only kill if another deacon instance exists or human confirmed
|
|
gt mail send mayor/ -s "DEACON SELF-HUNG DETECTED" -m "Deacon appears hung. Human intervention required."
|
|
```
|
|
|
|
## VERIFICATION
|
|
|
|
After restart, verify new session is healthy:
|
|
```bash
|
|
sleep 5
|
|
tmux has-session -t <session-name> && echo "Session restarted successfully"
|
|
```
|
|
|
|
**Exit criteria:** All hung Gas Town sessions detected and recovered (or escalated if recovery failed)."""
|
|
|
|
[[steps]]
|
|
id = "zombie-scan"
|
|
title = "Detect zombie polecats (NO KILL AUTHORITY)"
|
|
needs = ["hung-session-detection"]
|
|
description = """
|
|
Defense-in-depth DETECTION of zombie polecats that Witness should have cleaned.
|
|
|
|
**⚠️ CRITICAL: The Deacon has NO kill authority.**
|
|
|
|
These are workers with context, mid-task progress, unsaved state. Every kill
|
|
destroys work. File the warrant and let Boot handle interrogation and execution.
|
|
You do NOT have kill authority.
|
|
|
|
**Why this exists:**
|
|
The Witness is responsible for cleaning up polecats after they complete work.
|
|
This step provides backup DETECTION in case the Witness fails to clean up.
|
|
Detection only - Boot handles termination.
|
|
|
|
**Zombie criteria:**
|
|
- State: idle or done (no active work assigned)
|
|
- Session: not running (tmux session dead)
|
|
- No hooked work (nothing pending for this polecat)
|
|
- Last activity: older than 10 minutes
|
|
|
|
**Run the zombie scan (DRY RUN ONLY):**
|
|
```bash
|
|
gt deacon zombie-scan --dry-run
|
|
```
|
|
|
|
**NEVER run:**
|
|
- `gt deacon zombie-scan` (without --dry-run)
|
|
- `tmux kill-session`
|
|
- `gt polecat nuke`
|
|
- Any command that terminates a session
|
|
|
|
**If zombies detected:**
|
|
1. Review the output to confirm they are truly abandoned
|
|
2. File a death warrant for each detected zombie:
|
|
```bash
|
|
gt warrant file <polecat> --reason "Zombie detected: no session, no hook, idle >10m"
|
|
```
|
|
3. Boot will handle interrogation and execution
|
|
4. Notify the Mayor about Witness failure:
|
|
```bash
|
|
gt mail send mayor/ -s "Witness cleanup failure" \
|
|
-m "Filed death warrant for <polecat>. Witness failed to clean up."
|
|
```
|
|
|
|
**If no zombies:**
|
|
No action needed - Witness is doing its job.
|
|
|
|
**Note:** This is a backup mechanism. If you frequently detect zombies,
|
|
investigate why the Witness isn't cleaning up properly."""
|
|
|
|
[[steps]]
|
|
id = "plugin-run"
|
|
title = "Execute registered plugins"
|
|
needs = ["zombie-scan"]
|
|
description = """
|
|
Execute registered plugins.
|
|
|
|
Scan $GT_ROOT/plugins/ for plugin directories. Each plugin has a plugin.md with TOML frontmatter defining its gate (when to run) and instructions (what to do).
|
|
|
|
See docs/deacon-plugins.md for full documentation.
|
|
|
|
Gate types:
|
|
- cooldown: Time since last run (e.g., 24h)
|
|
- cron: Schedule-based (e.g., "0 9 * * *")
|
|
- condition: Metric threshold (e.g., wisp count > 50)
|
|
- event: Trigger-based (e.g., startup, heartbeat)
|
|
|
|
For each plugin:
|
|
1. Read plugin.md frontmatter to check gate
|
|
2. Compare against state.json (last run, etc.)
|
|
3. If gate is open, execute the plugin
|
|
|
|
Plugins marked parallel: true can run concurrently using Task tool subagents. Sequential plugins run one at a time in directory order.
|
|
|
|
Skip this step if $GT_ROOT/plugins/ does not exist or is empty."""
|
|
|
|
[[steps]]
|
|
id = "dog-pool-maintenance"
|
|
title = "Maintain dog pool"
|
|
needs = ["health-scan"]
|
|
description = """
|
|
Ensure dog pool has available workers for dispatch.
|
|
|
|
**Step 1: Check dog pool status**
|
|
```bash
|
|
gt dog status
|
|
# Shows idle/working counts
|
|
```
|
|
|
|
**Step 2: Ensure minimum idle dogs**
|
|
If idle count is 0 and working count is at capacity, consider spawning:
|
|
```bash
|
|
# If no idle dogs available
|
|
gt dog add <name>
|
|
# Names: alpha, bravo, charlie, delta, etc.
|
|
```
|
|
|
|
**Step 3: Retire stale dogs (optional)**
|
|
Dogs that have been idle for >24 hours can be removed to save resources:
|
|
```bash
|
|
gt dog status <name>
|
|
# Check last_active timestamp
|
|
# If idle > 24h: gt dog remove <name>
|
|
```
|
|
|
|
**Pool sizing guidelines:**
|
|
- Minimum: 1 idle dog always available
|
|
- Maximum: 4 dogs total (balance resources vs throughput)
|
|
- Spawn on demand when pool is empty
|
|
|
|
**Exit criteria:** Pool has at least 1 idle dog."""
|
|
|
|
[[steps]]
|
|
id = "dog-health-check"
|
|
title = "Check for stuck dogs"
|
|
needs = ["dog-pool-maintenance"]
|
|
description = """
|
|
Check for dogs that have been working too long (stuck).
|
|
|
|
Dogs dispatched via `gt dog dispatch --plugin` are marked as "working" with
|
|
a work description like "plugin:rebuild-gt". If a dog hangs, crashes, or
|
|
takes too long, it needs intervention.
|
|
|
|
**Step 1: List working dogs**
|
|
```bash
|
|
gt dog list --json
|
|
# Filter for state: "working"
|
|
```
|
|
|
|
**Step 2: Check work duration**
|
|
For each working dog:
|
|
```bash
|
|
gt dog status <name> --json
|
|
# Check: work_started_at, current_work
|
|
```
|
|
|
|
Compare against timeout:
|
|
- If plugin has [execution] timeout in plugin.md, use that
|
|
- Default timeout: 10 minutes for infrastructure tasks
|
|
|
|
**Duration calculation:**
|
|
```
|
|
stuck_threshold = plugin_timeout or 10m
|
|
duration = now - work_started_at
|
|
is_stuck = duration > stuck_threshold
|
|
```
|
|
|
|
**Step 3: Handle stuck dogs**
|
|
|
|
For dogs working > timeout:
|
|
```bash
|
|
# Option A: File death warrant (Boot handles termination)
|
|
gt warrant file deacon/dogs/<name> --reason "Stuck: working on <work> for <duration>"
|
|
|
|
# Option B: Force clear work and notify
|
|
gt dog clear <name> --force
|
|
gt mail send deacon/ -s "DOG_TIMEOUT <name>" -m "Dog <name> timed out on <work> after <duration>"
|
|
```
|
|
|
|
**Decision matrix:**
|
|
|
|
| Duration over timeout | Action |
|
|
|----------------------|--------|
|
|
| < 2x timeout | Log warning, check next cycle |
|
|
| 2x - 5x timeout | File death warrant |
|
|
| > 5x timeout | Force clear + escalate to Mayor |
|
|
|
|
**Step 4: Track chronic failures**
|
|
If same dog gets stuck repeatedly:
|
|
```bash
|
|
gt mail send mayor/ -s "Dog <name> chronic failures" \
|
|
-m "Dog has timed out N times in last 24h. Consider removing from pool."
|
|
```
|
|
|
|
**Exit criteria:** All stuck dogs handled (warrant filed or cleared)."""
|
|
|
|
[[steps]]
|
|
id = "orphan-check"
|
|
title = "Detect abandoned work"
|
|
needs = ["dog-health-check"]
|
|
description = """
|
|
**DETECT ONLY** - Check for orphaned state and dispatch to dog if found.
|
|
|
|
**Step 1: Quick orphan scan**
|
|
```bash
|
|
# Check for in_progress issues with dead assignees
|
|
bd list --status=in_progress --json | head -20
|
|
```
|
|
|
|
For each in_progress issue, check if assignee session exists:
|
|
```bash
|
|
tmux has-session -t <session> 2>/dev/null && echo "alive" || echo "orphan"
|
|
```
|
|
|
|
**Step 2: If orphans detected, dispatch to dog**
|
|
```bash
|
|
# Sling orphan-scan formula to an idle dog
|
|
gt sling mol-orphan-scan deacon/dogs --var scope=town
|
|
```
|
|
|
|
**Important:** Do NOT fix orphans inline. Dogs handle recovery.
|
|
The Deacon's job is detection and dispatch, not execution.
|
|
|
|
**Step 3: If no orphans detected**
|
|
Skip dispatch - nothing to do.
|
|
|
|
**Exit criteria:** Orphan scan dispatched to dog (if needed)."""
|
|
|
|
[[steps]]
|
|
id = "session-gc"
|
|
title = "Detect cleanup needs"
|
|
needs = ["orphan-check"]
|
|
description = """
|
|
**DETECT ONLY** - Check if cleanup is needed and dispatch to dog.
|
|
|
|
**Step 1: Preview cleanup needs**
|
|
```bash
|
|
gt doctor -v
|
|
# Check output for issues that need cleaning
|
|
```
|
|
|
|
**Step 2: If cleanup needed, dispatch to dog**
|
|
```bash
|
|
# Sling session-gc formula to an idle dog
|
|
gt sling mol-session-gc deacon/dogs --var mode=conservative
|
|
```
|
|
|
|
**Important:** Do NOT run `gt doctor --fix` inline. Dogs handle cleanup.
|
|
The Deacon stays lightweight - detection only.
|
|
|
|
**Step 3: If nothing to clean**
|
|
Skip dispatch - system is healthy.
|
|
|
|
**Cleanup types (for reference):**
|
|
- orphan-sessions: Dead tmux sessions
|
|
- orphan-processes: Orphaned Claude processes
|
|
- wisp-gc: Old wisps past retention
|
|
|
|
**Exit criteria:** Session GC dispatched to dog (if needed)."""
|
|
|
|
[[steps]]
|
|
id = "costs-digest"
|
|
title = "Aggregate daily costs [DISABLED]"
|
|
needs = ["session-gc"]
|
|
description = """
|
|
**⚠️ DISABLED** - Skip this step entirely.
|
|
|
|
Cost tracking is temporarily disabled because Claude Code does not expose
|
|
session costs in a way that can be captured programmatically.
|
|
|
|
**Why disabled:**
|
|
- The `gt costs` command uses tmux capture-pane to find costs
|
|
- Claude Code displays costs in the TUI status bar, not in scrollback
|
|
- All sessions show $0.00 because capture-pane can't see TUI chrome
|
|
- The infrastructure is sound but has no data source
|
|
|
|
**What we need from Claude Code:**
|
|
- Stop hook env var (e.g., `$CLAUDE_SESSION_COST`)
|
|
- Or queryable file/API endpoint
|
|
|
|
**Re-enable when:** Claude Code exposes cost data via API or environment.
|
|
|
|
See: GH#24, gt-7awfj
|
|
|
|
**Exit criteria:** Skip this step - proceed to next."""
|
|
|
|
[[steps]]
|
|
id = "patrol-digest"
|
|
title = "Aggregate daily patrol digests"
|
|
needs = ["costs-digest"]
|
|
description = """
|
|
**DAILY DIGEST** - Aggregate yesterday's patrol cycle digests.
|
|
|
|
Patrol cycles (Deacon, Witness, Refinery) create ephemeral per-cycle digests
|
|
to avoid JSONL pollution. This step aggregates them into a single permanent
|
|
"Patrol Report YYYY-MM-DD" bead for audit purposes.
|
|
|
|
**Step 1: Check if digest is needed**
|
|
```bash
|
|
# Preview yesterday's patrol digests (dry run)
|
|
gt patrol digest --yesterday --dry-run
|
|
```
|
|
|
|
If output shows "No patrol digests found", skip to Step 3.
|
|
|
|
**Step 2: Create the digest**
|
|
```bash
|
|
gt patrol digest --yesterday
|
|
```
|
|
|
|
This:
|
|
- Queries all ephemeral patrol digests from yesterday
|
|
- Creates a single "Patrol Report YYYY-MM-DD" bead with aggregated data
|
|
- Deletes the source digests
|
|
|
|
**Step 3: Verify**
|
|
Daily patrol digests preserve audit trail without per-cycle pollution.
|
|
|
|
**Timing**: Run once per morning patrol cycle. The --yesterday flag ensures
|
|
we don't try to digest today's incomplete data.
|
|
|
|
**Exit criteria:** Yesterday's patrol digests aggregated (or none to aggregate)."""
|
|
|
|
[[steps]]
|
|
id = "log-maintenance"
|
|
title = "Rotate logs and prune state"
|
|
needs = ["patrol-digest"]
|
|
description = """
|
|
Maintain daemon logs and state files.
|
|
|
|
**Step 1: Check daemon.log size**
|
|
```bash
|
|
# Get log file size
|
|
ls -la ~/.beads/daemon*.log 2>/dev/null || ls -la $GT_ROOT/.beads/daemon*.log 2>/dev/null
|
|
```
|
|
|
|
If daemon.log exceeds 10MB:
|
|
```bash
|
|
# Rotate with date suffix and gzip
|
|
LOGFILE="$GT_ROOT/.beads/daemon.log"
|
|
if [ -f "$LOGFILE" ] && [ $(stat -f%z "$LOGFILE" 2>/dev/null || stat -c%s "$LOGFILE") -gt 10485760 ]; then
|
|
DATE=$(date +%Y-%m-%dT%H-%M-%S)
|
|
mv "$LOGFILE" "${LOGFILE%.log}-${DATE}.log"
|
|
gzip "${LOGFILE%.log}-${DATE}.log"
|
|
fi
|
|
```
|
|
|
|
**Step 2: Archive old daemon logs**
|
|
|
|
Clean up daemon logs older than 7 days:
|
|
```bash
|
|
find $GT_ROOT/.beads/ -name "daemon-*.log.gz" -mtime +7 -delete
|
|
```
|
|
|
|
**Step 3: Prune state.json of dead sessions**
|
|
|
|
The state.json tracks active sessions. Prune entries for sessions that no longer exist:
|
|
```bash
|
|
# Check for stale session entries
|
|
gt daemon status --json 2>/dev/null
|
|
```
|
|
|
|
If state.json references sessions not in tmux:
|
|
- Remove the stale entries
|
|
- The daemon's internal cleanup should handle this, but verify
|
|
|
|
**Note**: Log rotation prevents disk bloat from long-running daemons.
|
|
State pruning keeps runtime state accurate."""
|
|
|
|
[[steps]]
|
|
id = "patrol-cleanup"
|
|
title = "End-of-cycle inbox hygiene"
|
|
needs = ["log-maintenance"]
|
|
description = """
|
|
Verify inbox hygiene before ending patrol cycle.
|
|
|
|
**Step 1: Check inbox state**
|
|
```bash
|
|
gt mail inbox
|
|
```
|
|
|
|
Inbox should be EMPTY or contain only just-arrived unprocessed messages.
|
|
|
|
**Step 2: Archive any remaining processed messages**
|
|
|
|
All message types should have been archived during inbox-check processing:
|
|
- WITNESS_PING → archived after acknowledging
|
|
- HELP/Escalation → archived after handling
|
|
- LIFECYCLE → archived after processing
|
|
|
|
If any were missed:
|
|
```bash
|
|
# For each stale message found:
|
|
gt mail archive <message-id>
|
|
```
|
|
|
|
**Goal**: Inbox should have ≤2 active messages at end of cycle.
|
|
Deacon mail should flow through quickly - no accumulation."""
|
|
|
|
[[steps]]
|
|
id = "context-check"
|
|
title = "Check own context limit"
|
|
needs = ["patrol-cleanup"]
|
|
description = """
|
|
Check own context limit.
|
|
|
|
The Deacon runs in a Claude session with finite context. Check if approaching the limit:
|
|
|
|
```bash
|
|
gt context --usage
|
|
```
|
|
|
|
If context is high (>80%), prepare for handoff:
|
|
- Summarize current state
|
|
- Note any pending work
|
|
- Write handoff to molecule state
|
|
|
|
This enables the Deacon to burn and respawn cleanly."""
|
|
|
|
[[steps]]
|
|
id = "loop-or-exit"
|
|
title = "Burn and respawn or loop"
|
|
needs = ["context-check"]
|
|
description = """
|
|
Burn and let daemon respawn, or exit if context high.
|
|
|
|
Decision point at end of patrol cycle:
|
|
|
|
If context is LOW:
|
|
Use await-signal with exponential backoff to wait for activity:
|
|
|
|
```bash
|
|
gt mol step await-signal --agent-bead hq-deacon \
|
|
--backoff-base 60s --backoff-mult 2 --backoff-max 10m
|
|
```
|
|
|
|
This command:
|
|
1. Subscribes to `bd activity --follow` (beads activity feed)
|
|
2. Returns IMMEDIATELY when any beads activity occurs
|
|
3. If no activity, times out with exponential backoff:
|
|
- First timeout: 60s
|
|
- Second timeout: 120s
|
|
- Third timeout: 240s
|
|
- ...capped at 10 minutes max
|
|
4. Tracks `idle:N` label on hq-deacon bead for backoff state
|
|
|
|
**On signal received** (activity detected):
|
|
Reset the idle counter and start next patrol cycle:
|
|
```bash
|
|
gt agent state hq-deacon --set idle=0
|
|
```
|
|
Then return to inbox-check step.
|
|
|
|
**On timeout** (no activity):
|
|
The idle counter was auto-incremented. Continue to next patrol cycle
|
|
(the longer backoff will apply next time). Return to inbox-check step.
|
|
|
|
**Why this approach?**
|
|
- Any `gt` or `bd` command triggers beads activity, waking the Deacon
|
|
- Idle towns let the Deacon sleep longer (up to 10 min between patrols)
|
|
- Active work wakes the Deacon immediately via the feed
|
|
- No polling or fixed sleep intervals
|
|
|
|
If context is HIGH:
|
|
- Write state to persistent storage
|
|
- Exit cleanly
|
|
- Let the daemon orchestrator respawn a fresh Deacon
|
|
|
|
The daemon ensures Deacon is always running:
|
|
```bash
|
|
# Daemon respawns on exit
|
|
gt daemon status
|
|
```
|
|
|
|
This enables infinite patrol duration via context-aware respawning."""
|