fix: complete removal of agent_state observable tracking (gt-zecmc)

Additional cleanup from the agent_state refactoring:

- Remove dead code: checkStaleAgents(), markAgentDead() in lifecycle.go
- Remove dead code: reportAgentState(), getAgentFields() in prime.go
- Update getAgentBeadState() comment to clarify non-observable states only
- Update mol-witness-patrol.formula.toml to use tmux discovery
- Update mol-polecat-lease.formula.toml to use POLECAT_DONE mail
- Update docs/watchdog-chain.md to reflect new architecture

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
gastown/crew/joe
2026-01-06 20:42:11 -08:00
committed by Steve Yegge
parent 6e84489ca3
commit 87169a3fc7
5 changed files with 28 additions and 198 deletions

View File

@@ -62,7 +62,8 @@ Polecat is actively working. Monitor for stuck or completion.
**Periodic checks:**
- Use standard nudge protocol from Witness CLAUDE.md
- Watch for POLECAT_DONE mail or agent_state=done
- Watch for POLECAT_DONE mail (primary completion signal)
- Check tmux session: `gt session status {{rig}}/polecats/{{polecat}}`
**Signs of progress:**
- Git commits appearing
@@ -73,11 +74,12 @@ Polecat is actively working. Monitor for stuck or completion.
- Idle >15 minutes
- Repeated errors
- Explicit "I'm stuck" messages
- Agent bead shows stuck state: `bd show <agent-bead-id>`
**If POLECAT_DONE received or agent_state=done:**
**If POLECAT_DONE mail received:**
Proceed to verifying step.
**Exit criteria:** Polecat signals completion (POLECAT_DONE mail or state=done)."""
**Exit criteria:** Polecat signals completion via POLECAT_DONE mail."""
[[steps]]
id = "verifying"

View File

@@ -20,7 +20,7 @@ needs = ['process-cleanups']
title = 'Ensure refinery is alive'
[[steps]]
description = "Survey all polecats using agent beads (ZFC: trust what agents report).\n\n**Step 1: List polecat agent beads**\n\n```bash\nbd list --type=agent --json\n```\n\nFilter the JSON output for entries where description contains `role_type: polecat`.\nEach polecat agent bead has fields in its description:\n- `role_type: polecat`\n- `rig: <rig-name>`\n- `agent_state: running|idle|stuck|done`\n- `hook_bead: <current-work-id>`\n\n**Step 2: For each polecat, check agent_state**\n\n| agent_state | Meaning | Action |\n|-------------|---------|--------|\n| running | Actively working | Check progress (Step 3) |\n| idle | No work assigned | Auto-nuke if clean (Step 3a) |\n| stuck | Self-reported stuck | Handle stuck protocol |\n| done | Work complete | Verify cleanup triggered (see Step 4a) |\n\n**Step 3: For running polecats, assess progress**\n\nCheck the hook_bead field to see what they're working on:\n```bash\nbd show <hook_bead> # See current step/issue\n```\n\nYou can also verify they're responsive:\n```bash\ntmux capture-pane -t gt-<rig>-<name> -p | tail -20\n```\n\nLook for:\n- Recent tool activity → making progress\n- Idle at prompt → may need nudge\n- Error messages → may need help\n\n**Step 3a: For idle polecats, auto-nuke if clean**\n\nWhen agent_state=idle, the polecat has no work assigned. Check if it's safe to nuke:\n\n```bash\n# Check git status in the polecat's worktree\ncd polecats/<name>\ngit status --porcelain # Should be empty (clean)\ngit log origin/main..HEAD # Should have no unpushed commits\n```\n\n**If clean** (no uncommitted changes, no unpushed commits):\n```bash\n# Safe to nuke - no work to lose\ngt polecat nuke <name>\n```\nLog the auto-nuke for audit purposes. No escalation needed.\n\n**If dirty** (uncommitted or unpushed work):\n```bash\n# Escalate to Mayor - polecat has work that might be valuable\ngt mail send mayor/ -s \\\"IDLE_DIRTY: <polecat> has uncommitted work\\\" \\\n -m \\\"Polecat: <name>\nState: idle (no hook_bead)\nGit status: <uncommitted-files>\nUnpushed commits: <count>\n\nPlease advise: recover work or discard?\\\"\n```\n\n**Rationale**: Idle polecats with clean git state are pure overhead. They have\nno work and no state worth preserving. Nuking them immediately frees resources\nand reduces noise. Only escalate when there's actual work at risk.\n\n**Step 4: Decide action**\n\n| Observation | Action |\n|-------------|--------|\n| agent_state=running, recent activity | None |\n| agent_state=running, idle 5-15 min | Gentle nudge |\n| agent_state=running, idle 15+ min | Direct nudge with deadline |\n| agent_state=stuck | Assess and help or escalate |\n| agent_state=done | Verify cleanup triggered (see Step 4a) |\n\n**Step 4a: Handle agent_state=done**\n\nIn the ephemeral model, polecats with agent_state=done and cleanup_status=clean\nshould already be nuked by HandlePolecatDone. Finding one here indicates:\n\n1. **Stale agent bead** - polecat was nuked but bead remains\n ```bash\n # Verify polecat doesn't exist anymore\n ls polecats/<name> 2>/dev/null || echo \"Already nuked\"\n ```\n If nuked, the agent bead is stale. Clean it up or ignore.\n\n2. **Cleanup wisp exists** - polecat has dirty state needing intervention\n ```bash\n bd list --wisp --labels=polecat:<name> --status=open\n ```\n Process in process-cleanups step.\n\n3. **No wisp, polecat exists** - POLECAT_DONE mail was missed\n Try auto-nuke directly (ephemeral model):\n ```bash\n # Check cleanup_status and nuke if clean\n gt polecat nuke <name> # Will fail if dirty\n ```\n If nuke fails (dirty state), create cleanup wisp for investigation.\n\n**Step 5: Execute nudges**\n```bash\ngt nudge <rig>/polecats/<name> \"How's progress? Need help?\"\n```\n\n**Step 6: Escalate if needed**\n```bash\ngt mail send mayor/ -s \"Escalation: <polecat> stuck\" \\\n -m \"Polecat <name> reports stuck. Please intervene.\"\n```\n\n**Parallelism**: Use Task tool subagents to inspect multiple polecats concurrently.\n\n**ZFC Principle**: Trust agent_state from beads. Don't infer state from PID/tmux."
description = "Survey all polecats using tmux (discover, don't track).\n\n**Principle**: Agent liveness is discovered from tmux, not recorded in beads.\nOnly non-observable states like 'stuck' are stored in beads.\n\n**Step 1: List polecat sessions**\n\n```bash\n# Find all polecat tmux sessions for this rig\ntmux list-sessions -F '#{session_name}' 2>/dev/null | grep \"^gt-<rig>-\"\n```\n\nFor each session, check if Claude is actively running:\n```bash\n# Check if Claude process is running in the session\ngt session status <rig>/polecats/<name>\n```\n\n**Step 2: For each polecat, assess state**\n\n| Observation | Meaning | Action |\n|-------------|---------|--------|\n| Session exists, Claude running | Actively working | Check progress (Step 3) |\n| Session exists, Claude not running | Zombie session | Kill and respawn if work hooked |\n| No session | Not running | Check if should exist |\n\nAlso check the agent bead for non-observable state:\n```bash\nbd show <agent-bead-id> # Check for stuck, awaiting-gate, etc.\n```\n\n**Step 3: For running polecats, assess progress**\n\nCheck hook_bead to see what they're working on:\n```bash\nbd show <agent-bead-id> # Look at hook_bead field\n```\n\nCapture pane to assess activity:\n```bash\ntmux capture-pane -t gt-<rig>-<name> -p | tail -20\n```\n\nLook for:\n- Recent tool activity → making progress\n- Idle at prompt → may need nudge\n- Error messages → may need help\n\n**Step 3a: For idle polecats (no hook_bead), auto-nuke if clean**\n\nWhen a polecat has no hook_bead, it has no assigned work. Check if safe to nuke:\n\n```bash\n# Check git status in the polecat's worktree\ncd polecats/<name>\ngit status --porcelain # Should be empty (clean)\ngit log origin/main..HEAD # Should have no unpushed commits\n```\n\n**If clean** (no uncommitted changes, no unpushed commits):\n```bash\n# Safe to nuke - no work to lose\ngt polecat nuke <name>\n```\nLog the auto-nuke for audit purposes. No escalation needed.\n\n**If dirty** (uncommitted or unpushed work):\n```bash\n# Escalate to Mayor - polecat has work that might be valuable\ngt mail send mayor/ -s \\\"IDLE_DIRTY: <polecat> has uncommitted work\\\" \\\n -m \\\"Polecat: <name>\nState: no work hooked\nGit status: <uncommitted-files>\nUnpushed commits: <count>\n\nPlease advise: recover work or discard?\\\"\n```\n\n**Rationale**: Idle polecats with clean git state are pure overhead. They have\nno work and no state worth preserving. Nuking them immediately frees resources\nand reduces noise. Only escalate when there's actual work at risk.\n\n**Step 4: Decide action**\n\n| Observation | Action |\n|-------------|--------|\n| Running, recent activity | None |\n| Running, idle 5-15 min | Gentle nudge |\n| Running, idle 15+ min | Direct nudge with deadline |\n| Bead shows stuck | Assess and help or escalate |\n| POLECAT_DONE mail received | Verify cleanup (see Step 4a) |\n\n**Step 4a: Handle POLECAT_DONE**\n\nIn the ephemeral model, POLECAT_DONE triggers immediate cleanup:\n\n1. **Check cleanup_status** in agent bead\n2. **If clean**: Auto-nuke immediately\n ```bash\n gt polecat nuke <name>\n ```\n3. **If dirty**: Create cleanup wisp for manual intervention\n\n**Step 5: Execute nudges**\n```bash\ngt nudge <rig>/polecats/<name> \"How's progress? Need help?\"\n```\n\n**Step 6: Escalate if needed**\n```bash\ngt mail send mayor/ -s \"Escalation: <polecat> stuck\" \\\n -m \"Polecat <name> reports stuck. Please intervene.\"\n```\n\n**Parallelism**: Use Task tool subagents to inspect multiple polecats concurrently.\n\n**Discovery Principle**: Derive state from tmux. Only trust bead state for\nnon-observable conditions like 'stuck' or 'awaiting-gate'."
id = 'survey-workers'
needs = ['check-refinery']
title = 'Inspect all active polecats'