diff --git a/docs/concepts/polecat-lifecycle.md b/docs/concepts/polecat-lifecycle.md index ddcdf158..4ea3ec36 100644 --- a/docs/concepts/polecat-lifecycle.md +++ b/docs/concepts/polecat-lifecycle.md @@ -8,6 +8,27 @@ Polecats have three distinct lifecycle layers that operate independently. Confus these layers leads to bugs like "idle polecats" and misunderstanding when recycling occurs. +## The Three Operating States + +Polecats have exactly three operating states. There is **no idle pool**. + +| State | Description | How it happens | +|-------|-------------|----------------| +| **Working** | Actively doing assigned work | Normal operation | +| **Stalled** | Session stopped mid-work | Interrupted, crashed, or timed out without being nudged | +| **Zombie** | Completed work but failed to die | `gt done` failed during cleanup | + +**The key distinction:** Zombies completed their work; stalled polecats did not. + +- **Stalled** = supposed to be working, but stopped. The polecat was interrupted or + crashed and was never nudged back to life. Work is incomplete. +- **Zombie** = finished work, tried to exit via `gt done`, but cleanup failed. The + session should have shut down but didn't. Work is complete, just stuck in limbo. + +There is no "idle" state. Polecats don't wait around between tasks. When work is +done, `gt done` shuts down the session. If you see a non-working polecat, something +is broken. + ## The Self-Cleaning Polecat Model **Polecats are responsible for their own cleanup.** When a polecat completes its @@ -23,7 +44,7 @@ never sit idle. The simple model: **sandbox dies with session**. ### Why Self-Cleaning? - **No idle polecats** - There's no state where a polecat exists without work -- **Reduced watchdog overhead** - Deacon doesn't need to patrol for zombies +- **Reduced watchdog overhead** - Deacon patrols for stalled/zombie polecats, not idle ones - **Faster turnover** - Resources freed immediately on completion - **Simpler mental model** - Done means gone @@ -158,19 +179,24 @@ during normal operation. ## Anti-Patterns -### Idle Polecats +### "Idle" Polecats (They Don't Exist) -**Myth:** Polecats wait between tasks in an idle state. +**Myth:** Polecats wait between tasks in an idle pool. -**Reality:** Polecats don't exist without work. The lifecycle is: +**Reality:** There is no idle state. Polecats don't exist without work: 1. Work assigned → polecat spawned -2. Work done → polecat nuked -3. There is no idle state +2. Work done → `gt done` → session exits → polecat nuked +3. There is no step 3 where they wait around -If you see a polecat without work, something is broken. Either: -- The hook was lost (bug) -- The session crashed before loading context -- Manual intervention corrupted state +If you see a non-working polecat, it's in a **failure state**: + +| What you see | What it is | What went wrong | +|--------------|------------|-----------------| +| Session exists but not working | **Stalled** | Interrupted/crashed, never nudged | +| Session done but didn't exit | **Zombie** | `gt done` failed during cleanup | + +Don't call these "idle" - that implies they're waiting for work. They're not. +A stalled polecat is *supposed* to be working. A zombie is *supposed* to be dead. ### Manual State Transitions @@ -192,20 +218,23 @@ gt polecat nuke Toast # (from Witness, after verification) Polecats manage their own session lifecycle. The Witness manages sandbox lifecycle. External manipulation bypasses verification. -### Sandboxes Without Work +### Sandboxes Without Work (Stalled Polecats) -**Anti-pattern:** A sandbox exists but no molecule is hooked. +**Anti-pattern:** A sandbox exists but no molecule is hooked, or the session isn't running. -This means: -- The polecat was spawned incorrectly -- The hook was lost during crash +This is a **stalled** polecat. It means: +- The session crashed and wasn't nudged back to life +- The hook was lost during a crash - State corruption occurred +This is NOT an "idle" polecat waiting for work. It's stalled - supposed to be +working but stopped unexpectedly. + **Recovery:** ```bash # From Witness: -gt polecat nuke Toast # Clean slate -gt sling gt-abc gastown # Respawn with work +gt polecat nuke Toast # Clean up the stalled polecat +gt sling gt-abc gastown # Respawn with fresh polecat ``` ### Confusing Session with Sandbox @@ -244,10 +273,10 @@ The Witness monitors polecats but does NOT: - Nuke polecats (polecats self-nuke via `gt done`) The Witness DOES: +- Detect and nudge stalled polecats (sessions that stopped unexpectedly) +- Clean up zombie polecats (sessions where `gt done` failed) - Respawn crashed sessions -- Nudge stuck polecats -- Handle escalations -- Clean up orphaned polecats (crash before `gt done`) +- Handle escalations from stuck polecats (polecats that explicitly asked for help) ## Polecat Identity diff --git a/docs/design/operational-state.md b/docs/design/operational-state.md index c0f602ef..2c25b60f 100644 --- a/docs/design/operational-state.md +++ b/docs/design/operational-state.md @@ -67,7 +67,12 @@ Events capture the full history. Labels cache the current state for fast queries Labels use `:` format: - `patrol:muted` / `patrol:active` - `mode:degraded` / `mode:normal` -- `status:idle` / `status:working` +- `status:idle` / `status:working` (for persistent agents only - see note) + +**Note on polecats:** The `status:idle` label does NOT apply to polecats. Polecats +have no idle state - they're either working, stalled (stopped unexpectedly), or +zombie (`gt done` failed). This label is for persistent agents like Deacon, Witness, +and Crew members who can legitimately be idle between tasks. ### State Change Flow diff --git a/internal/polecat/types.go b/internal/polecat/types.go index 48bfc709..1fd4e7f1 100644 --- a/internal/polecat/types.go +++ b/internal/polecat/types.go @@ -3,20 +3,41 @@ package polecat import "time" -// State represents the current state of a polecat. -// In the transient model, polecats exist only while working. +// State represents the current session state of a polecat. +// +// IMPORTANT: There is NO idle state. Polecats have three operating conditions: +// +// - Working: Session active, doing assigned work (normal operation) +// - Stalled: Session stopped unexpectedly, was never nudged back to life +// - Zombie: Session called 'gt done' but cleanup failed - tried to die but couldn't +// +// The distinction matters: zombies completed their work; stalled polecats did not. +// Neither is "idle" - stalled polecats are SUPPOSED to be working, zombies are +// SUPPOSED to be dead. There is no idle pool where polecats wait for work. +// +// Note: These are SESSION states. The polecat IDENTITY (CV chain, mailbox, work +// history) persists across sessions. A stalled or zombie session doesn't destroy +// the polecat's identity - it just means the session needs intervention. +// +// "Stalled" and "zombie" are detected conditions, not stored states. The Witness +// detects them through monitoring (tmux state, age in StateDone, etc.). type State string const ( - // StateWorking means the polecat is actively working on an issue. + // StateWorking means the polecat session is actively working on an issue. // This is the initial and primary state for transient polecats. + // Working is the ONLY healthy operating state - there is no idle pool. StateWorking State = "working" - // StateDone means the polecat has completed its assigned work - // and is ready for cleanup by the Witness. + // StateDone means the polecat has completed its assigned work and called + // 'gt done'. This is normally a transient state - the session should exit + // immediately after. If a polecat remains in StateDone, it's a "zombie": + // the cleanup failed and the session is stuck. StateDone State = "done" - // StateStuck means the polecat needs assistance. + // StateStuck means the polecat has explicitly signaled it needs assistance. + // This is an intentional request for help from the polecat itself. + // Different from "stalled" (detected externally when session stops working). StateStuck State = "stuck" // StateActive is deprecated: use StateWorking. diff --git a/templates/polecat-CLAUDE.md b/templates/polecat-CLAUDE.md index debac2f3..784f6ec6 100644 --- a/templates/polecat-CLAUDE.md +++ b/templates/polecat-CLAUDE.md @@ -55,7 +55,12 @@ You: - Nuke your own sandbox and session - Exit immediately -There is no idle state. Done means gone. +**There is no idle state.** Polecats have exactly three operating states: +- **Working** - actively doing assigned work (normal) +- **Stalled** - session stopped mid-work (failure: should be working) +- **Zombie** - `gt done` failed during cleanup (failure: should be dead) + +Done means gone. If `gt done` succeeds, you cease to exist. **Important:** Your molecule already has step beads. Use `bd ready` to find them. Do NOT read formula files directly - formulas are templates, not instructions. @@ -167,9 +172,10 @@ The `gt done` command (self-cleaning): - Nukes your sandbox (worktree cleanup) - Exits your session immediately -**You are gone after `gt done`.** No idle waiting. The Refinery will merge -your work from the MQ. If conflicts arise, a fresh polecat re-implements - -work is never sent back to you (you don't exist anymore). +**You are gone after `gt done`.** The session shuts down - there's no idle state +where you wait for more work. The Refinery will merge your work from the MQ. +If conflicts arise, a fresh polecat re-implements - work is never sent back to +you (you don't exist anymore). ### No PRs in Maintainer Repos @@ -236,8 +242,10 @@ If you forget to handoff: - Work continues from hook (molecule state preserved) - No work is lost -**The Witness role**: Witness monitors for stuck polecats (long idle on same step) -but does NOT force recycle between steps. You manage your own session lifecycle. +**The Witness role**: Witness monitors for stalled polecats (sessions that stopped +unexpectedly) but does NOT force recycle between steps. You manage your own session +lifecycle. Note: "stalled" means you stopped when you should be working - it's not +an idle state. ---