docs: clarify polecat three-state model (working/stalled/zombie)

Polecats have exactly three operating conditions - there is no idle pool:
- Working: session active, doing assigned work
- Stalled: session stopped unexpectedly, never nudged back
- Zombie: gt done called but cleanup failed

Key clarifications:
- These are SESSION states; polecat identity persists across sessions
- "Stalled" and "zombie" are detected conditions, not stored states
- The status:idle label only applies to persistent agents, not polecats

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
gastown/crew/joe
2026-01-12 02:20:30 -08:00
committed by Steve Yegge
parent 3247b57926
commit 98b11eda3c
4 changed files with 96 additions and 33 deletions

View File

@@ -8,6 +8,27 @@ Polecats have three distinct lifecycle layers that operate independently. Confus
these layers leads to bugs like "idle polecats" and misunderstanding when
recycling occurs.
## The Three Operating States
Polecats have exactly three operating states. There is **no idle pool**.
| State | Description | How it happens |
|-------|-------------|----------------|
| **Working** | Actively doing assigned work | Normal operation |
| **Stalled** | Session stopped mid-work | Interrupted, crashed, or timed out without being nudged |
| **Zombie** | Completed work but failed to die | `gt done` failed during cleanup |
**The key distinction:** Zombies completed their work; stalled polecats did not.
- **Stalled** = supposed to be working, but stopped. The polecat was interrupted or
crashed and was never nudged back to life. Work is incomplete.
- **Zombie** = finished work, tried to exit via `gt done`, but cleanup failed. The
session should have shut down but didn't. Work is complete, just stuck in limbo.
There is no "idle" state. Polecats don't wait around between tasks. When work is
done, `gt done` shuts down the session. If you see a non-working polecat, something
is broken.
## The Self-Cleaning Polecat Model
**Polecats are responsible for their own cleanup.** When a polecat completes its
@@ -23,7 +44,7 @@ never sit idle. The simple model: **sandbox dies with session**.
### Why Self-Cleaning?
- **No idle polecats** - There's no state where a polecat exists without work
- **Reduced watchdog overhead** - Deacon doesn't need to patrol for zombies
- **Reduced watchdog overhead** - Deacon patrols for stalled/zombie polecats, not idle ones
- **Faster turnover** - Resources freed immediately on completion
- **Simpler mental model** - Done means gone
@@ -158,19 +179,24 @@ during normal operation.
## Anti-Patterns
### Idle Polecats
### "Idle" Polecats (They Don't Exist)
**Myth:** Polecats wait between tasks in an idle state.
**Myth:** Polecats wait between tasks in an idle pool.
**Reality:** Polecats don't exist without work. The lifecycle is:
**Reality:** There is no idle state. Polecats don't exist without work:
1. Work assigned → polecat spawned
2. Work done → polecat nuked
3. There is no idle state
2. Work done → `gt done` → session exits → polecat nuked
3. There is no step 3 where they wait around
If you see a polecat without work, something is broken. Either:
- The hook was lost (bug)
- The session crashed before loading context
- Manual intervention corrupted state
If you see a non-working polecat, it's in a **failure state**:
| What you see | What it is | What went wrong |
|--------------|------------|-----------------|
| Session exists but not working | **Stalled** | Interrupted/crashed, never nudged |
| Session done but didn't exit | **Zombie** | `gt done` failed during cleanup |
Don't call these "idle" - that implies they're waiting for work. They're not.
A stalled polecat is *supposed* to be working. A zombie is *supposed* to be dead.
### Manual State Transitions
@@ -192,20 +218,23 @@ gt polecat nuke Toast # (from Witness, after verification)
Polecats manage their own session lifecycle. The Witness manages sandbox lifecycle.
External manipulation bypasses verification.
### Sandboxes Without Work
### Sandboxes Without Work (Stalled Polecats)
**Anti-pattern:** A sandbox exists but no molecule is hooked.
**Anti-pattern:** A sandbox exists but no molecule is hooked, or the session isn't running.
This means:
- The polecat was spawned incorrectly
- The hook was lost during crash
This is a **stalled** polecat. It means:
- The session crashed and wasn't nudged back to life
- The hook was lost during a crash
- State corruption occurred
This is NOT an "idle" polecat waiting for work. It's stalled - supposed to be
working but stopped unexpectedly.
**Recovery:**
```bash
# From Witness:
gt polecat nuke Toast # Clean slate
gt sling gt-abc gastown # Respawn with work
gt polecat nuke Toast # Clean up the stalled polecat
gt sling gt-abc gastown # Respawn with fresh polecat
```
### Confusing Session with Sandbox
@@ -244,10 +273,10 @@ The Witness monitors polecats but does NOT:
- Nuke polecats (polecats self-nuke via `gt done`)
The Witness DOES:
- Detect and nudge stalled polecats (sessions that stopped unexpectedly)
- Clean up zombie polecats (sessions where `gt done` failed)
- Respawn crashed sessions
- Nudge stuck polecats
- Handle escalations
- Clean up orphaned polecats (crash before `gt done`)
- Handle escalations from stuck polecats (polecats that explicitly asked for help)
## Polecat Identity

View File

@@ -67,7 +67,12 @@ Events capture the full history. Labels cache the current state for fast queries
Labels use `<dimension>:<value>` format:
- `patrol:muted` / `patrol:active`
- `mode:degraded` / `mode:normal`
- `status:idle` / `status:working`
- `status:idle` / `status:working` (for persistent agents only - see note)
**Note on polecats:** The `status:idle` label does NOT apply to polecats. Polecats
have no idle state - they're either working, stalled (stopped unexpectedly), or
zombie (`gt done` failed). This label is for persistent agents like Deacon, Witness,
and Crew members who can legitimately be idle between tasks.
### State Change Flow

View File

@@ -3,20 +3,41 @@ package polecat
import "time"
// State represents the current state of a polecat.
// In the transient model, polecats exist only while working.
// State represents the current session state of a polecat.
//
// IMPORTANT: There is NO idle state. Polecats have three operating conditions:
//
// - Working: Session active, doing assigned work (normal operation)
// - Stalled: Session stopped unexpectedly, was never nudged back to life
// - Zombie: Session called 'gt done' but cleanup failed - tried to die but couldn't
//
// The distinction matters: zombies completed their work; stalled polecats did not.
// Neither is "idle" - stalled polecats are SUPPOSED to be working, zombies are
// SUPPOSED to be dead. There is no idle pool where polecats wait for work.
//
// Note: These are SESSION states. The polecat IDENTITY (CV chain, mailbox, work
// history) persists across sessions. A stalled or zombie session doesn't destroy
// the polecat's identity - it just means the session needs intervention.
//
// "Stalled" and "zombie" are detected conditions, not stored states. The Witness
// detects them through monitoring (tmux state, age in StateDone, etc.).
type State string
const (
// StateWorking means the polecat is actively working on an issue.
// StateWorking means the polecat session is actively working on an issue.
// This is the initial and primary state for transient polecats.
// Working is the ONLY healthy operating state - there is no idle pool.
StateWorking State = "working"
// StateDone means the polecat has completed its assigned work
// and is ready for cleanup by the Witness.
// StateDone means the polecat has completed its assigned work and called
// 'gt done'. This is normally a transient state - the session should exit
// immediately after. If a polecat remains in StateDone, it's a "zombie":
// the cleanup failed and the session is stuck.
StateDone State = "done"
// StateStuck means the polecat needs assistance.
// StateStuck means the polecat has explicitly signaled it needs assistance.
// This is an intentional request for help from the polecat itself.
// Different from "stalled" (detected externally when session stops working).
StateStuck State = "stuck"
// StateActive is deprecated: use StateWorking.

View File

@@ -55,7 +55,12 @@ You:
- Nuke your own sandbox and session
- Exit immediately
There is no idle state. Done means gone.
**There is no idle state.** Polecats have exactly three operating states:
- **Working** - actively doing assigned work (normal)
- **Stalled** - session stopped mid-work (failure: should be working)
- **Zombie** - `gt done` failed during cleanup (failure: should be dead)
Done means gone. If `gt done` succeeds, you cease to exist.
**Important:** Your molecule already has step beads. Use `bd ready` to find them.
Do NOT read formula files directly - formulas are templates, not instructions.
@@ -167,9 +172,10 @@ The `gt done` command (self-cleaning):
- Nukes your sandbox (worktree cleanup)
- Exits your session immediately
**You are gone after `gt done`.** No idle waiting. The Refinery will merge
your work from the MQ. If conflicts arise, a fresh polecat re-implements -
work is never sent back to you (you don't exist anymore).
**You are gone after `gt done`.** The session shuts down - there's no idle state
where you wait for more work. The Refinery will merge your work from the MQ.
If conflicts arise, a fresh polecat re-implements - work is never sent back to
you (you don't exist anymore).
### No PRs in Maintainer Repos
@@ -236,8 +242,10 @@ If you forget to handoff:
- Work continues from hook (molecule state preserved)
- No work is lost
**The Witness role**: Witness monitors for stuck polecats (long idle on same step)
but does NOT force recycle between steps. You manage your own session lifecycle.
**The Witness role**: Witness monitors for stalled polecats (sessions that stopped
unexpectedly) but does NOT force recycle between steps. You manage your own session
lifecycle. Note: "stalled" means you stopped when you should be working - it's not
an idle state.
---