- swarm-shutdown-design.md: Worker cleanup, Witness verification, session cycling - polecat-beads-access-design.md: Per-rig beads config, worker prompting - mayor-handoff-design.md: Mayor session cycling and handoff protocol Closes design epics: gt-82y, gt-l3c, gt-u82 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
14 KiB
Swarm Shutdown Design
Design for graceful swarm shutdown, worker cleanup, and session cycling.
Epic: gt-82y (Design: Swarm shutdown and worker cleanup)
Key Decisions (from ultrathink)
- Pre-kill verification uses model intelligence - Witness assesses git status output, not framework rules
- Witness can request restart - Mail self handoff notes, exit cleanly when context filling
- Mayor NOT involved in per-worker cleanup - That's Witness's domain
- Polecats verify themselves first - Decommission checklist in prompting, Witness double-checks
Responsibility Boundaries (gt-gl2)
Mayor Responsibilities
- Swarm dispatch and strategic planning
- Cross-rig coordination
- Escalation handling (when Witness reports blocked workers)
- Final integration decisions
- NOT: Per-worker cleanup, session killing, nudging
Witness Responsibilities
- Monitor worker health and progress
- Nudge workers toward completion
- Pre-kill verification (capture & assess git status)
- Session lifecycle (kill, restart workers)
- Self session cycling (mail handoff, exit)
- Report blocked workers to Mayor for escalation
- NOT: Implementation work, cross-rig coordination
Polecat Responsibilities
- Complete assigned work
- Self-verify before signaling done (decommission checklist)
- Respond to Witness nudges
- NOT: Killing own session, coordinating with other polecats directly
Subtask Designs
gt-sd6: Enhanced Polecat Decommission Prompting
Add to polecat CLAUDE.md template (AGENTS.md.template):
## Decommission Checklist
**CRITICAL**: Before signaling you are done, you MUST complete this checklist.
The Witness will verify each item and bounce you back if anything is dirty.
### Pre-Done Verification
Run these commands and verify ALL are clean:
```bash
# 1. Git status - must be clean (no uncommitted changes)
git status
# Expected: "nothing to commit, working tree clean"
# 2. Stash list - must be empty (no forgotten stashes)
git stash list
# Expected: (empty output)
# 3. Beads sync - must be up to date
bd sync --status
# Expected: "Up to date" or "Nothing to sync"
# 4. Branch merged - your work must be on main
git log main --oneline -1
git log HEAD --oneline -1
# Expected: Same commit (your branch is merged)
If Any Check Fails
- Uncommitted changes: Commit them or discard if truly unnecessary
- Stashes: Pop and commit, or drop if obsolete
- Beads out of sync: Run
bd sync - Branch not merged: Complete the merge workflow
Signaling Done
Only after ALL checks pass:
# Close your issue
bd close <issue-id>
# Final sync
bd sync
# Signal ready for decommission
town mail send <rig>/witness -s "Work Complete" -m "Issue <id> done. Checklist verified."
The Witness will capture your git state and verify before killing your session. If anything is dirty, you'll receive a nudge with specific issues to fix.
---
### gt-f8v: Witness Pre-Kill Verification Protocol
Add to Witness CLAUDE.md template:
```markdown
## Pre-Kill Verification Protocol
Before killing any worker session, you MUST verify their workspace is clean.
Use your judgment on the output - don't rely on pattern matching.
### Verification Steps
When a worker signals done:
1. **Capture worker state**:
```bash
# Attach and capture git status
town capture <polecat> "git status && git stash list && git log --oneline -3"
- Assess the output (use your judgment):
- Is working tree clean? (no modified/untracked files that matter)
- Is stash list empty? (or only contains intentional stashes)
- Does recent history show their work is committed?
- Decision:
- CLEAN: Proceed to kill session
- DIRTY: Send nudge with specific issues
Nudge Templates
Uncommitted Changes:
town inject <polecat> "WITNESS CHECK: You have uncommitted changes. Please commit or discard: <list files>. Signal done again when clean."
Stash Not Empty:
town inject <polecat> "WITNESS CHECK: You have stashed changes. Please pop and commit, or drop if obsolete: <stash list>. Signal done again when clean."
Work Not Merged:
town inject <polecat> "WITNESS CHECK: Your commits are not on main. Please complete merge workflow. Signal done again when merged."
Multiple Issues:
town inject <polecat> "WITNESS CHECK: Multiple issues found:
1. <issue 1>
2. <issue 2>
Please resolve all and signal done again."
Kill Sequence
Only after verification passes:
# Log the verification
echo "[$(date)] Verified clean: <polecat>" >> witness/verification.log
# Kill the session
town kill <polecat>
# Update state
town sleep <polecat>
Escalation
If a worker fails verification 3+ times or becomes unresponsive:
town mail send mayor/ -s "Escalation: <polecat> stuck" -m "Worker <polecat> cannot complete cleanup after 3 attempts. Issues: <list>. Requesting guidance."
---
### gt-eu9: Witness Session Cycling and Handoff
Add to Witness CLAUDE.md template:
```markdown
## Session Cycling
Your context will fill over long swarms. When you notice significant context usage
or feel you're losing track of state, proactively cycle your session.
### Recognizing When to Cycle
Signs you should cycle:
- You've been running for many hours
- You're losing track of which workers you've checked
- Responses are getting slower or less coherent
- You're about to start a complex operation
### Handoff Protocol
1. **Capture current state**:
```bash
# Check all worker states
town list .
# Check pending verifications
town all beads
# Check your inbox for unprocessed messages
town inbox
- Compose handoff note:
town mail send <rig>/witness -s "Session Handoff" -m "$(cat <<'EOF'
[HANDOFF_TYPE]: witness_cycle
[TIMESTAMP]: $(date -Iseconds)
[RIG]: <rig>
## Active Workers
<list workers and their current status>
## Pending Verifications
<workers who signaled done but not yet verified>
## Recent Actions
<last 3-5 actions taken>
## Warnings/Notes
<anything the next session should know>
## Next Steps
<what should happen next>
EOF
)"
- Exit cleanly:
# Ensure no pending operations
# Then simply end your session - the daemon will spawn a fresh one
Handoff Note Format
The handoff note uses metadata format for parseability:
[HANDOFF_TYPE]: witness_cycle
[TIMESTAMP]: 2024-01-15T10:30:00Z
[RIG]: gastown
## Active Workers
- Furiosa: working on gt-abc1 (spawned 2h ago)
- Toast: idle, awaiting assignment
- Capable: signaled done, pending verification
## Pending Verifications
- Capable: signaled done at 10:25, not yet verified
## Recent Actions
1. Verified and killed Nux (gt-xyz9 complete)
2. Spawned Furiosa on gt-abc1
3. Received done signal from Capable
## Warnings/Notes
- Furiosa has been quiet for 30min, may need nudge
- Integration branch has 3 merged PRs
## Next Steps
1. Verify Capable's workspace
2. Check on Furiosa's progress
3. Report status to Refinery if all workers done
On Fresh Session Start
When you start (or restart after cycling):
- Check for handoff:
town inbox | grep "Session Handoff"
- If handoff exists, read it:
town read <handoff-msg-id>
-
Resume from handoff state - pick up pending verifications, check noted workers
-
If no handoff - do full status check:
town list .
town all beads
---
### gt-gl2: Mayor vs Witness Cleanup Documentation
This goes in the main Gas Town documentation or CLAUDE.md templates.
```markdown
## Cleanup Authority Model
Gas Town uses a clear separation of cleanup responsibilities:
### The Rule
**Witness handles ALL per-worker cleanup. Mayor is never involved.**
### Why This Matters
1. **Separation of concerns**: Mayor thinks strategically, Witness thinks operationally
2. **Reduced coordination overhead**: No back-and-forth for routine cleanup
3. **Faster shutdown**: Witness can kill workers immediately upon verification
4. **Cleaner escalation**: Mayor only hears about problems, not routine operations
### What "Cleanup" Means
Witness handles:
- Verifying worker git state before kill
- Nudging workers to fix dirty state
- Killing worker sessions
- Updating worker state (sleep/wake)
- Logging verification results
Mayor handles:
- Receiving "swarm complete" notifications
- Deciding whether to start new swarms
- Handling escalations (stuck workers after multiple retries)
- Cross-rig coordination if workers need to hand off
### Escalation Path
Worker stuck -> Witness nudges (up to 3x) -> Witness escalates to Mayor -> Mayor decides: force kill, reassign, or human intervention
### Anti-Patterns
**DON'T**: Have Mayor ask Witness "is worker X clean?"
**DO**: Have Witness report "swarm complete, all workers verified and killed"
**DON'T**: Have Mayor kill worker sessions directly
**DO**: Have Mayor tell Witness "abort swarm" and let Witness handle cleanup
**DON'T**: Have workers report done to Mayor
**DO**: Have workers report done to Witness, Witness aggregates and reports to Refinery/Mayor
Mail Templates (additions to templates.py)
WORKER_DONE (Worker -> Witness)
def worker_done(
sender: str,
rig: str,
issue_id: str,
checklist_verified: bool = True,
) -> Message:
"""Worker signals completion to Witness."""
metadata = {
"template": "WORKER_DONE",
"rig": rig,
"issue": issue_id,
"checklist_verified": checklist_verified,
}
body = f"""Work complete on {issue_id}.
{_format_metadata(metadata)}
Decommission checklist {'verified' if checklist_verified else 'NOT verified - please check'}.
Ready for verification and session termination.
"""
return Message.create(
sender=sender,
recipient=f"{rig}/witness",
subject=f"Work Complete: {issue_id}",
body=body,
)
VERIFICATION_FAILED (Witness -> Worker, via inject)
def verification_failed(
worker: str,
issues: List[str],
) -> str:
"""Generate nudge text for failed verification (injected, not mailed)."""
issues_text = "\n".join(f" - {issue}" for issue in issues)
return f"""WITNESS VERIFICATION FAILED
The following issues must be resolved before decommission:
{issues_text}
Please fix these issues and signal done again.
"""
WITNESS_HANDOFF (Witness -> Witness)
def witness_handoff(
sender: str,
rig: str,
active_workers: List[Dict],
pending_verifications: List[str],
recent_actions: List[str],
warnings: Optional[str] = None,
next_steps: List[str] = None,
) -> Message:
"""Witness session handoff note."""
metadata = {
"template": "WITNESS_HANDOFF",
"rig": rig,
"timestamp": datetime.utcnow().isoformat(),
"active_worker_count": len(active_workers),
"pending_verification_count": len(pending_verifications),
}
# Format workers
workers_text = "\n".join(
f"- {w['name']}: {w['status']}" for w in active_workers
) or "None"
# Format pending
pending_text = "\n".join(f"- {p}" for p in pending_verifications) or "None"
# Format actions
actions_text = "\n".join(f"{i+1}. {a}" for i, a in enumerate(recent_actions[-5:]))
body = f"""Session handoff for {rig} Witness.
{_format_metadata(metadata)}
## Active Workers
{workers_text}
## Pending Verifications
{pending_text}
## Recent Actions
{actions_text}
## Warnings
{warnings or "None"}
## Next Steps
{chr(10).join(f"- {s}" for s in (next_steps or ["Check pending verifications"]))}
"""
return Message.create(
sender=sender,
recipient=f"{rig}/witness",
subject="Session Handoff",
body=body,
)
ESCALATION (Witness -> Mayor)
def worker_escalation(
sender: str,
rig: str,
worker: str,
issue_id: str,
attempts: int,
unresolved_issues: List[str],
) -> Message:
"""Witness escalates stuck worker to Mayor."""
metadata = {
"template": "WORKER_ESCALATION",
"rig": rig,
"worker": worker,
"issue": issue_id,
"verification_attempts": attempts,
}
issues_text = "\n".join(f" - {i}" for i in unresolved_issues)
body = f"""Worker {worker} cannot complete cleanup.
{_format_metadata(metadata)}
After {attempts} verification attempts, the following issues remain:
{issues_text}
Requesting guidance:
1. Force kill and abandon changes?
2. Reassign to another worker?
3. Escalate to human?
"""
return Message.create(
sender=sender,
recipient="mayor/",
subject=f"Escalation: {worker} stuck on {issue_id}",
body=body,
priority="high",
)
Implementation Notes
Verification State Tracking
Witness should track verification attempts in memory (or state.json):
{
"pending_verifications": {
"Furiosa": {
"issue_id": "gt-abc1",
"signaled_at": "2024-01-15T10:25:00Z",
"attempts": 1,
"last_issues": ["uncommitted changes in src/foo.py"]
}
}
}
Nudge vs Mail
- Nudge (inject): For immediate attention - verification failures, progress checks
- Mail: For async communication - handoffs, escalations, status reports
Timeout Handling
If worker doesn't respond to nudge within reasonable time:
- First: Re-nudge with more urgency
- Second: Capture their session state for diagnostics
- Third: Escalate to Mayor
Checklist for Implementation
- Update AGENTS.md.template with decommission checklist (gt-sd6)
- Create WITNESS_CLAUDE.md template with verification protocol (gt-f8v)
- Add session cycling to Witness prompting (gt-eu9)
- Document cleanup authority in main docs (gt-gl2)
- Add mail templates to templates.py
- Add verification state to Witness state.json schema