fix(orphan): protect all tmux sessions, not just Gas Town ones (#924)

* Add hung-session-detection step to deacon patrol

Detects and surgically recovers Gas Town sessions where Claude API
call is stuck indefinitely. These appear "running" (tmux session
exists) but aren't processing work.

Safety checks (ALL must pass before recovery):
1. Session matches Gas Town pattern exactly (gt-*-witness, etc)
2. Session shows waiting state (Clauding/Deciphering/etc)
3. Duration >30min AND (zero tokens OR duration >2hrs)
4. NOT showing active tool execution (⏺ markers)

This closes a gap where existing zombie-scan only catches processes
not in tmux sessions.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(orphan): protect all tmux sessions, not just Gas Town ones

The orphan cleanup was killing Claude processes in user's personal tmux
sessions (e.g., "loomtown", "yaad") because only sessions with gt-* or
hq-* prefixes were protected.

Changes:
- Renamed getGasTownSessionPIDs() to getTmuxSessionPIDs()
- Now protects ALL tmux sessions regardless of name prefix
- Updated variable names for clarity (gasTownPIDs -> protectedPIDs)

The TTY="?" check is not reliable during certain operations (startup,
session transitions), so explicit protection of all tmux sessions is
necessary to prevent killing user's personal Claude instances.

Fixes #923

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: mayor <ec2-user@ip-172-31-43-79.ec2.internal>
Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
Basit Mustafa
2026-01-24 22:45:12 -07:00
committed by GitHub
parent f276b9d28a
commit 3442471a93
2 changed files with 124 additions and 16 deletions

View File

@@ -419,10 +419,114 @@ gt mail send mayor/ -s "Health: <rig> <component> unresponsive" \\
Reset unresponsive_cycles to 0 when component responds normally."""
[[steps]]
id = "hung-session-detection"
title = "Detect and recover hung Gas Town sessions (SURGICAL)"
needs = ["health-scan"]
description = """
Detect and surgically recover hung Gas Town sessions where the Claude API call is stuck.
A hung session appears "running" (tmux session exists, Claude process exists) but
the API call has been stuck indefinitely. This breaks patrol chains - if witness
hangs, refinery never gets nudged about new MRs.
**Why existing checks miss this:**
- zombie-scan only catches processes not in tmux sessions
- gt status shows "running" if tmux session exists
- Nudges queue but never get processed (Claude can't respond)
## SURGICAL TARGETING
**ONLY these session patterns are valid targets:**
- `gt-<rig>-witness` (e.g., gt-kalshi-witness, gt-horizon-witness)
- `gt-<rig>-refinery` (e.g., gt-kalshi-refinery)
- `hq-deacon`
**NEVER touch sessions that don't match these patterns exactly.**
## DETECTION (All checks must pass)
For each Gas Town session, capture output and verify ALL of these:
```bash
# Step 1: Get session output
output=$(tmux capture-pane -t <session-name> -p 2>/dev/null | tail -10)
```
**Check 1: Session is in waiting state**
Must see one of: `Clauding`, `Deciphering`, `Marinating`, `Finagling`, `thinking`
```bash
echo "$output" | grep -qiE 'Clauding|Deciphering|Marinating|Finagling|thinking'
```
**Check 2: Duration exceeds threshold (30+ minutes)**
Parse duration from output like "21h 35m 20s" or "45m 30s":
```bash
# Extract hours and minutes
hours=$(echo "$output" | grep -oE '[0-9]+h' | head -1 | tr -d 'h')
minutes=$(echo "$output" | grep -oE '[0-9]+m' | head -1 | tr -d 'm')
total_minutes=$((${hours:-0} * 60 + ${minutes:-0}))
# Threshold: 30 minutes minimum
[ "$total_minutes" -ge 30 ]
```
**Check 3: Zero tokens received (definite hang) OR very long duration (>2 hours)**
```bash
# Definite hang: zero tokens received
echo "$output" | grep -qE '↓ 0 tokens'
# OR extremely long duration (>2 hours = 120 minutes)
[ "$total_minutes" -ge 120 ]
```
**Check 4: NOT showing active tool execution**
Active sessions show tool markers (⏺). If present, session is actually working:
```bash
# If tool markers present in recent output, DO NOT kill
echo "$output" | grep -qE '⏺|Read|Write|Bash|Edit' && continue
```
## RECOVERY (Only after ALL checks pass)
**Log the action first:**
```bash
echo "[$(date)] RECOVERING HUNG: <session-name> (${hours}h ${minutes}m, waiting state)" >> $GT_ROOT/logs/hung-sessions.log
```
**Kill and restart based on session type:**
For witness:
```bash
tmux kill-session -t gt-<rig>-witness 2>/dev/null
gt witness start <rig>
```
For refinery:
```bash
tmux kill-session -t gt-<rig>-refinery 2>/dev/null
gt refinery restart <rig>
```
For deacon (self-recovery - use with caution):
```bash
# Deacon detecting itself is hung is a paradox
# Only kill if another deacon instance exists or human confirmed
gt mail send mayor/ -s "DEACON SELF-HUNG DETECTED" -m "Deacon appears hung. Human intervention required."
```
## VERIFICATION
After restart, verify new session is healthy:
```bash
sleep 5
tmux has-session -t <session-name> && echo "Session restarted successfully"
```
**Exit criteria:** All hung Gas Town sessions detected and recovered (or escalated if recovery failed)."""
[[steps]]
id = "zombie-scan"
title = "Detect zombie polecats (NO KILL AUTHORITY)"
needs = ["health-scan"]
needs = ["hung-session-detection"]
description = """
Defense-in-depth DETECTION of zombie polecats that Witness should have cleaned.