Commit Graph

4 Commits

Author SHA1 Message Date
gastown/crew/dennis
0db2bda6e6 feat(deacon): add zombie-scan command for tmux-verified process cleanup
Unlike cleanup-orphans (which uses TTY="?" detection), zombie-scan uses
tmux verification: it checks if each Claude process is in an active
tmux session by comparing against actual pane PIDs.

A process is a zombie if:
- It's a Claude/codex process
- It's NOT the pane PID of any active tmux session
- It's NOT a child of any pane PID
- It's older than 60 seconds

Also refactors:
- getChildPIDs() with ps fallback when pgrep unavailable
- State file handling with file locking for concurrent access

Usage:
  gt deacon zombie-scan           # Find and kill zombies
  gt deacon zombie-scan --dry-run # Just list zombies

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 14:19:29 -08:00
mayor
2b3f287f02 fix(orphan): prevent killing Claude processes in valid tmux sessions
The orphan cleanup was killing witness/refinery/deacon Claude processes
during startup because they temporarily show TTY "?" before fully
attaching to the tmux session.

Added getGasTownSessionPIDs() to discover all PIDs belonging to valid
gt-* and hq-* tmux sessions (including child processes). The orphan
cleanup now skips these PIDs, only killing truly orphaned processes
from dead sessions.

This fixes the race condition where:
1. Daemon starts a witness/refinery session
2. Claude starts but takes time to show a prompt
3. Startup detection times out
4. Orphan cleanup sees Claude with TTY "?" and kills it

Now processes in valid sessions are protected regardless of TTY state.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 12:46:49 -08:00
mayor
2feefd1731 fix(orphan): prevent Claude Code session leaks on macOS
Three bugs were causing orphaned Claude processes to accumulate:

1. TTY comparison in orphan.go checked for "?" but macOS shows "??"
   - Orphan cleanup never found anything on macOS
   - Changed to check for both "?" and "??"

2. selfKillSession in done.go used basic tmux kill-session
   - Claude Code can survive SIGHUP
   - Now uses KillSessionWithProcesses for proper cleanup

3. Crew stop commands used basic KillSession
   - Same issue as #2
   - Updated runCrewRemove, runCrewStop, runCrewStopAll

Root cause of 383 accumulated sessions: every gt done and crew stop
left orphans, and the cleanup never worked on macOS.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 03:49:18 -08:00
aleiby
22064b0730 feat: Add automatic orphaned claude process cleanup (#588)
* feat: Add automatic orphaned claude process cleanup

Claude Code's Task tool spawns subagent processes that sometimes don't clean up
properly after completion. These accumulate and consume significant memory
(observed: 17 processes using ~6GB RAM).

This change adds automatic cleanup in two places:

1. **Deacon patrol** (primary): New patrol step "orphan-process-cleanup" runs
   `gt deacon cleanup-orphans` early in each cycle. More responsive (~30s).

2. **Daemon heartbeat** (fallback): Runs cleanup every 3 minutes as safety net
   when deacon is down.

Detection uses TTY column - processes with TTY "?" have no controlling terminal.
This is safe because:
- Processes in terminals (user sessions) have a TTY like "pts/0" - untouched
- Only kills processes with no controlling terminal
- Orphaned subagents are children of tmux server with no TTY

New files:
- internal/util/orphan.go: FindOrphanedClaudeProcesses, CleanupOrphanedClaudeProcesses
- internal/util/orphan_test.go: Tests for orphan detection

New command:
- `gt deacon cleanup-orphans`: Manual/patrol-triggered cleanup

Fixes #587

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(orphan): add Windows build tag and minimum age check

Addresses review feedback on PR #588:

1. Add //go:build !windows to orphan.go and orphan_test.go
   - The code uses Unix-specific syscalls (SIGTERM, ESRCH) and
     ps command options that don't exist on Windows

2. Add minimum age check (60 seconds) to prevent false positives
   - Prevents race conditions with newly spawned subagents
   - Addresses reviewer concern about cron/systemd processes
   - Uses portable etime format instead of Linux-only etimes

3. Add parseEtime helper with comprehensive tests
   - Parses [[DD-]HH:]MM:SS format (works on both Linux and macOS)
   - etimes (seconds) is Linux-specific, etime is portable

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(orphan): add proper SIGTERM→SIGKILL escalation with state tracking

Previous approach used process age which doesn't work: a Task subagent
runs without TTY from birth, so a long-running legitimate subagent that
later fails to exit would be immediately SIGKILLed without trying SIGTERM.

New approach uses a state file to track signal history:

1. First encounter → SIGTERM, record PID + timestamp in state file
2. Next cycle (after 60s grace period) → if still alive, SIGKILL
3. Next cycle → if survived SIGKILL, log as unkillable and remove

State file: $XDG_RUNTIME_DIR/gastown-orphan-state (or /tmp/)
Format: "<pid> <signal> <unix_timestamp>" per line

The state file is automatically cleaned up:
- Dead processes removed on load
- Unkillable processes removed after logging

Also updates callers to use new CleanupResult type which includes
the signal sent (SIGTERM, SIGKILL, or UNKILLABLE).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 15:35:48 -08:00