Files
gastown/internal
Roland Tritsch f4072e58cc fix(shutdown): fix session counter bug and add --cleanup-orphans flag (#759)
## Problems Fixed

1. **False reporting**: `gt shutdown` reported "0 sessions stopped" even when
   all 5 sessions were successfully terminated
2. **Orphaned processes**: No way to clean up Claude processes left behind by
   crashed/interrupted sessions

## Root Causes

1. **Counter bug**: `killSessionsInOrder()` only incremented the counter when
   `KillSessionWithProcesses()` returned no error. However, this function can
   return an error even after successfully killing all processes (e.g., when
   the session auto-closes after its processes die, the final `kill-session`
   command fails with "session not found").

2. **No orphan cleanup**: While `internal/util/orphan.go` provides orphan
   detection infrastructure, it wasn't integrated into the shutdown workflow.

## Solutions

1. **Fix counter logic**: Modified `killSessionsInOrder()` to verify session
   termination by checking if the session still exists after the kill attempt,
   rather than relying solely on the error return value. This correctly counts
   sessions that were terminated even if the kill command returned an error.

2. **Add `--cleanup-orphans` flag**: Integrated orphan cleanup with a simple
   synchronous approach:
   - Finds Claude/codex processes without a controlling terminal (TTY)
   - Filters out processes younger than 60 seconds (avoids race conditions)
   - Excludes processes belonging to active Gas Town tmux sessions
   - Sends SIGTERM to all orphans
   - Waits for configurable grace period (default 60s)
   - Sends SIGKILL to any that survived SIGTERM

3. **Add `--cleanup-orphans-grace-secs` flag**: Allows users to configure the
   grace period between SIGTERM and SIGKILL (default 60 seconds).

## Design Choice: Synchronous vs. Persistent State

The orphan cleanup uses a **synchronous wait approach** rather than the
persistent state machine approach in `util.CleanupOrphanedClaudeProcesses()`:

**Synchronous approach (this PR):**
- Send SIGTERM → Wait N seconds → Send SIGKILL (all in one invocation)
- Simpler to understand and debug
- User sees immediate results
- No persistent state file to manage

**Persistent state approach (util.CleanupOrphanedClaudeProcesses):**
- First run: SIGTERM → save state
- Second run (60s later): Check state → SIGKILL
- Requires multiple invocations
- Persists state in `/tmp/gastown-orphan-state`

The synchronous approach is more appropriate for `gt shutdown` where users
expect immediate cleanup, while the persistent approach is better suited for
periodic cleanup daemons.

## Testing

Before fix:
```
Sessions to stop: gt-boot, gt-pgqueue-refinery, gt-pgqueue-witness, hq-deacon, hq-mayor
✓ Gas Town shutdown complete (0 sessions stopped)  ← Bug
```

After fix:
```
Sessions to stop: gt-boot, gt-pgqueue-refinery, gt-pgqueue-witness, hq-deacon, hq-mayor
✓ hq-deacon stopped
✓ gt-boot stopped
✓ gt-pgqueue-refinery stopped
✓ gt-pgqueue-witness stopped
✓ hq-mayor stopped
Cleaning up orphaned Claude processes...
→ PID 267916: sent SIGTERM (waiting 60s before SIGKILL)
 Waiting 60 seconds for processes to terminate gracefully...
✓ 1 process(es) terminated gracefully from SIGTERM
✓ All processes cleaned up successfully
✓ Gas Town shutdown complete (5 sessions stopped)  ← Fixed
```

All sessions verified terminated via `tmux ls`.

Co-authored-by: Roland Tritsch <roland@ailtir.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-20 20:23:30 -08:00
..
2026-01-20 14:17:35 -08:00
2026-01-20 14:17:35 -08:00
2026-01-20 14:17:35 -08:00
2026-01-20 14:17:35 -08:00
2026-01-20 14:17:35 -08:00
2026-01-20 14:17:35 -08:00
2026-01-20 14:17:35 -08:00