feat: Add automatic orphaned claude process cleanup (#588)

* feat: Add automatic orphaned claude process cleanup

Claude Code's Task tool spawns subagent processes that sometimes don't clean up
properly after completion. These accumulate and consume significant memory
(observed: 17 processes using ~6GB RAM).

This change adds automatic cleanup in two places:

1. **Deacon patrol** (primary): New patrol step "orphan-process-cleanup" runs
   `gt deacon cleanup-orphans` early in each cycle. More responsive (~30s).

2. **Daemon heartbeat** (fallback): Runs cleanup every 3 minutes as safety net
   when deacon is down.

Detection uses TTY column - processes with TTY "?" have no controlling terminal.
This is safe because:
- Processes in terminals (user sessions) have a TTY like "pts/0" - untouched
- Only kills processes with no controlling terminal
- Orphaned subagents are children of tmux server with no TTY

New files:
- internal/util/orphan.go: FindOrphanedClaudeProcesses, CleanupOrphanedClaudeProcesses
- internal/util/orphan_test.go: Tests for orphan detection

New command:
- `gt deacon cleanup-orphans`: Manual/patrol-triggered cleanup

Fixes #587

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(orphan): add Windows build tag and minimum age check

Addresses review feedback on PR #588:

1. Add //go:build !windows to orphan.go and orphan_test.go
   - The code uses Unix-specific syscalls (SIGTERM, ESRCH) and
     ps command options that don't exist on Windows

2. Add minimum age check (60 seconds) to prevent false positives
   - Prevents race conditions with newly spawned subagents
   - Addresses reviewer concern about cron/systemd processes
   - Uses portable etime format instead of Linux-only etimes

3. Add parseEtime helper with comprehensive tests
   - Parses [[DD-]HH:]MM:SS format (works on both Linux and macOS)
   - etimes (seconds) is Linux-specific, etime is portable

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(orphan): add proper SIGTERM→SIGKILL escalation with state tracking

Previous approach used process age which doesn't work: a Task subagent
runs without TTY from birth, so a long-running legitimate subagent that
later fails to exit would be immediately SIGKILLed without trying SIGTERM.

New approach uses a state file to track signal history:

1. First encounter → SIGTERM, record PID + timestamp in state file
2. Next cycle (after 60s grace period) → if still alive, SIGKILL
3. Next cycle → if survived SIGKILL, log as unkillable and remove

State file: $XDG_RUNTIME_DIR/gastown-orphan-state (or /tmp/)
Format: "<pid> <signal> <unix_timestamp>" per line

The state file is automatically cleaned up:
- Dead processes removed on load
- Unkillable processes removed after logging

Also updates callers to use new CleanupResult type which includes
the signal sent (SIGTERM, SIGKILL, or UNKILLABLE).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
aleiby
2026-01-16 15:35:48 -08:00
committed by GitHub
parent 5a56525655
commit 22064b0730
5 changed files with 553 additions and 1 deletions

View File

@@ -84,10 +84,46 @@ Callbacks may spawn new polecats, update issue state, or trigger other actions.
**Hygiene principle**: Archive messages after they're fully processed.
Keep inbox near-empty - only unprocessed items should remain."""
[[steps]]
id = "orphan-process-cleanup"
title = "Clean up orphaned claude subagent processes"
needs = ["inbox-check"]
description = """
Clean up orphaned claude subagent processes.
Claude Code's Task tool spawns subagent processes that sometimes don't clean up
properly after completion. These accumulate and consume significant memory.
**Detection method:**
Orphaned processes have no controlling terminal (TTY = "?"). Legitimate claude
instances in terminals have a TTY like "pts/0".
**Run cleanup:**
```bash
gt deacon cleanup-orphans
```
This command:
1. Lists all claude/codex processes with `ps -eo pid,tty,comm`
2. Filters for TTY = "?" (no controlling terminal)
3. Sends SIGTERM to each orphaned process
4. Reports how many were killed
**Why this is safe:**
- Processes in terminals (your personal sessions) have a TTY - they won't be touched
- Only kills processes that have no controlling terminal
- These orphans are children of the tmux server with no TTY, indicating they're
detached subagents that failed to exit
**If cleanup fails:**
Log the error but continue patrol - this is best-effort cleanup.
**Exit criteria:** Orphan cleanup attempted (success or logged failure)."""
[[steps]]
id = "trigger-pending-spawns"
title = "Nudge newly spawned polecats"
needs = ["inbox-check"]
needs = ["orphan-process-cleanup"]
description = """
Nudge newly spawned polecats that are ready for input.