* Add hung-session-detection step to deacon patrol
Detects and surgically recovers Gas Town sessions where Claude API
call is stuck indefinitely. These appear "running" (tmux session
exists) but aren't processing work.
Safety checks (ALL must pass before recovery):
1. Session matches Gas Town pattern exactly (gt-*-witness, etc)
2. Session shows waiting state (Clauding/Deciphering/etc)
3. Duration >30min AND (zero tokens OR duration >2hrs)
4. NOT showing active tool execution (⏺ markers)
This closes a gap where existing zombie-scan only catches processes
not in tmux sessions.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(orphan): protect all tmux sessions, not just Gas Town ones
The orphan cleanup was killing Claude processes in user's personal tmux
sessions (e.g., "loomtown", "yaad") because only sessions with gt-* or
hq-* prefixes were protected.
Changes:
- Renamed getGasTownSessionPIDs() to getTmuxSessionPIDs()
- Now protects ALL tmux sessions regardless of name prefix
- Updated variable names for clarity (gasTownPIDs -> protectedPIDs)
The TTY="?" check is not reliable during certain operations (startup,
session transitions), so explicit protection of all tmux sessions is
necessary to prevent killing user's personal Claude instances.
Fixes#923
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: mayor <ec2-user@ip-172-31-43-79.ec2.internal>
Co-authored-by: Claude <noreply@anthropic.com>
The existing PPID=1 detection misses orphaned Claude processes that get
reparented to something other than init/launchd. The new --aggressive
flag cross-references Claude processes against active tmux sessions to
find ALL orphans not in any gt-* or hq-* session.
Testing shows this catches ~3x more orphans (117 vs 39 in one sample).
Usage:
gt orphans procs --aggressive # List ALL orphans
gt orphans procs kill --aggressive # Kill ALL orphans
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Unlike cleanup-orphans (which uses TTY="?" detection), zombie-scan uses
tmux verification: it checks if each Claude process is in an active
tmux session by comparing against actual pane PIDs.
A process is a zombie if:
- It's a Claude/codex process
- It's NOT the pane PID of any active tmux session
- It's NOT a child of any pane PID
- It's older than 60 seconds
Also refactors:
- getChildPIDs() with ps fallback when pgrep unavailable
- State file handling with file locking for concurrent access
Usage:
gt deacon zombie-scan # Find and kill zombies
gt deacon zombie-scan --dry-run # Just list zombies
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add Windows stub for orphan cleanup
* Fix account switch tests on Windows
* Make query session events test portable
* Disable beads daemon in query session events test
* Add Windows bd stubs for sling tests
* Make expandOutputPath test OS-agnostic
* Make role_agents test Windows-friendly
* Make config path tests OS-agnostic
* Make HealthCheckStateFile test OS-agnostic
* Skip orphan process check on Windows
* Normalize sparse checkout detail paths
* Make dog path tests OS-agnostic
* Fix bare repo refspec config on Windows
* Add Windows process detection for locks
* Add Windows CI workflow
* Make mail path tests OS-agnostic
* Skip plugin file mode test on Windows
* Skip tmux-dependent polecat tests on Windows
* Normalize polecat paths and AGENTS.md content
* Make beads init failure test Windows-friendly
* Skip rig agent bead init test on Windows
* Make XDG path tests OS-agnostic
* Make exec tests portable on Windows
* Adjust atomic write tests for Windows
* Make wisp tests Windows-friendly
* Make workspace find tests OS-agnostic
* Fix Windows rig add integration test
* Make sling var logging Windows-friendly
* Fix sling attached molecule update ordering
---------
Co-authored-by: Johann Dirry <johann.dirry@microsea.at>
The orphan cleanup was killing witness/refinery/deacon Claude processes
during startup because they temporarily show TTY "?" before fully
attaching to the tmux session.
Added getGasTownSessionPIDs() to discover all PIDs belonging to valid
gt-* and hq-* tmux sessions (including child processes). The orphan
cleanup now skips these PIDs, only killing truly orphaned processes
from dead sessions.
This fixes the race condition where:
1. Daemon starts a witness/refinery session
2. Claude starts but takes time to show a prompt
3. Startup detection times out
4. Orphan cleanup sees Claude with TTY "?" and kills it
Now processes in valid sessions are protected regardless of TTY state.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three bugs were causing orphaned Claude processes to accumulate:
1. TTY comparison in orphan.go checked for "?" but macOS shows "??"
- Orphan cleanup never found anything on macOS
- Changed to check for both "?" and "??"
2. selfKillSession in done.go used basic tmux kill-session
- Claude Code can survive SIGHUP
- Now uses KillSessionWithProcesses for proper cleanup
3. Crew stop commands used basic KillSession
- Same issue as #2
- Updated runCrewRemove, runCrewStop, runCrewStopAll
Root cause of 383 accumulated sessions: every gt done and crew stop
left orphans, and the cleanup never worked on macOS.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat: Add automatic orphaned claude process cleanup
Claude Code's Task tool spawns subagent processes that sometimes don't clean up
properly after completion. These accumulate and consume significant memory
(observed: 17 processes using ~6GB RAM).
This change adds automatic cleanup in two places:
1. **Deacon patrol** (primary): New patrol step "orphan-process-cleanup" runs
`gt deacon cleanup-orphans` early in each cycle. More responsive (~30s).
2. **Daemon heartbeat** (fallback): Runs cleanup every 3 minutes as safety net
when deacon is down.
Detection uses TTY column - processes with TTY "?" have no controlling terminal.
This is safe because:
- Processes in terminals (user sessions) have a TTY like "pts/0" - untouched
- Only kills processes with no controlling terminal
- Orphaned subagents are children of tmux server with no TTY
New files:
- internal/util/orphan.go: FindOrphanedClaudeProcesses, CleanupOrphanedClaudeProcesses
- internal/util/orphan_test.go: Tests for orphan detection
New command:
- `gt deacon cleanup-orphans`: Manual/patrol-triggered cleanup
Fixes#587
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(orphan): add Windows build tag and minimum age check
Addresses review feedback on PR #588:
1. Add //go:build !windows to orphan.go and orphan_test.go
- The code uses Unix-specific syscalls (SIGTERM, ESRCH) and
ps command options that don't exist on Windows
2. Add minimum age check (60 seconds) to prevent false positives
- Prevents race conditions with newly spawned subagents
- Addresses reviewer concern about cron/systemd processes
- Uses portable etime format instead of Linux-only etimes
3. Add parseEtime helper with comprehensive tests
- Parses [[DD-]HH:]MM:SS format (works on both Linux and macOS)
- etimes (seconds) is Linux-specific, etime is portable
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(orphan): add proper SIGTERM→SIGKILL escalation with state tracking
Previous approach used process age which doesn't work: a Task subagent
runs without TTY from birth, so a long-running legitimate subagent that
later fails to exit would be immediately SIGKILLed without trying SIGTERM.
New approach uses a state file to track signal history:
1. First encounter → SIGTERM, record PID + timestamp in state file
2. Next cycle (after 60s grace period) → if still alive, SIGKILL
3. Next cycle → if survived SIGKILL, log as unkillable and remove
State file: $XDG_RUNTIME_DIR/gastown-orphan-state (or /tmp/)
Format: "<pid> <signal> <unix_timestamp>" per line
The state file is automatically cleaned up:
- Dead processes removed on load
- Unkillable processes removed after logging
Also updates callers to use new CleanupResult type which includes
the signal sent (SIGTERM, SIGKILL, or UNKILLABLE).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* test(util): add comprehensive tests for atomic write functions
Add tests for:
- File permissions
- Empty data handling
- Various JSON types (string, int, float, bool, null, array, nested)
- Unmarshallable types error handling
- Read-only directory permission errors
- Concurrent writes
- Original content preservation on failure
- Struct serialization/deserialization
- Large data (1MB)
* test(connection): add edge case tests for address parsing
Add comprehensive test coverage for ParseAddress edge cases:
- Empty/whitespace/slash-only inputs
- Leading/trailing slash handling
- Machine prefix edge cases (colons, empty machine)
- Multiple slashes in polecat name (SplitN behavior)
- Unicode and emoji support
- Very long addresses
- Special characters (hyphens, underscores, dots)
- Whitespace in components
Also adds tests for MustParseAddress panic behavior and RigPath method.
Closes: gt-xgjyp
* test(checkpoint): add comprehensive test coverage for checkpoint package
Tests all public functions: Read, Write, Remove, Capture, WithMolecule,
WithHookedBead, WithNotes, Age, IsStale, Summary, Path.
Edge cases covered: missing file, corrupted JSON, stale detection.
Closes: gt-09yn1
* test(lock): add comprehensive tests for lock package
Add lock_test.go with tests covering:
- LockInfo.IsStale() with valid/invalid PIDs
- Lock.Acquire/Release lifecycle
- Re-acquiring own lock (session refresh)
- Stale lock cleanup during Acquire
- Lock.Read() for missing/invalid/valid files
- Lock.Check() for unlocked/owned/stale scenarios
- Lock.Status() string formatting
- Lock.ForceRelease()
- processExists() helper
- FindAllLocks() directory scanning
- CleanStaleLocks() with mocked tmux
- getActiveTmuxSessions() parsing
- splitOnColon() and splitLines() helpers
- DetectCollisions() for stale/orphaned locks
Coverage: 84.4%
* test(keepalive): add example tests demonstrating usage patterns
Add ExampleTouchInWorkspace, ExampleRead, and ExampleState_Age to
serve as documentation for how to use the keepalive package.
* fix(test): correct boundary test timing race in checkpoint_test.go
The 'exactly threshold' test case was flaky due to timing: by the time
time.Since() runs after setting Timestamp, microseconds have passed,
making age > threshold. Changed expectation to true since at-threshold
is effectively stale.
---------
Co-authored-by: slit <gt@gastown.local>
Replace ProcessExists() checks in witness and refinery managers with
tmux session detection. Agent liveness should be derived from tmux
session state, not PID probing (per ZFC tracking principles).
- Remove util.ProcessExists() from witness/manager.go and refinery/manager.go
- Delete internal/util/process.go and process_test.go (now unused)
- Foreground mode and Stop() now rely solely on tmux HasSession/KillSession
Closes: hq-yxkdr (recentDeaths already removed)
Closes: hq-1sd4o (ProcessExists removed)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Create util.ExecWithOutput and util.ExecRun to consolidate repeated
exec.Command patterns across witness/handlers.go and refinery/manager.go.
Changes:
- Add internal/util/exec.go with ExecWithOutput (returns stdout) and
ExecRun (runs command without output)
- Refactor witness/handlers.go to use utility functions (7 call sites)
- Refactor refinery/manager.go, removing unused gitRun/gitOutput methods
- Add comprehensive tests in exec_test.go
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Inverse of 'gt rig shutdown'. Starts rig patrol agents:
- Checks tmux sessions to avoid duplicates
- Starts witness if not running
- Starts refinery if not running
- Reports what was started vs skipped
Also adds ProcessExists util function needed by witness/refinery managers.
Move duplicated processExists function to shared util package:
- Create internal/util/process.go with ProcessExists function
- Add internal/util/process_test.go with basic tests
- Update witness/manager.go to use util.ProcessExists
- Update refinery/manager.go to use util.ProcessExists
- Remove local processExists functions from both files
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prevents data loss from concurrent/interrupted state file writes by using
atomic write pattern (write to .tmp, then rename).
Changes:
- Add internal/util package with AtomicWriteJSON/AtomicWriteFile helpers
- Update witness/manager.go saveState to use atomic writes
- Update refinery/manager.go saveState to use atomic writes
- Update crew/manager.go saveState to use atomic writes
- Update daemon/types.go SaveState to use atomic writes
- Update polecat/namepool.go Save to use atomic writes
- Add comprehensive tests for atomic write utilities
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>