* fix(down): add refinery shutdown to gt down
Refineries were not being stopped by gt down, causing them to continue
running after shutdown. This adds a refinery shutdown loop before
witnesses, fixing problem P3 from the v2.4 proposal.
Changes:
- Add Phase 1: Stop refineries (gt-<rig>-refinery sessions)
- Renumber existing phases (witnesses now Phase 2, etc.)
- Include refineries in halt event logging
* feat(beads): add StopAllBdProcesses for shutdown
Add functions to stop bd daemon and bd activity processes:
- StopAllBdProcesses(dryRun, force) - main entry point
- CountBdDaemons() - count running bd daemons
- CountBdActivityProcesses() - count running bd activity processes
- stopBdDaemons() - uses bd daemon killall
- stopBdActivityProcesses() - SIGTERM->wait->SIGKILL pattern
This solves problems P1 (bd daemon respawns sessions) and P2 (bd activity
causes instant wakeups) from the v2.4 proposal.
* feat(down): rename --all to --nuke, add new --all and --dry-run flags
BREAKING CHANGE: --all now stops bd processes instead of killing tmux server.
Use --nuke for the old --all behavior (killing the entire tmux server).
New flags:
- --all: Stop bd daemons/activity processes and verify shutdown
- --nuke: Kill entire tmux server (DESTRUCTIVE, with warning)
- --dry-run: Preview what would be stopped without taking action
This solves problem P4 (old --all was too destructive) from the v2.4 proposal.
The --nuke flag now requires GT_NUKE_ACKNOWLEDGED=1 environment variable
to suppress the warning about destroying all tmux sessions.
* feat(down): add shutdown lock to prevent concurrent runs
Add Phase 0 that acquires a file lock before shutdown to prevent race
conditions when multiple gt down commands are run concurrently.
- Uses gofrs/flock for cross-platform file locking
- Lock file stored at ~/gt/daemon/shutdown.lock
- 5 second timeout with 100ms retry interval
- Lock released via defer on successful acquisition
- Dry-run mode skips lock acquisition
This solves problem P6 (concurrent shutdown race) from the v2.4 proposal.
* feat(down): add verification phase for respawn detection
Add Phase 5 that verifies shutdown was complete after stopping all services:
- Waits 500ms for processes to fully terminate
- Checks for respawned bd daemons
- Checks for respawned bd activity processes
- Checks for remaining gt-*/hq-* tmux sessions
- Checks if daemon PID is still running
If anything respawned, warns user and suggests checking systemd/launchd.
This solves problem P5 (no verification) from the v2.4 proposal.
* test(down): add unit tests for shutdown functionality
Add tests for:
- parseBdDaemonCount() - array, object with count, object with daemons, empty, invalid
- CountBdActivityProcesses() - integration test
- CountBdDaemons() - integration test (skipped if bd not installed)
- StopAllBdProcesses() - dry-run mode test
- isProcessRunning() - current process, invalid PID, max PID
These tests cover the core parsing and process detection logic added
in the v2.4 shutdown enhancement.
* fix(review): add tmux check and pkill fallback for bd shutdown
Address review gaps against proposal v2.4 AC:
- AC1: Add tmux availability check BEFORE acquiring shutdown lock
- AC2: Add pkill fallback for bd daemon when killall incomplete
- AC2: Return remaining count from stop functions for error reporting
- Style: interface{} → any (Go 1.18+)
* fix(prime): add validation for --state flag combination
The --state flag should be standalone and not combined with other flags.
Add validation at start of runPrime to enforce this.
Fixes TestPrimeFlagCombinations test failures.
* fix(review): address bot review critical issues
- isProcessRunning: handle pid<=0 as invalid (return false)
- isProcessRunning: handle EPERM as process exists (return true)
- stopBdDaemons: prevent negative killed count from race conditions
- stopBdActivityProcesses: prevent negative killed count from race conditions
* fix(review): critical fixes from deep review
Platform fixes:
- CountBdActivityProcesses: use sh -c "pgrep | wc -l" for macOS compatibility
(pgrep -c flag not available on BSD/macOS)
Correctness fixes:
- stopSession: return (wasRunning, error) to distinguish "stopped" vs "not running"
- daemon.IsRunning: handle error instead of ignoring with blank identifier
- stopBdDaemons/stopBdActivityProcesses: guard against negative killed counts
Safety fixes:
- --nuke: require GT_NUKE_ACKNOWLEDGED=1, don't just warn and proceed
- pkill patterns: document limitation about broad matching
Code cleanup:
- EnsureBdDaemonHealth: remove unused issues variable
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
273 lines
7.4 KiB
Go
273 lines
7.4 KiB
Go
package beads
|
|
|
|
import (
|
|
"bytes"
|
|
"encoding/json"
|
|
"fmt"
|
|
"os/exec"
|
|
"strconv"
|
|
"strings"
|
|
"time"
|
|
)
|
|
|
|
const (
|
|
gracefulTimeout = 2 * time.Second
|
|
)
|
|
|
|
// BdDaemonInfo represents the status of a single bd daemon instance.
|
|
type BdDaemonInfo struct {
|
|
Workspace string `json:"workspace"`
|
|
SocketPath string `json:"socket_path"`
|
|
PID int `json:"pid"`
|
|
Version string `json:"version"`
|
|
Status string `json:"status"`
|
|
Issue string `json:"issue,omitempty"`
|
|
VersionMismatch bool `json:"version_mismatch,omitempty"`
|
|
}
|
|
|
|
// BdDaemonHealth represents the overall health of bd daemons.
|
|
type BdDaemonHealth struct {
|
|
Total int `json:"total"`
|
|
Healthy int `json:"healthy"`
|
|
Stale int `json:"stale"`
|
|
Mismatched int `json:"mismatched"`
|
|
Unresponsive int `json:"unresponsive"`
|
|
Daemons []BdDaemonInfo `json:"daemons"`
|
|
}
|
|
|
|
// CheckBdDaemonHealth checks the health of all bd daemons.
|
|
// Returns nil if no daemons are running (which is fine, bd will use direct mode).
|
|
func CheckBdDaemonHealth() (*BdDaemonHealth, error) {
|
|
cmd := exec.Command("bd", "daemon", "health", "--json")
|
|
var stdout, stderr bytes.Buffer
|
|
cmd.Stdout = &stdout
|
|
cmd.Stderr = &stderr
|
|
|
|
err := cmd.Run()
|
|
if err != nil {
|
|
// bd daemon health may fail if bd not installed or other issues
|
|
// Return nil to indicate we can't check (not an error for status display)
|
|
return nil, nil
|
|
}
|
|
|
|
var health BdDaemonHealth
|
|
if err := json.Unmarshal(stdout.Bytes(), &health); err != nil {
|
|
return nil, fmt.Errorf("parsing daemon health: %w", err)
|
|
}
|
|
|
|
return &health, nil
|
|
}
|
|
|
|
// EnsureBdDaemonHealth checks if bd daemons are healthy and attempts to restart if needed.
|
|
// Returns a warning message if there were issues, or empty string if everything is fine.
|
|
// This is non-blocking - it will not fail if daemons can't be started.
|
|
func EnsureBdDaemonHealth(workDir string) string {
|
|
health, err := CheckBdDaemonHealth()
|
|
if err != nil || health == nil {
|
|
// Can't check daemon health - proceed without warning
|
|
return ""
|
|
}
|
|
|
|
// No daemons running is fine - bd will use direct mode
|
|
if health.Total == 0 {
|
|
return ""
|
|
}
|
|
|
|
// Check if any daemons need attention
|
|
needsRestart := false
|
|
for _, d := range health.Daemons {
|
|
switch d.Status {
|
|
case "healthy":
|
|
// Good
|
|
case "version_mismatch", "stale", "unresponsive":
|
|
needsRestart = true
|
|
}
|
|
}
|
|
|
|
if !needsRestart {
|
|
return ""
|
|
}
|
|
|
|
// Attempt to restart daemons
|
|
if restartErr := restartBdDaemons(); restartErr != nil {
|
|
return fmt.Sprintf("bd daemons unhealthy (restart failed: %v)", restartErr)
|
|
}
|
|
|
|
// Verify restart worked
|
|
time.Sleep(500 * time.Millisecond)
|
|
newHealth, err := CheckBdDaemonHealth()
|
|
if err != nil || newHealth == nil {
|
|
return "bd daemons restarted but status unknown"
|
|
}
|
|
|
|
if newHealth.Healthy < newHealth.Total {
|
|
return fmt.Sprintf("bd daemons partially healthy (%d/%d)", newHealth.Healthy, newHealth.Total)
|
|
}
|
|
|
|
return "" // Successfully restarted
|
|
}
|
|
|
|
// restartBdDaemons restarts all bd daemons.
|
|
func restartBdDaemons() error { //nolint:unparam // error return kept for future use
|
|
// Stop all daemons first
|
|
stopCmd := exec.Command("bd", "daemon", "killall")
|
|
_ = stopCmd.Run() // Ignore errors - daemons might not be running
|
|
|
|
// Give time for cleanup
|
|
time.Sleep(200 * time.Millisecond)
|
|
|
|
// Start daemons for known locations
|
|
// The daemon will auto-start when bd commands are run in those directories
|
|
// Just running any bd command will trigger daemon startup if configured
|
|
return nil
|
|
}
|
|
|
|
// StartBdDaemonIfNeeded starts the bd daemon for a specific workspace if not running.
|
|
// This is a best-effort operation - failures are logged but don't block execution.
|
|
func StartBdDaemonIfNeeded(workDir string) error {
|
|
cmd := exec.Command("bd", "daemon", "--start")
|
|
cmd.Dir = workDir
|
|
return cmd.Run()
|
|
}
|
|
|
|
// StopAllBdProcesses stops all bd daemon and activity processes.
|
|
// Returns (daemonsKilled, activityKilled, error).
|
|
// If dryRun is true, returns counts without stopping anything.
|
|
func StopAllBdProcesses(dryRun, force bool) (int, int, error) {
|
|
if _, err := exec.LookPath("bd"); err != nil {
|
|
return 0, 0, nil
|
|
}
|
|
|
|
daemonsBefore := CountBdDaemons()
|
|
activityBefore := CountBdActivityProcesses()
|
|
|
|
if dryRun {
|
|
return daemonsBefore, activityBefore, nil
|
|
}
|
|
|
|
daemonsKilled, daemonsRemaining := stopBdDaemons(force)
|
|
activityKilled, activityRemaining := stopBdActivityProcesses(force)
|
|
|
|
if daemonsRemaining > 0 {
|
|
return daemonsKilled, activityKilled, fmt.Errorf("bd daemon shutdown incomplete: %d still running", daemonsRemaining)
|
|
}
|
|
if activityRemaining > 0 {
|
|
return daemonsKilled, activityKilled, fmt.Errorf("bd activity shutdown incomplete: %d still running", activityRemaining)
|
|
}
|
|
|
|
return daemonsKilled, activityKilled, nil
|
|
}
|
|
|
|
// CountBdDaemons returns count of running bd daemons.
|
|
func CountBdDaemons() int {
|
|
listCmd := exec.Command("bd", "daemon", "list", "--json")
|
|
output, err := listCmd.Output()
|
|
if err != nil {
|
|
return 0
|
|
}
|
|
return parseBdDaemonCount(output)
|
|
}
|
|
|
|
// parseBdDaemonCount parses bd daemon list --json output.
|
|
func parseBdDaemonCount(output []byte) int {
|
|
if len(output) == 0 {
|
|
return 0
|
|
}
|
|
|
|
var daemons []any
|
|
if err := json.Unmarshal(output, &daemons); err == nil {
|
|
return len(daemons)
|
|
}
|
|
|
|
var wrapper struct {
|
|
Daemons []any `json:"daemons"`
|
|
Count int `json:"count"`
|
|
}
|
|
if err := json.Unmarshal(output, &wrapper); err == nil {
|
|
if wrapper.Count > 0 {
|
|
return wrapper.Count
|
|
}
|
|
return len(wrapper.Daemons)
|
|
}
|
|
|
|
return 0
|
|
}
|
|
|
|
func stopBdDaemons(force bool) (int, int) {
|
|
before := CountBdDaemons()
|
|
if before == 0 {
|
|
return 0, 0
|
|
}
|
|
|
|
killCmd := exec.Command("bd", "daemon", "killall")
|
|
_ = killCmd.Run()
|
|
|
|
time.Sleep(100 * time.Millisecond)
|
|
|
|
after := CountBdDaemons()
|
|
if after == 0 {
|
|
return before, 0
|
|
}
|
|
|
|
// Note: pkill -f pattern may match unintended processes in rare cases
|
|
// (e.g., editors with "bd daemon" in file content). This is acceptable
|
|
// as a fallback when bd daemon killall fails.
|
|
if force {
|
|
_ = exec.Command("pkill", "-9", "-f", "bd daemon").Run()
|
|
} else {
|
|
_ = exec.Command("pkill", "-TERM", "-f", "bd daemon").Run()
|
|
time.Sleep(gracefulTimeout)
|
|
if remaining := CountBdDaemons(); remaining > 0 {
|
|
_ = exec.Command("pkill", "-9", "-f", "bd daemon").Run()
|
|
}
|
|
}
|
|
|
|
time.Sleep(100 * time.Millisecond)
|
|
|
|
final := CountBdDaemons()
|
|
killed := before - final
|
|
if killed < 0 {
|
|
killed = 0 // Race condition: more processes spawned than we killed
|
|
}
|
|
return killed, final
|
|
}
|
|
|
|
// CountBdActivityProcesses returns count of running `bd activity` processes.
|
|
func CountBdActivityProcesses() int {
|
|
// Use pgrep -f with wc -l for cross-platform compatibility
|
|
// (macOS pgrep doesn't support -c flag)
|
|
cmd := exec.Command("sh", "-c", "pgrep -f 'bd activity' 2>/dev/null | wc -l")
|
|
output, err := cmd.Output()
|
|
if err != nil {
|
|
return 0
|
|
}
|
|
count, _ := strconv.Atoi(strings.TrimSpace(string(output)))
|
|
return count
|
|
}
|
|
|
|
func stopBdActivityProcesses(force bool) (int, int) {
|
|
before := CountBdActivityProcesses()
|
|
if before == 0 {
|
|
return 0, 0
|
|
}
|
|
|
|
if force {
|
|
_ = exec.Command("pkill", "-9", "-f", "bd activity").Run()
|
|
} else {
|
|
_ = exec.Command("pkill", "-TERM", "-f", "bd activity").Run()
|
|
time.Sleep(gracefulTimeout)
|
|
if remaining := CountBdActivityProcesses(); remaining > 0 {
|
|
_ = exec.Command("pkill", "-9", "-f", "bd activity").Run()
|
|
}
|
|
}
|
|
|
|
time.Sleep(100 * time.Millisecond)
|
|
|
|
after := CountBdActivityProcesses()
|
|
killed := before - after
|
|
if killed < 0 {
|
|
killed = 0 // Race condition: more processes spawned than we killed
|
|
}
|
|
return killed, after
|
|
}
|