* fix(down): add refinery shutdown to gt down
Refineries were not being stopped by gt down, causing them to continue
running after shutdown. This adds a refinery shutdown loop before
witnesses, fixing problem P3 from the v2.4 proposal.
Changes:
- Add Phase 1: Stop refineries (gt-<rig>-refinery sessions)
- Renumber existing phases (witnesses now Phase 2, etc.)
- Include refineries in halt event logging
* feat(beads): add StopAllBdProcesses for shutdown
Add functions to stop bd daemon and bd activity processes:
- StopAllBdProcesses(dryRun, force) - main entry point
- CountBdDaemons() - count running bd daemons
- CountBdActivityProcesses() - count running bd activity processes
- stopBdDaemons() - uses bd daemon killall
- stopBdActivityProcesses() - SIGTERM->wait->SIGKILL pattern
This solves problems P1 (bd daemon respawns sessions) and P2 (bd activity
causes instant wakeups) from the v2.4 proposal.
* feat(down): rename --all to --nuke, add new --all and --dry-run flags
BREAKING CHANGE: --all now stops bd processes instead of killing tmux server.
Use --nuke for the old --all behavior (killing the entire tmux server).
New flags:
- --all: Stop bd daemons/activity processes and verify shutdown
- --nuke: Kill entire tmux server (DESTRUCTIVE, with warning)
- --dry-run: Preview what would be stopped without taking action
This solves problem P4 (old --all was too destructive) from the v2.4 proposal.
The --nuke flag now requires GT_NUKE_ACKNOWLEDGED=1 environment variable
to suppress the warning about destroying all tmux sessions.
* feat(down): add shutdown lock to prevent concurrent runs
Add Phase 0 that acquires a file lock before shutdown to prevent race
conditions when multiple gt down commands are run concurrently.
- Uses gofrs/flock for cross-platform file locking
- Lock file stored at ~/gt/daemon/shutdown.lock
- 5 second timeout with 100ms retry interval
- Lock released via defer on successful acquisition
- Dry-run mode skips lock acquisition
This solves problem P6 (concurrent shutdown race) from the v2.4 proposal.
* feat(down): add verification phase for respawn detection
Add Phase 5 that verifies shutdown was complete after stopping all services:
- Waits 500ms for processes to fully terminate
- Checks for respawned bd daemons
- Checks for respawned bd activity processes
- Checks for remaining gt-*/hq-* tmux sessions
- Checks if daemon PID is still running
If anything respawned, warns user and suggests checking systemd/launchd.
This solves problem P5 (no verification) from the v2.4 proposal.
* test(down): add unit tests for shutdown functionality
Add tests for:
- parseBdDaemonCount() - array, object with count, object with daemons, empty, invalid
- CountBdActivityProcesses() - integration test
- CountBdDaemons() - integration test (skipped if bd not installed)
- StopAllBdProcesses() - dry-run mode test
- isProcessRunning() - current process, invalid PID, max PID
These tests cover the core parsing and process detection logic added
in the v2.4 shutdown enhancement.
* fix(review): add tmux check and pkill fallback for bd shutdown
Address review gaps against proposal v2.4 AC:
- AC1: Add tmux availability check BEFORE acquiring shutdown lock
- AC2: Add pkill fallback for bd daemon when killall incomplete
- AC2: Return remaining count from stop functions for error reporting
- Style: interface{} → any (Go 1.18+)
* fix(prime): add validation for --state flag combination
The --state flag should be standalone and not combined with other flags.
Add validation at start of runPrime to enforce this.
Fixes TestPrimeFlagCombinations test failures.
* fix(review): address bot review critical issues
- isProcessRunning: handle pid<=0 as invalid (return false)
- isProcessRunning: handle EPERM as process exists (return true)
- stopBdDaemons: prevent negative killed count from race conditions
- stopBdActivityProcesses: prevent negative killed count from race conditions
* fix(review): critical fixes from deep review
Platform fixes:
- CountBdActivityProcesses: use sh -c "pgrep | wc -l" for macOS compatibility
(pgrep -c flag not available on BSD/macOS)
Correctness fixes:
- stopSession: return (wasRunning, error) to distinguish "stopped" vs "not running"
- daemon.IsRunning: handle error instead of ignoring with blank identifier
- stopBdDaemons/stopBdActivityProcesses: guard against negative killed counts
Safety fixes:
- --nuke: require GT_NUKE_ACKNOWLEDGED=1, don't just warn and proceed
- pkill patterns: document limitation about broad matching
Code cleanup:
- EnsureBdDaemonHealth: remove unused issues variable
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
+148
-10
@@ -5,9 +5,15 @@ import (
|
|||||||
"encoding/json"
|
"encoding/json"
|
||||||
"fmt"
|
"fmt"
|
||||||
"os/exec"
|
"os/exec"
|
||||||
|
"strconv"
|
||||||
|
"strings"
|
||||||
"time"
|
"time"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
const (
|
||||||
|
gracefulTimeout = 2 * time.Second
|
||||||
|
)
|
||||||
|
|
||||||
// BdDaemonInfo represents the status of a single bd daemon instance.
|
// BdDaemonInfo represents the status of a single bd daemon instance.
|
||||||
type BdDaemonInfo struct {
|
type BdDaemonInfo struct {
|
||||||
Workspace string `json:"workspace"`
|
Workspace string `json:"workspace"`
|
||||||
@@ -69,21 +75,12 @@ func EnsureBdDaemonHealth(workDir string) string {
|
|||||||
|
|
||||||
// Check if any daemons need attention
|
// Check if any daemons need attention
|
||||||
needsRestart := false
|
needsRestart := false
|
||||||
var issues []string
|
|
||||||
|
|
||||||
for _, d := range health.Daemons {
|
for _, d := range health.Daemons {
|
||||||
switch d.Status {
|
switch d.Status {
|
||||||
case "healthy":
|
case "healthy":
|
||||||
// Good
|
// Good
|
||||||
case "version_mismatch":
|
case "version_mismatch", "stale", "unresponsive":
|
||||||
needsRestart = true
|
needsRestart = true
|
||||||
issues = append(issues, fmt.Sprintf("%s: version mismatch", d.Workspace))
|
|
||||||
case "stale":
|
|
||||||
needsRestart = true
|
|
||||||
issues = append(issues, fmt.Sprintf("%s: stale", d.Workspace))
|
|
||||||
case "unresponsive":
|
|
||||||
needsRestart = true
|
|
||||||
issues = append(issues, fmt.Sprintf("%s: unresponsive", d.Workspace))
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -132,3 +129,144 @@ func StartBdDaemonIfNeeded(workDir string) error {
|
|||||||
cmd.Dir = workDir
|
cmd.Dir = workDir
|
||||||
return cmd.Run()
|
return cmd.Run()
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// StopAllBdProcesses stops all bd daemon and activity processes.
|
||||||
|
// Returns (daemonsKilled, activityKilled, error).
|
||||||
|
// If dryRun is true, returns counts without stopping anything.
|
||||||
|
func StopAllBdProcesses(dryRun, force bool) (int, int, error) {
|
||||||
|
if _, err := exec.LookPath("bd"); err != nil {
|
||||||
|
return 0, 0, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
daemonsBefore := CountBdDaemons()
|
||||||
|
activityBefore := CountBdActivityProcesses()
|
||||||
|
|
||||||
|
if dryRun {
|
||||||
|
return daemonsBefore, activityBefore, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
daemonsKilled, daemonsRemaining := stopBdDaemons(force)
|
||||||
|
activityKilled, activityRemaining := stopBdActivityProcesses(force)
|
||||||
|
|
||||||
|
if daemonsRemaining > 0 {
|
||||||
|
return daemonsKilled, activityKilled, fmt.Errorf("bd daemon shutdown incomplete: %d still running", daemonsRemaining)
|
||||||
|
}
|
||||||
|
if activityRemaining > 0 {
|
||||||
|
return daemonsKilled, activityKilled, fmt.Errorf("bd activity shutdown incomplete: %d still running", activityRemaining)
|
||||||
|
}
|
||||||
|
|
||||||
|
return daemonsKilled, activityKilled, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// CountBdDaemons returns count of running bd daemons.
|
||||||
|
func CountBdDaemons() int {
|
||||||
|
listCmd := exec.Command("bd", "daemon", "list", "--json")
|
||||||
|
output, err := listCmd.Output()
|
||||||
|
if err != nil {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
return parseBdDaemonCount(output)
|
||||||
|
}
|
||||||
|
|
||||||
|
// parseBdDaemonCount parses bd daemon list --json output.
|
||||||
|
func parseBdDaemonCount(output []byte) int {
|
||||||
|
if len(output) == 0 {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
var daemons []any
|
||||||
|
if err := json.Unmarshal(output, &daemons); err == nil {
|
||||||
|
return len(daemons)
|
||||||
|
}
|
||||||
|
|
||||||
|
var wrapper struct {
|
||||||
|
Daemons []any `json:"daemons"`
|
||||||
|
Count int `json:"count"`
|
||||||
|
}
|
||||||
|
if err := json.Unmarshal(output, &wrapper); err == nil {
|
||||||
|
if wrapper.Count > 0 {
|
||||||
|
return wrapper.Count
|
||||||
|
}
|
||||||
|
return len(wrapper.Daemons)
|
||||||
|
}
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
func stopBdDaemons(force bool) (int, int) {
|
||||||
|
before := CountBdDaemons()
|
||||||
|
if before == 0 {
|
||||||
|
return 0, 0
|
||||||
|
}
|
||||||
|
|
||||||
|
killCmd := exec.Command("bd", "daemon", "killall")
|
||||||
|
_ = killCmd.Run()
|
||||||
|
|
||||||
|
time.Sleep(100 * time.Millisecond)
|
||||||
|
|
||||||
|
after := CountBdDaemons()
|
||||||
|
if after == 0 {
|
||||||
|
return before, 0
|
||||||
|
}
|
||||||
|
|
||||||
|
// Note: pkill -f pattern may match unintended processes in rare cases
|
||||||
|
// (e.g., editors with "bd daemon" in file content). This is acceptable
|
||||||
|
// as a fallback when bd daemon killall fails.
|
||||||
|
if force {
|
||||||
|
_ = exec.Command("pkill", "-9", "-f", "bd daemon").Run()
|
||||||
|
} else {
|
||||||
|
_ = exec.Command("pkill", "-TERM", "-f", "bd daemon").Run()
|
||||||
|
time.Sleep(gracefulTimeout)
|
||||||
|
if remaining := CountBdDaemons(); remaining > 0 {
|
||||||
|
_ = exec.Command("pkill", "-9", "-f", "bd daemon").Run()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
time.Sleep(100 * time.Millisecond)
|
||||||
|
|
||||||
|
final := CountBdDaemons()
|
||||||
|
killed := before - final
|
||||||
|
if killed < 0 {
|
||||||
|
killed = 0 // Race condition: more processes spawned than we killed
|
||||||
|
}
|
||||||
|
return killed, final
|
||||||
|
}
|
||||||
|
|
||||||
|
// CountBdActivityProcesses returns count of running `bd activity` processes.
|
||||||
|
func CountBdActivityProcesses() int {
|
||||||
|
// Use pgrep -f with wc -l for cross-platform compatibility
|
||||||
|
// (macOS pgrep doesn't support -c flag)
|
||||||
|
cmd := exec.Command("sh", "-c", "pgrep -f 'bd activity' 2>/dev/null | wc -l")
|
||||||
|
output, err := cmd.Output()
|
||||||
|
if err != nil {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
count, _ := strconv.Atoi(strings.TrimSpace(string(output)))
|
||||||
|
return count
|
||||||
|
}
|
||||||
|
|
||||||
|
func stopBdActivityProcesses(force bool) (int, int) {
|
||||||
|
before := CountBdActivityProcesses()
|
||||||
|
if before == 0 {
|
||||||
|
return 0, 0
|
||||||
|
}
|
||||||
|
|
||||||
|
if force {
|
||||||
|
_ = exec.Command("pkill", "-9", "-f", "bd activity").Run()
|
||||||
|
} else {
|
||||||
|
_ = exec.Command("pkill", "-TERM", "-f", "bd activity").Run()
|
||||||
|
time.Sleep(gracefulTimeout)
|
||||||
|
if remaining := CountBdActivityProcesses(); remaining > 0 {
|
||||||
|
_ = exec.Command("pkill", "-9", "-f", "bd activity").Run()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
time.Sleep(100 * time.Millisecond)
|
||||||
|
|
||||||
|
after := CountBdActivityProcesses()
|
||||||
|
killed := before - after
|
||||||
|
if killed < 0 {
|
||||||
|
killed = 0 // Race condition: more processes spawned than we killed
|
||||||
|
}
|
||||||
|
return killed, after
|
||||||
|
}
|
||||||
|
|||||||
@@ -0,0 +1,73 @@
|
|||||||
|
package beads
|
||||||
|
|
||||||
|
import (
|
||||||
|
"os/exec"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestParseBdDaemonCount_Array(t *testing.T) {
|
||||||
|
input := []byte(`[{"pid":1234},{"pid":5678}]`)
|
||||||
|
count := parseBdDaemonCount(input)
|
||||||
|
if count != 2 {
|
||||||
|
t.Errorf("expected 2, got %d", count)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseBdDaemonCount_ObjectWithCount(t *testing.T) {
|
||||||
|
input := []byte(`{"count":3,"daemons":[{},{},{}]}`)
|
||||||
|
count := parseBdDaemonCount(input)
|
||||||
|
if count != 3 {
|
||||||
|
t.Errorf("expected 3, got %d", count)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseBdDaemonCount_ObjectWithDaemons(t *testing.T) {
|
||||||
|
input := []byte(`{"daemons":[{},{}]}`)
|
||||||
|
count := parseBdDaemonCount(input)
|
||||||
|
if count != 2 {
|
||||||
|
t.Errorf("expected 2, got %d", count)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseBdDaemonCount_Empty(t *testing.T) {
|
||||||
|
input := []byte(``)
|
||||||
|
count := parseBdDaemonCount(input)
|
||||||
|
if count != 0 {
|
||||||
|
t.Errorf("expected 0, got %d", count)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseBdDaemonCount_Invalid(t *testing.T) {
|
||||||
|
input := []byte(`not json`)
|
||||||
|
count := parseBdDaemonCount(input)
|
||||||
|
if count != 0 {
|
||||||
|
t.Errorf("expected 0 for invalid JSON, got %d", count)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestCountBdActivityProcesses(t *testing.T) {
|
||||||
|
count := CountBdActivityProcesses()
|
||||||
|
if count < 0 {
|
||||||
|
t.Errorf("count should be non-negative, got %d", count)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestCountBdDaemons(t *testing.T) {
|
||||||
|
if _, err := exec.LookPath("bd"); err != nil {
|
||||||
|
t.Skip("bd not installed")
|
||||||
|
}
|
||||||
|
count := CountBdDaemons()
|
||||||
|
if count < 0 {
|
||||||
|
t.Errorf("count should be non-negative, got %d", count)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestStopAllBdProcesses_DryRun(t *testing.T) {
|
||||||
|
daemonsKilled, activityKilled, err := StopAllBdProcesses(true, false)
|
||||||
|
if err != nil {
|
||||||
|
t.Errorf("unexpected error: %v", err)
|
||||||
|
}
|
||||||
|
if daemonsKilled < 0 || activityKilled < 0 {
|
||||||
|
t.Errorf("counts should be non-negative: daemons=%d, activity=%d", daemonsKilled, activityKilled)
|
||||||
|
}
|
||||||
|
}
|
||||||
+252
-27
@@ -1,10 +1,17 @@
|
|||||||
package cmd
|
package cmd
|
||||||
|
|
||||||
import (
|
import (
|
||||||
|
"context"
|
||||||
"fmt"
|
"fmt"
|
||||||
|
"os"
|
||||||
|
"path/filepath"
|
||||||
|
"strings"
|
||||||
|
"syscall"
|
||||||
"time"
|
"time"
|
||||||
|
|
||||||
|
"github.com/gofrs/flock"
|
||||||
"github.com/spf13/cobra"
|
"github.com/spf13/cobra"
|
||||||
|
"github.com/steveyegge/gastown/internal/beads"
|
||||||
"github.com/steveyegge/gastown/internal/daemon"
|
"github.com/steveyegge/gastown/internal/daemon"
|
||||||
"github.com/steveyegge/gastown/internal/events"
|
"github.com/steveyegge/gastown/internal/events"
|
||||||
"github.com/steveyegge/gastown/internal/session"
|
"github.com/steveyegge/gastown/internal/session"
|
||||||
@@ -13,6 +20,11 @@ import (
|
|||||||
"github.com/steveyegge/gastown/internal/workspace"
|
"github.com/steveyegge/gastown/internal/workspace"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
const (
|
||||||
|
shutdownLockFile = "daemon/shutdown.lock"
|
||||||
|
shutdownLockTimeout = 5 * time.Second
|
||||||
|
)
|
||||||
|
|
||||||
var downCmd = &cobra.Command{
|
var downCmd = &cobra.Command{
|
||||||
Use: "down",
|
Use: "down",
|
||||||
GroupID: GroupServices,
|
GroupID: GroupServices,
|
||||||
@@ -21,19 +33,21 @@ var downCmd = &cobra.Command{
|
|||||||
|
|
||||||
This gracefully shuts down all infrastructure agents:
|
This gracefully shuts down all infrastructure agents:
|
||||||
|
|
||||||
|
• Refineries - Per-rig work processors
|
||||||
• Witnesses - Per-rig polecat managers
|
• Witnesses - Per-rig polecat managers
|
||||||
• Mayor - Global work coordinator
|
• Mayor - Global work coordinator
|
||||||
• Boot - Deacon's watchdog
|
• Boot - Deacon's watchdog
|
||||||
• Deacon - Health orchestrator
|
• Deacon - Health orchestrator
|
||||||
• Daemon - Go background process
|
• Daemon - Go background process
|
||||||
|
|
||||||
Polecats are NOT stopped by this command - use 'gt swarm stop' or
|
With --all, also stops resurrection layer (bd daemon/activity) and verifies
|
||||||
kill individual polecats with 'gt polecat kill'.
|
shutdown. Polecats are NOT stopped - use 'gt swarm stop' for that.
|
||||||
|
|
||||||
This is useful for:
|
Flags:
|
||||||
• Taking a break (stop token consumption)
|
--all Stop bd daemons/activity, verify complete shutdown
|
||||||
• Clean shutdown before system maintenance
|
--nuke Kill entire tmux server (DESTRUCTIVE!)
|
||||||
• Resetting the town to a clean state`,
|
--dry-run Preview what would be stopped
|
||||||
|
--force Skip graceful shutdown, use SIGKILL`,
|
||||||
RunE: runDown,
|
RunE: runDown,
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -41,12 +55,16 @@ var (
|
|||||||
downQuiet bool
|
downQuiet bool
|
||||||
downForce bool
|
downForce bool
|
||||||
downAll bool
|
downAll bool
|
||||||
|
downNuke bool
|
||||||
|
downDryRun bool
|
||||||
)
|
)
|
||||||
|
|
||||||
func init() {
|
func init() {
|
||||||
downCmd.Flags().BoolVarP(&downQuiet, "quiet", "q", false, "Only show errors")
|
downCmd.Flags().BoolVarP(&downQuiet, "quiet", "q", false, "Only show errors")
|
||||||
downCmd.Flags().BoolVarP(&downForce, "force", "f", false, "Force kill without graceful shutdown")
|
downCmd.Flags().BoolVarP(&downForce, "force", "f", false, "Force kill without graceful shutdown")
|
||||||
downCmd.Flags().BoolVarP(&downAll, "all", "a", false, "Also kill the tmux server")
|
downCmd.Flags().BoolVarP(&downAll, "all", "a", false, "Stop bd daemons/activity and verify shutdown")
|
||||||
|
downCmd.Flags().BoolVar(&downNuke, "nuke", false, "Kill entire tmux server (DESTRUCTIVE - kills non-GT sessions!)")
|
||||||
|
downCmd.Flags().BoolVar(&downDryRun, "dry-run", false, "Preview what would be stopped without taking action")
|
||||||
rootCmd.AddCommand(downCmd)
|
rootCmd.AddCommand(downCmd)
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -57,24 +75,103 @@ func runDown(cmd *cobra.Command, args []string) error {
|
|||||||
}
|
}
|
||||||
|
|
||||||
t := tmux.NewTmux()
|
t := tmux.NewTmux()
|
||||||
|
if !t.IsAvailable() {
|
||||||
|
return fmt.Errorf("tmux not available (is tmux installed and on PATH?)")
|
||||||
|
}
|
||||||
|
|
||||||
|
// Phase 0: Acquire shutdown lock (skip for dry-run)
|
||||||
|
if !downDryRun {
|
||||||
|
lock, err := acquireShutdownLock(townRoot)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("cannot proceed: %w", err)
|
||||||
|
}
|
||||||
|
defer lock.Unlock()
|
||||||
|
}
|
||||||
allOK := true
|
allOK := true
|
||||||
|
|
||||||
// Stop in reverse order of startup
|
if downDryRun {
|
||||||
|
fmt.Println("═══ DRY RUN: Preview of shutdown actions ═══")
|
||||||
|
fmt.Println()
|
||||||
|
}
|
||||||
|
|
||||||
// 1. Stop witnesses first
|
// Phase 1: Stop bd resurrection layer (--all only)
|
||||||
rigs := discoverRigs(townRoot)
|
if downAll {
|
||||||
for _, rigName := range rigs {
|
daemonsKilled, activityKilled, err := beads.StopAllBdProcesses(downDryRun, downForce)
|
||||||
sessionName := fmt.Sprintf("gt-%s-witness", rigName)
|
if err != nil {
|
||||||
if err := stopSession(t, sessionName); err != nil {
|
printDownStatus("bd processes", false, err.Error())
|
||||||
printDownStatus(fmt.Sprintf("Witness (%s)", rigName), false, err.Error())
|
|
||||||
allOK = false
|
allOK = false
|
||||||
} else {
|
} else {
|
||||||
printDownStatus(fmt.Sprintf("Witness (%s)", rigName), true, "stopped")
|
if downDryRun {
|
||||||
|
if daemonsKilled > 0 || activityKilled > 0 {
|
||||||
|
printDownStatus("bd daemon", true, fmt.Sprintf("%d would stop", daemonsKilled))
|
||||||
|
printDownStatus("bd activity", true, fmt.Sprintf("%d would stop", activityKilled))
|
||||||
|
} else {
|
||||||
|
printDownStatus("bd processes", true, "none running")
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
if daemonsKilled > 0 {
|
||||||
|
printDownStatus("bd daemon", true, fmt.Sprintf("%d stopped", daemonsKilled))
|
||||||
|
}
|
||||||
|
if activityKilled > 0 {
|
||||||
|
printDownStatus("bd activity", true, fmt.Sprintf("%d stopped", activityKilled))
|
||||||
|
}
|
||||||
|
if daemonsKilled == 0 && activityKilled == 0 {
|
||||||
|
printDownStatus("bd processes", true, "none running")
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// 2. Stop town-level sessions (Mayor, Boot, Deacon) in correct order
|
rigs := discoverRigs(townRoot)
|
||||||
|
|
||||||
|
// Phase 2a: Stop refineries
|
||||||
|
for _, rigName := range rigs {
|
||||||
|
sessionName := fmt.Sprintf("gt-%s-refinery", rigName)
|
||||||
|
if downDryRun {
|
||||||
|
if running, _ := t.HasSession(sessionName); running {
|
||||||
|
printDownStatus(fmt.Sprintf("Refinery (%s)", rigName), true, "would stop")
|
||||||
|
}
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
wasRunning, err := stopSession(t, sessionName)
|
||||||
|
if err != nil {
|
||||||
|
printDownStatus(fmt.Sprintf("Refinery (%s)", rigName), false, err.Error())
|
||||||
|
allOK = false
|
||||||
|
} else if wasRunning {
|
||||||
|
printDownStatus(fmt.Sprintf("Refinery (%s)", rigName), true, "stopped")
|
||||||
|
} else {
|
||||||
|
printDownStatus(fmt.Sprintf("Refinery (%s)", rigName), true, "not running")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Phase 2b: Stop witnesses
|
||||||
|
for _, rigName := range rigs {
|
||||||
|
sessionName := fmt.Sprintf("gt-%s-witness", rigName)
|
||||||
|
if downDryRun {
|
||||||
|
if running, _ := t.HasSession(sessionName); running {
|
||||||
|
printDownStatus(fmt.Sprintf("Witness (%s)", rigName), true, "would stop")
|
||||||
|
}
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
wasRunning, err := stopSession(t, sessionName)
|
||||||
|
if err != nil {
|
||||||
|
printDownStatus(fmt.Sprintf("Witness (%s)", rigName), false, err.Error())
|
||||||
|
allOK = false
|
||||||
|
} else if wasRunning {
|
||||||
|
printDownStatus(fmt.Sprintf("Witness (%s)", rigName), true, "stopped")
|
||||||
|
} else {
|
||||||
|
printDownStatus(fmt.Sprintf("Witness (%s)", rigName), true, "not running")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Phase 3: Stop town-level sessions (Mayor, Boot, Deacon)
|
||||||
for _, ts := range session.TownSessions() {
|
for _, ts := range session.TownSessions() {
|
||||||
|
if downDryRun {
|
||||||
|
if running, _ := t.HasSession(ts.SessionID); running {
|
||||||
|
printDownStatus(ts.Name, true, "would stop")
|
||||||
|
}
|
||||||
|
continue
|
||||||
|
}
|
||||||
stopped, err := session.StopTownSession(t, ts, downForce)
|
stopped, err := session.StopTownSession(t, ts, downForce)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
printDownStatus(ts.Name, false, err.Error())
|
printDownStatus(ts.Name, false, err.Error())
|
||||||
@@ -86,38 +183,88 @@ func runDown(cmd *cobra.Command, args []string) error {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// 3. Stop Daemon last
|
// Phase 4: Stop Daemon
|
||||||
running, _, _ := daemon.IsRunning(townRoot)
|
running, pid, daemonErr := daemon.IsRunning(townRoot)
|
||||||
|
if daemonErr != nil {
|
||||||
|
printDownStatus("Daemon", false, fmt.Sprintf("status check failed: %v", daemonErr))
|
||||||
|
allOK = false
|
||||||
|
} else if downDryRun {
|
||||||
|
if running {
|
||||||
|
printDownStatus("Daemon", true, fmt.Sprintf("would stop (PID %d)", pid))
|
||||||
|
}
|
||||||
|
} else {
|
||||||
if running {
|
if running {
|
||||||
if err := daemon.StopDaemon(townRoot); err != nil {
|
if err := daemon.StopDaemon(townRoot); err != nil {
|
||||||
printDownStatus("Daemon", false, err.Error())
|
printDownStatus("Daemon", false, err.Error())
|
||||||
allOK = false
|
allOK = false
|
||||||
} else {
|
} else {
|
||||||
printDownStatus("Daemon", true, "stopped")
|
printDownStatus("Daemon", true, fmt.Sprintf("stopped (was PID %d)", pid))
|
||||||
}
|
}
|
||||||
} else {
|
} else {
|
||||||
printDownStatus("Daemon", true, "not running")
|
printDownStatus("Daemon", true, "not running")
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// 4. Kill tmux server if --all
|
// Phase 5: Verification (--all only)
|
||||||
if downAll {
|
if downAll && !downDryRun {
|
||||||
|
time.Sleep(500 * time.Millisecond)
|
||||||
|
respawned := verifyShutdown(t, townRoot)
|
||||||
|
if len(respawned) > 0 {
|
||||||
|
fmt.Println()
|
||||||
|
fmt.Printf("%s Warning: Some processes may have respawned:\n", style.Bold.Render("⚠"))
|
||||||
|
for _, r := range respawned {
|
||||||
|
fmt.Printf(" • %s\n", r)
|
||||||
|
}
|
||||||
|
fmt.Println()
|
||||||
|
fmt.Printf("This may indicate systemd/launchd is managing bd.\n")
|
||||||
|
fmt.Printf("Check with:\n")
|
||||||
|
fmt.Printf(" %s\n", style.Dim.Render("systemctl status bd-daemon # Linux"))
|
||||||
|
fmt.Printf(" %s\n", style.Dim.Render("launchctl list | grep bd # macOS"))
|
||||||
|
allOK = false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Phase 6: Nuke tmux server (--nuke only, DESTRUCTIVE)
|
||||||
|
if downNuke {
|
||||||
|
if downDryRun {
|
||||||
|
printDownStatus("Tmux server", true, "would kill (DESTRUCTIVE)")
|
||||||
|
} else if os.Getenv("GT_NUKE_ACKNOWLEDGED") == "" {
|
||||||
|
// Require explicit acknowledgement for destructive operation
|
||||||
|
fmt.Println()
|
||||||
|
fmt.Printf("%s The --nuke flag kills ALL tmux sessions, not just Gas Town.\n",
|
||||||
|
style.Bold.Render("⚠ BLOCKED:"))
|
||||||
|
fmt.Printf("This includes vim sessions, running builds, SSH connections, etc.\n")
|
||||||
|
fmt.Println()
|
||||||
|
fmt.Printf("To proceed, run with: %s\n", style.Bold.Render("GT_NUKE_ACKNOWLEDGED=1 gt down --nuke"))
|
||||||
|
allOK = false
|
||||||
|
} else {
|
||||||
if err := t.KillServer(); err != nil {
|
if err := t.KillServer(); err != nil {
|
||||||
printDownStatus("Tmux server", false, err.Error())
|
printDownStatus("Tmux server", false, err.Error())
|
||||||
allOK = false
|
allOK = false
|
||||||
} else {
|
} else {
|
||||||
printDownStatus("Tmux server", true, "killed")
|
printDownStatus("Tmux server", true, "killed (all tmux sessions destroyed)")
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Summary
|
||||||
fmt.Println()
|
fmt.Println()
|
||||||
|
if downDryRun {
|
||||||
|
fmt.Println("═══ DRY RUN COMPLETE (no changes made) ═══")
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
if allOK {
|
if allOK {
|
||||||
fmt.Printf("%s All services stopped\n", style.Bold.Render("✓"))
|
fmt.Printf("%s All services stopped\n", style.Bold.Render("✓"))
|
||||||
// Log halt event with stopped services
|
|
||||||
stoppedServices := []string{"daemon", "deacon", "boot", "mayor"}
|
stoppedServices := []string{"daemon", "deacon", "boot", "mayor"}
|
||||||
for _, rigName := range rigs {
|
for _, rigName := range rigs {
|
||||||
|
stoppedServices = append(stoppedServices, fmt.Sprintf("%s/refinery", rigName))
|
||||||
stoppedServices = append(stoppedServices, fmt.Sprintf("%s/witness", rigName))
|
stoppedServices = append(stoppedServices, fmt.Sprintf("%s/witness", rigName))
|
||||||
}
|
}
|
||||||
if downAll {
|
if downAll {
|
||||||
|
stoppedServices = append(stoppedServices, "bd-processes")
|
||||||
|
}
|
||||||
|
if downNuke {
|
||||||
stoppedServices = append(stoppedServices, "tmux-server")
|
stoppedServices = append(stoppedServices, "tmux-server")
|
||||||
}
|
}
|
||||||
_ = events.LogFeed(events.TypeHalt, "gt", events.HaltPayload(stoppedServices))
|
_ = events.LogFeed(events.TypeHalt, "gt", events.HaltPayload(stoppedServices))
|
||||||
@@ -141,13 +288,14 @@ func printDownStatus(name string, ok bool, detail string) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// stopSession gracefully stops a tmux session.
|
// stopSession gracefully stops a tmux session.
|
||||||
func stopSession(t *tmux.Tmux, sessionName string) error {
|
// Returns (wasRunning, error) - wasRunning is true if session existed and was stopped.
|
||||||
|
func stopSession(t *tmux.Tmux, sessionName string) (bool, error) {
|
||||||
running, err := t.HasSession(sessionName)
|
running, err := t.HasSession(sessionName)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
return err
|
return false, err
|
||||||
}
|
}
|
||||||
if !running {
|
if !running {
|
||||||
return nil // Already stopped
|
return false, nil // Already stopped
|
||||||
}
|
}
|
||||||
|
|
||||||
// Try graceful shutdown first (Ctrl-C, best-effort interrupt)
|
// Try graceful shutdown first (Ctrl-C, best-effort interrupt)
|
||||||
@@ -157,5 +305,82 @@ func stopSession(t *tmux.Tmux, sessionName string) error {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Kill the session
|
// Kill the session
|
||||||
return t.KillSession(sessionName)
|
return true, t.KillSession(sessionName)
|
||||||
|
}
|
||||||
|
|
||||||
|
// acquireShutdownLock prevents concurrent shutdowns.
|
||||||
|
// Returns the lock (caller must defer Unlock()) or error if lock held.
|
||||||
|
func acquireShutdownLock(townRoot string) (*flock.Flock, error) {
|
||||||
|
lockPath := filepath.Join(townRoot, shutdownLockFile)
|
||||||
|
|
||||||
|
if err := os.MkdirAll(filepath.Dir(lockPath), 0755); err != nil {
|
||||||
|
return nil, fmt.Errorf("creating lock directory: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
lock := flock.New(lockPath)
|
||||||
|
|
||||||
|
ctx, cancel := context.WithTimeout(context.Background(), shutdownLockTimeout)
|
||||||
|
defer cancel()
|
||||||
|
|
||||||
|
locked, err := lock.TryLockContext(ctx, 100*time.Millisecond)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("lock acquisition failed: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if !locked {
|
||||||
|
return nil, fmt.Errorf("another shutdown is in progress (lock held: %s)", lockPath)
|
||||||
|
}
|
||||||
|
|
||||||
|
return lock, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// verifyShutdown checks for respawned processes after shutdown.
|
||||||
|
// Returns list of things that are still running or respawned.
|
||||||
|
func verifyShutdown(t *tmux.Tmux, townRoot string) []string {
|
||||||
|
var respawned []string
|
||||||
|
|
||||||
|
if count := beads.CountBdDaemons(); count > 0 {
|
||||||
|
respawned = append(respawned, fmt.Sprintf("bd daemon (%d running)", count))
|
||||||
|
}
|
||||||
|
|
||||||
|
if count := beads.CountBdActivityProcesses(); count > 0 {
|
||||||
|
respawned = append(respawned, fmt.Sprintf("bd activity (%d running)", count))
|
||||||
|
}
|
||||||
|
|
||||||
|
sessions, err := t.ListSessions()
|
||||||
|
if err == nil {
|
||||||
|
for _, sess := range sessions {
|
||||||
|
if strings.HasPrefix(sess, "gt-") || strings.HasPrefix(sess, "hq-") {
|
||||||
|
respawned = append(respawned, fmt.Sprintf("tmux session %s", sess))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
pidFile := filepath.Join(townRoot, "daemon", "daemon.pid")
|
||||||
|
if pidData, err := os.ReadFile(pidFile); err == nil {
|
||||||
|
var pid int
|
||||||
|
if _, err := fmt.Sscanf(string(pidData), "%d", &pid); err == nil {
|
||||||
|
if isProcessRunning(pid) {
|
||||||
|
respawned = append(respawned, fmt.Sprintf("gt daemon (PID %d)", pid))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return respawned
|
||||||
|
}
|
||||||
|
|
||||||
|
// isProcessRunning checks if a process with the given PID exists.
|
||||||
|
func isProcessRunning(pid int) bool {
|
||||||
|
if pid <= 0 {
|
||||||
|
return false // Invalid PID
|
||||||
|
}
|
||||||
|
err := syscall.Kill(pid, 0)
|
||||||
|
if err == nil {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
// EPERM means process exists but we don't have permission to signal it
|
||||||
|
if err == syscall.EPERM {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
return false
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -0,0 +1,24 @@
|
|||||||
|
package cmd
|
||||||
|
|
||||||
|
import (
|
||||||
|
"os"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestIsProcessRunning_CurrentProcess(t *testing.T) {
|
||||||
|
if !isProcessRunning(os.Getpid()) {
|
||||||
|
t.Error("current process should be detected as running")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestIsProcessRunning_InvalidPID(t *testing.T) {
|
||||||
|
if isProcessRunning(99999999) {
|
||||||
|
t.Error("invalid PID should not be detected as running")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestIsProcessRunning_MaxPID(t *testing.T) {
|
||||||
|
if isProcessRunning(2147483647) {
|
||||||
|
t.Error("max PID should not be running")
|
||||||
|
}
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user