diff --git a/docs/GH353_INVESTIGATION.md b/docs/GH353_INVESTIGATION.md deleted file mode 100644 index 573a6ead..00000000 --- a/docs/GH353_INVESTIGATION.md +++ /dev/null @@ -1,302 +0,0 @@ -# Investigation: GH #353 - Daemon Locking Issues in Codex Sandbox - -## Problem Summary - -When running `bd` inside the Codex sandbox (macOS host), users encounter persistent "Database out of sync with JSONL" errors that cannot be resolved through normal means (`bd import`). The root cause is a daemon process that the sandbox cannot signal or kill, creating a deadlock situation. - -## Root Cause Analysis - -### The Daemon Locking Mechanism - -The daemon uses three mechanisms to claim a database: - -1. **File lock (`flock`)** on `.beads/daemon.lock` - exclusive lock held while daemon is running -2. **PID file** at `.beads/daemon.pid` - contains daemon process ID (Windows compatibility) -3. **Lock metadata** in `daemon.lock` - JSON containing PID, database path, version, start time - -**Source:** `cmd/bd/daemon_lock.go` - -### Process Verification Issue - -On Unix systems, `isProcessRunning()` uses `syscall.Kill(pid, 0)` to check if a process exists. In sandboxed environments: - -- The daemon PID exists in the lock file -- `syscall.Kill(pid, 0)` returns EPERM (operation not permitted) -- The CLI can't verify if the daemon is actually running -- The CLI can't send signals to stop the daemon - -**Source:** `cmd/bd/daemon_unix.go:26-28` - -### Staleness Check Flow - -When running `bd ready` or other read commands: - -1. **With daemon connected:** - - Command → Daemon RPC → `checkAndAutoImportIfStale()` - - Daemon checks JSONL mtime vs `last_import_time` metadata - - Daemon auto-imports if stale (with safeguards) - - **Source:** `internal/rpc/server_export_import_auto.go:171-303` - -2. **Without daemon (direct mode):** - - Command → `ensureDatabaseFresh(ctx)` check - - Compares JSONL mtime vs `last_import_time` metadata - - **Refuses to proceed** if stale, shows error message - - **Source:** `cmd/bd/staleness.go:20-51` - -### The Deadlock Scenario - -1. Daemon is running outside sandbox with database lock -2. User (in sandbox) runs `bd ready` -3. CLI tries to connect to daemon → connection fails or daemon is unreachable -4. CLI falls back to direct mode -5. Direct mode checks staleness → JSONL is newer than metadata -6. Error: "Database out of sync with JSONL. Run 'bd import' first." -7. User runs `bd import -i .beads/beads.jsonl` -8. Import updates metadata in database file -9. **But daemon still running with OLD metadata cached in memory** -10. User runs `bd ready` again → CLI connects to daemon -11. Daemon checks staleness using **cached metadata** → still stale! -12. **Infinite loop:** Can't fix because can't restart daemon - -### Why `--no-daemon` Doesn't Always Work - -The `--no-daemon` flag should work by setting `daemonClient = nil` and skipping daemon connection (**source:** `cmd/bd/main.go:287-289`). However: - -1. If JSONL is genuinely newer than database (e.g., after `git pull`), the staleness check in direct mode will still fail -2. If the user doesn't specify `--no-daemon` consistently, the CLI will reconnect to the stale daemon -3. The daemon may still hold file locks that interfere with direct operations - -## Existing Workarounds - -### The `--sandbox` Flag - -Already exists! Sets: -- `noDaemon = true` (skip daemon) -- `noAutoFlush = true` (skip auto-flush) -- `noAutoImport = true` (skip auto-import) - -**Source:** `cmd/bd/main.go:201-206` - -**Issue:** Still runs staleness check in direct mode, which fails if JSONL is actually newer. - -## Proposed Solutions - -### Solution 1: Force-Import Flag (Quick Fix) ⭐ **Recommended** - -Add `--force` flag to `bd import` that: -- Updates `last_import_time` and `last_import_hash` metadata even when 0 issues imported -- Explicitly touches database file to update mtime -- Prints clear message: "Metadata updated (database already in sync)" - -**Pros:** -- Minimal code change -- Solves immediate problem -- User can manually fix stuck state - -**Cons:** -- Requires user to know about --force flag -- Doesn't prevent the problem from occurring - -**Implementation location:** `cmd/bd/import.go` around line 349 - -### Solution 2: Skip-Staleness Flag (Escape Hatch) ⭐ **Recommended** - -Add `--allow-stale` or `--no-staleness-check` global flag that: -- Bypasses `ensureDatabaseFresh()` check entirely -- Allows operations on potentially stale data -- Prints warning: "⚠️ Staleness check skipped, data may be out of sync" - -**Pros:** -- Emergency escape hatch when stuck -- Minimal invasive change -- Works with `--sandbox` mode - -**Cons:** -- User can accidentally work with stale data -- Should be well-documented as last resort - -**Implementation location:** `cmd/bd/staleness.go:20` and callers - -### Solution 3: Sandbox Detection (Automatic) ⭐⭐ **Best Long-term** - -Auto-detect sandbox environment and adjust behavior: - -```go -func isSandboxed() bool { - // Try to signal a known process (e.g., our own parent) - // If we get EPERM, we're likely sandboxed - if syscall.Kill(os.Getppid(), 0) != nil { - if err == syscall.EPERM { - return true - } - } - return false -} - -// In PersistentPreRun: -if isSandboxed() { - sandboxMode = true // Auto-enable sandbox mode - fmt.Fprintf(os.Stderr, "ℹ️ Sandbox detected, using direct mode\n") -} -``` - -Additionally, when daemon connection fails with permission errors: -- Automatically set `noDaemon = true` for subsequent operations -- Skip daemon health checks that require process signals - -**Pros:** -- Zero configuration for users -- Prevents the problem entirely -- Graceful degradation - -**Cons:** -- More complex heuristic -- May have false positives -- Requires testing in various environments - -**Implementation locations:** -- `cmd/bd/main.go` (detection) -- `cmd/bd/daemon_unix.go` (process checks) - -### Solution 4: Better Daemon Health Checks (Robust) - -Enhance daemon health check to detect unreachable daemons: - -1. When `daemonClient.Health()` fails, check why: - - Connection refused → daemon not running - - Timeout → daemon unreachable (sandbox?) - - Permission denied → sandbox detected - -2. On sandbox detection, automatically: - - Set `noDaemon = true` - - Clear cached daemon client - - Proceed in direct mode - -**Pros:** -- Automatic recovery -- Better error messages -- Handles edge cases - -**Cons:** -- Requires careful timeout tuning -- More complex state management - -**Implementation location:** `cmd/bd/main.go` around lines 300-367 - -### Solution 5: Daemon Metadata Refresh (Prevents Staleness) - -Make daemon periodically refresh metadata from disk: - -```go -// In daemon event loop, check metadata every N seconds -if time.Since(lastMetadataCheck) > 5*time.Second { - lastImportTime, _ := store.GetMetadata(ctx, "last_import_time") - // Update cached value -} -``` - -**Pros:** -- Daemon picks up external import operations -- Reduces stale metadata issues -- Works for other scenarios too - -**Cons:** -- Doesn't solve sandbox permission issues -- Adds I/O overhead -- Still requires daemon restart eventually - -**Implementation location:** `cmd/bd/daemon_event_loop.go` - -## Recommended Implementation Plan - -### Phase 1: Immediate Relief (1-2 hours) -1. ✅ Add `--force` flag to `bd import` (Solution 1) -2. ✅ Add `--allow-stale` global flag (Solution 2) -3. ✅ Update error message to suggest these flags - -### Phase 2: Better UX (3-4 hours) -1. ✅ Implement sandbox detection heuristic (Solution 3) -2. ✅ Auto-enable `--sandbox` mode when detected -3. ✅ Update docs with sandbox troubleshooting - -### Phase 3: Robustness (5-6 hours) -1. Enhance daemon health checks (Solution 4) -2. Add daemon metadata refresh (Solution 5) -3. Comprehensive testing in sandbox environments - -## Testing Strategy - -### Manual Testing in Codex Sandbox -1. Start daemon outside sandbox -2. Run `bd ready` inside sandbox → should detect sandbox -3. Run `bd import --force` → should update metadata -4. Run `bd ready --allow-stale` → should work despite staleness - -### Automated Testing -1. Mock sandboxed environment (permission denied on signals) -2. Test daemon connection failure scenarios -3. Test metadata update in import with 0 changes -4. Test staleness check bypass flag - -## Documentation Updates Needed - -1. **TROUBLESHOOTING.md** - Add sandbox section with: - - Symptoms of daemon lock issues - - `--sandbox` flag usage - - `--force` and `--allow-stale` as escape hatches - -2. **CLI_REFERENCE.md** - Document new flags: - - `--allow-stale` / `--no-staleness-check` - - `bd import --force` - -3. **Error message** in `staleness.go` - Add: - ``` - If you're in a sandboxed environment (e.g., Codex): - bd --sandbox ready - bd import --force -i .beads/beads.jsonl - ``` - -## Files to Modify - -### Critical Path (Phase 1) -- [ ] `cmd/bd/import.go` - Add --force flag -- [ ] `cmd/bd/staleness.go` - Add staleness bypass, update error message -- [ ] `cmd/bd/main.go` - Add --allow-stale flag - -### Enhancement (Phase 2-3) -- [ ] `cmd/bd/main.go` - Sandbox detection -- [ ] `cmd/bd/daemon_unix.go` - Permission-aware process checks -- [ ] `cmd/bd/daemon_event_loop.go` - Metadata refresh -- [ ] `internal/rpc/server_export_import_auto.go` - Better import handling - -### Documentation -- [ ] `docs/TROUBLESHOOTING.md` -- [ ] `docs/CLI_REFERENCE.md` -- [ ] Issue #353 comment with workaround - -## Open Questions - -1. Should `--sandbox` auto-detect, or require explicit flag? - - **Recommendation:** Start with explicit, add auto-detect in Phase 2 - -2. Should `--allow-stale` be per-command or global? - - **Recommendation:** Global flag (less repetition) - -3. What should happen to daemon lock files when daemon is unreachable? - - **Recommendation:** Leave them (don't force-break locks), use direct mode - -4. Should we add a `--force-direct` that ignores daemon locks entirely? - - **Recommendation:** Not needed if sandbox detection works well - -## Success Metrics - -- Users in Codex can run `bd ready` without errors -- No false positives in sandbox detection -- Clear error messages guide users to solutions -- `bd import --force` always updates metadata -- `--sandbox` mode works reliably - ---- - -**Investigation completed:** 2025-11-21 -**Next steps:** Implement Phase 1 solutions diff --git a/docs/GH353_NEXT_SESSION.md b/docs/GH353_NEXT_SESSION.md deleted file mode 100644 index 356335d8..00000000 --- a/docs/GH353_NEXT_SESSION.md +++ /dev/null @@ -1,128 +0,0 @@ -# Next Session Prompt: Implement GH #353 Fixes - -## Context -We've investigated GH #353 (daemon locking issues in Codex sandbox). Full analysis in `docs/GH353_INVESTIGATION.md`. - -**TL;DR:** Users in sandboxed environments (Codex) get stuck with "Database out of sync" errors because: -1. Running daemon has cached metadata -2. `bd import` updates database but daemon never sees it -3. Sandbox can't signal/kill the daemon -4. User is stuck in infinite loop - -## Task: Implement Phase 1 Solutions - -Implement three quick fixes that give users escape hatches: - -### 1. Add `--force` flag to `bd import` -**File:** `cmd/bd/import.go` - -**What to do:** -- Add `--force` flag to importCmd.Flags() (around line 692) -- When `--force` is true, ALWAYS update metadata (lines 310-346) even if `created == 0 && updated == 0` -- Print message: "Metadata updated (database already in sync with JSONL)" -- Ensure `TouchDatabaseFile()` is called to update mtime - -**Why:** Allows users to manually force metadata sync when stuck - -### 2. Add `--allow-stale` global flag -**File:** `cmd/bd/main.go` - -**What to do:** -- Add global var: `allowStale bool` -- Add to rootCmd.PersistentFlags(): `--allow-stale` (around line 111) -- Description: "Allow operations on potentially stale data (skip staleness check)" - -**File:** `cmd/bd/staleness.go` - -**What to do:** -- At top of `ensureDatabaseFresh()` function (line 20), add: - ```go - if allowStale { - fmt.Fprintf(os.Stderr, "⚠️ Staleness check skipped (--allow-stale), data may be out of sync\n") - return nil - } - ``` - -**Why:** Emergency escape hatch when staleness check blocks operations - -### 3. Improve error message in staleness.go -**File:** `cmd/bd/staleness.go` - -**What to do:** -- Update the error message (lines 41-50) to add sandbox guidance: - ```go - return fmt.Errorf( - "Database out of sync with JSONL. Run 'bd import' first.\n\n"+ - "The JSONL file has been updated (e.g., after 'git pull') but the database\n"+ - "hasn't been imported yet. This would cause you to see stale/incomplete data.\n\n"+ - "To fix:\n"+ - " bd import -i .beads/beads.jsonl # Import JSONL updates to database\n\n"+ - "If in a sandboxed environment (e.g., Codex) where daemon can't be stopped:\n"+ - " bd --sandbox ready # Use direct mode (no daemon)\n"+ - " bd import --force # Force metadata update\n"+ - " bd ready --allow-stale # Skip staleness check (use with caution)\n\n"+ - "Or use daemon mode (auto-imports on every operation):\n"+ - " bd daemon start\n"+ - " bd # Will auto-import before executing", - ) - ``` - -**Why:** Guides users to the right solution based on their environment - -## Testing Checklist - -After implementation: - -- [ ] `bd import --force -i .beads/beads.jsonl` updates metadata even with 0 changes -- [ ] `bd import --force` without `-i` flag shows appropriate error (needs input file) -- [ ] `bd ready --allow-stale` bypasses staleness check and shows warning -- [ ] Error message displays correctly and includes sandbox guidance -- [ ] `--sandbox` mode still works as before -- [ ] Flags appear in `bd --help` and `bd import --help` - -## Quick Start Commands - -```bash -# 1. Review the investigation -cat docs/GH353_INVESTIGATION.md - -# 2. Check current import.go implementation -grep -A 5 "func init()" cmd/bd/import.go - -# 3. Check current staleness.go -head -60 cmd/bd/staleness.go - -# 4. Run existing tests to establish baseline -go test ./cmd/bd/... -run TestImport -go test ./cmd/bd/... -run TestStaleness - -# 5. Implement changes (see sections above) - -# 6. Test manually -bd import --help | grep force -bd --help | grep allow-stale -``` - -## Expected Outcome - -Users stuck in Codex sandbox can: -1. Run `bd import --force -i .beads/beads.jsonl` to fix metadata -2. Run `bd --sandbox ready` to use direct mode -3. Run `bd ready --allow-stale` as last resort -4. See helpful error message explaining their options - -## References - -- **Investigation:** `docs/GH353_INVESTIGATION.md` -- **Issue:** https://github.com/steveyegge/beads/issues/353 -- **Key files:** - - `cmd/bd/import.go` (import command) - - `cmd/bd/staleness.go` (staleness check) - - `cmd/bd/main.go` (global flags) - -## Estimated Time -~1-2 hours for implementation + testing - ---- - -**Ready to implement?** Start with adding the flags, then update the error message, then test thoroughly.