Implements three quick fixes for users stuck in sandboxed environments (e.g., Codex) where daemon cannot be stopped: 1. **--force flag for bd import** - Forces metadata update even when DB is synced with JSONL - Fixes stuck state caused by stale daemon cache - Shows: "Metadata updated (database already in sync with JSONL)" 2. **--allow-stale global flag** - Emergency escape hatch to bypass staleness check - Shows warning: "⚠️ Staleness check skipped (--allow-stale)" - Allows operations on potentially stale data 3. **Improved error message** - Added sandbox-specific guidance to staleness error - Suggests --sandbox, --force, and --allow-stale flags - Provides clear fix steps for different scenarios Also fixed: - Removed unused import in cmd/bd/duplicates_test.go Follow-up work filed: - bd-u3t: Phase 2 - Sandbox auto-detection - bd-e0o: Phase 3 - Daemon robustness enhancements - bd-9nw: Documentation updates Fixes #353 (Phase 1) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
9.7 KiB
Investigation: GH #353 - Daemon Locking Issues in Codex Sandbox
Problem Summary
When running bd inside the Codex sandbox (macOS host), users encounter persistent "Database out of sync with JSONL" errors that cannot be resolved through normal means (bd import). The root cause is a daemon process that the sandbox cannot signal or kill, creating a deadlock situation.
Root Cause Analysis
The Daemon Locking Mechanism
The daemon uses three mechanisms to claim a database:
- File lock (
flock) on.beads/daemon.lock- exclusive lock held while daemon is running - PID file at
.beads/daemon.pid- contains daemon process ID (Windows compatibility) - Lock metadata in
daemon.lock- JSON containing PID, database path, version, start time
Source: cmd/bd/daemon_lock.go
Process Verification Issue
On Unix systems, isProcessRunning() uses syscall.Kill(pid, 0) to check if a process exists. In sandboxed environments:
- The daemon PID exists in the lock file
syscall.Kill(pid, 0)returns EPERM (operation not permitted)- The CLI can't verify if the daemon is actually running
- The CLI can't send signals to stop the daemon
Source: cmd/bd/daemon_unix.go:26-28
Staleness Check Flow
When running bd ready or other read commands:
-
With daemon connected:
- Command → Daemon RPC →
checkAndAutoImportIfStale() - Daemon checks JSONL mtime vs
last_import_timemetadata - Daemon auto-imports if stale (with safeguards)
- Source:
internal/rpc/server_export_import_auto.go:171-303
- Command → Daemon RPC →
-
Without daemon (direct mode):
- Command →
ensureDatabaseFresh(ctx)check - Compares JSONL mtime vs
last_import_timemetadata - Refuses to proceed if stale, shows error message
- Source:
cmd/bd/staleness.go:20-51
- Command →
The Deadlock Scenario
- Daemon is running outside sandbox with database lock
- User (in sandbox) runs
bd ready - CLI tries to connect to daemon → connection fails or daemon is unreachable
- CLI falls back to direct mode
- Direct mode checks staleness → JSONL is newer than metadata
- Error: "Database out of sync with JSONL. Run 'bd import' first."
- User runs
bd import -i .beads/beads.jsonl - Import updates metadata in database file
- But daemon still running with OLD metadata cached in memory
- User runs
bd readyagain → CLI connects to daemon - Daemon checks staleness using cached metadata → still stale!
- Infinite loop: Can't fix because can't restart daemon
Why --no-daemon Doesn't Always Work
The --no-daemon flag should work by setting daemonClient = nil and skipping daemon connection (source: cmd/bd/main.go:287-289). However:
- If JSONL is genuinely newer than database (e.g., after
git pull), the staleness check in direct mode will still fail - If the user doesn't specify
--no-daemonconsistently, the CLI will reconnect to the stale daemon - The daemon may still hold file locks that interfere with direct operations
Existing Workarounds
The --sandbox Flag
Already exists! Sets:
noDaemon = true(skip daemon)noAutoFlush = true(skip auto-flush)noAutoImport = true(skip auto-import)
Source: cmd/bd/main.go:201-206
Issue: Still runs staleness check in direct mode, which fails if JSONL is actually newer.
Proposed Solutions
Solution 1: Force-Import Flag (Quick Fix) ⭐ Recommended
Add --force flag to bd import that:
- Updates
last_import_timeandlast_import_hashmetadata even when 0 issues imported - Explicitly touches database file to update mtime
- Prints clear message: "Metadata updated (database already in sync)"
Pros:
- Minimal code change
- Solves immediate problem
- User can manually fix stuck state
Cons:
- Requires user to know about --force flag
- Doesn't prevent the problem from occurring
Implementation location: cmd/bd/import.go around line 349
Solution 2: Skip-Staleness Flag (Escape Hatch) ⭐ Recommended
Add --allow-stale or --no-staleness-check global flag that:
- Bypasses
ensureDatabaseFresh()check entirely - Allows operations on potentially stale data
- Prints warning: "⚠️ Staleness check skipped, data may be out of sync"
Pros:
- Emergency escape hatch when stuck
- Minimal invasive change
- Works with
--sandboxmode
Cons:
- User can accidentally work with stale data
- Should be well-documented as last resort
Implementation location: cmd/bd/staleness.go:20 and callers
Solution 3: Sandbox Detection (Automatic) ⭐⭐ Best Long-term
Auto-detect sandbox environment and adjust behavior:
func isSandboxed() bool {
// Try to signal a known process (e.g., our own parent)
// If we get EPERM, we're likely sandboxed
if syscall.Kill(os.Getppid(), 0) != nil {
if err == syscall.EPERM {
return true
}
}
return false
}
// In PersistentPreRun:
if isSandboxed() {
sandboxMode = true // Auto-enable sandbox mode
fmt.Fprintf(os.Stderr, "ℹ️ Sandbox detected, using direct mode\n")
}
Additionally, when daemon connection fails with permission errors:
- Automatically set
noDaemon = truefor subsequent operations - Skip daemon health checks that require process signals
Pros:
- Zero configuration for users
- Prevents the problem entirely
- Graceful degradation
Cons:
- More complex heuristic
- May have false positives
- Requires testing in various environments
Implementation locations:
cmd/bd/main.go(detection)cmd/bd/daemon_unix.go(process checks)
Solution 4: Better Daemon Health Checks (Robust)
Enhance daemon health check to detect unreachable daemons:
-
When
daemonClient.Health()fails, check why:- Connection refused → daemon not running
- Timeout → daemon unreachable (sandbox?)
- Permission denied → sandbox detected
-
On sandbox detection, automatically:
- Set
noDaemon = true - Clear cached daemon client
- Proceed in direct mode
- Set
Pros:
- Automatic recovery
- Better error messages
- Handles edge cases
Cons:
- Requires careful timeout tuning
- More complex state management
Implementation location: cmd/bd/main.go around lines 300-367
Solution 5: Daemon Metadata Refresh (Prevents Staleness)
Make daemon periodically refresh metadata from disk:
// In daemon event loop, check metadata every N seconds
if time.Since(lastMetadataCheck) > 5*time.Second {
lastImportTime, _ := store.GetMetadata(ctx, "last_import_time")
// Update cached value
}
Pros:
- Daemon picks up external import operations
- Reduces stale metadata issues
- Works for other scenarios too
Cons:
- Doesn't solve sandbox permission issues
- Adds I/O overhead
- Still requires daemon restart eventually
Implementation location: cmd/bd/daemon_event_loop.go
Recommended Implementation Plan
Phase 1: Immediate Relief (1-2 hours)
- ✅ Add
--forceflag tobd import(Solution 1) - ✅ Add
--allow-staleglobal flag (Solution 2) - ✅ Update error message to suggest these flags
Phase 2: Better UX (3-4 hours)
- ✅ Implement sandbox detection heuristic (Solution 3)
- ✅ Auto-enable
--sandboxmode when detected - ✅ Update docs with sandbox troubleshooting
Phase 3: Robustness (5-6 hours)
- Enhance daemon health checks (Solution 4)
- Add daemon metadata refresh (Solution 5)
- Comprehensive testing in sandbox environments
Testing Strategy
Manual Testing in Codex Sandbox
- Start daemon outside sandbox
- Run
bd readyinside sandbox → should detect sandbox - Run
bd import --force→ should update metadata - Run
bd ready --allow-stale→ should work despite staleness
Automated Testing
- Mock sandboxed environment (permission denied on signals)
- Test daemon connection failure scenarios
- Test metadata update in import with 0 changes
- Test staleness check bypass flag
Documentation Updates Needed
-
TROUBLESHOOTING.md - Add sandbox section with:
- Symptoms of daemon lock issues
--sandboxflag usage--forceand--allow-staleas escape hatches
-
CLI_REFERENCE.md - Document new flags:
--allow-stale/--no-staleness-checkbd import --force
-
Error message in
staleness.go- Add:If you're in a sandboxed environment (e.g., Codex): bd --sandbox ready bd import --force -i .beads/beads.jsonl
Files to Modify
Critical Path (Phase 1)
cmd/bd/import.go- Add --force flagcmd/bd/staleness.go- Add staleness bypass, update error messagecmd/bd/main.go- Add --allow-stale flag
Enhancement (Phase 2-3)
cmd/bd/main.go- Sandbox detectioncmd/bd/daemon_unix.go- Permission-aware process checkscmd/bd/daemon_event_loop.go- Metadata refreshinternal/rpc/server_export_import_auto.go- Better import handling
Documentation
docs/TROUBLESHOOTING.mddocs/CLI_REFERENCE.md- Issue #353 comment with workaround
Open Questions
-
Should
--sandboxauto-detect, or require explicit flag?- Recommendation: Start with explicit, add auto-detect in Phase 2
-
Should
--allow-stalebe per-command or global?- Recommendation: Global flag (less repetition)
-
What should happen to daemon lock files when daemon is unreachable?
- Recommendation: Leave them (don't force-break locks), use direct mode
-
Should we add a
--force-directthat ignores daemon locks entirely?- Recommendation: Not needed if sandbox detection works well
Success Metrics
- Users in Codex can run
bd readywithout errors - No false positives in sandbox detection
- Clear error messages guide users to solutions
bd import --forcealways updates metadata--sandboxmode works reliably
Investigation completed: 2025-11-21 Next steps: Implement Phase 1 solutions