Files
beads/docs/GH353_INVESTIGATION.md
Steve Yegge b3e76ff1f6 feat: Add sandbox escape hatches for daemon lock issues (GH #353 Phase 1)
Implements three quick fixes for users stuck in sandboxed environments
(e.g., Codex) where daemon cannot be stopped:

1. **--force flag for bd import**
   - Forces metadata update even when DB is synced with JSONL
   - Fixes stuck state caused by stale daemon cache
   - Shows: "Metadata updated (database already in sync with JSONL)"

2. **--allow-stale global flag**
   - Emergency escape hatch to bypass staleness check
   - Shows warning: "⚠️  Staleness check skipped (--allow-stale)"
   - Allows operations on potentially stale data

3. **Improved error message**
   - Added sandbox-specific guidance to staleness error
   - Suggests --sandbox, --force, and --allow-stale flags
   - Provides clear fix steps for different scenarios

Also fixed:
- Removed unused import in cmd/bd/duplicates_test.go

Follow-up work filed:
- bd-u3t: Phase 2 - Sandbox auto-detection
- bd-e0o: Phase 3 - Daemon robustness enhancements
- bd-9nw: Documentation updates

Fixes #353 (Phase 1)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 18:53:34 -05:00

303 lines
9.7 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Investigation: GH #353 - Daemon Locking Issues in Codex Sandbox
## Problem Summary
When running `bd` inside the Codex sandbox (macOS host), users encounter persistent "Database out of sync with JSONL" errors that cannot be resolved through normal means (`bd import`). The root cause is a daemon process that the sandbox cannot signal or kill, creating a deadlock situation.
## Root Cause Analysis
### The Daemon Locking Mechanism
The daemon uses three mechanisms to claim a database:
1. **File lock (`flock`)** on `.beads/daemon.lock` - exclusive lock held while daemon is running
2. **PID file** at `.beads/daemon.pid` - contains daemon process ID (Windows compatibility)
3. **Lock metadata** in `daemon.lock` - JSON containing PID, database path, version, start time
**Source:** `cmd/bd/daemon_lock.go`
### Process Verification Issue
On Unix systems, `isProcessRunning()` uses `syscall.Kill(pid, 0)` to check if a process exists. In sandboxed environments:
- The daemon PID exists in the lock file
- `syscall.Kill(pid, 0)` returns EPERM (operation not permitted)
- The CLI can't verify if the daemon is actually running
- The CLI can't send signals to stop the daemon
**Source:** `cmd/bd/daemon_unix.go:26-28`
### Staleness Check Flow
When running `bd ready` or other read commands:
1. **With daemon connected:**
- Command → Daemon RPC → `checkAndAutoImportIfStale()`
- Daemon checks JSONL mtime vs `last_import_time` metadata
- Daemon auto-imports if stale (with safeguards)
- **Source:** `internal/rpc/server_export_import_auto.go:171-303`
2. **Without daemon (direct mode):**
- Command → `ensureDatabaseFresh(ctx)` check
- Compares JSONL mtime vs `last_import_time` metadata
- **Refuses to proceed** if stale, shows error message
- **Source:** `cmd/bd/staleness.go:20-51`
### The Deadlock Scenario
1. Daemon is running outside sandbox with database lock
2. User (in sandbox) runs `bd ready`
3. CLI tries to connect to daemon → connection fails or daemon is unreachable
4. CLI falls back to direct mode
5. Direct mode checks staleness → JSONL is newer than metadata
6. Error: "Database out of sync with JSONL. Run 'bd import' first."
7. User runs `bd import -i .beads/beads.jsonl`
8. Import updates metadata in database file
9. **But daemon still running with OLD metadata cached in memory**
10. User runs `bd ready` again → CLI connects to daemon
11. Daemon checks staleness using **cached metadata** → still stale!
12. **Infinite loop:** Can't fix because can't restart daemon
### Why `--no-daemon` Doesn't Always Work
The `--no-daemon` flag should work by setting `daemonClient = nil` and skipping daemon connection (**source:** `cmd/bd/main.go:287-289`). However:
1. If JSONL is genuinely newer than database (e.g., after `git pull`), the staleness check in direct mode will still fail
2. If the user doesn't specify `--no-daemon` consistently, the CLI will reconnect to the stale daemon
3. The daemon may still hold file locks that interfere with direct operations
## Existing Workarounds
### The `--sandbox` Flag
Already exists! Sets:
- `noDaemon = true` (skip daemon)
- `noAutoFlush = true` (skip auto-flush)
- `noAutoImport = true` (skip auto-import)
**Source:** `cmd/bd/main.go:201-206`
**Issue:** Still runs staleness check in direct mode, which fails if JSONL is actually newer.
## Proposed Solutions
### Solution 1: Force-Import Flag (Quick Fix) ⭐ **Recommended**
Add `--force` flag to `bd import` that:
- Updates `last_import_time` and `last_import_hash` metadata even when 0 issues imported
- Explicitly touches database file to update mtime
- Prints clear message: "Metadata updated (database already in sync)"
**Pros:**
- Minimal code change
- Solves immediate problem
- User can manually fix stuck state
**Cons:**
- Requires user to know about --force flag
- Doesn't prevent the problem from occurring
**Implementation location:** `cmd/bd/import.go` around line 349
### Solution 2: Skip-Staleness Flag (Escape Hatch) ⭐ **Recommended**
Add `--allow-stale` or `--no-staleness-check` global flag that:
- Bypasses `ensureDatabaseFresh()` check entirely
- Allows operations on potentially stale data
- Prints warning: "⚠️ Staleness check skipped, data may be out of sync"
**Pros:**
- Emergency escape hatch when stuck
- Minimal invasive change
- Works with `--sandbox` mode
**Cons:**
- User can accidentally work with stale data
- Should be well-documented as last resort
**Implementation location:** `cmd/bd/staleness.go:20` and callers
### Solution 3: Sandbox Detection (Automatic) ⭐⭐ **Best Long-term**
Auto-detect sandbox environment and adjust behavior:
```go
func isSandboxed() bool {
// Try to signal a known process (e.g., our own parent)
// If we get EPERM, we're likely sandboxed
if syscall.Kill(os.Getppid(), 0) != nil {
if err == syscall.EPERM {
return true
}
}
return false
}
// In PersistentPreRun:
if isSandboxed() {
sandboxMode = true // Auto-enable sandbox mode
fmt.Fprintf(os.Stderr, " Sandbox detected, using direct mode\n")
}
```
Additionally, when daemon connection fails with permission errors:
- Automatically set `noDaemon = true` for subsequent operations
- Skip daemon health checks that require process signals
**Pros:**
- Zero configuration for users
- Prevents the problem entirely
- Graceful degradation
**Cons:**
- More complex heuristic
- May have false positives
- Requires testing in various environments
**Implementation locations:**
- `cmd/bd/main.go` (detection)
- `cmd/bd/daemon_unix.go` (process checks)
### Solution 4: Better Daemon Health Checks (Robust)
Enhance daemon health check to detect unreachable daemons:
1. When `daemonClient.Health()` fails, check why:
- Connection refused → daemon not running
- Timeout → daemon unreachable (sandbox?)
- Permission denied → sandbox detected
2. On sandbox detection, automatically:
- Set `noDaemon = true`
- Clear cached daemon client
- Proceed in direct mode
**Pros:**
- Automatic recovery
- Better error messages
- Handles edge cases
**Cons:**
- Requires careful timeout tuning
- More complex state management
**Implementation location:** `cmd/bd/main.go` around lines 300-367
### Solution 5: Daemon Metadata Refresh (Prevents Staleness)
Make daemon periodically refresh metadata from disk:
```go
// In daemon event loop, check metadata every N seconds
if time.Since(lastMetadataCheck) > 5*time.Second {
lastImportTime, _ := store.GetMetadata(ctx, "last_import_time")
// Update cached value
}
```
**Pros:**
- Daemon picks up external import operations
- Reduces stale metadata issues
- Works for other scenarios too
**Cons:**
- Doesn't solve sandbox permission issues
- Adds I/O overhead
- Still requires daemon restart eventually
**Implementation location:** `cmd/bd/daemon_event_loop.go`
## Recommended Implementation Plan
### Phase 1: Immediate Relief (1-2 hours)
1. ✅ Add `--force` flag to `bd import` (Solution 1)
2. ✅ Add `--allow-stale` global flag (Solution 2)
3. ✅ Update error message to suggest these flags
### Phase 2: Better UX (3-4 hours)
1. ✅ Implement sandbox detection heuristic (Solution 3)
2. ✅ Auto-enable `--sandbox` mode when detected
3. ✅ Update docs with sandbox troubleshooting
### Phase 3: Robustness (5-6 hours)
1. Enhance daemon health checks (Solution 4)
2. Add daemon metadata refresh (Solution 5)
3. Comprehensive testing in sandbox environments
## Testing Strategy
### Manual Testing in Codex Sandbox
1. Start daemon outside sandbox
2. Run `bd ready` inside sandbox → should detect sandbox
3. Run `bd import --force` → should update metadata
4. Run `bd ready --allow-stale` → should work despite staleness
### Automated Testing
1. Mock sandboxed environment (permission denied on signals)
2. Test daemon connection failure scenarios
3. Test metadata update in import with 0 changes
4. Test staleness check bypass flag
## Documentation Updates Needed
1. **TROUBLESHOOTING.md** - Add sandbox section with:
- Symptoms of daemon lock issues
- `--sandbox` flag usage
- `--force` and `--allow-stale` as escape hatches
2. **CLI_REFERENCE.md** - Document new flags:
- `--allow-stale` / `--no-staleness-check`
- `bd import --force`
3. **Error message** in `staleness.go` - Add:
```
If you're in a sandboxed environment (e.g., Codex):
bd --sandbox ready
bd import --force -i .beads/beads.jsonl
```
## Files to Modify
### Critical Path (Phase 1)
- [ ] `cmd/bd/import.go` - Add --force flag
- [ ] `cmd/bd/staleness.go` - Add staleness bypass, update error message
- [ ] `cmd/bd/main.go` - Add --allow-stale flag
### Enhancement (Phase 2-3)
- [ ] `cmd/bd/main.go` - Sandbox detection
- [ ] `cmd/bd/daemon_unix.go` - Permission-aware process checks
- [ ] `cmd/bd/daemon_event_loop.go` - Metadata refresh
- [ ] `internal/rpc/server_export_import_auto.go` - Better import handling
### Documentation
- [ ] `docs/TROUBLESHOOTING.md`
- [ ] `docs/CLI_REFERENCE.md`
- [ ] Issue #353 comment with workaround
## Open Questions
1. Should `--sandbox` auto-detect, or require explicit flag?
- **Recommendation:** Start with explicit, add auto-detect in Phase 2
2. Should `--allow-stale` be per-command or global?
- **Recommendation:** Global flag (less repetition)
3. What should happen to daemon lock files when daemon is unreachable?
- **Recommendation:** Leave them (don't force-break locks), use direct mode
4. Should we add a `--force-direct` that ignores daemon locks entirely?
- **Recommendation:** Not needed if sandbox detection works well
## Success Metrics
- Users in Codex can run `bd ready` without errors
- No false positives in sandbox detection
- Clear error messages guide users to solutions
- `bd import --force` always updates metadata
- `--sandbox` mode works reliably
---
**Investigation completed:** 2025-11-21
**Next steps:** Implement Phase 1 solutions