Files
beads/docs/GH353_INVESTIGATION.md
Steve Yegge b3e76ff1f6 feat: Add sandbox escape hatches for daemon lock issues (GH #353 Phase 1)
Implements three quick fixes for users stuck in sandboxed environments
(e.g., Codex) where daemon cannot be stopped:

1. **--force flag for bd import**
   - Forces metadata update even when DB is synced with JSONL
   - Fixes stuck state caused by stale daemon cache
   - Shows: "Metadata updated (database already in sync with JSONL)"

2. **--allow-stale global flag**
   - Emergency escape hatch to bypass staleness check
   - Shows warning: "⚠️  Staleness check skipped (--allow-stale)"
   - Allows operations on potentially stale data

3. **Improved error message**
   - Added sandbox-specific guidance to staleness error
   - Suggests --sandbox, --force, and --allow-stale flags
   - Provides clear fix steps for different scenarios

Also fixed:
- Removed unused import in cmd/bd/duplicates_test.go

Follow-up work filed:
- bd-u3t: Phase 2 - Sandbox auto-detection
- bd-e0o: Phase 3 - Daemon robustness enhancements
- bd-9nw: Documentation updates

Fixes #353 (Phase 1)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 18:53:34 -05:00

9.7 KiB
Raw Blame History

Investigation: GH #353 - Daemon Locking Issues in Codex Sandbox

Problem Summary

When running bd inside the Codex sandbox (macOS host), users encounter persistent "Database out of sync with JSONL" errors that cannot be resolved through normal means (bd import). The root cause is a daemon process that the sandbox cannot signal or kill, creating a deadlock situation.

Root Cause Analysis

The Daemon Locking Mechanism

The daemon uses three mechanisms to claim a database:

  1. File lock (flock) on .beads/daemon.lock - exclusive lock held while daemon is running
  2. PID file at .beads/daemon.pid - contains daemon process ID (Windows compatibility)
  3. Lock metadata in daemon.lock - JSON containing PID, database path, version, start time

Source: cmd/bd/daemon_lock.go

Process Verification Issue

On Unix systems, isProcessRunning() uses syscall.Kill(pid, 0) to check if a process exists. In sandboxed environments:

  • The daemon PID exists in the lock file
  • syscall.Kill(pid, 0) returns EPERM (operation not permitted)
  • The CLI can't verify if the daemon is actually running
  • The CLI can't send signals to stop the daemon

Source: cmd/bd/daemon_unix.go:26-28

Staleness Check Flow

When running bd ready or other read commands:

  1. With daemon connected:

    • Command → Daemon RPC → checkAndAutoImportIfStale()
    • Daemon checks JSONL mtime vs last_import_time metadata
    • Daemon auto-imports if stale (with safeguards)
    • Source: internal/rpc/server_export_import_auto.go:171-303
  2. Without daemon (direct mode):

    • Command → ensureDatabaseFresh(ctx) check
    • Compares JSONL mtime vs last_import_time metadata
    • Refuses to proceed if stale, shows error message
    • Source: cmd/bd/staleness.go:20-51

The Deadlock Scenario

  1. Daemon is running outside sandbox with database lock
  2. User (in sandbox) runs bd ready
  3. CLI tries to connect to daemon → connection fails or daemon is unreachable
  4. CLI falls back to direct mode
  5. Direct mode checks staleness → JSONL is newer than metadata
  6. Error: "Database out of sync with JSONL. Run 'bd import' first."
  7. User runs bd import -i .beads/beads.jsonl
  8. Import updates metadata in database file
  9. But daemon still running with OLD metadata cached in memory
  10. User runs bd ready again → CLI connects to daemon
  11. Daemon checks staleness using cached metadata → still stale!
  12. Infinite loop: Can't fix because can't restart daemon

Why --no-daemon Doesn't Always Work

The --no-daemon flag should work by setting daemonClient = nil and skipping daemon connection (source: cmd/bd/main.go:287-289). However:

  1. If JSONL is genuinely newer than database (e.g., after git pull), the staleness check in direct mode will still fail
  2. If the user doesn't specify --no-daemon consistently, the CLI will reconnect to the stale daemon
  3. The daemon may still hold file locks that interfere with direct operations

Existing Workarounds

The --sandbox Flag

Already exists! Sets:

  • noDaemon = true (skip daemon)
  • noAutoFlush = true (skip auto-flush)
  • noAutoImport = true (skip auto-import)

Source: cmd/bd/main.go:201-206

Issue: Still runs staleness check in direct mode, which fails if JSONL is actually newer.

Proposed Solutions

Add --force flag to bd import that:

  • Updates last_import_time and last_import_hash metadata even when 0 issues imported
  • Explicitly touches database file to update mtime
  • Prints clear message: "Metadata updated (database already in sync)"

Pros:

  • Minimal code change
  • Solves immediate problem
  • User can manually fix stuck state

Cons:

  • Requires user to know about --force flag
  • Doesn't prevent the problem from occurring

Implementation location: cmd/bd/import.go around line 349

Add --allow-stale or --no-staleness-check global flag that:

  • Bypasses ensureDatabaseFresh() check entirely
  • Allows operations on potentially stale data
  • Prints warning: "⚠️ Staleness check skipped, data may be out of sync"

Pros:

  • Emergency escape hatch when stuck
  • Minimal invasive change
  • Works with --sandbox mode

Cons:

  • User can accidentally work with stale data
  • Should be well-documented as last resort

Implementation location: cmd/bd/staleness.go:20 and callers

Solution 3: Sandbox Detection (Automatic) Best Long-term

Auto-detect sandbox environment and adjust behavior:

func isSandboxed() bool {
    // Try to signal a known process (e.g., our own parent)
    // If we get EPERM, we're likely sandboxed
    if syscall.Kill(os.Getppid(), 0) != nil {
        if err == syscall.EPERM {
            return true
        }
    }
    return false
}

// In PersistentPreRun:
if isSandboxed() {
    sandboxMode = true  // Auto-enable sandbox mode
    fmt.Fprintf(os.Stderr, "  Sandbox detected, using direct mode\n")
}

Additionally, when daemon connection fails with permission errors:

  • Automatically set noDaemon = true for subsequent operations
  • Skip daemon health checks that require process signals

Pros:

  • Zero configuration for users
  • Prevents the problem entirely
  • Graceful degradation

Cons:

  • More complex heuristic
  • May have false positives
  • Requires testing in various environments

Implementation locations:

  • cmd/bd/main.go (detection)
  • cmd/bd/daemon_unix.go (process checks)

Solution 4: Better Daemon Health Checks (Robust)

Enhance daemon health check to detect unreachable daemons:

  1. When daemonClient.Health() fails, check why:

    • Connection refused → daemon not running
    • Timeout → daemon unreachable (sandbox?)
    • Permission denied → sandbox detected
  2. On sandbox detection, automatically:

    • Set noDaemon = true
    • Clear cached daemon client
    • Proceed in direct mode

Pros:

  • Automatic recovery
  • Better error messages
  • Handles edge cases

Cons:

  • Requires careful timeout tuning
  • More complex state management

Implementation location: cmd/bd/main.go around lines 300-367

Solution 5: Daemon Metadata Refresh (Prevents Staleness)

Make daemon periodically refresh metadata from disk:

// In daemon event loop, check metadata every N seconds
if time.Since(lastMetadataCheck) > 5*time.Second {
    lastImportTime, _ := store.GetMetadata(ctx, "last_import_time")
    // Update cached value
}

Pros:

  • Daemon picks up external import operations
  • Reduces stale metadata issues
  • Works for other scenarios too

Cons:

  • Doesn't solve sandbox permission issues
  • Adds I/O overhead
  • Still requires daemon restart eventually

Implementation location: cmd/bd/daemon_event_loop.go

Phase 1: Immediate Relief (1-2 hours)

  1. Add --force flag to bd import (Solution 1)
  2. Add --allow-stale global flag (Solution 2)
  3. Update error message to suggest these flags

Phase 2: Better UX (3-4 hours)

  1. Implement sandbox detection heuristic (Solution 3)
  2. Auto-enable --sandbox mode when detected
  3. Update docs with sandbox troubleshooting

Phase 3: Robustness (5-6 hours)

  1. Enhance daemon health checks (Solution 4)
  2. Add daemon metadata refresh (Solution 5)
  3. Comprehensive testing in sandbox environments

Testing Strategy

Manual Testing in Codex Sandbox

  1. Start daemon outside sandbox
  2. Run bd ready inside sandbox → should detect sandbox
  3. Run bd import --force → should update metadata
  4. Run bd ready --allow-stale → should work despite staleness

Automated Testing

  1. Mock sandboxed environment (permission denied on signals)
  2. Test daemon connection failure scenarios
  3. Test metadata update in import with 0 changes
  4. Test staleness check bypass flag

Documentation Updates Needed

  1. TROUBLESHOOTING.md - Add sandbox section with:

    • Symptoms of daemon lock issues
    • --sandbox flag usage
    • --force and --allow-stale as escape hatches
  2. CLI_REFERENCE.md - Document new flags:

    • --allow-stale / --no-staleness-check
    • bd import --force
  3. Error message in staleness.go - Add:

    If you're in a sandboxed environment (e.g., Codex):
      bd --sandbox ready
      bd import --force -i .beads/beads.jsonl
    

Files to Modify

Critical Path (Phase 1)

  • cmd/bd/import.go - Add --force flag
  • cmd/bd/staleness.go - Add staleness bypass, update error message
  • cmd/bd/main.go - Add --allow-stale flag

Enhancement (Phase 2-3)

  • cmd/bd/main.go - Sandbox detection
  • cmd/bd/daemon_unix.go - Permission-aware process checks
  • cmd/bd/daemon_event_loop.go - Metadata refresh
  • internal/rpc/server_export_import_auto.go - Better import handling

Documentation

  • docs/TROUBLESHOOTING.md
  • docs/CLI_REFERENCE.md
  • Issue #353 comment with workaround

Open Questions

  1. Should --sandbox auto-detect, or require explicit flag?

    • Recommendation: Start with explicit, add auto-detect in Phase 2
  2. Should --allow-stale be per-command or global?

    • Recommendation: Global flag (less repetition)
  3. What should happen to daemon lock files when daemon is unreachable?

    • Recommendation: Leave them (don't force-break locks), use direct mode
  4. Should we add a --force-direct that ignores daemon locks entirely?

    • Recommendation: Not needed if sandbox detection works well

Success Metrics

  • Users in Codex can run bd ready without errors
  • No false positives in sandbox detection
  • Clear error messages guide users to solutions
  • bd import --force always updates metadata
  • --sandbox mode works reliably

Investigation completed: 2025-11-21 Next steps: Implement Phase 1 solutions