bd sync: 2025-11-25 11:35:26

2025-11-25 11:35:26 -08:00
parent d1d7b0e34a
commit c3e4172be7
3 changed files with 586 additions and 308 deletions
--- a/docs/INTERNALS.md
+++ b/docs/INTERNALS.md
@@ -0,0 +1,359 @@
+# Internals
+
+This document describes internal implementation details of bd, with particular focus on concurrency guarantees and data consistency.
+
+For the overall architecture (data model, sync mechanism, component overview), see [ARCHITECTURE.md](ARCHITECTURE.md).
+
+## Auto-Flush Architecture
+
+### Problem Statement (Issue bd-52)
+
+The original auto-flush implementation suffered from a critical race condition when multiple concurrent operations accessed shared state:
+
+- **Concurrent access points:**
+  - Auto-flush timer goroutine (5s debounce)
+  - Daemon sync goroutine
+  - Concurrent CLI commands
+  - Git hook execution
+  - PersistentPostRun cleanup
+
+- **Shared mutable state:**
+  - `isDirty` flag
+  - `needsFullExport` flag
+  - `flushTimer` instance
+  - `storeActive` flag
+
+- **Impact:**
+  - Potential data loss under concurrent load
+  - Corruption when multiple agents/commands run simultaneously
+  - Race conditions during rapid commits
+  - Flush operations could access closed storage
+
+### Solution: Event-Driven FlushManager
+
+The race condition was eliminated by replacing timer-based shared state with an event-driven architecture using a single-owner pattern.
+
+#### Architecture
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                     Command/Agent                        │
+│                                                          │
+│  markDirtyAndScheduleFlush() ─┐                         │
+│  markDirtyAndScheduleFullExport() ─┐                    │
+└────────────────────────────────────┼───┼────────────────┘
+                                     │   │
+                                     v   v
+                    ┌────────────────────────────────────┐
+                    │        FlushManager                │
+                    │  (Single-Owner Pattern)            │
+                    │                                    │
+                    │  Channels (buffered):              │
+                    │    - markDirtyCh                   │
+                    │    - timerFiredCh                  │
+                    │    - flushNowCh                    │
+                    │    - shutdownCh                    │
+                    │                                    │
+                    │  State (owned by run() goroutine): │
+                    │    - isDirty                       │
+                    │    - needsFullExport               │
+                    │    - debounceTimer                 │
+                    └────────────────────────────────────┘
+                                     │
+                                     v
+                    ┌────────────────────────────────────┐
+                    │      flushToJSONLWithState()       │
+                    │                                    │
+                    │  - Validates store is active       │
+                    │  - Checks JSONL integrity          │
+                    │  - Performs incremental/full export│
+                    │  - Updates export hashes           │
+                    └────────────────────────────────────┘
+```
+
+#### Key Design Principles
+
+**1. Single Owner Pattern**
+
+All flush state (`isDirty`, `needsFullExport`, `debounceTimer`) is owned by a single background goroutine (`FlushManager.run()`). This eliminates the need for mutexes to protect this state.
+
+**2. Channel-Based Communication**
+
+External code communicates with FlushManager via buffered channels:
+- `markDirtyCh`: Request to mark DB dirty (incremental or full export)
+- `timerFiredCh`: Debounce timer expired notification
+- `flushNowCh`: Synchronous flush request (returns error)
+- `shutdownCh`: Graceful shutdown with final flush
+
+**3. No Shared Mutable State**
+
+The only shared state is accessed via atomic operations (channel sends/receives). The `storeActive` flag and `store` pointer still use a mutex, but only to coordinate with store lifecycle, not flush logic.
+
+**4. Debouncing Without Locks**
+
+The timer callback sends to `timerFiredCh` instead of directly manipulating state. The run() goroutine processes timer events in its select loop, eliminating timer-related races.
+
+#### Concurrency Guarantees
+
+**Thread-Safety:**
+- `MarkDirty(fullExport bool)` - Safe from any goroutine, non-blocking
+- `FlushNow() error` - Safe from any goroutine, blocks until flush completes
+- `Shutdown() error` - Idempotent, safe to call multiple times
+
+**Debouncing Guarantees:**
+- Multiple `MarkDirty()` calls within the debounce window → single flush
+- Timer resets on each mark, flush occurs after last modification
+- FlushNow() bypasses debounce, forces immediate flush
+
+**Shutdown Guarantees:**
+- Final flush performed if database is dirty
+- Background goroutine cleanly exits
+- Idempotent via `sync.Once` - safe for multiple calls
+- Subsequent operations after shutdown are no-ops
+
+**Store Lifecycle:**
+- FlushManager checks `storeActive` flag before every flush
+- Store closure is coordinated via `storeMutex`
+- Flush safely aborts if store closes mid-operation
+
+#### Migration Path
+
+The implementation maintains backward compatibility:
+
+1. **Legacy path (tests):** If `flushManager == nil`, falls back to old timer-based logic
+2. **New path (production):** Uses FlushManager event-driven architecture
+3. **Wrapper functions:** `markDirtyAndScheduleFlush()` and `markDirtyAndScheduleFullExport()` delegate to FlushManager when available
+
+This allows existing tests to pass without modification while fixing the race condition in production.
+
+## Testing
+
+### Race Detection
+
+Comprehensive race detector tests ensure concurrency safety:
+
+- `TestFlushManagerConcurrentMarkDirty` - Many goroutines marking dirty
+- `TestFlushManagerConcurrentFlushNow` - Concurrent immediate flushes
+- `TestFlushManagerMarkDirtyDuringFlush` - Interleaved mark/flush operations
+- `TestFlushManagerShutdownDuringOperation` - Shutdown while operations ongoing
+- `TestMarkDirtyAndScheduleFlushConcurrency` - Integration test with legacy API
+
+Run with: `go test -race -run TestFlushManager ./cmd/bd`
+
+### In-Process Test Compatibility
+
+The FlushManager is designed to work correctly when commands run multiple times in the same process (common in tests):
+
+- Each command execution in `PersistentPreRun` creates a new FlushManager
+- `PersistentPostRun` shuts down the manager
+- `Shutdown()` is idempotent via `sync.Once`
+- Old managers are garbage collected when replaced
+
+## Related Subsystems
+
+### Daemon Mode
+
+When running with daemon mode (`--no-daemon=false`), the CLI delegates to an RPC server. The FlushManager is NOT used in daemon mode - the daemon process has its own flush coordination.
+
+The `daemonClient != nil` check in `PersistentPostRun` ensures FlushManager shutdown only occurs in direct mode.
+
+### Auto-Import
+
+Auto-import runs in `PersistentPreRun` before FlushManager is used. It may call `markDirtyAndScheduleFlush()` or `markDirtyAndScheduleFullExport()` if JSONL changes are detected.
+
+Hash-based comparison (not mtime) prevents git pull false positives (issue bd-84).
+
+### JSONL Integrity
+
+`flushToJSONLWithState()` validates JSONL file hash before flush:
+- Compares stored hash with actual file hash
+- If mismatch detected, clears export_hashes and forces full re-export (issue bd-160)
+- Prevents staleness when JSONL is modified outside bd
+
+### Export Modes
+
+**Incremental export (default):**
+- Exports only dirty issues (tracked in `dirty_issues` table)
+- Merges with existing JSONL file
+- Faster for small changesets
+
+**Full export (after ID changes):**
+- Exports all issues from database
+- Rebuilds JSONL from scratch
+- Required after operations like `rename-prefix` that change issue IDs
+- Triggered by `markDirtyAndScheduleFullExport()`
+
+## Performance Characteristics
+
+- **Debounce window:** Configurable via `getDebounceDuration()` (default 5s)
+- **Channel buffer sizes:**
+  - markDirtyCh: 10 events (prevents blocking during bursts)
+  - timerFiredCh: 1 event (timer notifications coalesce naturally)
+  - flushNowCh: 1 request (synchronous, one at a time)
+  - shutdownCh: 1 request (one-shot operation)
+- **Memory overhead:** One goroutine + minimal channel buffers per command execution
+- **Flush latency:** Debounce duration + JSONL write time (typically <100ms for incremental)
+
+## Blocked Issues Cache (bd-5qim)
+
+### Problem Statement
+
+The `bd ready` command originally computed blocked issues using a recursive CTE on every query. On a 10K issue database, each query took ~752ms, making the command feel sluggish and impractical for large projects.
+
+### Solution: Materialized Cache Table
+
+The `blocked_issues_cache` table materializes the blocking computation, storing issue IDs for all currently blocked issues. Queries now use a simple `NOT EXISTS` check against this cache, completing in ~29ms (25x speedup).
+
+### Architecture
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                   GetReadyWork Query                     │
+│                                                          │
+│  SELECT ... FROM issues WHERE status IN (...)            │
+│  AND NOT EXISTS (                                        │
+│    SELECT 1 FROM blocked_issues_cache                    │
+│    WHERE issue_id = issues.id                            │
+│  )                                                       │
+│                                                          │
+│  Performance: 29ms (was 752ms with recursive CTE)       │
+└─────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────┐
+│              Cache Invalidation Triggers                 │
+│                                                          │
+│  1. AddDependency (blocks/parent-child only)             │
+│  2. RemoveDependency (blocks/parent-child only)          │
+│  3. UpdateIssue (on any status change)                   │
+│  4. CloseIssue (changes status to closed)                │
+│                                                          │
+│  NOT triggered by: related, discovered-from deps         │
+└─────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────┐
+│               Cache Rebuild Process                      │
+│                                                          │
+│  1. DELETE FROM blocked_issues_cache                     │
+│  2. INSERT INTO blocked_issues_cache                     │
+│     WITH RECURSIVE CTE:                                  │
+│       - Find directly blocked issues (blocks deps)       │
+│       - Propagate to children (parent-child deps)        │
+│  3. Happens in same transaction as triggering change     │
+│                                                          │
+│  Performance: <50ms full rebuild on 10K database         │
+└─────────────────────────────────────────────────────────┘
+```
+
+### Blocking Semantics
+
+An issue is blocked if:
+
+1. **Direct blocking**: Has a `blocks` dependency on an open/in_progress/blocked issue
+2. **Transitive blocking**: Parent is blocked and issue is connected via `parent-child` dependency
+
+Closed issues never block others. Related and discovered-from dependencies don't affect blocking.
+
+### Cache Invalidation Strategy
+
+**Full rebuild on every change**
+
+Instead of incremental updates, the cache is completely rebuilt (DELETE + INSERT) on any triggering change. This approach is chosen because:
+
+- Rebuild is fast (<50ms even on 10K issues) due to optimized CTE
+- Simpler implementation with no risk of partial/stale updates
+- Dependency changes are rare compared to reads
+- Guarantees consistency - cache matches database state exactly
+
+**Transaction safety**
+
+All cache operations happen within the same transaction as the triggering change:
+- Uses transaction if provided, otherwise direct db connection
+- Cache can never be in an inconsistent state visible to queries
+- Foreign key CASCADE ensures cache entries deleted when issues are deleted
+
+**Selective invalidation**
+
+Only `blocks` and `parent-child` dependencies trigger rebuilds since they affect blocking semantics. Related and discovered-from dependencies don't trigger invalidation, avoiding unnecessary work.
+
+### Performance Characteristics
+
+**Query performance (GetReadyWork):**
+- Before cache: ~752ms (recursive CTE)
+- With cache: ~29ms (NOT EXISTS)
+- Speedup: 25x
+
+**Write overhead:**
+- Cache rebuild: <50ms
+- Only triggered on dependency/status changes (rare operations)
+- Trade-off: slower writes for much faster reads
+
+### Edge Cases
+
+1. **Parent-child transitive blocking**
+   - Children of blocked parents are automatically marked as blocked
+   - Propagates through arbitrary depth hierarchies (limited to depth 50 for safety)
+
+2. **Multiple blockers**
+   - Issue blocked by multiple open issues stays blocked until all are closed
+   - DISTINCT in CTE ensures issue appears once in cache
+
+3. **Status changes**
+   - Closing a blocker removes all blocked descendants from cache
+   - Reopening a blocker adds them back
+
+4. **Dependency removal**
+   - Removing last blocker unblocks the issue
+   - Removing parent-child link unblocks orphaned subtree
+
+5. **Foreign key cascades**
+   - Cache entries automatically deleted when issue is deleted
+   - No manual cleanup needed
+
+### Testing
+
+Comprehensive test coverage in `blocked_cache_test.go`:
+- Cache invalidation on dependency add/remove
+- Cache updates on status changes
+- Multiple blockers
+- Deep hierarchies
+- Transitive blocking via parent-child
+- Related dependencies (should NOT affect cache)
+
+Run tests: `go test -v ./internal/storage/sqlite -run TestCache`
+
+### Implementation Files
+
+- `internal/storage/sqlite/blocked_cache.go` - Cache rebuild and invalidation
+- `internal/storage/sqlite/ready.go` - Uses cache in GetReadyWork queries
+- `internal/storage/sqlite/dependencies.go` - Invalidates on dep changes
+- `internal/storage/sqlite/queries.go` - Invalidates on status changes
+- `internal/storage/sqlite/migrations/015_blocked_issues_cache.go` - Schema and initial population
+
+### Future Optimizations
+
+If rebuild becomes a bottleneck in very large databases (>100K issues):
+- Consider incremental updates for specific dependency types
+- Add indexes to dependencies table for CTE performance
+- Implement dirty tracking to avoid rebuilds when cache is unchanged
+
+However, current performance is excellent for realistic workloads.
+
+## Future Improvements
+
+Potential enhancements for multi-agent scenarios:
+
+1. **Flush coordination across agents:**
+   - Shared lock file to prevent concurrent JSONL writes
+   - Detection of external JSONL modifications during flush
+
+2. **Adaptive debounce window:**
+   - Shorter debounce during interactive sessions
+   - Longer debounce during batch operations
+
+3. **Flush progress tracking:**
+   - Expose flush queue depth via status API
+   - Allow clients to wait for flush completion
+
+4. **Per-issue dirty tracking optimization:**
+   - Currently tracks full vs. incremental
+   - Could track specific issue IDs for surgical updates