beads/docs/INTERNALS.md

# Internals

This document describes internal implementation details of bd, with particular focus on concurrency guarantees and data consistency.

For the overall architecture (data model, sync mechanism, component overview), see [ARCHITECTURE.md](ARCHITECTURE.md).

## Auto-Flush Architecture

### Problem Statement (Issue bd-52)

The original auto-flush implementation suffered from a critical race condition when multiple concurrent operations accessed shared state:

- **Concurrent access points:**
  - Auto-flush timer goroutine (5s debounce)
  - Daemon sync goroutine
  - Concurrent CLI commands
  - Git hook execution
  - PersistentPostRun cleanup

- **Shared mutable state:**
  - `isDirty` flag
  - `needsFullExport` flag
  - `flushTimer` instance
  - `storeActive` flag

- **Impact:**
  - Potential data loss under concurrent load
  - Corruption when multiple agents/commands run simultaneously
  - Race conditions during rapid commits
  - Flush operations could access closed storage

### Solution: Event-Driven FlushManager

The race condition was eliminated by replacing timer-based shared state with an event-driven architecture using a single-owner pattern.

#### Architecture

```
┌─────────────────────────────────────────────────────────┐
│                     Command/Agent                        │
│                                                          │
│  markDirtyAndScheduleFlush() ─┐                         │
│  markDirtyAndScheduleFullExport() ─┐                    │
└────────────────────────────────────┼───┼────────────────┘
                                     │   │
                                     v   v
                    ┌────────────────────────────────────┐
                    │        FlushManager                │
                    │  (Single-Owner Pattern)            │
                    │                                    │
                    │  Channels (buffered):              │
                    │    - markDirtyCh                   │
                    │    - timerFiredCh                  │
                    │    - flushNowCh                    │
                    │    - shutdownCh                    │
                    │                                    │
                    │  State (owned by run() goroutine): │
                    │    - isDirty                       │
                    │    - needsFullExport               │
                    │    - debounceTimer                 │
                    └────────────────────────────────────┘
                                     │
                                     v
                    ┌────────────────────────────────────┐
                    │      flushToJSONLWithState()       │
                    │                                    │
                    │  - Validates store is active       │
                    │  - Checks JSONL integrity          │
                    │  - Performs incremental/full export│
                    │  - Updates export hashes           │
                    └────────────────────────────────────┘
```

#### Key Design Principles

**1. Single Owner Pattern**

All flush state (`isDirty`, `needsFullExport`, `debounceTimer`) is owned by a single background goroutine (`FlushManager.run()`). This eliminates the need for mutexes to protect this state.

**2. Channel-Based Communication**

External code communicates with FlushManager via buffered channels:
- `markDirtyCh`: Request to mark DB dirty (incremental or full export)
- `timerFiredCh`: Debounce timer expired notification
- `flushNowCh`: Synchronous flush request (returns error)
- `shutdownCh`: Graceful shutdown with final flush

**3. No Shared Mutable State**

The only shared state is accessed via atomic operations (channel sends/receives). The `storeActive` flag and `store` pointer still use a mutex, but only to coordinate with store lifecycle, not flush logic.

**4. Debouncing Without Locks**

The timer callback sends to `timerFiredCh` instead of directly manipulating state. The run() goroutine processes timer events in its select loop, eliminating timer-related races.

#### Concurrency Guarantees

**Thread-Safety:**
- `MarkDirty(fullExport bool)` - Safe from any goroutine, non-blocking
- `FlushNow() error` - Safe from any goroutine, blocks until flush completes
- `Shutdown() error` - Idempotent, safe to call multiple times

**Debouncing Guarantees:**
- Multiple `MarkDirty()` calls within the debounce window → single flush
- Timer resets on each mark, flush occurs after last modification
- FlushNow() bypasses debounce, forces immediate flush

**Shutdown Guarantees:**
- Final flush performed if database is dirty
- Background goroutine cleanly exits
- Idempotent via `sync.Once` - safe for multiple calls
- Subsequent operations after shutdown are no-ops

**Store Lifecycle:**
- FlushManager checks `storeActive` flag before every flush
- Store closure is coordinated via `storeMutex`
- Flush safely aborts if store closes mid-operation

#### Migration Path

The implementation maintains backward compatibility:

1. **Legacy path (tests):** If `flushManager == nil`, falls back to old timer-based logic
2. **New path (production):** Uses FlushManager event-driven architecture
3. **Wrapper functions:** `markDirtyAndScheduleFlush()` and `markDirtyAndScheduleFullExport()` delegate to FlushManager when available

This allows existing tests to pass without modification while fixing the race condition in production.

## Testing

### Race Detection

Comprehensive race detector tests ensure concurrency safety:

- `TestFlushManagerConcurrentMarkDirty` - Many goroutines marking dirty
- `TestFlushManagerConcurrentFlushNow` - Concurrent immediate flushes
- `TestFlushManagerMarkDirtyDuringFlush` - Interleaved mark/flush operations
- `TestFlushManagerShutdownDuringOperation` - Shutdown while operations ongoing
- `TestMarkDirtyAndScheduleFlushConcurrency` - Integration test with legacy API

Run with: `go test -race -run TestFlushManager ./cmd/bd`

### In-Process Test Compatibility

The FlushManager is designed to work correctly when commands run multiple times in the same process (common in tests):

- Each command execution in `PersistentPreRun` creates a new FlushManager
- `PersistentPostRun` shuts down the manager
- `Shutdown()` is idempotent via `sync.Once`
- Old managers are garbage collected when replaced

## Related Subsystems

### Daemon Mode

When running with daemon mode (`--no-daemon=false`), the CLI delegates to an RPC server. The FlushManager is NOT used in daemon mode - the daemon process has its own flush coordination.

The `daemonClient != nil` check in `PersistentPostRun` ensures FlushManager shutdown only occurs in direct mode.

### Auto-Import

Auto-import runs in `PersistentPreRun` before FlushManager is used. It may call `markDirtyAndScheduleFlush()` or `markDirtyAndScheduleFullExport()` if JSONL changes are detected.

Hash-based comparison (not mtime) prevents git pull false positives (issue bd-84).

### JSONL Integrity

`flushToJSONLWithState()` validates JSONL file hash before flush:
- Compares stored hash with actual file hash
- If mismatch detected, clears export_hashes and forces full re-export (issue bd-160)
- Prevents staleness when JSONL is modified outside bd

### Export Modes

**Incremental export (default):**
- Exports only dirty issues (tracked in `dirty_issues` table)
- Merges with existing JSONL file
- Faster for small changesets

**Full export (after ID changes):**
- Exports all issues from database
- Rebuilds JSONL from scratch
- Required after operations like `rename-prefix` that change issue IDs
- Triggered by `markDirtyAndScheduleFullExport()`

## Performance Characteristics

- **Debounce window:** Configurable via `getDebounceDuration()` (default 5s)
- **Channel buffer sizes:**
  - markDirtyCh: 10 events (prevents blocking during bursts)
  - timerFiredCh: 1 event (timer notifications coalesce naturally)
  - flushNowCh: 1 request (synchronous, one at a time)
  - shutdownCh: 1 request (one-shot operation)
- **Memory overhead:** One goroutine + minimal channel buffers per command execution
- **Flush latency:** Debounce duration + JSONL write time (typically <100ms for incremental)

## Blocked Issues Cache (bd-5qim)

### Problem Statement

The `bd ready` command originally computed blocked issues using a recursive CTE on every query. On a 10K issue database, each query took ~752ms, making the command feel sluggish and impractical for large projects.

### Solution: Materialized Cache Table

The `blocked_issues_cache` table materializes the blocking computation, storing issue IDs for all currently blocked issues. Queries now use a simple `NOT EXISTS` check against this cache, completing in ~29ms (25x speedup).

### Architecture

```
┌─────────────────────────────────────────────────────────┐
│                   GetReadyWork Query                     │
│                                                          │
│  SELECT ... FROM issues WHERE status IN (...)            │
│  AND NOT EXISTS (                                        │
│    SELECT 1 FROM blocked_issues_cache                    │
│    WHERE issue_id = issues.id                            │
│  )                                                       │
│                                                          │
│  Performance: 29ms (was 752ms with recursive CTE)       │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│              Cache Invalidation Triggers                 │
│                                                          │
│  1. AddDependency (blocks/parent-child only)             │
│  2. RemoveDependency (blocks/parent-child only)          │
│  3. UpdateIssue (on any status change)                   │
│  4. CloseIssue (changes status to closed)                │
│                                                          │
│  NOT triggered by: related, discovered-from deps         │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│               Cache Rebuild Process                      │
│                                                          │
│  1. DELETE FROM blocked_issues_cache                     │
│  2. INSERT INTO blocked_issues_cache                     │
│     WITH RECURSIVE CTE:                                  │
│       - Find directly blocked issues (blocks deps)       │
│       - Propagate to children (parent-child deps)        │
│  3. Happens in same transaction as triggering change     │
│                                                          │
│  Performance: <50ms full rebuild on 10K database         │
└─────────────────────────────────────────────────────────┘
```

### Blocking Semantics

An issue is blocked if:

1. **Direct blocking**: Has a `blocks` dependency on an open/in_progress/blocked issue
2. **Transitive blocking**: Parent is blocked and issue is connected via `parent-child` dependency

Closed issues never block others. Related and discovered-from dependencies don't affect blocking.

### Cache Invalidation Strategy

**Full rebuild on every change**

Instead of incremental updates, the cache is completely rebuilt (DELETE + INSERT) on any triggering change. This approach is chosen because:

- Rebuild is fast (<50ms even on 10K issues) due to optimized CTE
- Simpler implementation with no risk of partial/stale updates
- Dependency changes are rare compared to reads
- Guarantees consistency - cache matches database state exactly

**Transaction safety**

All cache operations happen within the same transaction as the triggering change:
- Uses transaction if provided, otherwise direct db connection
- Cache can never be in an inconsistent state visible to queries
- Foreign key CASCADE ensures cache entries deleted when issues are deleted

**Selective invalidation**

Only `blocks` and `parent-child` dependencies trigger rebuilds since they affect blocking semantics. Related and discovered-from dependencies don't trigger invalidation, avoiding unnecessary work.

### Performance Characteristics

**Query performance (GetReadyWork):**
- Before cache: ~752ms (recursive CTE)
- With cache: ~29ms (NOT EXISTS)
- Speedup: 25x

**Write overhead:**
- Cache rebuild: <50ms
- Only triggered on dependency/status changes (rare operations)
- Trade-off: slower writes for much faster reads

### Edge Cases

1. **Parent-child transitive blocking**
   - Children of blocked parents are automatically marked as blocked
   - Propagates through arbitrary depth hierarchies (limited to depth 50 for safety)

2. **Multiple blockers**
   - Issue blocked by multiple open issues stays blocked until all are closed
   - DISTINCT in CTE ensures issue appears once in cache

3. **Status changes**
   - Closing a blocker removes all blocked descendants from cache
   - Reopening a blocker adds them back

4. **Dependency removal**
   - Removing last blocker unblocks the issue
   - Removing parent-child link unblocks orphaned subtree

5. **Foreign key cascades**
   - Cache entries automatically deleted when issue is deleted
   - No manual cleanup needed

### Testing

Comprehensive test coverage in `blocked_cache_test.go`:
- Cache invalidation on dependency add/remove
- Cache updates on status changes
- Multiple blockers
- Deep hierarchies
- Transitive blocking via parent-child
- Related dependencies (should NOT affect cache)

Run tests: `go test -v ./internal/storage/sqlite -run TestCache`

### Implementation Files

- `internal/storage/sqlite/blocked_cache.go` - Cache rebuild and invalidation
- `internal/storage/sqlite/ready.go` - Uses cache in GetReadyWork queries
- `internal/storage/sqlite/dependencies.go` - Invalidates on dep changes
- `internal/storage/sqlite/queries.go` - Invalidates on status changes
- `internal/storage/sqlite/migrations/015_blocked_issues_cache.go` - Schema and initial population

### Future Optimizations

If rebuild becomes a bottleneck in very large databases (>100K issues):
- Consider incremental updates for specific dependency types
- Add indexes to dependencies table for CTE performance
- Implement dirty tracking to avoid rebuilds when cache is unchanged

However, current performance is excellent for realistic workloads.

## Future Improvements

Potential enhancements for multi-agent scenarios:

1. **Flush coordination across agents:**
   - Shared lock file to prevent concurrent JSONL writes
   - Detection of external JSONL modifications during flush

2. **Adaptive debounce window:**
   - Shorter debounce during interactive sessions
   - Longer debounce during batch operations

3. **Flush progress tracking:**
   - Expose flush queue depth via status API
   - Allow clients to wait for flush completion

4. **Per-issue dirty tracking optimization:**
   - Currently tracks full vs. incremental
   - Could track specific issue IDs for surgical updates