Files

Steve Yegge a9b2f9f553 Fix race condition in auto-flush mechanism (issue bd-52)

Critical fixes to code review findings:

1. Remove global state access from flushToJSONLWithState
   - FlushManager now has true single ownership of flush state
   - No more race conditions from concurrent global state access
   - flushToJSONLWithState trusts only the flushState parameter
   - Legacy wrapper handles success detection via failure count

2. Fix shutdown timeout data loss risk
   - Increased timeout from 5s → 30s to prevent data loss
   - Added detailed comments explaining the timeout rationale
   - Better error message indicates potential data loss scenario

Implementation details:
- New FlushManager uses event-driven single-owner pattern
- Channels eliminate shared mutable state (markDirtyCh, flushNowCh, etc.)
- Comprehensive race detector tests verify concurrency safety
- Backward compatible with existing tests via legacy code path
- ARCHITECTURE.md documents design principles and guarantees

Test results:
- All race detector tests pass (TestFlushManager*)
- Legacy API compatibility verified (TestMarkDirtyAndScheduleFlush*)
- No race conditions detected under concurrent load

Future improvements tracked as beads:
- bd-gdn: Add functional tests for flush correctness verification
- bd-5xt: Log errors from timer-triggered flushes
- bd-i00: Convert magic numbers to named constants

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-20 21:24:31 -05:00

9.2 KiB

Raw Blame History

Architecture

This document describes the internal architecture of the bd issue tracker, with particular focus on concurrency guarantees and data consistency.

Auto-Flush Architecture

Problem Statement (Issue bd-52)

The original auto-flush implementation suffered from a critical race condition when multiple concurrent operations accessed shared state:

Concurrent access points:
- Auto-flush timer goroutine (5s debounce)
- Daemon sync goroutine
- Concurrent CLI commands
- Git hook execution
- PersistentPostRun cleanup
Shared mutable state:
- isDirty flag
- needsFullExport flag
- flushTimer instance
- storeActive flag
Impact:
- Potential data loss under concurrent load
- Corruption when multiple agents/commands run simultaneously
- Race conditions during rapid commits
- Flush operations could access closed storage

Solution: Event-Driven FlushManager

The race condition was eliminated by replacing timer-based shared state with an event-driven architecture using a single-owner pattern.

Architecture

┌─────────────────────────────────────────────────────────┐
│                     Command/Agent                        │
│                                                          │
│  markDirtyAndScheduleFlush() ─┐                         │
│  markDirtyAndScheduleFullExport() ─┐                    │
└────────────────────────────────────┼───┼────────────────┘
                                     │   │
                                     v   v
                    ┌────────────────────────────────────┐
                    │        FlushManager                │
                    │  (Single-Owner Pattern)            │
                    │                                    │
                    │  Channels (buffered):              │
                    │    - markDirtyCh                   │
                    │    - timerFiredCh                  │
                    │    - flushNowCh                    │
                    │    - shutdownCh                    │
                    │                                    │
                    │  State (owned by run() goroutine): │
                    │    - isDirty                       │
                    │    - needsFullExport               │
                    │    - debounceTimer                 │
                    └────────────────────────────────────┘
                                     │
                                     v
                    ┌────────────────────────────────────┐
                    │      flushToJSONLWithState()       │
                    │                                    │
                    │  - Validates store is active       │
                    │  - Checks JSONL integrity          │
                    │  - Performs incremental/full export│
                    │  - Updates export hashes           │
                    └────────────────────────────────────┘

Key Design Principles

1. Single Owner Pattern

All flush state (isDirty, needsFullExport, debounceTimer) is owned by a single background goroutine (FlushManager.run()). This eliminates the need for mutexes to protect this state.

2. Channel-Based Communication

External code communicates with FlushManager via buffered channels:

markDirtyCh: Request to mark DB dirty (incremental or full export)
timerFiredCh: Debounce timer expired notification
flushNowCh: Synchronous flush request (returns error)
shutdownCh: Graceful shutdown with final flush

3. No Shared Mutable State

The only shared state is accessed via atomic operations (channel sends/receives). The storeActive flag and store pointer still use a mutex, but only to coordinate with store lifecycle, not flush logic.

4. Debouncing Without Locks

The timer callback sends to timerFiredCh instead of directly manipulating state. The run() goroutine processes timer events in its select loop, eliminating timer-related races.

Concurrency Guarantees

Thread-Safety:

MarkDirty(fullExport bool) - Safe from any goroutine, non-blocking
FlushNow() error - Safe from any goroutine, blocks until flush completes
Shutdown() error - Idempotent, safe to call multiple times

Debouncing Guarantees:

Multiple MarkDirty() calls within the debounce window → single flush
Timer resets on each mark, flush occurs after last modification
FlushNow() bypasses debounce, forces immediate flush

Shutdown Guarantees:

Final flush performed if database is dirty
Background goroutine cleanly exits
Idempotent via sync.Once - safe for multiple calls
Subsequent operations after shutdown are no-ops

Store Lifecycle:

FlushManager checks storeActive flag before every flush
Store closure is coordinated via storeMutex
Flush safely aborts if store closes mid-operation

Migration Path

The implementation maintains backward compatibility:

Legacy path (tests): If flushManager == nil, falls back to old timer-based logic
New path (production): Uses FlushManager event-driven architecture
Wrapper functions: markDirtyAndScheduleFlush() and markDirtyAndScheduleFullExport() delegate to FlushManager when available

This allows existing tests to pass without modification while fixing the race condition in production.

Testing

Race Detection

Comprehensive race detector tests ensure concurrency safety:

TestFlushManagerConcurrentMarkDirty - Many goroutines marking dirty
TestFlushManagerConcurrentFlushNow - Concurrent immediate flushes
TestFlushManagerMarkDirtyDuringFlush - Interleaved mark/flush operations
TestFlushManagerShutdownDuringOperation - Shutdown while operations ongoing
TestMarkDirtyAndScheduleFlushConcurrency - Integration test with legacy API

Run with: go test -race -run TestFlushManager ./cmd/bd

In-Process Test Compatibility

The FlushManager is designed to work correctly when commands run multiple times in the same process (common in tests):

Each command execution in PersistentPreRun creates a new FlushManager
PersistentPostRun shuts down the manager
Shutdown() is idempotent via sync.Once
Old managers are garbage collected when replaced

Daemon Mode

When running with daemon mode (--no-daemon=false), the CLI delegates to an RPC server. The FlushManager is NOT used in daemon mode - the daemon process has its own flush coordination.

The daemonClient != nil check in PersistentPostRun ensures FlushManager shutdown only occurs in direct mode.

Auto-Import

Auto-import runs in PersistentPreRun before FlushManager is used. It may call markDirtyAndScheduleFlush() or markDirtyAndScheduleFullExport() if JSONL changes are detected.

Hash-based comparison (not mtime) prevents git pull false positives (issue bd-84).

JSONL Integrity

flushToJSONLWithState() validates JSONL file hash before flush:

Compares stored hash with actual file hash
If mismatch detected, clears export_hashes and forces full re-export (issue bd-160)
Prevents staleness when JSONL is modified outside bd

Export Modes

Incremental export (default):

Exports only dirty issues (tracked in dirty_issues table)
Merges with existing JSONL file
Faster for small changesets

Full export (after ID changes):

Exports all issues from database
Rebuilds JSONL from scratch
Required after operations like rename-prefix that change issue IDs
Triggered by markDirtyAndScheduleFullExport()

Performance Characteristics

Debounce window: Configurable via getDebounceDuration() (default 5s)
Channel buffer sizes:
- markDirtyCh: 10 events (prevents blocking during bursts)
- timerFiredCh: 1 event (timer notifications coalesce naturally)
- flushNowCh: 1 request (synchronous, one at a time)
- shutdownCh: 1 request (one-shot operation)
Memory overhead: One goroutine + minimal channel buffers per command execution
Flush latency: Debounce duration + JSONL write time (typically <100ms for incremental)

Future Improvements

Potential enhancements for multi-agent scenarios:

Flush coordination across agents:
- Shared lock file to prevent concurrent JSONL writes
- Detection of external JSONL modifications during flush
Adaptive debounce window:
- Shorter debounce during interactive sessions
- Longer debounce during batch operations
Flush progress tracking:
- Expose flush queue depth via status API
- Allow clients to wait for flush completion
Per-issue dirty tracking optimization:
- Currently tracks full vs. incremental
- Could track specific issue IDs for surgical updates

9.2 KiB Raw Blame History

Architecture

Auto-Flush Architecture

Problem Statement (Issue bd-52)

Solution: Event-Driven FlushManager

Architecture

Key Design Principles

Concurrency Guarantees

Migration Path

Testing

Race Detection

In-Process Test Compatibility

Related Subsystems

Daemon Mode

Auto-Import

JSONL Integrity

Export Modes

Performance Characteristics

Future Improvements

9.2 KiB

Raw Blame History