Critical fixes to code review findings: 1. Remove global state access from flushToJSONLWithState - FlushManager now has true single ownership of flush state - No more race conditions from concurrent global state access - flushToJSONLWithState trusts only the flushState parameter - Legacy wrapper handles success detection via failure count 2. Fix shutdown timeout data loss risk - Increased timeout from 5s → 30s to prevent data loss - Added detailed comments explaining the timeout rationale - Better error message indicates potential data loss scenario Implementation details: - New FlushManager uses event-driven single-owner pattern - Channels eliminate shared mutable state (markDirtyCh, flushNowCh, etc.) - Comprehensive race detector tests verify concurrency safety - Backward compatible with existing tests via legacy code path - ARCHITECTURE.md documents design principles and guarantees Test results: - All race detector tests pass (TestFlushManager*) - Legacy API compatibility verified (TestMarkDirtyAndScheduleFlush*) - No race conditions detected under concurrent load Future improvements tracked as beads: - bd-gdn: Add functional tests for flush correctness verification - bd-5xt: Log errors from timer-triggered flushes - bd-i00: Convert magic numbers to named constants 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
214 lines
9.2 KiB
Markdown
214 lines
9.2 KiB
Markdown
# Architecture
|
|
|
|
This document describes the internal architecture of the `bd` issue tracker, with particular focus on concurrency guarantees and data consistency.
|
|
|
|
## Auto-Flush Architecture
|
|
|
|
### Problem Statement (Issue bd-52)
|
|
|
|
The original auto-flush implementation suffered from a critical race condition when multiple concurrent operations accessed shared state:
|
|
|
|
- **Concurrent access points:**
|
|
- Auto-flush timer goroutine (5s debounce)
|
|
- Daemon sync goroutine
|
|
- Concurrent CLI commands
|
|
- Git hook execution
|
|
- PersistentPostRun cleanup
|
|
|
|
- **Shared mutable state:**
|
|
- `isDirty` flag
|
|
- `needsFullExport` flag
|
|
- `flushTimer` instance
|
|
- `storeActive` flag
|
|
|
|
- **Impact:**
|
|
- Potential data loss under concurrent load
|
|
- Corruption when multiple agents/commands run simultaneously
|
|
- Race conditions during rapid commits
|
|
- Flush operations could access closed storage
|
|
|
|
### Solution: Event-Driven FlushManager
|
|
|
|
The race condition was eliminated by replacing timer-based shared state with an event-driven architecture using a single-owner pattern.
|
|
|
|
#### Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Command/Agent │
|
|
│ │
|
|
│ markDirtyAndScheduleFlush() ─┐ │
|
|
│ markDirtyAndScheduleFullExport() ─┐ │
|
|
└────────────────────────────────────┼───┼────────────────┘
|
|
│ │
|
|
v v
|
|
┌────────────────────────────────────┐
|
|
│ FlushManager │
|
|
│ (Single-Owner Pattern) │
|
|
│ │
|
|
│ Channels (buffered): │
|
|
│ - markDirtyCh │
|
|
│ - timerFiredCh │
|
|
│ - flushNowCh │
|
|
│ - shutdownCh │
|
|
│ │
|
|
│ State (owned by run() goroutine): │
|
|
│ - isDirty │
|
|
│ - needsFullExport │
|
|
│ - debounceTimer │
|
|
└────────────────────────────────────┘
|
|
│
|
|
v
|
|
┌────────────────────────────────────┐
|
|
│ flushToJSONLWithState() │
|
|
│ │
|
|
│ - Validates store is active │
|
|
│ - Checks JSONL integrity │
|
|
│ - Performs incremental/full export│
|
|
│ - Updates export hashes │
|
|
└────────────────────────────────────┘
|
|
```
|
|
|
|
#### Key Design Principles
|
|
|
|
**1. Single Owner Pattern**
|
|
|
|
All flush state (`isDirty`, `needsFullExport`, `debounceTimer`) is owned by a single background goroutine (`FlushManager.run()`). This eliminates the need for mutexes to protect this state.
|
|
|
|
**2. Channel-Based Communication**
|
|
|
|
External code communicates with FlushManager via buffered channels:
|
|
- `markDirtyCh`: Request to mark DB dirty (incremental or full export)
|
|
- `timerFiredCh`: Debounce timer expired notification
|
|
- `flushNowCh`: Synchronous flush request (returns error)
|
|
- `shutdownCh`: Graceful shutdown with final flush
|
|
|
|
**3. No Shared Mutable State**
|
|
|
|
The only shared state is accessed via atomic operations (channel sends/receives). The `storeActive` flag and `store` pointer still use a mutex, but only to coordinate with store lifecycle, not flush logic.
|
|
|
|
**4. Debouncing Without Locks**
|
|
|
|
The timer callback sends to `timerFiredCh` instead of directly manipulating state. The run() goroutine processes timer events in its select loop, eliminating timer-related races.
|
|
|
|
#### Concurrency Guarantees
|
|
|
|
**Thread-Safety:**
|
|
- `MarkDirty(fullExport bool)` - Safe from any goroutine, non-blocking
|
|
- `FlushNow() error` - Safe from any goroutine, blocks until flush completes
|
|
- `Shutdown() error` - Idempotent, safe to call multiple times
|
|
|
|
**Debouncing Guarantees:**
|
|
- Multiple `MarkDirty()` calls within the debounce window → single flush
|
|
- Timer resets on each mark, flush occurs after last modification
|
|
- FlushNow() bypasses debounce, forces immediate flush
|
|
|
|
**Shutdown Guarantees:**
|
|
- Final flush performed if database is dirty
|
|
- Background goroutine cleanly exits
|
|
- Idempotent via `sync.Once` - safe for multiple calls
|
|
- Subsequent operations after shutdown are no-ops
|
|
|
|
**Store Lifecycle:**
|
|
- FlushManager checks `storeActive` flag before every flush
|
|
- Store closure is coordinated via `storeMutex`
|
|
- Flush safely aborts if store closes mid-operation
|
|
|
|
#### Migration Path
|
|
|
|
The implementation maintains backward compatibility:
|
|
|
|
1. **Legacy path (tests):** If `flushManager == nil`, falls back to old timer-based logic
|
|
2. **New path (production):** Uses FlushManager event-driven architecture
|
|
3. **Wrapper functions:** `markDirtyAndScheduleFlush()` and `markDirtyAndScheduleFullExport()` delegate to FlushManager when available
|
|
|
|
This allows existing tests to pass without modification while fixing the race condition in production.
|
|
|
|
## Testing
|
|
|
|
### Race Detection
|
|
|
|
Comprehensive race detector tests ensure concurrency safety:
|
|
|
|
- `TestFlushManagerConcurrentMarkDirty` - Many goroutines marking dirty
|
|
- `TestFlushManagerConcurrentFlushNow` - Concurrent immediate flushes
|
|
- `TestFlushManagerMarkDirtyDuringFlush` - Interleaved mark/flush operations
|
|
- `TestFlushManagerShutdownDuringOperation` - Shutdown while operations ongoing
|
|
- `TestMarkDirtyAndScheduleFlushConcurrency` - Integration test with legacy API
|
|
|
|
Run with: `go test -race -run TestFlushManager ./cmd/bd`
|
|
|
|
### In-Process Test Compatibility
|
|
|
|
The FlushManager is designed to work correctly when commands run multiple times in the same process (common in tests):
|
|
|
|
- Each command execution in `PersistentPreRun` creates a new FlushManager
|
|
- `PersistentPostRun` shuts down the manager
|
|
- `Shutdown()` is idempotent via `sync.Once`
|
|
- Old managers are garbage collected when replaced
|
|
|
|
## Related Subsystems
|
|
|
|
### Daemon Mode
|
|
|
|
When running with daemon mode (`--no-daemon=false`), the CLI delegates to an RPC server. The FlushManager is NOT used in daemon mode - the daemon process has its own flush coordination.
|
|
|
|
The `daemonClient != nil` check in `PersistentPostRun` ensures FlushManager shutdown only occurs in direct mode.
|
|
|
|
### Auto-Import
|
|
|
|
Auto-import runs in `PersistentPreRun` before FlushManager is used. It may call `markDirtyAndScheduleFlush()` or `markDirtyAndScheduleFullExport()` if JSONL changes are detected.
|
|
|
|
Hash-based comparison (not mtime) prevents git pull false positives (issue bd-84).
|
|
|
|
### JSONL Integrity
|
|
|
|
`flushToJSONLWithState()` validates JSONL file hash before flush:
|
|
- Compares stored hash with actual file hash
|
|
- If mismatch detected, clears export_hashes and forces full re-export (issue bd-160)
|
|
- Prevents staleness when JSONL is modified outside bd
|
|
|
|
### Export Modes
|
|
|
|
**Incremental export (default):**
|
|
- Exports only dirty issues (tracked in `dirty_issues` table)
|
|
- Merges with existing JSONL file
|
|
- Faster for small changesets
|
|
|
|
**Full export (after ID changes):**
|
|
- Exports all issues from database
|
|
- Rebuilds JSONL from scratch
|
|
- Required after operations like `rename-prefix` that change issue IDs
|
|
- Triggered by `markDirtyAndScheduleFullExport()`
|
|
|
|
## Performance Characteristics
|
|
|
|
- **Debounce window:** Configurable via `getDebounceDuration()` (default 5s)
|
|
- **Channel buffer sizes:**
|
|
- markDirtyCh: 10 events (prevents blocking during bursts)
|
|
- timerFiredCh: 1 event (timer notifications coalesce naturally)
|
|
- flushNowCh: 1 request (synchronous, one at a time)
|
|
- shutdownCh: 1 request (one-shot operation)
|
|
- **Memory overhead:** One goroutine + minimal channel buffers per command execution
|
|
- **Flush latency:** Debounce duration + JSONL write time (typically <100ms for incremental)
|
|
|
|
## Future Improvements
|
|
|
|
Potential enhancements for multi-agent scenarios:
|
|
|
|
1. **Flush coordination across agents:**
|
|
- Shared lock file to prevent concurrent JSONL writes
|
|
- Detection of external JSONL modifications during flush
|
|
|
|
2. **Adaptive debounce window:**
|
|
- Shorter debounce during interactive sessions
|
|
- Longer debounce during batch operations
|
|
|
|
3. **Flush progress tracking:**
|
|
- Expose flush queue depth via status API
|
|
- Allow clients to wait for flush completion
|
|
|
|
4. **Per-issue dirty tracking optimization:**
|
|
- Currently tracks full vs. incremental
|
|
- Could track specific issue IDs for surgical updates
|