Files
beads/DAEMON_DESIGN.md
Steve Yegge 15b60b4ad0 Phase 4: Atomic operations and stress testing (bd-114, bd-110)
Completes daemon architecture implementation:

Features:
- Batch/transaction API (OpBatch) for multi-step atomic operations
- Request timeout and cancellation support (30s default, configurable)
- Comprehensive stress tests (4-10 concurrent agents, 800-1000 ops)
- Performance benchmarks (daemon 2x faster than direct mode)

Results:
- Zero ID collisions across 1000+ concurrent creates
- All acceptance criteria validated for bd-110
- Create: 2.4ms (daemon) vs 4.7ms (direct)
- Update/List: similar 2x improvement

Tests Added:
- TestStressConcurrentAgents (8 agents, 800 creates)
- TestStressBatchOperations (4 agents, 400 batch ops)
- TestStressMixedOperations (6 agents, mixed read/write)
- TestStressNoUniqueConstraintViolations (10 agents, 1000 creates)
- BenchmarkDaemonCreate/Update/List/Latency
- Fixed flaky TestConcurrentRequests (shared client issue)

Files:
- internal/rpc/protocol.go - Added OpBatch, BatchArgs, BatchResponse
- internal/rpc/server.go - Implemented handleBatch with stop-on-failure
- internal/rpc/client.go - Added SetTimeout and Batch methods
- internal/rpc/stress_test.go - All stress tests
- internal/rpc/bench_test.go - Performance benchmarks
- DAEMON_STRESS_TEST.md - Complete documentation

Closes bd-114, bd-110

Amp-Thread-ID: https://ampcode.com/threads/T-1c07c140-0420-49fe-add1-b0b83b1bdff5
Co-authored-by: Amp <amp@ampcode.com>
2025-10-16 23:46:12 -07:00

7.4 KiB

BD Daemon Architecture for Concurrent Access

Problem Statement

Multiple AI agents running concurrently (via beads-mcp) cause:

  • SQLite write corruption: Counter stuck, UNIQUE constraint failures
  • Git index.lock contention: All agents auto-export → all try to commit simultaneously
  • Data loss risk: Concurrent SQLite writers without coordination
  • Poor performance: Redundant exports, 4x git operations for same changes

Current Architecture (Broken)

Agent 1 → beads-mcp 1 → bd CLI → SQLite DB (direct write)
Agent 2 → beads-mcp 2 → bd CLI → SQLite DB (direct write)  ← RACE CONDITIONS
Agent 3 → beads-mcp 3 → bd CLI → SQLite DB (direct write)
Agent 4 → beads-mcp 4 → bd CLI → SQLite DB (direct write)
                                     ↓
                                 4x concurrent git export/commit

Proposed Architecture (Daemon-Mediated)

Agent 1 → beads-mcp 1 → bd client ──┐
Agent 2 → beads-mcp 2 → bd client ──┼──> bd daemon → SQLite DB
Agent 3 → beads-mcp 3 → bd client ──┤   (single writer)     ↓
Agent 4 → beads-mcp 4 → bd client ──┘                   git export
                                                        (batched,
                                                         serialized)

Key Changes

  1. bd daemon becomes mandatory for multi-agent scenarios
  2. All bd commands become RPC clients when daemon is running
  3. Daemon owns SQLite - single writer, no races
  4. Daemon batches git operations - one export cycle per interval
  5. Unix socket IPC - simple, fast, local-only

Implementation Plan

Phase 1: RPC Infrastructure

New files:

  • internal/rpc/protocol.go - Request/response types
  • internal/rpc/server.go - Unix socket server in daemon
  • internal/rpc/client.go - Client detection & dispatch

Operations to support:

type Request struct {
    Operation string          // "create", "update", "list", "close", etc.
    Args      json.RawMessage // Operation-specific args
}

type Response struct {
    Success bool
    Data    json.RawMessage // Operation result
    Error   string
}

Socket location: ~/.beads/bd.sock or .beads/bd.sock (per-repo)

Phase 2: Client Auto-Detection

bd command behavior:

  1. Check if daemon socket exists & responsive
  2. If yes: Send RPC request, print response
  3. If no: Run command directly (backward compatible)

Example:

func main() {
    if client := rpc.TryConnect(); client != nil {
        // Daemon is running - use RPC
        resp := client.Execute(cmd, args)
        fmt.Println(resp)
        return
    }

    // No daemon - run directly (current behavior)
    executeLocally(cmd, args)
}

Phase 3: Daemon SQLite Ownership

Daemon startup:

  1. Open SQLite connection (exclusive)
  2. Start RPC server on Unix socket
  3. Start git sync loop (existing functionality)
  4. Process RPC requests serially

Git operations:

  • Batch exports every 5 seconds (not per-operation)
  • Single commit with all changes
  • Prevent concurrent git operations entirely

Phase 4: Atomic Operations

ID generation:

// In daemon process only
func (d *Daemon) generateID(prefix string) (string, error) {
    d.mu.Lock()
    defer d.mu.Unlock()

    // No races - daemon is single writer
    return d.storage.NextID(prefix)
}

Transaction support:

// RPC can request multi-operation transactions
type BatchRequest struct {
    Operations []Request
    Atomic     bool // All-or-nothing
}

Migration Strategy

Stage 1: Opt-In (v0.10.0)

  • Daemon RPC code implemented
  • bd commands detect daemon, fall back to direct
  • Users can bd daemon start for multi-agent scenarios
  • No breaking changes - direct mode still works
  • Document multi-agent workflow requires daemon
  • MCP server README says "start daemon for concurrent agents"
  • Detection warning: "Multiple bd processes detected, consider using daemon"

Stage 3: Required for Multi-Agent (v1.0.0)

  • bd detects concurrent access patterns
  • Refuses to run without daemon if lock contention detected
  • Error: "Concurrent access detected. Start daemon: bd daemon start"

Benefits

No SQLite corruption - single writer No git lock contention - batched, serialized operations Atomic ID generation - no counter corruption Better performance - fewer redundant exports Backward compatible - graceful fallback to direct mode Simple protocol - Unix sockets, JSON payloads

Trade-offs

⚠️ Daemon must be running for multi-agent workflows ⚠️ One more process to manage (bd daemon start/stop) ⚠️ Complexity - RPC layer adds code & maintenance ⚠️ Single point of failure - if daemon crashes, all agents blocked

Open Questions

  1. Per-repo or global daemon?

    • Per-repo: .beads/bd.sock (supports multiple repos)
    • Global: ~/.beads/bd.sock (simpler, but only one repo at a time)
    • Recommendation: Per-repo, use --db path to determine socket location
  2. Daemon crash recovery?

    • Client auto-starts daemon if socket missing?
    • Or require manual bd daemon start?
    • Recommendation: Auto-start with exponential backoff
  3. Concurrent read optimization?

    • Reads could bypass daemon (SQLite supports concurrent readers)
    • But complex: need to detect read-only vs read-write commands
    • Recommendation: Start simple, all ops through daemon
  4. Transaction API for clients?

    • MCP tools often do multi-step operations
    • Would benefit from BEGIN/COMMIT style transactions
    • Recommendation: Phase 4 feature, not MVP

Success Metrics

  • 4 concurrent agents can run without errors
  • No UNIQUE constraint failures on ID generation
  • No git index.lock errors
  • SQLite counter stays in sync with actual issues
  • Graceful fallback when daemon not running
  • bd-668: Git index.lock contention (root cause)
  • bd-670: ID generation retry on UNIQUE constraint
  • bd-654: Concurrent tmp file collisions (already fixed)
  • bd-477: Phase 1 daemon command (git sync only - now expanded)
  • bd-279: Tests for concurrent scenarios
  • bd-271: Epic for multi-device support

Next Steps

  1. Ultrathink: Validate this design with user
  2. File epic: Create bd-??? for daemon RPC architecture
  3. Break down work: Phase 1 subtasks (protocol, server, client)
  4. Start implementation: Begin with protocol.go

Phase 4: Atomic Operations and Stress Testing (COMPLETED - bd-114)

Status: Complete

Implementation:

  • Batch/transaction API for multi-step operations
  • Request timeout and cancellation support
  • Connection management optimization
  • Comprehensive stress tests (4-10 concurrent agents)
  • Performance benchmarks vs direct mode

Results:

  • Daemon mode is 2x faster than direct mode
  • Zero ID collisions in 1000+ concurrent creates
  • All acceptance criteria validated
  • Full test coverage with stress tests

Documentation: See DAEMON_STRESS_TEST.md for details.

Files Added:

  • internal/rpc/stress_test.go - Stress tests with 4-10 agents
  • internal/rpc/bench_test.go - Performance benchmarks
  • DAEMON_STRESS_TEST.md - Full documentation

Files Modified:

  • internal/rpc/protocol.go - Added OpBatch and batch types
  • internal/rpc/server.go - Implemented batch handler
  • internal/rpc/client.go - Added timeout support and Batch method