Commit Graph

14 Commits

Author SHA1 Message Date
Steve Yegge
34cf361b2b Add telemetry and observability to daemon (bd-153)
Implement comprehensive metrics collection for the daemon with zero-overhead design:

Features:
- Request metrics: counts, latency percentiles (p50, p95, p99), error rates
- Cache metrics: hit/miss ratios, eviction counts, database connections
- Connection metrics: total, active, rejected connections
- System metrics: memory usage, goroutine count, uptime

Implementation:
- New internal/rpc/metrics.go with Metrics collector
- OpMetrics RPC operation for programmatic access
- 'bd daemon --metrics' command (human-readable and JSON output)
- Lock-free atomic operations for cache/connection metrics
- Copy-and-compute pattern in Snapshot to minimize lock contention
- Deferred metrics recording ensures all requests are tracked

Improvements from code review:
- JSON types use float64 for ms/seconds (not time.Duration)
- Snapshot copies data under short lock, computes outside
- Union of operations from counts and errors maps
- Defensive clamping in percentile calculation
- Defer pattern ensures metrics recorded even on early returns

Documentation updated in README.md with usage examples.

Closes bd-153

Amp-Thread-ID: https://ampcode.com/threads/T-20213187-65c7-47f7-ba21-5234c9e52e26
Co-authored-by: Amp <amp@ampcode.com>
2025-10-19 15:55:55 -07:00
Steve Yegge
c71da4c267 Add resource limits to daemon (bd-152)
Implemented connection limiting, request timeouts, and memory pressure detection:

- Connection limiting with semaphore pattern (default 100 max connections)
- Request timeout enforcement on read/write (default 30s)
- Memory pressure detection with aggressive cache eviction (default 500MB threshold)
- Configurable via environment variables:
  - BEADS_DAEMON_MAX_CONNS
  - BEADS_DAEMON_REQUEST_TIMEOUT
  - BEADS_DAEMON_MEMORY_THRESHOLD_MB
- Health endpoint now exposes active/max connections and memory usage
- Comprehensive test coverage for all limits

This prevents resource exhaustion under heavy load or attack scenarios.

Amp-Thread-ID: https://ampcode.com/threads/T-44d1817a-3709-4f1d-a27a-78bb2fa4d3dc
Co-authored-by: Amp <amp@ampcode.com>
2025-10-19 13:22:23 -07:00
Steve Yegge
5aa7658433 Fix race condition in TestSocketCleanup by protecting listener access with mutex
Fixes bd-160

The race was between Start() writing s.listener and Stop() reading it.
Now all listener access is protected by the server mutex:
- Start() stores listener under lock after creation
- Accept loop reads listener under RLock
- Stop() closes listener under lock

All RPC tests now pass with -race flag.
2025-10-19 09:14:37 -07:00
Steve Yegge
22daa12665 Add daemon fallback visibility and version compatibility checks
Implemented bd-150: Improve daemon fallback visibility and user feedback
- Added DaemonStatus struct to track connection state
- Enhanced BD_DEBUG logging with detailed diagnostics and timing
- Added BD_VERBOSE mode with actionable warnings when falling back
- Implemented health checks before using daemon
- Clear fallback reasons: connect_failed, health_failed, auto_start_disabled, auto_start_failed, flag_no_daemon
- Updated documentation

Implemented bd-151: Add version compatibility checks for daemon RPC protocol
- Added ClientVersion field to RPC Request struct
- Client sends version (0.9.10) in all requests
- Server validates version compatibility using semver:
  - Major version must match
  - Daemon >= client for backward compatibility
  - Clear error messages with directional hints (upgrade daemon vs upgrade client)
- Added ClientVersion and Compatible fields to HealthResponse
- Implemented 'bd version --daemon' command to check compatibility
- Fixed batch operations to propagate ClientVersion for proper checks
- Updated documentation with version compatibility section

Code review improvements:
- Propagate ClientVersion in batch sub-requests
- Directional error messages based on which side is older
- Made ServerVersion a var for future unification

Amp-Thread-ID: https://ampcode.com/threads/T-b5fe36b8-c065-44a9-a55b-582573671609
Co-authored-by: Amp <amp@ampcode.com>
2025-10-19 08:04:48 -07:00
Steve Yegge
bb13f4634b Fix daemon crash recovery race conditions (bd-147)
Improvements based on oracle code review:
- Move socket cleanup AFTER lock acquisition (prevents unlinking live sockets)
- Add PID liveness check before removing stale socket
- Add stale lock detection with retry mechanism
- Tighten directory permissions to 0700 for security
- Improve socket readiness probing with shorter timeouts
- Make removeOldSocket() ignore ENOENT errors

Fixes race condition where socket could be removed during daemon startup window,
potentially orphaning a running daemon process.

Amp-Thread-ID: https://ampcode.com/threads/T-63542c60-b5b9-4a34-9f22-415d9d7e8223
Co-authored-by: Amp <amp@ampcode.com>
2025-10-18 13:59:06 -07:00
Steve Yegge
9e2ee1889f Add daemon health check endpoint (bd-146)
- Add OpHealth RPC operation to protocol
- Implement handleHealth() with DB ping and 1s timeout
- Returns status (healthy/degraded/unhealthy), uptime, cache metrics
- Update TryConnect() to use health check instead of ping
- Add 'bd daemon --health' CLI command with JSON output
- Track cache hits/misses for metrics
- Unhealthy daemon triggers automatic fallback to direct mode
- Health check completes in <2 seconds

Amp-Thread-ID: https://ampcode.com/threads/T-1a4889f3-77cf-433a-a704-e1c383929f48
Co-authored-by: Amp <amp@ampcode.com>
2025-10-18 13:41:06 -07:00
Steve Yegge
f987722f96 Code review fixes for cache eviction (bd-145)
Oracle-recommended improvements:

**Thread Safety:**
- Fix lastAccess race: updates now under Write lock
- Make Stop() idempotent with sync.Once
- Close stores synchronously (not in goroutine)

**Performance:**
- Replace O(n²) sort with sort.Slice (O(n log n))
- Enforce LRU immediately on insert (prevents FD spikes)

**Correctness:**
- Canonicalize cache key to repo root (not cwd)
  - Prevents duplicate connections for same repo
  - Multiple subdirs → single cache entry
- Validate env vars: TTL <= 0 falls back to default

**Tests (6 new edge cases):**
- Canonical key behavior across subdirectories
- Immediate LRU enforcement without periodic cleanup
- Invalid TTL handling
- Re-open after eviction
- Stop idempotency
- All tests pass with -race flag

This addresses potential data races, resource spikes, and duplicate
connections identified during code review.
2025-10-18 13:24:59 -07:00
Steve Yegge
259e994522 Add storage cache eviction policy to daemon (bd-145)
Implemented TTL-based and LRU cache eviction for daemon storage connections:

- Add StorageCacheEntry with lastAccess timestamp tracking
- Cleanup goroutine runs every 5 minutes to evict stale entries
- TTL-based eviction: remove entries idle >30min (configurable)
- LRU eviction: enforce max cache size (default: 50 repos)
- Configurable via BEADS_DAEMON_MAX_CACHE_SIZE and BEADS_DAEMON_CACHE_TTL
- Proper cleanup on server shutdown
- Update lastAccess on cache hits
- Comprehensive tests for eviction logic

Fixes memory leaks and file descriptor exhaustion for multi-repo users.

Amp-Thread-ID: https://ampcode.com/threads/T-1148d8b3-b8a8-45fc-af9c-b5be14c4834d
Co-authored-by: Amp <amp@ampcode.com>
2025-10-18 13:17:07 -07:00
Steve Yegge
fb9b5864af feat: Add bd repos multi-repo commands and fix bd ready for in_progress issues
- Add 'bd repos' command for multi-repository management (bd-123)
  - bd repos list: show all cached repositories
  - bd repos ready: aggregate ready work across repos
  - bd repos stats: combined statistics across repos
  - bd repos clear-cache: clear repository cache
  - Requires global daemon (bd daemon --global)

- Fix bd ready to show in_progress issues (bd-165)
  - bd ready now shows both 'open' and 'in_progress' issues with no blockers
  - Allows epics/tasks ready to close to appear in ready work
  - Critical P0 bug fix for workflow

- Apply code review improvements to repos implementation
  - Use strongly typed RPC responses (remove interface{})
  - Fix clear-cache lock handling (close connections outside lock)
  - Add error collection for per-repo failures
  - Add context timeouts (1-2s) to prevent hangs
  - Add lock strategy comments

- Update documentation (README.md, AGENTS.md)
- Add comprehensive tests for both features

Amp-Thread-ID: https://ampcode.com/threads/T-1de989a1-1890-492c-9847-a34144259e0f
Co-authored-by: Amp <amp@ampcode.com>
2025-10-18 00:37:27 -07:00
Steve Yegge
0dac4b9003 Fix global daemon implementation and improve security
- bd-159: Global daemon now runs in routing mode without opening DB
- bd-158: Set socket permissions to 0600 for security
- bd-160: Reject --auto-commit/--auto-push with --global
- bd-157: Verified stale socket cleanup (already working)
- bd-56: Closed as won't-do (cycle prevention is better)
- bd-73: Multi-repo support complete
2025-10-17 23:17:22 -07:00
Steve Yegge
b40de9bc41 Implement daemon RPC with per-request context routing (bd-115)
- Added per-request storage routing in daemon server
  - Server now supports Cwd field in requests for database discovery
  - Tree-walking to find .beads/*.db from any working directory
  - Storage caching for performance across requests

- Created Python daemon client (bd_daemon_client.py)
  - RPC over Unix socket communication
  - Implements full BdClientBase interface
  - Auto-discovery of daemon socket from working directory

- Refactored bd_client.py with abstract interface
  - BdClientBase abstract class for common interface
  - BdCliClient for CLI-based operations (renamed from BdClient)
  - create_bd_client() factory with daemon/CLI fallback
  - Backwards-compatible BdClient alias

Next: Update MCP server to use daemon client when available
2025-10-17 16:28:29 -07:00
Steve Yegge
15b60b4ad0 Phase 4: Atomic operations and stress testing (bd-114, bd-110)
Completes daemon architecture implementation:

Features:
- Batch/transaction API (OpBatch) for multi-step atomic operations
- Request timeout and cancellation support (30s default, configurable)
- Comprehensive stress tests (4-10 concurrent agents, 800-1000 ops)
- Performance benchmarks (daemon 2x faster than direct mode)

Results:
- Zero ID collisions across 1000+ concurrent creates
- All acceptance criteria validated for bd-110
- Create: 2.4ms (daemon) vs 4.7ms (direct)
- Update/List: similar 2x improvement

Tests Added:
- TestStressConcurrentAgents (8 agents, 800 creates)
- TestStressBatchOperations (4 agents, 400 batch ops)
- TestStressMixedOperations (6 agents, mixed read/write)
- TestStressNoUniqueConstraintViolations (10 agents, 1000 creates)
- BenchmarkDaemonCreate/Update/List/Latency
- Fixed flaky TestConcurrentRequests (shared client issue)

Files:
- internal/rpc/protocol.go - Added OpBatch, BatchArgs, BatchResponse
- internal/rpc/server.go - Implemented handleBatch with stop-on-failure
- internal/rpc/client.go - Added SetTimeout and Batch methods
- internal/rpc/stress_test.go - All stress tests
- internal/rpc/bench_test.go - Performance benchmarks
- DAEMON_STRESS_TEST.md - Complete documentation

Closes bd-114, bd-110

Amp-Thread-ID: https://ampcode.com/threads/T-1c07c140-0420-49fe-add1-b0b83b1bdff5
Co-authored-by: Amp <amp@ampcode.com>
2025-10-16 23:46:12 -07:00
Steve Yegge
5c0fac6e17 feat: Phase 1 RPC protocol infrastructure for daemon architecture (bd-111)
Implemented Unix socket RPC foundation to enable daemon-based concurrent access:

New files:
- internal/rpc/protocol.go: Request/Response types with 13 operations
- internal/rpc/server.go: Unix socket server with storage adapter
- internal/rpc/client.go: Client with auto-detection and typed methods
- internal/rpc/rpc_test.go: Integration tests

Features:
- JSON-based protocol over Unix sockets
- Adapter pattern for context/actor propagation to storage API
- Ping/health checks for daemon detection
- All core operations: create, update, close, list, show, ready, stats, deps, labels
- Graceful socket cleanup and signal handling
- Concurrent request support

Tests: 49.3% coverage, all passing

Related issues:
- bd-110: Daemon architecture epic
- bd-111: Phase 1 (completed)
- bd-112: Phase 2 (client auto-detection)
- bd-113: Phase 3 (daemon command)
- bd-114: Phase 4 (atomic operations)

Amp-Thread-ID: https://ampcode.com/threads/T-796c62e6-93b6-41c7-9cb5-8acc4a35ba9a
Co-authored-by: Amp <amp@ampcode.com>
2025-10-16 22:49:19 -07:00
Steve Yegge
872f203c57 Add RPC infrastructure and updated database
- RPC Phase 1: Protocol, server, client implementation
- Updated renumber.go with proper text reference updates (3-phase approach)
- Clean database exported: 344 issues (bd-1 to bd-344)
- Added DAEMON_DESIGN.md documentation
- Updated go.mod/go.sum for RPC dependencies

Amp-Thread-ID: https://ampcode.com/threads/T-456af77c-8b7f-4004-9027-c37b95e10ea5
Co-authored-by: Amp <amp@ampcode.com>
2025-10-16 20:36:23 -07:00