Files
beads/docs/pr-752-chaos-testing-review.md
Steve Yegge 1611f16751 refactor: remove unused bd pin/unpin/hook commands (bd-x0zl)
Analysis found these commands are dead code:
- gt never calls `bd pin` - uses `bd update --status=pinned` instead
- Beads.Pin() wrapper exists but is never called
- bd hook functionality duplicated by gt mol status
- Code comment says "pinned field is cosmetic for bd hook visibility"

Removed:
- cmd/bd/pin.go
- cmd/bd/unpin.go
- cmd/bd/hook.go

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-27 16:02:15 -08:00

205 lines
7.7 KiB
Markdown

# PR #752 Chaos Testing Review
**PR**: https://github.com/steveyegge/beads/pull/752
**Author**: jordanhubbard
**Bead**: bd-kx1j
**Status**: Under Review
## Summary
Jordan proposes adding chaos testing and E2E test coverage to beads. The PR:
- Adds 4849 lines, removes 511 lines
- Introduces chaos testing framework (random corruption, disk space exhaustion, NFS-like failures)
- Creates side databases for testing recovery scenarios
- Adds E2E tests tracking documented user scenarios
- Brings code coverage to ~48%
## Key Question from Jordan
> "Is this level of testing something you actually want with the current pace of progress?
> It comes with an implied obligation to update and add to the tests as well as follow
> the CICD feedback in github (very spammy if your tests don't pass!)"
## Files Changed (Major Categories)
### Chaos/Doctor Infrastructure
- `cmd/bd/doctor_repair_chaos_test.go` (378 lines) - Core chaos testing
- `cmd/bd/doctor/fix/database_integrity.go` (116 lines) - DB integrity fixes
- `cmd/bd/doctor/fix/jsonl_integrity.go` (87 lines) - JSONL integrity fixes
- `cmd/bd/doctor/fix/fs.go` (57 lines) - Filesystem fault injection
- `cmd/bd/doctor/fix/sqlite_open.go` (52 lines) - SQLite open handling
- `cmd/bd/doctor/jsonl_integrity.go` (123 lines) - JSONL checks
- `cmd/bd/doctor/git.go` (168 additions) - Git hygiene checks
### Test Coverage Additions
- `internal/storage/memory/memory_more_coverage_test.go` (921 lines) - Memory storage tests
- `cmd/bd/cli_coverage_show_test.go` (426 lines) - CLI show command tests
- `cmd/bd/daemon_autostart_unit_test.go` (331 lines) - Daemon autostart tests
- `internal/rpc/client_gate_shutdown_test.go` (107 lines) - RPC client tests
- Various other test files
### Bug Fixes Discovered During Testing
- `internal/storage/sqlite/migrations/021_migrate_edge_fields.go` - Major migration fix
- `internal/storage/sqlite/migrations/022_drop_edge_columns.go` - Column cleanup
- `internal/storage/sqlite/migrations_template_pinned_regression_test.go` - Regression test
## Tradeoffs
### Costs
1. **Maintenance burden**: Must keep coverage above 48% (or whatever threshold is set)
2. **CI noise**: Failed tests = spam until fixed
3. **Velocity tax**: Every change needs test updates
4. **Complexity**: Chaos testing framework itself needs maintenance
### Benefits
1. **Robustness validation**: Proves beads can recover from corruption
2. **Bug discovery**: Already found migration bugs (021, 022)
3. **Confidence**: If chaos tests pass, beads is more robust than feared
4. **Documentation**: E2E tests document expected user scenarios
5. **Regression prevention**: Future changes caught before release
## Initial Assessment
**Implementation Quality: HIGH**
The chaos testing code is well-structured. Key observations:
### What the Chaos Tests Actually Cover
From `doctor_repair_chaos_test.go`:
1. **Complete DB corruption** - Writes "not a database" garbage, verifies recovery from JSONL
2. **Truncated DB without JSONL** - Tests graceful failure when no recovery source exists
3. **Sidecar file backup** - Ensures -wal, -shm, -journal files are preserved during repair
4. **Repair with running daemon** - Tests recovery while daemon holds locks
5. **JSONL integrity** - Malformed lines, re-export from DB
Each test:
- Uses isolated temp directories
- Builds a fresh `bd` binary for testing
- Uses "side databases" (separate from real data)
- Has proper cleanup
### Bug Fixes Already Discovered
The PR includes fixes for bugs found during testing:
- Migration 021/022: `pinned` and `is_template` columns were being clobbered
- Regression test added to prevent recurrence
### Test Coverage Structure
Tests are organized by build tags:
- `//go:build chaos` - Chaos/corruption tests (run separately)
- `//go:build e2e` - End-to-end CLI tests
- Regular unit tests - No build tag required
This means chaos tests only run when explicitly requested, not on every `go test`.
---
## Deep Analysis (Ultrathink)
### The Core Question
Is the testing worth the ongoing maintenance cost?
### Argument FOR Merging
1. **Beads is more robust than feared**. If Jordan got these tests passing, it means:
- `bd doctor` actually recovers from corruption
- JSONL/DB sync is working correctly
- Migration edge cases are handled
This validates the core design: SQLite + JSONL + git backstop.
2. **Bugs already found**. The migration 021/022 bugs are exactly the kind of subtle
issues that would cause data loss in production. Finding them now is worth something.
3. **Build tag isolation**. Chaos tests won't slow down regular development:
```bash
go test ./... # Normal tests only
go test -tags=chaos ./... # Include chaos tests
go test -tags=e2e ./... # Include E2E tests
```
4. **48% coverage is a floor, not a target**. The PR doesn't enforce maintaining 48%.
Jordan is asking: "Is this level worth it?" We can always add more later, or let
coverage drift if priorities change.
5. **Documentation value**. E2E tests document expected user scenarios. When an AI agent
asks "what should happen when X?", the tests provide executable answers.
### Argument AGAINST Merging
1. **Velocity tax is real**. Every behavior change needs test updates. This is especially
painful during rapid iteration phases.
2. **CI noise**. Failed tests block merges. With multiple agents working, flaky tests
become coordination bottlenecks.
3. **Framework maintenance**. The chaos testing framework itself (side databases, build
tags, test helpers) becomes another thing to maintain.
4. **False confidence**. Tests passing doesn't mean beads is production-ready. It means
tested scenarios work. Edge cases not covered still fail silently.
### The Real Question: What Phase Are We In?
**If beads is still in "rapid prototype" phase**: The testing overhead is premature.
Focus on features, fix crashes as they happen, lean on git backstop.
**If beads is approaching "reliable tool" phase**: Testing is essential. Multi-agent
workflows amplify bugs. Corruption during a 10-agent batch is expensive.
**Current reality**: Beads is being dogfooded seriously. Multiple agents, real work,
real data loss when things break. We're closer to "reliable tool" than "prototype."
### ROI Calculation
**Cost of NOT testing**: When corruption happens:
- Agent loses context (30-60 min recovery)
- Human has to debug (variable, often 15-60 min)
- Trust erosion (hard to quantify)
**Cost of testing**:
- Review this PR (1-2 hours, one time)
- Update tests when behavior changes (5-15 min per change)
- Fix flaky tests when they appear (variable)
If corruption happens once a month, testing ROI is marginal.
If corruption happens weekly (or with each new feature), testing pays for itself.
---
## Recommendation
**MERGE WITH MODIFICATIONS**
### Why Merge
1. The implementation quality is high
2. Bugs already found justify the effort
3. Build tag isolation minimizes velocity impact
4. Beads is past the prototype phase
### Suggested Modifications
1. **No hard coverage threshold in CI**. Let coverage drift naturally. The value is in
the chaos tests catching corruption, not in hitting a percentage.
2. **Chaos tests optional in CI**. Run chaos tests on release branches, not every PR.
This reduces CI noise during active development.
3. **Clear ownership**. Jordan should document how to add new chaos scenarios. Future
contributors need to know when to add vs skip tests.
### Decision Framework for User
If you answer YES to 2+ of these, merge:
- [ ] Are you dogfooding beads for real work?
- [ ] Has corruption caused you to lose time in the last month?
- [ ] Do you expect multiple agents using beads concurrently?
- [ ] Is beads approaching a "v1.0" milestone?
If you answer NO to all, defer the PR until beads stabilizes.