From dc890308bb2e2fbb76d9265b7a10e5da871c0e85 Mon Sep 17 00:00:00 2001 From: Steve Yegge Date: Fri, 26 Dec 2025 17:25:20 -0800 Subject: [PATCH] Add chaos testing PR #752 review (bd-kx1j) --- docs/pr-752-chaos-testing-review.md | 204 ++++++++++++++++++++++++++++ 1 file changed, 204 insertions(+) create mode 100644 docs/pr-752-chaos-testing-review.md diff --git a/docs/pr-752-chaos-testing-review.md b/docs/pr-752-chaos-testing-review.md new file mode 100644 index 00000000..b6315182 --- /dev/null +++ b/docs/pr-752-chaos-testing-review.md @@ -0,0 +1,204 @@ +# PR #752 Chaos Testing Review + +**PR**: https://github.com/steveyegge/beads/pull/752 +**Author**: jordanhubbard +**Bead**: bd-kx1j +**Status**: Under Review + +## Summary + +Jordan proposes adding chaos testing and E2E test coverage to beads. The PR: +- Adds 4849 lines, removes 511 lines +- Introduces chaos testing framework (random corruption, disk space exhaustion, NFS-like failures) +- Creates side databases for testing recovery scenarios +- Adds E2E tests tracking documented user scenarios +- Brings code coverage to ~48% + +## Key Question from Jordan + +> "Is this level of testing something you actually want with the current pace of progress? +> It comes with an implied obligation to update and add to the tests as well as follow +> the CICD feedback in github (very spammy if your tests don't pass!)" + +## Files Changed (Major Categories) + +### Chaos/Doctor Infrastructure +- `cmd/bd/doctor_repair_chaos_test.go` (378 lines) - Core chaos testing +- `cmd/bd/doctor/fix/database_integrity.go` (116 lines) - DB integrity fixes +- `cmd/bd/doctor/fix/jsonl_integrity.go` (87 lines) - JSONL integrity fixes +- `cmd/bd/doctor/fix/fs.go` (57 lines) - Filesystem fault injection +- `cmd/bd/doctor/fix/sqlite_open.go` (52 lines) - SQLite open handling +- `cmd/bd/doctor/jsonl_integrity.go` (123 lines) - JSONL checks +- `cmd/bd/doctor/git.go` (168 additions) - Git hygiene checks + +### Test Coverage Additions +- `internal/storage/memory/memory_more_coverage_test.go` (921 lines) - Memory storage tests +- `cmd/bd/cli_coverage_show_test.go` (426 lines) - CLI show command tests +- `cmd/bd/daemon_autostart_unit_test.go` (331 lines) - Daemon autostart tests +- `internal/rpc/client_gate_shutdown_test.go` (107 lines) - RPC client tests +- Various other test files + +### Bug Fixes Discovered During Testing +- `internal/storage/sqlite/migrations/021_migrate_edge_fields.go` - Major migration fix +- `internal/storage/sqlite/migrations/022_drop_edge_columns.go` - Column cleanup +- `internal/storage/sqlite/migrations_template_pinned_regression_test.go` - Regression test + +## Tradeoffs + +### Costs +1. **Maintenance burden**: Must keep coverage above 48% (or whatever threshold is set) +2. **CI noise**: Failed tests = spam until fixed +3. **Velocity tax**: Every change needs test updates +4. **Complexity**: Chaos testing framework itself needs maintenance + +### Benefits +1. **Robustness validation**: Proves beads can recover from corruption +2. **Bug discovery**: Already found migration bugs (021, 022) +3. **Confidence**: If chaos tests pass, beads is more robust than feared +4. **Documentation**: E2E tests document expected user scenarios +5. **Regression prevention**: Future changes caught before release + +## Initial Assessment + +**Implementation Quality: HIGH** + +The chaos testing code is well-structured. Key observations: + +### What the Chaos Tests Actually Cover + +From `doctor_repair_chaos_test.go`: + +1. **Complete DB corruption** - Writes "not a database" garbage, verifies recovery from JSONL +2. **Truncated DB without JSONL** - Tests graceful failure when no recovery source exists +3. **Sidecar file backup** - Ensures -wal, -shm, -journal files are preserved during repair +4. **Repair with running daemon** - Tests recovery while daemon holds locks +5. **JSONL integrity** - Malformed lines, re-export from DB + +Each test: +- Uses isolated temp directories +- Builds a fresh `bd` binary for testing +- Uses "side databases" (separate from real data) +- Has proper cleanup + +### Bug Fixes Already Discovered + +The PR includes fixes for bugs found during testing: +- Migration 021/022: `pinned` and `is_template` columns were being clobbered +- Regression test added to prevent recurrence + +### Test Coverage Structure + +Tests are organized by build tags: +- `//go:build chaos` - Chaos/corruption tests (run separately) +- `//go:build e2e` - End-to-end CLI tests +- Regular unit tests - No build tag required + +This means chaos tests only run when explicitly requested, not on every `go test`. + +--- + +## Deep Analysis (Ultrathink) + +### The Core Question + +Is the testing worth the ongoing maintenance cost? + +### Argument FOR Merging + +1. **Beads is more robust than feared**. If Jordan got these tests passing, it means: + - `bd doctor` actually recovers from corruption + - JSONL/DB sync is working correctly + - Migration edge cases are handled + + This validates the core design: SQLite + JSONL + git backstop. + +2. **Bugs already found**. The migration 021/022 bugs are exactly the kind of subtle + issues that would cause data loss in production. Finding them now is worth something. + +3. **Build tag isolation**. Chaos tests won't slow down regular development: + ```bash + go test ./... # Normal tests only + go test -tags=chaos ./... # Include chaos tests + go test -tags=e2e ./... # Include E2E tests + ``` + +4. **48% coverage is a floor, not a target**. The PR doesn't enforce maintaining 48%. + Jordan is asking: "Is this level worth it?" We can always add more later, or let + coverage drift if priorities change. + +5. **Documentation value**. E2E tests document expected user scenarios. When an AI agent + asks "what should happen when X?", the tests provide executable answers. + +### Argument AGAINST Merging + +1. **Velocity tax is real**. Every behavior change needs test updates. This is especially + painful during rapid iteration phases. + +2. **CI noise**. Failed tests block merges. With multiple agents working, flaky tests + become coordination bottlenecks. + +3. **Framework maintenance**. The chaos testing framework itself (side databases, build + tags, test helpers) becomes another thing to maintain. + +4. **False confidence**. Tests passing doesn't mean beads is production-ready. It means + tested scenarios work. Edge cases not covered still fail silently. + +### The Real Question: What Phase Are We In? + +**If beads is still in "rapid prototype" phase**: The testing overhead is premature. +Focus on features, fix crashes as they happen, lean on git backstop. + +**If beads is approaching "reliable tool" phase**: Testing is essential. Multi-agent +workflows amplify bugs. Corruption during a 10-agent batch is expensive. + +**Current reality**: Beads is being dogfooded seriously. Multiple agents, real work, +real data loss when things break. We're closer to "reliable tool" than "prototype." + +### ROI Calculation + +**Cost of NOT testing**: When corruption happens: +- Agent loses context (30-60 min recovery) +- Human has to debug (variable, often 15-60 min) +- Trust erosion (hard to quantify) + +**Cost of testing**: +- Review this PR (1-2 hours, one time) +- Update tests when behavior changes (5-15 min per change) +- Fix flaky tests when they appear (variable) + +If corruption happens once a month, testing ROI is marginal. +If corruption happens weekly (or with each new feature), testing pays for itself. + +--- + +## Recommendation + +**MERGE WITH MODIFICATIONS** + +### Why Merge + +1. The implementation quality is high +2. Bugs already found justify the effort +3. Build tag isolation minimizes velocity impact +4. Beads is past the prototype phase + +### Suggested Modifications + +1. **No hard coverage threshold in CI**. Let coverage drift naturally. The value is in + the chaos tests catching corruption, not in hitting a percentage. + +2. **Chaos tests optional in CI**. Run chaos tests on release branches, not every PR. + This reduces CI noise during active development. + +3. **Clear ownership**. Jordan should document how to add new chaos scenarios. Future + contributors need to know when to add vs skip tests. + +### Decision Framework for User + +If you answer YES to 2+ of these, merge: +- [ ] Are you dogfooding beads for real work? +- [ ] Has corruption caused you to lose time in the last month? +- [ ] Do you expect multiple agents using beads concurrently? +- [ ] Is beads approaching a "v1.0" milestone? + +If you answer NO to all, defer the PR until beads stabilizes.