beads/docs/pr-752-chaos-testing-review.md

# PR #752 Chaos Testing Review

**PR**: https://github.com/steveyegge/beads/pull/752
**Author**: jordanhubbard
**Bead**: bd-kx1j
**Status**: Under Review

## Summary

Jordan proposes adding chaos testing and E2E test coverage to beads. The PR:
- Adds 4849 lines, removes 511 lines
- Introduces chaos testing framework (random corruption, disk space exhaustion, NFS-like failures)
- Creates side databases for testing recovery scenarios
- Adds E2E tests tracking documented user scenarios
- Brings code coverage to ~48%

## Key Question from Jordan

> "Is this level of testing something you actually want with the current pace of progress?
> It comes with an implied obligation to update and add to the tests as well as follow
> the CICD feedback in github (very spammy if your tests don't pass!)"

## Files Changed (Major Categories)

### Chaos/Doctor Infrastructure
- `cmd/bd/doctor_repair_chaos_test.go` (378 lines) - Core chaos testing
- `cmd/bd/doctor/fix/database_integrity.go` (116 lines) - DB integrity fixes
- `cmd/bd/doctor/fix/jsonl_integrity.go` (87 lines) - JSONL integrity fixes
- `cmd/bd/doctor/fix/fs.go` (57 lines) - Filesystem fault injection
- `cmd/bd/doctor/fix/sqlite_open.go` (52 lines) - SQLite open handling
- `cmd/bd/doctor/jsonl_integrity.go` (123 lines) - JSONL checks
- `cmd/bd/doctor/git.go` (168 additions) - Git hygiene checks

### Test Coverage Additions
- `internal/storage/memory/memory_more_coverage_test.go` (921 lines) - Memory storage tests
- `cmd/bd/cli_coverage_show_test.go` (426 lines) - CLI show command tests
- `cmd/bd/daemon_autostart_unit_test.go` (331 lines) - Daemon autostart tests
- `internal/rpc/client_gate_shutdown_test.go` (107 lines) - RPC client tests
- Various other test files

### Bug Fixes Discovered During Testing
- `internal/storage/sqlite/migrations/021_migrate_edge_fields.go` - Major migration fix
- `internal/storage/sqlite/migrations/022_drop_edge_columns.go` - Column cleanup
- `internal/storage/sqlite/migrations_template_pinned_regression_test.go` - Regression test

## Tradeoffs

### Costs
1. **Maintenance burden**: Must keep coverage above 48% (or whatever threshold is set)
2. **CI noise**: Failed tests = spam until fixed
3. **Velocity tax**: Every change needs test updates
4. **Complexity**: Chaos testing framework itself needs maintenance

### Benefits
1. **Robustness validation**: Proves beads can recover from corruption
2. **Bug discovery**: Already found migration bugs (021, 022)
3. **Confidence**: If chaos tests pass, beads is more robust than feared
4. **Documentation**: E2E tests document expected user scenarios
5. **Regression prevention**: Future changes caught before release

## Initial Assessment

**Implementation Quality: HIGH**

The chaos testing code is well-structured. Key observations:

### What the Chaos Tests Actually Cover

From `doctor_repair_chaos_test.go`:

1. **Complete DB corruption** - Writes "not a database" garbage, verifies recovery from JSONL
2. **Truncated DB without JSONL** - Tests graceful failure when no recovery source exists
3. **Sidecar file backup** - Ensures -wal, -shm, -journal files are preserved during repair
4. **Repair with running daemon** - Tests recovery while daemon holds locks
5. **JSONL integrity** - Malformed lines, re-export from DB

Each test:
- Uses isolated temp directories
- Builds a fresh `bd` binary for testing
- Uses "side databases" (separate from real data)
- Has proper cleanup

### Bug Fixes Already Discovered

The PR includes fixes for bugs found during testing:
- Migration 021/022: `pinned` and `is_template` columns were being clobbered
- Regression test added to prevent recurrence

### Test Coverage Structure

Tests are organized by build tags:
- `//go:build chaos` - Chaos/corruption tests (run separately)
- `//go:build e2e` - End-to-end CLI tests
- Regular unit tests - No build tag required

This means chaos tests only run when explicitly requested, not on every `go test`.

---

## Deep Analysis (Ultrathink)

### The Core Question

Is the testing worth the ongoing maintenance cost?

### Argument FOR Merging

1. **Beads is more robust than feared**. If Jordan got these tests passing, it means:
   - `bd doctor` actually recovers from corruption
   - JSONL/DB sync is working correctly
   - Migration edge cases are handled

   This validates the core design: SQLite + JSONL + git backstop.

2. **Bugs already found**. The migration 021/022 bugs are exactly the kind of subtle
   issues that would cause data loss in production. Finding them now is worth something.

3. **Build tag isolation**. Chaos tests won't slow down regular development:
   ```bash
   go test ./...                    # Normal tests only
   go test -tags=chaos ./...        # Include chaos tests
   go test -tags=e2e ./...          # Include E2E tests
   ```

4. **48% coverage is a floor, not a target**. The PR doesn't enforce maintaining 48%.
   Jordan is asking: "Is this level worth it?" We can always add more later, or let
   coverage drift if priorities change.

5. **Documentation value**. E2E tests document expected user scenarios. When an AI agent
   asks "what should happen when X?", the tests provide executable answers.

### Argument AGAINST Merging

1. **Velocity tax is real**. Every behavior change needs test updates. This is especially
   painful during rapid iteration phases.

2. **CI noise**. Failed tests block merges. With multiple agents working, flaky tests
   become coordination bottlenecks.

3. **Framework maintenance**. The chaos testing framework itself (side databases, build
   tags, test helpers) becomes another thing to maintain.

4. **False confidence**. Tests passing doesn't mean beads is production-ready. It means
   tested scenarios work. Edge cases not covered still fail silently.

### The Real Question: What Phase Are We In?

**If beads is still in "rapid prototype" phase**: The testing overhead is premature.
Focus on features, fix crashes as they happen, lean on git backstop.

**If beads is approaching "reliable tool" phase**: Testing is essential. Multi-agent
workflows amplify bugs. Corruption during a 10-agent batch is expensive.

**Current reality**: Beads is being dogfooded seriously. Multiple agents, real work,
real data loss when things break. We're closer to "reliable tool" than "prototype."

### ROI Calculation

**Cost of NOT testing**: When corruption happens:
- Agent loses context (30-60 min recovery)
- Human has to debug (variable, often 15-60 min)
- Trust erosion (hard to quantify)

**Cost of testing**:
- Review this PR (1-2 hours, one time)
- Update tests when behavior changes (5-15 min per change)
- Fix flaky tests when they appear (variable)

If corruption happens once a month, testing ROI is marginal.
If corruption happens weekly (or with each new feature), testing pays for itself.

---

## Recommendation

**MERGE WITH MODIFICATIONS**

### Why Merge

1. The implementation quality is high
2. Bugs already found justify the effort
3. Build tag isolation minimizes velocity impact
4. Beads is past the prototype phase

### Suggested Modifications

1. **No hard coverage threshold in CI**. Let coverage drift naturally. The value is in
   the chaos tests catching corruption, not in hitting a percentage.

2. **Chaos tests optional in CI**. Run chaos tests on release branches, not every PR.
   This reduces CI noise during active development.

3. **Clear ownership**. Jordan should document how to add new chaos scenarios. Future
   contributors need to know when to add vs skip tests.

### Decision Framework for User

If you answer YES to 2+ of these, merge:
- [ ] Are you dogfooding beads for real work?
- [ ] Has corruption caused you to lose time in the last month?
- [ ] Do you expect multiple agents using beads concurrently?
- [ ] Is beads approaching a "v1.0" milestone?

If you answer NO to all, defer the PR until beads stabilizes.