diff --git a/n-way-collision-convergence.md b/n-way-collision-convergence.md new file mode 100644 index 00000000..553ab70c --- /dev/null +++ b/n-way-collision-convergence.md @@ -0,0 +1,95 @@ +# N-Way Collision Convergence Problem + +## Summary + +The current collision resolution implementation (`--resolve-collisions`) works correctly for 2-way collisions but **does not converge** for 3-way (and by extension N-way) collisions. This is a critical limitation for parallel worker scenarios where multiple agents file issues simultaneously. + +## Test Evidence + +`TestThreeCloneCollision` in `beads_twoclone_test.go` demonstrates the problem with 3 clones creating the same issue ID (`test-1`) with different content. + +### Observed Behavior + +**Sync Order A→B→C:** +- Clone A: 0 issues (empty database after final pull) +- Clone B: 2 issues (missing "Issue from clone C") +- Clone C: 3 issues (has all issues) + +**Sync Order C→A→B:** +- Clone A: 2 issues (missing "Issue from clone B") +- Clone B: 3 issues (has all issues) +- Clone C: 0 issues (empty database after final pull) + +**Pattern:** The middle clone in the sync order gets all issues, but the first and last clones end up with incomplete data. This behavior is **100% reproducible** across all test runs. + +## Root Cause Analysis + +When the third clone pulls and resolves collisions: +1. It correctly remaps its conflicting issue to a new ID (e.g., `test-1` → `test-3`) +2. It imports the issues from the other two clones +3. It pushes the merged state + +However, when the first clone pulls this merged state: +1. The import sees new issues that collide with its local database +2. The resolution logic doesn't properly handle issues that were already remapped upstream +3. The database ends up in an inconsistent state (often empty or partially populated) + +## Why This Matters + +This prevents reliable N-way parallel worker scenarios: +- Multiple AI agents filing issues simultaneously +- Distributed teams working on different clones +- CI/CD systems creating issues in parallel builds + +**Current workaround:** Only works reliably with 2 workers or sequential issue creation. + +## What Needs To Be Fixed + +### 1. Import Logic Enhancement +The `--resolve-collisions` import needs to: +- Detect when incoming issues were already remapped upstream +- Preserve the remapping chain (track `test-1` → `test-2` → `test-3`) +- Not re-remap already-remapped issues + +### 2. Convergence Algorithm +Implement a proper convergence algorithm that ensures: +- All clones eventually have the same complete set of issues +- Idempotent imports (importing the same JSONL multiple times is safe) +- Transitive collision resolution (if A remaps to B, and B exists, handle gracefully) + +### 3. Test Requirements +The fix should make `TestThreeCloneCollision` pass without skipping: +- All three clones must have all three issues (by title) +- Content must match across all clones (ignoring timestamps and specific ID assignments) +- Must work for both sync orders (A→B→C and C→A→B) + +### 4. Extend to N-Way +Once 3-way works, verify it generalizes to N workers: +- Test with 5+ clones +- Test with different sync order permutations +- Ensure convergence time is bounded + +## Files To Examine + +- **`beads_twoclone_test.go`**: Contains `TestThreeCloneCollision` that reproduces the issue +- **`cmd/bd/import.go`**: Import logic with `--resolve-collisions` flag +- **`internal/storage/sqlite/sqlite.go`**: Database operations for collision detection +- **`cmd/bd/sync.go`**: Sync workflow that calls import/export + +## Success Criteria + +1. `TestThreeCloneCollision` passes without skipping +2. All clones converge to identical content after final pull +3. No data loss (all issues present in all clones) +4. ID assignments can be non-deterministic, but content must match +5. Works for N workers (extend test to 5+ clones) + +## Current Test Status + +```bash +go test -v -run TestThreeCloneCollision +# Both subtests SKIP with message: +# "KNOWN LIMITATION: 3-way collisions may require additional resolution logic" +``` + +The test is designed to skip when convergence fails, so it won't break CI, but it documents the limitation clearly.