Add bd duplicates command for automated duplicate detection (bd-203)

- New 'bd duplicates' command finds content duplicates across database - Groups by content hash (title, description, design, acceptance criteria) - Chooses merge target by reference count or smallest ID - Supports --auto-merge and --dry-run flags - Added --dedupe-after flag to 'bd import' for post-import detection - Comprehensive test coverage for duplicate detection - Updated AGENTS.md and ADVANCED.md with usage examples Amp-Thread-ID: https://ampcode.com/threads/T-6f99566f-c979-43ed-bd8f-5aa38b0f6191 Co-authored-by: Amp <amp@ampcode.com>
2025-10-24 13:45:04 -07:00
parent 9b2c551923
commit 3195b8062b
6 changed files with 686 additions and 2 deletions
--- a/.beads/beads.jsonl
+++ b/.beads/beads.jsonl
@@ -86,7 +86,7 @@
 {"id":"bd-200","title":"Write comprehensive tests for external_ref import scenarios","description":"Create test suite covering all external_ref import behaviors.\n\n## Test Cases\n1. Import with external_ref → creates new issue\n2. Re-import same external_ref with changes → updates existing\n3. Re-import same external_ref unchanged → idempotent (no update)\n4. Import external issue with ID collision → auto-remaps external issue\n5. Import without external_ref → uses current ID-based logic\n6. Mixed batch: some with external_ref, some without\n7. External_ref match + ID mismatch → updates by external_ref\n8. Local issue (no external_ref) never matched by external import\n9. Performance test with external_ref index\n\nAdd tests to:\n- cmd/bd/import_shared_test.go\n- internal/storage/sqlite/collision_test.go","status":"closed","priority":1,"issue_type":"task","created_at":"2025-10-24T13:08:46.376432-07:00","updated_at":"2025-10-24T13:35:26.748892-07:00","closed_at":"2025-10-24T13:35:26.748892-07:00"}
 {"id":"bd-201","title":"Update documentation for external_ref import behavior","description":"Document the new external_ref-based import matching behavior.\n\n## Files to Update\n- README.md: Add external_ref import section\n- QUICKSTART.md: Add example of Jira/GitHub integration workflow\n- AGENTS.md: Document external_ref import behavior for AI agents\n- FAQ.md: Add Q\u0026A about external system integration\n- Add examples/ directory entry showing Jira/GitHub/Linear integration\n\n## Key Points to Document\n- How external_ref matching works\n- That local issues are protected from external imports\n- Example workflow: import → add local tasks → re-import updates\n- ID conflict resolution behavior","status":"closed","priority":2,"issue_type":"task","created_at":"2025-10-24T13:08:46.390992-07:00","updated_at":"2025-10-24T13:35:26.749081-07:00","closed_at":"2025-10-24T13:35:26.749081-07:00"}
 {"id":"bd-202","title":"Code review: external_ref import feature","description":"Comprehensive code review of external_ref import implementation.\n\n## Review Checklist\n- [ ] Storage layer changes are correct and efficient\n- [ ] Index created properly on external_ref column\n- [ ] Collision detection logic handles all edge cases\n- [ ] Import flow correctly distinguishes external vs local issues\n- [ ] No breaking changes to existing workflows\n- [ ] Tests cover all scenarios (see bd-200)\n- [ ] Documentation is clear and complete (see bd-201)\n- [ ] Performance impact is acceptable\n- [ ] Error messages are helpful\n- [ ] Logging is appropriate for debugging\n\n## Oracle Review\nUse oracle tool to analyze implementation for correctness, edge cases, and potential issues.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-10-24T13:08:46.395433-07:00","updated_at":"2025-10-24T13:35:26.749271-07:00","closed_at":"2025-10-24T13:35:26.749271-07:00"}
-{"id":"bd-203","title":"Add automated duplicate detection tool for post-import cleanup","description":"After collision resolution with --resolve-collisions, we can end up with content duplicates (different IDs, identical content) when parallel work creates the same issues.\n\n## Problem\nCurrent situation:\n- `deduplicateIncomingIssues()` only deduplicates within the import batch\n- Doesn't detect duplicates between DB and incoming issues\n- Doesn't detect duplicates across the entire database post-import\n- Manual detection: `bd list --json | jq 'group_by(.title) | map(select(length \u003e 1))'`\n\n## Real Example\nAfter import with collision resolution, we had 7 duplicate pairs:\n--196/bd-196: Same external_ref epic\n--196-196: Same findByExternalRef task\n--196-196: Same collision detection task\n--196-196: Same import flow task\n--196-196: Same test writing task\n--196-196: Same documentation task\n--196-196: Same code review task\n\n## Proposed Solution\nAdd `bd duplicates` command that:\n1. Groups all issues by content hash (title + description + design + acceptance_criteria)\n2. Reports duplicate groups with suggested merge target (lowest ID or most references)\n3. Optionally auto-merge with `--auto-merge` flag\n4. Respects status (don't merge open with closed)\n\n## Example Output\n```\n🔍 Found 7 duplicate groups:\n\nGroup 1: \"Feature: Use external_ref as primary matching key\"\n  --196 (open, P1, 0 references)\n  - bd-196 (open, P1, 0 references)\n  Suggested merge: bd merge-196 --into bd-196\n\nGroup 2: \"Add findByExternalRef query\"\n  --196 (open, P1, 0 references)  \n  --196 (open, P1, 0 references)\n  Suggested merge: bd merge-196 --into-196\n...\n\nRun with --auto-merge to execute all suggested merges\n```\n\n## Implementation Notes\n- Use same content hashing as deduplicateIncomingIssues\n- Consider reference counts when choosing merge target\n- Skip duplicates with different status (open vs closed)\n- Add --dry-run mode\n- Integration with import: `bd import --resolve-collisions --dedupe-after`","status":"open","priority":2,"issue_type":"feature","created_at":"2025-10-24T13:35:14.97041-07:00","updated_at":"2025-10-24T13:35:26.729886-07:00"}
+{"id":"bd-203","title":"Add automated duplicate detection tool for post-import cleanup","description":"After collision resolution with --resolve-collisions, we can end up with content duplicates (different IDs, identical content) when parallel work creates the same issues.\n\n## Problem\nCurrent situation:\n- `deduplicateIncomingIssues()` only deduplicates within the import batch\n- Doesn't detect duplicates between DB and incoming issues\n- Doesn't detect duplicates across the entire database post-import\n- Manual detection: `bd list --json | jq 'group_by(.title) | map(select(length \u003e 1))'`\n\n## Real Example\nAfter import with collision resolution, we had 7 duplicate pairs:\n--196/bd-196: Same external_ref epic\n--196-196: Same findByExternalRef task\n--196-196: Same collision detection task\n--196-196: Same import flow task\n--196-196: Same test writing task\n--196-196: Same documentation task\n--196-196: Same code review task\n\n## Proposed Solution\nAdd `bd duplicates` command that:\n1. Groups all issues by content hash (title + description + design + acceptance_criteria)\n2. Reports duplicate groups with suggested merge target (lowest ID or most references)\n3. Optionally auto-merge with `--auto-merge` flag\n4. Respects status (don't merge open with closed)\n\n## Example Output\n```\n🔍 Found 7 duplicate groups:\n\nGroup 1: \"Feature: Use external_ref as primary matching key\"\n  --196 (open, P1, 0 references)\n  - bd-196 (open, P1, 0 references)\n  Suggested merge: bd merge-196 --into bd-196\n\nGroup 2: \"Add findByExternalRef query\"\n  --196 (open, P1, 0 references)  \n  --196 (open, P1, 0 references)\n  Suggested merge: bd merge-196 --into-196\n...\n\nRun with --auto-merge to execute all suggested merges\n```\n\n## Implementation Notes\n- Use same content hashing as deduplicateIncomingIssues\n- Consider reference counts when choosing merge target\n- Skip duplicates with different status (open vs closed)\n- Add --dry-run mode\n- Integration with import: `bd import --resolve-collisions --dedupe-after`","status":"in_progress","priority":2,"issue_type":"feature","created_at":"2025-10-24T13:35:14.97041-07:00","updated_at":"2025-10-24T13:37:39.187698-07:00"}
 {"id":"bd-204","title":"Optimize auto-flush to use incremental updates","description":"Every flush exports ALL issues and ALL dependencies, even if only one issue changed. For large projects (1000+ issues), this could be expensive. Current approach guarantees consistency, which is fine for MVP, but future optimization could track which issues changed and use incremental updates. Located in cmd/bd/main.go:255-276.","status":"closed","priority":3,"issue_type":"feature","created_at":"2025-10-24T13:35:23.09858-07:00","updated_at":"2025-10-24T13:35:23.09858-07:00","closed_at":"2025-10-14T02:51:52.200141-07:00"}
 {"id":"bd-205","title":"Implement dependency migration for merge","description":"Migrate all dependencies from source issue(s) to target issue during merge, removing duplicates and preserving graph integrity","status":"closed","priority":1,"issue_type":"task","created_at":"2025-10-24T13:35:23.098981-07:00","updated_at":"2025-10-24T13:35:23.098981-07:00","closed_at":"2025-10-22T01:07:04.720032-07:00"}
 {"id":"bd-206","title":"Add merged_into field to database schema","description":"Add merged_into field to Issue struct and update database schema to support merge tracking","notes":"Simplified: no schema field needed. Close merged issues with reason 'Merged into bd-X'. See bd-79 design.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-10-24T13:35:23.099266-07:00","updated_at":"2025-10-24T13:35:23.099266-07:00","closed_at":"2025-10-22T01:07:14.145014-07:00"}