- Added comprehensive code comments in collision.go explaining external_ref inclusion - Documented content hash behavior in HASH_ID_DESIGN.md with examples - Enhanced test documentation in collision_test.go - Closes bd-9f4a, bd-df11, bd-537e Amp-Thread-ID: https://ampcode.com/threads/T-47525168-d51c-4f56-b598-18402e5ea389 Co-authored-by: Amp <amp@ampcode.com>
10 KiB
Hash-Based ID Generation Design
Status: Implemented (bd-166)
Version: 2.0
Last Updated: 2025-10-30
Overview
bd v2.0 replaces sequential auto-increment IDs (bd-1, bd-2) with content-hash based IDs (bd-af78e9a2) and hierarchical sequential children (bd-af78e9a2.1, .2, .3).
This eliminates ID collisions in distributed workflows while maintaining human-friendly IDs for related work.
ID Format
Top-Level IDs (Hash-Based)
Format: {prefix}-{6-8-char-hex} (progressive on collision)
Examples:
bd-a3f2dd (6 chars, common case ~97%)
bd-a3f2dda (7 chars, rare collision ~3%)
bd-a3f2dda8 (8 chars, very rare double collision)
- Prefix: Configurable (bd, ticket, bug, etc.)
- Hash: First 6 characters of SHA256 hash (extends to 7-8 on collision)
- Total length: 9-11 chars for "bd-" prefix
Hierarchical Child IDs (Sequential)
Format: {parent-id}.{child-number}
Examples:
bd-a3f2dd.1 (depth 1, 6-char parent)
bd-a3f2dda.1.2 (depth 2, 7-char parent on collision)
bd-a3f2dd.1.2.3 (depth 3, max depth)
- Max depth: 3 levels (prevents over-decomposition)
- Max breadth: Unlimited (tested up to 347 children)
- Max ID length: ~17 chars at depth 3 (6-char parent + .N.N.N)
Hash Generation Algorithm
func GenerateHashID(prefix, title, description string, created time.Time, workspaceID string) string {
h := sha256.New()
h.Write([]byte(title))
h.Write([]byte(description))
h.Write([]byte(created.Format(time.RFC3339Nano)))
h.Write([]byte(workspaceID))
hash := hex.EncodeToString(h.Sum(nil))
return fmt.Sprintf("%s-%s", prefix, hash[:8])
}
Hash Inputs
- Title - Primary identifier for the issue
- Description - Additional context for uniqueness
- Created timestamp - RFC3339Nano format for nanosecond precision
- Workspace ID - Prevents collisions across databases/teams
Design Decisions
Why include timestamp?
- Ensures different issues with identical title+description get unique IDs
- Nanosecond precision makes simultaneous creation unlikely
Why include workspace ID?
- Prevents collisions when merging databases from different teams
- Can be hostname, UUID, or team identifier
Why NOT include priority/type?
- These fields are mutable and shouldn't affect identity
- Changing priority shouldn't change the issue ID
Content Hash (Collision Detection)
Separate from ID generation, bd uses content hashing for collision detection during import. See internal/storage/sqlite/collision.go:hashIssueContent().
Content Hash Fields
The content hash includes ALL semantically meaningful fields:
- title, description, status, priority, issue_type
- assignee, design, acceptance_criteria, notes
- external_ref ⚠️ (important: see below)
External Ref in Content Hash
IMPORTANT: external_ref is included in the content hash. This has subtle implications:
Local issue (no external_ref) → content hash A
Same issue + external_ref → content hash B (different!)
Why include external_ref?
- Linkage to external systems (Jira, GitHub, Linear) is semantically meaningful
- Changing external_ref represents a real content change
- Ensures external system changes are tracked properly
Implications:
- Rename detection won't match issues before/after adding external_ref
- Collision detection treats external_ref changes as updates
- Idempotent import requires identical external_ref
- Import by external_ref still works (checked before content hash)
Example scenario:
# 1. Create local issue
bd create "Fix auth bug" -p 1
# → ID: bd-a3f2dd, content_hash: abc123
# 2. Link to Jira
bd update bd-a3f2dd --external-ref JIRA-456
# → ID: bd-a3f2dd (same), content_hash: def789 (changed!)
# 3. Re-import from Jira
bd import -i jira-export.jsonl
# → Matches by external_ref first (JIRA-456)
# → Content hash different, triggers update
# → Idempotent on subsequent imports
Design rationale: External system linkage is tracked as substantive content, not just metadata. This ensures proper audit trails and collision resolution.
Why 6 chars (with progressive extension)?
- 6 chars (24 bits) = ~16 million possible IDs
- Progressive collision handling: extend to 7-8 chars only when needed
- Optimizes for common case: 97% get short, readable 6-char IDs
- Rare collisions get slightly longer but still reasonable IDs
- Inspired by Git's abbreviated commit SHAs
Collision Analysis
Birthday Paradox Probability
For 6-character hex IDs (24-bit space = 2^24 = 16,777,216):
| # Issues | 6-char Collision | 7-char Collision | 8-char Collision |
|---|---|---|---|
| 100 | ~0.03% | ~0.002% | ~0.0001% |
| 1,000 | 2.94% | 0.19% | 0.01% |
| 10,000 | 94.9% | 17.0% | 1.16% |
Formula: P(collision) ≈ 1 - e^(-n²/2N)
Progressive Strategy: Start with 6 chars. On INSERT collision, try 7 chars from same hash. On second collision, try 8 chars. This means ~97% of IDs in a 1,000 issue database stay at 6 chars.
Real-World Risk Assessment
Low Risk (<10,000 issues):
- Single team projects: ~1% chance over lifetime
- Mitigation: Workspace ID prevents cross-team collisions
- Fallback: If collision detected, append counter (bd-af78e9a2-2)
Medium Risk (10,000-50,000 issues):
- Large enterprise projects
- Recommendation: Monitor collision rate
- Consider 16-char IDs in v3 if collisions occur
High Risk (>50,000 issues):
- Multi-team platforms with shared database
- Recommendation: Use 16-char IDs (64 bits) for 2^64 space
- Implementation: Change hash[:8] to hash[:16]
Collision Detection
The database schema enforces uniqueness via PRIMARY KEY constraint. If a hash collision occurs:
- INSERT fails with UNIQUE constraint violation
- Client detects error and retries with modified input
- Options:
- Append counter to description: "Fix auth (2)"
- Wait 1ns and regenerate (different timestamp)
- Use 16-char hash mode
Performance
Benchmark Results (Apple M1 Max):
BenchmarkGenerateHashID-10 3758022 317.4 ns/op
BenchmarkGenerateChildID-10 19689157 60.96 ns/op
- Hash ID generation: ~317ns (well under 1μs requirement) ✅
- Child ID generation: ~61ns (trivial string concat)
- No performance concerns for interactive CLI use
Comparison to Sequential IDs
| Aspect | Sequential (v1) | Hash-Based (v2) |
|---|---|---|
| Collision risk | HIGH (offline work) | NONE (top-level) |
| ID length | 5-8 chars | 9-11 chars (avg ~9) |
| Predictability | Predictable (bd-1, bd-2) | Unpredictable |
| Offline-first | ❌ Requires coordination | ✅ Fully offline |
| Merge conflicts | ❌ Same ID, different content | ✅ Different IDs |
| Human-friendly | ✅ Easy to remember | ⚠️ Harder to remember |
| Code complexity | ~2,100 LOC collision resolution | <100 LOC |
CLI Usage
Prefix Handling
Storage: Always includes prefix (bd-a3f2dd) CLI Input: Prefix optional (both bd-a3f2dd AND a3f2dd accepted) CLI Output: Always shows prefix (copy-paste clarity) External refs: Always use prefix (git commits, docs, Slack)
# All of these work (prefix optional in input):
bd show a3f2dd
bd show bd-a3f2dd
bd show a3f2dd.1
bd show bd-a3f2dd.1.2
# Output always shows prefix:
bd-a3f2dd [epic] Auth System
Status: open
...
Git-Style Prefix Matching
Like Git commit SHAs, bd accepts abbreviated IDs:
bd show af78 # Matches bd-af78e9a2 if unique
bd show af7 # ERROR: ambiguous (matches bd-af78e9a2 and bd-af78e9a2.1)
Migration Strategy
Database Migration
# Preview migration
bd migrate --hash-ids --dry-run
# Execute migration
bd migrate --hash-ids
# What it does:
# 1. Create child_counters table
# 2. For each existing issue:
# - Generate hash ID from content
# - Update all references in dependencies
# - Update all text mentions in descriptions/notes
# 3. Drop issue_counters table
# 4. Update config to hash_id_mode=true
Backward Compatibility
- Sequential IDs continue working in v1.x
- Hash IDs are opt-in until v2.0
- Migration is one-way (no rollback)
- Export to JSONL preserves both old and new IDs during transition
Workspace ID Generation
Recommended approach:
- First run: Generate UUID and store in
configtable - Subsequent runs: Reuse stored workspace ID
- Collision: If two databases have same workspace ID, collisions possible but rare
Alternative approaches:
- Hostname: Simple but not unique (multiple DBs on same machine)
- Git remote URL: Requires git repository
- Manual config: User sets team identifier (e.g., "team-auth")
Implementation:
func (s *SQLiteStorage) getWorkspaceID(ctx context.Context) (string, error) {
var id string
err := s.db.QueryRowContext(ctx,
`SELECT value FROM config WHERE key = ?`,
"workspace_id").Scan(&id)
if err == sql.ErrNoRows {
// Generate new UUID
id = uuid.New().String()
_, err = s.db.ExecContext(ctx,
`INSERT INTO config (key, value) VALUES (?, ?)`,
"workspace_id", id)
}
return id, err
}
Future Considerations
16-Character Hash IDs (v3.0)
If collision rates become problematic:
// Change from:
return fmt.Sprintf("%s-%s", prefix, hash[:8])
// To:
return fmt.Sprintf("%s-%s", prefix, hash[:16])
// Example: bd-af78e9a2c4d5e6f7
Tradeoffs:
- ✅ Collision probability: ~0% even at 100M issues
- ❌ Longer IDs: 19 chars vs 11 chars
- ❌ Less human-friendly
Custom Hash Algorithms
For specialized use cases:
- BLAKE3: Faster than SHA256 (not needed for interactive CLI)
- xxHash: Non-cryptographic but faster (collision resistance?)
- MurmurHash: Used by Jira (consider for compatibility)
References
- Epic: bd-165 (Hash-based IDs with hierarchical children)
- Implementation: internal/types/id_generator.go
- Tests: internal/types/id_generator_test.go
- Related: bd-168 (CreateIssue integration), bd-169 (JSONL format)
Summary
Hash-based IDs eliminate distributed ID collision problems at the cost of slightly longer, less memorable IDs. Hierarchical children provide human-friendly sequential IDs within naturally-coordinated contexts (epic ownership).
This design enables true offline-first workflows and eliminates ~2,100 lines of complex collision resolution code.