Files

Steve Yegge 773aa736e4 Document external_ref in content hash behavior (bd-9f4a)

- Added comprehensive code comments in collision.go explaining external_ref inclusion
- Documented content hash behavior in HASH_ID_DESIGN.md with examples
- Enhanced test documentation in collision_test.go
- Closes bd-9f4a, bd-df11, bd-537e

Amp-Thread-ID: https://ampcode.com/threads/T-47525168-d51c-4f56-b598-18402e5ea389
Co-authored-by: Amp <amp@ampcode.com>

2025-11-08 02:22:15 -08:00

10 KiB

Raw Blame History

Hash-Based ID Generation Design

Status: Implemented (bd-166)
Version: 2.0
Last Updated: 2025-10-30

Overview

bd v2.0 replaces sequential auto-increment IDs (bd-1, bd-2) with content-hash based IDs (bd-af78e9a2) and hierarchical sequential children (bd-af78e9a2.1, .2, .3).

This eliminates ID collisions in distributed workflows while maintaining human-friendly IDs for related work.

ID Format

Top-Level IDs (Hash-Based)

Format: {prefix}-{6-8-char-hex} (progressive on collision)
Examples: 
  bd-a3f2dd   (6 chars, common case ~97%)
  bd-a3f2dda  (7 chars, rare collision ~3%)
  bd-a3f2dda8 (8 chars, very rare double collision)

Prefix: Configurable (bd, ticket, bug, etc.)
Hash: First 6 characters of SHA256 hash (extends to 7-8 on collision)
Total length: 9-11 chars for "bd-" prefix

Hierarchical Child IDs (Sequential)

Format: {parent-id}.{child-number}
Examples:
  bd-a3f2dd.1       (depth 1, 6-char parent)
  bd-a3f2dda.1.2    (depth 2, 7-char parent on collision)
  bd-a3f2dd.1.2.3   (depth 3, max depth)

Max depth: 3 levels (prevents over-decomposition)
Max breadth: Unlimited (tested up to 347 children)
Max ID length: ~17 chars at depth 3 (6-char parent + .N.N.N)

Hash Generation Algorithm

func GenerateHashID(prefix, title, description string, created time.Time, workspaceID string) string {
    h := sha256.New()
    h.Write([]byte(title))
    h.Write([]byte(description))
    h.Write([]byte(created.Format(time.RFC3339Nano)))
    h.Write([]byte(workspaceID))
    hash := hex.EncodeToString(h.Sum(nil))
    return fmt.Sprintf("%s-%s", prefix, hash[:8])
}

Hash Inputs

Title - Primary identifier for the issue
Description - Additional context for uniqueness
Created timestamp - RFC3339Nano format for nanosecond precision
Workspace ID - Prevents collisions across databases/teams

Design Decisions

Why include timestamp?

Ensures different issues with identical title+description get unique IDs
Nanosecond precision makes simultaneous creation unlikely

Why include workspace ID?

Prevents collisions when merging databases from different teams
Can be hostname, UUID, or team identifier

Why NOT include priority/type?

These fields are mutable and shouldn't affect identity
Changing priority shouldn't change the issue ID

Content Hash (Collision Detection)

Separate from ID generation, bd uses content hashing for collision detection during import. See internal/storage/sqlite/collision.go:hashIssueContent().

Content Hash Fields

The content hash includes ALL semantically meaningful fields:

title, description, status, priority, issue_type
assignee, design, acceptance_criteria, notes
external_ref ⚠️ (important: see below)

External Ref in Content Hash

IMPORTANT: external_ref is included in the content hash. This has subtle implications:

Local issue (no external_ref)    → content hash A
Same issue + external_ref         → content hash B  (different!)

Why include external_ref?

Linkage to external systems (Jira, GitHub, Linear) is semantically meaningful
Changing external_ref represents a real content change
Ensures external system changes are tracked properly

Implications:

Rename detection won't match issues before/after adding external_ref
Collision detection treats external_ref changes as updates
Idempotent import requires identical external_ref
Import by external_ref still works (checked before content hash)

Example scenario:

# 1. Create local issue
bd create "Fix auth bug" -p 1
# → ID: bd-a3f2dd, content_hash: abc123

# 2. Link to Jira
bd update bd-a3f2dd --external-ref JIRA-456
# → ID: bd-a3f2dd (same), content_hash: def789 (changed!)

# 3. Re-import from Jira
bd import -i jira-export.jsonl
# → Matches by external_ref first (JIRA-456)
# → Content hash different, triggers update
# → Idempotent on subsequent imports

Design rationale: External system linkage is tracked as substantive content, not just metadata. This ensures proper audit trails and collision resolution.

Why 6 chars (with progressive extension)?

6 chars (24 bits) = ~16 million possible IDs
Progressive collision handling: extend to 7-8 chars only when needed
Optimizes for common case: 97% get short, readable 6-char IDs
Rare collisions get slightly longer but still reasonable IDs
Inspired by Git's abbreviated commit SHAs

Collision Analysis

Birthday Paradox Probability

For 6-character hex IDs (24-bit space = 2^24 = 16,777,216):

# Issues	6-char Collision	7-char Collision	8-char Collision
100	~0.03%	~0.002%	~0.0001%
1,000	2.94%	0.19%	0.01%
10,000	94.9%	17.0%	1.16%

Formula: P(collision) ≈ 1 - e^(-n²/2N)

Progressive Strategy: Start with 6 chars. On INSERT collision, try 7 chars from same hash. On second collision, try 8 chars. This means ~97% of IDs in a 1,000 issue database stay at 6 chars.

Real-World Risk Assessment

Low Risk (<10,000 issues):

Single team projects: ~1% chance over lifetime
Mitigation: Workspace ID prevents cross-team collisions
Fallback: If collision detected, append counter (bd-af78e9a2-2)

Medium Risk (10,000-50,000 issues):

Large enterprise projects
Recommendation: Monitor collision rate
Consider 16-char IDs in v3 if collisions occur

High Risk (>50,000 issues):

Multi-team platforms with shared database
Recommendation: Use 16-char IDs (64 bits) for 2^64 space
Implementation: Change hash[:8] to hash[:16]

Collision Detection

The database schema enforces uniqueness via PRIMARY KEY constraint. If a hash collision occurs:

INSERT fails with UNIQUE constraint violation
Client detects error and retries with modified input
Options:
- Append counter to description: "Fix auth (2)"
- Wait 1ns and regenerate (different timestamp)
- Use 16-char hash mode

Performance

Benchmark Results (Apple M1 Max):

BenchmarkGenerateHashID-10     3758022    317.4 ns/op
BenchmarkGenerateChildID-10   19689157     60.96 ns/op

Hash ID generation: ~317ns (well under 1μs requirement) ✅
Child ID generation: ~61ns (trivial string concat)
No performance concerns for interactive CLI use

Comparison to Sequential IDs

Aspect	Sequential (v1)	Hash-Based (v2)
Collision risk	HIGH (offline work)	NONE (top-level)
ID length	5-8 chars	9-11 chars (avg ~9)
Predictability	Predictable (bd-1, bd-2)	Unpredictable
Offline-first	❌ Requires coordination	✅ Fully offline
Merge conflicts	❌ Same ID, different content	✅ Different IDs
Human-friendly	✅ Easy to remember	⚠️ Harder to remember
Code complexity	~2,100 LOC collision resolution	<100 LOC

CLI Usage

Prefix Handling

Storage: Always includes prefix (bd-a3f2dd) CLI Input: Prefix optional (both bd-a3f2dd AND a3f2dd accepted) CLI Output: Always shows prefix (copy-paste clarity) External refs: Always use prefix (git commits, docs, Slack)

# All of these work (prefix optional in input):
bd show a3f2dd
bd show bd-a3f2dd
bd show a3f2dd.1
bd show bd-a3f2dd.1.2

# Output always shows prefix:
bd-a3f2dd [epic] Auth System
  Status: open
  ...

Git-Style Prefix Matching

Like Git commit SHAs, bd accepts abbreviated IDs:

bd show af78      # Matches bd-af78e9a2 if unique
bd show af7       # ERROR: ambiguous (matches bd-af78e9a2 and bd-af78e9a2.1)

Migration Strategy

Database Migration

# Preview migration
bd migrate --hash-ids --dry-run

# Execute migration
bd migrate --hash-ids

# What it does:
# 1. Create child_counters table
# 2. For each existing issue:
#    - Generate hash ID from content
#    - Update all references in dependencies
#    - Update all text mentions in descriptions/notes
# 3. Drop issue_counters table
# 4. Update config to hash_id_mode=true

Backward Compatibility

Sequential IDs continue working in v1.x
Hash IDs are opt-in until v2.0
Migration is one-way (no rollback)
Export to JSONL preserves both old and new IDs during transition

Workspace ID Generation

Recommended approach:

First run: Generate UUID and store in config table
Subsequent runs: Reuse stored workspace ID
Collision: If two databases have same workspace ID, collisions possible but rare

Alternative approaches:

Hostname: Simple but not unique (multiple DBs on same machine)
Git remote URL: Requires git repository
Manual config: User sets team identifier (e.g., "team-auth")

Implementation:

func (s *SQLiteStorage) getWorkspaceID(ctx context.Context) (string, error) {
    var id string
    err := s.db.QueryRowContext(ctx, 
        `SELECT value FROM config WHERE key = ?`, 
        "workspace_id").Scan(&id)
    if err == sql.ErrNoRows {
        // Generate new UUID
        id = uuid.New().String()
        _, err = s.db.ExecContext(ctx,
            `INSERT INTO config (key, value) VALUES (?, ?)`,
            "workspace_id", id)
    }
    return id, err
}

Future Considerations

16-Character Hash IDs (v3.0)

If collision rates become problematic:

// Change from:
return fmt.Sprintf("%s-%s", prefix, hash[:8])

// To:
return fmt.Sprintf("%s-%s", prefix, hash[:16])

// Example: bd-af78e9a2c4d5e6f7

Tradeoffs:

✅ Collision probability: ~0% even at 100M issues
❌ Longer IDs: 19 chars vs 11 chars
❌ Less human-friendly

Custom Hash Algorithms

For specialized use cases:

BLAKE3: Faster than SHA256 (not needed for interactive CLI)
xxHash: Non-cryptographic but faster (collision resistance?)
MurmurHash: Used by Jira (consider for compatibility)

References

Epic: bd-165 (Hash-based IDs with hierarchical children)
Implementation: internal/types/id_generator.go
Tests: internal/types/id_generator_test.go
Related: bd-168 (CreateIssue integration), bd-169 (JSONL format)

Summary

Hash-based IDs eliminate distributed ID collision problems at the cost of slightly longer, less memorable IDs. Hierarchical children provide human-friendly sequential IDs within naturally-coordinated contexts (epic ownership).

This design enables true offline-first workflows and eliminates ~2,100 lines of complex collision resolution code.

10 KiB Raw Blame History