Add GitHub Issues migration script (bd-68)

- New gh2jsonl.py script supports GitHub API and JSON file import - Maps GitHub labels to bd priority/type/status - Preserves metadata, assignees, timestamps, external refs - Auto-detects cross-references and creates dependencies - Production-ready: User-Agent, rate limit handling, UTF-8 support - Comprehensive README with examples and troubleshooting - Tested and reviewed Amp-Thread-ID: https://ampcode.com/threads/T-2fc85f05-302b-4fc9-8cac-63ac0e03c9af Co-authored-by: Amp <amp@ampcode.com>
2025-10-17 23:55:51 -07:00
parent 9fb46d41b8
commit 56a379dc5a
5 changed files with 744 additions and 6 deletions
@@ -0,0 +1,303 @@
+# GitHub Issues to bd Importer
+
+Import issues from GitHub repositories into `bd`.
+
+## Overview
+
+This tool converts GitHub Issues to bd's JSONL format, supporting both:
+1. **GitHub API** - Fetch issues directly from a repository
+2. **JSON Export** - Parse manually exported GitHub issues
+
+## Features
+
+- ✅ **Fetch from GitHub API** - Direct import from any public/private repo
+- ✅ **JSON file import** - Parse exported GitHub issues JSON
+- ✅ **Label mapping** - Auto-map GitHub labels to bd priority/type
+- ✅ **Preserve metadata** - Keep assignees, timestamps, descriptions
+- ✅ **Cross-references** - Convert `#123` references to dependencies
+- ✅ **External links** - Preserve URLs back to original GitHub issues
+- ✅ **Filter PRs** - Automatically excludes pull requests
+
+## Installation
+
+No dependencies required! Uses Python 3 standard library.
+
+For API access, set up a GitHub token:
+
+```bash
+# Create token at: https://github.com/settings/tokens
+# Permissions needed: public_repo (or repo for private repos)
+
+export GITHUB_TOKEN=ghp_your_token_here
+```
+
+**Security Note:** Use the `GITHUB_TOKEN` environment variable instead of `--token` flag when possible. The `--token` flag may appear in shell history and process listings.
+
+## Usage
+
+### From GitHub API
+
+```bash
+# Fetch all issues from a repository
+python gh2jsonl.py --repo owner/repo | bd import
+
+# Save to file first (recommended)
+python gh2jsonl.py --repo owner/repo > issues.jsonl
+bd import -i issues.jsonl --dry-run  # Preview
+bd import -i issues.jsonl             # Import
+
+# Fetch only open issues
+python gh2jsonl.py --repo owner/repo --state open
+
+# Fetch only closed issues
+python gh2jsonl.py --repo owner/repo --state closed
+```
+
+### From JSON File
+
+Export issues from GitHub (via API or manually), then:
+
+```bash
+# Single issue
+curl -H "Authorization: token $GITHUB_TOKEN" \
+  https://api.github.com/repos/owner/repo/issues/123 > issue.json
+
+python gh2jsonl.py --file issue.json | bd import
+
+# Multiple issues
+curl -H "Authorization: token $GITHUB_TOKEN" \
+  https://api.github.com/repos/owner/repo/issues > issues.json
+
+python gh2jsonl.py --file issues.json | bd import
+```
+
+### Custom Options
+
+```bash
+# Use custom prefix (instead of 'bd')
+python gh2jsonl.py --repo owner/repo --prefix myproject
+
+# Start numbering from specific ID
+python gh2jsonl.py --repo owner/repo --start-id 100
+
+# Pass token directly (instead of env var)
+python gh2jsonl.py --repo owner/repo --token ghp_...
+```
+
+## Label Mapping
+
+The script maps GitHub labels to bd fields:
+
+### Priority Mapping
+
+| GitHub Labels | bd Priority |
+|--------------|-------------|
+| `critical`, `p0`, `urgent` | 0 (Critical) |
+| `high`, `p1`, `important` | 1 (High) |
+| (default) | 2 (Medium) |
+| `low`, `p3`, `minor` | 3 (Low) |
+| `backlog`, `p4`, `someday` | 4 (Backlog) |
+
+### Type Mapping
+
+| GitHub Labels | bd Type |
+|--------------|---------|
+| `bug`, `defect` | bug |
+| `feature`, `enhancement` | feature |
+| `epic`, `milestone` | epic |
+| `chore`, `maintenance`, `dependencies` | chore |
+| (default) | task |
+
+### Status Mapping
+
+| GitHub State | GitHub Labels | bd Status |
+|-------------|---------------|-----------|
+| closed | (any) | closed |
+| open | `in progress`, `in-progress`, `wip` | in_progress |
+| open | `blocked` | blocked |
+| open | (default) | open |
+
+### Labels
+
+All other labels are preserved in the `labels` field. Labels used for mapping (priority, type, status) are filtered out to avoid duplication.
+
+## Field Mapping
+
+| GitHub Field | bd Field | Notes |
+|--------------|----------|-------|
+| `number` | (internal mapping) | GH#123 → bd-1, etc. |
+| `title` | `title` | Direct copy |
+| `body` | `description` | Direct copy |
+| `state` | `status` | See status mapping |
+| `labels` | `priority`, `issue_type`, `labels` | See label mapping |
+| `assignee.login` | `assignee` | First assignee only |
+| `created_at` | `created_at` | ISO 8601 timestamp |
+| `updated_at` | `updated_at` | ISO 8601 timestamp |
+| `closed_at` | `closed_at` | ISO 8601 timestamp |
+| `html_url` | `external_ref` | Link back to GitHub |
+
+## Cross-References
+
+Issue references in the body text are converted to dependencies:
+
+**GitHub:**
+```markdown
+This depends on #123 and fixes #456.
+See also owner/other-repo#789.
+```
+
+**Result:**
+- If GH#123 was imported, creates `related` dependency to its bd ID
+- If GH#456 was imported, creates `related` dependency to its bd ID
+- Cross-repo references (#789) are ignored (unless those issues were also imported)
+
+**Note:** Dependency records use `"issue_id": ""` format, which the bd importer automatically fills. This matches the behavior of the markdown-to-jsonl converter.
+
+## Examples
+
+### Example 1: Import Active Issues
+
+```bash
+# Import only open issues for active work
+export GITHUB_TOKEN=ghp_...
+python gh2jsonl.py --repo mycompany/myapp --state open > open-issues.jsonl
+
+# Preview
+cat open-issues.jsonl | jq .
+
+# Import
+bd import -i open-issues.jsonl
+bd ready  # See what's ready to work on
+```
+
+### Example 2: Full Repository Migration
+
+```bash
+# Import all issues (open and closed)
+python gh2jsonl.py --repo mycompany/myapp > all-issues.jsonl
+
+# Preview import (check for collisions)
+bd import -i all-issues.jsonl --dry-run
+
+# Import with collision resolution if needed
+bd import -i all-issues.jsonl --resolve-collisions
+
+# View stats
+bd stats
+```
+
+### Example 3: Partial Import from JSON
+
+```bash
+# Manually export specific issues via GitHub API
+gh api repos/owner/repo/issues?labels=p1,bug > high-priority-bugs.json
+
+# Import
+python gh2jsonl.py --file high-priority-bugs.json | bd import
+```
+
+## Customization
+
+The script is intentionally simple to customize for your workflow:
+
+### 1. Adjust Label Mappings
+
+Edit `map_priority()`, `map_issue_type()`, and `map_status()` to match your label conventions:
+
+```python
+def map_priority(self, labels: List[str]) -> int:
+    label_names = [label.get("name", "").lower() if isinstance(label, dict) else label.lower() for label in labels]
+    
+    # Add your custom mappings
+    if any(l in label_names for l in ["sev1", "emergency"]):
+        return 0
+    # ... etc
+```
+
+### 2. Add Custom Fields
+
+Map additional GitHub fields to bd:
+
+```python
+def convert_issue(self, gh_issue: Dict[str, Any]) -> Dict[str, Any]:
+    # ... existing code ...
+    
+    # Add milestone to design field
+    if gh_issue.get("milestone"):
+        issue["design"] = f"Milestone: {gh_issue['milestone']['title']}"
+    
+    return issue
+```
+
+### 3. Enhanced Dependency Detection
+
+Parse more dependency patterns from body text:
+
+```python
+def extract_dependencies_from_body(self, body: str) -> List[str]:
+    # ... existing code ...
+    
+    # Add: "Blocks: #123, #456"
+    blocks_pattern = r'Blocks:\s*((?:#\d+(?:\s*,\s*)?)+)'
+    # ... etc
+```
+
+## Limitations
+
+- **Single assignee**: GitHub supports multiple assignees, bd supports one
+- **No milestones**: GitHub milestones aren't mapped (consider using design field)
+- **Simple cross-refs**: Only basic `#123` patterns detected
+- **No comments**: Issue comments aren't imported (only the body)
+- **No reactions**: GitHub reactions/emoji aren't imported
+- **No projects**: GitHub project board info isn't imported
+
+## API Rate Limits
+
+GitHub API has rate limits:
+- **Authenticated**: 5,000 requests/hour
+- **Unauthenticated**: 60 requests/hour
+
+This script uses 1 request per 100 issues (pagination), so:
+- Can fetch ~500,000 issues/hour (authenticated)
+- Can fetch ~6,000 issues/hour (unauthenticated)
+
+For large repositories (>1000 issues), authentication is recommended.
+
+**Note:** The script automatically includes a `User-Agent` header (required by GitHub) and provides actionable error messages when rate limits are exceeded, including the reset timestamp.
+
+## Troubleshooting
+
+### "GitHub token required"
+
+Set the `GITHUB_TOKEN` environment variable:
+```bash
+export GITHUB_TOKEN=ghp_your_token_here
+```
+
+Or pass directly:
+```bash
+python gh2jsonl.py --repo owner/repo --token ghp_...
+```
+
+### "GitHub API error: 404"
+
+- Check repository name format: `owner/repo`
+- Check repository exists and is accessible
+- For private repos, ensure token has `repo` scope
+
+### "GitHub API error: 403"
+
+- Rate limit exceeded (wait or use authentication)
+- Token doesn't have required permissions
+- Repository requires different permissions
+
+### Issue numbers don't match
+
+This is expected! GitHub issue numbers (e.g., #123) are mapped to bd IDs (e.g., bd-1) based on import order. The original GitHub URL is preserved in `external_ref`.
+
+## See Also
+
+- [bd README](../../README.md) - Main documentation
+- [Markdown Import Example](../markdown-to-jsonl/) - Import from markdown
+- [TEXT_FORMATS.md](../../TEXT_FORMATS.md) - Understanding bd's JSONL format
+- [JSONL Import Guide](../../README.md#import) - Import collision handling
@@ -0,0 +1,52 @@
+[
+  {
+    "number": 42,
+    "title": "Add user authentication",
+    "body": "Implement JWT-based authentication.\n\nThis blocks #43 and is related to #44.",
+    "state": "open",
+    "labels": [
+      {"name": "feature"},
+      {"name": "high"},
+      {"name": "security"}
+    ],
+    "assignee": {
+      "login": "alice"
+    },
+    "created_at": "2025-01-15T10:00:00Z",
+    "updated_at": "2025-01-16T14:30:00Z",
+    "html_url": "https://github.com/example/repo/issues/42"
+  },
+  {
+    "number": 43,
+    "title": "Add API rate limiting",
+    "body": "Implement rate limiting for API endpoints.\n\nDepends on authentication (#42) being completed first.",
+    "state": "open",
+    "labels": [
+      {"name": "feature"},
+      {"name": "p1"}
+    ],
+    "assignee": {
+      "login": "bob"
+    },
+    "created_at": "2025-01-15T11:00:00Z",
+    "updated_at": "2025-01-15T11:00:00Z",
+    "html_url": "https://github.com/example/repo/issues/43"
+  },
+  {
+    "number": 44,
+    "title": "Fix login redirect bug",
+    "body": "Login page redirects to wrong URL after authentication.",
+    "state": "closed",
+    "labels": [
+      {"name": "bug"},
+      {"name": "critical"}
+    ],
+    "assignee": {
+      "login": "charlie"
+    },
+    "created_at": "2025-01-10T09:00:00Z",
+    "updated_at": "2025-01-12T16:00:00Z",
+    "closed_at": "2025-01-12T16:00:00Z",
+    "html_url": "https://github.com/example/repo/issues/44"
+  }
+]
@@ -0,0 +1,386 @@
+#!/usr/bin/env python3
+"""
+Convert GitHub Issues to bd JSONL format.
+
+Supports two input modes:
+1. GitHub API - Fetch issues directly from a repository
+2. JSON Export - Parse exported GitHub issues JSON
+
+Usage:
+    # From GitHub API
+    export GITHUB_TOKEN=ghp_your_token_here
+    python gh2jsonl.py --repo owner/repo | bd import
+    
+    # From exported JSON file
+    python gh2jsonl.py --file issues.json | bd import
+    
+    # Save to file first
+    python gh2jsonl.py --repo owner/repo > issues.jsonl
+"""
+
+import json
+import os
+import re
+import sys
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import List, Dict, Any, Optional
+from urllib.request import Request, urlopen
+from urllib.error import HTTPError, URLError
+
+
+class GitHubToBeads:
+    """Convert GitHub Issues to bd JSONL format."""
+
+    def __init__(self, prefix: str = "bd", start_id: int = 1):
+        self.prefix = prefix
+        self.issue_counter = start_id
+        self.issues: List[Dict[str, Any]] = []
+        self.gh_id_to_bd_id: Dict[int, str] = {}
+
+    def fetch_from_api(self, repo: str, token: Optional[str] = None, state: str = "all"):
+        """Fetch issues from GitHub API."""
+        if not token:
+            token = os.getenv("GITHUB_TOKEN")
+            if not token:
+                raise ValueError(
+                    "GitHub token required. Set GITHUB_TOKEN env var or pass --token"
+                )
+
+        # Parse repo
+        if "/" not in repo:
+            raise ValueError("Repository must be in format: owner/repo")
+
+        # Fetch all issues (paginated)
+        page = 1
+        per_page = 100
+        all_issues = []
+
+        while True:
+            url = f"https://api.github.com/repos/{repo}/issues?state={state}&per_page={per_page}&page={page}"
+            headers = {
+                "Authorization": f"token {token}",
+                "Accept": "application/vnd.github.v3+json",
+                "User-Agent": "bd-gh-import/1.0",
+            }
+
+            try:
+                req = Request(url, headers=headers)
+                with urlopen(req) as response:
+                    data = json.loads(response.read().decode())
+
+                    if not data:
+                        break
+
+                    # Filter out pull requests (they appear in issues endpoint)
+                    issues = [issue for issue in data if "pull_request" not in issue]
+                    all_issues.extend(issues)
+
+                    if len(data) < per_page:
+                        break
+
+                    page += 1
+
+            except HTTPError as e:
+                error_body = e.read().decode(errors="replace")
+                remaining = e.headers.get("X-RateLimit-Remaining")
+                reset = e.headers.get("X-RateLimit-Reset")
+                msg = f"GitHub API error: {e.code} - {error_body}"
+                if e.code == 403 and remaining == "0":
+                    msg += f"\nRate limit exceeded. Resets at Unix timestamp: {reset}"
+                raise RuntimeError(msg)
+            except URLError as e:
+                raise RuntimeError(f"Network error calling GitHub: {e.reason}")
+
+        print(f"Fetched {len(all_issues)} issues from {repo}", file=sys.stderr)
+        return all_issues
+
+    def parse_json_file(self, filepath: Path) -> List[Dict[str, Any]]:
+        """Parse GitHub issues from JSON file."""
+        with open(filepath, 'r', encoding='utf-8') as f:
+            try:
+                data = json.load(f)
+            except json.JSONDecodeError as e:
+                raise ValueError(f"Invalid JSON in {filepath}: {e}")
+
+        # Handle both single issue and array of issues
+        if isinstance(data, dict):
+            # Filter out PRs
+            if "pull_request" in data:
+                return []
+            return [data]
+        elif isinstance(data, list):
+            # Filter out PRs
+            return [issue for issue in data if "pull_request" not in issue]
+        else:
+            raise ValueError("JSON must be a single issue object or array of issues")
+
+    def map_priority(self, labels: List[str]) -> int:
+        """Map GitHub labels to bd priority."""
+        label_names = [label.get("name", "").lower() if isinstance(label, dict) else label.lower() for label in labels]
+
+        # Priority labels (customize for your repo)
+        if any(l in label_names for l in ["critical", "p0", "urgent"]):
+            return 0
+        elif any(l in label_names for l in ["high", "p1", "important"]):
+            return 1
+        elif any(l in label_names for l in ["low", "p3", "minor"]):
+            return 3
+        elif any(l in label_names for l in ["backlog", "p4", "someday"]):
+            return 4
+        else:
+            return 2  # Default medium
+
+    def map_issue_type(self, labels: List[str]) -> str:
+        """Map GitHub labels to bd issue type."""
+        label_names = [label.get("name", "").lower() if isinstance(label, dict) else label.lower() for label in labels]
+
+        # Type labels (customize for your repo)
+        if any(l in label_names for l in ["bug", "defect"]):
+            return "bug"
+        elif any(l in label_names for l in ["feature", "enhancement"]):
+            return "feature"
+        elif any(l in label_names for l in ["epic", "milestone"]):
+            return "epic"
+        elif any(l in label_names for l in ["chore", "maintenance", "dependencies"]):
+            return "chore"
+        else:
+            return "task"
+
+    def map_status(self, state: str, labels: List[str]) -> str:
+        """Map GitHub state to bd status."""
+        label_names = [label.get("name", "").lower() if isinstance(label, dict) else label.lower() for label in labels]
+
+        if state == "closed":
+            return "closed"
+        elif any(l in label_names for l in ["in progress", "in-progress", "wip"]):
+            return "in_progress"
+        elif any(l in label_names for l in ["blocked"]):
+            return "blocked"
+        else:
+            return "open"
+
+    def extract_labels(self, gh_labels: List) -> List[str]:
+        """Extract label names from GitHub label objects."""
+        labels = []
+        for label in gh_labels:
+            if isinstance(label, dict):
+                name = label.get("name", "")
+            else:
+                name = str(label)
+
+            # Filter out labels we use for mapping
+            skip_labels = {
+                "bug", "feature", "epic", "chore", "enhancement", "defect",
+                "critical", "high", "low", "p0", "p1", "p2", "p3", "p4",
+                "urgent", "important", "minor", "backlog", "someday",
+                "in progress", "in-progress", "wip", "blocked"
+            }
+
+            if name.lower() not in skip_labels:
+                labels.append(name)
+
+        return labels
+
+    def extract_dependencies_from_body(self, body: str) -> List[str]:
+        """Extract issue references from body text."""
+        if not body:
+            return []
+
+        refs = []
+
+        # Pattern: #123 or owner/repo#123
+        issue_pattern = r'(?:^|\s)#(\d+)|(?:[\w-]+/[\w-]+)#(\d+)'
+
+        for match in re.finditer(issue_pattern, body):
+            issue_num = match.group(1) or match.group(2)
+            if issue_num:
+                refs.append(int(issue_num))
+
+        return list(set(refs))  # Deduplicate
+
+    def convert_issue(self, gh_issue: Dict[str, Any]) -> Dict[str, Any]:
+        """Convert a single GitHub issue to bd format."""
+        gh_id = gh_issue["number"]
+        bd_id = f"{self.prefix}-{self.issue_counter}"
+        self.issue_counter += 1
+
+        # Store mapping
+        self.gh_id_to_bd_id[gh_id] = bd_id
+
+        labels = gh_issue.get("labels", [])
+
+        # Build bd issue
+        issue = {
+            "id": bd_id,
+            "title": gh_issue["title"],
+            "description": gh_issue.get("body") or "",
+            "status": self.map_status(gh_issue["state"], labels),
+            "priority": self.map_priority(labels),
+            "issue_type": self.map_issue_type(labels),
+            "created_at": gh_issue["created_at"],
+            "updated_at": gh_issue["updated_at"],
+        }
+
+        # Add external reference
+        issue["external_ref"] = gh_issue["html_url"]
+
+        # Add assignee if present
+        if gh_issue.get("assignee"):
+            issue["assignee"] = gh_issue["assignee"]["login"]
+
+        # Add labels (filtered)
+        bd_labels = self.extract_labels(labels)
+        if bd_labels:
+            issue["labels"] = bd_labels
+
+        # Add closed timestamp if closed
+        if gh_issue.get("closed_at"):
+            issue["closed_at"] = gh_issue["closed_at"]
+
+        return issue
+
+    def add_dependencies(self):
+        """Add dependencies based on issue references in body text."""
+        for gh_issue_data in getattr(self, '_gh_issues', []):
+            gh_id = gh_issue_data["number"]
+            bd_id = self.gh_id_to_bd_id.get(gh_id)
+
+            if not bd_id:
+                continue
+
+            body = gh_issue_data.get("body") or ""
+            referenced_gh_ids = self.extract_dependencies_from_body(body)
+
+            dependencies = []
+            for ref_gh_id in referenced_gh_ids:
+                ref_bd_id = self.gh_id_to_bd_id.get(ref_gh_id)
+                if ref_bd_id:
+                    dependencies.append({
+                        "issue_id": "",
+                        "depends_on_id": ref_bd_id,
+                        "type": "related"
+                    })
+
+            # Find the bd issue and add dependencies
+            if dependencies:
+                for issue in self.issues:
+                    if issue["id"] == bd_id:
+                        issue["dependencies"] = dependencies
+                        break
+
+    def convert(self, gh_issues: List[Dict[str, Any]]):
+        """Convert all GitHub issues to bd format."""
+        # Store for dependency processing
+        self._gh_issues = gh_issues
+
+        # Sort by issue number for consistent ID assignment
+        sorted_issues = sorted(gh_issues, key=lambda x: x["number"])
+
+        # Convert each issue
+        for gh_issue in sorted_issues:
+            bd_issue = self.convert_issue(gh_issue)
+            self.issues.append(bd_issue)
+
+        # Add cross-references
+        self.add_dependencies()
+
+        print(
+            f"Converted {len(self.issues)} issues. Mapping: GH #{min(self.gh_id_to_bd_id.keys())} -> {self.gh_id_to_bd_id[min(self.gh_id_to_bd_id.keys())]}",
+            file=sys.stderr
+        )
+
+    def to_jsonl(self) -> str:
+        """Convert issues to JSONL format."""
+        lines = []
+        for issue in self.issues:
+            lines.append(json.dumps(issue, ensure_ascii=False))
+        return '\n'.join(lines)
+
+
+def main():
+    """Main entry point."""
+    import argparse
+
+    parser = argparse.ArgumentParser(
+        description="Convert GitHub Issues to bd JSONL format",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # From GitHub API
+  export GITHUB_TOKEN=ghp_...
+  python gh2jsonl.py --repo owner/repo | bd import
+  
+  # From JSON file
+  python gh2jsonl.py --file issues.json > issues.jsonl
+  
+  # Fetch only open issues
+  python gh2jsonl.py --repo owner/repo --state open
+  
+  # Custom prefix and start ID
+  python gh2jsonl.py --repo owner/repo --prefix myproject --start-id 100
+        """
+    )
+
+    parser.add_argument(
+        "--repo",
+        help="GitHub repository (owner/repo)"
+    )
+    parser.add_argument(
+        "--file",
+        type=Path,
+        help="JSON file containing GitHub issues export"
+    )
+    parser.add_argument(
+        "--token",
+        help="GitHub personal access token (or set GITHUB_TOKEN env var)"
+    )
+    parser.add_argument(
+        "--state",
+        choices=["open", "closed", "all"],
+        default="all",
+        help="Issue state to fetch (default: all)"
+    )
+    parser.add_argument(
+        "--prefix",
+        default="bd",
+        help="Issue ID prefix (default: bd)"
+    )
+    parser.add_argument(
+        "--start-id",
+        type=int,
+        default=1,
+        help="Starting issue number (default: 1)"
+    )
+
+    args = parser.parse_args()
+
+    # Validate inputs
+    if not args.repo and not args.file:
+        parser.error("Either --repo or --file is required")
+
+    if args.repo and args.file:
+        parser.error("Cannot use both --repo and --file")
+
+    # Create converter
+    converter = GitHubToBeads(prefix=args.prefix, start_id=args.start_id)
+
+    # Load issues
+    if args.repo:
+        gh_issues = converter.fetch_from_api(args.repo, args.token, args.state)
+    else:
+        gh_issues = converter.parse_json_file(args.file)
+
+    if not gh_issues:
+        print("No issues found", file=sys.stderr)
+        sys.exit(0)
+
+    # Convert
+    converter.convert(gh_issues)
+
+    # Output JSONL
+    print(converter.to_jsonl())
+
+
+if __name__ == "__main__":
+    main()