feat(deacon): add dog-health-check step to patrol

Adds supervision for dispatched dogs that may get stuck.

The new step (between dog-pool-maintenance and orphan-check):
- Lists dogs in "working" state
- Checks work duration vs plugin timeout (default 10m)
- Decision matrix based on how long overdue:
  - < 2x timeout: log warning, check next cycle
  - 2x-5x timeout: file death warrant
  - > 5x timeout: force clear + escalate to Mayor
- Tracks chronic failures for repeat offenders

This closes the supervision gap where dogs could hang forever
after being dispatched via `gt dog dispatch --plugin`.

Closes: gt-s4dp3

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
gus
2026-01-12 18:12:28 -08:00
committed by Steve Yegge
parent 1902182f3a
commit 6b2a7438e1
2 changed files with 130 additions and 2 deletions

View File

@@ -499,10 +499,74 @@ gt dog status <name>
**Exit criteria:** Pool has at least 1 idle dog."""
[[steps]]
id = "dog-health-check"
title = "Check for stuck dogs"
needs = ["dog-pool-maintenance"]
description = """
Check for dogs that have been working too long (stuck).
Dogs dispatched via `gt dog dispatch --plugin` are marked as "working" with
a work description like "plugin:rebuild-gt". If a dog hangs, crashes, or
takes too long, it needs intervention.
**Step 1: List working dogs**
```bash
gt dog list --json
# Filter for state: "working"
```
**Step 2: Check work duration**
For each working dog:
```bash
gt dog status <name> --json
# Check: work_started_at, current_work
```
Compare against timeout:
- If plugin has [execution] timeout in plugin.md, use that
- Default timeout: 10 minutes for infrastructure tasks
**Duration calculation:**
```
stuck_threshold = plugin_timeout or 10m
duration = now - work_started_at
is_stuck = duration > stuck_threshold
```
**Step 3: Handle stuck dogs**
For dogs working > timeout:
```bash
# Option A: File death warrant (Boot handles termination)
gt warrant file deacon/dogs/<name> --reason "Stuck: working on <work> for <duration>"
# Option B: Force clear work and notify
gt dog clear <name> --force
gt mail send deacon/ -s "DOG_TIMEOUT <name>" -m "Dog <name> timed out on <work> after <duration>"
```
**Decision matrix:**
| Duration over timeout | Action |
|----------------------|--------|
| < 2x timeout | Log warning, check next cycle |
| 2x - 5x timeout | File death warrant |
| > 5x timeout | Force clear + escalate to Mayor |
**Step 4: Track chronic failures**
If same dog gets stuck repeatedly:
```bash
gt mail send mayor/ -s "Dog <name> chronic failures" \
-m "Dog has timed out N times in last 24h. Consider removing from pool."
```
**Exit criteria:** All stuck dogs handled (warrant filed or cleared)."""
[[steps]]
id = "orphan-check"
title = "Detect abandoned work"
needs = ["dog-pool-maintenance"]
needs = ["dog-health-check"]
description = """
**DETECT ONLY** - Check for orphaned state and dispatch to dog if found.

View File

@@ -499,10 +499,74 @@ gt dog status <name>
**Exit criteria:** Pool has at least 1 idle dog."""
[[steps]]
id = "dog-health-check"
title = "Check for stuck dogs"
needs = ["dog-pool-maintenance"]
description = """
Check for dogs that have been working too long (stuck).
Dogs dispatched via `gt dog dispatch --plugin` are marked as "working" with
a work description like "plugin:rebuild-gt". If a dog hangs, crashes, or
takes too long, it needs intervention.
**Step 1: List working dogs**
```bash
gt dog list --json
# Filter for state: "working"
```
**Step 2: Check work duration**
For each working dog:
```bash
gt dog status <name> --json
# Check: work_started_at, current_work
```
Compare against timeout:
- If plugin has [execution] timeout in plugin.md, use that
- Default timeout: 10 minutes for infrastructure tasks
**Duration calculation:**
```
stuck_threshold = plugin_timeout or 10m
duration = now - work_started_at
is_stuck = duration > stuck_threshold
```
**Step 3: Handle stuck dogs**
For dogs working > timeout:
```bash
# Option A: File death warrant (Boot handles termination)
gt warrant file deacon/dogs/<name> --reason "Stuck: working on <work> for <duration>"
# Option B: Force clear work and notify
gt dog clear <name> --force
gt mail send deacon/ -s "DOG_TIMEOUT <name>" -m "Dog <name> timed out on <work> after <duration>"
```
**Decision matrix:**
| Duration over timeout | Action |
|----------------------|--------|
| < 2x timeout | Log warning, check next cycle |
| 2x - 5x timeout | File death warrant |
| > 5x timeout | Force clear + escalate to Mayor |
**Step 4: Track chronic failures**
If same dog gets stuck repeatedly:
```bash
gt mail send mayor/ -s "Dog <name> chronic failures" \
-m "Dog has timed out N times in last 24h. Consider removing from pool."
```
**Exit criteria:** All stuck dogs handled (warrant filed or cleared)."""
[[steps]]
id = "orphan-check"
title = "Detect abandoned work"
needs = ["dog-pool-maintenance"]
needs = ["dog-health-check"]
description = """
**DETECT ONLY** - Check for orphaned state and dispatch to dog if found.