Incident Response
name: incident-response
by chunhualiao · published 2026-03-22
$ claw add gh:chunhualiao/chunhualiao-incident-response---
name: incident-response
version: 1.0.0
description: Structured incident response for OpenClaw system failures. Use when a user reports something broken, missing, changed, or misbehaving — config loss, agent routing failures, binding changes, gateway crashes, missing settings, or any system regression. Follows a strict 7-phase loop: Triage → Evidence → 5 Whys → Restore → Prevent → Monitor → Document. Triggers on: "investigate", "why did X stop working", "something changed", "bindings lost", "gateway down", "gateway crashed", "setting disappeared", "something disappeared", "fix this", "who changed X", "root cause", "audit", "misconfigured", "agent not responding".
owner_agent: any agent with exec access to the affected system
---
# Incident Response
Seven phases, in order. Never skip. Never assume — follow the evidence.
**Outputs produced by this skill:**
Phase 0: Triage (2 min)
**Check current state FIRST before investigating history.**
# Is it actually broken right now?
openclaw status
ssh "<remote-host>" "launchctl list | grep openclaw"
# Test with correct protocol (check source: HTTP vs HTTPS?)If currently working → report "recovered, investigating cause." If still broken → proceed.
Phase 1: Evidence Collection
Gather hard evidence from four sources:
1a. Config backups timeline
# See binding/setting counts over time
ssh "<remote-host>" "python3 << 'EOF'
import json, glob, os
for f in sorted(glob.glob('~/.openclaw/config-backups/openclaw-*.json'), key=os.path.getmtime):
d = json.load(open(f))
import datetime
dt = datetime.datetime.fromtimestamp(os.path.getmtime(f)).strftime('%Y-%m-%d %H:%M')
# Customize: bindings, agents, channels, etc.
count = len(d.get('bindings', []))
ids = [b.get('agentId') for b in d.get('bindings', [])]
print(f'{dt} [{count}] {ids}')
EOF"1b. Git audit trail
ssh "<remote-host>" "cd ~/.openclaw && git log --oneline -20"
ssh "<remote-host>" "cd ~/.openclaw && git diff <commit-a> <commit-b> -- openclaw.json | grep '^[+-]' | grep -v '^---\|^+++'"1c. Session logs (who did what)
# Find sessions that touched the broken config key
ssh "<remote-host>" "rg -rl 'keyword' ~/.openclaw/agents/*/sessions/*.jsonl | head -5"
# Extract tool calls from a session
ssh "<remote-host>" "python3 << 'EOF'
import json
for line in open('SESSION.jsonl'):
obj = json.loads(line)
if obj.get('type') != 'message': continue
for block in obj.get('message',{}).get('content',[]):
if block.get('type') == 'toolCall' and block.get('name') in ['Write','Edit','gateway','exec']:
print(obj['timestamp'], block['name'], str(block.get('input',''))[:200])
EOF"1d. Config backup diff (find the exact moment of change)
# Compare before/after a suspicious backup
python3 -c "
import json
a = json.load(open('backup-before.json'))
b = json.load(open('backup-after.json'))
# Compare specific field
print('Before:', a.get('bindings'))
print('After:', b.get('bindings'))
"**Stop and document:** Who changed what, when, which session, which tool call.
Phase 2: 5 Whys Analysis
Write each "why" as a statement of fact backed by evidence from Phase 1.
Why 1: [Symptom] — e.g. "Bindings dropped from 17 to 1"
Evidence: backup timestamp + count
Why 2: [Immediate cause] — e.g. "A full config replacement was written at 09:38 PST"
Evidence: backup mtime + content diff
Why 3: [Mechanism] — e.g. "the agent wrote a new config from scratch, not from current config"
Evidence: session log tool call + content
Why 4: [System gap] — e.g. "config-validate.sh --merge had no guard against binding count drops"
Evidence: script inspection showing no such check
Why 5: [Root cause] — e.g. "No automated detection existed between when the config was written and the next user report"
Evidence: no monitoring cron, no git at the time**Rule:** Every "why" must cite a specific file, log entry, timestamp, or command output. No assumptions.
Phase 3: Restore
Restore to last known-good state using backup timeline from Phase 1.
# Restore specific fields (always merge, never replace)
PATCH=$(python3 -c "
import json
good = json.load(open('/path/to/good-backup.json'))
patch = {'bindings': good['bindings']} # customize field
print(json.dumps(patch))
")
echo "$PATCH" | ssh "<remote-host>" "~/.openclaw/scripts/config-validate.sh --merge"
# Restart gateway
ssh "<remote-host>" "launchctl stop ai.openclaw.gateway && sleep 2 && launchctl start ai.openclaw.gateway"
ssh "<remote-host>" "launchctl list | grep ai.openclaw.gateway" # verify exit code 0**Verify restore:** Check that the restored value matches the good backup. Re-run the user's original failing action.
Phase 4: Prevention
Add guards proportional to the severity and recurrence risk. See `references/prevention-patterns.md` for full patterns. Quick reference:
**For config fields that must not decrease:**
Add guard to `config-validate.sh --merge` (see references for template)
**For agent behavior rules:**
Add to `~/.openclaw/agents/<id>/agent/SOUL.md` as a Hard Rule (HR-NNN)
**For recurring mistakes:**
Add to `~/.openclaw/learnings/rules.md` with category and date
**For schema validation gaps:**
Update `config-validate.sh` valid_keys list after verifying against DeepWiki
Always commit prevention changes to git:
ssh "<remote-host>" "cd ~/.openclaw && git add -A && git commit -m 'prevention: <what was added> after <incident>'"Phase 5: Monitor
Set a recurring cron job that runs until user confirms "good enough" (minimum 7 days, 30 days for recurring incidents).
Cron job structure:
- Schedule: every 24h (or every N hours for high-severity)
- Task: check specific metric → compare to baseline → if degraded: restore + 5-why → report
- Report channel: sessions_send to your preferred channel (Signal, Telegram, Discord)
- Auto-escalate: if same fix needed 3+ days in a row → upgrade prevention measure
- Termination: user explicitly says "stop monitoring" or N days without incidentSee `references/cron-template.md` for the full cron job prompt template.
Phase 6: Document
Write to `~/.openclaw/learnings/rules.md` if a Hard Rule should be added:
Update `MEMORY.md` with incident summary if it's systemic.
---
Configuration
No persistent configuration required. Adapt the following to your environment:
| Variable | Description | Example |
|----------|-------------|---------|
| Remote host | SSH target for remote investigations | `<remote-host>` → your Titan/server hostname |
| Config backup path | Where OpenClaw stores automatic config backups | `~/.openclaw/config-backups/` |
| Session key | Your messaging session key for cron reports | `agent:main-signal:signal:<your-number>` |
| Learnings path | Where rules are persisted | `~/.openclaw/learnings/rules.md` |
See `references/cron-template.md` for full cron report configuration.
---
Quick Diagnosis Checklists
See `references/checklists.md` for:
More tools from the same signal band
Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...