Initial import of watchdog-discord-route skill

2026-04-22 08:33:51 +08:00
commit 8138fb011d
22 changed files with 2447 additions and 0 deletions
--- a/references/owner-report-operator-manual.md
+++ b/references/owner-report-operator-manual.md
@@ -0,0 +1,242 @@
+# Owner Report Operator Manual
+
+## Purpose
+Owner-report system 用來把「需要主動回報的 checkpoint」變成一條可觀測、可驗證、可補送的通知鏈路。
+
+適合解決：
+- 長時間、跨步驟任務
+- ClawTeam / worker / 背景流程的進度回報
+- 不能只靠口頭承諾「等等回報」的情境
+
+不追求複雜事件平台；目標是 **簡單、可靠、失敗不假成功**。
+
+## Architecture
+最小鏈路：
+
+一般 queue drain：
+`producer -> pending/*.md -> watchdog -> driver -> Discord -> sent/*.md`
+
+watchdog-b 單發直送：
+`producer -> pending/*.md -> driver -> Discord channel:1480577550445969541 -> sent/*.md`
+
+元件：
+- `scripts/owner_report_producer.py`
+  - 把顯式 checkpoint 欄位寫成 `~/.clawteam/owner-reports/pending/<report_id>.md`
+- `scripts/owner_report_consumer.py`
+  - 讀 pending report，轉成標準 JSON
+- `scripts/owner_report_driver.py`
+  - 呼叫外部 send command；**只有送成功才移到** `sent/`
+- `scripts/owner_report_watchdog.py`
+  - 單次掃描 pending，預設 oldest-first 處理 1 筆
+- `scripts/run_owner_report_watchdog.sh`
+  - 本機 wrapper，固定 send command / target / max-count，給 cron 用
+
+目錄：
+- pending: `~/.clawteam/owner-reports/pending/`
+- sent: `~/.clawteam/owner-reports/sent/`
+- cron log: `/opt/workspace_auditing_report/logs/owner_report_watchdog_cron.out`
+
+## When to use
+請用在：
+- 多步驟、跨時間任務
+- 有明確 checkpoint / status change
+- 使用 ClawTeam、subagent、watchdog、cron 的工作
+- Eric 明確要求不要漏回報的任務
+
+不要用在：
+- 一次回完即可的短問答
+- 小修改、低風險單步操作
+- 不需要主動通知的普通工作
+
+## Normal flow
+1. 任務出現值得回報的 checkpoint
+2. `owner_report_producer.py` 產生一筆 pending report
+3. cron 每分鐘執行 `run_owner_report_watchdog.sh`
+4. watchdog 從 `pending/` 挑最舊 report
+5. driver 用 `OWNER_REPORT_SEND_CMD` 實際送出
+6. 成功：移到 `sent/`
+7. 失敗：留在 `pending/`，本輪停止
+
+### Failure semantics
+- **送失敗不會 archive**
+- **backlog 依 oldest-first**
+- **遇到失敗即停**，避免把後續成功誤當整體正常
+
+## Common commands
+### 1) 看 queue
+```bash
+ls -l ~/.clawteam/owner-reports/pending
+ls -l ~/.clawteam/owner-reports/sent
+```
+
+### 2) 手動產生一筆 report
+```bash
+cd /home/chchang/.openclaw/workspace/owner-reporting-system/scripts
+uv run python owner_report_producer.py \
+  --team clawteam \
+  --worker backend-a \
+  --task-id example-task \
+  --progress 80% \
+  --done 'export complete' \
+  --next 'wait for aggregation' \
+  --status normal \
+  --source checkpoint-complete
+```
+
+### 3) producer dry-run
+```bash
+uv run python owner_report_producer.py \
+  --team clawteam \
+  --worker backend-a \
+  --task-id example-task \
+  --progress 80% \
+  --done 'export complete' \
+  --next 'wait for aggregation' \
+  --status normal \
+  --dry-run
+```
+
+### 4) watchdog dry-run
+```bash
+/home/chchang/.openclaw/workspace/owner-reporting-system/scripts/run_owner_report_watchdog.sh --dry-run
+```
+
+### 5) 立即手動跑 watchdog
+```bash
+/home/chchang/.openclaw/workspace/owner-reporting-system/scripts/run_owner_report_watchdog.sh
+```
+
+### 6) 一次多吃幾筆 backlog
+```bash
+OWNER_REPORT_MAX_COUNT=20 \
+/home/chchang/.openclaw/workspace/owner-reporting-system/scripts/run_owner_report_watchdog.sh
+```
+
+### 7) 查看 cron
+```bash
+crontab -l
+```
+
+### 8) 看 watchdog log
+```bash
+tail -n 200 /opt/workspace_auditing_report/logs/owner_report_watchdog_cron.out
+```
+
+## Debugging checklist
+遇到「沒有送出」時，依序檢查：
+
+1. **pending 有沒有檔案**
+   ```bash
+   ls -l ~/.clawteam/owner-reports/pending
+   ```
+2. **report 內容是否合理**
+   ```bash
+   cd /home/chchang/.openclaw/workspace/owner-reporting-system/scripts
+   uv run python owner_report_consumer.py <report_id_or_path>
+   ```
+3. **driver dry-run 是否正常**
+   ```bash
+   uv run python owner_report_driver.py <report_id_or_path> --dry-run
+   ```
+4. **watchdog dry-run 挑的是哪一筆**
+   ```bash
+   ./run_owner_report_watchdog.sh --dry-run
+   ```
+5. **cron 是否存在**
+   ```bash
+   crontab -l
+   ```
+6. **cron log 是否有錯**
+   ```bash
+   tail -n 200 /opt/workspace_auditing_report/logs/owner_report_watchdog_cron.out
+   ```
+7. **Node / openclaw entry 是否存在**
+   - `run_owner_report_watchdog.sh` 會檢查：
+     - `NODE_BIN`
+     - `OPENCLAW_ENTRY`
+8. **send command 是否能成功發送**
+   - wrapper 與 watchdog-b 直送都靠 `OWNER_REPORT_SEND_CMD` / `--send-cmd`
+   - 現行 owner-facing 預設目標為 Discord `channel:1480577550445969541`
+   - watchdog-b 單發驗證可直接用 `owner_report_driver.py <report_id> --send-cmd '...message send --channel discord --target '\''channel:1480577550445969541'\'' --message "$OWNER_REPORT_MESSAGE"'`
+9. **如果有 backlog 卡住**
+   - 先修掉最舊失敗那筆；watchdog 是 oldest-first，且失敗即停
+
+## Cron / watchdog behavior
+本機現況：
+- schedule: `* * * * *`
+- command: `run_owner_report_watchdog.sh`
+- target: 預設 Discord `channel:1480577550445969541`
+- max backlog per run: 預設 `5`
+
+wrapper 會：
+1. 固定 `OWNER_REPORT_SEND_CMD`
+2. 固定 owner-facing channel/target（預設 `OWNER_REPORT_CHANNEL=discord`、`OWNER_REPORT_TARGET=channel:1480577550445969541`；可覆蓋）
+3. 呼叫 `owner_report_watchdog.py --max-count "$OWNER_REPORT_MAX_COUNT"`
+
+watchdog 本身特性：
+- 非 daemon
+- 非常駐
+- 不 retry / backoff
+- 每次只做一輪掃描
+- 預設處理 1 筆；wrapper 預設放大為 5 筆
+- `--all` 可掃完整個當前 backlog
+
+## Caveats
+- 這不是 message bus，也不是完整 job system
+- 沒有內建 retry / dedupe / database
+- 若最舊 pending 一直失敗，後面會被擋住
+- `sent/` 代表「send command 成功」，不是保證人類已讀
+- producer 依賴顯式欄位輸入，不會自動理解任意 log
+- cron / PATH / node version 問題，已透過 wrapper 固定 Node 與 OpenClaw entry 盡量降低
+
+## Examples
+### Example A: 一般任務 checkpoint
+```bash
+cd /home/chchang/.openclaw/workspace/owner-reporting-system/scripts
+uv run python owner_report_producer.py \
+  --team general-task \
+  --worker alice \
+  --task-id manual-checkpoint-1 \
+  --progress 50% \
+  --done '第一階段完成' \
+  --next '等待第二階段結果' \
+  --status normal \
+  --source manual-checkpoint
+```
+
+### Example B: 只驗證不送出
+```bash
+/home/chchang/.openclaw/workspace/owner-reporting-system/scripts/run_owner_report_watchdog.sh --dry-run
+```
+
+### Example C: watchdog-b 單發直送到 owner-facing Discord
+先用 probe 找到本機 runtime：
+```bash
+python3 /home/chchang/.openclaw/workspace/skills/watchdog-discord-route/scripts/openclaw_runtime_probe.py --pretty
+```
+
+再帶入偵測到的 `node` / `dist/entry.js`：
+```bash
+cd /home/chchang/.openclaw/workspace/owner-reporting-system/scripts
+python3 owner_report_driver.py <report_id> \
+  --send-cmd '"<detected-node>" "<detected-entry.js>" message send --channel discord --target '\''channel:1480577550445969541'\'' --message "$OWNER_REPORT_MESSAGE"'
+```
+
+### Example D: 臨時改送別的 channel / target
+```bash
+OWNER_REPORT_CHANNEL=telegram \
+OWNER_REPORT_TARGET=123456789 \
+/home/chchang/.openclaw/workspace/owner-reporting-system/scripts/run_owner_report_watchdog.sh
+```
+
+### Example E: 快速消化 backlog
+```bash
+OWNER_REPORT_MAX_COUNT=10 \
+/home/chchang/.openclaw/workspace/owner-reporting-system/scripts/run_owner_report_watchdog.sh
+```
+
+## Operator summary
+如果只記三件事：
+1. **report 要先進 `pending/`，才有東西可送**
+2. **只有 send 成功才會移到 `sent/`**
+3. **看不到通知時，先查 pending / cron log / 最舊失敗那筆**
--- a/references/owner-reporting-system.md
+++ b/references/owner-reporting-system.md
@@ -0,0 +1,80 @@
+# Owner Reporting System
+
+這是一套全域性的 owner-facing 主動回報流程，不屬於某一個特定專案。
+
+它的目的，是把長時間、多步驟、不可漏回報的工作，整理成一條可觀測、可驗證、失敗不假成功的通知鏈路。
+
+## Core flow
+
+General queue drain path:
+`producer -> pending/*.md -> watchdog -> driver -> Discord -> sent/*.md`
+
+Watchdog-b single-shot direct path:
+`producer -> pending/*.md -> driver -> Discord channel:1480577550445969541 -> sent/*.md`
+
+元件：
+- `scripts/owner_report_producer.py`
+- `scripts/owner_report_consumer.py`
+- `scripts/owner_report_driver.py`
+- `scripts/owner_report_watchdog.py`
+- `scripts/run_owner_report_watchdog.sh`
+- `OWNER_REPORT_OPERATOR_MANUAL.md`
+
+## Scope
+
+適用於：
+- ClawTeam / subagent / 背景流程 checkpoint
+- 多步驟技術任務
+- 明確要求不要漏回報的交辦
+- 需要 oldest-first / success-only archive / stop-on-failure 語義的通知鏈路
+
+不適用於：
+- 單次短問答
+- 不需要主動通知的小修改
+- 一次即可回完的低風險任務
+
+## Queue paths
+
+- pending: `~/.clawteam/owner-reports/pending/`
+- sent: `~/.clawteam/owner-reports/sent/`
+
+## Local integration
+
+本機目前由 user crontab 每分鐘執行一次 watchdog wrapper：
+
+- wrapper: `/home/chchang/.openclaw/workspace/owner-reporting-system/scripts/run_owner_report_watchdog.sh`
+- log: `/opt/workspace_auditing_report/logs/owner_report_watchdog_cron.out`
+- default target: 預設 `OWNER_REPORT_CHANNEL=discord` + `OWNER_REPORT_TARGET=channel:1480577550445969541`
+- backlog per run: 預設 `OWNER_REPORT_MAX_COUNT=5`
+
+另外，watchdog-b owner-facing 單發驗證現在可直接走 `owner_report_driver.py`，不必依賴 wrapper watchdog 的目標/顯示語義判斷。
+
+## Common commands
+
+```bash
+# produce one checkpoint report
+cd /home/chchang/.openclaw/workspace/owner-reporting-system/scripts
+uv run python owner_report_producer.py \
+  --team general-task \
+  --worker alice \
+  --task-id example-task \
+  --progress 50% \
+  --done '第一階段完成' \
+  --next '等待第二階段結果' \
+  --status normal \
+  --source manual-checkpoint
+
+# dry-run watchdog
+/home/chchang/.openclaw/workspace/owner-reporting-system/scripts/run_owner_report_watchdog.sh --dry-run
+
+# process backlog immediately
+OWNER_REPORT_MAX_COUNT=20 \
+/home/chchang/.openclaw/workspace/owner-reporting-system/scripts/run_owner_report_watchdog.sh
+
+# temporarily override destination
+OWNER_REPORT_CHANNEL=telegram \
+OWNER_REPORT_TARGET=864811879 \
+/home/chchang/.openclaw/workspace/owner-reporting-system/scripts/run_owner_report_watchdog.sh --dry-run
+```
+
+更完整的操作、debug 與 failure semantics 請看 `OWNER_REPORT_OPERATOR_MANUAL.md`。
--- a/references/watchdog-b-readme.md
+++ b/references/watchdog-b-readme.md
@@ -0,0 +1,192 @@
+# Watchdog B v3 notification layer
+
+Single source of truth for owner-facing policy: `~/.config/openclaw/watchdog-b.env`
+
+Runtime auto-detection source: `scripts/openclaw_runtime_probe.py`
+
+This directory now contains:
+
+- `check_openclaw_state.sh` — tri-state checker (`running` / `stalled` / `idle`)
+- `run_watchdog_b.sh` — dispatcher + notification runner
+- `notify_watchdog_b.py` — minimal notification integration layer
+
+## Configuration source
+
+Priority order is now:
+1. process env already set by caller/systemd
+2. `WATCHDOG_B_CONFIG_FILE` if set
+3. fallback `~/.config/openclaw/watchdog-b.env`
+4. code defaults
+
+This means both `run_watchdog_b.sh` and `notify_watchdog_b.py` can be invoked manually and still resolve the same owner-facing channel / target / mode / wording.
+
+For Node/OpenClaw runtime paths, the bundled scripts now resolve in this order:
+1. explicit env overrides: `WATCHDOG_B_NODE_BIN`, `WATCHDOG_B_OPENCLAW_MJS`, `WATCHDOG_B_OPENCLAW_ENTRY`
+2. PATH lookup for `node` and `openclaw`
+3. common install roots scan: nvm, pnpm global, npm-global, `/usr/local`, `/usr`, Volta-style trees
+4. fail with an operator-facing error that tells you which env vars to set manually
+
+Repo template to copy from:
+- `ops/systemd/user/watchdog-b.env.example`
+
+Install example:
+```bash
+mkdir -p ~/.config/openclaw
+cp ~/.openclaw/workspace/ops/systemd/user/watchdog-b.env.example ~/.config/openclaw/watchdog-b.env
+$EDITOR ~/.config/openclaw/watchdog-b.env
+```
+
+## Notification strategy
+
+### 1) `running`
+Default: **manual / queue-ready only**.
+
+Why:
+- a healthy runtime every 10 minutes should not spam Eric
+- owner-facing reporting should remain explicit and auditable
+
+Behavior:
+- `WATCHDOG_B_RUNNING_REPORT_MODE=manual` (default)
+  - does not create external messages
+  - returns a concrete hint for how to enable queue creation
+- `WATCHDOG_B_RUNNING_REPORT_MODE=enqueue`
+  - creates a real pending owner report in `~/.clawteam/owner-reports/pending/`
+  - does **not** auto-send it
+- `WATCHDOG_B_RUNNING_REPORT_MODE=enqueue-and-drain`
+  - creates a pending owner report and immediately delivers that exact pending item through `owner_report_driver.py`
+  - send path is direct OpenClaw Discord send using the env-configured owner-facing destination
+  - this keeps queue/audit semantics but avoids depending on the wrapper watchdog's destination/visibility behavior
+
+Throttle:
+- `WATCHDOG_B_RUNNING_REPORT_MIN_INTERVAL_SECONDS` default `3600`
+
+### 2) `stalled`
+Default: **nudge main agent first**, then escalate to Eric only after repetition.
+
+Behavior:
+- call internal OpenClaw agent route:
+  - `node .../openclaw.mjs agent --agent main --message ...`
+- maintain local notify state under `state/watchdog-b/notify-state.json`
+- after repeated observations (`WATCHDOG_B_STALLED_OWNER_ESCALATION_AFTER`, default `2`), enqueue an owner report
+
+Throttle:
+- `WATCHDOG_B_STALLED_NUDGE_MIN_INTERVAL_SECONDS` default `900`
+
+Owner escalation mode:
+- `WATCHDOG_B_STALLED_OWNER_MODE=escalate` (default)
+- `WATCHDOG_B_STALLED_OWNER_MODE=always`
+- `WATCHDOG_B_STALLED_OWNER_MODE=never`
+
+Owner delivery mode after enqueue:
+- `WATCHDOG_B_OWNER_DELIVERY_MODE=enqueue-only` (default)
+- `WATCHDOG_B_OWNER_DELIVERY_MODE=direct-discord`
+
+When `direct-discord` is enabled, watchdog-b still enqueues first, then directly delivers that same pending report via `owner_report_driver.py` to the env-configured Discord target.
+
+## Owner-facing message style
+
+Owner-facing Discord message is now compact and conclusion-first:
+- first line: headline (`🔔 [watchdog-b] <worker>`)
+- second line: concise conclusion with emoji, e.g. `✅ 主程序仍在運行`
+- third line: actionable next step, prefixed with `→`
+- last line: compact technical metadata (`task=... | status=... | progress=... | source=...`)
+
+Style knobs live in `watchdog-b.env`:
+- `WATCHDOG_B_RUNNING_EMOJI`, `WATCHDOG_B_RUNNING_SUMMARY`
+- `WATCHDOG_B_STALLED_EMOJI`, `WATCHDOG_B_STALLED_SUMMARY`
+- `WATCHDOG_B_IDLE_EMOJI`, `WATCHDOG_B_IDLE_SUMMARY`
+
+### 3) `idle`
+Default: same pattern as stalled, but slower.
+
+Behavior:
+- nudge main agent first
+- only escalate to Eric after repeated idle detections
+
+Throttle:
+- `WATCHDOG_B_IDLE_NUDGE_MIN_INTERVAL_SECONDS` default `1800`
+
+Owner escalation threshold:
+- `WATCHDOG_B_IDLE_OWNER_ESCALATION_AFTER` default `2`
+
+## Safety defaults
+
+- `WATCHDOG_B_NOTIFY_DRY_RUN=1` by default in `run_watchdog_b.sh`
+- owner-facing send path keeps the existing `owner-reporting-system` queue/artifact/audit flow, but direct Discord delivery no longer depends on the cron wrapper's default destination semantics
+- local state tracks last send time / count to reduce spam
+- a failed notifier does not crash the dispatcher; it emits warning + preserves artifacts
+
+## Key artifacts
+
+Under `state/watchdog-b/`:
+
+- `last-output.txt` — rendered dispatcher output
+- `last-notify-output.txt` — notifier JSON result
+- `last-state.txt` — last state
+- `history.tsv` — state history
+- `notify-state.json` — throttle / repetition tracking
+
+## Manual test examples
+
+### Dry-run running path
+```bash
+WATCHDOG_B_NOTIFY_DRY_RUN=1 \
+WATCHDOG_B_RUNNING_REPORT_MODE=manual \
+./scripts/watchdog-b/notify_watchdog_b.py --state running --dry-run
+```
+
+### Real queue creation for running (no send)
+```bash
+WATCHDOG_B_NOTIFY_DRY_RUN=0 \
+WATCHDOG_B_RUNNING_REPORT_MODE=enqueue \
+./scripts/watchdog-b/notify_watchdog_b.py --state running
+ls -l ~/.clawteam/owner-reports/pending
+```
+
+### Runtime probe only
+```bash
+python3 ./scripts/watchdog-b/openclaw_runtime_probe.py --pretty
+```
+
+### Single-shot enqueue + direct Discord delivery
+```bash
+WATCHDOG_B_NOTIFY_DRY_RUN=0 \
+WATCHDOG_B_RUNNING_REPORT_MODE=enqueue-and-drain \
+./scripts/watchdog-b/notify_watchdog_b.py --state running
+```
+
+### Dry-run stalled nudge
+```bash
+WATCHDOG_B_NOTIFY_DRY_RUN=1 \
+./scripts/watchdog-b/notify_watchdog_b.py --state stalled --dry-run
+```
+
+If runtime auto-detection fails on a host with a custom install layout, set one or more of:
+- `WATCHDOG_B_NODE_BIN`
+- `WATCHDOG_B_OPENCLAW_MJS`
+- `WATCHDOG_B_OPENCLAW_ENTRY`
+
+### Full dispatcher dry-run with fixture overrides
+```bash
+OPENCLAW_PID_FILE=$PWD/tests/fixtures/watchdog-b/running/host-runtime/openclaw.pid \
+OPENCLAW_LOG_FILE=$PWD/tests/fixtures/watchdog-b/running/logs/openclaw.log \
+WATCHDOG_B_ARTIFACT_DIR=$PWD/state/watchdog-b-test-running-v3 \
+WATCHDOG_B_NOTIFY_DRY_RUN=1 \
+./scripts/watchdog-b/run_watchdog_b.sh
+```
+
+## What is truly wired vs not
+
+### Truly wired now
+- state detection
+- notifier invocation from dispatcher
+- main-agent internal nudge command construction and execution path
+- owner-report queue creation via existing producer
+- optional direct delivery of an enqueued owner report through `owner_report_driver.py`
+- throttling / repetition state persisted locally
+
+### Still conditional / not claimed as universally proven
+- successful main-agent wake-up depends on local OpenClaw CLI/runtime being callable from this environment
+- successful owner-facing delivery still depends on valid local OpenClaw Discord routing on the host
+- direct watchdog-b owner delivery now targets `channel:1480577550445969541` by default and bypasses the wrapper watchdog destination logic
+- the cron wrapper remains available for generic queue draining, but watchdog-b no longer needs to rely on it for single-shot owner-facing verification