Merge feat/T-027-node-verification into main

commit: 8f8e97f
parent: dca7129
author: im_wower
date: 2026-03-22 02:16:10 +0800 CST

Merge feat/T-027-node-verification into main

6 files changed, +732, -12

M coordination/tasks/T-027-node-verification.md

M docs/runtime/README.md

M docs/runtime/launchd.md

A docs/runtime/node-verification.md

A scripts/runtime/check-node.sh

M scripts/runtime/common.sh

M coordination/tasks/T-027-node-verification.md

+21, -8

 1@@ -1,10 +1,10 @@
 2 ---
 3 task_id: T-027
 4 title: launchd 节点验证与 On-Node 检查
 5-status: todo
 6+status: review
 7 branch: feat/T-027-node-verification
 8 repo: /Users/george/code/baa-conductor
 9-base_ref: main
10+base_ref: main@6505a31
11 depends_on:
12   - T-019
13   - T-020
14@@ -12,7 +12,7 @@ depends_on:
15 write_scope:
16   - docs/runtime/**
17   - scripts/runtime/**
18-updated_at: 2026-03-22
19+updated_at: 2026-03-22T01:46:45+08:00
20 ---
21 
22 # T-027 launchd 节点验证与 On-Node 检查
23@@ -62,20 +62,33 @@ updated_at: 2026-03-22
24 
25 ## files_changed
26 
27-- 待填写
28+- `coordination/tasks/T-027-node-verification.md`
29+- `docs/runtime/README.md`
30+- `docs/runtime/launchd.md`
31+- `docs/runtime/node-verification.md`
32+- `scripts/runtime/common.sh`
33+- `scripts/runtime/check-node.sh`
34 
35 ## commands_run
36 
37-- 待填写
38+- `npx --yes pnpm install`
39+- `bash -n scripts/runtime/*.sh`
40+- `git diff --check`
41+- `scripts/runtime/check-node.sh --help`
42+- `scripts/runtime/check-node.sh --repo-dir <tmp_repo> --node mini --service conductor --service status-api --skip-static-check --local-api-base http://127.0.0.1:4417 --status-api-base http://127.0.0.1:4418 --expected-rolez leader`
43 
44 ## result
45 
46-- 待填写
47+- 新增 `scripts/runtime/check-node.sh`，把节点验证推进到运行态层：覆盖 runtime 静态校验复用、本地端口、conductor `/healthz` `/readyz` `/rolez`、status-api 宿主进程与 `/healthz` `/v1/status` `/v1/status/ui`、以及 launchd 日志文件存在性。
48+- `scripts/runtime/common.sh` 补了 status-api 默认地址、on-node 默认服务集合和进程匹配辅助逻辑，避免节点约定散落在多个脚本里。
49+- `docs/runtime/README.md`、`docs/runtime/launchd.md`、`docs/runtime/node-verification.md` 明确了 mini/mac 的静态检查顺序、on-node 检查顺序，以及 steady-state 下 `mini=leader`、`mac=standby` 的 `rolez` 预期。
50 
51 ## risks
52 
53-- 待填写
54+- 没有在真实 `mini` / `mac` 节点上执行 `--check-loaded`，因此 `launchctl print gui/...` / `launchctl print system/...` 只完成了脚本级接线，没有做实机验证。
55+- `status-api /v1/status` 目前按返回体中包含 `"ok": true` 做断言；如果后续响应格式改成不同 JSON 序列化风格，脚本可能需要同步微调。
56 
57 ## next_handoff
58 
59-- 待填写
60+- 在真实 `mini` 与 `mac` 节点上各跑一次 `scripts/runtime/check-node.sh --check-loaded ...`，确认 launchd 域、日志文件、端口和探针都与文档一致。
61+- 如果某个节点不常驻 `status-api`，在落地时显式使用 `--service conductor`，并确认运维文档是否需要额外区分“仅 conductor 节点”和“conductor + status-api 节点”。

M docs/runtime/README.md

+7, -2

 1@@ -11,6 +11,7 @@
 2 - [`layout.md`](./layout.md): `runs/`、`worktrees/`、`logs/`、`tmp/` 与 `state/` 的初始化方式和生命周期
 3 - [`environment.md`](./environment.md): `launchd` 下必须显式写入的环境变量，以及安装脚本如何覆盖默认值
 4 - [`launchd.md`](./launchd.md): `mini` 与 `mac` 的脚本化安装步骤，以及 `LaunchAgents` / `LaunchDaemons` 的差异
 5+- [`node-verification.md`](./node-verification.md): `mini` / `mac` 的 on-node 验证顺序、期望探针结果，以及日志/进程检查点
 6 
 7 ## 统一约定
 8 
 9@@ -23,14 +24,18 @@
10 
11 1. 先按 [`layout.md`](./layout.md) 运行 `./scripts/runtime/bootstrap.sh` 初始化 runtime 根目录。
12 2. 再按 [`environment.md`](./environment.md) 准备共享变量和节点变量，特别是 `BAA_SHARED_TOKEN`。
13-3. 按 [`launchd.md`](./launchd.md) 运行 `install-launchd.sh` 生成安装副本，再用 `check-launchd.sh` / `reload-launchd.sh` 校验与重载。
14-4. 每次准备执行 `check-launchd.sh` 的 dist 校验或真正 `launchctl bootstrap` 前，先在 repo 根目录执行一次 `npx --yes pnpm -r build`，确认目标 app 的 `dist/index.js` 已更新。
15+3. 按 [`launchd.md`](./launchd.md) 运行 `install-launchd.sh` 生成安装副本，先用 `check-launchd.sh` 做静态校验。
16+4. 在节点上已有实际进程后，再按 [`node-verification.md`](./node-verification.md) 运行 `check-node.sh` 做端口、探针、status-api 宿主进程与日志路径检查。
17+5. 每次准备执行 `check-launchd.sh` 的 dist 校验、`check-node.sh` 的进程检查，或真正 `launchctl bootstrap` 前，先在 repo 根目录执行一次 `npx --yes pnpm -r build`，确认目标 app 的 `dist/index.js` 已更新。
18 
19 ## 当前脚本集
20 
21 - `scripts/runtime/bootstrap.sh`: 预创建 `state/`、`runs/`、`worktrees/`、`logs/launchd/`、`tmp/`
22 - `scripts/runtime/install-launchd.sh`: 从 `ops/launchd/*.plist` 渲染实际安装副本
23 - `scripts/runtime/check-launchd.sh`: 校验源模板、runtime 目录、构建产物，以及已安装的 plist 副本
24+- `scripts/runtime/check-node.sh`: 在节点上校验 launchd 副本之外的运行态信号，例如本地端口、HTTP 探针、status-api 宿主进程与 launchd 日志文件
25 - `scripts/runtime/reload-launchd.sh`: 执行或 dry-run `launchctl bootout/bootstrap/kickstart`
26 
27 默认安装/检查/重载集合只包含 `conductor`。如果后续要把其它模板也纳入流程，显式加 `--service worker-runner`、`--service status-api`，或直接使用 `--all-services`。
28+
29+`check-node.sh` 的默认集合不同：它默认同时检查 `conductor` 和 `status-api`，因为真实节点验证至少要覆盖本地控制面和状态面两条路径。如果节点暂时不跑 `status-api`，再显式收窄到 `--service conductor`。

M docs/runtime/launchd.md

+52, -2

 1@@ -16,6 +16,11 @@ repo 里保留了三个源模板：
 2 
 3 这样可以先把已接成 CLI/runtime 入口的服务跑通，再按需扩展其他模板。
 4 
 5+需要注意两个脚本默认集合不同：
 6+
 7+- `install-launchd.sh` / `check-launchd.sh` / `reload-launchd.sh` 默认只处理 `conductor`
 8+- `check-node.sh` 默认同时处理 `conductor` 和 `status-api`，因为节点验证至少要覆盖本地控制面与状态面
 9+
10 repo 内的三个 plist 源模板都默认写成 `mini` 的 canonical 配置：
11 
12 - repo 根目录：`/Users/george/code/baa-conductor`
13@@ -90,7 +95,25 @@ AGENTS_DIR="$HOME/Library/LaunchAgents"
14   --install-dir "$AGENTS_DIR"
15 ```
16 
17-### 5. 预览或执行重载
18+### 5. on-node 验证
19+
20+`check-launchd.sh` 只覆盖静态层。节点上已经有真实进程后，再补一轮 on-node 检查：
21+
22+```bash
23+REPO_DIR=/Users/george/code/baa-conductor
24+AGENTS_DIR="$HOME/Library/LaunchAgents"
25+
26+./scripts/runtime/check-node.sh \
27+  --repo-dir "$REPO_DIR" \
28+  --node mini \
29+  --all-services \
30+  --install-dir "$AGENTS_DIR" \
31+  --expected-rolez leader
32+```
33+
34+如果 `status-api` 暂时不在该节点常驻，把 `--all-services` 改成 `--service conductor`。
35+
36+### 6. 预览或执行重载
37 
38 ```bash
39 ./scripts/runtime/reload-launchd.sh --dry-run
40@@ -122,7 +145,25 @@ export BAA_SHARED_TOKEN='replace-with-real-token'
41   --node mac
42 ```
43 
44-### 3. 加载服务
45+### 3. on-node 验证
46+
47+`mac` 的 steady-state 预期是 standby，因此把 `rolez` 预期值改成 `standby`：
48+
49+```bash
50+REPO_DIR=/Users/george/code/baa-conductor
51+AGENTS_DIR="$HOME/Library/LaunchAgents"
52+
53+./scripts/runtime/check-node.sh \
54+  --repo-dir "$REPO_DIR" \
55+  --node mac \
56+  --all-services \
57+  --install-dir "$AGENTS_DIR" \
58+  --expected-rolez standby
59+```
60+
61+如果这是 failover rehearsal，把 `--expected-rolez standby` 改成 `--expected-rolez leader`。
62+
63+### 4. 加载服务
64 
65 静态校验和重载命令与 `mini` 相同，只是把 `--node mini` 换成 `--node mac`。
66 
67@@ -181,6 +222,13 @@ plutil -lint ops/launchd/so.makefile.baa-status-api.plist
68   --repo-dir /Users/george/code/baa-conductor \
69   --node mini \
70   --install-dir "$HOME/Library/LaunchAgents"
71+
72+./scripts/runtime/check-node.sh \
73+  --repo-dir /Users/george/code/baa-conductor \
74+  --node mini \
75+  --all-services \
76+  --install-dir "$HOME/Library/LaunchAgents" \
77+  --expected-rolez leader
78 ```
79 
80 运行时排障常用命令：
81@@ -188,8 +236,10 @@ plutil -lint ops/launchd/so.makefile.baa-status-api.plist
82 ```bash
83 launchctl print "gui/$(id -u)/so.makefile.baa-conductor"
84 launchctl print "gui/$(id -u)/so.makefile.baa-worker-runner"
85+launchctl print "gui/$(id -u)/so.makefile.baa-status-api"
86 tail -n 50 /Users/george/code/baa-conductor/logs/launchd/so.makefile.baa-conductor.err.log
87 tail -n 50 /Users/george/code/baa-conductor/logs/launchd/so.makefile.baa-worker-runner.err.log
88+tail -n 50 /Users/george/code/baa-conductor/logs/launchd/so.makefile.baa-status-api.err.log
89 ```
90 
91 当前仓库已经能为 app 生成基础 `dist/index.js` 产物，因此 launchd 不再依赖“未来某天才会出现的入口文件”。在执行 `check-launchd.sh` 的 dist 校验或真正 `launchctl bootstrap` 之前，先在 repo 根目录跑一次：

A docs/runtime/node-verification.md

+141, -0

  1@@ -0,0 +1,141 @@
  2+# node verification
  3+
  4+本页描述的是“节点上已经有真实进程时”的验证顺序。
  5+
  6+它不负责加载服务，不会执行 `launchctl bootstrap`，也不会修改 Nginx / DNS。
  7+
  8+## 检查面
  9+
 10+`scripts/runtime/check-node.sh` 把节点验证拆成两层：
 11+
 12+| 层级 | 默认覆盖项 | 目的 |
 13+| --- | --- | --- |
 14+| 静态层 | runtime 目录、`dist/index.js`、安装副本 plist、共享 token、日志路径配置 | 确认 launchd 渲染结果和 repo/runtime 根目录一致 |
 15+| 运行态层 | 本地端口、conductor `/healthz` `/readyz` `/rolez`、status-api `/healthz` `/v1/status` `/v1/status/ui`、status-api 宿主进程、launchd 日志文件 | 确认节点上实际跑起来的进程与预期服务面一致 |
 16+
 17+默认检查集合是 `conductor + status-api`。如果节点暂时只跑 conductor，可以显式改成 `--service conductor`。如果还要把 `worker-runner` 也纳入同一次节点验证，再加 `--all-services`。
 18+
 19+## 前置条件
 20+
 21+在进入 on-node 检查前，先确保这些步骤已经完成：
 22+
 23+1. `./scripts/runtime/bootstrap.sh --repo-dir /Users/george/code/baa-conductor`
 24+2. `cd /Users/george/code/baa-conductor && npx --yes pnpm -r build`
 25+3. `./scripts/runtime/install-launchd.sh ...` 已经渲染出目标节点的安装副本
 26+4. 节点上已经有真实进程
 27+
 28+第 4 步可以来自已有的 launchd 加载，也可以来自人工先行启动的进程；`check-node.sh` 只做验证，不负责启动。
 29+
 30+## mini 验证顺序
 31+
 32+### 1. 静态校验
 33+
 34+```bash
 35+REPO_DIR=/Users/george/code/baa-conductor
 36+AGENTS_DIR="$HOME/Library/LaunchAgents"
 37+
 38+./scripts/runtime/check-launchd.sh \
 39+  --repo-dir "$REPO_DIR" \
 40+  --node mini \
 41+  --all-services \
 42+  --install-dir "$AGENTS_DIR"
 43+```
 44+
 45+### 2. on-node 校验
 46+
 47+steady-state 下，`mini` 应该对外表现为 leader，因此 `rolez` 预期值写成 `leader`：
 48+
 49+```bash
 50+REPO_DIR=/Users/george/code/baa-conductor
 51+AGENTS_DIR="$HOME/Library/LaunchAgents"
 52+
 53+./scripts/runtime/check-node.sh \
 54+  --repo-dir "$REPO_DIR" \
 55+  --node mini \
 56+  --all-services \
 57+  --install-dir "$AGENTS_DIR" \
 58+  --expected-rolez leader
 59+```
 60+
 61+如果该节点已经是通过 launchd 常驻起来的，再加 `--check-loaded`，把 `launchctl print gui/<uid>/<label>` 也纳入检查，但这仍然只是读取状态，不会触发加载：
 62+
 63+```bash
 64+./scripts/runtime/check-node.sh \
 65+  --repo-dir "$REPO_DIR" \
 66+  --node mini \
 67+  --all-services \
 68+  --install-dir "$AGENTS_DIR" \
 69+  --expected-rolez leader \
 70+  --check-loaded
 71+```
 72+
 73+### 3. 失败时的人工 spot check
 74+
 75+```bash
 76+lsof -nP -iTCP:4317 -sTCP:LISTEN
 77+lsof -nP -iTCP:4318 -sTCP:LISTEN
 78+curl -sS http://127.0.0.1:4317/healthz
 79+curl -sS http://127.0.0.1:4317/readyz
 80+curl -sS http://127.0.0.1:4317/rolez
 81+curl -sS http://127.0.0.1:4318/healthz
 82+tail -n 50 /Users/george/code/baa-conductor/logs/launchd/so.makefile.baa-conductor.err.log
 83+tail -n 50 /Users/george/code/baa-conductor/logs/launchd/so.makefile.baa-status-api.err.log
 84+```
 85+
 86+## mac 验证顺序
 87+
 88+### 1. 静态校验
 89+
 90+```bash
 91+REPO_DIR=/Users/george/code/baa-conductor
 92+AGENTS_DIR="$HOME/Library/LaunchAgents"
 93+
 94+./scripts/runtime/check-launchd.sh \
 95+  --repo-dir "$REPO_DIR" \
 96+  --node mac \
 97+  --all-services \
 98+  --install-dir "$AGENTS_DIR"
 99+```
100+
101+### 2. on-node 校验
102+
103+steady-state 下，`mac` 应该保持 standby，因此默认把 `rolez` 预期值写成 `standby`：
104+
105+```bash
106+REPO_DIR=/Users/george/code/baa-conductor
107+AGENTS_DIR="$HOME/Library/LaunchAgents"
108+
109+./scripts/runtime/check-node.sh \
110+  --repo-dir "$REPO_DIR" \
111+  --node mac \
112+  --all-services \
113+  --install-dir "$AGENTS_DIR" \
114+  --expected-rolez standby
115+```
116+
117+如果这是 failover rehearsal 期间的 `mac`，只把 `--expected-rolez standby` 改成 `--expected-rolez leader`，其余步骤保持不变。
118+
119+### 3. `LaunchDaemons` 场景
120+
121+如果 `mac` 用的是 `/Library/LaunchDaemons`，把安装路径和域名一起改成 daemon 版本：
122+
123+```bash
124+sudo ./scripts/runtime/check-node.sh \
125+  --repo-dir /Users/george/code/baa-conductor \
126+  --node mac \
127+  --scope daemon \
128+  --install-dir /Library/LaunchDaemons \
129+  --username george \
130+  --expected-rolez standby \
131+  --check-loaded
132+```
133+
134+`--check-loaded` 在这里会去读 `launchctl print system/<label>`，不会重新 bootstrap 服务。
135+
136+## 常见失败信号
137+
138+- `conductor /readyz` 返回 `503`：节点进程存活，但 runtime 还没进入 ready 状态，先看 conductor stderr 日志。
139+- `conductor /rolez` 与预期不符：节点身份没问题，但当前 lease 角色与预期不一致，先确认是否处于 failover 或 standby 场景。
140+- `status-api` 端口没监听：`status-api` 宿主进程没有起来，或监听地址/端口与默认 `127.0.0.1:4318` 不一致。
141+- `status-api` 进程匹配失败：节点上可能跑的是旧路径、旧 worktree，或 launchd 仍指向错误的 `dist/index.js`。
142+- launchd 日志文件缺失：安装副本路径虽然存在，但服务尚未真正由 launchd 打开过对应 `StandardOutPath` / `StandardErrorPath`。

A scripts/runtime/check-node.sh

+473, -0

  1@@ -0,0 +1,473 @@
  2+#!/usr/bin/env bash
  3+set -euo pipefail
  4+
  5+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
  6+# shellcheck source=./common.sh
  7+source "${SCRIPT_DIR}/common.sh"
  8+
  9+usage() {
 10+  cat <<'EOF'
 11+Usage:
 12+  scripts/runtime/check-node.sh [options]
 13+
 14+Options:
 15+  --node mini|mac              Select node defaults. Defaults to mini.
 16+  --scope agent|daemon         Expected launchd scope. Defaults to agent.
 17+  --service NAME               Add one service to the runtime check set. Repeatable.
 18+  --all-services               Check conductor, worker-runner, and status-api.
 19+  --repo-dir PATH              Repo root used to derive runtime paths.
 20+  --home-dir PATH              HOME value expected in installed plist files.
 21+  --install-dir PATH           Validate installed copies under this directory.
 22+  --shared-token TOKEN         Expect this exact token in installed copies.
 23+  --shared-token-file PATH     Read the expected token from a file.
 24+  --control-api-base URL       Expected BAA_CONTROL_API_BASE in installed copies.
 25+  --local-api-base URL         Conductor local API base URL. Defaults to 127.0.0.1:4317.
 26+  --status-api-base URL        Status API base URL. Defaults to 127.0.0.1:4318.
 27+  --username NAME              Expected UserName for LaunchDaemons.
 28+  --domain TARGET              launchctl domain target for --check-loaded.
 29+  --check-loaded               Also require launchctl print to succeed for each service.
 30+  --expected-rolez VALUE       Expected conductor /rolez body: leader, standby, or any.
 31+  --skip-static-check          Skip the underlying check-launchd.sh pass.
 32+  --skip-port-check            Skip local TCP LISTEN checks.
 33+  --skip-process-check         Skip host process command-line checks.
 34+  --skip-http-check            Skip conductor/status-api HTTP probes.
 35+  --skip-log-check             Skip launchd stdout/stderr file checks.
 36+  --help                       Show this help text.
 37+
 38+Notes:
 39+  The default runtime check set is conductor + status-api, because that is the
 40+  minimum on-node surface for a realistic node verification pass. Use
 41+  --service to narrow the scope or --all-services to include worker-runner.
 42+EOF
 43+}
 44+
 45+require_command awk
 46+require_command curl
 47+require_command lsof
 48+require_command ps
 49+
 50+node="mini"
 51+scope="agent"
 52+repo_dir="${BAA_RUNTIME_REPO_DIR_DEFAULT}"
 53+home_dir="$(default_home_dir)"
 54+install_dir=""
 55+shared_token=""
 56+shared_token_file=""
 57+control_api_base="${BAA_RUNTIME_DEFAULT_CONTROL_API_BASE}"
 58+local_api_base="${BAA_RUNTIME_DEFAULT_LOCAL_API}"
 59+status_api_base="${BAA_RUNTIME_DEFAULT_STATUS_API}"
 60+username="$(default_username)"
 61+domain_target=""
 62+check_loaded="0"
 63+expected_rolez="any"
 64+skip_static_check="0"
 65+skip_port_check="0"
 66+skip_process_check="0"
 67+skip_http_check="0"
 68+skip_log_check="0"
 69+services=()
 70+
 71+while [[ $# -gt 0 ]]; do
 72+  case "$1" in
 73+    --node)
 74+      node="$2"
 75+      shift 2
 76+      ;;
 77+    --scope)
 78+      scope="$2"
 79+      shift 2
 80+      ;;
 81+    --service)
 82+      validate_service "$2"
 83+      if ! contains_value "$2" "${services[@]-}"; then
 84+        services+=("$2")
 85+      fi
 86+      shift 2
 87+      ;;
 88+    --all-services)
 89+      while IFS= read -r service; do
 90+        if ! contains_value "$service" "${services[@]-}"; then
 91+          services+=("$service")
 92+        fi
 93+      done < <(all_services)
 94+      shift
 95+      ;;
 96+    --repo-dir)
 97+      repo_dir="$2"
 98+      shift 2
 99+      ;;
100+    --home-dir)
101+      home_dir="$2"
102+      shift 2
103+      ;;
104+    --install-dir)
105+      install_dir="$2"
106+      shift 2
107+      ;;
108+    --shared-token)
109+      shared_token="$2"
110+      shift 2
111+      ;;
112+    --shared-token-file)
113+      shared_token_file="$2"
114+      shift 2
115+      ;;
116+    --control-api-base)
117+      control_api_base="$2"
118+      shift 2
119+      ;;
120+    --local-api-base)
121+      local_api_base="$2"
122+      shift 2
123+      ;;
124+    --status-api-base)
125+      status_api_base="$2"
126+      shift 2
127+      ;;
128+    --username)
129+      username="$2"
130+      shift 2
131+      ;;
132+    --domain)
133+      domain_target="$2"
134+      shift 2
135+      ;;
136+    --check-loaded)
137+      check_loaded="1"
138+      shift
139+      ;;
140+    --expected-rolez)
141+      expected_rolez="$2"
142+      shift 2
143+      ;;
144+    --skip-static-check)
145+      skip_static_check="1"
146+      shift
147+      ;;
148+    --skip-port-check)
149+      skip_port_check="1"
150+      shift
151+      ;;
152+    --skip-process-check)
153+      skip_process_check="1"
154+      shift
155+      ;;
156+    --skip-http-check)
157+      skip_http_check="1"
158+      shift
159+      ;;
160+    --skip-log-check)
161+      skip_log_check="1"
162+      shift
163+      ;;
164+    --help)
165+      usage
166+      exit 0
167+      ;;
168+    *)
169+      die "Unknown option: $1"
170+      ;;
171+  esac
172+done
173+
174+validate_node "$node"
175+validate_scope "$scope"
176+
177+case "$expected_rolez" in
178+  any | leader | standby) ;;
179+  *)
180+    die "Unsupported --expected-rolez value: ${expected_rolez}"
181+    ;;
182+esac
183+
184+if [[ "${#services[@]}" -eq 0 ]]; then
185+  while IFS= read -r service; do
186+    services+=("$service")
187+  done < <(default_node_verification_services)
188+fi
189+
190+if [[ -z "$install_dir" ]]; then
191+  install_dir="$(default_install_dir "$scope" "$home_dir")"
192+fi
193+
194+if [[ "$check_loaded" == "1" && -z "$domain_target" ]]; then
195+  domain_target="$(default_domain_target "$scope")"
196+fi
197+
198+set -- $(resolve_node_defaults "$node")
199+conductor_host="$1"
200+conductor_role="$2"
201+node_id="$3"
202+
203+logs_launchd_dir="${repo_dir}/logs/launchd"
204+HTTP_STATUS=""
205+HTTP_BODY=""
206+
207+normalize_base_url() {
208+  local value="$1"
209+
210+  while [[ "$value" == */ ]]; do
211+    value="${value%/}"
212+  done
213+
214+  printf '%s\n' "$value"
215+}
216+
217+extract_port_from_url() {
218+  local service="$1"
219+  local url="$2"
220+  local authority
221+  local default_port
222+
223+  authority="${url#*://}"
224+  authority="${authority%%/*}"
225+
226+  if [[ "$authority" == *:* ]]; then
227+    printf '%s\n' "${authority##*:}"
228+    return 0
229+  fi
230+
231+  default_port="$(service_default_port "$service")"
232+  if [[ -n "$default_port" ]]; then
233+    printf '%s\n' "$default_port"
234+    return 0
235+  fi
236+
237+  case "$url" in
238+    https://*)
239+      printf '%s\n' "443"
240+      ;;
241+    http://*)
242+      printf '%s\n' "80"
243+      ;;
244+    *)
245+      die "Could not derive a TCP port from URL: ${url}"
246+      ;;
247+  esac
248+}
249+
250+http_get() {
251+  local url="$1"
252+  local body_file
253+
254+  body_file="$(mktemp "${TMPDIR:-/tmp}/baa-runtime-http.XXXXXX")"
255+
256+  HTTP_STATUS="$(curl -sS --max-time 5 -o "$body_file" -w '%{http_code}' "$url")" || {
257+    rm -f "$body_file"
258+    die "Request failed: ${url}"
259+  }
260+
261+  HTTP_BODY="$(tr -d '\r' <"$body_file")"
262+  rm -f "$body_file"
263+
264+  while [[ "$HTTP_BODY" == *$'\n' ]]; do
265+    HTTP_BODY="${HTTP_BODY%$'\n'}"
266+  done
267+}
268+
269+assert_http_equals() {
270+  local name="$1"
271+  local url="$2"
272+  local expected_status="$3"
273+  local expected_body="$4"
274+
275+  http_get "$url"
276+
277+  if [[ "$HTTP_STATUS" != "$expected_status" ]]; then
278+    die "${name} returned HTTP ${HTTP_STATUS}, expected ${expected_status}"
279+  fi
280+
281+  if [[ "$HTTP_BODY" != "$expected_body" ]]; then
282+    die "${name} body mismatch: expected '${expected_body}', got '${HTTP_BODY}'"
283+  fi
284+
285+  runtime_log "${name} ok"
286+}
287+
288+assert_http_contains() {
289+  local name="$1"
290+  local url="$2"
291+  local expected_status="$3"
292+  local expected_substring="$4"
293+
294+  http_get "$url"
295+
296+  if [[ "$HTTP_STATUS" != "$expected_status" ]]; then
297+    die "${name} returned HTTP ${HTTP_STATUS}, expected ${expected_status}"
298+  fi
299+
300+  if [[ "$HTTP_BODY" != *"$expected_substring"* ]]; then
301+    die "${name} body does not contain expected text: ${expected_substring}"
302+  fi
303+
304+  runtime_log "${name} ok"
305+}
306+
307+check_loaded_services() {
308+  require_command launchctl
309+
310+  for service in "${services[@]}"; do
311+    launchctl print "${domain_target}/$(service_label "$service")" >/dev/null
312+    runtime_log "launchctl loaded: $(service_label "$service")"
313+  done
314+}
315+
316+run_static_checks() {
317+  local static_args=()
318+
319+  static_args+=(
320+    --node "$node"
321+    --scope "$scope"
322+    --repo-dir "$repo_dir"
323+    --home-dir "$home_dir"
324+    --install-dir "$install_dir"
325+    --control-api-base "$control_api_base"
326+    --local-api-base "$local_api_base"
327+    --username "$username"
328+  )
329+
330+  for service in "${services[@]}"; do
331+    static_args+=(--service "$service")
332+  done
333+
334+  if [[ -n "$shared_token" ]]; then
335+    static_args+=(--shared-token "$shared_token")
336+  fi
337+
338+  if [[ -n "$shared_token_file" ]]; then
339+    static_args+=(--shared-token-file "$shared_token_file")
340+  fi
341+
342+  if [[ "$check_loaded" == "1" ]]; then
343+    static_args+=(--check-loaded --domain "$domain_target")
344+  fi
345+
346+  "${SCRIPT_DIR}/check-launchd.sh" "${static_args[@]}"
347+}
348+
349+check_service_process() {
350+  local service="$1"
351+  local process_pattern="$2"
352+  local process_lines
353+
354+  process_lines="$(ps -axo pid=,command= | awk -v pattern="$process_pattern" 'index($0, pattern) > 0 { print }')"
355+
356+  if [[ -z "$process_lines" ]]; then
357+    die "${service} process not found for pattern: ${process_pattern}"
358+  fi
359+
360+  runtime_log "${service} process ok: $(printf '%s\n' "$process_lines" | sed -n '1p')"
361+}
362+
363+check_listen_port() {
364+  local service="$1"
365+  local port="$2"
366+  local socket_lines
367+
368+  socket_lines="$(lsof -nP -iTCP:"$port" -sTCP:LISTEN 2>/dev/null || true)"
369+
370+  if [[ -z "$socket_lines" ]]; then
371+    die "${service} is not listening on TCP port ${port}"
372+  fi
373+
374+  runtime_log "${service} listening on TCP ${port}"
375+}
376+
377+check_service_logs() {
378+  local service="$1"
379+  local stdout_path
380+  local stderr_path
381+
382+  stdout_path="$(service_stdout_path "$logs_launchd_dir" "$service")"
383+  stderr_path="$(service_stderr_path "$logs_launchd_dir" "$service")"
384+
385+  assert_directory "$(dirname -- "$stdout_path")"
386+  assert_file "$stdout_path"
387+  assert_file "$stderr_path"
388+
389+  runtime_log "${service} log files present"
390+}
391+
392+check_conductor_runtime() {
393+  local conductor_base_url="$1"
394+  local port
395+
396+  if [[ "$skip_port_check" != "1" ]]; then
397+    port="$(extract_port_from_url "conductor" "$conductor_base_url")"
398+    check_listen_port "conductor" "$port"
399+  fi
400+
401+  if [[ "$skip_http_check" != "1" ]]; then
402+    assert_http_equals "conductor /healthz" "${conductor_base_url}/healthz" "200" "ok"
403+    assert_http_equals "conductor /readyz" "${conductor_base_url}/readyz" "200" "ready"
404+
405+    http_get "${conductor_base_url}/rolez"
406+    if [[ "$HTTP_STATUS" != "200" ]]; then
407+      die "conductor /rolez returned HTTP ${HTTP_STATUS}, expected 200"
408+    fi
409+
410+    case "$expected_rolez" in
411+      any)
412+        case "$HTTP_BODY" in
413+          leader | standby) ;;
414+          *)
415+            die "conductor /rolez must be leader or standby, got '${HTTP_BODY}'"
416+            ;;
417+        esac
418+        ;;
419+      *)
420+        if [[ "$HTTP_BODY" != "$expected_rolez" ]]; then
421+          die "conductor /rolez mismatch: expected '${expected_rolez}', got '${HTTP_BODY}'"
422+        fi
423+        ;;
424+    esac
425+
426+    runtime_log "conductor /rolez ok: ${HTTP_BODY}"
427+  fi
428+}
429+
430+check_status_api_runtime() {
431+  local status_base_url="$1"
432+  local port
433+
434+  if [[ "$skip_port_check" != "1" ]]; then
435+    port="$(extract_port_from_url "status-api" "$status_base_url")"
436+    check_listen_port "status-api" "$port"
437+  fi
438+
439+  if [[ "$skip_http_check" != "1" ]]; then
440+    assert_http_equals "status-api /healthz" "${status_base_url}/healthz" "200" "ok"
441+    assert_http_contains "status-api /v1/status" "${status_base_url}/v1/status" "200" "\"ok\": true"
442+    assert_http_contains "status-api /v1/status/ui" "${status_base_url}/v1/status/ui" "200" "BAA Conductor Status"
443+  fi
444+}
445+
446+local_api_base="$(normalize_base_url "$local_api_base")"
447+status_api_base="$(normalize_base_url "$status_api_base")"
448+
449+if [[ "$skip_static_check" != "1" ]]; then
450+  run_static_checks
451+elif [[ "$check_loaded" == "1" ]]; then
452+  check_loaded_services
453+fi
454+
455+for service in "${services[@]}"; do
456+  if [[ "$skip_process_check" != "1" ]]; then
457+    check_service_process "$service" "$(service_process_match "$repo_dir" "$service" "$conductor_host" "$conductor_role")"
458+  fi
459+
460+  if [[ "$skip_log_check" != "1" ]]; then
461+    check_service_logs "$service"
462+  fi
463+
464+  case "$service" in
465+    conductor)
466+      check_conductor_runtime "$local_api_base"
467+      ;;
468+    status-api)
469+      check_status_api_runtime "$status_api_base"
470+      ;;
471+  esac
472+done
473+
474+runtime_log "node checks passed for ${node} (${node_id})"

M scripts/runtime/common.sh

+38, -0

 1@@ -9,6 +9,7 @@ readonly BAA_RUNTIME_SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &&
 2 readonly BAA_RUNTIME_REPO_DIR_DEFAULT="$(cd -- "${BAA_RUNTIME_SCRIPT_DIR}/../.." && pwd)"
 3 readonly BAA_RUNTIME_DEFAULT_CONTROL_API_BASE="https://control-api.makefile.so"
 4 readonly BAA_RUNTIME_DEFAULT_LOCAL_API="http://127.0.0.1:4317"
 5+readonly BAA_RUNTIME_DEFAULT_STATUS_API="http://127.0.0.1:4318"
 6 readonly BAA_RUNTIME_DEFAULT_LOCALE="en_US.UTF-8"
 7 
 8 runtime_log() {
 9@@ -75,6 +76,10 @@ default_services() {
10   printf '%s\n' conductor
11 }
12 
13+default_node_verification_services() {
14+  printf '%s\n' conductor status-api
15+}
16+
17 all_services() {
18   printf '%s\n' conductor worker-runner status-api
19 }
20@@ -107,6 +112,39 @@ service_dist_entry_relative() {
21   esac
22 }
23 
24+service_default_port() {
25+  case "$1" in
26+    conductor)
27+      printf '%s\n' "4317"
28+      ;;
29+    status-api)
30+      printf '%s\n' "4318"
31+      ;;
32+    worker-runner)
33+      printf '%s\n' ""
34+      ;;
35+  esac
36+}
37+
38+service_process_match() {
39+  local repo_dir="$1"
40+  local service="$2"
41+  local conductor_host="${3:-}"
42+  local conductor_role="${4:-}"
43+  local dist_entry
44+
45+  dist_entry="${repo_dir}/$(service_dist_entry_relative "$service")"
46+
47+  case "$service" in
48+    conductor)
49+      printf '%s --host %s --role %s\n' "$dist_entry" "$conductor_host" "$conductor_role"
50+      ;;
51+    *)
52+      printf '%s\n' "$dist_entry"
53+      ;;
54+  esac
55+}
56+
57 service_template_path() {
58   local repo_dir="$1"
59   local service="$2"