- commit
- 7b74cf7
- parent
- 6505a31
- author
- im_wower
- date
- 2026-03-22 01:47:42 +0800 CST
feat(runtime): add on-node node verification checks
6 files changed,
+732,
-12
1@@ -1,10 +1,10 @@
2 ---
3 task_id: T-027
4 title: launchd 节点验证与 On-Node 检查
5-status: todo
6+status: review
7 branch: feat/T-027-node-verification
8 repo: /Users/george/code/baa-conductor
9-base_ref: main
10+base_ref: main@6505a31
11 depends_on:
12 - T-019
13 - T-020
14@@ -12,7 +12,7 @@ depends_on:
15 write_scope:
16 - docs/runtime/**
17 - scripts/runtime/**
18-updated_at: 2026-03-22
19+updated_at: 2026-03-22T01:46:45+08:00
20 ---
21
22 # T-027 launchd 节点验证与 On-Node 检查
23@@ -62,20 +62,33 @@ updated_at: 2026-03-22
24
25 ## files_changed
26
27-- 待填写
28+- `coordination/tasks/T-027-node-verification.md`
29+- `docs/runtime/README.md`
30+- `docs/runtime/launchd.md`
31+- `docs/runtime/node-verification.md`
32+- `scripts/runtime/common.sh`
33+- `scripts/runtime/check-node.sh`
34
35 ## commands_run
36
37-- 待填写
38+- `npx --yes pnpm install`
39+- `bash -n scripts/runtime/*.sh`
40+- `git diff --check`
41+- `scripts/runtime/check-node.sh --help`
42+- `scripts/runtime/check-node.sh --repo-dir <tmp_repo> --node mini --service conductor --service status-api --skip-static-check --local-api-base http://127.0.0.1:4417 --status-api-base http://127.0.0.1:4418 --expected-rolez leader`
43
44 ## result
45
46-- 待填写
47+- 新增 `scripts/runtime/check-node.sh`,把节点验证推进到运行态层:覆盖 runtime 静态校验复用、本地端口、conductor `/healthz` `/readyz` `/rolez`、status-api 宿主进程与 `/healthz` `/v1/status` `/v1/status/ui`、以及 launchd 日志文件存在性。
48+- `scripts/runtime/common.sh` 补了 status-api 默认地址、on-node 默认服务集合和进程匹配辅助逻辑,避免节点约定散落在多个脚本里。
49+- `docs/runtime/README.md`、`docs/runtime/launchd.md`、`docs/runtime/node-verification.md` 明确了 mini/mac 的静态检查顺序、on-node 检查顺序,以及 steady-state 下 `mini=leader`、`mac=standby` 的 `rolez` 预期。
50
51 ## risks
52
53-- 待填写
54+- 没有在真实 `mini` / `mac` 节点上执行 `--check-loaded`,因此 `launchctl print gui/...` / `launchctl print system/...` 只完成了脚本级接线,没有做实机验证。
55+- `status-api /v1/status` 目前按返回体中包含 `"ok": true` 做断言;如果后续响应格式改成不同 JSON 序列化风格,脚本可能需要同步微调。
56
57 ## next_handoff
58
59-- 待填写
60+- 在真实 `mini` 与 `mac` 节点上各跑一次 `scripts/runtime/check-node.sh --check-loaded ...`,确认 launchd 域、日志文件、端口和探针都与文档一致。
61+- 如果某个节点不常驻 `status-api`,在落地时显式使用 `--service conductor`,并确认运维文档是否需要额外区分“仅 conductor 节点”和“conductor + status-api 节点”。
+7,
-2
1@@ -11,6 +11,7 @@
2 - [`layout.md`](./layout.md): `runs/`、`worktrees/`、`logs/`、`tmp/` 与 `state/` 的初始化方式和生命周期
3 - [`environment.md`](./environment.md): `launchd` 下必须显式写入的环境变量,以及安装脚本如何覆盖默认值
4 - [`launchd.md`](./launchd.md): `mini` 与 `mac` 的脚本化安装步骤,以及 `LaunchAgents` / `LaunchDaemons` 的差异
5+- [`node-verification.md`](./node-verification.md): `mini` / `mac` 的 on-node 验证顺序、期望探针结果,以及日志/进程检查点
6
7 ## 统一约定
8
9@@ -23,14 +24,18 @@
10
11 1. 先按 [`layout.md`](./layout.md) 运行 `./scripts/runtime/bootstrap.sh` 初始化 runtime 根目录。
12 2. 再按 [`environment.md`](./environment.md) 准备共享变量和节点变量,特别是 `BAA_SHARED_TOKEN`。
13-3. 按 [`launchd.md`](./launchd.md) 运行 `install-launchd.sh` 生成安装副本,再用 `check-launchd.sh` / `reload-launchd.sh` 校验与重载。
14-4. 每次准备执行 `check-launchd.sh` 的 dist 校验或真正 `launchctl bootstrap` 前,先在 repo 根目录执行一次 `npx --yes pnpm -r build`,确认目标 app 的 `dist/index.js` 已更新。
15+3. 按 [`launchd.md`](./launchd.md) 运行 `install-launchd.sh` 生成安装副本,先用 `check-launchd.sh` 做静态校验。
16+4. 在节点上已有实际进程后,再按 [`node-verification.md`](./node-verification.md) 运行 `check-node.sh` 做端口、探针、status-api 宿主进程与日志路径检查。
17+5. 每次准备执行 `check-launchd.sh` 的 dist 校验、`check-node.sh` 的进程检查,或真正 `launchctl bootstrap` 前,先在 repo 根目录执行一次 `npx --yes pnpm -r build`,确认目标 app 的 `dist/index.js` 已更新。
18
19 ## 当前脚本集
20
21 - `scripts/runtime/bootstrap.sh`: 预创建 `state/`、`runs/`、`worktrees/`、`logs/launchd/`、`tmp/`
22 - `scripts/runtime/install-launchd.sh`: 从 `ops/launchd/*.plist` 渲染实际安装副本
23 - `scripts/runtime/check-launchd.sh`: 校验源模板、runtime 目录、构建产物,以及已安装的 plist 副本
24+- `scripts/runtime/check-node.sh`: 在节点上校验 launchd 副本之外的运行态信号,例如本地端口、HTTP 探针、status-api 宿主进程与 launchd 日志文件
25 - `scripts/runtime/reload-launchd.sh`: 执行或 dry-run `launchctl bootout/bootstrap/kickstart`
26
27 默认安装/检查/重载集合只包含 `conductor`。如果后续要把其它模板也纳入流程,显式加 `--service worker-runner`、`--service status-api`,或直接使用 `--all-services`。
28+
29+`check-node.sh` 的默认集合不同:它默认同时检查 `conductor` 和 `status-api`,因为真实节点验证至少要覆盖本地控制面和状态面两条路径。如果节点暂时不跑 `status-api`,再显式收窄到 `--service conductor`。
+52,
-2
1@@ -16,6 +16,11 @@ repo 里保留了三个源模板:
2
3 这样可以先把已接成 CLI/runtime 入口的服务跑通,再按需扩展其他模板。
4
5+需要注意两个脚本默认集合不同:
6+
7+- `install-launchd.sh` / `check-launchd.sh` / `reload-launchd.sh` 默认只处理 `conductor`
8+- `check-node.sh` 默认同时处理 `conductor` 和 `status-api`,因为节点验证至少要覆盖本地控制面与状态面
9+
10 repo 内的三个 plist 源模板都默认写成 `mini` 的 canonical 配置:
11
12 - repo 根目录:`/Users/george/code/baa-conductor`
13@@ -90,7 +95,25 @@ AGENTS_DIR="$HOME/Library/LaunchAgents"
14 --install-dir "$AGENTS_DIR"
15 ```
16
17-### 5. 预览或执行重载
18+### 5. on-node 验证
19+
20+`check-launchd.sh` 只覆盖静态层。节点上已经有真实进程后,再补一轮 on-node 检查:
21+
22+```bash
23+REPO_DIR=/Users/george/code/baa-conductor
24+AGENTS_DIR="$HOME/Library/LaunchAgents"
25+
26+./scripts/runtime/check-node.sh \
27+ --repo-dir "$REPO_DIR" \
28+ --node mini \
29+ --all-services \
30+ --install-dir "$AGENTS_DIR" \
31+ --expected-rolez leader
32+```
33+
34+如果 `status-api` 暂时不在该节点常驻,把 `--all-services` 改成 `--service conductor`。
35+
36+### 6. 预览或执行重载
37
38 ```bash
39 ./scripts/runtime/reload-launchd.sh --dry-run
40@@ -122,7 +145,25 @@ export BAA_SHARED_TOKEN='replace-with-real-token'
41 --node mac
42 ```
43
44-### 3. 加载服务
45+### 3. on-node 验证
46+
47+`mac` 的 steady-state 预期是 standby,因此把 `rolez` 预期值改成 `standby`:
48+
49+```bash
50+REPO_DIR=/Users/george/code/baa-conductor
51+AGENTS_DIR="$HOME/Library/LaunchAgents"
52+
53+./scripts/runtime/check-node.sh \
54+ --repo-dir "$REPO_DIR" \
55+ --node mac \
56+ --all-services \
57+ --install-dir "$AGENTS_DIR" \
58+ --expected-rolez standby
59+```
60+
61+如果这是 failover rehearsal,把 `--expected-rolez standby` 改成 `--expected-rolez leader`。
62+
63+### 4. 加载服务
64
65 静态校验和重载命令与 `mini` 相同,只是把 `--node mini` 换成 `--node mac`。
66
67@@ -181,6 +222,13 @@ plutil -lint ops/launchd/so.makefile.baa-status-api.plist
68 --repo-dir /Users/george/code/baa-conductor \
69 --node mini \
70 --install-dir "$HOME/Library/LaunchAgents"
71+
72+./scripts/runtime/check-node.sh \
73+ --repo-dir /Users/george/code/baa-conductor \
74+ --node mini \
75+ --all-services \
76+ --install-dir "$HOME/Library/LaunchAgents" \
77+ --expected-rolez leader
78 ```
79
80 运行时排障常用命令:
81@@ -188,8 +236,10 @@ plutil -lint ops/launchd/so.makefile.baa-status-api.plist
82 ```bash
83 launchctl print "gui/$(id -u)/so.makefile.baa-conductor"
84 launchctl print "gui/$(id -u)/so.makefile.baa-worker-runner"
85+launchctl print "gui/$(id -u)/so.makefile.baa-status-api"
86 tail -n 50 /Users/george/code/baa-conductor/logs/launchd/so.makefile.baa-conductor.err.log
87 tail -n 50 /Users/george/code/baa-conductor/logs/launchd/so.makefile.baa-worker-runner.err.log
88+tail -n 50 /Users/george/code/baa-conductor/logs/launchd/so.makefile.baa-status-api.err.log
89 ```
90
91 当前仓库已经能为 app 生成基础 `dist/index.js` 产物,因此 launchd 不再依赖“未来某天才会出现的入口文件”。在执行 `check-launchd.sh` 的 dist 校验或真正 `launchctl bootstrap` 之前,先在 repo 根目录跑一次:
+141,
-0
1@@ -0,0 +1,141 @@
2+# node verification
3+
4+本页描述的是“节点上已经有真实进程时”的验证顺序。
5+
6+它不负责加载服务,不会执行 `launchctl bootstrap`,也不会修改 Nginx / DNS。
7+
8+## 检查面
9+
10+`scripts/runtime/check-node.sh` 把节点验证拆成两层:
11+
12+| 层级 | 默认覆盖项 | 目的 |
13+| --- | --- | --- |
14+| 静态层 | runtime 目录、`dist/index.js`、安装副本 plist、共享 token、日志路径配置 | 确认 launchd 渲染结果和 repo/runtime 根目录一致 |
15+| 运行态层 | 本地端口、conductor `/healthz` `/readyz` `/rolez`、status-api `/healthz` `/v1/status` `/v1/status/ui`、status-api 宿主进程、launchd 日志文件 | 确认节点上实际跑起来的进程与预期服务面一致 |
16+
17+默认检查集合是 `conductor + status-api`。如果节点暂时只跑 conductor,可以显式改成 `--service conductor`。如果还要把 `worker-runner` 也纳入同一次节点验证,再加 `--all-services`。
18+
19+## 前置条件
20+
21+在进入 on-node 检查前,先确保这些步骤已经完成:
22+
23+1. `./scripts/runtime/bootstrap.sh --repo-dir /Users/george/code/baa-conductor`
24+2. `cd /Users/george/code/baa-conductor && npx --yes pnpm -r build`
25+3. `./scripts/runtime/install-launchd.sh ...` 已经渲染出目标节点的安装副本
26+4. 节点上已经有真实进程
27+
28+第 4 步可以来自已有的 launchd 加载,也可以来自人工先行启动的进程;`check-node.sh` 只做验证,不负责启动。
29+
30+## mini 验证顺序
31+
32+### 1. 静态校验
33+
34+```bash
35+REPO_DIR=/Users/george/code/baa-conductor
36+AGENTS_DIR="$HOME/Library/LaunchAgents"
37+
38+./scripts/runtime/check-launchd.sh \
39+ --repo-dir "$REPO_DIR" \
40+ --node mini \
41+ --all-services \
42+ --install-dir "$AGENTS_DIR"
43+```
44+
45+### 2. on-node 校验
46+
47+steady-state 下,`mini` 应该对外表现为 leader,因此 `rolez` 预期值写成 `leader`:
48+
49+```bash
50+REPO_DIR=/Users/george/code/baa-conductor
51+AGENTS_DIR="$HOME/Library/LaunchAgents"
52+
53+./scripts/runtime/check-node.sh \
54+ --repo-dir "$REPO_DIR" \
55+ --node mini \
56+ --all-services \
57+ --install-dir "$AGENTS_DIR" \
58+ --expected-rolez leader
59+```
60+
61+如果该节点已经是通过 launchd 常驻起来的,再加 `--check-loaded`,把 `launchctl print gui/<uid>/<label>` 也纳入检查,但这仍然只是读取状态,不会触发加载:
62+
63+```bash
64+./scripts/runtime/check-node.sh \
65+ --repo-dir "$REPO_DIR" \
66+ --node mini \
67+ --all-services \
68+ --install-dir "$AGENTS_DIR" \
69+ --expected-rolez leader \
70+ --check-loaded
71+```
72+
73+### 3. 失败时的人工 spot check
74+
75+```bash
76+lsof -nP -iTCP:4317 -sTCP:LISTEN
77+lsof -nP -iTCP:4318 -sTCP:LISTEN
78+curl -sS http://127.0.0.1:4317/healthz
79+curl -sS http://127.0.0.1:4317/readyz
80+curl -sS http://127.0.0.1:4317/rolez
81+curl -sS http://127.0.0.1:4318/healthz
82+tail -n 50 /Users/george/code/baa-conductor/logs/launchd/so.makefile.baa-conductor.err.log
83+tail -n 50 /Users/george/code/baa-conductor/logs/launchd/so.makefile.baa-status-api.err.log
84+```
85+
86+## mac 验证顺序
87+
88+### 1. 静态校验
89+
90+```bash
91+REPO_DIR=/Users/george/code/baa-conductor
92+AGENTS_DIR="$HOME/Library/LaunchAgents"
93+
94+./scripts/runtime/check-launchd.sh \
95+ --repo-dir "$REPO_DIR" \
96+ --node mac \
97+ --all-services \
98+ --install-dir "$AGENTS_DIR"
99+```
100+
101+### 2. on-node 校验
102+
103+steady-state 下,`mac` 应该保持 standby,因此默认把 `rolez` 预期值写成 `standby`:
104+
105+```bash
106+REPO_DIR=/Users/george/code/baa-conductor
107+AGENTS_DIR="$HOME/Library/LaunchAgents"
108+
109+./scripts/runtime/check-node.sh \
110+ --repo-dir "$REPO_DIR" \
111+ --node mac \
112+ --all-services \
113+ --install-dir "$AGENTS_DIR" \
114+ --expected-rolez standby
115+```
116+
117+如果这是 failover rehearsal 期间的 `mac`,只把 `--expected-rolez standby` 改成 `--expected-rolez leader`,其余步骤保持不变。
118+
119+### 3. `LaunchDaemons` 场景
120+
121+如果 `mac` 用的是 `/Library/LaunchDaemons`,把安装路径和域名一起改成 daemon 版本:
122+
123+```bash
124+sudo ./scripts/runtime/check-node.sh \
125+ --repo-dir /Users/george/code/baa-conductor \
126+ --node mac \
127+ --scope daemon \
128+ --install-dir /Library/LaunchDaemons \
129+ --username george \
130+ --expected-rolez standby \
131+ --check-loaded
132+```
133+
134+`--check-loaded` 在这里会去读 `launchctl print system/<label>`,不会重新 bootstrap 服务。
135+
136+## 常见失败信号
137+
138+- `conductor /readyz` 返回 `503`:节点进程存活,但 runtime 还没进入 ready 状态,先看 conductor stderr 日志。
139+- `conductor /rolez` 与预期不符:节点身份没问题,但当前 lease 角色与预期不一致,先确认是否处于 failover 或 standby 场景。
140+- `status-api` 端口没监听:`status-api` 宿主进程没有起来,或监听地址/端口与默认 `127.0.0.1:4318` 不一致。
141+- `status-api` 进程匹配失败:节点上可能跑的是旧路径、旧 worktree,或 launchd 仍指向错误的 `dist/index.js`。
142+- launchd 日志文件缺失:安装副本路径虽然存在,但服务尚未真正由 launchd 打开过对应 `StandardOutPath` / `StandardErrorPath`。
+473,
-0
1@@ -0,0 +1,473 @@
2+#!/usr/bin/env bash
3+set -euo pipefail
4+
5+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
6+# shellcheck source=./common.sh
7+source "${SCRIPT_DIR}/common.sh"
8+
9+usage() {
10+ cat <<'EOF'
11+Usage:
12+ scripts/runtime/check-node.sh [options]
13+
14+Options:
15+ --node mini|mac Select node defaults. Defaults to mini.
16+ --scope agent|daemon Expected launchd scope. Defaults to agent.
17+ --service NAME Add one service to the runtime check set. Repeatable.
18+ --all-services Check conductor, worker-runner, and status-api.
19+ --repo-dir PATH Repo root used to derive runtime paths.
20+ --home-dir PATH HOME value expected in installed plist files.
21+ --install-dir PATH Validate installed copies under this directory.
22+ --shared-token TOKEN Expect this exact token in installed copies.
23+ --shared-token-file PATH Read the expected token from a file.
24+ --control-api-base URL Expected BAA_CONTROL_API_BASE in installed copies.
25+ --local-api-base URL Conductor local API base URL. Defaults to 127.0.0.1:4317.
26+ --status-api-base URL Status API base URL. Defaults to 127.0.0.1:4318.
27+ --username NAME Expected UserName for LaunchDaemons.
28+ --domain TARGET launchctl domain target for --check-loaded.
29+ --check-loaded Also require launchctl print to succeed for each service.
30+ --expected-rolez VALUE Expected conductor /rolez body: leader, standby, or any.
31+ --skip-static-check Skip the underlying check-launchd.sh pass.
32+ --skip-port-check Skip local TCP LISTEN checks.
33+ --skip-process-check Skip host process command-line checks.
34+ --skip-http-check Skip conductor/status-api HTTP probes.
35+ --skip-log-check Skip launchd stdout/stderr file checks.
36+ --help Show this help text.
37+
38+Notes:
39+ The default runtime check set is conductor + status-api, because that is the
40+ minimum on-node surface for a realistic node verification pass. Use
41+ --service to narrow the scope or --all-services to include worker-runner.
42+EOF
43+}
44+
45+require_command awk
46+require_command curl
47+require_command lsof
48+require_command ps
49+
50+node="mini"
51+scope="agent"
52+repo_dir="${BAA_RUNTIME_REPO_DIR_DEFAULT}"
53+home_dir="$(default_home_dir)"
54+install_dir=""
55+shared_token=""
56+shared_token_file=""
57+control_api_base="${BAA_RUNTIME_DEFAULT_CONTROL_API_BASE}"
58+local_api_base="${BAA_RUNTIME_DEFAULT_LOCAL_API}"
59+status_api_base="${BAA_RUNTIME_DEFAULT_STATUS_API}"
60+username="$(default_username)"
61+domain_target=""
62+check_loaded="0"
63+expected_rolez="any"
64+skip_static_check="0"
65+skip_port_check="0"
66+skip_process_check="0"
67+skip_http_check="0"
68+skip_log_check="0"
69+services=()
70+
71+while [[ $# -gt 0 ]]; do
72+ case "$1" in
73+ --node)
74+ node="$2"
75+ shift 2
76+ ;;
77+ --scope)
78+ scope="$2"
79+ shift 2
80+ ;;
81+ --service)
82+ validate_service "$2"
83+ if ! contains_value "$2" "${services[@]-}"; then
84+ services+=("$2")
85+ fi
86+ shift 2
87+ ;;
88+ --all-services)
89+ while IFS= read -r service; do
90+ if ! contains_value "$service" "${services[@]-}"; then
91+ services+=("$service")
92+ fi
93+ done < <(all_services)
94+ shift
95+ ;;
96+ --repo-dir)
97+ repo_dir="$2"
98+ shift 2
99+ ;;
100+ --home-dir)
101+ home_dir="$2"
102+ shift 2
103+ ;;
104+ --install-dir)
105+ install_dir="$2"
106+ shift 2
107+ ;;
108+ --shared-token)
109+ shared_token="$2"
110+ shift 2
111+ ;;
112+ --shared-token-file)
113+ shared_token_file="$2"
114+ shift 2
115+ ;;
116+ --control-api-base)
117+ control_api_base="$2"
118+ shift 2
119+ ;;
120+ --local-api-base)
121+ local_api_base="$2"
122+ shift 2
123+ ;;
124+ --status-api-base)
125+ status_api_base="$2"
126+ shift 2
127+ ;;
128+ --username)
129+ username="$2"
130+ shift 2
131+ ;;
132+ --domain)
133+ domain_target="$2"
134+ shift 2
135+ ;;
136+ --check-loaded)
137+ check_loaded="1"
138+ shift
139+ ;;
140+ --expected-rolez)
141+ expected_rolez="$2"
142+ shift 2
143+ ;;
144+ --skip-static-check)
145+ skip_static_check="1"
146+ shift
147+ ;;
148+ --skip-port-check)
149+ skip_port_check="1"
150+ shift
151+ ;;
152+ --skip-process-check)
153+ skip_process_check="1"
154+ shift
155+ ;;
156+ --skip-http-check)
157+ skip_http_check="1"
158+ shift
159+ ;;
160+ --skip-log-check)
161+ skip_log_check="1"
162+ shift
163+ ;;
164+ --help)
165+ usage
166+ exit 0
167+ ;;
168+ *)
169+ die "Unknown option: $1"
170+ ;;
171+ esac
172+done
173+
174+validate_node "$node"
175+validate_scope "$scope"
176+
177+case "$expected_rolez" in
178+ any | leader | standby) ;;
179+ *)
180+ die "Unsupported --expected-rolez value: ${expected_rolez}"
181+ ;;
182+esac
183+
184+if [[ "${#services[@]}" -eq 0 ]]; then
185+ while IFS= read -r service; do
186+ services+=("$service")
187+ done < <(default_node_verification_services)
188+fi
189+
190+if [[ -z "$install_dir" ]]; then
191+ install_dir="$(default_install_dir "$scope" "$home_dir")"
192+fi
193+
194+if [[ "$check_loaded" == "1" && -z "$domain_target" ]]; then
195+ domain_target="$(default_domain_target "$scope")"
196+fi
197+
198+set -- $(resolve_node_defaults "$node")
199+conductor_host="$1"
200+conductor_role="$2"
201+node_id="$3"
202+
203+logs_launchd_dir="${repo_dir}/logs/launchd"
204+HTTP_STATUS=""
205+HTTP_BODY=""
206+
207+normalize_base_url() {
208+ local value="$1"
209+
210+ while [[ "$value" == */ ]]; do
211+ value="${value%/}"
212+ done
213+
214+ printf '%s\n' "$value"
215+}
216+
217+extract_port_from_url() {
218+ local service="$1"
219+ local url="$2"
220+ local authority
221+ local default_port
222+
223+ authority="${url#*://}"
224+ authority="${authority%%/*}"
225+
226+ if [[ "$authority" == *:* ]]; then
227+ printf '%s\n' "${authority##*:}"
228+ return 0
229+ fi
230+
231+ default_port="$(service_default_port "$service")"
232+ if [[ -n "$default_port" ]]; then
233+ printf '%s\n' "$default_port"
234+ return 0
235+ fi
236+
237+ case "$url" in
238+ https://*)
239+ printf '%s\n' "443"
240+ ;;
241+ http://*)
242+ printf '%s\n' "80"
243+ ;;
244+ *)
245+ die "Could not derive a TCP port from URL: ${url}"
246+ ;;
247+ esac
248+}
249+
250+http_get() {
251+ local url="$1"
252+ local body_file
253+
254+ body_file="$(mktemp "${TMPDIR:-/tmp}/baa-runtime-http.XXXXXX")"
255+
256+ HTTP_STATUS="$(curl -sS --max-time 5 -o "$body_file" -w '%{http_code}' "$url")" || {
257+ rm -f "$body_file"
258+ die "Request failed: ${url}"
259+ }
260+
261+ HTTP_BODY="$(tr -d '\r' <"$body_file")"
262+ rm -f "$body_file"
263+
264+ while [[ "$HTTP_BODY" == *$'\n' ]]; do
265+ HTTP_BODY="${HTTP_BODY%$'\n'}"
266+ done
267+}
268+
269+assert_http_equals() {
270+ local name="$1"
271+ local url="$2"
272+ local expected_status="$3"
273+ local expected_body="$4"
274+
275+ http_get "$url"
276+
277+ if [[ "$HTTP_STATUS" != "$expected_status" ]]; then
278+ die "${name} returned HTTP ${HTTP_STATUS}, expected ${expected_status}"
279+ fi
280+
281+ if [[ "$HTTP_BODY" != "$expected_body" ]]; then
282+ die "${name} body mismatch: expected '${expected_body}', got '${HTTP_BODY}'"
283+ fi
284+
285+ runtime_log "${name} ok"
286+}
287+
288+assert_http_contains() {
289+ local name="$1"
290+ local url="$2"
291+ local expected_status="$3"
292+ local expected_substring="$4"
293+
294+ http_get "$url"
295+
296+ if [[ "$HTTP_STATUS" != "$expected_status" ]]; then
297+ die "${name} returned HTTP ${HTTP_STATUS}, expected ${expected_status}"
298+ fi
299+
300+ if [[ "$HTTP_BODY" != *"$expected_substring"* ]]; then
301+ die "${name} body does not contain expected text: ${expected_substring}"
302+ fi
303+
304+ runtime_log "${name} ok"
305+}
306+
307+check_loaded_services() {
308+ require_command launchctl
309+
310+ for service in "${services[@]}"; do
311+ launchctl print "${domain_target}/$(service_label "$service")" >/dev/null
312+ runtime_log "launchctl loaded: $(service_label "$service")"
313+ done
314+}
315+
316+run_static_checks() {
317+ local static_args=()
318+
319+ static_args+=(
320+ --node "$node"
321+ --scope "$scope"
322+ --repo-dir "$repo_dir"
323+ --home-dir "$home_dir"
324+ --install-dir "$install_dir"
325+ --control-api-base "$control_api_base"
326+ --local-api-base "$local_api_base"
327+ --username "$username"
328+ )
329+
330+ for service in "${services[@]}"; do
331+ static_args+=(--service "$service")
332+ done
333+
334+ if [[ -n "$shared_token" ]]; then
335+ static_args+=(--shared-token "$shared_token")
336+ fi
337+
338+ if [[ -n "$shared_token_file" ]]; then
339+ static_args+=(--shared-token-file "$shared_token_file")
340+ fi
341+
342+ if [[ "$check_loaded" == "1" ]]; then
343+ static_args+=(--check-loaded --domain "$domain_target")
344+ fi
345+
346+ "${SCRIPT_DIR}/check-launchd.sh" "${static_args[@]}"
347+}
348+
349+check_service_process() {
350+ local service="$1"
351+ local process_pattern="$2"
352+ local process_lines
353+
354+ process_lines="$(ps -axo pid=,command= | awk -v pattern="$process_pattern" 'index($0, pattern) > 0 { print }')"
355+
356+ if [[ -z "$process_lines" ]]; then
357+ die "${service} process not found for pattern: ${process_pattern}"
358+ fi
359+
360+ runtime_log "${service} process ok: $(printf '%s\n' "$process_lines" | sed -n '1p')"
361+}
362+
363+check_listen_port() {
364+ local service="$1"
365+ local port="$2"
366+ local socket_lines
367+
368+ socket_lines="$(lsof -nP -iTCP:"$port" -sTCP:LISTEN 2>/dev/null || true)"
369+
370+ if [[ -z "$socket_lines" ]]; then
371+ die "${service} is not listening on TCP port ${port}"
372+ fi
373+
374+ runtime_log "${service} listening on TCP ${port}"
375+}
376+
377+check_service_logs() {
378+ local service="$1"
379+ local stdout_path
380+ local stderr_path
381+
382+ stdout_path="$(service_stdout_path "$logs_launchd_dir" "$service")"
383+ stderr_path="$(service_stderr_path "$logs_launchd_dir" "$service")"
384+
385+ assert_directory "$(dirname -- "$stdout_path")"
386+ assert_file "$stdout_path"
387+ assert_file "$stderr_path"
388+
389+ runtime_log "${service} log files present"
390+}
391+
392+check_conductor_runtime() {
393+ local conductor_base_url="$1"
394+ local port
395+
396+ if [[ "$skip_port_check" != "1" ]]; then
397+ port="$(extract_port_from_url "conductor" "$conductor_base_url")"
398+ check_listen_port "conductor" "$port"
399+ fi
400+
401+ if [[ "$skip_http_check" != "1" ]]; then
402+ assert_http_equals "conductor /healthz" "${conductor_base_url}/healthz" "200" "ok"
403+ assert_http_equals "conductor /readyz" "${conductor_base_url}/readyz" "200" "ready"
404+
405+ http_get "${conductor_base_url}/rolez"
406+ if [[ "$HTTP_STATUS" != "200" ]]; then
407+ die "conductor /rolez returned HTTP ${HTTP_STATUS}, expected 200"
408+ fi
409+
410+ case "$expected_rolez" in
411+ any)
412+ case "$HTTP_BODY" in
413+ leader | standby) ;;
414+ *)
415+ die "conductor /rolez must be leader or standby, got '${HTTP_BODY}'"
416+ ;;
417+ esac
418+ ;;
419+ *)
420+ if [[ "$HTTP_BODY" != "$expected_rolez" ]]; then
421+ die "conductor /rolez mismatch: expected '${expected_rolez}', got '${HTTP_BODY}'"
422+ fi
423+ ;;
424+ esac
425+
426+ runtime_log "conductor /rolez ok: ${HTTP_BODY}"
427+ fi
428+}
429+
430+check_status_api_runtime() {
431+ local status_base_url="$1"
432+ local port
433+
434+ if [[ "$skip_port_check" != "1" ]]; then
435+ port="$(extract_port_from_url "status-api" "$status_base_url")"
436+ check_listen_port "status-api" "$port"
437+ fi
438+
439+ if [[ "$skip_http_check" != "1" ]]; then
440+ assert_http_equals "status-api /healthz" "${status_base_url}/healthz" "200" "ok"
441+ assert_http_contains "status-api /v1/status" "${status_base_url}/v1/status" "200" "\"ok\": true"
442+ assert_http_contains "status-api /v1/status/ui" "${status_base_url}/v1/status/ui" "200" "BAA Conductor Status"
443+ fi
444+}
445+
446+local_api_base="$(normalize_base_url "$local_api_base")"
447+status_api_base="$(normalize_base_url "$status_api_base")"
448+
449+if [[ "$skip_static_check" != "1" ]]; then
450+ run_static_checks
451+elif [[ "$check_loaded" == "1" ]]; then
452+ check_loaded_services
453+fi
454+
455+for service in "${services[@]}"; do
456+ if [[ "$skip_process_check" != "1" ]]; then
457+ check_service_process "$service" "$(service_process_match "$repo_dir" "$service" "$conductor_host" "$conductor_role")"
458+ fi
459+
460+ if [[ "$skip_log_check" != "1" ]]; then
461+ check_service_logs "$service"
462+ fi
463+
464+ case "$service" in
465+ conductor)
466+ check_conductor_runtime "$local_api_base"
467+ ;;
468+ status-api)
469+ check_status_api_runtime "$status_api_base"
470+ ;;
471+ esac
472+done
473+
474+runtime_log "node checks passed for ${node} (${node_id})"
+38,
-0
1@@ -9,6 +9,7 @@ readonly BAA_RUNTIME_SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &&
2 readonly BAA_RUNTIME_REPO_DIR_DEFAULT="$(cd -- "${BAA_RUNTIME_SCRIPT_DIR}/../.." && pwd)"
3 readonly BAA_RUNTIME_DEFAULT_CONTROL_API_BASE="https://control-api.makefile.so"
4 readonly BAA_RUNTIME_DEFAULT_LOCAL_API="http://127.0.0.1:4317"
5+readonly BAA_RUNTIME_DEFAULT_STATUS_API="http://127.0.0.1:4318"
6 readonly BAA_RUNTIME_DEFAULT_LOCALE="en_US.UTF-8"
7
8 runtime_log() {
9@@ -75,6 +76,10 @@ default_services() {
10 printf '%s\n' conductor
11 }
12
13+default_node_verification_services() {
14+ printf '%s\n' conductor status-api
15+}
16+
17 all_services() {
18 printf '%s\n' conductor worker-runner status-api
19 }
20@@ -107,6 +112,39 @@ service_dist_entry_relative() {
21 esac
22 }
23
24+service_default_port() {
25+ case "$1" in
26+ conductor)
27+ printf '%s\n' "4317"
28+ ;;
29+ status-api)
30+ printf '%s\n' "4318"
31+ ;;
32+ worker-runner)
33+ printf '%s\n' ""
34+ ;;
35+ esac
36+}
37+
38+service_process_match() {
39+ local repo_dir="$1"
40+ local service="$2"
41+ local conductor_host="${3:-}"
42+ local conductor_role="${4:-}"
43+ local dist_entry
44+
45+ dist_entry="${repo_dir}/$(service_dist_entry_relative "$service")"
46+
47+ case "$service" in
48+ conductor)
49+ printf '%s --host %s --role %s\n' "$dist_entry" "$conductor_host" "$conductor_role"
50+ ;;
51+ *)
52+ printf '%s\n' "$dist_entry"
53+ ;;
54+ esac
55+}
56+
57 service_template_path() {
58 local repo_dir="$1"
59 local service="$2"