- commit
- dca7129
- parent
- ebd09f3
- author
- im_wower
- date
- 2026-03-22 02:16:10 +0800 CST
Merge feat/T-025-failover-rehearsal into main
10 files changed,
+1548,
-7
1@@ -1,10 +1,10 @@
2 ---
3 task_id: T-025
4 title: Failover rehearsal 与 Runbook
5-status: todo
6+status: review
7 branch: feat/T-025-failover-rehearsal
8 repo: /Users/george/code/baa-conductor
9-base_ref: main
10+base_ref: main@6505a31
11 depends_on:
12 - T-019
13 - T-021
14@@ -61,20 +61,40 @@ updated_at: 2026-03-22
15
16 ## files_changed
17
18-- 待填写
19+- `coordination/tasks/T-025-failover-rehearsal.md`
20+- `docs/ops/README.md`
21+- `docs/ops/failover-topology.md`
22+- `docs/ops/planned-failover.md`
23+- `docs/ops/emergency-failover.md`
24+- `docs/ops/switchback.md`
25+- `scripts/failover/common.sh`
26+- `scripts/failover/print-topology.sh`
27+- `scripts/failover/rehearsal-check.sh`
28+- `scripts/failover/print-checklist.sh`
29
30 ## commands_run
31
32-- 待填写
33+- `npx --yes pnpm install`
34+- `chmod +x scripts/failover/*.sh`
35+- `bash -n scripts/failover/*.sh`
36+- `./scripts/failover/print-topology.sh --env scripts/ops/baa-conductor.env.example`
37+- `./scripts/failover/print-checklist.sh --scenario planned --env scripts/ops/baa-conductor.env.example`
38+- `./scripts/failover/rehearsal-check.sh --env scripts/ops/baa-conductor.env.example --skip-public --skip-control-api`
39+- `git diff --check`
40
41 ## result
42
43-- 待填写
44+- 新增 `scripts/failover` 只读辅助脚本,用于输出主备拓扑、生成场景化 checklist,并对 public/direct/control-api 做 rehearsal 校验。
45+- 在 `docs/ops` 下补齐 failover topology、planned failover、emergency failover、switchback 四份文档。
46+- runbook 明确写清 Cloudflare DNS 固定到 VPS、VPS Nginx 回源 Tailscale `100.x`、节点 `launchd` 控制本地 conductor 存活,以及当前 Nginx 只按连通性而非 lease 感知切换。
47
48 ## risks
49
50-- 待填写
51+- 当前 `GET /v1/system/state` 仍只返回扁平 `holder_id/mode/term/lease_expires_at`,脚本只能通过 `holder_id` 前缀推断 leader 节点。
52+- 当前没有真正的 `promote/demote` 维护接口;planned failover 和 switchback 仍依赖 `pause/drain` 加本机 `launchctl bootout/reload`。
53+- 如果 mini 仍可达但逻辑上不该接流量,公网入口可能仍命中 mini;emergency runbook 已记录这种情况下需要 VPS Nginx 热修。
54
55 ## next_handoff
56
57-- 待填写
58+- 在真实节点或 staging 上按 runbook 做一次 dry-run / rehearsal,确认 `mini`、`mac`、VPS 三侧命令、权限和路径与文档一致。
59+- 后续可补 `system.state` 的 leader host 字段,以及真正的 maintenance `promote/demote` API,减少对人工 `launchctl` 与 Nginx 热修的依赖。
+22,
-0
1@@ -10,6 +10,28 @@
2
3 `control-api.makefile.so` 的 Worker 自定义域仍由 Cloudflare Worker / D1 相关任务管理,不在这里的脚本覆盖范围内。
4
5+## Failover 与 Rehearsal
6+
7+第四波把公网入口和节点运行时铺好后,第五波开始把“主备切换怎么练、怎么做”单独收口在这里:
8+
9+- [`failover-topology.md`](./failover-topology.md)
10+- [`planned-failover.md`](./planned-failover.md)
11+- [`emergency-failover.md`](./emergency-failover.md)
12+- [`switchback.md`](./switchback.md)
13+
14+配套只读/半自动脚本在:
15+
16+- [`../../scripts/failover/print-topology.sh`](../../scripts/failover/print-topology.sh)
17+- [`../../scripts/failover/rehearsal-check.sh`](../../scripts/failover/rehearsal-check.sh)
18+- [`../../scripts/failover/print-checklist.sh`](../../scripts/failover/print-checklist.sh)
19+
20+这些文档和脚本明确约束当前设计:
21+
22+- Cloudflare DNS 不参与 failover,公网记录仍固定指向 VPS
23+- `conductor.makefile.so` 的切换依赖 VPS Nginx 对 Tailscale `100.x` upstream 的连通性判断
24+- `launchd` 决定节点本地 conductor 是否继续对 `127.0.0.1:4317` 和 Tailscale 监听
25+- Nginx 不知道谁持有 leader lease,所以“逻辑 leader 已切走”不等于“公网一定已切走”
26+
27 ## 单一来源 inventory
28
29 本任务把公网域名、VPS 公网 IP、内网 Tailscale `100.x` 和 Nginx 安装路径收口到一份 inventory:
+145,
-0
1@@ -0,0 +1,145 @@
2+# Emergency Failover Runbook
3+
4+适用场景:
5+
6+- `mini` 宕机、不可登录、或其 conductor 行为已经不可信
7+- 需要尽快把公网入口稳定在 `mac`
8+
9+这个 runbook 的目标是“先恢复可用性,再谈回切整洁度”。
10+
11+## 1. 判断当前属于哪种故障
12+
13+先拿一份只读快照:
14+
15+```bash
16+./scripts/failover/rehearsal-check.sh \
17+ --env ../baa-conductor.ops.env \
18+ --basic-auth 'conductor-ops:REPLACE_ME' \
19+ --bearer-token 'REPLACE_ME' \
20+ --skip-node mini \
21+ --expect-leader mac
22+```
23+
24+根据结果分三类看:
25+
26+1. `mac` 直连已经是 `leader`,公网也正常
27+ 这说明 VPS 已经自动从 mini 掉到了 mac,只需要继续观察和记录。
28+2. `mac` 直连不健康
29+ 这说明备用节点自己还没准备好,先修 `mac`。
30+3. `mac` 已经是 `leader`,但公网入口仍不稳定,或者公网 `/rolez` 不是 `leader`
31+ 这通常说明 VPS 仍在命中 mini,或者 VPS 到 mac 的回源有问题。
32+
33+## 2. 先把 mac 修到可服务
34+
35+在 `mac` 上检查 launchd:
36+
37+```bash
38+cd /Users/george/code/baa-conductor
39+./scripts/runtime/check-launchd.sh \
40+ --repo-dir /Users/george/code/baa-conductor \
41+ --node mac \
42+ --install-dir "$HOME/Library/LaunchAgents"
43+```
44+
45+必要时重载:
46+
47+```bash
48+cd /Users/george/code/baa-conductor
49+./scripts/runtime/reload-launchd.sh
50+launchctl print "gui/$(id -u)/so.makefile.baa-conductor"
51+```
52+
53+如果安装副本不存在或与当前 repo 偏离,先补一次:
54+
55+```bash
56+cd /Users/george/code/baa-conductor
57+./scripts/runtime/install-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mac
58+./scripts/runtime/reload-launchd.sh
59+```
60+
61+## 3. 检查 VPS 到 mac 的回源
62+
63+在 VPS 上确认:
64+
65+- `100.112.239.13:4317` 可达
66+- Nginx 没有语法错误
67+- TLS 证书和 include 仍然完整
68+
69+最小检查:
70+
71+```bash
72+curl -sS http://100.112.239.13:4317/healthz
73+curl -sS http://100.112.239.13:4317/rolez
74+sudo nginx -t
75+```
76+
77+如果这些都正常,但公网还是不落到 mac,说明 mini 侧可能仍有一个“能响应但不该接流量”的进程。
78+
79+## 4. 应急 Nginx 热修
80+
81+当前 canonical 配置是“mini 主、mac 备”。这是正常态最简配置,但应急时有一个代价:
82+
83+- 只要 mini 还能回 HTTP,公网就会先打到 mini
84+- 即使 mini `/rolez` 已经不是 leader,Nginx 也不会主动让路
85+
86+这时可以在 VPS 上对已部署配置做临时热修,把 mac 调到前面。
87+
88+先备份:
89+
90+```bash
91+sudo cp \
92+ /etc/nginx/sites-available/baa-conductor.conf \
93+ /etc/nginx/sites-available/baa-conductor.conf.bak.$(date +%Y%m%d%H%M%S)
94+```
95+
96+临时把 `conductor_primary` 改成:
97+
98+```nginx
99+upstream conductor_primary {
100+ server 100.112.239.13:4317 max_fails=2 fail_timeout=5s;
101+ server 100.71.210.78:4317 backup;
102+ keepalive 32;
103+}
104+```
105+
106+然后:
107+
108+```bash
109+sudo nginx -t
110+sudo systemctl reload nginx
111+```
112+
113+这一步的性质必须记清楚:
114+
115+- 这是“已部署文件”的应急热修,不是 repo 变更
116+- switchback 时必须把 VPS 恢复到 repo 生成的 canonical bundle
117+
118+## 5. 验证 emergency failover 已落稳
119+
120+再次执行:
121+
122+```bash
123+./scripts/failover/rehearsal-check.sh \
124+ --env ../baa-conductor.ops.env \
125+ --basic-auth 'conductor-ops:REPLACE_ME' \
126+ --bearer-token 'REPLACE_ME' \
127+ --skip-node mini \
128+ --expect-leader mac
129+```
130+
131+成功条件:
132+
133+- `conductor.makefile.so` 返回 `leader`
134+- `mac-conductor.makefile.so` 返回 `leader`
135+- `GET /v1/system/state` 的 `holder_id` 以 `mac-` 开头
136+
137+## 6. 事后记录
138+
139+emergency failover 结束后,至少记录:
140+
141+- mini 是“完全不可达”还是“仍可达但角色错误”
142+- mac 是否需要人工 reload launchd 才恢复
143+- VPS 是否做了临时 Nginx 热修
144+- 哪个备份文件是本次热修前的 `baa-conductor.conf`
145+
146+这些信息会直接决定后续 switchback 的复杂度。
+151,
-0
1@@ -0,0 +1,151 @@
2+# Failover Topology
3+
4+本页把当前主备设计里最容易混淆的四层关系写死:
5+
6+1. Cloudflare DNS
7+2. VPS 上的 Nginx
8+3. 两台节点的 Tailscale `100.x`
9+4. 节点本地的 `launchd`
10+
11+目标不是引入新的自动切换机制,而是把现有约定整理成可 rehearsal 的操作模型。
12+
13+## 1. 固定拓扑
14+
15+### 公网入口
16+
17+- `conductor.makefile.so`
18+- `mini-conductor.makefile.so`
19+- `mac-conductor.makefile.so`
20+
21+这三个 host 都固定解析到同一台 VPS 公网 IP。failover 不通过改 DNS 完成,Cloudflare 只是把公网流量送到 VPS。
22+
23+`control-api.makefile.so` 是单独的 Cloudflare Worker 自定义域,仍然作为控制面使用:
24+
25+- `GET /v1/system/state`
26+- `POST /v1/system/drain`
27+- `POST /v1/system/pause`
28+- `POST /v1/system/resume`
29+
30+它不经过 VPS Nginx,因此可以在 conductor 主备切换时继续承担“冻结/恢复自动化”和“读取 lease 状态”的职责。
31+
32+### VPS Nginx
33+
34+仓库里的 canonical 配置见:
35+
36+- [`ops/nginx/baa-conductor.conf`](../../ops/nginx/baa-conductor.conf)
37+
38+当前 upstream 关系是:
39+
40+| 公网 host | VPS upstream | 实际回源 |
41+| --- | --- | --- |
42+| `conductor.makefile.so` | `conductor_primary` | `mini 100.71.210.78:4317` 主,`mac 100.112.239.13:4317` 备 |
43+| `mini-conductor.makefile.so` | `mini_conductor_direct` | `mini 100.71.210.78:4317` |
44+| `mac-conductor.makefile.so` | `mac_conductor_direct` | `mac 100.112.239.13:4317` |
45+
46+关键点:
47+
48+- `conductor.makefile.so` 只会在 mini 上游连不上时,才 TCP 级别回退到 mac。
49+- Nginx 不看 lease,不知道谁是逻辑 leader。
50+- 因此只要 mini 仍然在 `100.71.210.78:4317` 上返回 HTTP,公网流量就还会优先打到 mini。
51+
52+这意味着:
53+
54+- planned failover 不能只让 mac 获得 lease,还必须让 mini 的 conductor 停掉或不再对 VPS 可达。
55+- emergency failover 时,如果 mini 还活着但已经不是 leader,可能需要临时热修 VPS 上的已部署 Nginx 配置。
56+
57+### Tailscale `100.x`
58+
59+当前仓库明确固定使用 Tailscale IPv4,不依赖 MagicDNS:
60+
61+- `mini`: `100.71.210.78`
62+- `mac`: `100.112.239.13`
63+- port: `4317`
64+
65+这些值来自 inventory:
66+
67+- [`scripts/ops/baa-conductor.env.example`](../../scripts/ops/baa-conductor.env.example)
68+
69+### launchd
70+
71+节点进程是否存在、以什么身份存在,由 `launchd` 安装副本决定。
72+
73+默认节点身份见:
74+
75+- [`docs/runtime/environment.md`](../runtime/environment.md)
76+
77+| 节点 | `BAA_CONDUCTOR_HOST` | `BAA_CONDUCTOR_ROLE` | `BAA_NODE_ID` |
78+| --- | --- | --- | --- |
79+| `mini` | `mini` | `primary` | `mini-main` |
80+| `mac` | `mac` | `standby` | `mac-standby` |
81+
82+安装/校验/重载脚本见:
83+
84+- [`scripts/runtime/install-launchd.sh`](../../scripts/runtime/install-launchd.sh)
85+- [`scripts/runtime/check-launchd.sh`](../../scripts/runtime/check-launchd.sh)
86+- [`scripts/runtime/reload-launchd.sh`](../../scripts/runtime/reload-launchd.sh)
87+
88+`reload-launchd.sh` 适合“把安装副本重新 bootstrap/kickstart”;planned failover 要“只停 conductor,不立刻重启”时,直接用 `launchctl bootout` 更合适。
89+
90+## 2. 当前 failover 语义
91+
92+当前设计的真实行为可以压缩成一句话:
93+
94+> 逻辑 leader 由 lease 决定,公网入口是否切走则由 VPS 是否还能连通当前 primary upstream 决定。
95+
96+所以三种场景的区别是:
97+
98+- planned failover: 先冻结自动化,再显式停掉 mini conductor,让 VPS 落到 mac backup,上游 lease 也随之迁到 mac。
99+- emergency failover: mini 已宕机或不可信,优先保证 mac 可服务;必要时在 VPS 上做临时 Nginx 热修。
100+- switchback: mini 修复后,先让 mini 恢复健康,再停掉 mac conductor,并把 VPS 配置恢复为 canonical 的 mini 主、mac 备。
101+
102+## 3. Rehearsal 辅助脚本
103+
104+新增脚本都在:
105+
106+- [`scripts/failover/common.sh`](../../scripts/failover/common.sh)
107+- [`scripts/failover/print-topology.sh`](../../scripts/failover/print-topology.sh)
108+- [`scripts/failover/rehearsal-check.sh`](../../scripts/failover/rehearsal-check.sh)
109+- [`scripts/failover/print-checklist.sh`](../../scripts/failover/print-checklist.sh)
110+
111+它们的边界是:
112+
113+- 只做只读 GET 检查,或输出 checklist
114+- 不会直接执行真实 failover
115+- 不会改 DNS
116+- 不会改 repo 里的 Nginx 模板
117+
118+推荐先看拓扑:
119+
120+```bash
121+./scripts/failover/print-topology.sh --env ../baa-conductor.ops.env
122+```
123+
124+再做一次基线探测:
125+
126+```bash
127+./scripts/failover/rehearsal-check.sh \
128+ --env ../baa-conductor.ops.env \
129+ --basic-auth 'conductor-ops:REPLACE_ME' \
130+ --bearer-token 'REPLACE_ME' \
131+ --expect-leader mini
132+```
133+
134+如果要拿到按场景整理好的命令骨架:
135+
136+```bash
137+./scripts/failover/print-checklist.sh \
138+ --scenario planned \
139+ --env ../baa-conductor.ops.env
140+```
141+
142+## 4. 使用这些 runbook 的前提
143+
144+开始任何 rehearsal 之前,至少准备好:
145+
146+- `mini`、`mac`、VPS 的 shell 访问权限
147+- 直连域名 Basic Auth 凭据
148+- `GET /v1/system/state` 的 readonly 或 browser_admin token
149+- `POST /v1/system/drain|pause|resume` 的 browser_admin token
150+- 一份仓库外的 inventory env 文件
151+
152+若这些前提不满足,本轮只能做“文档演练”或“单机只读检查”,不要冒险把真实主节点停掉。
+184,
-0
1@@ -0,0 +1,184 @@
2+# Planned Failover Runbook
3+
4+适用场景:
5+
6+- `mini` 当前是 leader
7+- `mac` 已安装并加载 standby conductor
8+- 计划内维护,需要把公网入口从 `mini` 平滑切到 `mac`
9+
10+非目标:
11+
12+- 不改 DNS
13+- 不改 app 代码
14+- 不让 `baa-firefox` 介入切换逻辑
15+
16+## 1. 前置条件
17+
18+准备这些材料:
19+
20+- 仓库外 inventory,例如 `../baa-conductor.ops.env`
21+- 直连域名 Basic Auth
22+- `GET /v1/system/state` 的 readonly token
23+- `POST /v1/system/drain|pause|resume` 的 browser_admin token
24+- `mini` / `mac` 的 shell 访问权限
25+
26+先确认当前拓扑和基线:
27+
28+```bash
29+./scripts/failover/print-topology.sh --env ../baa-conductor.ops.env
30+
31+./scripts/failover/rehearsal-check.sh \
32+ --env ../baa-conductor.ops.env \
33+ --basic-auth 'conductor-ops:REPLACE_ME' \
34+ --bearer-token 'REPLACE_ME' \
35+ --expect-leader mini
36+```
37+
38+预期:
39+
40+- `conductor.makefile.so` 的 `/rolez` 返回 `leader`
41+- `mini-conductor.makefile.so` 的 `/rolez` 返回 `leader`
42+- `mac-conductor.makefile.so` 的 `/rolez` 返回 `standby`
43+- `GET /v1/system/state` 的 `holder_id` 以 `mini-` 开头
44+
45+## 2. 冻结自动化
46+
47+planned failover 先 drain,再 pause。
48+
49+`drain` 的目的:
50+
51+- 不再启动新的 work
52+- 给当前运行中的 task 一个自然收尾窗口
53+
54+当前仓库没有“自动等到 active runs 归零”的单独脚本,所以这里需要人工观察 status 面板、任务板或日志,确认已经到可切换窗口。
55+
56+执行:
57+
58+```bash
59+export CONTROL_API_BASE='https://control-api.makefile.so'
60+export BROWSER_ADMIN_TOKEN='REPLACE_ME'
61+
62+curl -sS -X POST \
63+ -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
64+ -H 'Content-Type: application/json' \
65+ -d '{"requested_by":"ops_runbook","reason":"planned_failover_rehearsal"}' \
66+ "${CONTROL_API_BASE%/}/v1/system/drain"
67+
68+curl -sS -X POST \
69+ -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
70+ -H 'Content-Type: application/json' \
71+ -d '{"requested_by":"ops_runbook","reason":"planned_failover_cutover"}' \
72+ "${CONTROL_API_BASE%/}/v1/system/pause"
73+```
74+
75+## 3. 先确认 mac standby 可接手
76+
77+在 `mac` 上先做 launchd 静态/加载校验:
78+
79+```bash
80+cd /Users/george/code/baa-conductor
81+./scripts/runtime/check-launchd.sh \
82+ --repo-dir /Users/george/code/baa-conductor \
83+ --node mac \
84+ --install-dir "$HOME/Library/LaunchAgents"
85+
86+launchctl print "gui/$(id -u)/so.makefile.baa-conductor"
87+```
88+
89+如果安装副本有漂移,先重渲染再继续:
90+
91+```bash
92+cd /Users/george/code/baa-conductor
93+./scripts/runtime/install-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mac
94+./scripts/runtime/reload-launchd.sh
95+```
96+
97+## 4. 切走 mini
98+
99+这是本设计里最关键的一步。
100+
101+原因:
102+
103+- `mac` 获得 lease 还不够
104+- 只要 mini 继续在 `100.71.210.78:4317` 上返回 HTTP,VPS Nginx 仍会优先把公网流量送到 mini
105+
106+所以 planned failover 要显式停掉 mini 的 conductor 进程,而不是只做逻辑层 promote/demote。
107+
108+默认 `LaunchAgents` 方式:
109+
110+```bash
111+launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/so.makefile.baa-conductor.plist"
112+```
113+
114+如果该节点使用 `LaunchDaemons`:
115+
116+```bash
117+sudo launchctl bootout system /Library/LaunchDaemons/so.makefile.baa-conductor.plist
118+```
119+
120+注意:
121+
122+- 这里只停 `so.makefile.baa-conductor`
123+- `worker-runner`、`status-api` 是否一起停,由维护窗口自己决定
124+- 本 runbook 默认只切 conductor,不扩大停机面
125+
126+## 5. 验证 cutover
127+
128+mini conductor 停掉后,再做一次探测:
129+
130+```bash
131+./scripts/failover/rehearsal-check.sh \
132+ --env ../baa-conductor.ops.env \
133+ --basic-auth 'conductor-ops:REPLACE_ME' \
134+ --bearer-token 'REPLACE_ME' \
135+ --skip-node mini \
136+ --expect-leader mac
137+```
138+
139+预期:
140+
141+- `conductor.makefile.so` 仍然 `healthz=ok`、`readyz=ready`、`rolez=leader`
142+- `mac-conductor.makefile.so` 的 `/rolez` 变成 `leader`
143+- `GET /v1/system/state` 的 `holder_id` 以 `mac-` 开头
144+
145+如果 `mac` 已经是 leader,但公网 `/rolez` 仍不对:
146+
147+- 优先检查 VPS 到 `100.112.239.13:4317` 的连通性
148+- 再检查 mini 是否实际上还在对 `100.71.210.78:4317` 提供服务
149+- planned failover 不应该通过改 DNS 修复
150+
151+## 6. 恢复自动化
152+
153+确认公网入口和 mac 直连都健康后,再 `resume`:
154+
155+```bash
156+curl -sS -X POST \
157+ -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
158+ -H 'Content-Type: application/json' \
159+ -d '{"requested_by":"ops_runbook","reason":"planned_failover_complete"}' \
160+ "${CONTROL_API_BASE%/}/v1/system/resume"
161+```
162+
163+## 7. Abort / 回滚
164+
165+如果在验证窗口里发现 mac 无法稳定接手:
166+
167+1. 保持 automation `paused`
168+2. 在 `mini` 上重新启动 conductor:
169+
170+```bash
171+cd /Users/george/code/baa-conductor
172+./scripts/runtime/reload-launchd.sh
173+```
174+
175+3. 再次确认基线回到 `mini`:
176+
177+```bash
178+./scripts/failover/rehearsal-check.sh \
179+ --env ../baa-conductor.ops.env \
180+ --basic-auth 'conductor-ops:REPLACE_ME' \
181+ --bearer-token 'REPLACE_ME' \
182+ --expect-leader mini
183+```
184+
185+4. 确认恢复正常后,再决定是否 `resume`
+145,
-0
1@@ -0,0 +1,145 @@
2+# Switchback Runbook
3+
4+适用场景:
5+
6+- emergency 或 planned failover 之后,当前 leader 在 `mac`
7+- `mini` 已修复,准备把系统恢复到 canonical 的“mini 主、mac 备”
8+
9+switchback 的重点不是“尽快让 mini 上线”,而是“把临时状态清干净,再回到可重复的默认形态”。
10+
11+## 1. 先修 mini,不先切流量
12+
13+在 `mini` 上先把基础面校验完:
14+
15+```bash
16+cd /Users/george/code/baa-conductor
17+./scripts/runtime/bootstrap.sh --repo-dir /Users/george/code/baa-conductor
18+npx --yes pnpm -r build
19+./scripts/runtime/check-launchd.sh \
20+ --repo-dir /Users/george/code/baa-conductor \
21+ --node mini \
22+ --install-dir "$HOME/Library/LaunchAgents"
23+```
24+
25+如有需要,重渲染并重载安装副本:
26+
27+```bash
28+cd /Users/george/code/baa-conductor
29+./scripts/runtime/install-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mini
30+./scripts/runtime/reload-launchd.sh
31+```
32+
33+在还没停掉 mac 之前,mini 即使已经恢复,也可能仍只是 `standby` 或暂时拿不到 lease,这是正常的。
34+
35+## 2. 先 pause,再移交
36+
37+switchback 前先暂停自动化:
38+
39+```bash
40+export CONTROL_API_BASE='https://control-api.makefile.so'
41+export BROWSER_ADMIN_TOKEN='REPLACE_ME'
42+
43+curl -sS -X POST \
44+ -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
45+ -H 'Content-Type: application/json' \
46+ -d '{"requested_by":"ops_runbook","reason":"switchback_prepare"}' \
47+ "${CONTROL_API_BASE%/}/v1/system/pause"
48+```
49+
50+## 3. 停掉 mac conductor
51+
52+为了让 lease 和公网入口都回到 mini,需要先让 mac conductor 退下。
53+
54+默认 `LaunchAgents`:
55+
56+```bash
57+launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/so.makefile.baa-conductor.plist"
58+```
59+
60+若使用 `LaunchDaemons`:
61+
62+```bash
63+sudo launchctl bootout system /Library/LaunchDaemons/so.makefile.baa-conductor.plist
64+```
65+
66+然后在 mini 上确保 conductor 重新 bootstrap:
67+
68+```bash
69+cd /Users/george/code/baa-conductor
70+./scripts/runtime/reload-launchd.sh
71+```
72+
73+## 4. 恢复 canonical Nginx
74+
75+如果 emergency failover 时对 VPS 做过热修,switchback 必须用 repo 的 canonical 配置覆盖回去。
76+
77+生成 bundle:
78+
79+```bash
80+scripts/ops/nginx-sync-plan.sh \
81+ --env ../baa-conductor.ops.env \
82+ --bundle-dir .tmp/ops/baa-conductor-nginx
83+```
84+
85+分发并 reload:
86+
87+```bash
88+rsync -av .tmp/ops/baa-conductor-nginx/ root@YOUR_VPS:/tmp/baa-conductor-nginx/
89+ssh root@YOUR_VPS 'cd /tmp/baa-conductor-nginx && sudo ./deploy-on-vps.sh --reload'
90+```
91+
92+这样会把 VPS 配置恢复成:
93+
94+- mini `100.71.210.78:4317` 为 primary
95+- mac `100.112.239.13:4317` 为 backup
96+
97+## 5. 验证已经回到 mini
98+
99+执行:
100+
101+```bash
102+./scripts/failover/rehearsal-check.sh \
103+ --env ../baa-conductor.ops.env \
104+ --basic-auth 'conductor-ops:REPLACE_ME' \
105+ --bearer-token 'REPLACE_ME' \
106+ --skip-node mac \
107+ --expect-leader mini
108+```
109+
110+成功条件:
111+
112+- `conductor.makefile.so` 返回 `leader`
113+- `mini-conductor.makefile.so` 返回 `leader`
114+- `GET /v1/system/state` 的 `holder_id` 以 `mini-` 开头
115+
116+如果这一步失败,不要急着 `resume`,先查:
117+
118+- mini 的 `launchctl print`
119+- mini 的 `logs/launchd/so.makefile.baa-conductor.err.log`
120+- VPS 上是否还有 emergency 热修残留
121+
122+## 6. 恢复自动化
123+
124+确认 switchback 完成后,再 resume:
125+
126+```bash
127+curl -sS -X POST \
128+ -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
129+ -H 'Content-Type: application/json' \
130+ -d '{"requested_by":"ops_runbook","reason":"switchback_complete"}' \
131+ "${CONTROL_API_BASE%/}/v1/system/resume"
132+```
133+
134+## 7. 收尾检查
135+
136+switchback 后建议立刻补做一次完整基线:
137+
138+```bash
139+./scripts/failover/rehearsal-check.sh \
140+ --env ../baa-conductor.ops.env \
141+ --basic-auth 'conductor-ops:REPLACE_ME' \
142+ --bearer-token 'REPLACE_ME' \
143+ --expect-leader mini
144+```
145+
146+如果这份结果和正常基线一致,说明系统已经回到默认拓扑。
+176,
-0
1@@ -0,0 +1,176 @@
2+#!/usr/bin/env bash
3+
4+if [[ -n "${BAA_FAILOVER_COMMON_SH_LOADED:-}" ]]; then
5+ return 0
6+fi
7+
8+readonly BAA_FAILOVER_COMMON_SH_LOADED=1
9+readonly BAA_FAILOVER_SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
10+readonly BAA_FAILOVER_REPO_DIR_DEFAULT="$(cd -- "${BAA_FAILOVER_SCRIPT_DIR}/../.." && pwd)"
11+readonly BAA_FAILOVER_DEFAULT_ENV_PATH="${BAA_FAILOVER_REPO_DIR_DEFAULT}/scripts/ops/baa-conductor.env.example"
12+readonly BAA_FAILOVER_DEFAULT_CONTROL_API_BASE="https://control-api.makefile.so"
13+
14+failover_log() {
15+ printf '[failover] %s\n' "$*"
16+}
17+
18+failover_warn() {
19+ printf '[failover] warning: %s\n' "$*" >&2
20+}
21+
22+failover_error() {
23+ printf '[failover] error: %s\n' "$*" >&2
24+}
25+
26+die() {
27+ failover_error "$*"
28+ exit 1
29+}
30+
31+require_command() {
32+ if ! command -v "$1" >/dev/null 2>&1; then
33+ die "Missing required command: $1"
34+ fi
35+}
36+
37+contains_value() {
38+ local needle="$1"
39+ shift
40+
41+ local value
42+ for value in "$@"; do
43+ if [[ "$value" == "$needle" ]]; then
44+ return 0
45+ fi
46+ done
47+
48+ return 1
49+}
50+
51+validate_node() {
52+ case "$1" in
53+ mini | mac) ;;
54+ *)
55+ die "Unsupported node: $1"
56+ ;;
57+ esac
58+}
59+
60+validate_scenario() {
61+ case "$1" in
62+ planned | emergency | switchback) ;;
63+ *)
64+ die "Unsupported scenario: $1"
65+ ;;
66+ esac
67+}
68+
69+shell_quote() {
70+ printf '%q' "$1"
71+}
72+
73+require_value() {
74+ local key="$1"
75+ local value="${!key:-}"
76+
77+ if [[ -z "$value" ]]; then
78+ die "Missing required value in inventory: ${key}"
79+ fi
80+
81+ printf '%s\n' "$value"
82+}
83+
84+validate_tailscale_ipv4() {
85+ local value="$1"
86+ local key="$2"
87+
88+ if [[ ! "$value" =~ ^100\.([0-9]{1,3}\.){2}[0-9]{1,3}$ ]]; then
89+ die "${key} must be a Tailscale 100.x IPv4 address: ${value}"
90+ fi
91+}
92+
93+load_env_file() {
94+ local env_path="$1"
95+
96+ if [[ ! -f "$env_path" ]]; then
97+ die "Inventory file not found: ${env_path}"
98+ fi
99+
100+ set -a
101+ # shellcheck disable=SC1090
102+ source "$env_path"
103+ set +a
104+}
105+
106+load_inventory() {
107+ local env_path="$1"
108+
109+ load_env_file "$env_path"
110+
111+ FAILOVER_ENV_PATH="$env_path"
112+ FAILOVER_APP_NAME="${BAA_APP_NAME:-baa-conductor}"
113+ FAILOVER_PUBLIC_IPV4="${BAA_PUBLIC_IPV4:-}"
114+ FAILOVER_PUBLIC_IPV6="${BAA_PUBLIC_IPV6:-}"
115+ FAILOVER_CONDUCTOR_HOST="$(require_value BAA_CONDUCTOR_HOST)"
116+ FAILOVER_MINI_DIRECT_HOST="$(require_value BAA_MINI_DIRECT_HOST)"
117+ FAILOVER_MAC_DIRECT_HOST="$(require_value BAA_MAC_DIRECT_HOST)"
118+ FAILOVER_MINI_TAILSCALE_IP="$(require_value BAA_MINI_TAILSCALE_IP)"
119+ FAILOVER_MAC_TAILSCALE_IP="$(require_value BAA_MAC_TAILSCALE_IP)"
120+ FAILOVER_CONDUCTOR_PORT="${BAA_CONDUCTOR_PORT:-4317}"
121+ FAILOVER_CONTROL_API_BASE="${BAA_CONTROL_API_BASE:-$BAA_FAILOVER_DEFAULT_CONTROL_API_BASE}"
122+
123+ validate_tailscale_ipv4 "$FAILOVER_MINI_TAILSCALE_IP" "BAA_MINI_TAILSCALE_IP"
124+ validate_tailscale_ipv4 "$FAILOVER_MAC_TAILSCALE_IP" "BAA_MAC_TAILSCALE_IP"
125+}
126+
127+node_direct_host() {
128+ validate_node "$1"
129+
130+ case "$1" in
131+ mini)
132+ printf '%s\n' "$FAILOVER_MINI_DIRECT_HOST"
133+ ;;
134+ mac)
135+ printf '%s\n' "$FAILOVER_MAC_DIRECT_HOST"
136+ ;;
137+ esac
138+}
139+
140+node_tailscale_ip() {
141+ validate_node "$1"
142+
143+ case "$1" in
144+ mini)
145+ printf '%s\n' "$FAILOVER_MINI_TAILSCALE_IP"
146+ ;;
147+ mac)
148+ printf '%s\n' "$FAILOVER_MAC_TAILSCALE_IP"
149+ ;;
150+ esac
151+}
152+
153+node_default_role() {
154+ validate_node "$1"
155+
156+ case "$1" in
157+ mini)
158+ printf '%s\n' "primary"
159+ ;;
160+ mac)
161+ printf '%s\n' "standby"
162+ ;;
163+ esac
164+}
165+
166+node_default_id() {
167+ validate_node "$1"
168+
169+ case "$1" in
170+ mini)
171+ printf '%s\n' "mini-main"
172+ ;;
173+ mac)
174+ printf '%s\n' "mac-standby"
175+ ;;
176+ esac
177+}
+240,
-0
1@@ -0,0 +1,240 @@
2+#!/usr/bin/env bash
3+set -euo pipefail
4+
5+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
6+# shellcheck source=./common.sh
7+source "${SCRIPT_DIR}/common.sh"
8+
9+usage() {
10+ cat <<'EOF'
11+Usage:
12+ scripts/failover/print-checklist.sh --scenario planned|emergency|switchback [options]
13+
14+Options:
15+ --scenario NAME planned, emergency, or switchback.
16+ --env PATH Inventory file to load.
17+ --control-api-base URL Override the control API base URL.
18+ --help Show this help text.
19+EOF
20+}
21+
22+print_common_exports() {
23+ local env_q=""
24+ local control_api_q=""
25+
26+ env_q="$(shell_quote "$FAILOVER_ENV_PATH")"
27+ control_api_q="$(shell_quote "$control_api_base")"
28+
29+ cat <<EOF
30+Suggested operator variables
31+----------------------------
32+export FAILOVER_ENV=${env_q}
33+export DIRECT_BASIC_AUTH='conductor-ops:REPLACE_ME'
34+export READONLY_TOKEN='REPLACE_ME'
35+export BROWSER_ADMIN_TOKEN='REPLACE_ME'
36+export MINI_SSH='<mini-admin-shell>'
37+export MAC_SSH='<mac-admin-shell>'
38+export VPS_SSH='root@<vps>'
39+export CONTROL_API_BASE=${control_api_q}
40+
41+Common baseline commands
42+------------------------
43+./scripts/failover/print-topology.sh --env "\$FAILOVER_ENV"
44+./scripts/failover/rehearsal-check.sh \\
45+ --env "\$FAILOVER_ENV" \\
46+ --basic-auth "\$DIRECT_BASIC_AUTH" \\
47+ --bearer-token "\$READONLY_TOKEN" \\
48+ --control-api-base "\$CONTROL_API_BASE"
49+EOF
50+}
51+
52+print_planned_checklist() {
53+ cat <<'EOF'
54+
55+Planned failover checklist
56+--------------------------
57+1. Confirm the baseline is mini leader:
58+./scripts/failover/rehearsal-check.sh \
59+ --env "$FAILOVER_ENV" \
60+ --basic-auth "$DIRECT_BASIC_AUTH" \
61+ --bearer-token "$READONLY_TOKEN" \
62+ --control-api-base "$CONTROL_API_BASE" \
63+ --expect-leader mini
64+
65+2. Drain and then pause automation:
66+curl -sS -X POST \
67+ -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
68+ -H 'Content-Type: application/json' \
69+ -d '{"requested_by":"ops_runbook","reason":"planned_failover_rehearsal"}' \
70+ "${CONTROL_API_BASE%/}/v1/system/drain"
71+
72+curl -sS -X POST \
73+ -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
74+ -H 'Content-Type: application/json' \
75+ -d '{"requested_by":"ops_runbook","reason":"planned_failover_cutover"}' \
76+ "${CONTROL_API_BASE%/}/v1/system/pause"
77+
78+3. On mac, confirm launchd has a healthy standby install:
79+ssh "$MAC_SSH" \
80+ 'cd /Users/george/code/baa-conductor && ./scripts/runtime/check-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mac --install-dir "$HOME/Library/LaunchAgents"'
81+
82+4. On mini, stop only the conductor service so VPS Nginx falls through to mac:
83+ssh "$MINI_SSH" \
84+ 'launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/so.makefile.baa-conductor.plist"'
85+
86+5. Verify mac is now leader and public ingress still returns leader:
87+./scripts/failover/rehearsal-check.sh \
88+ --env "$FAILOVER_ENV" \
89+ --basic-auth "$DIRECT_BASIC_AUTH" \
90+ --bearer-token "$READONLY_TOKEN" \
91+ --control-api-base "$CONTROL_API_BASE" \
92+ --skip-node mini \
93+ --expect-leader mac
94+
95+6. Resume automation only after mac direct host and public host are both healthy:
96+curl -sS -X POST \
97+ -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
98+ -H 'Content-Type: application/json' \
99+ -d '{"requested_by":"ops_runbook","reason":"planned_failover_complete"}' \
100+ "${CONTROL_API_BASE%/}/v1/system/resume"
101+EOF
102+}
103+
104+print_emergency_checklist() {
105+ cat <<'EOF'
106+
107+Emergency failover checklist
108+----------------------------
109+1. Snapshot public state first. If mini is already gone, skip its direct checks:
110+./scripts/failover/rehearsal-check.sh \
111+ --env "$FAILOVER_ENV" \
112+ --basic-auth "$DIRECT_BASIC_AUTH" \
113+ --bearer-token "$READONLY_TOKEN" \
114+ --control-api-base "$CONTROL_API_BASE" \
115+ --skip-node mini \
116+ --expect-leader mac
117+
118+2. On mac, repair or restart the local conductor service if needed:
119+ssh "$MAC_SSH" \
120+ 'cd /Users/george/code/baa-conductor && ./scripts/runtime/check-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mac --install-dir "$HOME/Library/LaunchAgents"'
121+
122+ssh "$MAC_SSH" \
123+ 'cd /Users/george/code/baa-conductor && ./scripts/runtime/reload-launchd.sh'
124+
125+3. If public ingress still lands on mini while mini is reachable but no longer leader, hotfix the VPS config so mac becomes the first upstream:
126+ssh "$VPS_SSH" 'sudo cp /etc/nginx/sites-available/baa-conductor.conf /etc/nginx/sites-available/baa-conductor.conf.bak.$(date +%Y%m%d%H%M%S)'
127+ssh "$VPS_SSH" 'sudo editor /etc/nginx/sites-available/baa-conductor.conf'
128+ssh "$VPS_SSH" 'sudo nginx -t && sudo systemctl reload nginx'
129+
130+4. Re-run the snapshot until public /rolez=leader and mac direct /rolez=leader:
131+./scripts/failover/rehearsal-check.sh \
132+ --env "$FAILOVER_ENV" \
133+ --basic-auth "$DIRECT_BASIC_AUTH" \
134+ --bearer-token "$READONLY_TOKEN" \
135+ --control-api-base "$CONTROL_API_BASE" \
136+ --skip-node mini \
137+ --expect-leader mac
138+
139+5. Record whether the VPS carried an emergency Nginx hotfix. Switchback must restore the canonical repo-rendered bundle later.
140+EOF
141+}
142+
143+print_switchback_checklist() {
144+ cat <<'EOF'
145+
146+Switchback checklist
147+--------------------
148+1. Rebuild and validate mini before touching traffic:
149+ssh "$MINI_SSH" \
150+ 'cd /Users/george/code/baa-conductor && npx --yes pnpm -r build && ./scripts/runtime/check-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mini --install-dir "$HOME/Library/LaunchAgents"'
151+
152+2. Pause automation so lease ownership can move cleanly back to mini:
153+curl -sS -X POST \
154+ -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
155+ -H 'Content-Type: application/json' \
156+ -d '{"requested_by":"ops_runbook","reason":"switchback_prepare"}' \
157+ "${CONTROL_API_BASE%/}/v1/system/pause"
158+
159+3. Stop mac conductor and restart mini conductor:
160+ssh "$MAC_SSH" \
161+ 'launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/so.makefile.baa-conductor.plist"'
162+
163+ssh "$MINI_SSH" \
164+ 'cd /Users/george/code/baa-conductor && ./scripts/runtime/reload-launchd.sh'
165+
166+4. If emergency hotfixes changed the deployed VPS config, restore the canonical mini-primary bundle from the repo:
167+scripts/ops/nginx-sync-plan.sh --env "$FAILOVER_ENV" --bundle-dir .tmp/ops/baa-conductor-nginx
168+rsync -av .tmp/ops/baa-conductor-nginx/ "$VPS_SSH":/tmp/baa-conductor-nginx/
169+ssh "$VPS_SSH" 'cd /tmp/baa-conductor-nginx && sudo ./deploy-on-vps.sh --reload'
170+
171+5. Verify leadership moved back to mini:
172+./scripts/failover/rehearsal-check.sh \
173+ --env "$FAILOVER_ENV" \
174+ --basic-auth "$DIRECT_BASIC_AUTH" \
175+ --bearer-token "$READONLY_TOKEN" \
176+ --control-api-base "$CONTROL_API_BASE" \
177+ --skip-node mac \
178+ --expect-leader mini
179+
180+6. Resume automation after public and mini direct hosts are both healthy:
181+curl -sS -X POST \
182+ -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
183+ -H 'Content-Type: application/json' \
184+ -d '{"requested_by":"ops_runbook","reason":"switchback_complete"}' \
185+ "${CONTROL_API_BASE%/}/v1/system/resume"
186+EOF
187+}
188+
189+env_path="${BAA_FAILOVER_DEFAULT_ENV_PATH}"
190+scenario=""
191+control_api_base=""
192+
193+while [[ $# -gt 0 ]]; do
194+ case "$1" in
195+ --scenario)
196+ validate_scenario "$2"
197+ scenario="$2"
198+ shift 2
199+ ;;
200+ --env)
201+ env_path="$2"
202+ shift 2
203+ ;;
204+ --control-api-base)
205+ control_api_base="$2"
206+ shift 2
207+ ;;
208+ --help)
209+ usage
210+ exit 0
211+ ;;
212+ *)
213+ die "Unknown option: $1"
214+ ;;
215+ esac
216+done
217+
218+if [[ -z "$scenario" ]]; then
219+ die "--scenario is required"
220+fi
221+
222+load_inventory "$env_path"
223+
224+if [[ -z "$control_api_base" ]]; then
225+ control_api_base="$FAILOVER_CONTROL_API_BASE"
226+fi
227+
228+printf 'Scenario: %s\n' "$scenario"
229+print_common_exports
230+
231+case "$scenario" in
232+ planned)
233+ print_planned_checklist
234+ ;;
235+ emergency)
236+ print_emergency_checklist
237+ ;;
238+ switchback)
239+ print_switchback_checklist
240+ ;;
241+esac
+74,
-0
1@@ -0,0 +1,74 @@
2+#!/usr/bin/env bash
3+set -euo pipefail
4+
5+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
6+# shellcheck source=./common.sh
7+source "${SCRIPT_DIR}/common.sh"
8+
9+usage() {
10+ cat <<'EOF'
11+Usage:
12+ scripts/failover/print-topology.sh [options]
13+
14+Options:
15+ --env PATH Inventory file to load. Defaults to scripts/ops/baa-conductor.env.example.
16+ --help Show this help text.
17+EOF
18+}
19+
20+env_path="${BAA_FAILOVER_DEFAULT_ENV_PATH}"
21+
22+while [[ $# -gt 0 ]]; do
23+ case "$1" in
24+ --env)
25+ env_path="$2"
26+ shift 2
27+ ;;
28+ --help)
29+ usage
30+ exit 0
31+ ;;
32+ *)
33+ die "Unknown option: $1"
34+ ;;
35+ esac
36+done
37+
38+load_inventory "$env_path"
39+
40+public_targets="${FAILOVER_PUBLIC_IPV4:-<unset>}"
41+if [[ -n "${FAILOVER_PUBLIC_IPV6:-}" ]]; then
42+ public_targets="${public_targets}, ${FAILOVER_PUBLIC_IPV6}"
43+fi
44+
45+cat <<EOF
46+Failover Topology
47+=================
48+
49+Inventory: ${FAILOVER_ENV_PATH}
50+Control API: ${FAILOVER_CONTROL_API_BASE}
51+
52+Public ingress
53+--------------
54+- Cloudflare DNS keeps conductor hosts pinned to the VPS public address: ${public_targets}
55+- https://${FAILOVER_CONDUCTOR_HOST} -> VPS Nginx upstream conductor_primary
56+- conductor_primary -> mini ${FAILOVER_MINI_TAILSCALE_IP}:${FAILOVER_CONDUCTOR_PORT} (primary), mac ${FAILOVER_MAC_TAILSCALE_IP}:${FAILOVER_CONDUCTOR_PORT} (backup)
57+
58+Direct node hosts
59+-----------------
60+- https://${FAILOVER_MINI_DIRECT_HOST} -> Basic Auth -> mini ${FAILOVER_MINI_TAILSCALE_IP}:${FAILOVER_CONDUCTOR_PORT}
61+- https://${FAILOVER_MAC_DIRECT_HOST} -> Basic Auth -> mac ${FAILOVER_MAC_TAILSCALE_IP}:${FAILOVER_CONDUCTOR_PORT}
62+
63+launchd defaults
64+----------------
65+- mini: BAA_CONDUCTOR_HOST=mini, BAA_CONDUCTOR_ROLE=primary, BAA_NODE_ID=mini-main
66+- mac: BAA_CONDUCTOR_HOST=mac, BAA_CONDUCTOR_ROLE=standby, BAA_NODE_ID=mac-standby
67+- Both nodes keep the same repo/runtime root: /Users/george/code/baa-conductor
68+
69+Operational notes
70+-----------------
71+- Cloudflare DNS is not part of failover or switchback. Public traffic stays on the VPS.
72+- Nginx failover is transport-based only. It reacts when a 100.x upstream stops accepting traffic.
73+- Nginx does not inspect leader lease state. If mini still answers on ${FAILOVER_MINI_TAILSCALE_IP}:${FAILOVER_CONDUCTOR_PORT} but /rolez says standby, public ingress can still land on mini until mini is stopped or the VPS config is hotfixed.
74+- launchd decides whether each node keeps serving 127.0.0.1:4317 and its Tailscale listener. Control API is only for drain/pause/resume and lease observation.
75+EOF
+384,
-0
1@@ -0,0 +1,384 @@
2+#!/usr/bin/env bash
3+set -euo pipefail
4+
5+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
6+# shellcheck source=./common.sh
7+source "${SCRIPT_DIR}/common.sh"
8+
9+usage() {
10+ cat <<'EOF'
11+Usage:
12+ scripts/failover/rehearsal-check.sh [options]
13+
14+Options:
15+ --env PATH Inventory file to load.
16+ --basic-auth USER:PASS Basic Auth for mini/mac direct domains.
17+ --bearer-token TOKEN Bearer token for GET /v1/system/state.
18+ --bearer-token-file PATH Read the bearer token from a file.
19+ --control-api-base URL Override the control API base URL.
20+ --expect-leader NODE Assert that mini or mac is the active leader.
21+ --skip-node NODE Skip direct checks for one node. Repeatable.
22+ --skip-public Skip public conductor host checks.
23+ --skip-control-api Skip GET /v1/system/state even when a token is available.
24+ --timeout SEC Per-request curl timeout. Defaults to 5.
25+ --help Show this help text.
26+
27+Notes:
28+ - Public and direct probes are read-only GET requests against /healthz, /readyz, and /rolez.
29+ - Direct-node checks are skipped automatically when Basic Auth is not provided.
30+ - Control API checks are skipped automatically when a bearer token is not provided.
31+EOF
32+}
33+
34+require_command curl
35+require_command node
36+
37+env_path="${BAA_FAILOVER_DEFAULT_ENV_PATH}"
38+basic_auth="${BAA_FAILOVER_BASIC_AUTH:-}"
39+bearer_token="${BAA_CONTROL_API_TOKEN:-}"
40+bearer_token_file=""
41+control_api_base=""
42+expect_leader=""
43+timeout_sec="5"
44+skip_public="0"
45+skip_control_api="0"
46+skip_nodes=()
47+failures=0
48+
49+record_failure() {
50+ failover_error "$*"
51+ failures=$((failures + 1))
52+}
53+
54+probe_endpoint() {
55+ local status_var="$1"
56+ local body_var="$2"
57+ local error_var="$3"
58+ local url="$4"
59+ shift 4
60+
61+ local tmp_body=""
62+ local tmp_err=""
63+ local http_code=""
64+ local error_message=""
65+ local body=""
66+
67+ tmp_body="$(mktemp)"
68+ tmp_err="$(mktemp)"
69+
70+ if ! http_code="$(curl -sS -L --max-time "$timeout_sec" -o "$tmp_body" -w '%{http_code}' "$@" "$url" 2>"$tmp_err")"; then
71+ error_message="$(tr '\n' ' ' < "$tmp_err")"
72+ error_message="${error_message%" "}"
73+ printf -v "$status_var" '%s' "curl_error"
74+ printf -v "$body_var" '%s' ""
75+ printf -v "$error_var" '%s' "$error_message"
76+ rm -f "$tmp_body" "$tmp_err"
77+ return 1
78+ fi
79+
80+ body="$(tr -d '\r' < "$tmp_body")"
81+ body="${body%$'\n'}"
82+
83+ printf -v "$status_var" '%s' "$http_code"
84+ printf -v "$body_var" '%s' "$body"
85+ printf -v "$error_var" '%s' ""
86+
87+ rm -f "$tmp_body" "$tmp_err"
88+}
89+
90+format_probe_result() {
91+ local status="$1"
92+ local body="$2"
93+ local error="$3"
94+
95+ if [[ "$status" == "curl_error" ]]; then
96+ if [[ -n "$error" ]]; then
97+ printf 'ERROR(%s)' "$error"
98+ else
99+ printf 'ERROR'
100+ fi
101+ return 0
102+ fi
103+
104+ if [[ -z "$body" ]]; then
105+ printf '%s(<empty>)' "$status"
106+ return 0
107+ fi
108+
109+ printf '%s(%s)' "$status" "$body"
110+}
111+
112+probe_surface() {
113+ local label="$1"
114+ local base_url="$2"
115+ shift 2
116+
117+ probe_endpoint "${label}_health_status" "${label}_health_body" "${label}_health_error" "${base_url}/healthz" "$@" || true
118+ probe_endpoint "${label}_ready_status" "${label}_ready_body" "${label}_ready_error" "${base_url}/readyz" "$@" || true
119+ probe_endpoint "${label}_role_status" "${label}_role_body" "${label}_role_error" "${base_url}/rolez" "$@" || true
120+}
121+
122+print_surface_summary() {
123+ local label="$1"
124+ local base_url="$2"
125+ local health_status_var="${label}_health_status"
126+ local health_body_var="${label}_health_body"
127+ local health_error_var="${label}_health_error"
128+ local ready_status_var="${label}_ready_status"
129+ local ready_body_var="${label}_ready_body"
130+ local ready_error_var="${label}_ready_error"
131+ local role_status_var="${label}_role_status"
132+ local role_body_var="${label}_role_body"
133+ local role_error_var="${label}_role_error"
134+
135+ printf '%-10s %s healthz=%s readyz=%s rolez=%s\n' \
136+ "${label}" \
137+ "${base_url}" \
138+ "$(format_probe_result "${!health_status_var:-n/a}" "${!health_body_var:-}" "${!health_error_var:-}")" \
139+ "$(format_probe_result "${!ready_status_var:-n/a}" "${!ready_body_var:-}" "${!ready_error_var:-}")" \
140+ "$(format_probe_result "${!role_status_var:-n/a}" "${!role_body_var:-}" "${!role_error_var:-}")"
141+}
142+
143+assert_text_response() {
144+ local label="$1"
145+ local expected_status="$2"
146+ local expected_body="$3"
147+ local actual_status="$4"
148+ local actual_body="$5"
149+ local actual_error="$6"
150+
151+ if [[ "$actual_status" == "curl_error" ]]; then
152+ record_failure "${label} request failed: ${actual_error}"
153+ return 0
154+ fi
155+
156+ if [[ "$actual_status" != "$expected_status" || "$actual_body" != "$expected_body" ]]; then
157+ record_failure "${label} expected ${expected_status}(${expected_body}), got ${actual_status}(${actual_body})"
158+ fi
159+}
160+
161+assert_surface() {
162+ local label="$1"
163+ local expected_role="$2"
164+ local health_status_var="${label}_health_status"
165+ local health_body_var="${label}_health_body"
166+ local health_error_var="${label}_health_error"
167+ local ready_status_var="${label}_ready_status"
168+ local ready_body_var="${label}_ready_body"
169+ local ready_error_var="${label}_ready_error"
170+ local role_status_var="${label}_role_status"
171+ local role_body_var="${label}_role_body"
172+ local role_error_var="${label}_role_error"
173+
174+ assert_text_response "${label} /healthz" "200" "ok" "${!health_status_var:-}" "${!health_body_var:-}" "${!health_error_var:-}"
175+ assert_text_response "${label} /readyz" "200" "ready" "${!ready_status_var:-}" "${!ready_body_var:-}" "${!ready_error_var:-}"
176+ assert_text_response "${label} /rolez" "200" "$expected_role" "${!role_status_var:-}" "${!role_body_var:-}" "${!role_error_var:-}"
177+}
178+
179+parse_system_state_json() {
180+ node -e 'const fs = require("fs");
181+const payload = JSON.parse(fs.readFileSync(0, "utf8"));
182+const pick = (...values) => values.find((value) => value !== undefined && value !== null);
183+const mode = pick(payload.data && payload.data.mode, payload.mode, payload.automation && payload.automation.mode, "");
184+const holder = pick(payload.data && payload.data.holder_id, payload.holder_id, payload.leader && payload.leader.controller_id, "");
185+const term = pick(payload.data && payload.data.term, payload.term, payload.leader && payload.leader.term, "");
186+const lease = pick(payload.data && payload.data.lease_expires_at, payload.lease_expires_at, payload.leader && payload.leader.lease_expires_at, "");
187+process.stdout.write([mode, holder, term, lease].map((value) => value == null ? "" : String(value)).join("\t"));'
188+}
189+
190+while [[ $# -gt 0 ]]; do
191+ case "$1" in
192+ --env)
193+ env_path="$2"
194+ shift 2
195+ ;;
196+ --basic-auth)
197+ basic_auth="$2"
198+ shift 2
199+ ;;
200+ --bearer-token)
201+ bearer_token="$2"
202+ shift 2
203+ ;;
204+ --bearer-token-file)
205+ bearer_token_file="$2"
206+ shift 2
207+ ;;
208+ --control-api-base)
209+ control_api_base="$2"
210+ shift 2
211+ ;;
212+ --expect-leader)
213+ validate_node "$2"
214+ expect_leader="$2"
215+ shift 2
216+ ;;
217+ --skip-node)
218+ validate_node "$2"
219+ if ! contains_value "$2" "${skip_nodes[@]-}"; then
220+ skip_nodes+=("$2")
221+ fi
222+ shift 2
223+ ;;
224+ --skip-public)
225+ skip_public="1"
226+ shift
227+ ;;
228+ --skip-control-api)
229+ skip_control_api="1"
230+ shift
231+ ;;
232+ --timeout)
233+ timeout_sec="$2"
234+ shift 2
235+ ;;
236+ --help)
237+ usage
238+ exit 0
239+ ;;
240+ *)
241+ die "Unknown option: $1"
242+ ;;
243+ esac
244+done
245+
246+load_inventory "$env_path"
247+
248+if [[ -n "$bearer_token_file" ]]; then
249+ if [[ ! -f "$bearer_token_file" ]]; then
250+ die "Bearer token file not found: ${bearer_token_file}"
251+ fi
252+ bearer_token="$(tr -d '\r\n' < "$bearer_token_file")"
253+fi
254+
255+if [[ -z "$control_api_base" ]]; then
256+ control_api_base="$FAILOVER_CONTROL_API_BASE"
257+fi
258+
259+if [[ -z "$basic_auth" ]]; then
260+ if ! contains_value mini "${skip_nodes[@]-}"; then
261+ skip_nodes+=("mini")
262+ fi
263+ if ! contains_value mac "${skip_nodes[@]-}"; then
264+ skip_nodes+=("mac")
265+ fi
266+ failover_warn "No direct-node Basic Auth configured; skipping mini/mac direct probes."
267+fi
268+
269+if [[ -n "$expect_leader" ]]; then
270+ if [[ -z "$bearer_token" ]] && contains_value mini "${skip_nodes[@]-}" && contains_value mac "${skip_nodes[@]-}" ; then
271+ die "Cannot verify --expect-leader without direct-node auth or a control API bearer token."
272+ fi
273+fi
274+
275+basic_auth_args=()
276+if [[ -n "$basic_auth" ]]; then
277+ basic_auth_args=(-u "$basic_auth")
278+fi
279+
280+printf 'Failover rehearsal snapshot\n'
281+printf 'inventory %s\n' "$FAILOVER_ENV_PATH"
282+
283+if [[ "$skip_public" != "1" ]]; then
284+ public_base_url="https://${FAILOVER_CONDUCTOR_HOST}"
285+ probe_surface "public" "$public_base_url"
286+ print_surface_summary "public" "$public_base_url"
287+ assert_surface "public" "leader"
288+else
289+ printf '%-10s skipped\n' "public"
290+fi
291+
292+if ! contains_value mini "${skip_nodes[@]-}"; then
293+ mini_base_url="https://${FAILOVER_MINI_DIRECT_HOST}"
294+ probe_surface "mini" "$mini_base_url" "${basic_auth_args[@]}"
295+ print_surface_summary "mini" "$mini_base_url"
296+else
297+ printf '%-10s skipped\n' "mini"
298+fi
299+
300+if ! contains_value mac "${skip_nodes[@]-}"; then
301+ mac_base_url="https://${FAILOVER_MAC_DIRECT_HOST}"
302+ probe_surface "mac" "$mac_base_url" "${basic_auth_args[@]}"
303+ print_surface_summary "mac" "$mac_base_url"
304+else
305+ printf '%-10s skipped\n' "mac"
306+fi
307+
308+control_mode=""
309+control_holder=""
310+control_term=""
311+control_lease_expires_at=""
312+
313+if [[ "$skip_control_api" != "1" && -n "$bearer_token" ]]; then
314+ control_state_url="${control_api_base%/}/v1/system/state"
315+ probe_endpoint "control_status" "control_body" "control_error" "$control_state_url" \
316+ -H "Authorization: Bearer ${bearer_token}" \
317+ -H "Accept: application/json" || true
318+
319+ if [[ "${control_status:-}" == "curl_error" ]]; then
320+ printf '%-10s %s %s\n' "control" "$control_state_url" "$(format_probe_result "$control_status" "" "$control_error")"
321+ record_failure "control API /v1/system/state request failed: ${control_error}"
322+ elif [[ "${control_status:-}" != "200" ]]; then
323+ printf '%-10s %s %s\n' "control" "$control_state_url" "$(format_probe_result "$control_status" "$control_body" "$control_error")"
324+ record_failure "control API /v1/system/state expected 200, got ${control_status}(${control_body})"
325+ else
326+ parsed_control_state="$(printf '%s' "$control_body" | parse_system_state_json 2>/dev/null || true)"
327+ if [[ -z "$parsed_control_state" ]]; then
328+ printf '%-10s %s 200(raw=%s)\n' "control" "$control_state_url" "$control_body"
329+ record_failure "control API /v1/system/state returned JSON that could not be normalized"
330+ else
331+ IFS=$'\t' read -r control_mode control_holder control_term control_lease_expires_at <<<"$parsed_control_state"
332+ printf '%-10s %s mode=%s holder_id=%s term=%s lease_expires_at=%s\n' \
333+ "control" \
334+ "$control_state_url" \
335+ "${control_mode:-<empty>}" \
336+ "${control_holder:-<empty>}" \
337+ "${control_term:-<empty>}" \
338+ "${control_lease_expires_at:-<empty>}"
339+ fi
340+ fi
341+else
342+ printf '%-10s skipped\n' "control"
343+fi
344+
345+if [[ -z "$expect_leader" ]]; then
346+ if ! contains_value mini "${skip_nodes[@]-}" && ! contains_value mac "${skip_nodes[@]-}"; then
347+ mini_role="${mini_role_body:-}"
348+ mac_role="${mac_role_body:-}"
349+ case "${mini_role}:${mac_role}" in
350+ leader:standby | standby:leader) ;;
351+ *)
352+ record_failure "Expected exactly one direct node leader, got mini=${mini_role:-<empty>} mac=${mac_role:-<empty>}"
353+ ;;
354+ esac
355+ fi
356+else
357+ if ! contains_value "$expect_leader" "${skip_nodes[@]-}"; then
358+ assert_surface "$expect_leader" "leader"
359+ fi
360+
361+ other_node="mini"
362+ if [[ "$expect_leader" == "mini" ]]; then
363+ other_node="mac"
364+ fi
365+
366+ if ! contains_value "$other_node" "${skip_nodes[@]-}"; then
367+ assert_surface "$other_node" "standby"
368+ fi
369+
370+ if [[ -n "$control_holder" ]]; then
371+ case "$control_holder" in
372+ "${expect_leader}"-*) ;;
373+ *)
374+ record_failure "control API holder_id expected prefix ${expect_leader}-, got ${control_holder}"
375+ ;;
376+ esac
377+ fi
378+fi
379+
380+if [[ "$failures" -gt 0 ]]; then
381+ failover_error "rehearsal checks failed with ${failures} issue(s)"
382+ exit 1
383+fi
384+
385+failover_log "rehearsal checks passed"