baa-conductor

git clone 

commit
b67c568
parent
6505a31
author
im_wower
date
2026-03-22 01:48:52 +0800 CST
feat: add failover rehearsal runbooks
10 files changed,  +1548, -7
M coordination/tasks/T-025-failover-rehearsal.md
+27, -7
 1@@ -1,10 +1,10 @@
 2 ---
 3 task_id: T-025
 4 title: Failover rehearsal 与 Runbook
 5-status: todo
 6+status: review
 7 branch: feat/T-025-failover-rehearsal
 8 repo: /Users/george/code/baa-conductor
 9-base_ref: main
10+base_ref: main@6505a31
11 depends_on:
12   - T-019
13   - T-021
14@@ -61,20 +61,40 @@ updated_at: 2026-03-22
15 
16 ## files_changed
17 
18-- 待填写
19+- `coordination/tasks/T-025-failover-rehearsal.md`
20+- `docs/ops/README.md`
21+- `docs/ops/failover-topology.md`
22+- `docs/ops/planned-failover.md`
23+- `docs/ops/emergency-failover.md`
24+- `docs/ops/switchback.md`
25+- `scripts/failover/common.sh`
26+- `scripts/failover/print-topology.sh`
27+- `scripts/failover/rehearsal-check.sh`
28+- `scripts/failover/print-checklist.sh`
29 
30 ## commands_run
31 
32-- 待填写
33+- `npx --yes pnpm install`
34+- `chmod +x scripts/failover/*.sh`
35+- `bash -n scripts/failover/*.sh`
36+- `./scripts/failover/print-topology.sh --env scripts/ops/baa-conductor.env.example`
37+- `./scripts/failover/print-checklist.sh --scenario planned --env scripts/ops/baa-conductor.env.example`
38+- `./scripts/failover/rehearsal-check.sh --env scripts/ops/baa-conductor.env.example --skip-public --skip-control-api`
39+- `git diff --check`
40 
41 ## result
42 
43-- 待填写
44+- 新增 `scripts/failover` 只读辅助脚本,用于输出主备拓扑、生成场景化 checklist,并对 public/direct/control-api 做 rehearsal 校验。
45+- 在 `docs/ops` 下补齐 failover topology、planned failover、emergency failover、switchback 四份文档。
46+- runbook 明确写清 Cloudflare DNS 固定到 VPS、VPS Nginx 回源 Tailscale `100.x`、节点 `launchd` 控制本地 conductor 存活,以及当前 Nginx 只按连通性而非 lease 感知切换。
47 
48 ## risks
49 
50-- 待填写
51+- 当前 `GET /v1/system/state` 仍只返回扁平 `holder_id/mode/term/lease_expires_at`,脚本只能通过 `holder_id` 前缀推断 leader 节点。
52+- 当前没有真正的 `promote/demote` 维护接口;planned failover 和 switchback 仍依赖 `pause/drain` 加本机 `launchctl bootout/reload`。
53+- 如果 mini 仍可达但逻辑上不该接流量,公网入口可能仍命中 mini;emergency runbook 已记录这种情况下需要 VPS Nginx 热修。
54 
55 ## next_handoff
56 
57-- 待填写
58+- 在真实节点或 staging 上按 runbook 做一次 dry-run / rehearsal,确认 `mini`、`mac`、VPS 三侧命令、权限和路径与文档一致。
59+- 后续可补 `system.state` 的 leader host 字段,以及真正的 maintenance `promote/demote` API,减少对人工 `launchctl` 与 Nginx 热修的依赖。
M docs/ops/README.md
+22, -0
 1@@ -10,6 +10,28 @@
 2 
 3 `control-api.makefile.so` 的 Worker 自定义域仍由 Cloudflare Worker / D1 相关任务管理,不在这里的脚本覆盖范围内。
 4 
 5+## Failover 与 Rehearsal
 6+
 7+第四波把公网入口和节点运行时铺好后,第五波开始把“主备切换怎么练、怎么做”单独收口在这里:
 8+
 9+- [`failover-topology.md`](./failover-topology.md)
10+- [`planned-failover.md`](./planned-failover.md)
11+- [`emergency-failover.md`](./emergency-failover.md)
12+- [`switchback.md`](./switchback.md)
13+
14+配套只读/半自动脚本在:
15+
16+- [`../../scripts/failover/print-topology.sh`](../../scripts/failover/print-topology.sh)
17+- [`../../scripts/failover/rehearsal-check.sh`](../../scripts/failover/rehearsal-check.sh)
18+- [`../../scripts/failover/print-checklist.sh`](../../scripts/failover/print-checklist.sh)
19+
20+这些文档和脚本明确约束当前设计:
21+
22+- Cloudflare DNS 不参与 failover,公网记录仍固定指向 VPS
23+- `conductor.makefile.so` 的切换依赖 VPS Nginx 对 Tailscale `100.x` upstream 的连通性判断
24+- `launchd` 决定节点本地 conductor 是否继续对 `127.0.0.1:4317` 和 Tailscale 监听
25+- Nginx 不知道谁持有 leader lease,所以“逻辑 leader 已切走”不等于“公网一定已切走”
26+
27 ## 单一来源 inventory
28 
29 本任务把公网域名、VPS 公网 IP、内网 Tailscale `100.x` 和 Nginx 安装路径收口到一份 inventory:
A docs/ops/emergency-failover.md
+145, -0
  1@@ -0,0 +1,145 @@
  2+# Emergency Failover Runbook
  3+
  4+适用场景:
  5+
  6+- `mini` 宕机、不可登录、或其 conductor 行为已经不可信
  7+- 需要尽快把公网入口稳定在 `mac`
  8+
  9+这个 runbook 的目标是“先恢复可用性,再谈回切整洁度”。
 10+
 11+## 1. 判断当前属于哪种故障
 12+
 13+先拿一份只读快照:
 14+
 15+```bash
 16+./scripts/failover/rehearsal-check.sh \
 17+  --env ../baa-conductor.ops.env \
 18+  --basic-auth 'conductor-ops:REPLACE_ME' \
 19+  --bearer-token 'REPLACE_ME' \
 20+  --skip-node mini \
 21+  --expect-leader mac
 22+```
 23+
 24+根据结果分三类看:
 25+
 26+1. `mac` 直连已经是 `leader`,公网也正常
 27+   这说明 VPS 已经自动从 mini 掉到了 mac,只需要继续观察和记录。
 28+2. `mac` 直连不健康
 29+   这说明备用节点自己还没准备好,先修 `mac`。
 30+3. `mac` 已经是 `leader`,但公网入口仍不稳定,或者公网 `/rolez` 不是 `leader`
 31+   这通常说明 VPS 仍在命中 mini,或者 VPS 到 mac 的回源有问题。
 32+
 33+## 2. 先把 mac 修到可服务
 34+
 35+在 `mac` 上检查 launchd:
 36+
 37+```bash
 38+cd /Users/george/code/baa-conductor
 39+./scripts/runtime/check-launchd.sh \
 40+  --repo-dir /Users/george/code/baa-conductor \
 41+  --node mac \
 42+  --install-dir "$HOME/Library/LaunchAgents"
 43+```
 44+
 45+必要时重载:
 46+
 47+```bash
 48+cd /Users/george/code/baa-conductor
 49+./scripts/runtime/reload-launchd.sh
 50+launchctl print "gui/$(id -u)/so.makefile.baa-conductor"
 51+```
 52+
 53+如果安装副本不存在或与当前 repo 偏离,先补一次:
 54+
 55+```bash
 56+cd /Users/george/code/baa-conductor
 57+./scripts/runtime/install-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mac
 58+./scripts/runtime/reload-launchd.sh
 59+```
 60+
 61+## 3. 检查 VPS 到 mac 的回源
 62+
 63+在 VPS 上确认:
 64+
 65+- `100.112.239.13:4317` 可达
 66+- Nginx 没有语法错误
 67+- TLS 证书和 include 仍然完整
 68+
 69+最小检查:
 70+
 71+```bash
 72+curl -sS http://100.112.239.13:4317/healthz
 73+curl -sS http://100.112.239.13:4317/rolez
 74+sudo nginx -t
 75+```
 76+
 77+如果这些都正常,但公网还是不落到 mac,说明 mini 侧可能仍有一个“能响应但不该接流量”的进程。
 78+
 79+## 4. 应急 Nginx 热修
 80+
 81+当前 canonical 配置是“mini 主、mac 备”。这是正常态最简配置,但应急时有一个代价:
 82+
 83+- 只要 mini 还能回 HTTP,公网就会先打到 mini
 84+- 即使 mini `/rolez` 已经不是 leader,Nginx 也不会主动让路
 85+
 86+这时可以在 VPS 上对已部署配置做临时热修,把 mac 调到前面。
 87+
 88+先备份:
 89+
 90+```bash
 91+sudo cp \
 92+  /etc/nginx/sites-available/baa-conductor.conf \
 93+  /etc/nginx/sites-available/baa-conductor.conf.bak.$(date +%Y%m%d%H%M%S)
 94+```
 95+
 96+临时把 `conductor_primary` 改成:
 97+
 98+```nginx
 99+upstream conductor_primary {
100+    server 100.112.239.13:4317 max_fails=2 fail_timeout=5s;
101+    server 100.71.210.78:4317 backup;
102+    keepalive 32;
103+}
104+```
105+
106+然后:
107+
108+```bash
109+sudo nginx -t
110+sudo systemctl reload nginx
111+```
112+
113+这一步的性质必须记清楚:
114+
115+- 这是“已部署文件”的应急热修,不是 repo 变更
116+- switchback 时必须把 VPS 恢复到 repo 生成的 canonical bundle
117+
118+## 5. 验证 emergency failover 已落稳
119+
120+再次执行:
121+
122+```bash
123+./scripts/failover/rehearsal-check.sh \
124+  --env ../baa-conductor.ops.env \
125+  --basic-auth 'conductor-ops:REPLACE_ME' \
126+  --bearer-token 'REPLACE_ME' \
127+  --skip-node mini \
128+  --expect-leader mac
129+```
130+
131+成功条件:
132+
133+- `conductor.makefile.so` 返回 `leader`
134+- `mac-conductor.makefile.so` 返回 `leader`
135+- `GET /v1/system/state` 的 `holder_id` 以 `mac-` 开头
136+
137+## 6. 事后记录
138+
139+emergency failover 结束后,至少记录:
140+
141+- mini 是“完全不可达”还是“仍可达但角色错误”
142+- mac 是否需要人工 reload launchd 才恢复
143+- VPS 是否做了临时 Nginx 热修
144+- 哪个备份文件是本次热修前的 `baa-conductor.conf`
145+
146+这些信息会直接决定后续 switchback 的复杂度。
A docs/ops/failover-topology.md
+151, -0
  1@@ -0,0 +1,151 @@
  2+# Failover Topology
  3+
  4+本页把当前主备设计里最容易混淆的四层关系写死:
  5+
  6+1. Cloudflare DNS
  7+2. VPS 上的 Nginx
  8+3. 两台节点的 Tailscale `100.x`
  9+4. 节点本地的 `launchd`
 10+
 11+目标不是引入新的自动切换机制,而是把现有约定整理成可 rehearsal 的操作模型。
 12+
 13+## 1. 固定拓扑
 14+
 15+### 公网入口
 16+
 17+- `conductor.makefile.so`
 18+- `mini-conductor.makefile.so`
 19+- `mac-conductor.makefile.so`
 20+
 21+这三个 host 都固定解析到同一台 VPS 公网 IP。failover 不通过改 DNS 完成,Cloudflare 只是把公网流量送到 VPS。
 22+
 23+`control-api.makefile.so` 是单独的 Cloudflare Worker 自定义域,仍然作为控制面使用:
 24+
 25+- `GET /v1/system/state`
 26+- `POST /v1/system/drain`
 27+- `POST /v1/system/pause`
 28+- `POST /v1/system/resume`
 29+
 30+它不经过 VPS Nginx,因此可以在 conductor 主备切换时继续承担“冻结/恢复自动化”和“读取 lease 状态”的职责。
 31+
 32+### VPS Nginx
 33+
 34+仓库里的 canonical 配置见:
 35+
 36+- [`ops/nginx/baa-conductor.conf`](../../ops/nginx/baa-conductor.conf)
 37+
 38+当前 upstream 关系是:
 39+
 40+| 公网 host | VPS upstream | 实际回源 |
 41+| --- | --- | --- |
 42+| `conductor.makefile.so` | `conductor_primary` | `mini 100.71.210.78:4317` 主,`mac 100.112.239.13:4317` 备 |
 43+| `mini-conductor.makefile.so` | `mini_conductor_direct` | `mini 100.71.210.78:4317` |
 44+| `mac-conductor.makefile.so` | `mac_conductor_direct` | `mac 100.112.239.13:4317` |
 45+
 46+关键点:
 47+
 48+- `conductor.makefile.so` 只会在 mini 上游连不上时,才 TCP 级别回退到 mac。
 49+- Nginx 不看 lease,不知道谁是逻辑 leader。
 50+- 因此只要 mini 仍然在 `100.71.210.78:4317` 上返回 HTTP,公网流量就还会优先打到 mini。
 51+
 52+这意味着:
 53+
 54+- planned failover 不能只让 mac 获得 lease,还必须让 mini 的 conductor 停掉或不再对 VPS 可达。
 55+- emergency failover 时,如果 mini 还活着但已经不是 leader,可能需要临时热修 VPS 上的已部署 Nginx 配置。
 56+
 57+### Tailscale `100.x`
 58+
 59+当前仓库明确固定使用 Tailscale IPv4,不依赖 MagicDNS:
 60+
 61+- `mini`: `100.71.210.78`
 62+- `mac`: `100.112.239.13`
 63+- port: `4317`
 64+
 65+这些值来自 inventory:
 66+
 67+- [`scripts/ops/baa-conductor.env.example`](../../scripts/ops/baa-conductor.env.example)
 68+
 69+### launchd
 70+
 71+节点进程是否存在、以什么身份存在,由 `launchd` 安装副本决定。
 72+
 73+默认节点身份见:
 74+
 75+- [`docs/runtime/environment.md`](../runtime/environment.md)
 76+
 77+| 节点 | `BAA_CONDUCTOR_HOST` | `BAA_CONDUCTOR_ROLE` | `BAA_NODE_ID` |
 78+| --- | --- | --- | --- |
 79+| `mini` | `mini` | `primary` | `mini-main` |
 80+| `mac` | `mac` | `standby` | `mac-standby` |
 81+
 82+安装/校验/重载脚本见:
 83+
 84+- [`scripts/runtime/install-launchd.sh`](../../scripts/runtime/install-launchd.sh)
 85+- [`scripts/runtime/check-launchd.sh`](../../scripts/runtime/check-launchd.sh)
 86+- [`scripts/runtime/reload-launchd.sh`](../../scripts/runtime/reload-launchd.sh)
 87+
 88+`reload-launchd.sh` 适合“把安装副本重新 bootstrap/kickstart”;planned failover 要“只停 conductor,不立刻重启”时,直接用 `launchctl bootout` 更合适。
 89+
 90+## 2. 当前 failover 语义
 91+
 92+当前设计的真实行为可以压缩成一句话:
 93+
 94+> 逻辑 leader 由 lease 决定,公网入口是否切走则由 VPS 是否还能连通当前 primary upstream 决定。
 95+
 96+所以三种场景的区别是:
 97+
 98+- planned failover: 先冻结自动化,再显式停掉 mini conductor,让 VPS 落到 mac backup,上游 lease 也随之迁到 mac。
 99+- emergency failover: mini 已宕机或不可信,优先保证 mac 可服务;必要时在 VPS 上做临时 Nginx 热修。
100+- switchback: mini 修复后,先让 mini 恢复健康,再停掉 mac conductor,并把 VPS 配置恢复为 canonical 的 mini 主、mac 备。
101+
102+## 3. Rehearsal 辅助脚本
103+
104+新增脚本都在:
105+
106+- [`scripts/failover/common.sh`](../../scripts/failover/common.sh)
107+- [`scripts/failover/print-topology.sh`](../../scripts/failover/print-topology.sh)
108+- [`scripts/failover/rehearsal-check.sh`](../../scripts/failover/rehearsal-check.sh)
109+- [`scripts/failover/print-checklist.sh`](../../scripts/failover/print-checklist.sh)
110+
111+它们的边界是:
112+
113+- 只做只读 GET 检查,或输出 checklist
114+- 不会直接执行真实 failover
115+- 不会改 DNS
116+- 不会改 repo 里的 Nginx 模板
117+
118+推荐先看拓扑:
119+
120+```bash
121+./scripts/failover/print-topology.sh --env ../baa-conductor.ops.env
122+```
123+
124+再做一次基线探测:
125+
126+```bash
127+./scripts/failover/rehearsal-check.sh \
128+  --env ../baa-conductor.ops.env \
129+  --basic-auth 'conductor-ops:REPLACE_ME' \
130+  --bearer-token 'REPLACE_ME' \
131+  --expect-leader mini
132+```
133+
134+如果要拿到按场景整理好的命令骨架:
135+
136+```bash
137+./scripts/failover/print-checklist.sh \
138+  --scenario planned \
139+  --env ../baa-conductor.ops.env
140+```
141+
142+## 4. 使用这些 runbook 的前提
143+
144+开始任何 rehearsal 之前,至少准备好:
145+
146+- `mini`、`mac`、VPS 的 shell 访问权限
147+- 直连域名 Basic Auth 凭据
148+- `GET /v1/system/state` 的 readonly 或 browser_admin token
149+- `POST /v1/system/drain|pause|resume` 的 browser_admin token
150+- 一份仓库外的 inventory env 文件
151+
152+若这些前提不满足,本轮只能做“文档演练”或“单机只读检查”,不要冒险把真实主节点停掉。
A docs/ops/planned-failover.md
+184, -0
  1@@ -0,0 +1,184 @@
  2+# Planned Failover Runbook
  3+
  4+适用场景:
  5+
  6+- `mini` 当前是 leader
  7+- `mac` 已安装并加载 standby conductor
  8+- 计划内维护,需要把公网入口从 `mini` 平滑切到 `mac`
  9+
 10+非目标:
 11+
 12+- 不改 DNS
 13+- 不改 app 代码
 14+- 不让 `baa-firefox` 介入切换逻辑
 15+
 16+## 1. 前置条件
 17+
 18+准备这些材料:
 19+
 20+- 仓库外 inventory,例如 `../baa-conductor.ops.env`
 21+- 直连域名 Basic Auth
 22+- `GET /v1/system/state` 的 readonly token
 23+- `POST /v1/system/drain|pause|resume` 的 browser_admin token
 24+- `mini` / `mac` 的 shell 访问权限
 25+
 26+先确认当前拓扑和基线:
 27+
 28+```bash
 29+./scripts/failover/print-topology.sh --env ../baa-conductor.ops.env
 30+
 31+./scripts/failover/rehearsal-check.sh \
 32+  --env ../baa-conductor.ops.env \
 33+  --basic-auth 'conductor-ops:REPLACE_ME' \
 34+  --bearer-token 'REPLACE_ME' \
 35+  --expect-leader mini
 36+```
 37+
 38+预期:
 39+
 40+- `conductor.makefile.so` 的 `/rolez` 返回 `leader`
 41+- `mini-conductor.makefile.so` 的 `/rolez` 返回 `leader`
 42+- `mac-conductor.makefile.so` 的 `/rolez` 返回 `standby`
 43+- `GET /v1/system/state` 的 `holder_id` 以 `mini-` 开头
 44+
 45+## 2. 冻结自动化
 46+
 47+planned failover 先 drain,再 pause。
 48+
 49+`drain` 的目的:
 50+
 51+- 不再启动新的 work
 52+- 给当前运行中的 task 一个自然收尾窗口
 53+
 54+当前仓库没有“自动等到 active runs 归零”的单独脚本,所以这里需要人工观察 status 面板、任务板或日志,确认已经到可切换窗口。
 55+
 56+执行:
 57+
 58+```bash
 59+export CONTROL_API_BASE='https://control-api.makefile.so'
 60+export BROWSER_ADMIN_TOKEN='REPLACE_ME'
 61+
 62+curl -sS -X POST \
 63+  -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
 64+  -H 'Content-Type: application/json' \
 65+  -d '{"requested_by":"ops_runbook","reason":"planned_failover_rehearsal"}' \
 66+  "${CONTROL_API_BASE%/}/v1/system/drain"
 67+
 68+curl -sS -X POST \
 69+  -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
 70+  -H 'Content-Type: application/json' \
 71+  -d '{"requested_by":"ops_runbook","reason":"planned_failover_cutover"}' \
 72+  "${CONTROL_API_BASE%/}/v1/system/pause"
 73+```
 74+
 75+## 3. 先确认 mac standby 可接手
 76+
 77+在 `mac` 上先做 launchd 静态/加载校验:
 78+
 79+```bash
 80+cd /Users/george/code/baa-conductor
 81+./scripts/runtime/check-launchd.sh \
 82+  --repo-dir /Users/george/code/baa-conductor \
 83+  --node mac \
 84+  --install-dir "$HOME/Library/LaunchAgents"
 85+
 86+launchctl print "gui/$(id -u)/so.makefile.baa-conductor"
 87+```
 88+
 89+如果安装副本有漂移,先重渲染再继续:
 90+
 91+```bash
 92+cd /Users/george/code/baa-conductor
 93+./scripts/runtime/install-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mac
 94+./scripts/runtime/reload-launchd.sh
 95+```
 96+
 97+## 4. 切走 mini
 98+
 99+这是本设计里最关键的一步。
100+
101+原因:
102+
103+- `mac` 获得 lease 还不够
104+- 只要 mini 继续在 `100.71.210.78:4317` 上返回 HTTP,VPS Nginx 仍会优先把公网流量送到 mini
105+
106+所以 planned failover 要显式停掉 mini 的 conductor 进程,而不是只做逻辑层 promote/demote。
107+
108+默认 `LaunchAgents` 方式:
109+
110+```bash
111+launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/so.makefile.baa-conductor.plist"
112+```
113+
114+如果该节点使用 `LaunchDaemons`:
115+
116+```bash
117+sudo launchctl bootout system /Library/LaunchDaemons/so.makefile.baa-conductor.plist
118+```
119+
120+注意:
121+
122+- 这里只停 `so.makefile.baa-conductor`
123+- `worker-runner`、`status-api` 是否一起停,由维护窗口自己决定
124+- 本 runbook 默认只切 conductor,不扩大停机面
125+
126+## 5. 验证 cutover
127+
128+mini conductor 停掉后,再做一次探测:
129+
130+```bash
131+./scripts/failover/rehearsal-check.sh \
132+  --env ../baa-conductor.ops.env \
133+  --basic-auth 'conductor-ops:REPLACE_ME' \
134+  --bearer-token 'REPLACE_ME' \
135+  --skip-node mini \
136+  --expect-leader mac
137+```
138+
139+预期:
140+
141+- `conductor.makefile.so` 仍然 `healthz=ok`、`readyz=ready`、`rolez=leader`
142+- `mac-conductor.makefile.so` 的 `/rolez` 变成 `leader`
143+- `GET /v1/system/state` 的 `holder_id` 以 `mac-` 开头
144+
145+如果 `mac` 已经是 leader,但公网 `/rolez` 仍不对:
146+
147+- 优先检查 VPS 到 `100.112.239.13:4317` 的连通性
148+- 再检查 mini 是否实际上还在对 `100.71.210.78:4317` 提供服务
149+- planned failover 不应该通过改 DNS 修复
150+
151+## 6. 恢复自动化
152+
153+确认公网入口和 mac 直连都健康后,再 `resume`:
154+
155+```bash
156+curl -sS -X POST \
157+  -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
158+  -H 'Content-Type: application/json' \
159+  -d '{"requested_by":"ops_runbook","reason":"planned_failover_complete"}' \
160+  "${CONTROL_API_BASE%/}/v1/system/resume"
161+```
162+
163+## 7. Abort / 回滚
164+
165+如果在验证窗口里发现 mac 无法稳定接手:
166+
167+1. 保持 automation `paused`
168+2. 在 `mini` 上重新启动 conductor:
169+
170+```bash
171+cd /Users/george/code/baa-conductor
172+./scripts/runtime/reload-launchd.sh
173+```
174+
175+3. 再次确认基线回到 `mini`:
176+
177+```bash
178+./scripts/failover/rehearsal-check.sh \
179+  --env ../baa-conductor.ops.env \
180+  --basic-auth 'conductor-ops:REPLACE_ME' \
181+  --bearer-token 'REPLACE_ME' \
182+  --expect-leader mini
183+```
184+
185+4. 确认恢复正常后,再决定是否 `resume`
A docs/ops/switchback.md
+145, -0
  1@@ -0,0 +1,145 @@
  2+# Switchback Runbook
  3+
  4+适用场景:
  5+
  6+- emergency 或 planned failover 之后,当前 leader 在 `mac`
  7+- `mini` 已修复,准备把系统恢复到 canonical 的“mini 主、mac 备”
  8+
  9+switchback 的重点不是“尽快让 mini 上线”,而是“把临时状态清干净,再回到可重复的默认形态”。
 10+
 11+## 1. 先修 mini,不先切流量
 12+
 13+在 `mini` 上先把基础面校验完:
 14+
 15+```bash
 16+cd /Users/george/code/baa-conductor
 17+./scripts/runtime/bootstrap.sh --repo-dir /Users/george/code/baa-conductor
 18+npx --yes pnpm -r build
 19+./scripts/runtime/check-launchd.sh \
 20+  --repo-dir /Users/george/code/baa-conductor \
 21+  --node mini \
 22+  --install-dir "$HOME/Library/LaunchAgents"
 23+```
 24+
 25+如有需要,重渲染并重载安装副本:
 26+
 27+```bash
 28+cd /Users/george/code/baa-conductor
 29+./scripts/runtime/install-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mini
 30+./scripts/runtime/reload-launchd.sh
 31+```
 32+
 33+在还没停掉 mac 之前,mini 即使已经恢复,也可能仍只是 `standby` 或暂时拿不到 lease,这是正常的。
 34+
 35+## 2. 先 pause,再移交
 36+
 37+switchback 前先暂停自动化:
 38+
 39+```bash
 40+export CONTROL_API_BASE='https://control-api.makefile.so'
 41+export BROWSER_ADMIN_TOKEN='REPLACE_ME'
 42+
 43+curl -sS -X POST \
 44+  -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
 45+  -H 'Content-Type: application/json' \
 46+  -d '{"requested_by":"ops_runbook","reason":"switchback_prepare"}' \
 47+  "${CONTROL_API_BASE%/}/v1/system/pause"
 48+```
 49+
 50+## 3. 停掉 mac conductor
 51+
 52+为了让 lease 和公网入口都回到 mini,需要先让 mac conductor 退下。
 53+
 54+默认 `LaunchAgents`:
 55+
 56+```bash
 57+launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/so.makefile.baa-conductor.plist"
 58+```
 59+
 60+若使用 `LaunchDaemons`:
 61+
 62+```bash
 63+sudo launchctl bootout system /Library/LaunchDaemons/so.makefile.baa-conductor.plist
 64+```
 65+
 66+然后在 mini 上确保 conductor 重新 bootstrap:
 67+
 68+```bash
 69+cd /Users/george/code/baa-conductor
 70+./scripts/runtime/reload-launchd.sh
 71+```
 72+
 73+## 4. 恢复 canonical Nginx
 74+
 75+如果 emergency failover 时对 VPS 做过热修,switchback 必须用 repo 的 canonical 配置覆盖回去。
 76+
 77+生成 bundle:
 78+
 79+```bash
 80+scripts/ops/nginx-sync-plan.sh \
 81+  --env ../baa-conductor.ops.env \
 82+  --bundle-dir .tmp/ops/baa-conductor-nginx
 83+```
 84+
 85+分发并 reload:
 86+
 87+```bash
 88+rsync -av .tmp/ops/baa-conductor-nginx/ root@YOUR_VPS:/tmp/baa-conductor-nginx/
 89+ssh root@YOUR_VPS 'cd /tmp/baa-conductor-nginx && sudo ./deploy-on-vps.sh --reload'
 90+```
 91+
 92+这样会把 VPS 配置恢复成:
 93+
 94+- mini `100.71.210.78:4317` 为 primary
 95+- mac `100.112.239.13:4317` 为 backup
 96+
 97+## 5. 验证已经回到 mini
 98+
 99+执行:
100+
101+```bash
102+./scripts/failover/rehearsal-check.sh \
103+  --env ../baa-conductor.ops.env \
104+  --basic-auth 'conductor-ops:REPLACE_ME' \
105+  --bearer-token 'REPLACE_ME' \
106+  --skip-node mac \
107+  --expect-leader mini
108+```
109+
110+成功条件:
111+
112+- `conductor.makefile.so` 返回 `leader`
113+- `mini-conductor.makefile.so` 返回 `leader`
114+- `GET /v1/system/state` 的 `holder_id` 以 `mini-` 开头
115+
116+如果这一步失败,不要急着 `resume`,先查:
117+
118+- mini 的 `launchctl print`
119+- mini 的 `logs/launchd/so.makefile.baa-conductor.err.log`
120+- VPS 上是否还有 emergency 热修残留
121+
122+## 6. 恢复自动化
123+
124+确认 switchback 完成后,再 resume:
125+
126+```bash
127+curl -sS -X POST \
128+  -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
129+  -H 'Content-Type: application/json' \
130+  -d '{"requested_by":"ops_runbook","reason":"switchback_complete"}' \
131+  "${CONTROL_API_BASE%/}/v1/system/resume"
132+```
133+
134+## 7. 收尾检查
135+
136+switchback 后建议立刻补做一次完整基线:
137+
138+```bash
139+./scripts/failover/rehearsal-check.sh \
140+  --env ../baa-conductor.ops.env \
141+  --basic-auth 'conductor-ops:REPLACE_ME' \
142+  --bearer-token 'REPLACE_ME' \
143+  --expect-leader mini
144+```
145+
146+如果这份结果和正常基线一致,说明系统已经回到默认拓扑。
A scripts/failover/common.sh
+176, -0
  1@@ -0,0 +1,176 @@
  2+#!/usr/bin/env bash
  3+
  4+if [[ -n "${BAA_FAILOVER_COMMON_SH_LOADED:-}" ]]; then
  5+  return 0
  6+fi
  7+
  8+readonly BAA_FAILOVER_COMMON_SH_LOADED=1
  9+readonly BAA_FAILOVER_SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
 10+readonly BAA_FAILOVER_REPO_DIR_DEFAULT="$(cd -- "${BAA_FAILOVER_SCRIPT_DIR}/../.." && pwd)"
 11+readonly BAA_FAILOVER_DEFAULT_ENV_PATH="${BAA_FAILOVER_REPO_DIR_DEFAULT}/scripts/ops/baa-conductor.env.example"
 12+readonly BAA_FAILOVER_DEFAULT_CONTROL_API_BASE="https://control-api.makefile.so"
 13+
 14+failover_log() {
 15+  printf '[failover] %s\n' "$*"
 16+}
 17+
 18+failover_warn() {
 19+  printf '[failover] warning: %s\n' "$*" >&2
 20+}
 21+
 22+failover_error() {
 23+  printf '[failover] error: %s\n' "$*" >&2
 24+}
 25+
 26+die() {
 27+  failover_error "$*"
 28+  exit 1
 29+}
 30+
 31+require_command() {
 32+  if ! command -v "$1" >/dev/null 2>&1; then
 33+    die "Missing required command: $1"
 34+  fi
 35+}
 36+
 37+contains_value() {
 38+  local needle="$1"
 39+  shift
 40+
 41+  local value
 42+  for value in "$@"; do
 43+    if [[ "$value" == "$needle" ]]; then
 44+      return 0
 45+    fi
 46+  done
 47+
 48+  return 1
 49+}
 50+
 51+validate_node() {
 52+  case "$1" in
 53+    mini | mac) ;;
 54+    *)
 55+      die "Unsupported node: $1"
 56+      ;;
 57+  esac
 58+}
 59+
 60+validate_scenario() {
 61+  case "$1" in
 62+    planned | emergency | switchback) ;;
 63+    *)
 64+      die "Unsupported scenario: $1"
 65+      ;;
 66+  esac
 67+}
 68+
 69+shell_quote() {
 70+  printf '%q' "$1"
 71+}
 72+
 73+require_value() {
 74+  local key="$1"
 75+  local value="${!key:-}"
 76+
 77+  if [[ -z "$value" ]]; then
 78+    die "Missing required value in inventory: ${key}"
 79+  fi
 80+
 81+  printf '%s\n' "$value"
 82+}
 83+
 84+validate_tailscale_ipv4() {
 85+  local value="$1"
 86+  local key="$2"
 87+
 88+  if [[ ! "$value" =~ ^100\.([0-9]{1,3}\.){2}[0-9]{1,3}$ ]]; then
 89+    die "${key} must be a Tailscale 100.x IPv4 address: ${value}"
 90+  fi
 91+}
 92+
 93+load_env_file() {
 94+  local env_path="$1"
 95+
 96+  if [[ ! -f "$env_path" ]]; then
 97+    die "Inventory file not found: ${env_path}"
 98+  fi
 99+
100+  set -a
101+  # shellcheck disable=SC1090
102+  source "$env_path"
103+  set +a
104+}
105+
106+load_inventory() {
107+  local env_path="$1"
108+
109+  load_env_file "$env_path"
110+
111+  FAILOVER_ENV_PATH="$env_path"
112+  FAILOVER_APP_NAME="${BAA_APP_NAME:-baa-conductor}"
113+  FAILOVER_PUBLIC_IPV4="${BAA_PUBLIC_IPV4:-}"
114+  FAILOVER_PUBLIC_IPV6="${BAA_PUBLIC_IPV6:-}"
115+  FAILOVER_CONDUCTOR_HOST="$(require_value BAA_CONDUCTOR_HOST)"
116+  FAILOVER_MINI_DIRECT_HOST="$(require_value BAA_MINI_DIRECT_HOST)"
117+  FAILOVER_MAC_DIRECT_HOST="$(require_value BAA_MAC_DIRECT_HOST)"
118+  FAILOVER_MINI_TAILSCALE_IP="$(require_value BAA_MINI_TAILSCALE_IP)"
119+  FAILOVER_MAC_TAILSCALE_IP="$(require_value BAA_MAC_TAILSCALE_IP)"
120+  FAILOVER_CONDUCTOR_PORT="${BAA_CONDUCTOR_PORT:-4317}"
121+  FAILOVER_CONTROL_API_BASE="${BAA_CONTROL_API_BASE:-$BAA_FAILOVER_DEFAULT_CONTROL_API_BASE}"
122+
123+  validate_tailscale_ipv4 "$FAILOVER_MINI_TAILSCALE_IP" "BAA_MINI_TAILSCALE_IP"
124+  validate_tailscale_ipv4 "$FAILOVER_MAC_TAILSCALE_IP" "BAA_MAC_TAILSCALE_IP"
125+}
126+
127+node_direct_host() {
128+  validate_node "$1"
129+
130+  case "$1" in
131+    mini)
132+      printf '%s\n' "$FAILOVER_MINI_DIRECT_HOST"
133+      ;;
134+    mac)
135+      printf '%s\n' "$FAILOVER_MAC_DIRECT_HOST"
136+      ;;
137+  esac
138+}
139+
140+node_tailscale_ip() {
141+  validate_node "$1"
142+
143+  case "$1" in
144+    mini)
145+      printf '%s\n' "$FAILOVER_MINI_TAILSCALE_IP"
146+      ;;
147+    mac)
148+      printf '%s\n' "$FAILOVER_MAC_TAILSCALE_IP"
149+      ;;
150+  esac
151+}
152+
153+node_default_role() {
154+  validate_node "$1"
155+
156+  case "$1" in
157+    mini)
158+      printf '%s\n' "primary"
159+      ;;
160+    mac)
161+      printf '%s\n' "standby"
162+      ;;
163+  esac
164+}
165+
166+node_default_id() {
167+  validate_node "$1"
168+
169+  case "$1" in
170+    mini)
171+      printf '%s\n' "mini-main"
172+      ;;
173+    mac)
174+      printf '%s\n' "mac-standby"
175+      ;;
176+  esac
177+}
A scripts/failover/print-checklist.sh
+240, -0
  1@@ -0,0 +1,240 @@
  2+#!/usr/bin/env bash
  3+set -euo pipefail
  4+
  5+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
  6+# shellcheck source=./common.sh
  7+source "${SCRIPT_DIR}/common.sh"
  8+
  9+usage() {
 10+  cat <<'EOF'
 11+Usage:
 12+  scripts/failover/print-checklist.sh --scenario planned|emergency|switchback [options]
 13+
 14+Options:
 15+  --scenario NAME         planned, emergency, or switchback.
 16+  --env PATH              Inventory file to load.
 17+  --control-api-base URL  Override the control API base URL.
 18+  --help                  Show this help text.
 19+EOF
 20+}
 21+
 22+print_common_exports() {
 23+  local env_q=""
 24+  local control_api_q=""
 25+
 26+  env_q="$(shell_quote "$FAILOVER_ENV_PATH")"
 27+  control_api_q="$(shell_quote "$control_api_base")"
 28+
 29+  cat <<EOF
 30+Suggested operator variables
 31+----------------------------
 32+export FAILOVER_ENV=${env_q}
 33+export DIRECT_BASIC_AUTH='conductor-ops:REPLACE_ME'
 34+export READONLY_TOKEN='REPLACE_ME'
 35+export BROWSER_ADMIN_TOKEN='REPLACE_ME'
 36+export MINI_SSH='<mini-admin-shell>'
 37+export MAC_SSH='<mac-admin-shell>'
 38+export VPS_SSH='root@<vps>'
 39+export CONTROL_API_BASE=${control_api_q}
 40+
 41+Common baseline commands
 42+------------------------
 43+./scripts/failover/print-topology.sh --env "\$FAILOVER_ENV"
 44+./scripts/failover/rehearsal-check.sh \\
 45+  --env "\$FAILOVER_ENV" \\
 46+  --basic-auth "\$DIRECT_BASIC_AUTH" \\
 47+  --bearer-token "\$READONLY_TOKEN" \\
 48+  --control-api-base "\$CONTROL_API_BASE"
 49+EOF
 50+}
 51+
 52+print_planned_checklist() {
 53+  cat <<'EOF'
 54+
 55+Planned failover checklist
 56+--------------------------
 57+1. Confirm the baseline is mini leader:
 58+./scripts/failover/rehearsal-check.sh \
 59+  --env "$FAILOVER_ENV" \
 60+  --basic-auth "$DIRECT_BASIC_AUTH" \
 61+  --bearer-token "$READONLY_TOKEN" \
 62+  --control-api-base "$CONTROL_API_BASE" \
 63+  --expect-leader mini
 64+
 65+2. Drain and then pause automation:
 66+curl -sS -X POST \
 67+  -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
 68+  -H 'Content-Type: application/json' \
 69+  -d '{"requested_by":"ops_runbook","reason":"planned_failover_rehearsal"}' \
 70+  "${CONTROL_API_BASE%/}/v1/system/drain"
 71+
 72+curl -sS -X POST \
 73+  -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
 74+  -H 'Content-Type: application/json' \
 75+  -d '{"requested_by":"ops_runbook","reason":"planned_failover_cutover"}' \
 76+  "${CONTROL_API_BASE%/}/v1/system/pause"
 77+
 78+3. On mac, confirm launchd has a healthy standby install:
 79+ssh "$MAC_SSH" \
 80+  'cd /Users/george/code/baa-conductor && ./scripts/runtime/check-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mac --install-dir "$HOME/Library/LaunchAgents"'
 81+
 82+4. On mini, stop only the conductor service so VPS Nginx falls through to mac:
 83+ssh "$MINI_SSH" \
 84+  'launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/so.makefile.baa-conductor.plist"'
 85+
 86+5. Verify mac is now leader and public ingress still returns leader:
 87+./scripts/failover/rehearsal-check.sh \
 88+  --env "$FAILOVER_ENV" \
 89+  --basic-auth "$DIRECT_BASIC_AUTH" \
 90+  --bearer-token "$READONLY_TOKEN" \
 91+  --control-api-base "$CONTROL_API_BASE" \
 92+  --skip-node mini \
 93+  --expect-leader mac
 94+
 95+6. Resume automation only after mac direct host and public host are both healthy:
 96+curl -sS -X POST \
 97+  -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
 98+  -H 'Content-Type: application/json' \
 99+  -d '{"requested_by":"ops_runbook","reason":"planned_failover_complete"}' \
100+  "${CONTROL_API_BASE%/}/v1/system/resume"
101+EOF
102+}
103+
104+print_emergency_checklist() {
105+  cat <<'EOF'
106+
107+Emergency failover checklist
108+----------------------------
109+1. Snapshot public state first. If mini is already gone, skip its direct checks:
110+./scripts/failover/rehearsal-check.sh \
111+  --env "$FAILOVER_ENV" \
112+  --basic-auth "$DIRECT_BASIC_AUTH" \
113+  --bearer-token "$READONLY_TOKEN" \
114+  --control-api-base "$CONTROL_API_BASE" \
115+  --skip-node mini \
116+  --expect-leader mac
117+
118+2. On mac, repair or restart the local conductor service if needed:
119+ssh "$MAC_SSH" \
120+  'cd /Users/george/code/baa-conductor && ./scripts/runtime/check-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mac --install-dir "$HOME/Library/LaunchAgents"'
121+
122+ssh "$MAC_SSH" \
123+  'cd /Users/george/code/baa-conductor && ./scripts/runtime/reload-launchd.sh'
124+
125+3. If public ingress still lands on mini while mini is reachable but no longer leader, hotfix the VPS config so mac becomes the first upstream:
126+ssh "$VPS_SSH" 'sudo cp /etc/nginx/sites-available/baa-conductor.conf /etc/nginx/sites-available/baa-conductor.conf.bak.$(date +%Y%m%d%H%M%S)'
127+ssh "$VPS_SSH" 'sudo editor /etc/nginx/sites-available/baa-conductor.conf'
128+ssh "$VPS_SSH" 'sudo nginx -t && sudo systemctl reload nginx'
129+
130+4. Re-run the snapshot until public /rolez=leader and mac direct /rolez=leader:
131+./scripts/failover/rehearsal-check.sh \
132+  --env "$FAILOVER_ENV" \
133+  --basic-auth "$DIRECT_BASIC_AUTH" \
134+  --bearer-token "$READONLY_TOKEN" \
135+  --control-api-base "$CONTROL_API_BASE" \
136+  --skip-node mini \
137+  --expect-leader mac
138+
139+5. Record whether the VPS carried an emergency Nginx hotfix. Switchback must restore the canonical repo-rendered bundle later.
140+EOF
141+}
142+
143+print_switchback_checklist() {
144+  cat <<'EOF'
145+
146+Switchback checklist
147+--------------------
148+1. Rebuild and validate mini before touching traffic:
149+ssh "$MINI_SSH" \
150+  'cd /Users/george/code/baa-conductor && npx --yes pnpm -r build && ./scripts/runtime/check-launchd.sh --repo-dir /Users/george/code/baa-conductor --node mini --install-dir "$HOME/Library/LaunchAgents"'
151+
152+2. Pause automation so lease ownership can move cleanly back to mini:
153+curl -sS -X POST \
154+  -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
155+  -H 'Content-Type: application/json' \
156+  -d '{"requested_by":"ops_runbook","reason":"switchback_prepare"}' \
157+  "${CONTROL_API_BASE%/}/v1/system/pause"
158+
159+3. Stop mac conductor and restart mini conductor:
160+ssh "$MAC_SSH" \
161+  'launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/so.makefile.baa-conductor.plist"'
162+
163+ssh "$MINI_SSH" \
164+  'cd /Users/george/code/baa-conductor && ./scripts/runtime/reload-launchd.sh'
165+
166+4. If emergency hotfixes changed the deployed VPS config, restore the canonical mini-primary bundle from the repo:
167+scripts/ops/nginx-sync-plan.sh --env "$FAILOVER_ENV" --bundle-dir .tmp/ops/baa-conductor-nginx
168+rsync -av .tmp/ops/baa-conductor-nginx/ "$VPS_SSH":/tmp/baa-conductor-nginx/
169+ssh "$VPS_SSH" 'cd /tmp/baa-conductor-nginx && sudo ./deploy-on-vps.sh --reload'
170+
171+5. Verify leadership moved back to mini:
172+./scripts/failover/rehearsal-check.sh \
173+  --env "$FAILOVER_ENV" \
174+  --basic-auth "$DIRECT_BASIC_AUTH" \
175+  --bearer-token "$READONLY_TOKEN" \
176+  --control-api-base "$CONTROL_API_BASE" \
177+  --skip-node mac \
178+  --expect-leader mini
179+
180+6. Resume automation after public and mini direct hosts are both healthy:
181+curl -sS -X POST \
182+  -H "Authorization: Bearer ${BROWSER_ADMIN_TOKEN}" \
183+  -H 'Content-Type: application/json' \
184+  -d '{"requested_by":"ops_runbook","reason":"switchback_complete"}' \
185+  "${CONTROL_API_BASE%/}/v1/system/resume"
186+EOF
187+}
188+
189+env_path="${BAA_FAILOVER_DEFAULT_ENV_PATH}"
190+scenario=""
191+control_api_base=""
192+
193+while [[ $# -gt 0 ]]; do
194+  case "$1" in
195+    --scenario)
196+      validate_scenario "$2"
197+      scenario="$2"
198+      shift 2
199+      ;;
200+    --env)
201+      env_path="$2"
202+      shift 2
203+      ;;
204+    --control-api-base)
205+      control_api_base="$2"
206+      shift 2
207+      ;;
208+    --help)
209+      usage
210+      exit 0
211+      ;;
212+    *)
213+      die "Unknown option: $1"
214+      ;;
215+  esac
216+done
217+
218+if [[ -z "$scenario" ]]; then
219+  die "--scenario is required"
220+fi
221+
222+load_inventory "$env_path"
223+
224+if [[ -z "$control_api_base" ]]; then
225+  control_api_base="$FAILOVER_CONTROL_API_BASE"
226+fi
227+
228+printf 'Scenario: %s\n' "$scenario"
229+print_common_exports
230+
231+case "$scenario" in
232+  planned)
233+    print_planned_checklist
234+    ;;
235+  emergency)
236+    print_emergency_checklist
237+    ;;
238+  switchback)
239+    print_switchback_checklist
240+    ;;
241+esac
A scripts/failover/print-topology.sh
+74, -0
 1@@ -0,0 +1,74 @@
 2+#!/usr/bin/env bash
 3+set -euo pipefail
 4+
 5+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
 6+# shellcheck source=./common.sh
 7+source "${SCRIPT_DIR}/common.sh"
 8+
 9+usage() {
10+  cat <<'EOF'
11+Usage:
12+  scripts/failover/print-topology.sh [options]
13+
14+Options:
15+  --env PATH   Inventory file to load. Defaults to scripts/ops/baa-conductor.env.example.
16+  --help       Show this help text.
17+EOF
18+}
19+
20+env_path="${BAA_FAILOVER_DEFAULT_ENV_PATH}"
21+
22+while [[ $# -gt 0 ]]; do
23+  case "$1" in
24+    --env)
25+      env_path="$2"
26+      shift 2
27+      ;;
28+    --help)
29+      usage
30+      exit 0
31+      ;;
32+    *)
33+      die "Unknown option: $1"
34+      ;;
35+  esac
36+done
37+
38+load_inventory "$env_path"
39+
40+public_targets="${FAILOVER_PUBLIC_IPV4:-<unset>}"
41+if [[ -n "${FAILOVER_PUBLIC_IPV6:-}" ]]; then
42+  public_targets="${public_targets}, ${FAILOVER_PUBLIC_IPV6}"
43+fi
44+
45+cat <<EOF
46+Failover Topology
47+=================
48+
49+Inventory: ${FAILOVER_ENV_PATH}
50+Control API: ${FAILOVER_CONTROL_API_BASE}
51+
52+Public ingress
53+--------------
54+- Cloudflare DNS keeps conductor hosts pinned to the VPS public address: ${public_targets}
55+- https://${FAILOVER_CONDUCTOR_HOST} -> VPS Nginx upstream conductor_primary
56+- conductor_primary -> mini ${FAILOVER_MINI_TAILSCALE_IP}:${FAILOVER_CONDUCTOR_PORT} (primary), mac ${FAILOVER_MAC_TAILSCALE_IP}:${FAILOVER_CONDUCTOR_PORT} (backup)
57+
58+Direct node hosts
59+-----------------
60+- https://${FAILOVER_MINI_DIRECT_HOST} -> Basic Auth -> mini ${FAILOVER_MINI_TAILSCALE_IP}:${FAILOVER_CONDUCTOR_PORT}
61+- https://${FAILOVER_MAC_DIRECT_HOST} -> Basic Auth -> mac ${FAILOVER_MAC_TAILSCALE_IP}:${FAILOVER_CONDUCTOR_PORT}
62+
63+launchd defaults
64+----------------
65+- mini: BAA_CONDUCTOR_HOST=mini, BAA_CONDUCTOR_ROLE=primary, BAA_NODE_ID=mini-main
66+- mac: BAA_CONDUCTOR_HOST=mac, BAA_CONDUCTOR_ROLE=standby, BAA_NODE_ID=mac-standby
67+- Both nodes keep the same repo/runtime root: /Users/george/code/baa-conductor
68+
69+Operational notes
70+-----------------
71+- Cloudflare DNS is not part of failover or switchback. Public traffic stays on the VPS.
72+- Nginx failover is transport-based only. It reacts when a 100.x upstream stops accepting traffic.
73+- Nginx does not inspect leader lease state. If mini still answers on ${FAILOVER_MINI_TAILSCALE_IP}:${FAILOVER_CONDUCTOR_PORT} but /rolez says standby, public ingress can still land on mini until mini is stopped or the VPS config is hotfixed.
74+- launchd decides whether each node keeps serving 127.0.0.1:4317 and its Tailscale listener. Control API is only for drain/pause/resume and lease observation.
75+EOF
A scripts/failover/rehearsal-check.sh
+384, -0
  1@@ -0,0 +1,384 @@
  2+#!/usr/bin/env bash
  3+set -euo pipefail
  4+
  5+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
  6+# shellcheck source=./common.sh
  7+source "${SCRIPT_DIR}/common.sh"
  8+
  9+usage() {
 10+  cat <<'EOF'
 11+Usage:
 12+  scripts/failover/rehearsal-check.sh [options]
 13+
 14+Options:
 15+  --env PATH                 Inventory file to load.
 16+  --basic-auth USER:PASS     Basic Auth for mini/mac direct domains.
 17+  --bearer-token TOKEN       Bearer token for GET /v1/system/state.
 18+  --bearer-token-file PATH   Read the bearer token from a file.
 19+  --control-api-base URL     Override the control API base URL.
 20+  --expect-leader NODE       Assert that mini or mac is the active leader.
 21+  --skip-node NODE           Skip direct checks for one node. Repeatable.
 22+  --skip-public              Skip public conductor host checks.
 23+  --skip-control-api         Skip GET /v1/system/state even when a token is available.
 24+  --timeout SEC              Per-request curl timeout. Defaults to 5.
 25+  --help                     Show this help text.
 26+
 27+Notes:
 28+  - Public and direct probes are read-only GET requests against /healthz, /readyz, and /rolez.
 29+  - Direct-node checks are skipped automatically when Basic Auth is not provided.
 30+  - Control API checks are skipped automatically when a bearer token is not provided.
 31+EOF
 32+}
 33+
 34+require_command curl
 35+require_command node
 36+
 37+env_path="${BAA_FAILOVER_DEFAULT_ENV_PATH}"
 38+basic_auth="${BAA_FAILOVER_BASIC_AUTH:-}"
 39+bearer_token="${BAA_CONTROL_API_TOKEN:-}"
 40+bearer_token_file=""
 41+control_api_base=""
 42+expect_leader=""
 43+timeout_sec="5"
 44+skip_public="0"
 45+skip_control_api="0"
 46+skip_nodes=()
 47+failures=0
 48+
 49+record_failure() {
 50+  failover_error "$*"
 51+  failures=$((failures + 1))
 52+}
 53+
 54+probe_endpoint() {
 55+  local status_var="$1"
 56+  local body_var="$2"
 57+  local error_var="$3"
 58+  local url="$4"
 59+  shift 4
 60+
 61+  local tmp_body=""
 62+  local tmp_err=""
 63+  local http_code=""
 64+  local error_message=""
 65+  local body=""
 66+
 67+  tmp_body="$(mktemp)"
 68+  tmp_err="$(mktemp)"
 69+
 70+  if ! http_code="$(curl -sS -L --max-time "$timeout_sec" -o "$tmp_body" -w '%{http_code}' "$@" "$url" 2>"$tmp_err")"; then
 71+    error_message="$(tr '\n' ' ' < "$tmp_err")"
 72+    error_message="${error_message%" "}"
 73+    printf -v "$status_var" '%s' "curl_error"
 74+    printf -v "$body_var" '%s' ""
 75+    printf -v "$error_var" '%s' "$error_message"
 76+    rm -f "$tmp_body" "$tmp_err"
 77+    return 1
 78+  fi
 79+
 80+  body="$(tr -d '\r' < "$tmp_body")"
 81+  body="${body%$'\n'}"
 82+
 83+  printf -v "$status_var" '%s' "$http_code"
 84+  printf -v "$body_var" '%s' "$body"
 85+  printf -v "$error_var" '%s' ""
 86+
 87+  rm -f "$tmp_body" "$tmp_err"
 88+}
 89+
 90+format_probe_result() {
 91+  local status="$1"
 92+  local body="$2"
 93+  local error="$3"
 94+
 95+  if [[ "$status" == "curl_error" ]]; then
 96+    if [[ -n "$error" ]]; then
 97+      printf 'ERROR(%s)' "$error"
 98+    else
 99+      printf 'ERROR'
100+    fi
101+    return 0
102+  fi
103+
104+  if [[ -z "$body" ]]; then
105+    printf '%s(<empty>)' "$status"
106+    return 0
107+  fi
108+
109+  printf '%s(%s)' "$status" "$body"
110+}
111+
112+probe_surface() {
113+  local label="$1"
114+  local base_url="$2"
115+  shift 2
116+
117+  probe_endpoint "${label}_health_status" "${label}_health_body" "${label}_health_error" "${base_url}/healthz" "$@" || true
118+  probe_endpoint "${label}_ready_status" "${label}_ready_body" "${label}_ready_error" "${base_url}/readyz" "$@" || true
119+  probe_endpoint "${label}_role_status" "${label}_role_body" "${label}_role_error" "${base_url}/rolez" "$@" || true
120+}
121+
122+print_surface_summary() {
123+  local label="$1"
124+  local base_url="$2"
125+  local health_status_var="${label}_health_status"
126+  local health_body_var="${label}_health_body"
127+  local health_error_var="${label}_health_error"
128+  local ready_status_var="${label}_ready_status"
129+  local ready_body_var="${label}_ready_body"
130+  local ready_error_var="${label}_ready_error"
131+  local role_status_var="${label}_role_status"
132+  local role_body_var="${label}_role_body"
133+  local role_error_var="${label}_role_error"
134+
135+  printf '%-10s %s healthz=%s readyz=%s rolez=%s\n' \
136+    "${label}" \
137+    "${base_url}" \
138+    "$(format_probe_result "${!health_status_var:-n/a}" "${!health_body_var:-}" "${!health_error_var:-}")" \
139+    "$(format_probe_result "${!ready_status_var:-n/a}" "${!ready_body_var:-}" "${!ready_error_var:-}")" \
140+    "$(format_probe_result "${!role_status_var:-n/a}" "${!role_body_var:-}" "${!role_error_var:-}")"
141+}
142+
143+assert_text_response() {
144+  local label="$1"
145+  local expected_status="$2"
146+  local expected_body="$3"
147+  local actual_status="$4"
148+  local actual_body="$5"
149+  local actual_error="$6"
150+
151+  if [[ "$actual_status" == "curl_error" ]]; then
152+    record_failure "${label} request failed: ${actual_error}"
153+    return 0
154+  fi
155+
156+  if [[ "$actual_status" != "$expected_status" || "$actual_body" != "$expected_body" ]]; then
157+    record_failure "${label} expected ${expected_status}(${expected_body}), got ${actual_status}(${actual_body})"
158+  fi
159+}
160+
161+assert_surface() {
162+  local label="$1"
163+  local expected_role="$2"
164+  local health_status_var="${label}_health_status"
165+  local health_body_var="${label}_health_body"
166+  local health_error_var="${label}_health_error"
167+  local ready_status_var="${label}_ready_status"
168+  local ready_body_var="${label}_ready_body"
169+  local ready_error_var="${label}_ready_error"
170+  local role_status_var="${label}_role_status"
171+  local role_body_var="${label}_role_body"
172+  local role_error_var="${label}_role_error"
173+
174+  assert_text_response "${label} /healthz" "200" "ok" "${!health_status_var:-}" "${!health_body_var:-}" "${!health_error_var:-}"
175+  assert_text_response "${label} /readyz" "200" "ready" "${!ready_status_var:-}" "${!ready_body_var:-}" "${!ready_error_var:-}"
176+  assert_text_response "${label} /rolez" "200" "$expected_role" "${!role_status_var:-}" "${!role_body_var:-}" "${!role_error_var:-}"
177+}
178+
179+parse_system_state_json() {
180+  node -e 'const fs = require("fs");
181+const payload = JSON.parse(fs.readFileSync(0, "utf8"));
182+const pick = (...values) => values.find((value) => value !== undefined && value !== null);
183+const mode = pick(payload.data && payload.data.mode, payload.mode, payload.automation && payload.automation.mode, "");
184+const holder = pick(payload.data && payload.data.holder_id, payload.holder_id, payload.leader && payload.leader.controller_id, "");
185+const term = pick(payload.data && payload.data.term, payload.term, payload.leader && payload.leader.term, "");
186+const lease = pick(payload.data && payload.data.lease_expires_at, payload.lease_expires_at, payload.leader && payload.leader.lease_expires_at, "");
187+process.stdout.write([mode, holder, term, lease].map((value) => value == null ? "" : String(value)).join("\t"));'
188+}
189+
190+while [[ $# -gt 0 ]]; do
191+  case "$1" in
192+    --env)
193+      env_path="$2"
194+      shift 2
195+      ;;
196+    --basic-auth)
197+      basic_auth="$2"
198+      shift 2
199+      ;;
200+    --bearer-token)
201+      bearer_token="$2"
202+      shift 2
203+      ;;
204+    --bearer-token-file)
205+      bearer_token_file="$2"
206+      shift 2
207+      ;;
208+    --control-api-base)
209+      control_api_base="$2"
210+      shift 2
211+      ;;
212+    --expect-leader)
213+      validate_node "$2"
214+      expect_leader="$2"
215+      shift 2
216+      ;;
217+    --skip-node)
218+      validate_node "$2"
219+      if ! contains_value "$2" "${skip_nodes[@]-}"; then
220+        skip_nodes+=("$2")
221+      fi
222+      shift 2
223+      ;;
224+    --skip-public)
225+      skip_public="1"
226+      shift
227+      ;;
228+    --skip-control-api)
229+      skip_control_api="1"
230+      shift
231+      ;;
232+    --timeout)
233+      timeout_sec="$2"
234+      shift 2
235+      ;;
236+    --help)
237+      usage
238+      exit 0
239+      ;;
240+    *)
241+      die "Unknown option: $1"
242+      ;;
243+  esac
244+done
245+
246+load_inventory "$env_path"
247+
248+if [[ -n "$bearer_token_file" ]]; then
249+  if [[ ! -f "$bearer_token_file" ]]; then
250+    die "Bearer token file not found: ${bearer_token_file}"
251+  fi
252+  bearer_token="$(tr -d '\r\n' < "$bearer_token_file")"
253+fi
254+
255+if [[ -z "$control_api_base" ]]; then
256+  control_api_base="$FAILOVER_CONTROL_API_BASE"
257+fi
258+
259+if [[ -z "$basic_auth" ]]; then
260+  if ! contains_value mini "${skip_nodes[@]-}"; then
261+    skip_nodes+=("mini")
262+  fi
263+  if ! contains_value mac "${skip_nodes[@]-}"; then
264+    skip_nodes+=("mac")
265+  fi
266+  failover_warn "No direct-node Basic Auth configured; skipping mini/mac direct probes."
267+fi
268+
269+if [[ -n "$expect_leader" ]]; then
270+  if [[ -z "$bearer_token" ]] && contains_value mini "${skip_nodes[@]-}" && contains_value mac "${skip_nodes[@]-}" ; then
271+    die "Cannot verify --expect-leader without direct-node auth or a control API bearer token."
272+  fi
273+fi
274+
275+basic_auth_args=()
276+if [[ -n "$basic_auth" ]]; then
277+  basic_auth_args=(-u "$basic_auth")
278+fi
279+
280+printf 'Failover rehearsal snapshot\n'
281+printf 'inventory   %s\n' "$FAILOVER_ENV_PATH"
282+
283+if [[ "$skip_public" != "1" ]]; then
284+  public_base_url="https://${FAILOVER_CONDUCTOR_HOST}"
285+  probe_surface "public" "$public_base_url"
286+  print_surface_summary "public" "$public_base_url"
287+  assert_surface "public" "leader"
288+else
289+  printf '%-10s skipped\n' "public"
290+fi
291+
292+if ! contains_value mini "${skip_nodes[@]-}"; then
293+  mini_base_url="https://${FAILOVER_MINI_DIRECT_HOST}"
294+  probe_surface "mini" "$mini_base_url" "${basic_auth_args[@]}"
295+  print_surface_summary "mini" "$mini_base_url"
296+else
297+  printf '%-10s skipped\n' "mini"
298+fi
299+
300+if ! contains_value mac "${skip_nodes[@]-}"; then
301+  mac_base_url="https://${FAILOVER_MAC_DIRECT_HOST}"
302+  probe_surface "mac" "$mac_base_url" "${basic_auth_args[@]}"
303+  print_surface_summary "mac" "$mac_base_url"
304+else
305+  printf '%-10s skipped\n' "mac"
306+fi
307+
308+control_mode=""
309+control_holder=""
310+control_term=""
311+control_lease_expires_at=""
312+
313+if [[ "$skip_control_api" != "1" && -n "$bearer_token" ]]; then
314+  control_state_url="${control_api_base%/}/v1/system/state"
315+  probe_endpoint "control_status" "control_body" "control_error" "$control_state_url" \
316+    -H "Authorization: Bearer ${bearer_token}" \
317+    -H "Accept: application/json" || true
318+
319+  if [[ "${control_status:-}" == "curl_error" ]]; then
320+    printf '%-10s %s %s\n' "control" "$control_state_url" "$(format_probe_result "$control_status" "" "$control_error")"
321+    record_failure "control API /v1/system/state request failed: ${control_error}"
322+  elif [[ "${control_status:-}" != "200" ]]; then
323+    printf '%-10s %s %s\n' "control" "$control_state_url" "$(format_probe_result "$control_status" "$control_body" "$control_error")"
324+    record_failure "control API /v1/system/state expected 200, got ${control_status}(${control_body})"
325+  else
326+    parsed_control_state="$(printf '%s' "$control_body" | parse_system_state_json 2>/dev/null || true)"
327+    if [[ -z "$parsed_control_state" ]]; then
328+      printf '%-10s %s 200(raw=%s)\n' "control" "$control_state_url" "$control_body"
329+      record_failure "control API /v1/system/state returned JSON that could not be normalized"
330+    else
331+      IFS=$'\t' read -r control_mode control_holder control_term control_lease_expires_at <<<"$parsed_control_state"
332+      printf '%-10s %s mode=%s holder_id=%s term=%s lease_expires_at=%s\n' \
333+        "control" \
334+        "$control_state_url" \
335+        "${control_mode:-<empty>}" \
336+        "${control_holder:-<empty>}" \
337+        "${control_term:-<empty>}" \
338+        "${control_lease_expires_at:-<empty>}"
339+    fi
340+  fi
341+else
342+  printf '%-10s skipped\n' "control"
343+fi
344+
345+if [[ -z "$expect_leader" ]]; then
346+  if ! contains_value mini "${skip_nodes[@]-}" && ! contains_value mac "${skip_nodes[@]-}"; then
347+    mini_role="${mini_role_body:-}"
348+    mac_role="${mac_role_body:-}"
349+    case "${mini_role}:${mac_role}" in
350+      leader:standby | standby:leader) ;;
351+      *)
352+        record_failure "Expected exactly one direct node leader, got mini=${mini_role:-<empty>} mac=${mac_role:-<empty>}"
353+        ;;
354+    esac
355+  fi
356+else
357+  if ! contains_value "$expect_leader" "${skip_nodes[@]-}"; then
358+    assert_surface "$expect_leader" "leader"
359+  fi
360+
361+  other_node="mini"
362+  if [[ "$expect_leader" == "mini" ]]; then
363+    other_node="mac"
364+  fi
365+
366+  if ! contains_value "$other_node" "${skip_nodes[@]-}"; then
367+    assert_surface "$other_node" "standby"
368+  fi
369+
370+  if [[ -n "$control_holder" ]]; then
371+    case "$control_holder" in
372+      "${expect_leader}"-*) ;;
373+      *)
374+        record_failure "control API holder_id expected prefix ${expect_leader}-, got ${control_holder}"
375+        ;;
376+    esac
377+  fi
378+fi
379+
380+if [[ "$failures" -gt 0 ]]; then
381+  failover_error "rehearsal checks failed with ${failures} issue(s)"
382+  exit 1
383+fi
384+
385+failover_log "rehearsal checks passed"