Merge branch 'master' of ssh://git.joshuabell.xyz:3032/ringofstorms/dotfiles

2026-02-02 09:40:40 -06:00 · 2026-02-02 09:40:40 -06:00 · 7087c8034c
commit 7087c8034c
parent 76211f2ba4 42d0f6b0d6
7 changed files with 1824 additions and 62 deletions
--- a/docs/migrating_services.md
+++ b/docs/migrating_services.md
@ -0,0 +1,425 @@
+# Migrating Services Between Hosts
+
+## Overview
+
+This document covers procedures for migrating services between NixOS hosts with minimal downtime.
+
+## General Migration Strategy
+
+### Pre-Migration Checklist
+
+- [ ] New host is configured in flake with identical service config
+- [ ] New host has required secrets (agenix/sops)
+- [ ] Network connectivity verified (Tailscale IP assigned)
+- [ ] Disk space sufficient on new host
+- [ ] Backup of current state completed
+
+### Migration Types
+
+| Type | Downtime | Complexity | Use When |
+|------|----------|------------|----------|
+| Cold migration | 5-30 min | Low | Simple services, maintenance windows |
+| Warm migration | 2-5 min | Medium | Most services |
+| Hot migration | <1 min | High | Databases with replication |
+
+---
+
+## Cold Migration (Simple)
+
+Best for: Stateless or rarely-accessed services.
+
+### Steps
+
+```bash
+# 1. Stop service on old host
+ssh oldhost 'systemctl stop myservice'
+
+# 2. Copy state to new host
+rsync -avz --progress oldhost:/var/lib/myservice/ newhost:/var/lib/myservice/
+
+# 3. Start on new host
+ssh newhost 'systemctl start myservice'
+
+# 4. Update reverse proxy (if applicable)
+# Edit nginx config: proxyPass = "http://<new-tailscale-ip>"
+# Rebuild: ssh proxy 'nixos-rebuild switch'
+
+# 5. Verify service works
+
+# 6. Clean up old host (after verification period)
+ssh oldhost 'rm -rf /var/lib/myservice'
+```
+
+**Downtime:** Duration of rsync + service start + proxy update.
+
+---
+
+## Warm Migration (Recommended)
+
+Best for: Most services with moderate state.
+
+### Strategy
+
+1. Sync state while service is running (initial sync)
+2. Stop service briefly for final sync
+3. Start on new host
+4. Update routing
+
+### Steps
+
+```bash
+# 1. Initial sync (service still running)
+rsync -avz --progress oldhost:/var/lib/myservice/ newhost:/var/lib/myservice/
+
+# 2. Stop service on old host
+ssh oldhost 'systemctl stop myservice'
+
+# 3. Final sync (quick - only changes since initial sync)
+rsync -avz --progress oldhost:/var/lib/myservice/ newhost:/var/lib/myservice/
+
+# 4. Start on new host
+ssh newhost 'systemctl start myservice'
+
+# 5. Update reverse proxy immediately
+ssh proxy 'nixos-rebuild switch'
+
+# 6. Verify
+curl https://myservice.joshuabell.xyz
+```
+
+**Downtime:** 2-5 minutes (final rsync + start + proxy switch).
+
+---
+
+## Hot Migration (Database Services)
+
+Best for: PostgreSQL, critical services requiring near-zero downtime.
+
+### PostgreSQL Logical Replication
+
+#### On Source (Old Host)
+
+```nix
+services.postgresql = {
+  settings = {
+    wal_level = "logical";
+    max_replication_slots = 4;
+    max_wal_senders = 4;
+  };
+};
+
+# Add replication user
+services.postgresql.ensureUsers = [{
+  name = "replicator";
+  ensurePermissions."ALL TABLES IN SCHEMA public" = "SELECT";
+}];
+```
+
+#### Set Up Replication
+
+```sql
+-- On source: Create publication
+CREATE PUBLICATION my_pub FOR ALL TABLES;
+
+-- On target: Create subscription
+CREATE SUBSCRIPTION my_sub
+  CONNECTION 'host=oldhost dbname=mydb user=replicator'
+  PUBLICATION my_pub;
+```
+
+#### Cutover
+
+```bash
+# 1. Verify replication is caught up
+# Check lag on target:
+SELECT * FROM pg_stat_subscription;
+
+# 2. Stop writes on source (maintenance mode)
+
+# 3. Wait for final sync
+
+# 4. Promote target (drop subscription)
+DROP SUBSCRIPTION my_sub;
+
+# 5. Update application connection strings
+
+# 6. Update reverse proxy
+```
+
+**Downtime:** <1 minute (just the cutover).
+
+---
+
+## Service-Specific Procedures
+
+### Forgejo (Git Server)
+
+**State locations:**
+- `/var/lib/forgejo/data/` - Git repositories, LFS
+- `/var/lib/forgejo/postgres/` - PostgreSQL database
+- `/var/lib/forgejo/backups/` - Existing backups
+
+**Procedure (Warm Migration):**
+
+```bash
+# 1. Put Forgejo in maintenance mode (optional)
+ssh h001 'touch /var/lib/forgejo/data/maintenance'
+
+# 2. Backup database inside container
+ssh h001 'nixos-container run forgejo -- pg_dumpall -U forgejo > /var/lib/forgejo/backups/pre-migration.sql'
+
+# 3. Initial sync
+rsync -avz --progress h001:/var/lib/forgejo/ newhost:/var/lib/forgejo/
+
+# 4. Stop container
+ssh h001 'systemctl stop container@forgejo'
+
+# 5. Final sync
+rsync -avz --progress h001:/var/lib/forgejo/ newhost:/var/lib/forgejo/
+
+# 6. Start on new host
+ssh newhost 'systemctl start container@forgejo'
+
+# 7. Update O001 nginx
+# Change: proxyPass = "http://100.64.0.13" → "http://<new-ip>"
+ssh o001 'nixos-rebuild switch'
+
+# 8. Verify
+git clone https://git.joshuabell.xyz/test/repo.git
+
+# 9. Remove maintenance mode
+ssh newhost 'rm /var/lib/forgejo/data/maintenance'
+```
+
+**Downtime:** ~5 minutes.
+
+### Zitadel (SSO)
+
+**State locations:**
+- `/var/lib/zitadel/postgres/` - PostgreSQL database
+- `/var/lib/zitadel/backups/` - Backups
+
+**Critical notes:**
+- SSO is used by other services - coordinate downtime
+- Test authentication after migration
+- May need to clear client caches
+
+**Procedure:** Same as Forgejo.
+
+### Vaultwarden (Password Manager)
+
+**State locations:**
+- `/var/lib/vaultwarden/` - SQLite database, attachments
+
+**Critical notes:**
+- MOST CRITICAL SERVICE - users depend on this constantly
+- Prefer hot migration or schedule during low-usage time
+- Verify emergency access works after migration
+
+**Procedure:**
+
+```bash
+# 1. Enable read-only mode (if supported)
+
+# 2. Sync while running
+rsync -avz --progress o001:/var/lib/vaultwarden/ newhost:/var/lib/vaultwarden/
+
+# 3. Quick cutover
+ssh o001 'systemctl stop vaultwarden'
+rsync -avz --progress o001:/var/lib/vaultwarden/ newhost:/var/lib/vaultwarden/
+ssh newhost 'systemctl start vaultwarden'
+
+# 4. Update DNS/proxy immediately
+
+# 5. Verify with mobile app and browser extension
+```
+
+**Downtime:** 2-3 minutes (coordinate with users).
+
+### Headscale
+
+**State locations:**
+- `/var/lib/headscale/` - SQLite database with node registrations
+
+**Critical notes:**
+- ALL mesh connectivity depends on this
+- Existing connections continue during migration
+- New connections will fail during downtime
+
+**Procedure:**
+
+```bash
+# 1. Backup current state
+restic -r /backup/l001 backup /var/lib/headscale --tag pre-migration
+
+# 2. Sync to new VPS
+rsync -avz --progress l001:/var/lib/headscale/ newvps:/var/lib/headscale/
+
+# 3. Stop on old host
+ssh l001 'systemctl stop headscale'
+
+# 4. Final sync
+rsync -avz --progress l001:/var/lib/headscale/ newvps:/var/lib/headscale/
+
+# 5. Start on new host
+ssh newvps 'systemctl start headscale'
+
+# 6. Update DNS
+# headscale.joshuabell.xyz → new IP
+
+# 7. Verify
+headscale nodes list
+tailscale status
+
+# 8. Test new device joining
+```
+
+**Downtime:** 5-10 minutes (include DNS propagation time).
+
+### AdGuard Home
+
+**State locations:**
+- `/var/lib/AdGuardHome/` - Config, query logs, filters
+
+**Critical notes:**
+- LAN DNS will fail during migration
+- Configure backup DNS on clients first
+
+**Procedure:**
+
+```bash
+# 1. Add temporary DNS to DHCP (e.g., 1.1.1.1)
+# Or have clients use secondary DNS server
+
+# 2. Quick migration
+ssh h003 'systemctl stop adguardhome'
+rsync -avz --progress h003:/var/lib/AdGuardHome/ newhost:/var/lib/AdGuardHome/
+ssh newhost 'systemctl start adguardhome'
+
+# 3. Update DHCP to point to new host
+
+# 4. Verify DNS resolution
+dig @new-host-ip google.com
+```
+
+**Downtime:** 2-3 minutes (clients use backup DNS).
+
+---
+
+## Reverse Proxy Updates
+
+When migrating services proxied through O001:
+
+### Current Proxy Mappings (O001 nginx.nix)
+
+| Domain | Backend |
+|--------|---------|
+| chat.joshuabell.xyz | 100.64.0.13 (H001) |
+| git.joshuabell.xyz | 100.64.0.13 (H001) |
+| notes.joshuabell.xyz | 100.64.0.13 (H001) |
+| sec.joshuabell.xyz | 100.64.0.13 (H001) |
+| sso.joshuabell.xyz | 100.64.0.13 (H001) |
+| llm.joshuabell.xyz | 100.64.0.13:8095 (H001) |
+
+### Updating Proxy
+
+1. Edit `hosts/oracle/o001/nginx.nix`
+2. Change `proxyPass` to new Tailscale IP
+3. Commit and push
+4. `ssh o001 'cd /etc/nixos && git pull && nixos-rebuild switch'`
+
+Or for faster updates without commit:
+
+```bash
+# Quick test (non-persistent)
+ssh o001 'sed -i "s/100.64.0.13/100.64.0.XX/g" /etc/nginx/nginx.conf && nginx -s reload'
+
+# Then update flake and rebuild properly
+```
+
+---
+
+## Rollback Procedures
+
+If migration fails:
+
+### Quick Rollback
+
+```bash
+# 1. Stop on new host
+ssh newhost 'systemctl stop myservice'
+
+# 2. Start on old host (state should still be there)
+ssh oldhost 'systemctl start myservice'
+
+# 3. Revert proxy changes
+ssh proxy 'nixos-rebuild switch --rollback'
+```
+
+### If Old State Was Deleted
+
+```bash
+# Restore from backup
+restic -r /backup/oldhost restore latest --target / --include /var/lib/myservice
+
+# Start service
+systemctl start myservice
+
+# Revert proxy
+```
+
+---
+
+## Post-Migration Checklist
+
+- [ ] Service responds correctly
+- [ ] Authentication works (if applicable)
+- [ ] Data integrity verified
+- [ ] Monitoring updated to new host
+- [ ] DNS/proxy pointing to new location
+- [ ] Old host state cleaned up (after grace period)
+- [ ] Backup job updated for new location
+- [ ] Documentation updated
+
+---
+
+## Common Issues
+
+### "Permission denied" on New Host
+
+```bash
+# Ensure correct ownership
+chown -R serviceuser:servicegroup /var/lib/myservice
+
+# Check SELinux/AppArmor if applicable
+```
+
+### Service Can't Connect to Database
+
+```bash
+# Verify PostgreSQL is running
+systemctl status postgresql
+
+# Check connection settings
+cat /var/lib/myservice/config.yaml | grep -i database
+```
+
+### SSL Certificate Issues
+
+```bash
+# Certificates are tied to domain, not host
+# Should work automatically if domain unchanged
+
+# If issues, force ACME renewal
+systemctl restart acme-myservice.joshuabell.xyz.service
+```
+
+### Tailscale IP Changed
+
+```bash
+# Get new Tailscale IP
+tailscale ip -4
+
+# Update all references to old IP
+grep -r "100.64.0.XX" /etc/nixos/
+```
--- a/flakes/common/hm_modules/ssh.nix
+++ b/flakes/common/hm_modules/ssh.nix
@ -72,13 +72,6 @@ in
      "t" = {
        identityFile = lib.mkIf (hasSecret "nix2t") osConfig.age.secrets.nix2t.path;
        user = "joshua.bell";
-        localForwards = [
-          {
-            bind.port = 3002;
-            host.port = 3002;
-            host.address = "localhost";
-          }
-        ];
        setEnv = {
          TERM = "vt100";
        };
@ -87,13 +80,6 @@ in
        identityFile = lib.mkIf (hasSecret "nix2t") osConfig.age.secrets.nix2t.path;
        hostname = "10.12.14.181";
        user = "joshua.bell";
-        localForwards = [
-          {
-            bind.port = 3002;
-            host.port = 3002;
-            host.address = "localhost";
-          }
-        ];
        setEnv = {
          TERM = "vt100";
        };
--- a/hosts/lio/flake.lock
+++ b/hosts/lio/flake.lock
@ -31,11 +31,11 @@
      },
      "locked": {
        "dir": "flakes/beszel",
-        "lastModified": 1769556519,
-        "narHash": "sha256-5A0LZ0YScpuZeVHNisjleTD+ln2kY0MJTjsKQlC0cMM=",
+        "lastModified": 1769709893,
+        "narHash": "sha256-498MHCXjghP2lXVSolRMegYBiA0lFyK8Tj7YcXCcA2M=",
        "ref": "refs/heads/master",
-        "rev": "15fccd2ff439cf7474eb45228e5333335ba5fada",
-        "revCount": 1179,
+        "rev": "51ec48148acfb31c74d1f293b98416b8a3235f7d",
+        "revCount": 1188,
        "type": "git",
        "url": "https://git.joshuabell.xyz/ringofstorms/dotfiles"
      },
@ -64,11 +64,11 @@
    "common": {
      "locked": {
        "dir": "flakes/common",
-        "lastModified": 1769556519,
-        "narHash": "sha256-5A0LZ0YScpuZeVHNisjleTD+ln2kY0MJTjsKQlC0cMM=",
+        "lastModified": 1769709893,
+        "narHash": "sha256-498MHCXjghP2lXVSolRMegYBiA0lFyK8Tj7YcXCcA2M=",
        "ref": "refs/heads/master",
-        "rev": "15fccd2ff439cf7474eb45228e5333335ba5fada",
-        "revCount": 1179,
+        "rev": "51ec48148acfb31c74d1f293b98416b8a3235f7d",
+        "revCount": 1188,
        "type": "git",
        "url": "https://git.joshuabell.xyz/ringofstorms/dotfiles"
      },
@ -138,11 +138,11 @@
      },
      "locked": {
        "dir": "flakes/de_plasma",
-        "lastModified": 1769556519,
-        "narHash": "sha256-5A0LZ0YScpuZeVHNisjleTD+ln2kY0MJTjsKQlC0cMM=",
+        "lastModified": 1769709893,
+        "narHash": "sha256-498MHCXjghP2lXVSolRMegYBiA0lFyK8Tj7YcXCcA2M=",
        "ref": "refs/heads/master",
-        "rev": "15fccd2ff439cf7474eb45228e5333335ba5fada",
-        "revCount": 1179,
+        "rev": "51ec48148acfb31c74d1f293b98416b8a3235f7d",
+        "revCount": 1188,
        "type": "git",
        "url": "https://git.joshuabell.xyz/ringofstorms/dotfiles"
      },
@ -212,11 +212,11 @@
      },
      "locked": {
        "dir": "flakes/flatpaks",
-        "lastModified": 1769556519,
-        "narHash": "sha256-5A0LZ0YScpuZeVHNisjleTD+ln2kY0MJTjsKQlC0cMM=",
+        "lastModified": 1769709893,
+        "narHash": "sha256-498MHCXjghP2lXVSolRMegYBiA0lFyK8Tj7YcXCcA2M=",
        "ref": "refs/heads/master",
-        "rev": "15fccd2ff439cf7474eb45228e5333335ba5fada",
-        "revCount": 1179,
+        "rev": "51ec48148acfb31c74d1f293b98416b8a3235f7d",
+        "revCount": 1188,
        "type": "git",
        "url": "https://git.joshuabell.xyz/ringofstorms/dotfiles"
      },
@ -253,11 +253,11 @@
        "nixpkgs": "nixpkgs_2"
      },
      "locked": {
-        "lastModified": 1768949235,
-        "narHash": "sha256-TtjKgXyg1lMfh374w5uxutd6Vx2P/hU81aEhTxrO2cg=",
+        "lastModified": 1769580047,
+        "narHash": "sha256-tNqCP/+2+peAXXQ2V8RwsBkenlfWMERb+Uy6xmevyhM=",
        "owner": "rycee",
        "repo": "home-manager",
-        "rev": "75ed713570ca17427119e7e204ab3590cc3bf2a5",
+        "rev": "366d78c2856de6ab3411c15c1cb4fb4c2bf5c826",
        "type": "github"
      },
      "original": {
@ -324,11 +324,11 @@
    },
    "nixpkgs-unstable": {
      "locked": {
-        "lastModified": 1769170682,
-        "narHash": "sha256-oMmN1lVQU0F0W2k6OI3bgdzp2YOHWYUAw79qzDSjenU=",
+        "lastModified": 1769789167,
+        "narHash": "sha256-kKB3bqYJU5nzYeIROI82Ef9VtTbu4uA3YydSk/Bioa8=",
        "owner": "nixos",
        "repo": "nixpkgs",
-        "rev": "c5296fdd05cfa2c187990dd909864da9658df755",
+        "rev": "62c8382960464ceb98ea593cb8321a2cf8f9e3e5",
        "type": "github"
      },
      "original": {
@ -356,11 +356,11 @@
    },
    "nixpkgs_3": {
      "locked": {
-        "lastModified": 1769318308,
-        "narHash": "sha256-Mjx6p96Pkefks3+aA+72lu1xVehb6mv2yTUUqmSet6Q=",
+        "lastModified": 1769741972,
+        "narHash": "sha256-RxSg1EioTWNpoLaykiT1UQKTo/K0PPdLqCyQgNjNqWs=",
        "owner": "nixos",
        "repo": "nixpkgs",
-        "rev": "1cd347bf3355fce6c64ab37d3967b4a2cb4b878c",
+        "rev": "63590ac958a8af30ebd52c7a0309d8c52a94dd77",
        "type": "github"
      },
      "original": {
@ -1303,11 +1303,11 @@
        "nixpkgs": "nixpkgs_4"
      },
      "locked": {
-        "lastModified": 1769556375,
-        "narHash": "sha256-Ne2wFTs2fPyyDUIqy/XiYUmnqs6aaNE8/JA6BVBP+Ow=",
+        "lastModified": 1769964763,
+        "narHash": "sha256-ADWyyRulN7Bi+LN9pMO5JjcKqFnKWtCi4CCqMXlsHE4=",
        "owner": "anomalyco",
        "repo": "opencode",
-        "rev": "15ffd3cba1d3bd7d4d84c6911623a9c1d19e6647",
+        "rev": "2c82e6c6ae2cd7472de3598c2bf3bb2a4e9560d2",
        "type": "github"
      },
      "original": {
@ -1341,11 +1341,11 @@
        "nixpkgs": "nixpkgs_5"
      },
      "locked": {
-        "lastModified": 1769538000,
-        "narHash": "sha256-viIWuCjUqDk/QuIhUK9nOvDHCDHEO+VQRTnI/xhi/Ok=",
+        "lastModified": 1769967358,
+        "narHash": "sha256-6NmYfCgne1N3wUTxA686yyZSte+HfNEQM1Z0B0rjWD4=",
        "ref": "refs/heads/main",
-        "rev": "221d0ca596b994f1942a42ee53a8b4508e23e7af",
-        "revCount": 20,
+        "rev": "acca9fbf50cd79ef6a23e6d617b847bbde558715",
+        "revCount": 23,
        "type": "git",
        "url": "https://git.joshuabell.xyz/ringofstorms/qvm"
      },
@ -1452,11 +1452,11 @@
        "rust-overlay": "rust-overlay"
      },
      "locked": {
-        "lastModified": 1769125736,
-        "narHash": "sha256-anQb65WdwbW+r/elOicrhDAhF+pjZBnur5ei9/rhq2s=",
+        "lastModified": 1769668724,
+        "narHash": "sha256-0mJUN5IrXTeDjRadmhVEhOviRnM6MklCe3R2P61/JLA=",
        "ref": "refs/heads/master",
-        "rev": "fedaece7199f49d1317856fe22f20b7467639409",
-        "revCount": 332,
+        "rev": "7affddc1b8b52ae11687e1537b7a2b8d57183fdd",
+        "revCount": 333,
        "type": "git",
        "url": "https://git.joshuabell.xyz/ringofstorms/nvim"
      },
@ -1514,11 +1514,11 @@
      },
      "locked": {
        "dir": "flakes/secrets",
-        "lastModified": 1769556519,
-        "narHash": "sha256-5A0LZ0YScpuZeVHNisjleTD+ln2kY0MJTjsKQlC0cMM=",
+        "lastModified": 1769709893,
+        "narHash": "sha256-498MHCXjghP2lXVSolRMegYBiA0lFyK8Tj7YcXCcA2M=",
        "ref": "refs/heads/master",
-        "rev": "15fccd2ff439cf7474eb45228e5333335ba5fada",
-        "revCount": 1179,
+        "rev": "51ec48148acfb31c74d1f293b98416b8a3235f7d",
+        "revCount": 1188,
        "type": "git",
        "url": "https://git.joshuabell.xyz/ringofstorms/dotfiles"
      },
@ -1531,11 +1531,11 @@
    "secrets-bao": {
      "locked": {
        "dir": "flakes/secrets-bao",
-        "lastModified": 1769556519,
-        "narHash": "sha256-5A0LZ0YScpuZeVHNisjleTD+ln2kY0MJTjsKQlC0cMM=",
+        "lastModified": 1769709893,
+        "narHash": "sha256-498MHCXjghP2lXVSolRMegYBiA0lFyK8Tj7YcXCcA2M=",
        "ref": "refs/heads/master",
-        "rev": "15fccd2ff439cf7474eb45228e5333335ba5fada",
-        "revCount": 1179,
+        "rev": "51ec48148acfb31c74d1f293b98416b8a3235f7d",
+        "revCount": 1188,
        "type": "git",
        "url": "https://git.joshuabell.xyz/ringofstorms/dotfiles"
      },
@ -1553,11 +1553,11 @@
      },
      "locked": {
        "dir": "flakes/stt_ime",
-        "lastModified": 1769556519,
-        "narHash": "sha256-5A0LZ0YScpuZeVHNisjleTD+ln2kY0MJTjsKQlC0cMM=",
+        "lastModified": 1769709893,
+        "narHash": "sha256-498MHCXjghP2lXVSolRMegYBiA0lFyK8Tj7YcXCcA2M=",
        "ref": "refs/heads/master",
-        "rev": "15fccd2ff439cf7474eb45228e5333335ba5fada",
-        "revCount": 1179,
+        "rev": "51ec48148acfb31c74d1f293b98416b8a3235f7d",
+        "revCount": 1188,
        "type": "git",
        "url": "https://git.joshuabell.xyz/ringofstorms/dotfiles"
      },
--- a/ideas/impermanence_everywhere.md
+++ b/ideas/impermanence_everywhere.md
@ -0,0 +1,456 @@
+# Impermanence Rollout Strategy
+
+## Overview
+
+This document covers rolling out impermanence (ephemeral root filesystem) to all hosts, using Juni as the template.
+
+## What is Impermanence?
+
+**Philosophy:** Root filesystem (`/`) is wiped on every boot (tmpfs or reset subvolume), forcing you to explicitly declare what state to persist.
+
+**Benefits:**
+- Clean system by default - no accumulated cruft
+- Forces documentation of important state
+- Easy rollback (just reboot)
+- Security (ephemeral root limits persistence of compromises)
+- Reproducible server state
+
+## Current State
+
+| Host | Impermanence | Notes |
+|------|--------------|-------|
+| Juni | ✅ Implemented | bcachefs with @root/@persist subvolumes |
+| H001 | ❌ Traditional | Most complex - many services |
+| H002 | ❌ Traditional | NAS - may not need impermanence |
+| H003 | ❌ Traditional | Router - good candidate |
+| O001 | ❌ Traditional | Gateway - good candidate |
+| L001 | ❌ Traditional | Headscale - good candidate |
+
+## Juni's Implementation (Reference)
+
+### Filesystem Layout
+
+```
+bcachefs (5 devices, 2x replication)
+├── @root      # Ephemeral - reset each boot
+├── @nix       # Persistent - Nix store
+├── @persist   # Persistent - bind mounts for state
+└── @snapshots # Automatic snapshots
+```
+
+### Boot Process
+
+1. Create snapshot of @root before reset
+2. Reset @root subvolume (or recreate)
+3. Boot into clean system
+4. Bind mount persisted paths from @persist
+
+### Persisted Paths (Juni)
+
+```nix
+environment.persistence."/persist" = {
+  hideMounts = true;
+  
+  directories = [
+    "/var/log"
+    "/var/lib/nixos"
+    "/var/lib/systemd"
+    "/var/lib/tailscale"
+    "/var/lib/flatpak"
+    "/etc/NetworkManager/system-connections"
+  ];
+  
+  files = [
+    "/etc/machine-id"
+    "/etc/ssh/ssh_host_ed25519_key"
+    "/etc/ssh/ssh_host_ed25519_key.pub"
+    "/etc/ssh/ssh_host_rsa_key"
+    "/etc/ssh/ssh_host_rsa_key.pub"
+  ];
+  
+  users.josh = {
+    directories = [
+      ".ssh"
+      ".gnupg"
+      "projects"
+      ".config"
+      ".local/share"
+    ];
+  };
+};
+```
+
+### Custom Tooling
+
+Juni has `bcache-impermanence` with commands:
+- `ls` - List snapshots
+- `gc` - Garbage collect old snapshots
+- `diff` - Show changes since last boot (auto-excludes persisted paths)
+
+Retention policy: 5 recent + 1/week for 4 weeks + 1/month
+
+---
+
+## Common Pain Point: Finding What Needs Persistence
+
+> "I often have issues adding new persistent layers and knowing what I need to add"
+
+### Discovery Workflow
+
+#### Method 1: Use the Diff Tool
+
+Before rebooting after installing new software:
+
+```bash
+# On Juni
+bcache-impermanence diff
+```
+
+This shows files created/modified outside persisted paths.
+
+#### Method 2: Boot and Observe Failures
+
+```bash
+# After reboot, check for failures
+journalctl -b | grep -i "no such file"
+journalctl -b | grep -i "failed to"
+journalctl -b | grep -i "permission denied"
+```
+
+#### Method 3: Monitor File Changes
+
+```bash
+# Before making changes
+find /var /etc -type f -printf '%T@ %p\n' 2>/dev/null | sort -n > /tmp/before.txt
+
+# After running services
+find /var /etc -type f -printf '%T@ %p\n' 2>/dev/null | sort -n > /tmp/after.txt
+
+# Compare
+diff /tmp/before.txt /tmp/after.txt
+```
+
+#### Method 4: Service-Specific Patterns
+
+Most services follow predictable patterns:
+
+| Pattern | Example | Usually Needs Persistence |
+|---------|---------|---------------------------|
+| `/var/lib/${service}` | `/var/lib/postgresql` | Yes |
+| `/var/cache/${service}` | `/var/cache/nginx` | Usually no |
+| `/var/log/${service}` | `/var/log/nginx` | Optional |
+| `/etc/${service}` | `/etc/nginx` | Only if runtime-generated |
+
+---
+
+## Server Impermanence Template
+
+### Minimal Server Persistence
+
+```nix
+environment.persistence."/persist" = {
+  hideMounts = true;
+  
+  directories = [
+    # Core system
+    "/var/lib/nixos"           # NixOS state DB
+    "/var/lib/systemd/coredump"
+    "/var/log"
+    
+    # Network
+    "/var/lib/tailscale"
+    "/etc/NetworkManager/system-connections"
+    
+    # ACME certificates
+    "/var/lib/acme"
+  ];
+  
+  files = [
+    "/etc/machine-id"
+    "/etc/ssh/ssh_host_ed25519_key"
+    "/etc/ssh/ssh_host_ed25519_key.pub"
+    "/etc/ssh/ssh_host_rsa_key"
+    "/etc/ssh/ssh_host_rsa_key.pub"
+  ];
+};
+```
+
+### Per-Host Additions
+
+#### H001 (Services)
+
+```nix
+environment.persistence."/persist".directories = [
+  # Add to minimal template:
+  "/var/lib/forgejo"
+  "/var/lib/zitadel"
+  "/var/lib/openbao"
+  "/bao-keys"
+  "/var/lib/trilium"
+  "/var/lib/opengist"
+  "/var/lib/open-webui"
+  "/var/lib/n8n"
+  "/var/lib/nixarr/state"
+  "/var/lib/containers"  # Podman/container state
+];
+```
+
+#### O001 (Gateway)
+
+```nix
+environment.persistence."/persist".directories = [
+  # Add to minimal template:
+  "/var/lib/vaultwarden"
+  "/var/lib/postgresql"
+  "/var/lib/fail2ban"
+];
+```
+
+#### L001 (Headscale)
+
+```nix
+environment.persistence."/persist".directories = [
+  # Add to minimal template:
+  "/var/lib/headscale"
+];
+```
+
+#### H003 (Router)
+
+```nix
+environment.persistence."/persist".directories = [
+  # Add to minimal template:
+  "/var/lib/AdGuardHome"
+  "/var/lib/dnsmasq"
+];
+
+environment.persistence."/persist".files = [
+  # Add to minimal template:
+  "/boot/keyfile_nvme0n1p1"  # LUKS key - CRITICAL
+];
+```
+
+---
+
+## Rollout Strategy
+
+### Phase 1: Lowest Risk (VPS Hosts)
+
+Start with L001 and O001:
+- Easy to rebuild from scratch if something goes wrong
+- Smaller state footprint
+- Good practice before tackling complex hosts
+
+**L001 Steps:**
+1. Back up `/var/lib/headscale/`
+2. Add impermanence module
+3. Test on spare VPS first
+4. Migrate
+
+**O001 Steps:**
+1. Back up Vaultwarden and PostgreSQL
+2. Add impermanence module
+3. Test carefully (Vaultwarden is critical!)
+
+### Phase 2: Router (H003)
+
+H003 is medium complexity:
+- Relatively small state
+- But critical for network (test during maintenance window)
+- LUKS keyfile needs special handling
+
+### Phase 3: Complex Host (H001)
+
+H001 is most complex due to:
+- Multiple containerized services
+- Database state in containers
+- Many stateful applications
+
+**Approach:**
+1. Inventory all state paths (see backup docs)
+2. Test with snapshot before committing
+3. Gradual rollout with extensive persistence list
+4. May need to persist more than expected initially
+
+### Phase 4: NAS (H002) - Maybe Skip
+
+H002 may not benefit from impermanence:
+- Primary purpose is persistent data storage
+- bcachefs replication already provides redundancy
+- Impermanence adds complexity without clear benefit
+
+---
+
+## Filesystem Options
+
+### Option A: bcachefs with Subvolumes (Like Juni)
+
+**Pros:**
+- Flexible, modern
+- Built-in snapshots
+- Replication support
+
+**Setup:**
+```nix
+fileSystems = {
+  "/" = {
+    device = "/dev/disk/by-label/nixos";
+    fsType = "bcachefs";
+    options = [ "subvol=@root" ];
+  };
+  "/nix" = {
+    device = "/dev/disk/by-label/nixos";
+    fsType = "bcachefs";
+    options = [ "subvol=@nix" ];
+  };
+  "/persist" = {
+    device = "/dev/disk/by-label/nixos";
+    fsType = "bcachefs";
+    options = [ "subvol=@persist" ];
+    neededForBoot = true;
+  };
+};
+```
+
+### Option B: BTRFS with Subvolumes
+
+Similar to bcachefs but more mature:
+
+```nix
+# Reset @root on boot
+boot.initrd.postDeviceCommands = lib.mkAfter ''
+  mkdir -p /mnt
+  mount -o subvol=/ /dev/disk/by-label/nixos /mnt
+  btrfs subvolume delete /mnt/@root
+  btrfs subvolume create /mnt/@root
+  umount /mnt
+'';
+```
+
+### Option C: tmpfs Root
+
+Simplest but uses RAM:
+
+```nix
+fileSystems."/" = {
+  device = "none";
+  fsType = "tmpfs";
+  options = [ "defaults" "size=2G" "mode=755" ];
+};
+```
+
+**Best for:** VPS hosts with limited disk but adequate RAM.
+
+---
+
+## Troubleshooting
+
+### Service Fails After Reboot
+
+```bash
+# Check what's missing
+journalctl -xeu servicename
+
+# Common fixes:
+# 1. Add /var/lib/servicename to persistence
+# 2. Ensure directory permissions are correct
+# 3. Check if service expects specific files in /etc
+```
+
+### "No such file or directory" Errors
+
+```bash
+# Find what's missing
+journalctl -b | grep "No such file"
+
+# Add missing paths to persistence
+```
+
+### Slow Boot (Too Many Bind Mounts)
+
+If you have many persisted paths, consider:
+1. Consolidating related paths
+2. Using symlinks instead of bind mounts for some paths
+3. Persisting parent directories instead of many children
+
+### Container State Issues
+
+Containers may have their own state directories:
+
+```nix
+# For NixOS containers
+environment.persistence."/persist".directories = [
+  "/var/lib/nixos-containers"
+];
+
+# For Podman
+environment.persistence."/persist".directories = [
+  "/var/lib/containers/storage/volumes"
+  # NOT overlay - that's regenerated
+];
+```
+
+---
+
+## Tooling Improvements
+
+### Automated Discovery Script
+
+Create a helper that runs periodically to detect unpersisted changes:
+
+```bash
+#!/usr/bin/env bash
+# /usr/local/bin/impermanence-check
+
+# Get list of persisted paths
+PERSISTED=$(nix eval --raw '.#nixosConfigurations.hostname.config.environment.persistence."/persist".directories' 2>/dev/null | tr -d '[]"' | tr ' ' '\n')
+
+# Find modified files outside persisted paths
+find / -xdev -type f -mmin -60 2>/dev/null | while read -r file; do
+  is_persisted=false
+  for path in $PERSISTED; do
+    if [[ "$file" == "$path"* ]]; then
+      is_persisted=true
+      break
+    fi
+  done
+  if ! $is_persisted; then
+    echo "UNPERSISTED: $file"
+  fi
+done
+```
+
+### Pre-Reboot Check
+
+Add to your workflow:
+
+```bash
+# Before rebooting
+bcache-impermanence diff  # or custom script
+
+# Review changes, add to persistence if needed, then reboot
+```
+
+---
+
+## Action Items
+
+### Immediate
+- [ ] Document all state paths for each host (see backup docs)
+- [ ] Create shared impermanence module in flake
+
+### Phase 1 (L001/O001)
+- [ ] Back up current state
+- [ ] Add impermanence to L001
+- [ ] Test thoroughly
+- [ ] Roll out to O001
+
+### Phase 2 (H003)
+- [ ] Plan maintenance window
+- [ ] Add impermanence to H003
+- [ ] Verify LUKS key persistence
+
+### Phase 3 (H001)
+- [ ] Complete state inventory
+- [ ] Test with extensive persistence list
+- [ ] Gradual rollout
--- a/ideas/openbao_secrets_migration.md
+++ b/ideas/openbao_secrets_migration.md
@ -0,0 +1,208 @@
+# OpenBao Secrets Migration
+
+## Overview
+
+This document covers migrating from ragenix (age-encrypted secrets) to OpenBao for centralized secret management, enabling zero-config machine onboarding.
+
+## Goals
+
+1. **Zero-config machine onboarding**: New machine = install NixOS + add Zitadel machine key + done
+2. **Eliminate re-keying workflow**: No more updating secrets.nix and re-encrypting .age files for each new machine
+3. **Runtime secret dependencies**: Services wait for secrets via systemd, not build-time conditionals
+4. **Consolidated SSH keys**: Use single `nix2nix` key for all NixOS machine SSH (keep `nix2t` for work)
+5. **Declarative policy management**: OpenBao policies auto-applied after unseal with reconciliation
+6. **Directional Tailscale ACLs**: Restrict work machine from reaching NixOS hosts (one-way access)
+7. **Per-host variable registry**: `_variables.nix` pattern for ports/UIDs/GIDs to prevent conflicts
+
+## Current State
+
+### Ragenix Secrets in Use (21 active)
+
+**SSH Keys (for client auth):**
+- nix2github, nix2bitbucket, nix2gitforgejo
+- nix2nix (shared), nix2t (work - keep separate)
+- nix2lio (remote builds), nix2oren, nix2gpdPocket3
+- nix2h001, nix2h003, nix2linode, nix2oracle
+
+**API Tokens:**
+- github_read_token (Nix private repo access)
+- linode_rw_domains (ACME DNS challenge)
+- litellm_public_api_key (nginx auth)
+
+**VPN:**
+- headscale_auth (Tailscale auth)
+- us_chi_wg (NixArr WireGuard)
+
+**Application Secrets:**
+- oauth2_proxy_key_file
+- openwebui_env
+- zitadel_master_key
+- vaultwarden_env
+
+**Skipping (unused):**
+- nix2h002, nix2joe, nix2l002, nix2gitjosh, obsidian_sync_env
+
+### Already Migrated to OpenBao (juni)
+- headscale_auth, atuin-key-josh, 12 SSH keys
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        New Machine Onboarding                   │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│  1. Install NixOS with full config                              │
+│     - All services defined but waiting on secrets               │
+│                                                                 │
+│  2. Create Zitadel machine user + copy key                      │
+│     - /machine-key.json → JWT auth to OpenBao                   │
+│                                                                 │
+│  3. vault-agent fetches secrets                                 │
+│     - kv/data/machines/home_roaming/* → /var/lib/openbao-secrets│
+│                                                                 │
+│  4. systemd dependencies resolve                                │
+│     - secret-watcher completes → hardDepend services start      │
+│                                                                 │
+│  5. Machine fully operational                                   │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Key Design Decisions
+
+### Secret Path Convention
+
+```
+kv/data/machines/
+├── home_roaming/        # Shared across all NixOS machines
+│   ├── nix2nix          # SSH key
+│   ├── nix2github       # SSH key  
+│   ├── headscale_auth   # Tailscale auth
+│   └── ...
+├── home/                # h001-specific (not roaming)
+│   ├── linode_rw_domains
+│   ├── zitadel_master_key
+│   └── ...
+└── oracle/              # o001-specific
+    ├── vaultwarden_env
+    └── ...
+```
+
+### Runtime Dependencies vs Build-Time Conditionals
+
+**Before (ragenix pattern - bad for onboarding):**
+```nix
+let hasSecret = name: (config.age.secrets or {}) ? ${name};
+in {
+  config = lib.mkIf (hasSecret "openwebui_env") {
+    services.open-webui.enable = true;
+  };
+}
+```
+
+**After (OpenBao pattern - zero-config onboarding):**
+```nix
+ringofstorms.secretsBao.secrets.openwebui_env = {
+  kvPath = "kv/data/machines/home_roaming/openwebui_env";
+  hardDepend = [ "open-webui" ];  # Service waits for secret at runtime
+  configChanges.services.open-webui = {
+    enable = true;
+    environmentFile = "$SECRET_PATH";
+  };
+};
+```
+
+### Per-Host File Structure
+
+```
+hosts/h001/
+├── _variables.nix       # Ports, UIDs, GIDs - single source of truth
+├── secrets.nix          # All secrets + their configChanges
+├── flake.nix            # Imports, basic host config
+├── nginx.nix            # Pure config (no conditionals)
+└── mods/
+    ├── openbao-policies.nix  # Auto-apply after unseal
+    └── ...
+```
+
+### OpenBao Policy Management
+
+Policies auto-apply after unseal with full reconciliation:
+
+```nix
+# openbao-policies.nix
+let
+  policies = {
+    machines = ''
+      path "kv/data/machines/home_roaming/*" {
+        capabilities = ["read", "list"]
+      }
+    '';
+  };
+  reservedPolicies = [ "default" "root" ];
+in {
+  systemd.services.openbao-apply-policies = {
+    after = [ "openbao-auto-unseal.service" ];
+    requires = [ "openbao-auto-unseal.service" ];
+    wantedBy = [ "multi-user.target" ];
+    # Script: apply all policies, delete orphans not in config
+  };
+}
+```
+
+### Headscale ACL Policy
+
+Directional access control:
+
+```nix
+# nix machines: full mesh access
+{ action = "accept"; src = ["group:nix-machines"]; dst = ["group:nix-machines:*"]; }
+
+# nix machines → work: full access  
+{ action = "accept"; src = ["group:nix-machines"]; dst = ["tag:work:*"]; }
+
+# work → nix machines: LIMITED (only specific ports)
+{ action = "accept"; src = ["tag:work"]; dst = ["h001:22,443"]; }
+```
+
+## Implementation Phases
+
+### Phase 1: SSH Key Preparation
+- [ ] Add nix2nix SSH key to all hosts authorized_keys (alongside existing)
+- [ ] Deploy with `nh os switch` to all hosts
+
+### Phase 2: Infrastructure
+- [ ] Create `_variables.nix` pattern for h001 (pilot)
+- [ ] Create `openbao-policies.nix` with auto-apply + reconciliation
+- [ ] Create `headscale-policy.nix` with directional ACLs
+- [ ] Create per-host `secrets.nix` pattern
+
+### Phase 3: Secret Migration
+- [ ] Migrate h001 secrets (linode_rw_domains, us_chi_wg, oauth2_proxy_key_file, openwebui_env, zitadel_master_key)
+- [ ] Migrate o001 secrets (vaultwarden_env, litellm_public_api_key)
+- [ ] Migrate common modules (tailnet, ssh, nix_options)
+- [ ] Migrate SSH client keys
+
+### Phase 4: Consumer Updates
+- [ ] Update ssh.nix to use OpenBao paths
+- [ ] Remove hasSecret conditionals from all modules
+- [ ] Remove ragenix imports and secrets flake
+
+### Phase 5: Testing & Finalization
+- [ ] Populate all secrets in OpenBao KV store
+- [ ] Test onboarding workflow on fresh VM
+- [ ] Document new machine onboarding process
+
+## Related Ideas
+
+- `impermanence_everywhere.md` - Impermanence persists `/var/lib/openbao-secrets` and `/machine-key.json`
+- `resilience.md` - OpenBao server (h001) is a SPOF; consider backup/failover
+- `service_backups.md` - `/var/lib/openbao` and `/bao-keys` need backup
+
+## Notes
+
+- OpenBao hosted on h001 at sec.joshuabell.xyz
+- JWT auth via Zitadel machine users
+- vault-agent on each host fetches secrets
+- `sec` CLI tool available for manual lookups
--- a/ideas/resilience.md
+++ b/ideas/resilience.md
@ -0,0 +1,347 @@
+# Infrastructure Resilience & Failover
+
+## Overview
+
+This document covers strategies for eliminating single points of failure and improving infrastructure resilience.
+
+## Current Architecture
+
+```
+                    INTERNET
+                        │
+              ┌─────────┴─────────┐
+              │                   │
+        ┌─────▼─────┐      ┌──────▼──────┐
+        │   O001    │      │    L001     │
+        │  (Oracle) │      │   (Linode)  │
+        │  nginx    │      │  Headscale  │
+        │  +vault   │      │   (SPOF!)   │
+        │  +atuin   │      └──────┬──────┘
+        │  (SPOF!)  │             │
+        └─────┬─────┘             │
+              │         Tailscale Mesh
+              │       ┌───────────┴───────────┐
+              │       │                       │
+        ┌─────▼───────▼─────┐          ┌──────▼──────┐
+        │       H001        │          │    H003     │
+        │  (Service Host)   │          │   (Router)  │
+        │  Forgejo,Zitadel, │          │  AdGuard,   │
+        │  LiteLLM,Trilium, │          │  DHCP,NAT   │
+        │  NixArr,OpenWebUI │          │   (SPOF!)   │
+        └─────────┬─────────┘          └─────────────┘
+                  │ NFS
+        ┌─────────▼─────────┐
+        │       H002        │
+        │   (NAS - bcachefs)│
+        │  Media, Data      │
+        └───────────────────┘
+```
+
+## Critical Single Points of Failure
+
+| Host | Service | Impact if Down | Recovery Time |
+|------|---------|----------------|---------------|
+| **L001** | Headscale | ALL mesh connectivity | HIGH - must restore SQLite exactly |
+| **O001** | nginx/Vaultwarden | All public access, password manager | MEDIUM |
+| **H003** | DNS/DHCP/NAT | Entire LAN offline | MEDIUM |
+| **H001** | All services | Services down but recoverable | MEDIUM |
+| **H002** | NFS | Media unavailable | LOW - bcachefs has replication |
+
+---
+
+## Reverse Proxy Resilience (O001)
+
+### Current Problem
+
+O001 is a single point of failure for all public traffic:
+- No public access to any service if it dies
+- DNS still points to it after failure
+- ACME certs are only on that host
+
+### Solution Options
+
+#### Option A: Cloudflare Tunnel (Recommended Quick Win)
+
+**Pros:**
+- No single server dependency
+- Run `cloudflared` on multiple hosts (H001 as backup)
+- Automatic failover between tunnel replicas
+- Built-in DDoS protection
+- No inbound ports needed
+
+**Cons:**
+- Cannot stream media (Jellyfin) - violates Cloudflare ToS
+- Adds latency
+- Vendor dependency
+
+**Implementation:**
+
+```nix
+# On BOTH O001 (primary) AND H001 (backup)
+services.cloudflared = {
+  enable = true;
+  tunnels."joshuabell" = {
+    credentialsFile = config.age.secrets.cloudflared.path;
+    ingress = {
+      "chat.joshuabell.xyz" = "http://100.64.0.13:80";
+      "git.joshuabell.xyz" = "http://100.64.0.13:80";
+      "notes.joshuabell.xyz" = "http://100.64.0.13:80";
+      "sec.joshuabell.xyz" = "http://100.64.0.13:80";
+      "sso.joshuabell.xyz" = "http://100.64.0.13:80";
+      "n8n.joshuabell.xyz" = "http://100.64.0.13:80";
+      "blog.joshuabell.xyz" = "http://100.64.0.13:80";
+    };
+  };
+};
+```
+
+Cloudflare automatically load balances across all active tunnel replicas.
+
+#### Option B: DNS Failover with Health Checks
+
+Use Cloudflare DNS with health checks:
+- Point `joshuabell.xyz` to both O001 and a backup
+- Cloudflare removes unhealthy IPs automatically
+- Requires Cloudflare paid plan for load balancing
+
+#### Option C: Tailscale Funnel
+
+Expose services directly without O001:
+
+```bash
+# On H001
+tailscale funnel 443
+```
+
+Exposes H001 directly at `https://h001.net.joshuabell.xyz`
+
+**Pros:**
+- No proxy needed
+- Per-service granularity
+- Automatic HTTPS
+
+**Cons:**
+- Uses `ts.net` domain (no custom domain)
+- Limited to ports 443, 8443, 10000
+
+#### Option D: Manual Failover with Shared Config
+
+Keep H001 ready to take over O001's role:
+1. Same nginx config via shared NixOS module
+2. Use DNS-01 ACME challenge (certs work on any host)
+3. Update DNS when O001 fails
+
+### Recommended Hybrid Approach
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                   RECOMMENDED TOPOLOGY                       │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│   Cloudflare DNS (health-checked failover)                 │
+│          │                                                  │
+│   ┌──────┴──────┐                                          │
+│   │             │                                          │
+│   ▼             ▼                                          │
+│  O001   ──OR── H001 (via Cloudflare Tunnel)               │
+│  nginx         cloudflared backup                          │
+│                                                             │
+│   Jellyfin: Direct via Tailscale Funnel (bypasses O001)   │
+│   Vaultwarden: Cloudflare Tunnel (survives O001 failure)  │
+│                                                             │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Key Changes:**
+1. Move Vaultwarden to Cloudflare Tunnel (survives O001 outage)
+2. Jellyfin via Tailscale Funnel (no Cloudflare ToS issues)
+3. Other services via Cloudflare Tunnel with H001 as backup
+
+---
+
+## Headscale HA (L001)
+
+### The Problem
+
+L001 running Headscale is the MOST CRITICAL SPOF:
+- If Headscale dies, existing connections keep working temporarily
+- NO NEW devices can connect
+- Devices that reboot cannot rejoin the mesh
+- Eventually all mesh connectivity degrades
+
+### Solution Options
+
+#### Option 1: Frequent Backups (Minimum Viable)
+
+```nix
+my.backup = {
+  enable = true;
+  paths = [ "/var/lib/headscale" "/var/lib/acme" ];
+};
+```
+
+**Recovery time:** ~30 minutes to spin up new VPS + restore
+
+#### Option 2: Warm Standby
+
+- Run second Linode/VPS with Headscale configured but stopped
+- Daily rsync of `/var/lib/headscale/` to standby
+- Update DNS to point to standby if primary fails
+
+```bash
+# Daily sync to standby
+rsync -avz l001:/var/lib/headscale/ standby:/var/lib/headscale/
+```
+
+**Recovery time:** ~5 minutes (start service, update DNS)
+
+#### Option 3: Headscale HA with LiteFS
+
+Headscale doesn't natively support HA, but you can use:
+- **LiteFS** for SQLite replication
+- **Consul** for leader election and failover
+
+See: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/
+
+**Recovery time:** ~15 seconds automatic failover
+
+#### Option 4: Use Tailscale Commercial
+
+Let Tailscale handle the control plane HA:
+- They manage availability
+- Keep Headscale for learning/experimentation
+- Critical services use Tailscale commercial
+
+### Recommendation
+
+Start with Option 1 (backups) immediately, work toward Option 2 (warm standby) within a month.
+
+---
+
+## Router HA (H003)
+
+### The Problem
+
+H003 is the network gateway:
+- AdGuard Home (DNS filtering)
+- dnsmasq (DHCP)
+- NAT firewall
+- If it dies, entire LAN loses connectivity
+
+### Solution Options
+
+#### Option 1: Secondary DNS/DHCP
+
+Run backup DNS on another host (H001 or H002):
+- Secondary AdGuard Home instance
+- Clients configured with both DNS servers
+- DHCP failover is trickier (consider ISC DHCP with failover)
+
+#### Option 2: Keepalived for Router Failover
+
+If you have two devices that could be routers:
+
+```nix
+services.keepalived = {
+  enable = true;
+  vrrpInstances.router = {
+    state = "MASTER";  # or "BACKUP" on secondary
+    interface = "eth0";
+    virtualRouterId = 1;
+    priority = 255;  # Lower on backup
+    virtualIps = [{ addr = "10.12.14.1/24"; }];
+  };
+};
+```
+
+#### Option 3: Router Redundancy via ISP
+
+- Use ISP router as fallback gateway
+- Clients get two gateways via DHCP
+- Less control but automatic failover
+
+### Recommendation
+
+Run secondary AdGuard Home on H001/H002 as minimum redundancy. Full router HA is complex for homelab.
+
+---
+
+## NFS HA (H002)
+
+### Current State
+
+H002 uses bcachefs with 2x replication across 5 disks. Single host failure still causes data unavailability.
+
+### Options
+
+#### Option 1: NFS Client Resilience
+
+Configure NFS clients to handle server unavailability gracefully:
+
+```nix
+fileSystems."/nfs/h002" = {
+  device = "100.64.0.3:/data";
+  fsType = "nfs4";
+  options = [
+    "soft"           # Don't hang forever
+    "timeo=50"       # 5 second timeout
+    "retrans=3"      # 3 retries
+    "nofail"         # Don't fail boot if unavailable
+  ];
+};
+```
+
+#### Option 2: Second NAS with GlusterFS
+
+For true HA, run two NAS nodes with GlusterFS replication:
+
+```
+H002 (bcachefs) ◄──── GlusterFS ────► H00X (bcachefs)
+```
+
+**Overkill for homelab**, but an option for critical data.
+
+### Recommendation
+
+Current bcachefs replication is adequate. Focus on offsite backups for truly irreplaceable data.
+
+---
+
+## Recommended Implementation Order
+
+### Phase 1: Quick Wins (This Week)
+1. [ ] Set up Cloudflare Tunnel on O001 AND H001
+2. [ ] Enable Tailscale Funnel for Jellyfin
+3. [ ] Automated backups for L001 Headscale
+
+### Phase 2: Core Resilience (This Month)
+4. [ ] DNS-01 ACME for shared certs
+5. [ ] Warm standby for Headscale
+6. [ ] Secondary AdGuard Home
+
+### Phase 3: Full Resilience (Next Quarter)
+7. [ ] Headscale HA with LiteFS (if needed)
+8. [ ] Automated failover testing
+9. [ ] Runbook documentation
+
+---
+
+## Monitoring & Alerting
+
+Essential for knowing when to failover:
+
+```nix
+# Uptime monitoring for critical services
+services.uptime-kuma = {
+  enable = true;
+  # Monitor: Headscale, nginx, Vaultwarden, AdGuard
+};
+
+# Or use external monitoring (BetterStack, Uptime Robot)
+```
+
+Alert on:
+- Headscale API unreachable
+- nginx health check fails
+- DNS resolution fails
+- NFS mount fails
--- a/ideas/service_backups.md
+++ b/ideas/service_backups.md
@ -0,0 +1,340 @@
+# Service Backup Strategy
+
+## Overview
+
+This document outlines the backup strategy for the NixOS fleet, covering critical data paths, backup tools, and recovery procedures.
+
+## Current State
+
+**No automated backups are running today.** This is a critical gap.
+
+## Backup Topology
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    BACKUP TOPOLOGY                          │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│   H001,H003,O001,L001 ──────► H002:/data/backups (primary) │
+│                        └────► B2/S3 (offsite)              │
+│                                                             │
+│   H002 (NAS) ───────────────► B2/S3 (offsite only)         │
+│                                                             │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Critical Paths by Host
+
+### L001 (Headscale) - HIGHEST PRIORITY
+
+| Path | Description | Size | Priority |
+|------|-------------|------|----------|
+| `/var/lib/headscale/` | SQLite DB with all node registrations | Small | CRITICAL |
+| `/var/lib/acme/` | SSL certificates | Small | High |
+
+**Impact if lost:** ALL mesh connectivity fails - new connections fail, devices can't rejoin.
+
+### O001 (Oracle Gateway)
+
+| Path | Description | Size | Priority |
+|------|-------------|------|----------|
+| `/var/lib/vaultwarden/` | Password vault (encrypted) | ~41MB | CRITICAL |
+| `/var/lib/postgresql/` | Atuin shell history | ~226MB | Medium |
+| `/var/lib/acme/` | SSL certificates | Small | High |
+
+**Impact if lost:** All public access down, password manager lost.
+
+### H001 (Services)
+
+| Path | Description | Size | Priority |
+|------|-------------|------|----------|
+| `/var/lib/forgejo/` | Git repos + PostgreSQL | Large | CRITICAL |
+| `/var/lib/zitadel/` | SSO database + config | Medium | CRITICAL |
+| `/var/lib/openbao/` | Secrets vault | Small | CRITICAL |
+| `/bao-keys/` | Vault unseal keys | Tiny | CRITICAL |
+| `/var/lib/trilium/` | Notes database | Medium | High |
+| `/var/lib/opengist/` | Gist data | Small | Medium |
+| `/var/lib/open-webui/` | AI chat history | Medium | Low |
+| `/var/lib/n8n/` | Workflows | Medium | Medium |
+| `/var/lib/acme/` | SSL certificates | Small | High |
+| `/var/lib/nixarr/state/` | Media manager configs | Small | Medium |
+
+**Note:** A 154GB backup exists at `/var/lib/forgejo.tar.gz` - this is manual and should be automated.
+
+### H003 (Router)
+
+| Path | Description | Size | Priority |
+|------|-------------|------|----------|
+| `/var/lib/AdGuardHome/` | DNS filtering config + stats | Medium | High |
+| `/boot/keyfile_nvme0n1p1` | LUKS encryption key | Tiny | CRITICAL |
+
+**WARNING:** The LUKS keyfile must be stored separately in a secure location (e.g., Vaultwarden).
+
+### H002 (NAS)
+
+| Path | Description | Size | Priority |
+|------|-------------|------|----------|
+| `/data/nixarr/media/` | Movies, TV, music, books | Very Large | Low (replaceable) |
+| `/data/pinchflat/` | YouTube downloads | Large | Low |
+
+**Note:** bcachefs already has 2x replication. Offsite backup is optional but recommended for irreplaceable data.
+
+## Recommended Backup Tool: Restic
+
+### Why Restic?
+
+- Modern, encrypted, deduplicated backups
+- Native NixOS module: `services.restic.backups`
+- Multiple backend support (local, S3, B2, SFTP)
+- Incremental backups with deduplication
+- Easy pruning/retention policies
+
+### Shared Backup Module
+
+Create a shared module at `modules/backup.nix`:
+
+```nix
+{ config, lib, pkgs, ... }:
+
+with lib;
+let
+  cfg = config.my.backup;
+in {
+  options.my.backup = {
+    enable = mkEnableOption "restic backups";
+    paths = mkOption { type = types.listOf types.str; default = []; };
+    exclude = mkOption { type = types.listOf types.str; default = []; };
+    postgresBackup = mkOption { type = types.bool; default = false; };
+  };
+
+  config = mkIf cfg.enable {
+    # PostgreSQL dumps before backup
+    services.postgresqlBackup = mkIf cfg.postgresBackup {
+      enable = true;
+      location = "/var/backup/postgresql";
+      compression = "zstd";
+      startAt = "02:00:00";
+    };
+
+    services.restic.backups = {
+      daily = {
+        paths = cfg.paths ++ (optional cfg.postgresBackup "/var/backup/postgresql");
+        exclude = cfg.exclude ++ [
+          "**/cache/**"
+          "**/Cache/**"
+          "**/.cache/**"
+          "**/tmp/**"
+        ];
+        
+        # Primary: NFS to H002
+        repository = "/nfs/h002/backups/${config.networking.hostName}";
+        
+        passwordFile = config.age.secrets.restic-password.path;
+        initialize = true;
+        
+        pruneOpts = [
+          "--keep-daily 7"
+          "--keep-weekly 4"
+          "--keep-monthly 6"
+        ];
+        
+        timerConfig = {
+          OnCalendar = "03:00:00";
+          RandomizedDelaySec = "1h";
+          Persistent = true;
+        };
+        
+        backupPrepareCommand = ''
+          # Ensure NFS is mounted
+          mount | grep -q "/nfs/h002" || mount /nfs/h002
+        '';
+      };
+      
+      # Offsite to B2/S3 (less frequent)
+      offsite = {
+        paths = cfg.paths;
+        repository = "b2:joshuabell-backups:${config.networking.hostName}";
+        passwordFile = config.age.secrets.restic-password.path;
+        environmentFile = config.age.secrets.b2-credentials.path;
+        
+        pruneOpts = [
+          "--keep-daily 3"
+          "--keep-weekly 2"
+          "--keep-monthly 3"
+        ];
+        
+        timerConfig = {
+          OnCalendar = "weekly";
+          Persistent = true;
+        };
+      };
+    };
+  };
+}
+```
+
+### Per-Host Configuration
+
+#### L001 (Headscale)
+```nix
+my.backup = {
+  enable = true;
+  paths = [ "/var/lib/headscale" "/var/lib/acme" ];
+};
+```
+
+#### O001 (Oracle)
+```nix
+my.backup = {
+  enable = true;
+  paths = [ "/var/lib/vaultwarden" "/var/lib/acme" ];
+  postgresBackup = true;  # For Atuin
+};
+```
+
+#### H001 (Services)
+```nix
+my.backup = {
+  enable = true;
+  paths = [
+    "/var/lib/forgejo"
+    "/var/lib/zitadel"
+    "/var/lib/openbao"
+    "/bao-keys"
+    "/var/lib/trilium"
+    "/var/lib/opengist"
+    "/var/lib/open-webui"
+    "/var/lib/n8n"
+    "/var/lib/acme"
+    "/var/lib/nixarr/state"
+  ];
+};
+```
+
+#### H003 (Router)
+```nix
+my.backup = {
+  enable = true;
+  paths = [ "/var/lib/AdGuardHome" ];
+  # LUKS key backed up separately to Vaultwarden
+};
+```
+
+## Database Backup Best Practices
+
+### For Containerized PostgreSQL (Forgejo/Zitadel)
+
+```nix
+systemd.services.container-forgejo-backup = {
+  script = ''
+    nixos-container run forgejo -- pg_dumpall -U forgejo \
+      | ${pkgs.zstd}/bin/zstd > /var/lib/forgejo/backups/db-$(date +%Y%m%d).sql.zst
+  '';
+  startAt = "02:30:00";  # Before restic runs at 03:00
+};
+```
+
+### For Direct PostgreSQL
+
+```nix
+services.postgresqlBackup = {
+  enable = true;
+  backupAll = true;
+  location = "/var/backup/postgresql";
+  compression = "zstd";
+  startAt = "*-*-* 02:00:00";
+};
+```
+
+## Recovery Procedures
+
+### Restoring from Restic
+
+```bash
+# List snapshots
+restic -r /path/to/repo snapshots
+
+# Restore specific snapshot
+restic -r /path/to/repo restore abc123 --target /restore
+
+# Restore latest
+restic -r /path/to/repo restore latest --target /restore
+
+# Restore specific path
+restic -r /path/to/repo restore latest \
+  --target /restore \
+  --include /var/lib/postgresql
+
+# Mount for browsing
+mkdir /mnt/restic
+restic -r /path/to/repo mount /mnt/restic
+```
+
+### PostgreSQL Recovery
+
+```bash
+# Stop PostgreSQL
+systemctl stop postgresql
+
+# Restore from restic
+restic restore latest --target / --include /var/lib/postgresql
+
+# Or from SQL dump
+sudo -u postgres psql < /restore/all-databases.sql
+
+# Start PostgreSQL
+systemctl start postgresql
+```
+
+## Backup Verification
+
+Add automated verification:
+
+```nix
+systemd.timers.restic-verify = {
+  wantedBy = [ "timers.target" ];
+  timerConfig = {
+    OnCalendar = "weekly";
+    Persistent = true;
+  };
+};
+
+systemd.services.restic-verify = {
+  script = ''
+    ${pkgs.restic}/bin/restic -r /path/to/repo check --read-data-subset=5%
+  '';
+};
+```
+
+## Monitoring & Alerting
+
+```nix
+# Alert on backup failure
+systemd.services."restic-backups-daily".serviceConfig.OnFailure = "notify-failure@%n.service";
+
+systemd.services."notify-failure@" = {
+  serviceConfig.Type = "oneshot";
+  script = ''
+    ${pkgs.curl}/bin/curl -X POST https://ntfy.sh/joshuabell-backups \
+      -H "Title: Backup Failed" \
+      -d "Service: %i on ${config.networking.hostName}"
+  '';
+};
+```
+
+## Action Items
+
+### Immediate (This Week)
+- [ ] Set up restic backups for L001 (Headscale) - most critical
+- [ ] Back up H003's LUKS keyfile to Vaultwarden
+- [ ] Create `/data/backups/` directory on H002
+
+### Short-Term (This Month)
+- [ ] Implement shared backup module
+- [ ] Deploy to all hosts
+- [ ] Set up offsite B2 bucket
+
+### Medium-Term
+- [ ] Automated backup verification
+- [ ] Monitoring/alerting integration
+- [ ] Test recovery procedures