diff --git a/docs/migrating_services.md b/docs/migrating_services.md new file mode 100644 index 00000000..b9d8300c --- /dev/null +++ b/docs/migrating_services.md @@ -0,0 +1,425 @@ +# Migrating Services Between Hosts + +## Overview + +This document covers procedures for migrating services between NixOS hosts with minimal downtime. + +## General Migration Strategy + +### Pre-Migration Checklist + +- [ ] New host is configured in flake with identical service config +- [ ] New host has required secrets (agenix/sops) +- [ ] Network connectivity verified (Tailscale IP assigned) +- [ ] Disk space sufficient on new host +- [ ] Backup of current state completed + +### Migration Types + +| Type | Downtime | Complexity | Use When | +|------|----------|------------|----------| +| Cold migration | 5-30 min | Low | Simple services, maintenance windows | +| Warm migration | 2-5 min | Medium | Most services | +| Hot migration | <1 min | High | Databases with replication | + +--- + +## Cold Migration (Simple) + +Best for: Stateless or rarely-accessed services. + +### Steps + +```bash +# 1. Stop service on old host +ssh oldhost 'systemctl stop myservice' + +# 2. Copy state to new host +rsync -avz --progress oldhost:/var/lib/myservice/ newhost:/var/lib/myservice/ + +# 3. Start on new host +ssh newhost 'systemctl start myservice' + +# 4. Update reverse proxy (if applicable) +# Edit nginx config: proxyPass = "http://" +# Rebuild: ssh proxy 'nixos-rebuild switch' + +# 5. Verify service works + +# 6. Clean up old host (after verification period) +ssh oldhost 'rm -rf /var/lib/myservice' +``` + +**Downtime:** Duration of rsync + service start + proxy update. + +--- + +## Warm Migration (Recommended) + +Best for: Most services with moderate state. + +### Strategy + +1. Sync state while service is running (initial sync) +2. Stop service briefly for final sync +3. Start on new host +4. Update routing + +### Steps + +```bash +# 1. Initial sync (service still running) +rsync -avz --progress oldhost:/var/lib/myservice/ newhost:/var/lib/myservice/ + +# 2. Stop service on old host +ssh oldhost 'systemctl stop myservice' + +# 3. Final sync (quick - only changes since initial sync) +rsync -avz --progress oldhost:/var/lib/myservice/ newhost:/var/lib/myservice/ + +# 4. Start on new host +ssh newhost 'systemctl start myservice' + +# 5. Update reverse proxy immediately +ssh proxy 'nixos-rebuild switch' + +# 6. Verify +curl https://myservice.joshuabell.xyz +``` + +**Downtime:** 2-5 minutes (final rsync + start + proxy switch). + +--- + +## Hot Migration (Database Services) + +Best for: PostgreSQL, critical services requiring near-zero downtime. + +### PostgreSQL Logical Replication + +#### On Source (Old Host) + +```nix +services.postgresql = { + settings = { + wal_level = "logical"; + max_replication_slots = 4; + max_wal_senders = 4; + }; +}; + +# Add replication user +services.postgresql.ensureUsers = [{ + name = "replicator"; + ensurePermissions."ALL TABLES IN SCHEMA public" = "SELECT"; +}]; +``` + +#### Set Up Replication + +```sql +-- On source: Create publication +CREATE PUBLICATION my_pub FOR ALL TABLES; + +-- On target: Create subscription +CREATE SUBSCRIPTION my_sub + CONNECTION 'host=oldhost dbname=mydb user=replicator' + PUBLICATION my_pub; +``` + +#### Cutover + +```bash +# 1. Verify replication is caught up +# Check lag on target: +SELECT * FROM pg_stat_subscription; + +# 2. Stop writes on source (maintenance mode) + +# 3. Wait for final sync + +# 4. Promote target (drop subscription) +DROP SUBSCRIPTION my_sub; + +# 5. Update application connection strings + +# 6. Update reverse proxy +``` + +**Downtime:** <1 minute (just the cutover). + +--- + +## Service-Specific Procedures + +### Forgejo (Git Server) + +**State locations:** +- `/var/lib/forgejo/data/` - Git repositories, LFS +- `/var/lib/forgejo/postgres/` - PostgreSQL database +- `/var/lib/forgejo/backups/` - Existing backups + +**Procedure (Warm Migration):** + +```bash +# 1. Put Forgejo in maintenance mode (optional) +ssh h001 'touch /var/lib/forgejo/data/maintenance' + +# 2. Backup database inside container +ssh h001 'nixos-container run forgejo -- pg_dumpall -U forgejo > /var/lib/forgejo/backups/pre-migration.sql' + +# 3. Initial sync +rsync -avz --progress h001:/var/lib/forgejo/ newhost:/var/lib/forgejo/ + +# 4. Stop container +ssh h001 'systemctl stop container@forgejo' + +# 5. Final sync +rsync -avz --progress h001:/var/lib/forgejo/ newhost:/var/lib/forgejo/ + +# 6. Start on new host +ssh newhost 'systemctl start container@forgejo' + +# 7. Update O001 nginx +# Change: proxyPass = "http://100.64.0.13" → "http://" +ssh o001 'nixos-rebuild switch' + +# 8. Verify +git clone https://git.joshuabell.xyz/test/repo.git + +# 9. Remove maintenance mode +ssh newhost 'rm /var/lib/forgejo/data/maintenance' +``` + +**Downtime:** ~5 minutes. + +### Zitadel (SSO) + +**State locations:** +- `/var/lib/zitadel/postgres/` - PostgreSQL database +- `/var/lib/zitadel/backups/` - Backups + +**Critical notes:** +- SSO is used by other services - coordinate downtime +- Test authentication after migration +- May need to clear client caches + +**Procedure:** Same as Forgejo. + +### Vaultwarden (Password Manager) + +**State locations:** +- `/var/lib/vaultwarden/` - SQLite database, attachments + +**Critical notes:** +- MOST CRITICAL SERVICE - users depend on this constantly +- Prefer hot migration or schedule during low-usage time +- Verify emergency access works after migration + +**Procedure:** + +```bash +# 1. Enable read-only mode (if supported) + +# 2. Sync while running +rsync -avz --progress o001:/var/lib/vaultwarden/ newhost:/var/lib/vaultwarden/ + +# 3. Quick cutover +ssh o001 'systemctl stop vaultwarden' +rsync -avz --progress o001:/var/lib/vaultwarden/ newhost:/var/lib/vaultwarden/ +ssh newhost 'systemctl start vaultwarden' + +# 4. Update DNS/proxy immediately + +# 5. Verify with mobile app and browser extension +``` + +**Downtime:** 2-3 minutes (coordinate with users). + +### Headscale + +**State locations:** +- `/var/lib/headscale/` - SQLite database with node registrations + +**Critical notes:** +- ALL mesh connectivity depends on this +- Existing connections continue during migration +- New connections will fail during downtime + +**Procedure:** + +```bash +# 1. Backup current state +restic -r /backup/l001 backup /var/lib/headscale --tag pre-migration + +# 2. Sync to new VPS +rsync -avz --progress l001:/var/lib/headscale/ newvps:/var/lib/headscale/ + +# 3. Stop on old host +ssh l001 'systemctl stop headscale' + +# 4. Final sync +rsync -avz --progress l001:/var/lib/headscale/ newvps:/var/lib/headscale/ + +# 5. Start on new host +ssh newvps 'systemctl start headscale' + +# 6. Update DNS +# headscale.joshuabell.xyz → new IP + +# 7. Verify +headscale nodes list +tailscale status + +# 8. Test new device joining +``` + +**Downtime:** 5-10 minutes (include DNS propagation time). + +### AdGuard Home + +**State locations:** +- `/var/lib/AdGuardHome/` - Config, query logs, filters + +**Critical notes:** +- LAN DNS will fail during migration +- Configure backup DNS on clients first + +**Procedure:** + +```bash +# 1. Add temporary DNS to DHCP (e.g., 1.1.1.1) +# Or have clients use secondary DNS server + +# 2. Quick migration +ssh h003 'systemctl stop adguardhome' +rsync -avz --progress h003:/var/lib/AdGuardHome/ newhost:/var/lib/AdGuardHome/ +ssh newhost 'systemctl start adguardhome' + +# 3. Update DHCP to point to new host + +# 4. Verify DNS resolution +dig @new-host-ip google.com +``` + +**Downtime:** 2-3 minutes (clients use backup DNS). + +--- + +## Reverse Proxy Updates + +When migrating services proxied through O001: + +### Current Proxy Mappings (O001 nginx.nix) + +| Domain | Backend | +|--------|---------| +| chat.joshuabell.xyz | 100.64.0.13 (H001) | +| git.joshuabell.xyz | 100.64.0.13 (H001) | +| notes.joshuabell.xyz | 100.64.0.13 (H001) | +| sec.joshuabell.xyz | 100.64.0.13 (H001) | +| sso.joshuabell.xyz | 100.64.0.13 (H001) | +| llm.joshuabell.xyz | 100.64.0.13:8095 (H001) | + +### Updating Proxy + +1. Edit `hosts/oracle/o001/nginx.nix` +2. Change `proxyPass` to new Tailscale IP +3. Commit and push +4. `ssh o001 'cd /etc/nixos && git pull && nixos-rebuild switch'` + +Or for faster updates without commit: + +```bash +# Quick test (non-persistent) +ssh o001 'sed -i "s/100.64.0.13/100.64.0.XX/g" /etc/nginx/nginx.conf && nginx -s reload' + +# Then update flake and rebuild properly +``` + +--- + +## Rollback Procedures + +If migration fails: + +### Quick Rollback + +```bash +# 1. Stop on new host +ssh newhost 'systemctl stop myservice' + +# 2. Start on old host (state should still be there) +ssh oldhost 'systemctl start myservice' + +# 3. Revert proxy changes +ssh proxy 'nixos-rebuild switch --rollback' +``` + +### If Old State Was Deleted + +```bash +# Restore from backup +restic -r /backup/oldhost restore latest --target / --include /var/lib/myservice + +# Start service +systemctl start myservice + +# Revert proxy +``` + +--- + +## Post-Migration Checklist + +- [ ] Service responds correctly +- [ ] Authentication works (if applicable) +- [ ] Data integrity verified +- [ ] Monitoring updated to new host +- [ ] DNS/proxy pointing to new location +- [ ] Old host state cleaned up (after grace period) +- [ ] Backup job updated for new location +- [ ] Documentation updated + +--- + +## Common Issues + +### "Permission denied" on New Host + +```bash +# Ensure correct ownership +chown -R serviceuser:servicegroup /var/lib/myservice + +# Check SELinux/AppArmor if applicable +``` + +### Service Can't Connect to Database + +```bash +# Verify PostgreSQL is running +systemctl status postgresql + +# Check connection settings +cat /var/lib/myservice/config.yaml | grep -i database +``` + +### SSL Certificate Issues + +```bash +# Certificates are tied to domain, not host +# Should work automatically if domain unchanged + +# If issues, force ACME renewal +systemctl restart acme-myservice.joshuabell.xyz.service +``` + +### Tailscale IP Changed + +```bash +# Get new Tailscale IP +tailscale ip -4 + +# Update all references to old IP +grep -r "100.64.0.XX" /etc/nixos/ +``` diff --git a/hosts/lio/flake.lock b/hosts/lio/flake.lock index 7059e985..4a6ef8e8 100644 --- a/hosts/lio/flake.lock +++ b/hosts/lio/flake.lock @@ -1303,11 +1303,11 @@ "nixpkgs": "nixpkgs_4" }, "locked": { - "lastModified": 1769556375, - "narHash": "sha256-Ne2wFTs2fPyyDUIqy/XiYUmnqs6aaNE8/JA6BVBP+Ow=", + "lastModified": 1769663859, + "narHash": "sha256-5otcBVNOhDtHjHoDzRSf1iN2/wToLGRgNppx44tlRj4=", "owner": "anomalyco", "repo": "opencode", - "rev": "15ffd3cba1d3bd7d4d84c6911623a9c1d19e6647", + "rev": "41ea4694db7636ba184d238fd2a00deb770f9c0b", "type": "github" }, "original": { diff --git a/ideas/impermanence_everywhere.md b/ideas/impermanence_everywhere.md new file mode 100644 index 00000000..17715134 --- /dev/null +++ b/ideas/impermanence_everywhere.md @@ -0,0 +1,456 @@ +# Impermanence Rollout Strategy + +## Overview + +This document covers rolling out impermanence (ephemeral root filesystem) to all hosts, using Juni as the template. + +## What is Impermanence? + +**Philosophy:** Root filesystem (`/`) is wiped on every boot (tmpfs or reset subvolume), forcing you to explicitly declare what state to persist. + +**Benefits:** +- Clean system by default - no accumulated cruft +- Forces documentation of important state +- Easy rollback (just reboot) +- Security (ephemeral root limits persistence of compromises) +- Reproducible server state + +## Current State + +| Host | Impermanence | Notes | +|------|--------------|-------| +| Juni | ✅ Implemented | bcachefs with @root/@persist subvolumes | +| H001 | ❌ Traditional | Most complex - many services | +| H002 | ❌ Traditional | NAS - may not need impermanence | +| H003 | ❌ Traditional | Router - good candidate | +| O001 | ❌ Traditional | Gateway - good candidate | +| L001 | ❌ Traditional | Headscale - good candidate | + +## Juni's Implementation (Reference) + +### Filesystem Layout + +``` +bcachefs (5 devices, 2x replication) +├── @root # Ephemeral - reset each boot +├── @nix # Persistent - Nix store +├── @persist # Persistent - bind mounts for state +└── @snapshots # Automatic snapshots +``` + +### Boot Process + +1. Create snapshot of @root before reset +2. Reset @root subvolume (or recreate) +3. Boot into clean system +4. Bind mount persisted paths from @persist + +### Persisted Paths (Juni) + +```nix +environment.persistence."/persist" = { + hideMounts = true; + + directories = [ + "/var/log" + "/var/lib/nixos" + "/var/lib/systemd" + "/var/lib/tailscale" + "/var/lib/flatpak" + "/etc/NetworkManager/system-connections" + ]; + + files = [ + "/etc/machine-id" + "/etc/ssh/ssh_host_ed25519_key" + "/etc/ssh/ssh_host_ed25519_key.pub" + "/etc/ssh/ssh_host_rsa_key" + "/etc/ssh/ssh_host_rsa_key.pub" + ]; + + users.josh = { + directories = [ + ".ssh" + ".gnupg" + "projects" + ".config" + ".local/share" + ]; + }; +}; +``` + +### Custom Tooling + +Juni has `bcache-impermanence` with commands: +- `ls` - List snapshots +- `gc` - Garbage collect old snapshots +- `diff` - Show changes since last boot (auto-excludes persisted paths) + +Retention policy: 5 recent + 1/week for 4 weeks + 1/month + +--- + +## Common Pain Point: Finding What Needs Persistence + +> "I often have issues adding new persistent layers and knowing what I need to add" + +### Discovery Workflow + +#### Method 1: Use the Diff Tool + +Before rebooting after installing new software: + +```bash +# On Juni +bcache-impermanence diff +``` + +This shows files created/modified outside persisted paths. + +#### Method 2: Boot and Observe Failures + +```bash +# After reboot, check for failures +journalctl -b | grep -i "no such file" +journalctl -b | grep -i "failed to" +journalctl -b | grep -i "permission denied" +``` + +#### Method 3: Monitor File Changes + +```bash +# Before making changes +find /var /etc -type f -printf '%T@ %p\n' 2>/dev/null | sort -n > /tmp/before.txt + +# After running services +find /var /etc -type f -printf '%T@ %p\n' 2>/dev/null | sort -n > /tmp/after.txt + +# Compare +diff /tmp/before.txt /tmp/after.txt +``` + +#### Method 4: Service-Specific Patterns + +Most services follow predictable patterns: + +| Pattern | Example | Usually Needs Persistence | +|---------|---------|---------------------------| +| `/var/lib/${service}` | `/var/lib/postgresql` | Yes | +| `/var/cache/${service}` | `/var/cache/nginx` | Usually no | +| `/var/log/${service}` | `/var/log/nginx` | Optional | +| `/etc/${service}` | `/etc/nginx` | Only if runtime-generated | + +--- + +## Server Impermanence Template + +### Minimal Server Persistence + +```nix +environment.persistence."/persist" = { + hideMounts = true; + + directories = [ + # Core system + "/var/lib/nixos" # NixOS state DB + "/var/lib/systemd/coredump" + "/var/log" + + # Network + "/var/lib/tailscale" + "/etc/NetworkManager/system-connections" + + # ACME certificates + "/var/lib/acme" + ]; + + files = [ + "/etc/machine-id" + "/etc/ssh/ssh_host_ed25519_key" + "/etc/ssh/ssh_host_ed25519_key.pub" + "/etc/ssh/ssh_host_rsa_key" + "/etc/ssh/ssh_host_rsa_key.pub" + ]; +}; +``` + +### Per-Host Additions + +#### H001 (Services) + +```nix +environment.persistence."/persist".directories = [ + # Add to minimal template: + "/var/lib/forgejo" + "/var/lib/zitadel" + "/var/lib/openbao" + "/bao-keys" + "/var/lib/trilium" + "/var/lib/opengist" + "/var/lib/open-webui" + "/var/lib/n8n" + "/var/lib/nixarr/state" + "/var/lib/containers" # Podman/container state +]; +``` + +#### O001 (Gateway) + +```nix +environment.persistence."/persist".directories = [ + # Add to minimal template: + "/var/lib/vaultwarden" + "/var/lib/postgresql" + "/var/lib/fail2ban" +]; +``` + +#### L001 (Headscale) + +```nix +environment.persistence."/persist".directories = [ + # Add to minimal template: + "/var/lib/headscale" +]; +``` + +#### H003 (Router) + +```nix +environment.persistence."/persist".directories = [ + # Add to minimal template: + "/var/lib/AdGuardHome" + "/var/lib/dnsmasq" +]; + +environment.persistence."/persist".files = [ + # Add to minimal template: + "/boot/keyfile_nvme0n1p1" # LUKS key - CRITICAL +]; +``` + +--- + +## Rollout Strategy + +### Phase 1: Lowest Risk (VPS Hosts) + +Start with L001 and O001: +- Easy to rebuild from scratch if something goes wrong +- Smaller state footprint +- Good practice before tackling complex hosts + +**L001 Steps:** +1. Back up `/var/lib/headscale/` +2. Add impermanence module +3. Test on spare VPS first +4. Migrate + +**O001 Steps:** +1. Back up Vaultwarden and PostgreSQL +2. Add impermanence module +3. Test carefully (Vaultwarden is critical!) + +### Phase 2: Router (H003) + +H003 is medium complexity: +- Relatively small state +- But critical for network (test during maintenance window) +- LUKS keyfile needs special handling + +### Phase 3: Complex Host (H001) + +H001 is most complex due to: +- Multiple containerized services +- Database state in containers +- Many stateful applications + +**Approach:** +1. Inventory all state paths (see backup docs) +2. Test with snapshot before committing +3. Gradual rollout with extensive persistence list +4. May need to persist more than expected initially + +### Phase 4: NAS (H002) - Maybe Skip + +H002 may not benefit from impermanence: +- Primary purpose is persistent data storage +- bcachefs replication already provides redundancy +- Impermanence adds complexity without clear benefit + +--- + +## Filesystem Options + +### Option A: bcachefs with Subvolumes (Like Juni) + +**Pros:** +- Flexible, modern +- Built-in snapshots +- Replication support + +**Setup:** +```nix +fileSystems = { + "/" = { + device = "/dev/disk/by-label/nixos"; + fsType = "bcachefs"; + options = [ "subvol=@root" ]; + }; + "/nix" = { + device = "/dev/disk/by-label/nixos"; + fsType = "bcachefs"; + options = [ "subvol=@nix" ]; + }; + "/persist" = { + device = "/dev/disk/by-label/nixos"; + fsType = "bcachefs"; + options = [ "subvol=@persist" ]; + neededForBoot = true; + }; +}; +``` + +### Option B: BTRFS with Subvolumes + +Similar to bcachefs but more mature: + +```nix +# Reset @root on boot +boot.initrd.postDeviceCommands = lib.mkAfter '' + mkdir -p /mnt + mount -o subvol=/ /dev/disk/by-label/nixos /mnt + btrfs subvolume delete /mnt/@root + btrfs subvolume create /mnt/@root + umount /mnt +''; +``` + +### Option C: tmpfs Root + +Simplest but uses RAM: + +```nix +fileSystems."/" = { + device = "none"; + fsType = "tmpfs"; + options = [ "defaults" "size=2G" "mode=755" ]; +}; +``` + +**Best for:** VPS hosts with limited disk but adequate RAM. + +--- + +## Troubleshooting + +### Service Fails After Reboot + +```bash +# Check what's missing +journalctl -xeu servicename + +# Common fixes: +# 1. Add /var/lib/servicename to persistence +# 2. Ensure directory permissions are correct +# 3. Check if service expects specific files in /etc +``` + +### "No such file or directory" Errors + +```bash +# Find what's missing +journalctl -b | grep "No such file" + +# Add missing paths to persistence +``` + +### Slow Boot (Too Many Bind Mounts) + +If you have many persisted paths, consider: +1. Consolidating related paths +2. Using symlinks instead of bind mounts for some paths +3. Persisting parent directories instead of many children + +### Container State Issues + +Containers may have their own state directories: + +```nix +# For NixOS containers +environment.persistence."/persist".directories = [ + "/var/lib/nixos-containers" +]; + +# For Podman +environment.persistence."/persist".directories = [ + "/var/lib/containers/storage/volumes" + # NOT overlay - that's regenerated +]; +``` + +--- + +## Tooling Improvements + +### Automated Discovery Script + +Create a helper that runs periodically to detect unpersisted changes: + +```bash +#!/usr/bin/env bash +# /usr/local/bin/impermanence-check + +# Get list of persisted paths +PERSISTED=$(nix eval --raw '.#nixosConfigurations.hostname.config.environment.persistence."/persist".directories' 2>/dev/null | tr -d '[]"' | tr ' ' '\n') + +# Find modified files outside persisted paths +find / -xdev -type f -mmin -60 2>/dev/null | while read -r file; do + is_persisted=false + for path in $PERSISTED; do + if [[ "$file" == "$path"* ]]; then + is_persisted=true + break + fi + done + if ! $is_persisted; then + echo "UNPERSISTED: $file" + fi +done +``` + +### Pre-Reboot Check + +Add to your workflow: + +```bash +# Before rebooting +bcache-impermanence diff # or custom script + +# Review changes, add to persistence if needed, then reboot +``` + +--- + +## Action Items + +### Immediate +- [ ] Document all state paths for each host (see backup docs) +- [ ] Create shared impermanence module in flake + +### Phase 1 (L001/O001) +- [ ] Back up current state +- [ ] Add impermanence to L001 +- [ ] Test thoroughly +- [ ] Roll out to O001 + +### Phase 2 (H003) +- [ ] Plan maintenance window +- [ ] Add impermanence to H003 +- [ ] Verify LUKS key persistence + +### Phase 3 (H001) +- [ ] Complete state inventory +- [ ] Test with extensive persistence list +- [ ] Gradual rollout diff --git a/ideas/openbao_secrets_migration.md b/ideas/openbao_secrets_migration.md new file mode 100644 index 00000000..4eed19c0 --- /dev/null +++ b/ideas/openbao_secrets_migration.md @@ -0,0 +1,208 @@ +# OpenBao Secrets Migration + +## Overview + +This document covers migrating from ragenix (age-encrypted secrets) to OpenBao for centralized secret management, enabling zero-config machine onboarding. + +## Goals + +1. **Zero-config machine onboarding**: New machine = install NixOS + add Zitadel machine key + done +2. **Eliminate re-keying workflow**: No more updating secrets.nix and re-encrypting .age files for each new machine +3. **Runtime secret dependencies**: Services wait for secrets via systemd, not build-time conditionals +4. **Consolidated SSH keys**: Use single `nix2nix` key for all NixOS machine SSH (keep `nix2t` for work) +5. **Declarative policy management**: OpenBao policies auto-applied after unseal with reconciliation +6. **Directional Tailscale ACLs**: Restrict work machine from reaching NixOS hosts (one-way access) +7. **Per-host variable registry**: `_variables.nix` pattern for ports/UIDs/GIDs to prevent conflicts + +## Current State + +### Ragenix Secrets in Use (21 active) + +**SSH Keys (for client auth):** +- nix2github, nix2bitbucket, nix2gitforgejo +- nix2nix (shared), nix2t (work - keep separate) +- nix2lio (remote builds), nix2oren, nix2gpdPocket3 +- nix2h001, nix2h003, nix2linode, nix2oracle + +**API Tokens:** +- github_read_token (Nix private repo access) +- linode_rw_domains (ACME DNS challenge) +- litellm_public_api_key (nginx auth) + +**VPN:** +- headscale_auth (Tailscale auth) +- us_chi_wg (NixArr WireGuard) + +**Application Secrets:** +- oauth2_proxy_key_file +- openwebui_env +- zitadel_master_key +- vaultwarden_env + +**Skipping (unused):** +- nix2h002, nix2joe, nix2l002, nix2gitjosh, obsidian_sync_env + +### Already Migrated to OpenBao (juni) +- headscale_auth, atuin-key-josh, 12 SSH keys + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ New Machine Onboarding │ +├─────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. Install NixOS with full config │ +│ - All services defined but waiting on secrets │ +│ │ +│ 2. Create Zitadel machine user + copy key │ +│ - /machine-key.json → JWT auth to OpenBao │ +│ │ +│ 3. vault-agent fetches secrets │ +│ - kv/data/machines/home_roaming/* → /var/lib/openbao-secrets│ +│ │ +│ 4. systemd dependencies resolve │ +│ - secret-watcher completes → hardDepend services start │ +│ │ +│ 5. Machine fully operational │ +│ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +## Key Design Decisions + +### Secret Path Convention + +``` +kv/data/machines/ +├── home_roaming/ # Shared across all NixOS machines +│ ├── nix2nix # SSH key +│ ├── nix2github # SSH key +│ ├── headscale_auth # Tailscale auth +│ └── ... +├── home/ # h001-specific (not roaming) +│ ├── linode_rw_domains +│ ├── zitadel_master_key +│ └── ... +└── oracle/ # o001-specific + ├── vaultwarden_env + └── ... +``` + +### Runtime Dependencies vs Build-Time Conditionals + +**Before (ragenix pattern - bad for onboarding):** +```nix +let hasSecret = name: (config.age.secrets or {}) ? ${name}; +in { + config = lib.mkIf (hasSecret "openwebui_env") { + services.open-webui.enable = true; + }; +} +``` + +**After (OpenBao pattern - zero-config onboarding):** +```nix +ringofstorms.secretsBao.secrets.openwebui_env = { + kvPath = "kv/data/machines/home_roaming/openwebui_env"; + hardDepend = [ "open-webui" ]; # Service waits for secret at runtime + configChanges.services.open-webui = { + enable = true; + environmentFile = "$SECRET_PATH"; + }; +}; +``` + +### Per-Host File Structure + +``` +hosts/h001/ +├── _variables.nix # Ports, UIDs, GIDs - single source of truth +├── secrets.nix # All secrets + their configChanges +├── flake.nix # Imports, basic host config +├── nginx.nix # Pure config (no conditionals) +└── mods/ + ├── openbao-policies.nix # Auto-apply after unseal + └── ... +``` + +### OpenBao Policy Management + +Policies auto-apply after unseal with full reconciliation: + +```nix +# openbao-policies.nix +let + policies = { + machines = '' + path "kv/data/machines/home_roaming/*" { + capabilities = ["read", "list"] + } + ''; + }; + reservedPolicies = [ "default" "root" ]; +in { + systemd.services.openbao-apply-policies = { + after = [ "openbao-auto-unseal.service" ]; + requires = [ "openbao-auto-unseal.service" ]; + wantedBy = [ "multi-user.target" ]; + # Script: apply all policies, delete orphans not in config + }; +} +``` + +### Headscale ACL Policy + +Directional access control: + +```nix +# nix machines: full mesh access +{ action = "accept"; src = ["group:nix-machines"]; dst = ["group:nix-machines:*"]; } + +# nix machines → work: full access +{ action = "accept"; src = ["group:nix-machines"]; dst = ["tag:work:*"]; } + +# work → nix machines: LIMITED (only specific ports) +{ action = "accept"; src = ["tag:work"]; dst = ["h001:22,443"]; } +``` + +## Implementation Phases + +### Phase 1: SSH Key Preparation +- [ ] Add nix2nix SSH key to all hosts authorized_keys (alongside existing) +- [ ] Deploy with `nh os switch` to all hosts + +### Phase 2: Infrastructure +- [ ] Create `_variables.nix` pattern for h001 (pilot) +- [ ] Create `openbao-policies.nix` with auto-apply + reconciliation +- [ ] Create `headscale-policy.nix` with directional ACLs +- [ ] Create per-host `secrets.nix` pattern + +### Phase 3: Secret Migration +- [ ] Migrate h001 secrets (linode_rw_domains, us_chi_wg, oauth2_proxy_key_file, openwebui_env, zitadel_master_key) +- [ ] Migrate o001 secrets (vaultwarden_env, litellm_public_api_key) +- [ ] Migrate common modules (tailnet, ssh, nix_options) +- [ ] Migrate SSH client keys + +### Phase 4: Consumer Updates +- [ ] Update ssh.nix to use OpenBao paths +- [ ] Remove hasSecret conditionals from all modules +- [ ] Remove ragenix imports and secrets flake + +### Phase 5: Testing & Finalization +- [ ] Populate all secrets in OpenBao KV store +- [ ] Test onboarding workflow on fresh VM +- [ ] Document new machine onboarding process + +## Related Ideas + +- `impermanence_everywhere.md` - Impermanence persists `/var/lib/openbao-secrets` and `/machine-key.json` +- `resilience.md` - OpenBao server (h001) is a SPOF; consider backup/failover +- `service_backups.md` - `/var/lib/openbao` and `/bao-keys` need backup + +## Notes + +- OpenBao hosted on h001 at sec.joshuabell.xyz +- JWT auth via Zitadel machine users +- vault-agent on each host fetches secrets +- `sec` CLI tool available for manual lookups diff --git a/ideas/resilience.md b/ideas/resilience.md new file mode 100644 index 00000000..f158cf83 --- /dev/null +++ b/ideas/resilience.md @@ -0,0 +1,347 @@ +# Infrastructure Resilience & Failover + +## Overview + +This document covers strategies for eliminating single points of failure and improving infrastructure resilience. + +## Current Architecture + +``` + INTERNET + │ + ┌─────────┴─────────┐ + │ │ + ┌─────▼─────┐ ┌──────▼──────┐ + │ O001 │ │ L001 │ + │ (Oracle) │ │ (Linode) │ + │ nginx │ │ Headscale │ + │ +vault │ │ (SPOF!) │ + │ +atuin │ └──────┬──────┘ + │ (SPOF!) │ │ + └─────┬─────┘ │ + │ Tailscale Mesh + │ ┌───────────┴───────────┐ + │ │ │ + ┌─────▼───────▼─────┐ ┌──────▼──────┐ + │ H001 │ │ H003 │ + │ (Service Host) │ │ (Router) │ + │ Forgejo,Zitadel, │ │ AdGuard, │ + │ LiteLLM,Trilium, │ │ DHCP,NAT │ + │ NixArr,OpenWebUI │ │ (SPOF!) │ + └─────────┬─────────┘ └─────────────┘ + │ NFS + ┌─────────▼─────────┐ + │ H002 │ + │ (NAS - bcachefs)│ + │ Media, Data │ + └───────────────────┘ +``` + +## Critical Single Points of Failure + +| Host | Service | Impact if Down | Recovery Time | +|------|---------|----------------|---------------| +| **L001** | Headscale | ALL mesh connectivity | HIGH - must restore SQLite exactly | +| **O001** | nginx/Vaultwarden | All public access, password manager | MEDIUM | +| **H003** | DNS/DHCP/NAT | Entire LAN offline | MEDIUM | +| **H001** | All services | Services down but recoverable | MEDIUM | +| **H002** | NFS | Media unavailable | LOW - bcachefs has replication | + +--- + +## Reverse Proxy Resilience (O001) + +### Current Problem + +O001 is a single point of failure for all public traffic: +- No public access to any service if it dies +- DNS still points to it after failure +- ACME certs are only on that host + +### Solution Options + +#### Option A: Cloudflare Tunnel (Recommended Quick Win) + +**Pros:** +- No single server dependency +- Run `cloudflared` on multiple hosts (H001 as backup) +- Automatic failover between tunnel replicas +- Built-in DDoS protection +- No inbound ports needed + +**Cons:** +- Cannot stream media (Jellyfin) - violates Cloudflare ToS +- Adds latency +- Vendor dependency + +**Implementation:** + +```nix +# On BOTH O001 (primary) AND H001 (backup) +services.cloudflared = { + enable = true; + tunnels."joshuabell" = { + credentialsFile = config.age.secrets.cloudflared.path; + ingress = { + "chat.joshuabell.xyz" = "http://100.64.0.13:80"; + "git.joshuabell.xyz" = "http://100.64.0.13:80"; + "notes.joshuabell.xyz" = "http://100.64.0.13:80"; + "sec.joshuabell.xyz" = "http://100.64.0.13:80"; + "sso.joshuabell.xyz" = "http://100.64.0.13:80"; + "n8n.joshuabell.xyz" = "http://100.64.0.13:80"; + "blog.joshuabell.xyz" = "http://100.64.0.13:80"; + }; + }; +}; +``` + +Cloudflare automatically load balances across all active tunnel replicas. + +#### Option B: DNS Failover with Health Checks + +Use Cloudflare DNS with health checks: +- Point `joshuabell.xyz` to both O001 and a backup +- Cloudflare removes unhealthy IPs automatically +- Requires Cloudflare paid plan for load balancing + +#### Option C: Tailscale Funnel + +Expose services directly without O001: + +```bash +# On H001 +tailscale funnel 443 +``` + +Exposes H001 directly at `https://h001.net.joshuabell.xyz` + +**Pros:** +- No proxy needed +- Per-service granularity +- Automatic HTTPS + +**Cons:** +- Uses `ts.net` domain (no custom domain) +- Limited to ports 443, 8443, 10000 + +#### Option D: Manual Failover with Shared Config + +Keep H001 ready to take over O001's role: +1. Same nginx config via shared NixOS module +2. Use DNS-01 ACME challenge (certs work on any host) +3. Update DNS when O001 fails + +### Recommended Hybrid Approach + +``` +┌─────────────────────────────────────────────────────────────┐ +│ RECOMMENDED TOPOLOGY │ +├─────────────────────────────────────────────────────────────┤ +│ │ +│ Cloudflare DNS (health-checked failover) │ +│ │ │ +│ ┌──────┴──────┐ │ +│ │ │ │ +│ ▼ ▼ │ +│ O001 ──OR── H001 (via Cloudflare Tunnel) │ +│ nginx cloudflared backup │ +│ │ +│ Jellyfin: Direct via Tailscale Funnel (bypasses O001) │ +│ Vaultwarden: Cloudflare Tunnel (survives O001 failure) │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +**Key Changes:** +1. Move Vaultwarden to Cloudflare Tunnel (survives O001 outage) +2. Jellyfin via Tailscale Funnel (no Cloudflare ToS issues) +3. Other services via Cloudflare Tunnel with H001 as backup + +--- + +## Headscale HA (L001) + +### The Problem + +L001 running Headscale is the MOST CRITICAL SPOF: +- If Headscale dies, existing connections keep working temporarily +- NO NEW devices can connect +- Devices that reboot cannot rejoin the mesh +- Eventually all mesh connectivity degrades + +### Solution Options + +#### Option 1: Frequent Backups (Minimum Viable) + +```nix +my.backup = { + enable = true; + paths = [ "/var/lib/headscale" "/var/lib/acme" ]; +}; +``` + +**Recovery time:** ~30 minutes to spin up new VPS + restore + +#### Option 2: Warm Standby + +- Run second Linode/VPS with Headscale configured but stopped +- Daily rsync of `/var/lib/headscale/` to standby +- Update DNS to point to standby if primary fails + +```bash +# Daily sync to standby +rsync -avz l001:/var/lib/headscale/ standby:/var/lib/headscale/ +``` + +**Recovery time:** ~5 minutes (start service, update DNS) + +#### Option 3: Headscale HA with LiteFS + +Headscale doesn't natively support HA, but you can use: +- **LiteFS** for SQLite replication +- **Consul** for leader election and failover + +See: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/ + +**Recovery time:** ~15 seconds automatic failover + +#### Option 4: Use Tailscale Commercial + +Let Tailscale handle the control plane HA: +- They manage availability +- Keep Headscale for learning/experimentation +- Critical services use Tailscale commercial + +### Recommendation + +Start with Option 1 (backups) immediately, work toward Option 2 (warm standby) within a month. + +--- + +## Router HA (H003) + +### The Problem + +H003 is the network gateway: +- AdGuard Home (DNS filtering) +- dnsmasq (DHCP) +- NAT firewall +- If it dies, entire LAN loses connectivity + +### Solution Options + +#### Option 1: Secondary DNS/DHCP + +Run backup DNS on another host (H001 or H002): +- Secondary AdGuard Home instance +- Clients configured with both DNS servers +- DHCP failover is trickier (consider ISC DHCP with failover) + +#### Option 2: Keepalived for Router Failover + +If you have two devices that could be routers: + +```nix +services.keepalived = { + enable = true; + vrrpInstances.router = { + state = "MASTER"; # or "BACKUP" on secondary + interface = "eth0"; + virtualRouterId = 1; + priority = 255; # Lower on backup + virtualIps = [{ addr = "10.12.14.1/24"; }]; + }; +}; +``` + +#### Option 3: Router Redundancy via ISP + +- Use ISP router as fallback gateway +- Clients get two gateways via DHCP +- Less control but automatic failover + +### Recommendation + +Run secondary AdGuard Home on H001/H002 as minimum redundancy. Full router HA is complex for homelab. + +--- + +## NFS HA (H002) + +### Current State + +H002 uses bcachefs with 2x replication across 5 disks. Single host failure still causes data unavailability. + +### Options + +#### Option 1: NFS Client Resilience + +Configure NFS clients to handle server unavailability gracefully: + +```nix +fileSystems."/nfs/h002" = { + device = "100.64.0.3:/data"; + fsType = "nfs4"; + options = [ + "soft" # Don't hang forever + "timeo=50" # 5 second timeout + "retrans=3" # 3 retries + "nofail" # Don't fail boot if unavailable + ]; +}; +``` + +#### Option 2: Second NAS with GlusterFS + +For true HA, run two NAS nodes with GlusterFS replication: + +``` +H002 (bcachefs) ◄──── GlusterFS ────► H00X (bcachefs) +``` + +**Overkill for homelab**, but an option for critical data. + +### Recommendation + +Current bcachefs replication is adequate. Focus on offsite backups for truly irreplaceable data. + +--- + +## Recommended Implementation Order + +### Phase 1: Quick Wins (This Week) +1. [ ] Set up Cloudflare Tunnel on O001 AND H001 +2. [ ] Enable Tailscale Funnel for Jellyfin +3. [ ] Automated backups for L001 Headscale + +### Phase 2: Core Resilience (This Month) +4. [ ] DNS-01 ACME for shared certs +5. [ ] Warm standby for Headscale +6. [ ] Secondary AdGuard Home + +### Phase 3: Full Resilience (Next Quarter) +7. [ ] Headscale HA with LiteFS (if needed) +8. [ ] Automated failover testing +9. [ ] Runbook documentation + +--- + +## Monitoring & Alerting + +Essential for knowing when to failover: + +```nix +# Uptime monitoring for critical services +services.uptime-kuma = { + enable = true; + # Monitor: Headscale, nginx, Vaultwarden, AdGuard +}; + +# Or use external monitoring (BetterStack, Uptime Robot) +``` + +Alert on: +- Headscale API unreachable +- nginx health check fails +- DNS resolution fails +- NFS mount fails diff --git a/ideas/service_backups.md b/ideas/service_backups.md new file mode 100644 index 00000000..39893287 --- /dev/null +++ b/ideas/service_backups.md @@ -0,0 +1,340 @@ +# Service Backup Strategy + +## Overview + +This document outlines the backup strategy for the NixOS fleet, covering critical data paths, backup tools, and recovery procedures. + +## Current State + +**No automated backups are running today.** This is a critical gap. + +## Backup Topology + +``` +┌─────────────────────────────────────────────────────────────┐ +│ BACKUP TOPOLOGY │ +├─────────────────────────────────────────────────────────────┤ +│ │ +│ H001,H003,O001,L001 ──────► H002:/data/backups (primary) │ +│ └────► B2/S3 (offsite) │ +│ │ +│ H002 (NAS) ───────────────► B2/S3 (offsite only) │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +## Critical Paths by Host + +### L001 (Headscale) - HIGHEST PRIORITY + +| Path | Description | Size | Priority | +|------|-------------|------|----------| +| `/var/lib/headscale/` | SQLite DB with all node registrations | Small | CRITICAL | +| `/var/lib/acme/` | SSL certificates | Small | High | + +**Impact if lost:** ALL mesh connectivity fails - new connections fail, devices can't rejoin. + +### O001 (Oracle Gateway) + +| Path | Description | Size | Priority | +|------|-------------|------|----------| +| `/var/lib/vaultwarden/` | Password vault (encrypted) | ~41MB | CRITICAL | +| `/var/lib/postgresql/` | Atuin shell history | ~226MB | Medium | +| `/var/lib/acme/` | SSL certificates | Small | High | + +**Impact if lost:** All public access down, password manager lost. + +### H001 (Services) + +| Path | Description | Size | Priority | +|------|-------------|------|----------| +| `/var/lib/forgejo/` | Git repos + PostgreSQL | Large | CRITICAL | +| `/var/lib/zitadel/` | SSO database + config | Medium | CRITICAL | +| `/var/lib/openbao/` | Secrets vault | Small | CRITICAL | +| `/bao-keys/` | Vault unseal keys | Tiny | CRITICAL | +| `/var/lib/trilium/` | Notes database | Medium | High | +| `/var/lib/opengist/` | Gist data | Small | Medium | +| `/var/lib/open-webui/` | AI chat history | Medium | Low | +| `/var/lib/n8n/` | Workflows | Medium | Medium | +| `/var/lib/acme/` | SSL certificates | Small | High | +| `/var/lib/nixarr/state/` | Media manager configs | Small | Medium | + +**Note:** A 154GB backup exists at `/var/lib/forgejo.tar.gz` - this is manual and should be automated. + +### H003 (Router) + +| Path | Description | Size | Priority | +|------|-------------|------|----------| +| `/var/lib/AdGuardHome/` | DNS filtering config + stats | Medium | High | +| `/boot/keyfile_nvme0n1p1` | LUKS encryption key | Tiny | CRITICAL | + +**WARNING:** The LUKS keyfile must be stored separately in a secure location (e.g., Vaultwarden). + +### H002 (NAS) + +| Path | Description | Size | Priority | +|------|-------------|------|----------| +| `/data/nixarr/media/` | Movies, TV, music, books | Very Large | Low (replaceable) | +| `/data/pinchflat/` | YouTube downloads | Large | Low | + +**Note:** bcachefs already has 2x replication. Offsite backup is optional but recommended for irreplaceable data. + +## Recommended Backup Tool: Restic + +### Why Restic? + +- Modern, encrypted, deduplicated backups +- Native NixOS module: `services.restic.backups` +- Multiple backend support (local, S3, B2, SFTP) +- Incremental backups with deduplication +- Easy pruning/retention policies + +### Shared Backup Module + +Create a shared module at `modules/backup.nix`: + +```nix +{ config, lib, pkgs, ... }: + +with lib; +let + cfg = config.my.backup; +in { + options.my.backup = { + enable = mkEnableOption "restic backups"; + paths = mkOption { type = types.listOf types.str; default = []; }; + exclude = mkOption { type = types.listOf types.str; default = []; }; + postgresBackup = mkOption { type = types.bool; default = false; }; + }; + + config = mkIf cfg.enable { + # PostgreSQL dumps before backup + services.postgresqlBackup = mkIf cfg.postgresBackup { + enable = true; + location = "/var/backup/postgresql"; + compression = "zstd"; + startAt = "02:00:00"; + }; + + services.restic.backups = { + daily = { + paths = cfg.paths ++ (optional cfg.postgresBackup "/var/backup/postgresql"); + exclude = cfg.exclude ++ [ + "**/cache/**" + "**/Cache/**" + "**/.cache/**" + "**/tmp/**" + ]; + + # Primary: NFS to H002 + repository = "/nfs/h002/backups/${config.networking.hostName}"; + + passwordFile = config.age.secrets.restic-password.path; + initialize = true; + + pruneOpts = [ + "--keep-daily 7" + "--keep-weekly 4" + "--keep-monthly 6" + ]; + + timerConfig = { + OnCalendar = "03:00:00"; + RandomizedDelaySec = "1h"; + Persistent = true; + }; + + backupPrepareCommand = '' + # Ensure NFS is mounted + mount | grep -q "/nfs/h002" || mount /nfs/h002 + ''; + }; + + # Offsite to B2/S3 (less frequent) + offsite = { + paths = cfg.paths; + repository = "b2:joshuabell-backups:${config.networking.hostName}"; + passwordFile = config.age.secrets.restic-password.path; + environmentFile = config.age.secrets.b2-credentials.path; + + pruneOpts = [ + "--keep-daily 3" + "--keep-weekly 2" + "--keep-monthly 3" + ]; + + timerConfig = { + OnCalendar = "weekly"; + Persistent = true; + }; + }; + }; + }; +} +``` + +### Per-Host Configuration + +#### L001 (Headscale) +```nix +my.backup = { + enable = true; + paths = [ "/var/lib/headscale" "/var/lib/acme" ]; +}; +``` + +#### O001 (Oracle) +```nix +my.backup = { + enable = true; + paths = [ "/var/lib/vaultwarden" "/var/lib/acme" ]; + postgresBackup = true; # For Atuin +}; +``` + +#### H001 (Services) +```nix +my.backup = { + enable = true; + paths = [ + "/var/lib/forgejo" + "/var/lib/zitadel" + "/var/lib/openbao" + "/bao-keys" + "/var/lib/trilium" + "/var/lib/opengist" + "/var/lib/open-webui" + "/var/lib/n8n" + "/var/lib/acme" + "/var/lib/nixarr/state" + ]; +}; +``` + +#### H003 (Router) +```nix +my.backup = { + enable = true; + paths = [ "/var/lib/AdGuardHome" ]; + # LUKS key backed up separately to Vaultwarden +}; +``` + +## Database Backup Best Practices + +### For Containerized PostgreSQL (Forgejo/Zitadel) + +```nix +systemd.services.container-forgejo-backup = { + script = '' + nixos-container run forgejo -- pg_dumpall -U forgejo \ + | ${pkgs.zstd}/bin/zstd > /var/lib/forgejo/backups/db-$(date +%Y%m%d).sql.zst + ''; + startAt = "02:30:00"; # Before restic runs at 03:00 +}; +``` + +### For Direct PostgreSQL + +```nix +services.postgresqlBackup = { + enable = true; + backupAll = true; + location = "/var/backup/postgresql"; + compression = "zstd"; + startAt = "*-*-* 02:00:00"; +}; +``` + +## Recovery Procedures + +### Restoring from Restic + +```bash +# List snapshots +restic -r /path/to/repo snapshots + +# Restore specific snapshot +restic -r /path/to/repo restore abc123 --target /restore + +# Restore latest +restic -r /path/to/repo restore latest --target /restore + +# Restore specific path +restic -r /path/to/repo restore latest \ + --target /restore \ + --include /var/lib/postgresql + +# Mount for browsing +mkdir /mnt/restic +restic -r /path/to/repo mount /mnt/restic +``` + +### PostgreSQL Recovery + +```bash +# Stop PostgreSQL +systemctl stop postgresql + +# Restore from restic +restic restore latest --target / --include /var/lib/postgresql + +# Or from SQL dump +sudo -u postgres psql < /restore/all-databases.sql + +# Start PostgreSQL +systemctl start postgresql +``` + +## Backup Verification + +Add automated verification: + +```nix +systemd.timers.restic-verify = { + wantedBy = [ "timers.target" ]; + timerConfig = { + OnCalendar = "weekly"; + Persistent = true; + }; +}; + +systemd.services.restic-verify = { + script = '' + ${pkgs.restic}/bin/restic -r /path/to/repo check --read-data-subset=5% + ''; +}; +``` + +## Monitoring & Alerting + +```nix +# Alert on backup failure +systemd.services."restic-backups-daily".serviceConfig.OnFailure = "notify-failure@%n.service"; + +systemd.services."notify-failure@" = { + serviceConfig.Type = "oneshot"; + script = '' + ${pkgs.curl}/bin/curl -X POST https://ntfy.sh/joshuabell-backups \ + -H "Title: Backup Failed" \ + -d "Service: %i on ${config.networking.hostName}" + ''; +}; +``` + +## Action Items + +### Immediate (This Week) +- [ ] Set up restic backups for L001 (Headscale) - most critical +- [ ] Back up H003's LUKS keyfile to Vaultwarden +- [ ] Create `/data/backups/` directory on H002 + +### Short-Term (This Month) +- [ ] Implement shared backup module +- [ ] Deploy to all hosts +- [ ] Set up offsite B2 bucket + +### Medium-Term +- [ ] Automated backup verification +- [ ] Monitoring/alerting integration +- [ ] Test recovery procedures