Add migrating services guide and idea drafts, update flake.lock

2026-01-29 00:38:53 -06:00 · 2026-01-29 00:38:53 -06:00 · 78469896c8
commit 78469896c8
parent 05f31b80c2
6 changed files with 1779 additions and 3 deletions
--- a/ideas/resilience.md
+++ b/ideas/resilience.md
@ -0,0 +1,347 @@
+# Infrastructure Resilience & Failover
+
+## Overview
+
+This document covers strategies for eliminating single points of failure and improving infrastructure resilience.
+
+## Current Architecture
+
+```
+                    INTERNET
+                        │
+              ┌─────────┴─────────┐
+              │                   │
+        ┌─────▼─────┐      ┌──────▼──────┐
+        │   O001    │      │    L001     │
+        │  (Oracle) │      │   (Linode)  │
+        │  nginx    │      │  Headscale  │
+        │  +vault   │      │   (SPOF!)   │
+        │  +atuin   │      └──────┬──────┘
+        │  (SPOF!)  │             │
+        └─────┬─────┘             │
+              │         Tailscale Mesh
+              │       ┌───────────┴───────────┐
+              │       │                       │
+        ┌─────▼───────▼─────┐          ┌──────▼──────┐
+        │       H001        │          │    H003     │
+        │  (Service Host)   │          │   (Router)  │
+        │  Forgejo,Zitadel, │          │  AdGuard,   │
+        │  LiteLLM,Trilium, │          │  DHCP,NAT   │
+        │  NixArr,OpenWebUI │          │   (SPOF!)   │
+        └─────────┬─────────┘          └─────────────┘
+                  │ NFS
+        ┌─────────▼─────────┐
+        │       H002        │
+        │   (NAS - bcachefs)│
+        │  Media, Data      │
+        └───────────────────┘
+```
+
+## Critical Single Points of Failure
+
+| Host | Service | Impact if Down | Recovery Time |
+|------|---------|----------------|---------------|
+| **L001** | Headscale | ALL mesh connectivity | HIGH - must restore SQLite exactly |
+| **O001** | nginx/Vaultwarden | All public access, password manager | MEDIUM |
+| **H003** | DNS/DHCP/NAT | Entire LAN offline | MEDIUM |
+| **H001** | All services | Services down but recoverable | MEDIUM |
+| **H002** | NFS | Media unavailable | LOW - bcachefs has replication |
+
+---
+
+## Reverse Proxy Resilience (O001)
+
+### Current Problem
+
+O001 is a single point of failure for all public traffic:
+- No public access to any service if it dies
+- DNS still points to it after failure
+- ACME certs are only on that host
+
+### Solution Options
+
+#### Option A: Cloudflare Tunnel (Recommended Quick Win)
+
+**Pros:**
+- No single server dependency
+- Run `cloudflared` on multiple hosts (H001 as backup)
+- Automatic failover between tunnel replicas
+- Built-in DDoS protection
+- No inbound ports needed
+
+**Cons:**
+- Cannot stream media (Jellyfin) - violates Cloudflare ToS
+- Adds latency
+- Vendor dependency
+
+**Implementation:**
+
+```nix
+# On BOTH O001 (primary) AND H001 (backup)
+services.cloudflared = {
+  enable = true;
+  tunnels."joshuabell" = {
+    credentialsFile = config.age.secrets.cloudflared.path;
+    ingress = {
+      "chat.joshuabell.xyz" = "http://100.64.0.13:80";
+      "git.joshuabell.xyz" = "http://100.64.0.13:80";
+      "notes.joshuabell.xyz" = "http://100.64.0.13:80";
+      "sec.joshuabell.xyz" = "http://100.64.0.13:80";
+      "sso.joshuabell.xyz" = "http://100.64.0.13:80";
+      "n8n.joshuabell.xyz" = "http://100.64.0.13:80";
+      "blog.joshuabell.xyz" = "http://100.64.0.13:80";
+    };
+  };
+};
+```
+
+Cloudflare automatically load balances across all active tunnel replicas.
+
+#### Option B: DNS Failover with Health Checks
+
+Use Cloudflare DNS with health checks:
+- Point `joshuabell.xyz` to both O001 and a backup
+- Cloudflare removes unhealthy IPs automatically
+- Requires Cloudflare paid plan for load balancing
+
+#### Option C: Tailscale Funnel
+
+Expose services directly without O001:
+
+```bash
+# On H001
+tailscale funnel 443
+```
+
+Exposes H001 directly at `https://h001.net.joshuabell.xyz`
+
+**Pros:**
+- No proxy needed
+- Per-service granularity
+- Automatic HTTPS
+
+**Cons:**
+- Uses `ts.net` domain (no custom domain)
+- Limited to ports 443, 8443, 10000
+
+#### Option D: Manual Failover with Shared Config
+
+Keep H001 ready to take over O001's role:
+1. Same nginx config via shared NixOS module
+2. Use DNS-01 ACME challenge (certs work on any host)
+3. Update DNS when O001 fails
+
+### Recommended Hybrid Approach
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                   RECOMMENDED TOPOLOGY                       │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│   Cloudflare DNS (health-checked failover)                 │
+│          │                                                  │
+│   ┌──────┴──────┐                                          │
+│   │             │                                          │
+│   ▼             ▼                                          │
+│  O001   ──OR── H001 (via Cloudflare Tunnel)               │
+│  nginx         cloudflared backup                          │
+│                                                             │
+│   Jellyfin: Direct via Tailscale Funnel (bypasses O001)   │
+│   Vaultwarden: Cloudflare Tunnel (survives O001 failure)  │
+│                                                             │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Key Changes:**
+1. Move Vaultwarden to Cloudflare Tunnel (survives O001 outage)
+2. Jellyfin via Tailscale Funnel (no Cloudflare ToS issues)
+3. Other services via Cloudflare Tunnel with H001 as backup
+
+---
+
+## Headscale HA (L001)
+
+### The Problem
+
+L001 running Headscale is the MOST CRITICAL SPOF:
+- If Headscale dies, existing connections keep working temporarily
+- NO NEW devices can connect
+- Devices that reboot cannot rejoin the mesh
+- Eventually all mesh connectivity degrades
+
+### Solution Options
+
+#### Option 1: Frequent Backups (Minimum Viable)
+
+```nix
+my.backup = {
+  enable = true;
+  paths = [ "/var/lib/headscale" "/var/lib/acme" ];
+};
+```
+
+**Recovery time:** ~30 minutes to spin up new VPS + restore
+
+#### Option 2: Warm Standby
+
+- Run second Linode/VPS with Headscale configured but stopped
+- Daily rsync of `/var/lib/headscale/` to standby
+- Update DNS to point to standby if primary fails
+
+```bash
+# Daily sync to standby
+rsync -avz l001:/var/lib/headscale/ standby:/var/lib/headscale/
+```
+
+**Recovery time:** ~5 minutes (start service, update DNS)
+
+#### Option 3: Headscale HA with LiteFS
+
+Headscale doesn't natively support HA, but you can use:
+- **LiteFS** for SQLite replication
+- **Consul** for leader election and failover
+
+See: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/
+
+**Recovery time:** ~15 seconds automatic failover
+
+#### Option 4: Use Tailscale Commercial
+
+Let Tailscale handle the control plane HA:
+- They manage availability
+- Keep Headscale for learning/experimentation
+- Critical services use Tailscale commercial
+
+### Recommendation
+
+Start with Option 1 (backups) immediately, work toward Option 2 (warm standby) within a month.
+
+---
+
+## Router HA (H003)
+
+### The Problem
+
+H003 is the network gateway:
+- AdGuard Home (DNS filtering)
+- dnsmasq (DHCP)
+- NAT firewall
+- If it dies, entire LAN loses connectivity
+
+### Solution Options
+
+#### Option 1: Secondary DNS/DHCP
+
+Run backup DNS on another host (H001 or H002):
+- Secondary AdGuard Home instance
+- Clients configured with both DNS servers
+- DHCP failover is trickier (consider ISC DHCP with failover)
+
+#### Option 2: Keepalived for Router Failover
+
+If you have two devices that could be routers:
+
+```nix
+services.keepalived = {
+  enable = true;
+  vrrpInstances.router = {
+    state = "MASTER";  # or "BACKUP" on secondary
+    interface = "eth0";
+    virtualRouterId = 1;
+    priority = 255;  # Lower on backup
+    virtualIps = [{ addr = "10.12.14.1/24"; }];
+  };
+};
+```
+
+#### Option 3: Router Redundancy via ISP
+
+- Use ISP router as fallback gateway
+- Clients get two gateways via DHCP
+- Less control but automatic failover
+
+### Recommendation
+
+Run secondary AdGuard Home on H001/H002 as minimum redundancy. Full router HA is complex for homelab.
+
+---
+
+## NFS HA (H002)
+
+### Current State
+
+H002 uses bcachefs with 2x replication across 5 disks. Single host failure still causes data unavailability.
+
+### Options
+
+#### Option 1: NFS Client Resilience
+
+Configure NFS clients to handle server unavailability gracefully:
+
+```nix
+fileSystems."/nfs/h002" = {
+  device = "100.64.0.3:/data";
+  fsType = "nfs4";
+  options = [
+    "soft"           # Don't hang forever
+    "timeo=50"       # 5 second timeout
+    "retrans=3"      # 3 retries
+    "nofail"         # Don't fail boot if unavailable
+  ];
+};
+```
+
+#### Option 2: Second NAS with GlusterFS
+
+For true HA, run two NAS nodes with GlusterFS replication:
+
+```
+H002 (bcachefs) ◄──── GlusterFS ────► H00X (bcachefs)
+```
+
+**Overkill for homelab**, but an option for critical data.
+
+### Recommendation
+
+Current bcachefs replication is adequate. Focus on offsite backups for truly irreplaceable data.
+
+---
+
+## Recommended Implementation Order
+
+### Phase 1: Quick Wins (This Week)
+1. [ ] Set up Cloudflare Tunnel on O001 AND H001
+2. [ ] Enable Tailscale Funnel for Jellyfin
+3. [ ] Automated backups for L001 Headscale
+
+### Phase 2: Core Resilience (This Month)
+4. [ ] DNS-01 ACME for shared certs
+5. [ ] Warm standby for Headscale
+6. [ ] Secondary AdGuard Home
+
+### Phase 3: Full Resilience (Next Quarter)
+7. [ ] Headscale HA with LiteFS (if needed)
+8. [ ] Automated failover testing
+9. [ ] Runbook documentation
+
+---
+
+## Monitoring & Alerting
+
+Essential for knowing when to failover:
+
+```nix
+# Uptime monitoring for critical services
+services.uptime-kuma = {
+  enable = true;
+  # Monitor: Headscale, nginx, Vaultwarden, AdGuard
+};
+
+# Or use external monitoring (BetterStack, Uptime Robot)
+```
+
+Alert on:
+- Headscale API unreachable
+- nginx health check fails
+- DNS resolution fails
+- NFS mount fails