dotfiles/ideas/resilience.md

347 lines
10 KiB
Markdown

# Infrastructure Resilience & Failover
## Overview
This document covers strategies for eliminating single points of failure and improving infrastructure resilience.
## Current Architecture
```
INTERNET
┌─────────┴─────────┐
│ │
┌─────▼─────┐ ┌──────▼──────┐
│ O001 │ │ L001 │
│ (Oracle) │ │ (Linode) │
│ nginx │ │ Headscale │
│ +vault │ │ (SPOF!) │
│ +atuin │ └──────┬──────┘
│ (SPOF!) │ │
└─────┬─────┘ │
│ Tailscale Mesh
│ ┌───────────┴───────────┐
│ │ │
┌─────▼───────▼─────┐ ┌──────▼──────┐
│ H001 │ │ H003 │
│ (Service Host) │ │ (Router) │
│ Forgejo,Zitadel, │ │ AdGuard, │
│ LiteLLM,Trilium, │ │ DHCP,NAT │
│ NixArr,OpenWebUI │ │ (SPOF!) │
└─────────┬─────────┘ └─────────────┘
│ NFS
┌─────────▼─────────┐
│ H002 │
│ (NAS - bcachefs)│
│ Media, Data │
└───────────────────┘
```
## Critical Single Points of Failure
| Host | Service | Impact if Down | Recovery Time |
|------|---------|----------------|---------------|
| **L001** | Headscale | ALL mesh connectivity | HIGH - must restore SQLite exactly |
| **O001** | nginx/Vaultwarden | All public access, password manager | MEDIUM |
| **H003** | DNS/DHCP/NAT | Entire LAN offline | MEDIUM |
| **H001** | All services | Services down but recoverable | MEDIUM |
| **H002** | NFS | Media unavailable | LOW - bcachefs has replication |
---
## Reverse Proxy Resilience (O001)
### Current Problem
O001 is a single point of failure for all public traffic:
- No public access to any service if it dies
- DNS still points to it after failure
- ACME certs are only on that host
### Solution Options
#### Option A: Cloudflare Tunnel (Recommended Quick Win)
**Pros:**
- No single server dependency
- Run `cloudflared` on multiple hosts (H001 as backup)
- Automatic failover between tunnel replicas
- Built-in DDoS protection
- No inbound ports needed
**Cons:**
- Cannot stream media (Jellyfin) - violates Cloudflare ToS
- Adds latency
- Vendor dependency
**Implementation:**
```nix
# On BOTH O001 (primary) AND H001 (backup)
services.cloudflared = {
enable = true;
tunnels."joshuabell" = {
credentialsFile = config.age.secrets.cloudflared.path;
ingress = {
"chat.joshuabell.xyz" = "http://100.64.0.13:80";
"git.joshuabell.xyz" = "http://100.64.0.13:80";
"notes.joshuabell.xyz" = "http://100.64.0.13:80";
"sec.joshuabell.xyz" = "http://100.64.0.13:80";
"sso.joshuabell.xyz" = "http://100.64.0.13:80";
"n8n.joshuabell.xyz" = "http://100.64.0.13:80";
"blog.joshuabell.xyz" = "http://100.64.0.13:80";
};
};
};
```
Cloudflare automatically load balances across all active tunnel replicas.
#### Option B: DNS Failover with Health Checks
Use Cloudflare DNS with health checks:
- Point `joshuabell.xyz` to both O001 and a backup
- Cloudflare removes unhealthy IPs automatically
- Requires Cloudflare paid plan for load balancing
#### Option C: Tailscale Funnel
Expose services directly without O001:
```bash
# On H001
tailscale funnel 443
```
Exposes H001 directly at `https://h001.net.joshuabell.xyz`
**Pros:**
- No proxy needed
- Per-service granularity
- Automatic HTTPS
**Cons:**
- Uses `ts.net` domain (no custom domain)
- Limited to ports 443, 8443, 10000
#### Option D: Manual Failover with Shared Config
Keep H001 ready to take over O001's role:
1. Same nginx config via shared NixOS module
2. Use DNS-01 ACME challenge (certs work on any host)
3. Update DNS when O001 fails
### Recommended Hybrid Approach
```
┌─────────────────────────────────────────────────────────────┐
│ RECOMMENDED TOPOLOGY │
├─────────────────────────────────────────────────────────────┤
│ │
│ Cloudflare DNS (health-checked failover) │
│ │ │
│ ┌──────┴──────┐ │
│ │ │ │
│ ▼ ▼ │
│ O001 ──OR── H001 (via Cloudflare Tunnel) │
│ nginx cloudflared backup │
│ │
│ Jellyfin: Direct via Tailscale Funnel (bypasses O001) │
│ Vaultwarden: Cloudflare Tunnel (survives O001 failure) │
│ │
└─────────────────────────────────────────────────────────────┘
```
**Key Changes:**
1. Move Vaultwarden to Cloudflare Tunnel (survives O001 outage)
2. Jellyfin via Tailscale Funnel (no Cloudflare ToS issues)
3. Other services via Cloudflare Tunnel with H001 as backup
---
## Headscale HA (L001)
### The Problem
L001 running Headscale is the MOST CRITICAL SPOF:
- If Headscale dies, existing connections keep working temporarily
- NO NEW devices can connect
- Devices that reboot cannot rejoin the mesh
- Eventually all mesh connectivity degrades
### Solution Options
#### Option 1: Frequent Backups (Minimum Viable)
```nix
my.backup = {
enable = true;
paths = [ "/var/lib/headscale" "/var/lib/acme" ];
};
```
**Recovery time:** ~30 minutes to spin up new VPS + restore
#### Option 2: Warm Standby
- Run second Linode/VPS with Headscale configured but stopped
- Daily rsync of `/var/lib/headscale/` to standby
- Update DNS to point to standby if primary fails
```bash
# Daily sync to standby
rsync -avz l001:/var/lib/headscale/ standby:/var/lib/headscale/
```
**Recovery time:** ~5 minutes (start service, update DNS)
#### Option 3: Headscale HA with LiteFS
Headscale doesn't natively support HA, but you can use:
- **LiteFS** for SQLite replication
- **Consul** for leader election and failover
See: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/
**Recovery time:** ~15 seconds automatic failover
#### Option 4: Use Tailscale Commercial
Let Tailscale handle the control plane HA:
- They manage availability
- Keep Headscale for learning/experimentation
- Critical services use Tailscale commercial
### Recommendation
Start with Option 1 (backups) immediately, work toward Option 2 (warm standby) within a month.
---
## Router HA (H003)
### The Problem
H003 is the network gateway:
- AdGuard Home (DNS filtering)
- dnsmasq (DHCP)
- NAT firewall
- If it dies, entire LAN loses connectivity
### Solution Options
#### Option 1: Secondary DNS/DHCP
Run backup DNS on another host (H001 or H002):
- Secondary AdGuard Home instance
- Clients configured with both DNS servers
- DHCP failover is trickier (consider ISC DHCP with failover)
#### Option 2: Keepalived for Router Failover
If you have two devices that could be routers:
```nix
services.keepalived = {
enable = true;
vrrpInstances.router = {
state = "MASTER"; # or "BACKUP" on secondary
interface = "eth0";
virtualRouterId = 1;
priority = 255; # Lower on backup
virtualIps = [{ addr = "10.12.14.1/24"; }];
};
};
```
#### Option 3: Router Redundancy via ISP
- Use ISP router as fallback gateway
- Clients get two gateways via DHCP
- Less control but automatic failover
### Recommendation
Run secondary AdGuard Home on H001/H002 as minimum redundancy. Full router HA is complex for homelab.
---
## NFS HA (H002)
### Current State
H002 uses bcachefs with 2x replication across 5 disks. Single host failure still causes data unavailability.
### Options
#### Option 1: NFS Client Resilience
Configure NFS clients to handle server unavailability gracefully:
```nix
fileSystems."/nfs/h002" = {
device = "100.64.0.3:/data";
fsType = "nfs4";
options = [
"soft" # Don't hang forever
"timeo=50" # 5 second timeout
"retrans=3" # 3 retries
"nofail" # Don't fail boot if unavailable
];
};
```
#### Option 2: Second NAS with GlusterFS
For true HA, run two NAS nodes with GlusterFS replication:
```
H002 (bcachefs) ◄──── GlusterFS ────► H00X (bcachefs)
```
**Overkill for homelab**, but an option for critical data.
### Recommendation
Current bcachefs replication is adequate. Focus on offsite backups for truly irreplaceable data.
---
## Recommended Implementation Order
### Phase 1: Quick Wins (This Week)
1. [ ] Set up Cloudflare Tunnel on O001 AND H001
2. [ ] Enable Tailscale Funnel for Jellyfin
3. [ ] Automated backups for L001 Headscale
### Phase 2: Core Resilience (This Month)
4. [ ] DNS-01 ACME for shared certs
5. [ ] Warm standby for Headscale
6. [ ] Secondary AdGuard Home
### Phase 3: Full Resilience (Next Quarter)
7. [ ] Headscale HA with LiteFS (if needed)
8. [ ] Automated failover testing
9. [ ] Runbook documentation
---
## Monitoring & Alerting
Essential for knowing when to failover:
```nix
# Uptime monitoring for critical services
services.uptime-kuma = {
enable = true;
# Monitor: Headscale, nginx, Vaultwarden, AdGuard
};
# Or use external monitoring (BetterStack, Uptime Robot)
```
Alert on:
- Headscale API unreachable
- nginx health check fails
- DNS resolution fails
- NFS mount fails