10 KiB
Infrastructure Resilience & Failover
Overview
This document covers strategies for eliminating single points of failure and improving infrastructure resilience.
Current Architecture
INTERNET
│
┌─────────┴─────────┐
│ │
┌─────▼─────┐ ┌──────▼──────┐
│ O001 │ │ L001 │
│ (Oracle) │ │ (Linode) │
│ nginx │ │ Headscale │
│ +vault │ │ (SPOF!) │
│ +atuin │ └──────┬──────┘
│ (SPOF!) │ │
└─────┬─────┘ │
│ Tailscale Mesh
│ ┌───────────┴───────────┐
│ │ │
┌─────▼───────▼─────┐ ┌──────▼──────┐
│ H001 │ │ H003 │
│ (Service Host) │ │ (Router) │
│ Forgejo,Zitadel, │ │ AdGuard, │
│ LiteLLM,Trilium, │ │ DHCP,NAT │
│ NixArr,OpenWebUI │ │ (SPOF!) │
└─────────┬─────────┘ └─────────────┘
│ NFS
┌─────────▼─────────┐
│ H002 │
│ (NAS - bcachefs)│
│ Media, Data │
└───────────────────┘
Critical Single Points of Failure
| Host | Service | Impact if Down | Recovery Time |
|---|---|---|---|
| L001 | Headscale | ALL mesh connectivity | HIGH - must restore SQLite exactly |
| O001 | nginx/Vaultwarden | All public access, password manager | MEDIUM |
| H003 | DNS/DHCP/NAT | Entire LAN offline | MEDIUM |
| H001 | All services | Services down but recoverable | MEDIUM |
| H002 | NFS | Media unavailable | LOW - bcachefs has replication |
Reverse Proxy Resilience (O001)
Current Problem
O001 is a single point of failure for all public traffic:
- No public access to any service if it dies
- DNS still points to it after failure
- ACME certs are only on that host
Solution Options
Option A: Cloudflare Tunnel (Recommended Quick Win)
Pros:
- No single server dependency
- Run
cloudflaredon multiple hosts (H001 as backup) - Automatic failover between tunnel replicas
- Built-in DDoS protection
- No inbound ports needed
Cons:
- Cannot stream media (Jellyfin) - violates Cloudflare ToS
- Adds latency
- Vendor dependency
Implementation:
# On BOTH O001 (primary) AND H001 (backup)
services.cloudflared = {
enable = true;
tunnels."joshuabell" = {
credentialsFile = config.age.secrets.cloudflared.path;
ingress = {
"chat.joshuabell.xyz" = "http://100.64.0.13:80";
"git.joshuabell.xyz" = "http://100.64.0.13:80";
"notes.joshuabell.xyz" = "http://100.64.0.13:80";
"sec.joshuabell.xyz" = "http://100.64.0.13:80";
"sso.joshuabell.xyz" = "http://100.64.0.13:80";
"n8n.joshuabell.xyz" = "http://100.64.0.13:80";
"blog.joshuabell.xyz" = "http://100.64.0.13:80";
};
};
};
Cloudflare automatically load balances across all active tunnel replicas.
Option B: DNS Failover with Health Checks
Use Cloudflare DNS with health checks:
- Point
joshuabell.xyzto both O001 and a backup - Cloudflare removes unhealthy IPs automatically
- Requires Cloudflare paid plan for load balancing
Option C: Tailscale Funnel
Expose services directly without O001:
# On H001
tailscale funnel 443
Exposes H001 directly at https://h001.net.joshuabell.xyz
Pros:
- No proxy needed
- Per-service granularity
- Automatic HTTPS
Cons:
- Uses
ts.netdomain (no custom domain) - Limited to ports 443, 8443, 10000
Option D: Manual Failover with Shared Config
Keep H001 ready to take over O001's role:
- Same nginx config via shared NixOS module
- Use DNS-01 ACME challenge (certs work on any host)
- Update DNS when O001 fails
Recommended Hybrid Approach
┌─────────────────────────────────────────────────────────────┐
│ RECOMMENDED TOPOLOGY │
├─────────────────────────────────────────────────────────────┤
│ │
│ Cloudflare DNS (health-checked failover) │
│ │ │
│ ┌──────┴──────┐ │
│ │ │ │
│ ▼ ▼ │
│ O001 ──OR── H001 (via Cloudflare Tunnel) │
│ nginx cloudflared backup │
│ │
│ Jellyfin: Direct via Tailscale Funnel (bypasses O001) │
│ Vaultwarden: Cloudflare Tunnel (survives O001 failure) │
│ │
└─────────────────────────────────────────────────────────────┘
Key Changes:
- Move Vaultwarden to Cloudflare Tunnel (survives O001 outage)
- Jellyfin via Tailscale Funnel (no Cloudflare ToS issues)
- Other services via Cloudflare Tunnel with H001 as backup
Headscale HA (L001)
The Problem
L001 running Headscale is the MOST CRITICAL SPOF:
- If Headscale dies, existing connections keep working temporarily
- NO NEW devices can connect
- Devices that reboot cannot rejoin the mesh
- Eventually all mesh connectivity degrades
Solution Options
Option 1: Frequent Backups (Minimum Viable)
my.backup = {
enable = true;
paths = [ "/var/lib/headscale" "/var/lib/acme" ];
};
Recovery time: ~30 minutes to spin up new VPS + restore
Option 2: Warm Standby
- Run second Linode/VPS with Headscale configured but stopped
- Daily rsync of
/var/lib/headscale/to standby - Update DNS to point to standby if primary fails
# Daily sync to standby
rsync -avz l001:/var/lib/headscale/ standby:/var/lib/headscale/
Recovery time: ~5 minutes (start service, update DNS)
Option 3: Headscale HA with LiteFS
Headscale doesn't natively support HA, but you can use:
- LiteFS for SQLite replication
- Consul for leader election and failover
See: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/
Recovery time: ~15 seconds automatic failover
Option 4: Use Tailscale Commercial
Let Tailscale handle the control plane HA:
- They manage availability
- Keep Headscale for learning/experimentation
- Critical services use Tailscale commercial
Recommendation
Start with Option 1 (backups) immediately, work toward Option 2 (warm standby) within a month.
Router HA (H003)
The Problem
H003 is the network gateway:
- AdGuard Home (DNS filtering)
- dnsmasq (DHCP)
- NAT firewall
- If it dies, entire LAN loses connectivity
Solution Options
Option 1: Secondary DNS/DHCP
Run backup DNS on another host (H001 or H002):
- Secondary AdGuard Home instance
- Clients configured with both DNS servers
- DHCP failover is trickier (consider ISC DHCP with failover)
Option 2: Keepalived for Router Failover
If you have two devices that could be routers:
services.keepalived = {
enable = true;
vrrpInstances.router = {
state = "MASTER"; # or "BACKUP" on secondary
interface = "eth0";
virtualRouterId = 1;
priority = 255; # Lower on backup
virtualIps = [{ addr = "10.12.14.1/24"; }];
};
};
Option 3: Router Redundancy via ISP
- Use ISP router as fallback gateway
- Clients get two gateways via DHCP
- Less control but automatic failover
Recommendation
Run secondary AdGuard Home on H001/H002 as minimum redundancy. Full router HA is complex for homelab.
NFS HA (H002)
Current State
H002 uses bcachefs with 2x replication across 5 disks. Single host failure still causes data unavailability.
Options
Option 1: NFS Client Resilience
Configure NFS clients to handle server unavailability gracefully:
fileSystems."/nfs/h002" = {
device = "100.64.0.3:/data";
fsType = "nfs4";
options = [
"soft" # Don't hang forever
"timeo=50" # 5 second timeout
"retrans=3" # 3 retries
"nofail" # Don't fail boot if unavailable
];
};
Option 2: Second NAS with GlusterFS
For true HA, run two NAS nodes with GlusterFS replication:
H002 (bcachefs) ◄──── GlusterFS ────► H00X (bcachefs)
Overkill for homelab, but an option for critical data.
Recommendation
Current bcachefs replication is adequate. Focus on offsite backups for truly irreplaceable data.
Recommended Implementation Order
Phase 1: Quick Wins (This Week)
- Set up Cloudflare Tunnel on O001 AND H001
- Enable Tailscale Funnel for Jellyfin
- Automated backups for L001 Headscale
Phase 2: Core Resilience (This Month)
- DNS-01 ACME for shared certs
- Warm standby for Headscale
- Secondary AdGuard Home
Phase 3: Full Resilience (Next Quarter)
- Headscale HA with LiteFS (if needed)
- Automated failover testing
- Runbook documentation
Monitoring & Alerting
Essential for knowing when to failover:
# Uptime monitoring for critical services
services.uptime-kuma = {
enable = true;
# Monitor: Headscale, nginx, Vaultwarden, AdGuard
};
# Or use external monitoring (BetterStack, Uptime Robot)
Alert on:
- Headscale API unreachable
- nginx health check fails
- DNS resolution fails
- NFS mount fails