347 lines
10 KiB
Markdown
347 lines
10 KiB
Markdown
# Infrastructure Resilience & Failover
|
|
|
|
## Overview
|
|
|
|
This document covers strategies for eliminating single points of failure and improving infrastructure resilience.
|
|
|
|
## Current Architecture
|
|
|
|
```
|
|
INTERNET
|
|
│
|
|
┌─────────┴─────────┐
|
|
│ │
|
|
┌─────▼─────┐ ┌──────▼──────┐
|
|
│ O001 │ │ L001 │
|
|
│ (Oracle) │ │ (Linode) │
|
|
│ nginx │ │ Headscale │
|
|
│ +vault │ │ (SPOF!) │
|
|
│ +atuin │ └──────┬──────┘
|
|
│ (SPOF!) │ │
|
|
└─────┬─────┘ │
|
|
│ Tailscale Mesh
|
|
│ ┌───────────┴───────────┐
|
|
│ │ │
|
|
┌─────▼───────▼─────┐ ┌──────▼──────┐
|
|
│ H001 │ │ H003 │
|
|
│ (Service Host) │ │ (Router) │
|
|
│ Forgejo,Zitadel, │ │ AdGuard, │
|
|
│ LiteLLM,Trilium, │ │ DHCP,NAT │
|
|
│ NixArr,OpenWebUI │ │ (SPOF!) │
|
|
└─────────┬─────────┘ └─────────────┘
|
|
│ NFS
|
|
┌─────────▼─────────┐
|
|
│ H002 │
|
|
│ (NAS - bcachefs)│
|
|
│ Media, Data │
|
|
└───────────────────┘
|
|
```
|
|
|
|
## Critical Single Points of Failure
|
|
|
|
| Host | Service | Impact if Down | Recovery Time |
|
|
|------|---------|----------------|---------------|
|
|
| **L001** | Headscale | ALL mesh connectivity | HIGH - must restore SQLite exactly |
|
|
| **O001** | nginx/Vaultwarden | All public access, password manager | MEDIUM |
|
|
| **H003** | DNS/DHCP/NAT | Entire LAN offline | MEDIUM |
|
|
| **H001** | All services | Services down but recoverable | MEDIUM |
|
|
| **H002** | NFS | Media unavailable | LOW - bcachefs has replication |
|
|
|
|
---
|
|
|
|
## Reverse Proxy Resilience (O001)
|
|
|
|
### Current Problem
|
|
|
|
O001 is a single point of failure for all public traffic:
|
|
- No public access to any service if it dies
|
|
- DNS still points to it after failure
|
|
- ACME certs are only on that host
|
|
|
|
### Solution Options
|
|
|
|
#### Option A: Cloudflare Tunnel (Recommended Quick Win)
|
|
|
|
**Pros:**
|
|
- No single server dependency
|
|
- Run `cloudflared` on multiple hosts (H001 as backup)
|
|
- Automatic failover between tunnel replicas
|
|
- Built-in DDoS protection
|
|
- No inbound ports needed
|
|
|
|
**Cons:**
|
|
- Cannot stream media (Jellyfin) - violates Cloudflare ToS
|
|
- Adds latency
|
|
- Vendor dependency
|
|
|
|
**Implementation:**
|
|
|
|
```nix
|
|
# On BOTH O001 (primary) AND H001 (backup)
|
|
services.cloudflared = {
|
|
enable = true;
|
|
tunnels."joshuabell" = {
|
|
credentialsFile = config.age.secrets.cloudflared.path;
|
|
ingress = {
|
|
"chat.joshuabell.xyz" = "http://100.64.0.13:80";
|
|
"git.joshuabell.xyz" = "http://100.64.0.13:80";
|
|
"notes.joshuabell.xyz" = "http://100.64.0.13:80";
|
|
"sec.joshuabell.xyz" = "http://100.64.0.13:80";
|
|
"sso.joshuabell.xyz" = "http://100.64.0.13:80";
|
|
"n8n.joshuabell.xyz" = "http://100.64.0.13:80";
|
|
"blog.joshuabell.xyz" = "http://100.64.0.13:80";
|
|
};
|
|
};
|
|
};
|
|
```
|
|
|
|
Cloudflare automatically load balances across all active tunnel replicas.
|
|
|
|
#### Option B: DNS Failover with Health Checks
|
|
|
|
Use Cloudflare DNS with health checks:
|
|
- Point `joshuabell.xyz` to both O001 and a backup
|
|
- Cloudflare removes unhealthy IPs automatically
|
|
- Requires Cloudflare paid plan for load balancing
|
|
|
|
#### Option C: Tailscale Funnel
|
|
|
|
Expose services directly without O001:
|
|
|
|
```bash
|
|
# On H001
|
|
tailscale funnel 443
|
|
```
|
|
|
|
Exposes H001 directly at `https://h001.net.joshuabell.xyz`
|
|
|
|
**Pros:**
|
|
- No proxy needed
|
|
- Per-service granularity
|
|
- Automatic HTTPS
|
|
|
|
**Cons:**
|
|
- Uses `ts.net` domain (no custom domain)
|
|
- Limited to ports 443, 8443, 10000
|
|
|
|
#### Option D: Manual Failover with Shared Config
|
|
|
|
Keep H001 ready to take over O001's role:
|
|
1. Same nginx config via shared NixOS module
|
|
2. Use DNS-01 ACME challenge (certs work on any host)
|
|
3. Update DNS when O001 fails
|
|
|
|
### Recommended Hybrid Approach
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ RECOMMENDED TOPOLOGY │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Cloudflare DNS (health-checked failover) │
|
|
│ │ │
|
|
│ ┌──────┴──────┐ │
|
|
│ │ │ │
|
|
│ ▼ ▼ │
|
|
│ O001 ──OR── H001 (via Cloudflare Tunnel) │
|
|
│ nginx cloudflared backup │
|
|
│ │
|
|
│ Jellyfin: Direct via Tailscale Funnel (bypasses O001) │
|
|
│ Vaultwarden: Cloudflare Tunnel (survives O001 failure) │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Key Changes:**
|
|
1. Move Vaultwarden to Cloudflare Tunnel (survives O001 outage)
|
|
2. Jellyfin via Tailscale Funnel (no Cloudflare ToS issues)
|
|
3. Other services via Cloudflare Tunnel with H001 as backup
|
|
|
|
---
|
|
|
|
## Headscale HA (L001)
|
|
|
|
### The Problem
|
|
|
|
L001 running Headscale is the MOST CRITICAL SPOF:
|
|
- If Headscale dies, existing connections keep working temporarily
|
|
- NO NEW devices can connect
|
|
- Devices that reboot cannot rejoin the mesh
|
|
- Eventually all mesh connectivity degrades
|
|
|
|
### Solution Options
|
|
|
|
#### Option 1: Frequent Backups (Minimum Viable)
|
|
|
|
```nix
|
|
my.backup = {
|
|
enable = true;
|
|
paths = [ "/var/lib/headscale" "/var/lib/acme" ];
|
|
};
|
|
```
|
|
|
|
**Recovery time:** ~30 minutes to spin up new VPS + restore
|
|
|
|
#### Option 2: Warm Standby
|
|
|
|
- Run second Linode/VPS with Headscale configured but stopped
|
|
- Daily rsync of `/var/lib/headscale/` to standby
|
|
- Update DNS to point to standby if primary fails
|
|
|
|
```bash
|
|
# Daily sync to standby
|
|
rsync -avz l001:/var/lib/headscale/ standby:/var/lib/headscale/
|
|
```
|
|
|
|
**Recovery time:** ~5 minutes (start service, update DNS)
|
|
|
|
#### Option 3: Headscale HA with LiteFS
|
|
|
|
Headscale doesn't natively support HA, but you can use:
|
|
- **LiteFS** for SQLite replication
|
|
- **Consul** for leader election and failover
|
|
|
|
See: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/
|
|
|
|
**Recovery time:** ~15 seconds automatic failover
|
|
|
|
#### Option 4: Use Tailscale Commercial
|
|
|
|
Let Tailscale handle the control plane HA:
|
|
- They manage availability
|
|
- Keep Headscale for learning/experimentation
|
|
- Critical services use Tailscale commercial
|
|
|
|
### Recommendation
|
|
|
|
Start with Option 1 (backups) immediately, work toward Option 2 (warm standby) within a month.
|
|
|
|
---
|
|
|
|
## Router HA (H003)
|
|
|
|
### The Problem
|
|
|
|
H003 is the network gateway:
|
|
- AdGuard Home (DNS filtering)
|
|
- dnsmasq (DHCP)
|
|
- NAT firewall
|
|
- If it dies, entire LAN loses connectivity
|
|
|
|
### Solution Options
|
|
|
|
#### Option 1: Secondary DNS/DHCP
|
|
|
|
Run backup DNS on another host (H001 or H002):
|
|
- Secondary AdGuard Home instance
|
|
- Clients configured with both DNS servers
|
|
- DHCP failover is trickier (consider ISC DHCP with failover)
|
|
|
|
#### Option 2: Keepalived for Router Failover
|
|
|
|
If you have two devices that could be routers:
|
|
|
|
```nix
|
|
services.keepalived = {
|
|
enable = true;
|
|
vrrpInstances.router = {
|
|
state = "MASTER"; # or "BACKUP" on secondary
|
|
interface = "eth0";
|
|
virtualRouterId = 1;
|
|
priority = 255; # Lower on backup
|
|
virtualIps = [{ addr = "10.12.14.1/24"; }];
|
|
};
|
|
};
|
|
```
|
|
|
|
#### Option 3: Router Redundancy via ISP
|
|
|
|
- Use ISP router as fallback gateway
|
|
- Clients get two gateways via DHCP
|
|
- Less control but automatic failover
|
|
|
|
### Recommendation
|
|
|
|
Run secondary AdGuard Home on H001/H002 as minimum redundancy. Full router HA is complex for homelab.
|
|
|
|
---
|
|
|
|
## NFS HA (H002)
|
|
|
|
### Current State
|
|
|
|
H002 uses bcachefs with 2x replication across 5 disks. Single host failure still causes data unavailability.
|
|
|
|
### Options
|
|
|
|
#### Option 1: NFS Client Resilience
|
|
|
|
Configure NFS clients to handle server unavailability gracefully:
|
|
|
|
```nix
|
|
fileSystems."/nfs/h002" = {
|
|
device = "100.64.0.3:/data";
|
|
fsType = "nfs4";
|
|
options = [
|
|
"soft" # Don't hang forever
|
|
"timeo=50" # 5 second timeout
|
|
"retrans=3" # 3 retries
|
|
"nofail" # Don't fail boot if unavailable
|
|
];
|
|
};
|
|
```
|
|
|
|
#### Option 2: Second NAS with GlusterFS
|
|
|
|
For true HA, run two NAS nodes with GlusterFS replication:
|
|
|
|
```
|
|
H002 (bcachefs) ◄──── GlusterFS ────► H00X (bcachefs)
|
|
```
|
|
|
|
**Overkill for homelab**, but an option for critical data.
|
|
|
|
### Recommendation
|
|
|
|
Current bcachefs replication is adequate. Focus on offsite backups for truly irreplaceable data.
|
|
|
|
---
|
|
|
|
## Recommended Implementation Order
|
|
|
|
### Phase 1: Quick Wins (This Week)
|
|
1. [ ] Set up Cloudflare Tunnel on O001 AND H001
|
|
2. [ ] Enable Tailscale Funnel for Jellyfin
|
|
3. [ ] Automated backups for L001 Headscale
|
|
|
|
### Phase 2: Core Resilience (This Month)
|
|
4. [ ] DNS-01 ACME for shared certs
|
|
5. [ ] Warm standby for Headscale
|
|
6. [ ] Secondary AdGuard Home
|
|
|
|
### Phase 3: Full Resilience (Next Quarter)
|
|
7. [ ] Headscale HA with LiteFS (if needed)
|
|
8. [ ] Automated failover testing
|
|
9. [ ] Runbook documentation
|
|
|
|
---
|
|
|
|
## Monitoring & Alerting
|
|
|
|
Essential for knowing when to failover:
|
|
|
|
```nix
|
|
# Uptime monitoring for critical services
|
|
services.uptime-kuma = {
|
|
enable = true;
|
|
# Monitor: Headscale, nginx, Vaultwarden, AdGuard
|
|
};
|
|
|
|
# Or use external monitoring (BetterStack, Uptime Robot)
|
|
```
|
|
|
|
Alert on:
|
|
- Headscale API unreachable
|
|
- nginx health check fails
|
|
- DNS resolution fails
|
|
- NFS mount fails
|