Add migrating services guide and idea drafts, update flake.lock
This commit is contained in:
parent
05f31b80c2
commit
78469896c8
6 changed files with 1779 additions and 3 deletions
347
ideas/resilience.md
Normal file
347
ideas/resilience.md
Normal file
|
|
@ -0,0 +1,347 @@
|
|||
# Infrastructure Resilience & Failover
|
||||
|
||||
## Overview
|
||||
|
||||
This document covers strategies for eliminating single points of failure and improving infrastructure resilience.
|
||||
|
||||
## Current Architecture
|
||||
|
||||
```
|
||||
INTERNET
|
||||
│
|
||||
┌─────────┴─────────┐
|
||||
│ │
|
||||
┌─────▼─────┐ ┌──────▼──────┐
|
||||
│ O001 │ │ L001 │
|
||||
│ (Oracle) │ │ (Linode) │
|
||||
│ nginx │ │ Headscale │
|
||||
│ +vault │ │ (SPOF!) │
|
||||
│ +atuin │ └──────┬──────┘
|
||||
│ (SPOF!) │ │
|
||||
└─────┬─────┘ │
|
||||
│ Tailscale Mesh
|
||||
│ ┌───────────┴───────────┐
|
||||
│ │ │
|
||||
┌─────▼───────▼─────┐ ┌──────▼──────┐
|
||||
│ H001 │ │ H003 │
|
||||
│ (Service Host) │ │ (Router) │
|
||||
│ Forgejo,Zitadel, │ │ AdGuard, │
|
||||
│ LiteLLM,Trilium, │ │ DHCP,NAT │
|
||||
│ NixArr,OpenWebUI │ │ (SPOF!) │
|
||||
└─────────┬─────────┘ └─────────────┘
|
||||
│ NFS
|
||||
┌─────────▼─────────┐
|
||||
│ H002 │
|
||||
│ (NAS - bcachefs)│
|
||||
│ Media, Data │
|
||||
└───────────────────┘
|
||||
```
|
||||
|
||||
## Critical Single Points of Failure
|
||||
|
||||
| Host | Service | Impact if Down | Recovery Time |
|
||||
|------|---------|----------------|---------------|
|
||||
| **L001** | Headscale | ALL mesh connectivity | HIGH - must restore SQLite exactly |
|
||||
| **O001** | nginx/Vaultwarden | All public access, password manager | MEDIUM |
|
||||
| **H003** | DNS/DHCP/NAT | Entire LAN offline | MEDIUM |
|
||||
| **H001** | All services | Services down but recoverable | MEDIUM |
|
||||
| **H002** | NFS | Media unavailable | LOW - bcachefs has replication |
|
||||
|
||||
---
|
||||
|
||||
## Reverse Proxy Resilience (O001)
|
||||
|
||||
### Current Problem
|
||||
|
||||
O001 is a single point of failure for all public traffic:
|
||||
- No public access to any service if it dies
|
||||
- DNS still points to it after failure
|
||||
- ACME certs are only on that host
|
||||
|
||||
### Solution Options
|
||||
|
||||
#### Option A: Cloudflare Tunnel (Recommended Quick Win)
|
||||
|
||||
**Pros:**
|
||||
- No single server dependency
|
||||
- Run `cloudflared` on multiple hosts (H001 as backup)
|
||||
- Automatic failover between tunnel replicas
|
||||
- Built-in DDoS protection
|
||||
- No inbound ports needed
|
||||
|
||||
**Cons:**
|
||||
- Cannot stream media (Jellyfin) - violates Cloudflare ToS
|
||||
- Adds latency
|
||||
- Vendor dependency
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```nix
|
||||
# On BOTH O001 (primary) AND H001 (backup)
|
||||
services.cloudflared = {
|
||||
enable = true;
|
||||
tunnels."joshuabell" = {
|
||||
credentialsFile = config.age.secrets.cloudflared.path;
|
||||
ingress = {
|
||||
"chat.joshuabell.xyz" = "http://100.64.0.13:80";
|
||||
"git.joshuabell.xyz" = "http://100.64.0.13:80";
|
||||
"notes.joshuabell.xyz" = "http://100.64.0.13:80";
|
||||
"sec.joshuabell.xyz" = "http://100.64.0.13:80";
|
||||
"sso.joshuabell.xyz" = "http://100.64.0.13:80";
|
||||
"n8n.joshuabell.xyz" = "http://100.64.0.13:80";
|
||||
"blog.joshuabell.xyz" = "http://100.64.0.13:80";
|
||||
};
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
Cloudflare automatically load balances across all active tunnel replicas.
|
||||
|
||||
#### Option B: DNS Failover with Health Checks
|
||||
|
||||
Use Cloudflare DNS with health checks:
|
||||
- Point `joshuabell.xyz` to both O001 and a backup
|
||||
- Cloudflare removes unhealthy IPs automatically
|
||||
- Requires Cloudflare paid plan for load balancing
|
||||
|
||||
#### Option C: Tailscale Funnel
|
||||
|
||||
Expose services directly without O001:
|
||||
|
||||
```bash
|
||||
# On H001
|
||||
tailscale funnel 443
|
||||
```
|
||||
|
||||
Exposes H001 directly at `https://h001.net.joshuabell.xyz`
|
||||
|
||||
**Pros:**
|
||||
- No proxy needed
|
||||
- Per-service granularity
|
||||
- Automatic HTTPS
|
||||
|
||||
**Cons:**
|
||||
- Uses `ts.net` domain (no custom domain)
|
||||
- Limited to ports 443, 8443, 10000
|
||||
|
||||
#### Option D: Manual Failover with Shared Config
|
||||
|
||||
Keep H001 ready to take over O001's role:
|
||||
1. Same nginx config via shared NixOS module
|
||||
2. Use DNS-01 ACME challenge (certs work on any host)
|
||||
3. Update DNS when O001 fails
|
||||
|
||||
### Recommended Hybrid Approach
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ RECOMMENDED TOPOLOGY │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Cloudflare DNS (health-checked failover) │
|
||||
│ │ │
|
||||
│ ┌──────┴──────┐ │
|
||||
│ │ │ │
|
||||
│ ▼ ▼ │
|
||||
│ O001 ──OR── H001 (via Cloudflare Tunnel) │
|
||||
│ nginx cloudflared backup │
|
||||
│ │
|
||||
│ Jellyfin: Direct via Tailscale Funnel (bypasses O001) │
|
||||
│ Vaultwarden: Cloudflare Tunnel (survives O001 failure) │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Key Changes:**
|
||||
1. Move Vaultwarden to Cloudflare Tunnel (survives O001 outage)
|
||||
2. Jellyfin via Tailscale Funnel (no Cloudflare ToS issues)
|
||||
3. Other services via Cloudflare Tunnel with H001 as backup
|
||||
|
||||
---
|
||||
|
||||
## Headscale HA (L001)
|
||||
|
||||
### The Problem
|
||||
|
||||
L001 running Headscale is the MOST CRITICAL SPOF:
|
||||
- If Headscale dies, existing connections keep working temporarily
|
||||
- NO NEW devices can connect
|
||||
- Devices that reboot cannot rejoin the mesh
|
||||
- Eventually all mesh connectivity degrades
|
||||
|
||||
### Solution Options
|
||||
|
||||
#### Option 1: Frequent Backups (Minimum Viable)
|
||||
|
||||
```nix
|
||||
my.backup = {
|
||||
enable = true;
|
||||
paths = [ "/var/lib/headscale" "/var/lib/acme" ];
|
||||
};
|
||||
```
|
||||
|
||||
**Recovery time:** ~30 minutes to spin up new VPS + restore
|
||||
|
||||
#### Option 2: Warm Standby
|
||||
|
||||
- Run second Linode/VPS with Headscale configured but stopped
|
||||
- Daily rsync of `/var/lib/headscale/` to standby
|
||||
- Update DNS to point to standby if primary fails
|
||||
|
||||
```bash
|
||||
# Daily sync to standby
|
||||
rsync -avz l001:/var/lib/headscale/ standby:/var/lib/headscale/
|
||||
```
|
||||
|
||||
**Recovery time:** ~5 minutes (start service, update DNS)
|
||||
|
||||
#### Option 3: Headscale HA with LiteFS
|
||||
|
||||
Headscale doesn't natively support HA, but you can use:
|
||||
- **LiteFS** for SQLite replication
|
||||
- **Consul** for leader election and failover
|
||||
|
||||
See: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/
|
||||
|
||||
**Recovery time:** ~15 seconds automatic failover
|
||||
|
||||
#### Option 4: Use Tailscale Commercial
|
||||
|
||||
Let Tailscale handle the control plane HA:
|
||||
- They manage availability
|
||||
- Keep Headscale for learning/experimentation
|
||||
- Critical services use Tailscale commercial
|
||||
|
||||
### Recommendation
|
||||
|
||||
Start with Option 1 (backups) immediately, work toward Option 2 (warm standby) within a month.
|
||||
|
||||
---
|
||||
|
||||
## Router HA (H003)
|
||||
|
||||
### The Problem
|
||||
|
||||
H003 is the network gateway:
|
||||
- AdGuard Home (DNS filtering)
|
||||
- dnsmasq (DHCP)
|
||||
- NAT firewall
|
||||
- If it dies, entire LAN loses connectivity
|
||||
|
||||
### Solution Options
|
||||
|
||||
#### Option 1: Secondary DNS/DHCP
|
||||
|
||||
Run backup DNS on another host (H001 or H002):
|
||||
- Secondary AdGuard Home instance
|
||||
- Clients configured with both DNS servers
|
||||
- DHCP failover is trickier (consider ISC DHCP with failover)
|
||||
|
||||
#### Option 2: Keepalived for Router Failover
|
||||
|
||||
If you have two devices that could be routers:
|
||||
|
||||
```nix
|
||||
services.keepalived = {
|
||||
enable = true;
|
||||
vrrpInstances.router = {
|
||||
state = "MASTER"; # or "BACKUP" on secondary
|
||||
interface = "eth0";
|
||||
virtualRouterId = 1;
|
||||
priority = 255; # Lower on backup
|
||||
virtualIps = [{ addr = "10.12.14.1/24"; }];
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
#### Option 3: Router Redundancy via ISP
|
||||
|
||||
- Use ISP router as fallback gateway
|
||||
- Clients get two gateways via DHCP
|
||||
- Less control but automatic failover
|
||||
|
||||
### Recommendation
|
||||
|
||||
Run secondary AdGuard Home on H001/H002 as minimum redundancy. Full router HA is complex for homelab.
|
||||
|
||||
---
|
||||
|
||||
## NFS HA (H002)
|
||||
|
||||
### Current State
|
||||
|
||||
H002 uses bcachefs with 2x replication across 5 disks. Single host failure still causes data unavailability.
|
||||
|
||||
### Options
|
||||
|
||||
#### Option 1: NFS Client Resilience
|
||||
|
||||
Configure NFS clients to handle server unavailability gracefully:
|
||||
|
||||
```nix
|
||||
fileSystems."/nfs/h002" = {
|
||||
device = "100.64.0.3:/data";
|
||||
fsType = "nfs4";
|
||||
options = [
|
||||
"soft" # Don't hang forever
|
||||
"timeo=50" # 5 second timeout
|
||||
"retrans=3" # 3 retries
|
||||
"nofail" # Don't fail boot if unavailable
|
||||
];
|
||||
};
|
||||
```
|
||||
|
||||
#### Option 2: Second NAS with GlusterFS
|
||||
|
||||
For true HA, run two NAS nodes with GlusterFS replication:
|
||||
|
||||
```
|
||||
H002 (bcachefs) ◄──── GlusterFS ────► H00X (bcachefs)
|
||||
```
|
||||
|
||||
**Overkill for homelab**, but an option for critical data.
|
||||
|
||||
### Recommendation
|
||||
|
||||
Current bcachefs replication is adequate. Focus on offsite backups for truly irreplaceable data.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Implementation Order
|
||||
|
||||
### Phase 1: Quick Wins (This Week)
|
||||
1. [ ] Set up Cloudflare Tunnel on O001 AND H001
|
||||
2. [ ] Enable Tailscale Funnel for Jellyfin
|
||||
3. [ ] Automated backups for L001 Headscale
|
||||
|
||||
### Phase 2: Core Resilience (This Month)
|
||||
4. [ ] DNS-01 ACME for shared certs
|
||||
5. [ ] Warm standby for Headscale
|
||||
6. [ ] Secondary AdGuard Home
|
||||
|
||||
### Phase 3: Full Resilience (Next Quarter)
|
||||
7. [ ] Headscale HA with LiteFS (if needed)
|
||||
8. [ ] Automated failover testing
|
||||
9. [ ] Runbook documentation
|
||||
|
||||
---
|
||||
|
||||
## Monitoring & Alerting
|
||||
|
||||
Essential for knowing when to failover:
|
||||
|
||||
```nix
|
||||
# Uptime monitoring for critical services
|
||||
services.uptime-kuma = {
|
||||
enable = true;
|
||||
# Monitor: Headscale, nginx, Vaultwarden, AdGuard
|
||||
};
|
||||
|
||||
# Or use external monitoring (BetterStack, Uptime Robot)
|
||||
```
|
||||
|
||||
Alert on:
|
||||
- Headscale API unreachable
|
||||
- nginx health check fails
|
||||
- DNS resolution fails
|
||||
- NFS mount fails
|
||||
Loading…
Add table
Add a link
Reference in a new issue