dotfiles/ideas/resilience.md

10 KiB

Infrastructure Resilience & Failover

Overview

This document covers strategies for eliminating single points of failure and improving infrastructure resilience.

Current Architecture

                    INTERNET
                        │
              ┌─────────┴─────────┐
              │                   │
        ┌─────▼─────┐      ┌──────▼──────┐
        │   O001    │      │    L001     │
        │  (Oracle) │      │   (Linode)  │
        │  nginx    │      │  Headscale  │
        │  +vault   │      │   (SPOF!)   │
        │  +atuin   │      └──────┬──────┘
        │  (SPOF!)  │             │
        └─────┬─────┘             │
              │         Tailscale Mesh
              │       ┌───────────┴───────────┐
              │       │                       │
        ┌─────▼───────▼─────┐          ┌──────▼──────┐
        │       H001        │          │    H003     │
        │  (Service Host)   │          │   (Router)  │
        │  Forgejo,Zitadel, │          │  AdGuard,   │
        │  LiteLLM,Trilium, │          │  DHCP,NAT   │
        │  NixArr,OpenWebUI │          │   (SPOF!)   │
        └─────────┬─────────┘          └─────────────┘
                  │ NFS
        ┌─────────▼─────────┐
        │       H002        │
        │   (NAS - bcachefs)│
        │  Media, Data      │
        └───────────────────┘

Critical Single Points of Failure

Host Service Impact if Down Recovery Time
L001 Headscale ALL mesh connectivity HIGH - must restore SQLite exactly
O001 nginx/Vaultwarden All public access, password manager MEDIUM
H003 DNS/DHCP/NAT Entire LAN offline MEDIUM
H001 All services Services down but recoverable MEDIUM
H002 NFS Media unavailable LOW - bcachefs has replication

Reverse Proxy Resilience (O001)

Current Problem

O001 is a single point of failure for all public traffic:

  • No public access to any service if it dies
  • DNS still points to it after failure
  • ACME certs are only on that host

Solution Options

Pros:

  • No single server dependency
  • Run cloudflared on multiple hosts (H001 as backup)
  • Automatic failover between tunnel replicas
  • Built-in DDoS protection
  • No inbound ports needed

Cons:

  • Cannot stream media (Jellyfin) - violates Cloudflare ToS
  • Adds latency
  • Vendor dependency

Implementation:

# On BOTH O001 (primary) AND H001 (backup)
services.cloudflared = {
  enable = true;
  tunnels."joshuabell" = {
    credentialsFile = config.age.secrets.cloudflared.path;
    ingress = {
      "chat.joshuabell.xyz" = "http://100.64.0.13:80";
      "git.joshuabell.xyz" = "http://100.64.0.13:80";
      "notes.joshuabell.xyz" = "http://100.64.0.13:80";
      "sec.joshuabell.xyz" = "http://100.64.0.13:80";
      "sso.joshuabell.xyz" = "http://100.64.0.13:80";
      "n8n.joshuabell.xyz" = "http://100.64.0.13:80";
      "blog.joshuabell.xyz" = "http://100.64.0.13:80";
    };
  };
};

Cloudflare automatically load balances across all active tunnel replicas.

Option B: DNS Failover with Health Checks

Use Cloudflare DNS with health checks:

  • Point joshuabell.xyz to both O001 and a backup
  • Cloudflare removes unhealthy IPs automatically
  • Requires Cloudflare paid plan for load balancing

Option C: Tailscale Funnel

Expose services directly without O001:

# On H001
tailscale funnel 443

Exposes H001 directly at https://h001.net.joshuabell.xyz

Pros:

  • No proxy needed
  • Per-service granularity
  • Automatic HTTPS

Cons:

  • Uses ts.net domain (no custom domain)
  • Limited to ports 443, 8443, 10000

Option D: Manual Failover with Shared Config

Keep H001 ready to take over O001's role:

  1. Same nginx config via shared NixOS module
  2. Use DNS-01 ACME challenge (certs work on any host)
  3. Update DNS when O001 fails
┌─────────────────────────────────────────────────────────────┐
│                   RECOMMENDED TOPOLOGY                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Cloudflare DNS (health-checked failover)                 │
│          │                                                  │
│   ┌──────┴──────┐                                          │
│   │             │                                          │
│   ▼             ▼                                          │
│  O001   ──OR── H001 (via Cloudflare Tunnel)               │
│  nginx         cloudflared backup                          │
│                                                             │
│   Jellyfin: Direct via Tailscale Funnel (bypasses O001)   │
│   Vaultwarden: Cloudflare Tunnel (survives O001 failure)  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key Changes:

  1. Move Vaultwarden to Cloudflare Tunnel (survives O001 outage)
  2. Jellyfin via Tailscale Funnel (no Cloudflare ToS issues)
  3. Other services via Cloudflare Tunnel with H001 as backup

Headscale HA (L001)

The Problem

L001 running Headscale is the MOST CRITICAL SPOF:

  • If Headscale dies, existing connections keep working temporarily
  • NO NEW devices can connect
  • Devices that reboot cannot rejoin the mesh
  • Eventually all mesh connectivity degrades

Solution Options

Option 1: Frequent Backups (Minimum Viable)

my.backup = {
  enable = true;
  paths = [ "/var/lib/headscale" "/var/lib/acme" ];
};

Recovery time: ~30 minutes to spin up new VPS + restore

Option 2: Warm Standby

  • Run second Linode/VPS with Headscale configured but stopped
  • Daily rsync of /var/lib/headscale/ to standby
  • Update DNS to point to standby if primary fails
# Daily sync to standby
rsync -avz l001:/var/lib/headscale/ standby:/var/lib/headscale/

Recovery time: ~5 minutes (start service, update DNS)

Option 3: Headscale HA with LiteFS

Headscale doesn't natively support HA, but you can use:

  • LiteFS for SQLite replication
  • Consul for leader election and failover

See: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/

Recovery time: ~15 seconds automatic failover

Option 4: Use Tailscale Commercial

Let Tailscale handle the control plane HA:

  • They manage availability
  • Keep Headscale for learning/experimentation
  • Critical services use Tailscale commercial

Recommendation

Start with Option 1 (backups) immediately, work toward Option 2 (warm standby) within a month.


Router HA (H003)

The Problem

H003 is the network gateway:

  • AdGuard Home (DNS filtering)
  • dnsmasq (DHCP)
  • NAT firewall
  • If it dies, entire LAN loses connectivity

Solution Options

Option 1: Secondary DNS/DHCP

Run backup DNS on another host (H001 or H002):

  • Secondary AdGuard Home instance
  • Clients configured with both DNS servers
  • DHCP failover is trickier (consider ISC DHCP with failover)

Option 2: Keepalived for Router Failover

If you have two devices that could be routers:

services.keepalived = {
  enable = true;
  vrrpInstances.router = {
    state = "MASTER";  # or "BACKUP" on secondary
    interface = "eth0";
    virtualRouterId = 1;
    priority = 255;  # Lower on backup
    virtualIps = [{ addr = "10.12.14.1/24"; }];
  };
};

Option 3: Router Redundancy via ISP

  • Use ISP router as fallback gateway
  • Clients get two gateways via DHCP
  • Less control but automatic failover

Recommendation

Run secondary AdGuard Home on H001/H002 as minimum redundancy. Full router HA is complex for homelab.


NFS HA (H002)

Current State

H002 uses bcachefs with 2x replication across 5 disks. Single host failure still causes data unavailability.

Options

Option 1: NFS Client Resilience

Configure NFS clients to handle server unavailability gracefully:

fileSystems."/nfs/h002" = {
  device = "100.64.0.3:/data";
  fsType = "nfs4";
  options = [
    "soft"           # Don't hang forever
    "timeo=50"       # 5 second timeout
    "retrans=3"      # 3 retries
    "nofail"         # Don't fail boot if unavailable
  ];
};

Option 2: Second NAS with GlusterFS

For true HA, run two NAS nodes with GlusterFS replication:

H002 (bcachefs) ◄──── GlusterFS ────► H00X (bcachefs)

Overkill for homelab, but an option for critical data.

Recommendation

Current bcachefs replication is adequate. Focus on offsite backups for truly irreplaceable data.


Phase 1: Quick Wins (This Week)

  1. Set up Cloudflare Tunnel on O001 AND H001
  2. Enable Tailscale Funnel for Jellyfin
  3. Automated backups for L001 Headscale

Phase 2: Core Resilience (This Month)

  1. DNS-01 ACME for shared certs
  2. Warm standby for Headscale
  3. Secondary AdGuard Home

Phase 3: Full Resilience (Next Quarter)

  1. Headscale HA with LiteFS (if needed)
  2. Automated failover testing
  3. Runbook documentation

Monitoring & Alerting

Essential for knowing when to failover:

# Uptime monitoring for critical services
services.uptime-kuma = {
  enable = true;
  # Monitor: Headscale, nginx, Vaultwarden, AdGuard
};

# Or use external monitoring (BetterStack, Uptime Robot)

Alert on:

  • Headscale API unreachable
  • nginx health check fails
  • DNS resolution fails
  • NFS mount fails