ringofstorms/dotfiles

Fork 0

Joshua Bell 78469896c8 Add migrating services guide and idea drafts, update flake.lock

2026-01-29 00:38:53 -06:00

10 KiB

Raw Blame History

Infrastructure Resilience & Failover

Overview

This document covers strategies for eliminating single points of failure and improving infrastructure resilience.

Current Architecture

                    INTERNET
                        │
              ┌─────────┴─────────┐
              │                   │
        ┌─────▼─────┐      ┌──────▼──────┐
        │   O001    │      │    L001     │
        │  (Oracle) │      │   (Linode)  │
        │  nginx    │      │  Headscale  │
        │  +vault   │      │   (SPOF!)   │
        │  +atuin   │      └──────┬──────┘
        │  (SPOF!)  │             │
        └─────┬─────┘             │
              │         Tailscale Mesh
              │       ┌───────────┴───────────┐
              │       │                       │
        ┌─────▼───────▼─────┐          ┌──────▼──────┐
        │       H001        │          │    H003     │
        │  (Service Host)   │          │   (Router)  │
        │  Forgejo,Zitadel, │          │  AdGuard,   │
        │  LiteLLM,Trilium, │          │  DHCP,NAT   │
        │  NixArr,OpenWebUI │          │   (SPOF!)   │
        └─────────┬─────────┘          └─────────────┘
                  │ NFS
        ┌─────────▼─────────┐
        │       H002        │
        │   (NAS - bcachefs)│
        │  Media, Data      │
        └───────────────────┘

Critical Single Points of Failure

Host	Service	Impact if Down	Recovery Time
L001	Headscale	ALL mesh connectivity	HIGH - must restore SQLite exactly
O001	nginx/Vaultwarden	All public access, password manager	MEDIUM
H003	DNS/DHCP/NAT	Entire LAN offline	MEDIUM
H001	All services	Services down but recoverable	MEDIUM
H002	NFS	Media unavailable	LOW - bcachefs has replication

Reverse Proxy Resilience (O001)

Current Problem

O001 is a single point of failure for all public traffic:

No public access to any service if it dies
DNS still points to it after failure
ACME certs are only on that host

Solution Options

Option A: Cloudflare Tunnel (Recommended Quick Win)

Pros:

No single server dependency
Run cloudflared on multiple hosts (H001 as backup)
Automatic failover between tunnel replicas
Built-in DDoS protection
No inbound ports needed

Cons:

Cannot stream media (Jellyfin) - violates Cloudflare ToS
Adds latency
Vendor dependency

Implementation:

# On BOTH O001 (primary) AND H001 (backup)
services.cloudflared = {
  enable = true;
  tunnels."joshuabell" = {
    credentialsFile = config.age.secrets.cloudflared.path;
    ingress = {
      "chat.joshuabell.xyz" = "http://100.64.0.13:80";
      "git.joshuabell.xyz" = "http://100.64.0.13:80";
      "notes.joshuabell.xyz" = "http://100.64.0.13:80";
      "sec.joshuabell.xyz" = "http://100.64.0.13:80";
      "sso.joshuabell.xyz" = "http://100.64.0.13:80";
      "n8n.joshuabell.xyz" = "http://100.64.0.13:80";
      "blog.joshuabell.xyz" = "http://100.64.0.13:80";
    };
  };
};

Cloudflare automatically load balances across all active tunnel replicas.

Option B: DNS Failover with Health Checks

Use Cloudflare DNS with health checks:

Point joshuabell.xyz to both O001 and a backup
Cloudflare removes unhealthy IPs automatically
Requires Cloudflare paid plan for load balancing

Option C: Tailscale Funnel

Expose services directly without O001:

# On H001
tailscale funnel 443

Exposes H001 directly at https://h001.net.joshuabell.xyz

Pros:

No proxy needed
Per-service granularity
Automatic HTTPS

Cons:

Uses ts.net domain (no custom domain)
Limited to ports 443, 8443, 10000

Option D: Manual Failover with Shared Config

Keep H001 ready to take over O001's role:

Same nginx config via shared NixOS module
Use DNS-01 ACME challenge (certs work on any host)
Update DNS when O001 fails

Recommended Hybrid Approach

┌─────────────────────────────────────────────────────────────┐
│                   RECOMMENDED TOPOLOGY                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Cloudflare DNS (health-checked failover)                 │
│          │                                                  │
│   ┌──────┴──────┐                                          │
│   │             │                                          │
│   ▼             ▼                                          │
│  O001   ──OR── H001 (via Cloudflare Tunnel)               │
│  nginx         cloudflared backup                          │
│                                                             │
│   Jellyfin: Direct via Tailscale Funnel (bypasses O001)   │
│   Vaultwarden: Cloudflare Tunnel (survives O001 failure)  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key Changes:

Move Vaultwarden to Cloudflare Tunnel (survives O001 outage)
Jellyfin via Tailscale Funnel (no Cloudflare ToS issues)
Other services via Cloudflare Tunnel with H001 as backup

Headscale HA (L001)

The Problem

L001 running Headscale is the MOST CRITICAL SPOF:

If Headscale dies, existing connections keep working temporarily
NO NEW devices can connect
Devices that reboot cannot rejoin the mesh
Eventually all mesh connectivity degrades

Solution Options

Option 1: Frequent Backups (Minimum Viable)

my.backup = {
  enable = true;
  paths = [ "/var/lib/headscale" "/var/lib/acme" ];
};

Recovery time: ~30 minutes to spin up new VPS + restore

Option 2: Warm Standby

Run second Linode/VPS with Headscale configured but stopped
Daily rsync of /var/lib/headscale/ to standby
Update DNS to point to standby if primary fails

# Daily sync to standby
rsync -avz l001:/var/lib/headscale/ standby:/var/lib/headscale/

Recovery time: ~5 minutes (start service, update DNS)

Option 3: Headscale HA with LiteFS

Headscale doesn't natively support HA, but you can use:

LiteFS for SQLite replication
Consul for leader election and failover

See: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/

Recovery time: ~15 seconds automatic failover

Option 4: Use Tailscale Commercial

Let Tailscale handle the control plane HA:

They manage availability
Keep Headscale for learning/experimentation
Critical services use Tailscale commercial

Recommendation

Start with Option 1 (backups) immediately, work toward Option 2 (warm standby) within a month.

Router HA (H003)

The Problem

H003 is the network gateway:

AdGuard Home (DNS filtering)
dnsmasq (DHCP)
NAT firewall
If it dies, entire LAN loses connectivity

Solution Options

Option 1: Secondary DNS/DHCP

Run backup DNS on another host (H001 or H002):

Secondary AdGuard Home instance
Clients configured with both DNS servers
DHCP failover is trickier (consider ISC DHCP with failover)

Option 2: Keepalived for Router Failover

If you have two devices that could be routers:

services.keepalived = {
  enable = true;
  vrrpInstances.router = {
    state = "MASTER";  # or "BACKUP" on secondary
    interface = "eth0";
    virtualRouterId = 1;
    priority = 255;  # Lower on backup
    virtualIps = [{ addr = "10.12.14.1/24"; }];
  };
};

Option 3: Router Redundancy via ISP

Use ISP router as fallback gateway
Clients get two gateways via DHCP
Less control but automatic failover

Recommendation

Run secondary AdGuard Home on H001/H002 as minimum redundancy. Full router HA is complex for homelab.

NFS HA (H002)

Current State

H002 uses bcachefs with 2x replication across 5 disks. Single host failure still causes data unavailability.

Options

Option 1: NFS Client Resilience

Configure NFS clients to handle server unavailability gracefully:

fileSystems."/nfs/h002" = {
  device = "100.64.0.3:/data";
  fsType = "nfs4";
  options = [
    "soft"           # Don't hang forever
    "timeo=50"       # 5 second timeout
    "retrans=3"      # 3 retries
    "nofail"         # Don't fail boot if unavailable
  ];
};

Option 2: Second NAS with GlusterFS

For true HA, run two NAS nodes with GlusterFS replication:

H002 (bcachefs) ◄──── GlusterFS ────► H00X (bcachefs)

Overkill for homelab, but an option for critical data.

Recommendation

Current bcachefs replication is adequate. Focus on offsite backups for truly irreplaceable data.

Recommended Implementation Order

Phase 1: Quick Wins (This Week)

Set up Cloudflare Tunnel on O001 AND H001
Enable Tailscale Funnel for Jellyfin
Automated backups for L001 Headscale

Phase 2: Core Resilience (This Month)

DNS-01 ACME for shared certs
Warm standby for Headscale
Secondary AdGuard Home

Phase 3: Full Resilience (Next Quarter)

Headscale HA with LiteFS (if needed)
Automated failover testing
Runbook documentation

Monitoring & Alerting

Essential for knowing when to failover:

# Uptime monitoring for critical services
services.uptime-kuma = {
  enable = true;
  # Monitor: Headscale, nginx, Vaultwarden, AdGuard
};

# Or use external monitoring (BetterStack, Uptime Robot)

Alert on:

Headscale API unreachable
nginx health check fails
DNS resolution fails
NFS mount fails

10 KiB Raw Blame History

Infrastructure Resilience & Failover

Overview

Current Architecture

Critical Single Points of Failure

Reverse Proxy Resilience (O001)

Current Problem

Solution Options

Option A: Cloudflare Tunnel (Recommended Quick Win)

Option B: DNS Failover with Health Checks

Option C: Tailscale Funnel

Option D: Manual Failover with Shared Config

Recommended Hybrid Approach

Headscale HA (L001)

The Problem

Solution Options

Option 1: Frequent Backups (Minimum Viable)

Option 2: Warm Standby

Option 3: Headscale HA with LiteFS

Option 4: Use Tailscale Commercial

Recommendation

Router HA (H003)

The Problem

Solution Options

Option 1: Secondary DNS/DHCP

Option 2: Keepalived for Router Failover

Option 3: Router Redundancy via ISP

Recommendation

NFS HA (H002)

Current State

Options

Option 1: NFS Client Resilience

Option 2: Second NAS with GlusterFS

Recommendation

Recommended Implementation Order

Phase 1: Quick Wins (This Week)

Phase 2: Core Resilience (This Month)

Phase 3: Full Resilience (Next Quarter)

Monitoring & Alerting

10 KiB

Raw Blame History