dotfiles/ideas/resilience.md

# Infrastructure Resilience & Failover

## Overview

This document covers strategies for eliminating single points of failure and improving infrastructure resilience.

## Current Architecture

```
                    INTERNET
                        │
              ┌─────────┴─────────┐
              │                   │
        ┌─────▼─────┐      ┌──────▼──────┐
        │   O001    │      │    L001     │
        │  (Oracle) │      │   (Linode)  │
        │  nginx    │      │  Headscale  │
        │  +vault   │      │   (SPOF!)   │
        │  +atuin   │      └──────┬──────┘
        │  (SPOF!)  │             │
        └─────┬─────┘             │
              │         Tailscale Mesh
              │       ┌───────────┴───────────┐
              │       │                       │
        ┌─────▼───────▼─────┐          ┌──────▼──────┐
        │       H001        │          │    H003     │
        │  (Service Host)   │          │   (Router)  │
        │  Forgejo,Zitadel, │          │  AdGuard,   │
        │  LiteLLM,Trilium, │          │  DHCP,NAT   │
        │  NixArr,OpenWebUI │          │   (SPOF!)   │
        └─────────┬─────────┘          └─────────────┘
                  │ NFS
        ┌─────────▼─────────┐
        │       H002        │
        │   (NAS - bcachefs)│
        │  Media, Data      │
        └───────────────────┘
```

## Critical Single Points of Failure

| Host | Service | Impact if Down | Recovery Time |
|------|---------|----------------|---------------|
| **L001** | Headscale | ALL mesh connectivity | HIGH - must restore SQLite exactly |
| **O001** | nginx/Vaultwarden | All public access, password manager | MEDIUM |
| **H003** | DNS/DHCP/NAT | Entire LAN offline | MEDIUM |
| **H001** | All services | Services down but recoverable | MEDIUM |
| **H002** | NFS | Media unavailable | LOW - bcachefs has replication |

---

## Reverse Proxy Resilience (O001)

### Current Problem

O001 is a single point of failure for all public traffic:
- No public access to any service if it dies
- DNS still points to it after failure
- ACME certs are only on that host

### Solution Options

#### Option A: Cloudflare Tunnel (Recommended Quick Win)

**Pros:**
- No single server dependency
- Run `cloudflared` on multiple hosts (H001 as backup)
- Automatic failover between tunnel replicas
- Built-in DDoS protection
- No inbound ports needed

**Cons:**
- Cannot stream media (Jellyfin) - violates Cloudflare ToS
- Adds latency
- Vendor dependency

**Implementation:**

```nix
# On BOTH O001 (primary) AND H001 (backup)
services.cloudflared = {
  enable = true;
  tunnels."joshuabell" = {
    credentialsFile = config.age.secrets.cloudflared.path;
    ingress = {
      "chat.joshuabell.xyz" = "http://100.64.0.13:80";
      "git.joshuabell.xyz" = "http://100.64.0.13:80";
      "notes.joshuabell.xyz" = "http://100.64.0.13:80";
      "sec.joshuabell.xyz" = "http://100.64.0.13:80";
      "sso.joshuabell.xyz" = "http://100.64.0.13:80";
      "n8n.joshuabell.xyz" = "http://100.64.0.13:80";
      "blog.joshuabell.xyz" = "http://100.64.0.13:80";
    };
  };
};
```

Cloudflare automatically load balances across all active tunnel replicas.

#### Option B: DNS Failover with Health Checks

Use Cloudflare DNS with health checks:
- Point `joshuabell.xyz` to both O001 and a backup
- Cloudflare removes unhealthy IPs automatically
- Requires Cloudflare paid plan for load balancing

#### Option C: Tailscale Funnel

Expose services directly without O001:

```bash
# On H001
tailscale funnel 443
```

Exposes H001 directly at `https://h001.net.joshuabell.xyz`

**Pros:**
- No proxy needed
- Per-service granularity
- Automatic HTTPS

**Cons:**
- Uses `ts.net` domain (no custom domain)
- Limited to ports 443, 8443, 10000

#### Option D: Manual Failover with Shared Config

Keep H001 ready to take over O001's role:
1. Same nginx config via shared NixOS module
2. Use DNS-01 ACME challenge (certs work on any host)
3. Update DNS when O001 fails

### Recommended Hybrid Approach

```
┌─────────────────────────────────────────────────────────────┐
│                   RECOMMENDED TOPOLOGY                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Cloudflare DNS (health-checked failover)                 │
│          │                                                  │
│   ┌──────┴──────┐                                          │
│   │             │                                          │
│   ▼             ▼                                          │
│  O001   ──OR── H001 (via Cloudflare Tunnel)               │
│  nginx         cloudflared backup                          │
│                                                             │
│   Jellyfin: Direct via Tailscale Funnel (bypasses O001)   │
│   Vaultwarden: Cloudflare Tunnel (survives O001 failure)  │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

**Key Changes:**
1. Move Vaultwarden to Cloudflare Tunnel (survives O001 outage)
2. Jellyfin via Tailscale Funnel (no Cloudflare ToS issues)
3. Other services via Cloudflare Tunnel with H001 as backup

---

## Headscale HA (L001)

### The Problem

L001 running Headscale is the MOST CRITICAL SPOF:
- If Headscale dies, existing connections keep working temporarily
- NO NEW devices can connect
- Devices that reboot cannot rejoin the mesh
- Eventually all mesh connectivity degrades

### Solution Options

#### Option 1: Frequent Backups (Minimum Viable)

```nix
my.backup = {
  enable = true;
  paths = [ "/var/lib/headscale" "/var/lib/acme" ];
};
```

**Recovery time:** ~30 minutes to spin up new VPS + restore

#### Option 2: Warm Standby

- Run second Linode/VPS with Headscale configured but stopped
- Daily rsync of `/var/lib/headscale/` to standby
- Update DNS to point to standby if primary fails

```bash
# Daily sync to standby
rsync -avz l001:/var/lib/headscale/ standby:/var/lib/headscale/
```

**Recovery time:** ~5 minutes (start service, update DNS)

#### Option 3: Headscale HA with LiteFS

Headscale doesn't natively support HA, but you can use:
- **LiteFS** for SQLite replication
- **Consul** for leader election and failover

See: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/

**Recovery time:** ~15 seconds automatic failover

#### Option 4: Use Tailscale Commercial

Let Tailscale handle the control plane HA:
- They manage availability
- Keep Headscale for learning/experimentation
- Critical services use Tailscale commercial

### Recommendation

Start with Option 1 (backups) immediately, work toward Option 2 (warm standby) within a month.

---

## Router HA (H003)

### The Problem

H003 is the network gateway:
- AdGuard Home (DNS filtering)
- dnsmasq (DHCP)
- NAT firewall
- If it dies, entire LAN loses connectivity

### Solution Options

#### Option 1: Secondary DNS/DHCP

Run backup DNS on another host (H001 or H002):
- Secondary AdGuard Home instance
- Clients configured with both DNS servers
- DHCP failover is trickier (consider ISC DHCP with failover)

#### Option 2: Keepalived for Router Failover

If you have two devices that could be routers:

```nix
services.keepalived = {
  enable = true;
  vrrpInstances.router = {
    state = "MASTER";  # or "BACKUP" on secondary
    interface = "eth0";
    virtualRouterId = 1;
    priority = 255;  # Lower on backup
    virtualIps = [{ addr = "10.12.14.1/24"; }];
  };
};
```

#### Option 3: Router Redundancy via ISP

- Use ISP router as fallback gateway
- Clients get two gateways via DHCP
- Less control but automatic failover

### Recommendation

Run secondary AdGuard Home on H001/H002 as minimum redundancy. Full router HA is complex for homelab.

---

## NFS HA (H002)

### Current State

H002 uses bcachefs with 2x replication across 5 disks. Single host failure still causes data unavailability.

### Options

#### Option 1: NFS Client Resilience

Configure NFS clients to handle server unavailability gracefully:

```nix
fileSystems."/nfs/h002" = {
  device = "100.64.0.3:/data";
  fsType = "nfs4";
  options = [
    "soft"           # Don't hang forever
    "timeo=50"       # 5 second timeout
    "retrans=3"      # 3 retries
    "nofail"         # Don't fail boot if unavailable
  ];
};
```

#### Option 2: Second NAS with GlusterFS

For true HA, run two NAS nodes with GlusterFS replication:

```
H002 (bcachefs) ◄──── GlusterFS ────► H00X (bcachefs)
```

**Overkill for homelab**, but an option for critical data.

### Recommendation

Current bcachefs replication is adequate. Focus on offsite backups for truly irreplaceable data.

---

## Recommended Implementation Order

### Phase 1: Quick Wins (This Week)
1. [ ] Set up Cloudflare Tunnel on O001 AND H001
2. [ ] Enable Tailscale Funnel for Jellyfin
3. [ ] Automated backups for L001 Headscale

### Phase 2: Core Resilience (This Month)
4. [ ] DNS-01 ACME for shared certs
5. [ ] Warm standby for Headscale
6. [ ] Secondary AdGuard Home

### Phase 3: Full Resilience (Next Quarter)
7. [ ] Headscale HA with LiteFS (if needed)
8. [ ] Automated failover testing
9. [ ] Runbook documentation

---

## Monitoring & Alerting

Essential for knowing when to failover:

```nix
# Uptime monitoring for critical services
services.uptime-kuma = {
  enable = true;
  # Monitor: Headscale, nginx, Vaultwarden, AdGuard
};

# Or use external monitoring (BetterStack, Uptime Robot)
```

Alert on:
- Headscale API unreachable
- nginx health check fails
- DNS resolution fails
- NFS mount fails