dotfiles/ideas/service_backups.md

9.2 KiB

Service Backup Strategy

Overview

This document outlines the backup strategy for the NixOS fleet, covering critical data paths, backup tools, and recovery procedures.

Current State

No automated backups are running today. This is a critical gap.

Backup Topology

┌─────────────────────────────────────────────────────────────┐
│                    BACKUP TOPOLOGY                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   H001,H003,O001,L001 ──────► H002:/data/backups (primary) │
│                        └────► B2/S3 (offsite)              │
│                                                             │
│   H002 (NAS) ───────────────► B2/S3 (offsite only)         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Critical Paths by Host

L001 (Headscale) - HIGHEST PRIORITY

Path Description Size Priority
/var/lib/headscale/ SQLite DB with all node registrations Small CRITICAL
/var/lib/acme/ SSL certificates Small High

Impact if lost: ALL mesh connectivity fails - new connections fail, devices can't rejoin.

O001 (Oracle Gateway)

Path Description Size Priority
/var/lib/vaultwarden/ Password vault (encrypted) ~41MB CRITICAL
/var/lib/postgresql/ Atuin shell history ~226MB Medium
/var/lib/acme/ SSL certificates Small High

Impact if lost: All public access down, password manager lost.

H001 (Services)

Path Description Size Priority
/var/lib/forgejo/ Git repos + PostgreSQL Large CRITICAL
/var/lib/zitadel/ SSO database + config Medium CRITICAL
/var/lib/openbao/ Secrets vault Small CRITICAL
/bao-keys/ Vault unseal keys Tiny CRITICAL
/var/lib/trilium/ Notes database Medium High
/var/lib/opengist/ Gist data Small Medium
/var/lib/open-webui/ AI chat history Medium Low
/var/lib/n8n/ Workflows Medium Medium
/var/lib/acme/ SSL certificates Small High
/var/lib/nixarr/state/ Media manager configs Small Medium

Note: A 154GB backup exists at /var/lib/forgejo.tar.gz - this is manual and should be automated.

H003 (Router)

Path Description Size Priority
/var/lib/AdGuardHome/ DNS filtering config + stats Medium High
/boot/keyfile_nvme0n1p1 LUKS encryption key Tiny CRITICAL

WARNING: The LUKS keyfile must be stored separately in a secure location (e.g., Vaultwarden).

H002 (NAS)

Path Description Size Priority
/data/nixarr/media/ Movies, TV, music, books Very Large Low (replaceable)
/data/pinchflat/ YouTube downloads Large Low

Note: bcachefs already has 2x replication. Offsite backup is optional but recommended for irreplaceable data.

Why Restic?

  • Modern, encrypted, deduplicated backups
  • Native NixOS module: services.restic.backups
  • Multiple backend support (local, S3, B2, SFTP)
  • Incremental backups with deduplication
  • Easy pruning/retention policies

Shared Backup Module

Create a shared module at modules/backup.nix:

{ config, lib, pkgs, ... }:

with lib;
let
  cfg = config.my.backup;
in {
  options.my.backup = {
    enable = mkEnableOption "restic backups";
    paths = mkOption { type = types.listOf types.str; default = []; };
    exclude = mkOption { type = types.listOf types.str; default = []; };
    postgresBackup = mkOption { type = types.bool; default = false; };
  };

  config = mkIf cfg.enable {
    # PostgreSQL dumps before backup
    services.postgresqlBackup = mkIf cfg.postgresBackup {
      enable = true;
      location = "/var/backup/postgresql";
      compression = "zstd";
      startAt = "02:00:00";
    };

    services.restic.backups = {
      daily = {
        paths = cfg.paths ++ (optional cfg.postgresBackup "/var/backup/postgresql");
        exclude = cfg.exclude ++ [
          "**/cache/**"
          "**/Cache/**"
          "**/.cache/**"
          "**/tmp/**"
        ];
        
        # Primary: NFS to H002
        repository = "/nfs/h002/backups/${config.networking.hostName}";
        
        passwordFile = config.age.secrets.restic-password.path;
        initialize = true;
        
        pruneOpts = [
          "--keep-daily 7"
          "--keep-weekly 4"
          "--keep-monthly 6"
        ];
        
        timerConfig = {
          OnCalendar = "03:00:00";
          RandomizedDelaySec = "1h";
          Persistent = true;
        };
        
        backupPrepareCommand = ''
          # Ensure NFS is mounted
          mount | grep -q "/nfs/h002" || mount /nfs/h002
        '';
      };
      
      # Offsite to B2/S3 (less frequent)
      offsite = {
        paths = cfg.paths;
        repository = "b2:joshuabell-backups:${config.networking.hostName}";
        passwordFile = config.age.secrets.restic-password.path;
        environmentFile = config.age.secrets.b2-credentials.path;
        
        pruneOpts = [
          "--keep-daily 3"
          "--keep-weekly 2"
          "--keep-monthly 3"
        ];
        
        timerConfig = {
          OnCalendar = "weekly";
          Persistent = true;
        };
      };
    };
  };
}

Per-Host Configuration

L001 (Headscale)

my.backup = {
  enable = true;
  paths = [ "/var/lib/headscale" "/var/lib/acme" ];
};

O001 (Oracle)

my.backup = {
  enable = true;
  paths = [ "/var/lib/vaultwarden" "/var/lib/acme" ];
  postgresBackup = true;  # For Atuin
};

H001 (Services)

my.backup = {
  enable = true;
  paths = [
    "/var/lib/forgejo"
    "/var/lib/zitadel"
    "/var/lib/openbao"
    "/bao-keys"
    "/var/lib/trilium"
    "/var/lib/opengist"
    "/var/lib/open-webui"
    "/var/lib/n8n"
    "/var/lib/acme"
    "/var/lib/nixarr/state"
  ];
};

H003 (Router)

my.backup = {
  enable = true;
  paths = [ "/var/lib/AdGuardHome" ];
  # LUKS key backed up separately to Vaultwarden
};

Database Backup Best Practices

For Containerized PostgreSQL (Forgejo/Zitadel)

systemd.services.container-forgejo-backup = {
  script = ''
    nixos-container run forgejo -- pg_dumpall -U forgejo \
      | ${pkgs.zstd}/bin/zstd > /var/lib/forgejo/backups/db-$(date +%Y%m%d).sql.zst
  '';
  startAt = "02:30:00";  # Before restic runs at 03:00
};

For Direct PostgreSQL

services.postgresqlBackup = {
  enable = true;
  backupAll = true;
  location = "/var/backup/postgresql";
  compression = "zstd";
  startAt = "*-*-* 02:00:00";
};

Recovery Procedures

Restoring from Restic

# List snapshots
restic -r /path/to/repo snapshots

# Restore specific snapshot
restic -r /path/to/repo restore abc123 --target /restore

# Restore latest
restic -r /path/to/repo restore latest --target /restore

# Restore specific path
restic -r /path/to/repo restore latest \
  --target /restore \
  --include /var/lib/postgresql

# Mount for browsing
mkdir /mnt/restic
restic -r /path/to/repo mount /mnt/restic

PostgreSQL Recovery

# Stop PostgreSQL
systemctl stop postgresql

# Restore from restic
restic restore latest --target / --include /var/lib/postgresql

# Or from SQL dump
sudo -u postgres psql < /restore/all-databases.sql

# Start PostgreSQL
systemctl start postgresql

Backup Verification

Add automated verification:

systemd.timers.restic-verify = {
  wantedBy = [ "timers.target" ];
  timerConfig = {
    OnCalendar = "weekly";
    Persistent = true;
  };
};

systemd.services.restic-verify = {
  script = ''
    ${pkgs.restic}/bin/restic -r /path/to/repo check --read-data-subset=5%
  '';
};

Monitoring & Alerting

# Alert on backup failure
systemd.services."restic-backups-daily".serviceConfig.OnFailure = "notify-failure@%n.service";

systemd.services."notify-failure@" = {
  serviceConfig.Type = "oneshot";
  script = ''
    ${pkgs.curl}/bin/curl -X POST https://ntfy.sh/joshuabell-backups \
      -H "Title: Backup Failed" \
      -d "Service: %i on ${config.networking.hostName}"
  '';
};

Action Items

Immediate (This Week)

  • Set up restic backups for L001 (Headscale) - most critical
  • Back up H003's LUKS keyfile to Vaultwarden
  • Create /data/backups/ directory on H002

Short-Term (This Month)

  • Implement shared backup module
  • Deploy to all hosts
  • Set up offsite B2 bucket

Medium-Term

  • Automated backup verification
  • Monitoring/alerting integration
  • Test recovery procedures