Add migrating services guide and idea drafts, update flake.lock
This commit is contained in:
parent
05f31b80c2
commit
78469896c8
6 changed files with 1779 additions and 3 deletions
456
ideas/impermanence_everywhere.md
Normal file
456
ideas/impermanence_everywhere.md
Normal file
|
|
@ -0,0 +1,456 @@
|
|||
# Impermanence Rollout Strategy
|
||||
|
||||
## Overview
|
||||
|
||||
This document covers rolling out impermanence (ephemeral root filesystem) to all hosts, using Juni as the template.
|
||||
|
||||
## What is Impermanence?
|
||||
|
||||
**Philosophy:** Root filesystem (`/`) is wiped on every boot (tmpfs or reset subvolume), forcing you to explicitly declare what state to persist.
|
||||
|
||||
**Benefits:**
|
||||
- Clean system by default - no accumulated cruft
|
||||
- Forces documentation of important state
|
||||
- Easy rollback (just reboot)
|
||||
- Security (ephemeral root limits persistence of compromises)
|
||||
- Reproducible server state
|
||||
|
||||
## Current State
|
||||
|
||||
| Host | Impermanence | Notes |
|
||||
|------|--------------|-------|
|
||||
| Juni | ✅ Implemented | bcachefs with @root/@persist subvolumes |
|
||||
| H001 | ❌ Traditional | Most complex - many services |
|
||||
| H002 | ❌ Traditional | NAS - may not need impermanence |
|
||||
| H003 | ❌ Traditional | Router - good candidate |
|
||||
| O001 | ❌ Traditional | Gateway - good candidate |
|
||||
| L001 | ❌ Traditional | Headscale - good candidate |
|
||||
|
||||
## Juni's Implementation (Reference)
|
||||
|
||||
### Filesystem Layout
|
||||
|
||||
```
|
||||
bcachefs (5 devices, 2x replication)
|
||||
├── @root # Ephemeral - reset each boot
|
||||
├── @nix # Persistent - Nix store
|
||||
├── @persist # Persistent - bind mounts for state
|
||||
└── @snapshots # Automatic snapshots
|
||||
```
|
||||
|
||||
### Boot Process
|
||||
|
||||
1. Create snapshot of @root before reset
|
||||
2. Reset @root subvolume (or recreate)
|
||||
3. Boot into clean system
|
||||
4. Bind mount persisted paths from @persist
|
||||
|
||||
### Persisted Paths (Juni)
|
||||
|
||||
```nix
|
||||
environment.persistence."/persist" = {
|
||||
hideMounts = true;
|
||||
|
||||
directories = [
|
||||
"/var/log"
|
||||
"/var/lib/nixos"
|
||||
"/var/lib/systemd"
|
||||
"/var/lib/tailscale"
|
||||
"/var/lib/flatpak"
|
||||
"/etc/NetworkManager/system-connections"
|
||||
];
|
||||
|
||||
files = [
|
||||
"/etc/machine-id"
|
||||
"/etc/ssh/ssh_host_ed25519_key"
|
||||
"/etc/ssh/ssh_host_ed25519_key.pub"
|
||||
"/etc/ssh/ssh_host_rsa_key"
|
||||
"/etc/ssh/ssh_host_rsa_key.pub"
|
||||
];
|
||||
|
||||
users.josh = {
|
||||
directories = [
|
||||
".ssh"
|
||||
".gnupg"
|
||||
"projects"
|
||||
".config"
|
||||
".local/share"
|
||||
];
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
### Custom Tooling
|
||||
|
||||
Juni has `bcache-impermanence` with commands:
|
||||
- `ls` - List snapshots
|
||||
- `gc` - Garbage collect old snapshots
|
||||
- `diff` - Show changes since last boot (auto-excludes persisted paths)
|
||||
|
||||
Retention policy: 5 recent + 1/week for 4 weeks + 1/month
|
||||
|
||||
---
|
||||
|
||||
## Common Pain Point: Finding What Needs Persistence
|
||||
|
||||
> "I often have issues adding new persistent layers and knowing what I need to add"
|
||||
|
||||
### Discovery Workflow
|
||||
|
||||
#### Method 1: Use the Diff Tool
|
||||
|
||||
Before rebooting after installing new software:
|
||||
|
||||
```bash
|
||||
# On Juni
|
||||
bcache-impermanence diff
|
||||
```
|
||||
|
||||
This shows files created/modified outside persisted paths.
|
||||
|
||||
#### Method 2: Boot and Observe Failures
|
||||
|
||||
```bash
|
||||
# After reboot, check for failures
|
||||
journalctl -b | grep -i "no such file"
|
||||
journalctl -b | grep -i "failed to"
|
||||
journalctl -b | grep -i "permission denied"
|
||||
```
|
||||
|
||||
#### Method 3: Monitor File Changes
|
||||
|
||||
```bash
|
||||
# Before making changes
|
||||
find /var /etc -type f -printf '%T@ %p\n' 2>/dev/null | sort -n > /tmp/before.txt
|
||||
|
||||
# After running services
|
||||
find /var /etc -type f -printf '%T@ %p\n' 2>/dev/null | sort -n > /tmp/after.txt
|
||||
|
||||
# Compare
|
||||
diff /tmp/before.txt /tmp/after.txt
|
||||
```
|
||||
|
||||
#### Method 4: Service-Specific Patterns
|
||||
|
||||
Most services follow predictable patterns:
|
||||
|
||||
| Pattern | Example | Usually Needs Persistence |
|
||||
|---------|---------|---------------------------|
|
||||
| `/var/lib/${service}` | `/var/lib/postgresql` | Yes |
|
||||
| `/var/cache/${service}` | `/var/cache/nginx` | Usually no |
|
||||
| `/var/log/${service}` | `/var/log/nginx` | Optional |
|
||||
| `/etc/${service}` | `/etc/nginx` | Only if runtime-generated |
|
||||
|
||||
---
|
||||
|
||||
## Server Impermanence Template
|
||||
|
||||
### Minimal Server Persistence
|
||||
|
||||
```nix
|
||||
environment.persistence."/persist" = {
|
||||
hideMounts = true;
|
||||
|
||||
directories = [
|
||||
# Core system
|
||||
"/var/lib/nixos" # NixOS state DB
|
||||
"/var/lib/systemd/coredump"
|
||||
"/var/log"
|
||||
|
||||
# Network
|
||||
"/var/lib/tailscale"
|
||||
"/etc/NetworkManager/system-connections"
|
||||
|
||||
# ACME certificates
|
||||
"/var/lib/acme"
|
||||
];
|
||||
|
||||
files = [
|
||||
"/etc/machine-id"
|
||||
"/etc/ssh/ssh_host_ed25519_key"
|
||||
"/etc/ssh/ssh_host_ed25519_key.pub"
|
||||
"/etc/ssh/ssh_host_rsa_key"
|
||||
"/etc/ssh/ssh_host_rsa_key.pub"
|
||||
];
|
||||
};
|
||||
```
|
||||
|
||||
### Per-Host Additions
|
||||
|
||||
#### H001 (Services)
|
||||
|
||||
```nix
|
||||
environment.persistence."/persist".directories = [
|
||||
# Add to minimal template:
|
||||
"/var/lib/forgejo"
|
||||
"/var/lib/zitadel"
|
||||
"/var/lib/openbao"
|
||||
"/bao-keys"
|
||||
"/var/lib/trilium"
|
||||
"/var/lib/opengist"
|
||||
"/var/lib/open-webui"
|
||||
"/var/lib/n8n"
|
||||
"/var/lib/nixarr/state"
|
||||
"/var/lib/containers" # Podman/container state
|
||||
];
|
||||
```
|
||||
|
||||
#### O001 (Gateway)
|
||||
|
||||
```nix
|
||||
environment.persistence."/persist".directories = [
|
||||
# Add to minimal template:
|
||||
"/var/lib/vaultwarden"
|
||||
"/var/lib/postgresql"
|
||||
"/var/lib/fail2ban"
|
||||
];
|
||||
```
|
||||
|
||||
#### L001 (Headscale)
|
||||
|
||||
```nix
|
||||
environment.persistence."/persist".directories = [
|
||||
# Add to minimal template:
|
||||
"/var/lib/headscale"
|
||||
];
|
||||
```
|
||||
|
||||
#### H003 (Router)
|
||||
|
||||
```nix
|
||||
environment.persistence."/persist".directories = [
|
||||
# Add to minimal template:
|
||||
"/var/lib/AdGuardHome"
|
||||
"/var/lib/dnsmasq"
|
||||
];
|
||||
|
||||
environment.persistence."/persist".files = [
|
||||
# Add to minimal template:
|
||||
"/boot/keyfile_nvme0n1p1" # LUKS key - CRITICAL
|
||||
];
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rollout Strategy
|
||||
|
||||
### Phase 1: Lowest Risk (VPS Hosts)
|
||||
|
||||
Start with L001 and O001:
|
||||
- Easy to rebuild from scratch if something goes wrong
|
||||
- Smaller state footprint
|
||||
- Good practice before tackling complex hosts
|
||||
|
||||
**L001 Steps:**
|
||||
1. Back up `/var/lib/headscale/`
|
||||
2. Add impermanence module
|
||||
3. Test on spare VPS first
|
||||
4. Migrate
|
||||
|
||||
**O001 Steps:**
|
||||
1. Back up Vaultwarden and PostgreSQL
|
||||
2. Add impermanence module
|
||||
3. Test carefully (Vaultwarden is critical!)
|
||||
|
||||
### Phase 2: Router (H003)
|
||||
|
||||
H003 is medium complexity:
|
||||
- Relatively small state
|
||||
- But critical for network (test during maintenance window)
|
||||
- LUKS keyfile needs special handling
|
||||
|
||||
### Phase 3: Complex Host (H001)
|
||||
|
||||
H001 is most complex due to:
|
||||
- Multiple containerized services
|
||||
- Database state in containers
|
||||
- Many stateful applications
|
||||
|
||||
**Approach:**
|
||||
1. Inventory all state paths (see backup docs)
|
||||
2. Test with snapshot before committing
|
||||
3. Gradual rollout with extensive persistence list
|
||||
4. May need to persist more than expected initially
|
||||
|
||||
### Phase 4: NAS (H002) - Maybe Skip
|
||||
|
||||
H002 may not benefit from impermanence:
|
||||
- Primary purpose is persistent data storage
|
||||
- bcachefs replication already provides redundancy
|
||||
- Impermanence adds complexity without clear benefit
|
||||
|
||||
---
|
||||
|
||||
## Filesystem Options
|
||||
|
||||
### Option A: bcachefs with Subvolumes (Like Juni)
|
||||
|
||||
**Pros:**
|
||||
- Flexible, modern
|
||||
- Built-in snapshots
|
||||
- Replication support
|
||||
|
||||
**Setup:**
|
||||
```nix
|
||||
fileSystems = {
|
||||
"/" = {
|
||||
device = "/dev/disk/by-label/nixos";
|
||||
fsType = "bcachefs";
|
||||
options = [ "subvol=@root" ];
|
||||
};
|
||||
"/nix" = {
|
||||
device = "/dev/disk/by-label/nixos";
|
||||
fsType = "bcachefs";
|
||||
options = [ "subvol=@nix" ];
|
||||
};
|
||||
"/persist" = {
|
||||
device = "/dev/disk/by-label/nixos";
|
||||
fsType = "bcachefs";
|
||||
options = [ "subvol=@persist" ];
|
||||
neededForBoot = true;
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
### Option B: BTRFS with Subvolumes
|
||||
|
||||
Similar to bcachefs but more mature:
|
||||
|
||||
```nix
|
||||
# Reset @root on boot
|
||||
boot.initrd.postDeviceCommands = lib.mkAfter ''
|
||||
mkdir -p /mnt
|
||||
mount -o subvol=/ /dev/disk/by-label/nixos /mnt
|
||||
btrfs subvolume delete /mnt/@root
|
||||
btrfs subvolume create /mnt/@root
|
||||
umount /mnt
|
||||
'';
|
||||
```
|
||||
|
||||
### Option C: tmpfs Root
|
||||
|
||||
Simplest but uses RAM:
|
||||
|
||||
```nix
|
||||
fileSystems."/" = {
|
||||
device = "none";
|
||||
fsType = "tmpfs";
|
||||
options = [ "defaults" "size=2G" "mode=755" ];
|
||||
};
|
||||
```
|
||||
|
||||
**Best for:** VPS hosts with limited disk but adequate RAM.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Fails After Reboot
|
||||
|
||||
```bash
|
||||
# Check what's missing
|
||||
journalctl -xeu servicename
|
||||
|
||||
# Common fixes:
|
||||
# 1. Add /var/lib/servicename to persistence
|
||||
# 2. Ensure directory permissions are correct
|
||||
# 3. Check if service expects specific files in /etc
|
||||
```
|
||||
|
||||
### "No such file or directory" Errors
|
||||
|
||||
```bash
|
||||
# Find what's missing
|
||||
journalctl -b | grep "No such file"
|
||||
|
||||
# Add missing paths to persistence
|
||||
```
|
||||
|
||||
### Slow Boot (Too Many Bind Mounts)
|
||||
|
||||
If you have many persisted paths, consider:
|
||||
1. Consolidating related paths
|
||||
2. Using symlinks instead of bind mounts for some paths
|
||||
3. Persisting parent directories instead of many children
|
||||
|
||||
### Container State Issues
|
||||
|
||||
Containers may have their own state directories:
|
||||
|
||||
```nix
|
||||
# For NixOS containers
|
||||
environment.persistence."/persist".directories = [
|
||||
"/var/lib/nixos-containers"
|
||||
];
|
||||
|
||||
# For Podman
|
||||
environment.persistence."/persist".directories = [
|
||||
"/var/lib/containers/storage/volumes"
|
||||
# NOT overlay - that's regenerated
|
||||
];
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tooling Improvements
|
||||
|
||||
### Automated Discovery Script
|
||||
|
||||
Create a helper that runs periodically to detect unpersisted changes:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# /usr/local/bin/impermanence-check
|
||||
|
||||
# Get list of persisted paths
|
||||
PERSISTED=$(nix eval --raw '.#nixosConfigurations.hostname.config.environment.persistence."/persist".directories' 2>/dev/null | tr -d '[]"' | tr ' ' '\n')
|
||||
|
||||
# Find modified files outside persisted paths
|
||||
find / -xdev -type f -mmin -60 2>/dev/null | while read -r file; do
|
||||
is_persisted=false
|
||||
for path in $PERSISTED; do
|
||||
if [[ "$file" == "$path"* ]]; then
|
||||
is_persisted=true
|
||||
break
|
||||
fi
|
||||
done
|
||||
if ! $is_persisted; then
|
||||
echo "UNPERSISTED: $file"
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
### Pre-Reboot Check
|
||||
|
||||
Add to your workflow:
|
||||
|
||||
```bash
|
||||
# Before rebooting
|
||||
bcache-impermanence diff # or custom script
|
||||
|
||||
# Review changes, add to persistence if needed, then reboot
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Action Items
|
||||
|
||||
### Immediate
|
||||
- [ ] Document all state paths for each host (see backup docs)
|
||||
- [ ] Create shared impermanence module in flake
|
||||
|
||||
### Phase 1 (L001/O001)
|
||||
- [ ] Back up current state
|
||||
- [ ] Add impermanence to L001
|
||||
- [ ] Test thoroughly
|
||||
- [ ] Roll out to O001
|
||||
|
||||
### Phase 2 (H003)
|
||||
- [ ] Plan maintenance window
|
||||
- [ ] Add impermanence to H003
|
||||
- [ ] Verify LUKS key persistence
|
||||
|
||||
### Phase 3 (H001)
|
||||
- [ ] Complete state inventory
|
||||
- [ ] Test with extensive persistence list
|
||||
- [ ] Gradual rollout
|
||||
Loading…
Add table
Add a link
Reference in a new issue