Backup scheduling¶
Runbook for configuring the backup schedule across all production servers. The schedule follows Cheery’s requirements: daily incrementals at 23:00, weekly fulls every Sunday at 02:00, 90 days of daily retention, two years of weekly retention. “Unacceptable” was her word for what came before. This runbook is what acceptable looks like.
Schedule overview¶
Type |
Time |
Day |
Retention |
|---|---|---|---|
Daily incremental |
23:00 UTC |
Every day |
90 days |
Weekly full |
02:00 UTC |
Sunday |
104 weeks (2 years) |
Repository check |
04:00 UTC |
First of month |
n/a |
DR sync to Nuremberg |
03:00 UTC |
Monday |
mirrors primary |
“23:00 Ankh-Morpork time” is UTC in practice; Ankh-Morpork does not observe summer time, a decision that Cheery noted was “at least consistent.”
Cron configuration¶
Add the following to root’s crontab on each production server. Use crontab -e or write to /etc/cron.d/restic-backup:
# Daily incremental backup at 23:00 UTC
0 23 * * * /opt/backup/restic-backup.sh >> /var/log/restic-backup.log 2>&1
# Weekly full backup on Sunday at 02:00 UTC
# Restic does not distinguish "full" from "incremental" in its internal model,
# but the Sunday run uses a different tag and longer retention policy.
0 2 * * 0 /opt/backup/restic-backup-weekly.sh >> /var/log/restic-backup.log 2>&1
# Monthly repository integrity check
0 4 1 * * /opt/backup/restic-check.sh >> /var/log/restic-backup.log 2>&1
Weekly backup script¶
The weekly backup differs from the daily in its tags and retention settings. Create /opt/backup/restic-backup-weekly.sh:
#!/bin/bash
set -euo pipefail
export VAULT_ADDR="https://vault.golemtrust.am:8200"
export VAULT_TOKEN=$(vault write -field=token auth/approle/login \
role_id="$(cat /etc/vault/role-id)" \
secret_id="$(cat /etc/vault/secret-id)")
export RESTIC_REPOSITORY="sftp://u123456@u123456.your-storagebox.de:23/$(hostname -s)"
export RESTIC_PASSWORD=$(vault kv get -field=restic_backup_password kv/golemtrust/backup)
export RESTIC_SFTP_COMMAND="ssh -i /root/.ssh/backup_key -p 23 -o StrictHostKeyChecking=accept-new"
echo "$(date): Starting weekly backup of $(hostname -s)" >> /var/log/restic-backup.log
restic backup \
--tag "$(hostname -s)" \
--tag "weekly" \
--exclude="/proc" \
--exclude="/sys" \
--exclude="/dev" \
--exclude="/run" \
--exclude="/tmp" \
--exclude="/var/cache" \
--exclude="/opt/vault/data" \
/ \
>> /var/log/restic-backup.log 2>&1
restic forget \
--tag weekly \
--keep-weekly 104 \
--prune \
>> /var/log/restic-backup.log 2>&1
echo "$(date): Weekly backup complete" >> /var/log/restic-backup.log
Note that the weekly backup includes / as the root path rather than individual directories. This captures everything, including paths that the daily backup may exclude for space reasons. The /opt/vault/data exclusion remains in both scripts.
Repository check script¶
Create /opt/backup/restic-check.sh:
#!/bin/bash
set -euo pipefail
export VAULT_ADDR="https://vault.golemtrust.am:8200"
export VAULT_TOKEN=$(vault write -field=token auth/approle/login \
role_id="$(cat /etc/vault/role-id)" \
secret_id="$(cat /etc/vault/secret-id)")
export RESTIC_REPOSITORY="sftp://u123456@u123456.your-storagebox.de:23/$(hostname -s)"
export RESTIC_PASSWORD=$(vault kv get -field=restic_backup_password kv/golemtrust/backup)
export RESTIC_SFTP_COMMAND="ssh -i /root/.ssh/backup_key -p 23 -o StrictHostKeyChecking=accept-new"
echo "$(date): Starting repository check for $(hostname -s)" >> /var/log/restic-backup.log
restic check --read-data-subset=10% >> /var/log/restic-backup.log 2>&1
CHECK_EXIT=$?
if [ $CHECK_EXIT -eq 0 ]; then
vault kv put kv/golemtrust/backup-status/$(hostname -s) \
last_check="$(date -Iseconds)" \
check_status="ok"
echo "$(date): Repository check passed" >> /var/log/restic-backup.log
else
vault kv put kv/golemtrust/backup-status/$(hostname -s) \
last_check="$(date -Iseconds)" \
check_status="failed"
echo "$(date): Repository check FAILED" >> /var/log/restic-backup.log
fi
--read-data-subset=10% reads and verifies a random 10% of repository data each month. Over 10 months, the full repository is verified. A full verification (restic check --read-data) is run annually during the restore testing exercise.
DR sync to Nuremberg¶
The Nuremberg Storage Box receives a weekly copy of all primary Storage Box contents. This sync runs from the metrics.golemtrust.am instance, which has SSH keys authorised on both Storage Boxes.
Create /opt/backup/dr-sync.sh on metrics.golemtrust.am:
#!/bin/bash
set -euo pipefail
PRIMARY_USER="u123456"
PRIMARY_HOST="u123456.your-storagebox.de"
DR_USER="u789012"
DR_HOST="u789012.your-storagebox.de"
echo "$(date): Starting DR sync to Nuremberg" >> /var/log/backup.log
rsync -az \
-e "ssh -i /root/.ssh/backup_key -p 23" \
"${PRIMARY_USER}@${PRIMARY_HOST}:/" \
--rsh="ssh -i /root/.ssh/dr_sync_key -p 23" \
"${DR_USER}@${DR_HOST}:/" \
>> /var/log/backup.log 2>&1
echo "$(date): DR sync complete" >> /var/log/backup.log
0 3 * * 1 /opt/backup/dr-sync.sh >> /var/log/backup.log 2>&1
Staggering backups across servers¶
All servers run their backup at 23:00, but the actual start times are staggered by adding a short sleep based on the server’s position in the infrastructure list. This prevents all servers from saturating the Storage Box SFTP connection simultaneously. Add to the beginning of the backup script on each server:
auth.golemtrust.am: no delay (runs first)db.golemtrust.am: 5 minutes (sleep 300)graylog-1throughgraylog-3: 10, 15, 20 minutesvault-1throughvault-3: 25, 30, 35 minutesOther servers: 40 minutes onwards, 5 minutes apart
At current backup sizes, each server’s daily incremental takes between two and ten minutes. The stagger is conservative; adjust it downwards once the actual backup durations are known from the first month of operation.