Monitoring setup

Runbook for monitoring backup health. A backup that runs silently and fails silently is not a backup; it is a comfortable illusion. Cheery Littlebottom was specific: alert if backup has not run in 24 hours, alert if backup fails, alert if restore test fails. This runbook implements all three.

Backup status in Vault KV

Each backup script writes its status to Vault’s KV engine after completing (see the restic deployment and scheduling runbooks). The key path is kv/golemtrust/backup-status/<hostname> with fields last_success, last_failure, and status.

A Prometheus exporter script reads these values and exposes them as metrics. Create /opt/backup/backup-metrics.sh:

#!/bin/bash
set -euo pipefail

export VAULT_ADDR="https://vault.golemtrust.am:8200"
export VAULT_TOKEN=$(vault write -field=token auth/approle/login \
  role_id="$(cat /etc/vault/role-id)" \
  secret_id="$(cat /etc/vault/secret-id)")

METRICS_FILE="/var/lib/node_exporter/textfile_collector/backup.prom"
TIMESTAMP=$(date +%s)

for HOST in auth db graylog-1 graylog-2 graylog-3 \
            vault-1 vault-2 vault-3 metrics; do
  STATUS=$(vault kv get -field=status "kv/golemtrust/backup-status/${HOST}" 2>/dev/null || echo "unknown")
  LAST_SUCCESS=$(vault kv get -field=last_success "kv/golemtrust/backup-status/${HOST}" 2>/dev/null || echo "1970-01-01T00:00:00+00:00")
  LAST_SUCCESS_TS=$(date -d "$LAST_SUCCESS" +%s 2>/dev/null || echo "0")
  AGE_SECONDS=$((TIMESTAMP - LAST_SUCCESS_TS))

  echo "backup_last_success_age_seconds{host=\"${HOST}\"} ${AGE_SECONDS}"
  echo "backup_status{host=\"${HOST}\",status=\"${STATUS}\"} 1"
done > "$METRICS_FILE"

This script is run every 15 minutes via cron on metrics.golemtrust.am:

*/15 * * * * /opt/backup/backup-metrics.sh >> /var/log/backup-metrics.log 2>&1

The metrics are written to the node_exporter textfile collector directory, which node_exporter picks up and exposes to Prometheus automatically. Ensure node_exporter is configured with --collector.textfile.directory=/var/lib/node_exporter/textfile_collector on metrics.golemtrust.am.

Prometheus alert rules

Create /etc/prometheus/rules/backup.yml:

groups:
  - name: backup
    interval: 15m
    rules:

      - alert: BackupNotRunIn24Hours
        expr: backup_last_success_age_seconds > 86400
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Backup overdue for {{ $labels.host }}"
          description: >
            No successful backup for {{ $labels.host }} in the last 24 hours.
            Last success was {{ $value | humanizeDuration }} ago.
            Check /var/log/restic-backup.log on the affected host.

      - alert: BackupFailed
        expr: backup_status{status="failed"} == 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Backup failed for {{ $labels.host }}"
          description: >
            The most recent backup attempt for {{ $labels.host }} failed.
            Check /var/log/restic-backup.log on the affected host.

      - alert: BackupStatusUnknown
        expr: backup_status{status="unknown"} == 1
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Backup status unknown for {{ $labels.host }}"
          description: >
            No backup status has been recorded for {{ $labels.host }}.
            The backup script may never have run, or Vault KV may be unreachable.

Reload Prometheus after creating the rule file:

systemctl reload prometheus

These alerts route to Alertmanager and then to Slack #infrastructure-alerts and to Cheery’s email address. Cheery requested a direct email, independent of Slack. Add her address to the Alertmanager receiver for backup alerts.

Grafana dashboard

Add a backup status panel to the existing Grafana infrastructure dashboard. The panel shows, for each server, the age of the last successful backup and a colour indicator (green for under 25 hours, amber for 25-48 hours, red for over 48 hours).

Query for the panel:

backup_last_success_age_seconds

Use the Stat visualisation with thresholds:

  • Green: 0 to 90000 (25 hours)

  • Amber: 90000 to 172800 (48 hours)

  • Red: 172800 and above

Adora Belle reviews this panel on Monday mornings alongside the rest of the infrastructure status. Cheery reviews it daily.

Log rotation for backup logs

Backup logs accumulate on each server. Configure logrotate at /etc/logrotate.d/restic-backup:

/var/log/restic-backup.log {
    daily
    rotate 30
    compress
    missingok
    notifempty
    create 0640 root root
}

Thirty days of compressed backup logs are sufficient for debugging purposes. Failures are also recorded in the Vault KV status, which is more durable.

Alerting on restore test failures

Restore tests are run quarterly by Cheery (see the restore procedures and testing protocols runbooks). The test result is recorded in Vault:

vault kv put kv/golemtrust/backup-status/restore-test \
  last_test="$(date -Iseconds)" \
  result="passed" \
  host_tested="db.golemtrust.am" \
  time_to_restore_minutes="47" \
  notes="Merchants Guild database restored successfully to staging"

Add a Prometheus metric for the restore test age. Add to /opt/backup/backup-metrics.sh:

RESTORE_TEST=$(vault kv get -field=last_test kv/golemtrust/backup-status/restore-test 2>/dev/null || echo "1970-01-01T00:00:00+00:00")
RESTORE_TEST_TS=$(date -d "$RESTORE_TEST" +%s 2>/dev/null || echo "0")
RESTORE_AGE=$((TIMESTAMP - RESTORE_TEST_TS))
echo "backup_restore_test_age_seconds ${RESTORE_AGE}"

Add a Prometheus alert rule:

- alert: RestoreTestOverdue
  expr: backup_restore_test_age_seconds > 7776000
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: "Quarterly restore test is overdue"
    description: >
      The last restore test was {{ $value | humanizeDuration }} ago.
      Restore tests should be conducted quarterly. Cheery Littlebottom manages this.

7,776,000 seconds is 90 days. The alert fires when the most recent restore test is more than 90 days old.