Prometheus deployment¶
Runbook for deploying Prometheus for metrics collection. Prometheus runs on a dedicated CX21 instance, metrics.golemtrust.am, and scrapes node exporters and application exporters from all production servers. Dr. Crucible deployed it alongside Graylog; Adora Belle reviewed the default dashboards and asked for more. There are now more.
Prerequisites¶
A Hetzner CX21 instance running Debian 12 at
metrics.golemtrust.am(10.0.2.10)DNS A record for
metrics.golemtrust.amandgrafana.golemtrust.ampointing to this instance (Grafana runs here too; see the Grafana dashboards runbook)Node exporter installed on all production servers (covered in this runbook)
TLS certificate for
grafana.golemtrust.am(Certbot with Cloudflare DNS)
Prometheus itself does not need a public-facing domain. It listens on 127.0.0.1 only; Grafana queries it locally. Prometheus’s own web UI is accessible via SSH tunnel if needed.
Prometheus installation¶
On metrics.golemtrust.am:
PROMETHEUS_VERSION="2.52.0"
wget "https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz"
tar xzf "prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz"
mv "prometheus-${PROMETHEUS_VERSION}.linux-amd64/prometheus" /usr/local/bin/
mv "prometheus-${PROMETHEUS_VERSION}.linux-amd64/promtool" /usr/local/bin/
rm -rf "prometheus-${PROMETHEUS_VERSION}.linux-amd64"*
useradd --system --no-create-home --shell /bin/false prometheus
mkdir -p /etc/prometheus /var/lib/prometheus
chown prometheus:prometheus /var/lib/prometheus
Create the configuration file at /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: production
organisation: golemtrust
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
- job_name: node
static_configs:
- targets:
- 'auth.golemtrust.am:9100'
- 'db.golemtrust.am:9100'
- 'graylog-1.golemtrust.am:9100'
- 'graylog-2.golemtrust.am:9100'
- 'graylog-3.golemtrust.am:9100'
- 'vault-1.golemtrust.am:9100'
- 'vault-2.golemtrust.am:9100'
- 'vault-3.golemtrust.am:9100'
- 'vault-transit.golemtrust.am:9100'
- 'metrics.golemtrust.am:9100'
- job_name: postgres
static_configs:
- targets:
- 'db.golemtrust.am:9187'
- job_name: keycloak
metrics_path: /auth/realms/master/metrics
static_configs:
- targets: ['auth.golemtrust.am:8080']
Create the systemd unit at /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
After=network-online.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=90d \
--web.listen-address=127.0.0.1:9090
Restart=on-failure
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus
Node exporter installation¶
Install on every production server. Run the following on each host:
NODE_EXPORTER_VERSION="1.8.0"
wget "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
tar xzf "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
mv "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter" /usr/local/bin/
rm -rf "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64"*
useradd --system --no-create-home --shell /bin/false node_exporter
Create /etc/systemd/system/node_exporter.service:
[Unit]
Description=Node Exporter
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=10.0.0.X:9100
Restart=on-failure
[Install]
WantedBy=multi-user.target
Replace 10.0.0.X with the node’s private IP. Node exporter should listen only on the private network interface, not on the public IP.
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter
Repeat on every server listed in the Prometheus node scrape config.
PostgreSQL exporter¶
Install on db.golemtrust.am:
PG_EXPORTER_VERSION="0.15.0"
wget "https://github.com/prometheus-community/postgres_exporter/releases/download/v${PG_EXPORTER_VERSION}/postgres_exporter-${PG_EXPORTER_VERSION}.linux-amd64.tar.gz"
tar xzf "postgres_exporter-${PG_EXPORTER_VERSION}.linux-amd64.tar.gz"
mv "postgres_exporter-${PG_EXPORTER_VERSION}.linux-amd64/postgres_exporter" /usr/local/bin/
Create a PostgreSQL user for the exporter:
sudo -u postgres psql -c "CREATE USER prometheus WITH PASSWORD '<generate and store in Vaultwarden>';"
sudo -u postgres psql -c "GRANT pg_monitor TO prometheus;"
Create /etc/systemd/system/postgres_exporter.service:
[Unit]
Description=PostgreSQL Exporter
After=network-online.target postgresql.service
[Service]
User=postgres
Environment=DATA_SOURCE_NAME=postgresql://prometheus:<password>@localhost:5432/postgres?sslmode=disable
ExecStart=/usr/local/bin/postgres_exporter --web.listen-address=10.0.0.3:9187
Restart=on-failure
[Install]
WantedBy=multi-user.target
Alertmanager¶
Alertmanager handles routing of Prometheus alerts to notification channels. Install on metrics.golemtrust.am:
AM_VERSION="0.27.0"
wget "https://github.com/prometheus/alertmanager/releases/download/v${AM_VERSION}/alertmanager-${AM_VERSION}.linux-amd64.tar.gz"
tar xzf "alertmanager-${AM_VERSION}.linux-amd64.tar.gz"
mv "alertmanager-${AM_VERSION}.linux-amd64/alertmanager" /usr/local/bin/
mv "alertmanager-${AM_VERSION}.linux-amd64/amtool" /usr/local/bin/
Create /etc/alertmanager/alertmanager.yml:
route:
receiver: slack-default
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: slack-default
slack_configs:
- api_url: '<Slack webhook URL from Vaultwarden>'
channel: '#infrastructure-alerts'
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Prometheus infrastructure alerts route to #infrastructure-alerts. Security alerts from Graylog route to #security-alerts. These are separate channels so that infrastructure noise does not drown out security signals.
Verification¶
Check that Prometheus can reach all targets:
curl -s http://127.0.0.1:9090/api/v1/targets | python3 -m json.tool | grep '"health"'
All targets should show "health": "up". Targets showing "health": "down" indicate a scrape failure; check that the exporter is running and that the firewall permits TCP 9100 from metrics.golemtrust.am to the target server’s private IP.