Grafana dashboards

Runbook for deploying Grafana and configuring the dashboards that Adora Belle wants. Grafana runs on the same metrics.golemtrust.am instance as Prometheus. It is accessible at https://grafana.golemtrust.am. Adora Belle checks it every morning before her first cigarette. The dashboards should be ready before she arrives; they always are.

Installation

On metrics.golemtrust.am:

wget -q -O - https://packages.grafana.com/gpg.key | \
  gpg --dearmor | tee /usr/share/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] \
  https://packages.grafana.com/oss/deb stable main" \
  | tee /etc/apt/sources.list.d/grafana.list
apt update && apt install -y grafana

Configuration

Edit /etc/grafana/grafana.ini. Key settings:

[server]
domain = grafana.golemtrust.am
root_url = https://grafana.golemtrust.am/
serve_from_sub_path = false

[security]
admin_user = admin
admin_password = <generate and store in Vaultwarden>
disable_gravatar = true
cookie_secure = true
cookie_samesite = strict

[users]
allow_sign_up = false
auto_assign_org = true
auto_assign_org_role = Viewer

[auth.anonymous]
enabled = false

[smtp]
enabled = true
host = smtp.fastmail.com:465
user = alerts@golemtrust.am
password = <retrieve from Vaultwarden, collection: Infrastructure>
from_address = alerts@golemtrust.am

allow_sign_up = false means only the admin can create accounts. New team members request access from Adora Belle or Carrot.

systemctl enable grafana-server
systemctl start grafana-server

Nginx reverse proxy

Create /etc/nginx/sites-available/grafana:

server {
    listen 80;
    server_name grafana.golemtrust.am;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    server_name grafana.golemtrust.am;

    ssl_certificate /etc/letsencrypt/live/grafana.golemtrust.am/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/grafana.golemtrust.am/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto https;
    }
}
certbot certonly --dns-cloudflare --dns-cloudflare-credentials /etc/cloudflare.ini \
  -d grafana.golemtrust.am --email crucible@golemtrust.am --agree-tos --non-interactive
ln -s /etc/nginx/sites-available/grafana /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx

Adding Prometheus as a data source

Log in at https://grafana.golemtrust.am with the admin credentials. Navigate to Connections, then Data sources, then Add data source.

  • Type: Prometheus

  • Name: Prometheus

  • URL: http://127.0.0.1:9090

  • Scrape interval: 15s

  • Save and test

The test should return a green confirmation. If it fails, check that Prometheus is running and listening on 127.0.0.1:9090.

Dashboards

Import dashboards from the Grafana dashboard library using the dashboard IDs below. Navigate to Dashboards, then Import, enter the ID, and load.

Node overview (ID 1860): system-level metrics for all servers. CPU, memory, disk, and network for each host. Adora Belle uses this to watch overall infrastructure health.

PostgreSQL (ID 9628): query throughput, connection counts, replication lag, index usage. Dr. Crucible added this after the first slow-query incident.

Graylog (use the dashboard exported from a running Graylog instance via System, then Content Packs): message throughput per stream, processing time, journal fill level. Useful for knowing whether Graylog is keeping up with log ingestion.

After importing, configure each dashboard’s data source to use the Prometheus data source added above. Set the default time range to Last 24 hours and the refresh interval to 1m.

Custom dashboards

Security overview

This dashboard was built by Angua and does not have a Grafana library ID; it was created directly. To recreate it:

Create a new dashboard and add the following panels.

Panel: Failed authentication rate

  • Query: rate(graylog_stream_messages_total{stream="authentication-events"}[5m])

  • Visualisation: Time series

  • Title: Authentication events per minute

Panel: Active HTTP 5xx errors

  • Query: sum(rate(nginx_http_requests_total{status=~"5.."}[5m])) by (host)

  • Visualisation: Time series

Panel: Vault token issuance rate

  • Query: rate(vault_token_creation_total[5m])

  • Visualisation: Stat

Panel: Top source IPs (last hour)

  • This panel uses a Graylog data source rather than Prometheus. Add Graylog as a data source using the Graylog Grafana plugin and configure it to query the Web access stream. Dr. Crucible installed the plugin; if it is absent, grafana-cli plugins install graylog-datasource.

Save the dashboard and pin it to the home screen. Adora Belle should see it immediately on login.

Alerting from Grafana

Grafana alerting is used for threshold-based infrastructure alerts that complement Graylog’s log-based security alerts. Navigate to Alerting, then Alert rules.

Create an alert for high CPU utilisation:

  • Name: High CPU utilisation

  • Query A: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

  • Condition: is above 90

  • For: 5m (must be above threshold for 5 minutes before alerting)

  • Contact point: create a Slack contact point using the #infrastructure-alerts webhook from Vaultwarden

Create an alert for low disk space:

  • Name: Low disk space

  • Query A: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

  • Condition: is below 15

  • For: 10m

  • Contact point: Slack #infrastructure-alerts and email

These alert on infrastructure problems. Security alerts live in Graylog. Keep the two systems focused on their respective domains; mixing them creates noise and confusion about which system to check when something is wrong.