Alert configuration¶

Runbook for configuring Graylog alerts. An alert is a notification triggered when a condition on a stream is met. Angua wrote the first set of alert conditions during the week after the Seamstresses’ Guild breach. The attacker returned seven days later and was caught in 90 seconds. These are those alerts.

Notification channels¶

Alerts are sent via three channels in order of escalation:

Slack, to the #security-alerts channel, immediately
Email, to security@golemtrust.am, immediately
SMS, to Angua’s and Carrot’s phones, if no one acknowledges the Slack alert within 10 minutes

Slack notification¶

Navigate to Alerts, then Notifications, then Create Notification.

Type: Slack Notification
Title: Security alerts Slack
Webhook URL: retrieve from Vaultwarden, collection: Infrastructure, item: Graylog Slack webhook
Channel: #security-alerts
Icon emoji: leave blank
Message: use the default template; it includes the alert title, description, and a link to the Graylog search

Email notification¶

Type: Email Notification
Title: Security alerts email
Recipients: security@golemtrust.am
Subject: [Graylog] ${alert_condition.title}
Body: use the default template

Email requires Graylog’s mail configuration to be set. In /etc/graylog/server/server.conf:

transport_email_enabled = true
transport_email_hostname = smtp.fastmail.com
transport_email_port = 465
transport_email_use_ssl = true
transport_email_use_tls = false
transport_email_auth_username = alerts@golemtrust.am
transport_email_auth_password = <retrieve from Vaultwarden, collection: Infrastructure>
transport_email_from_email = alerts@golemtrust.am

Restart Graylog after changing this configuration: systemctl restart graylog-server

SMS escalation is handled by a Graylog webhook notification that calls a script on the Graylog server. The script uses the Fastmail SMS API. The webhook URL is http://127.0.0.1:9999/sms-escalate; a small Flask service running on each Graylog node handles the call. Dr. Crucible wrote it; the code is in src/graylog-sms-escalate/ in the internal repository.

Alert conditions¶

Navigate to Alerts, then Event Definitions, then Create Event Definition.

Failed authentication spike¶

Triggers when more than 10 failed authentication attempts arrive from a single source IP within 5 minutes.

Title: Failed authentication spike
Priority: High
Condition type: Aggregation
Stream: Authentication events
Search query: message:("authentication failure" OR "invalid credentials")
Group by: client_ip
Aggregation: Count, greater than 10
Time range: 5 minutes
Execute every: 1 minute
Grace period: 5 minutes (do not re-alert for the same IP within 5 minutes of the last alert)
Notifications: Slack, Email

Unusual hours portal access¶

Triggers on any access to customer portals between 02:00 and 04:00 UTC with a non-browser user agent.

Title: Out-of-hours portal access with scripted user agent
Priority: High
Condition type: Filter and Count
Stream: Customer portals
Search query: user_agent:("curl" OR "wget" OR "python" OR "Go-http-client") AND hour:[2 TO 4]
Count, greater than 0
Time range: 10 minutes
Execute every: 5 minutes
Grace period: 30 minutes
Notifications: Slack, Email, SMS

This is the rule that would have caught the Seamstresses’ Guild attacker on day one instead of day three. The attacker’s access was consistent: between 02:30 and 03:45 UTC, using curl. Angua noticed it in the raw log file. The rule notices it automatically.

Source country anomaly¶

Triggers when an IP from outside the expected countries accesses the authentication service or customer portals.

Title: Authentication from unexpected country
Priority: Medium
Condition type: Filter and Count
Stream: Authentication events
Search query: NOT gl2_remote_ip_country:(GB DE NL AT CH AM)
Count, greater than 0
Time range: 5 minutes
Execute every: 5 minutes
Grace period: 60 minutes
Notifications: Slack, Email

The country list reflects Golem Trust’s customer base. Update it when new customers in other countries are onboarded. The gl2_remote_ip_country field is populated by Graylog’s GeoIP lookup; see the alert tuning runbook for how to enable GeoIP.

High error rate¶

Triggers when the error rate on any web-facing service exceeds a threshold.

Title: Elevated HTTP 5xx rate
Priority: Medium
Condition type: Aggregation
Stream: Web access
Search query: response_code:[500 TO 599]
Aggregation: Count, greater than 50
Time range: 5 minutes
Execute every: 2 minutes
Grace period: 10 minutes
Notifications: Slack

Security event flood¶

Triggers when the Security events stream receives an unusually high number of messages, which may indicate an attack or a failing system generating log noise.

Title: Security event flood
Priority: High
Condition type: Aggregation
Stream: Security events
Aggregation: Count, greater than 500
Time range: 5 minutes
Execute every: 1 minute
Grace period: 15 minutes
Notifications: Slack, Email, SMS

Testing alerts¶

After creating each alert, trigger it manually to confirm the notification channels are working.

For the failed authentication spike, use a Graylog search to inject test events. Navigate to Search, find a recent authentication failure event, and use the Send to Event Definition option if available, or temporarily lower the threshold to 1 event and wait for a natural occurrence.

Alternatively, Angua’s preferred method: trigger a few intentional failed logins against the Keycloak staging environment (not production) and observe whether the alert fires and notifications arrive. Check all three channels.

Acknowledging alerts¶

When an alert fires, acknowledge it in Slack by reacting with the checkmark emoji. This is convention rather than enforcement; Graylog does not currently integrate with the Slack acknowledgement. The 10-minute SMS escalation timer is based on Graylog’s own grace period and notification sequence, not on Slack activity.

A more formal acknowledgement workflow is on Dr. Crucible’s list for when the team grows. For now, convention is sufficient: acknowledge in Slack, investigate, post a brief outcome note in the thread.