Litmus deployment¶
Ponder’s objection to chaos engineering was not that breaking things is bad. Ponder understands very well that things break. His objection was that breaking things without a plan is merely vandalism, and that breaking things in production without understanding the blast radius is the kind of activity that ends careers. Dr. Crucible’s response was to show him the Litmus ChaosCenter: a controlled, observable, reversible framework for introducing failures. Ponder spent thirty minutes reading the documentation, then said “all right, start with staging.” That was six months ago. The production chaos schedule has been running for four months without an unplanned outage, which is itself evidence that the approach works.
Installation¶
Litmus is installed via Helm in the litmus namespace:
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
helm install litmus litmuschaos/litmus \
--namespace litmus \
--create-namespace \
--version 3.x \
--values /opt/infrastructure/helm/litmus/values.yaml
The values file:
# /opt/infrastructure/helm/litmus/values.yaml
portal:
server:
authServer:
enabled: true
graphqlServer:
enabled: true
frontend:
enabled: true
mongodb:
persistence:
size: 20Gi
storageClass: hcloud-volumes
ingress:
enabled: true
ingressClassName: nginx
host: litmus.golemtrust.am
tls:
- secretName: litmus-tls
hosts:
- litmus.golemtrust.am
adminConfig:
DBUSER: "admin"
DBPASSWORD: "CHANGE_ME_FROM_VAULT"
After installation, retrieve the initial admin credentials and change them immediately:
kubectl get secret litmus-admin-secret \
-n litmus \
-o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d
# Change via ChaosCenter UI at https://litmus.golemtrust.am
Litmus components¶
ChaosCenter: the web UI for creating and monitoring chaos experiments. Access via Keycloak OIDC (see OIDC configuration below). URL: https://litmus.golemtrust.am.
ChaosOperator: the Kubernetes operator that watches for ChaosEngine custom resources and executes experiments. It runs as a Deployment in the litmus namespace and has cluster-wide RBAC permissions to manipulate pods, nodes, and network policies in all namespaces it is authorised for.
ChaosExporter: exports experiment metrics to Prometheus. Metrics include litmuschaos_experiment_verdict (1 for pass, 0 for fail), litmuschaos_experiment_count, and litmuschaos_cluster_scoped_experiments_runs_total.
ChaosHub configuration¶
Litmus connects to two ChaosHubs. The public hub provides the standard experiment library (pod-delete, node-drain, network-chaos, and so on). The private Golem Trust hub in GitLab provides custom experiments for Golem Trust-specific scenarios.
Private hub configuration in ChaosCenter:
Name: Golem Trust Custom Hub
Hub Type: Remote
Repo URL: https://gitlab.golemtrust.am/infrastructure/chaos-hub.git
Branch: main
Auth Type: Token
Access Token: [GitLab deploy token, stored in Vaultwarden]
Access control¶
Access to create ChaosEngine resources in production namespaces is restricted to Dr. Crucible and Ponder, enforced by an OPA Gatekeeper policy:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sChaosEngineCreator
metadata:
name: restrict-chaosengine-creators
spec:
match:
kinds:
- apiGroups: ["litmuschaos.io"]
kinds: ["ChaosEngine"]
namespaces:
- payments
- royal-bank
- core-services
- infrastructure
parameters:
allowedServiceAccounts:
- system:serviceaccount:litmus:litmus-admin
allowedUsers:
- dr-crucible
- ponder-stibbons
Staging namespaces (*-staging) have a more permissive policy that allows any member of the chaos-engineers Keycloak group to create experiments.
Keycloak OIDC configuration¶
ChaosCenter authenticates via Keycloak OIDC. The client configuration in Keycloak:
Client ID: litmus-chaoscenter
Client Protocol: openid-connect
Access Type: confidential
Valid Redirect URIs: https://litmus.golemtrust.am/*
Root URL: https://litmus.golemtrust.am
Roles:
- chaos-viewer (read-only access to ChaosCenter)
- chaos-engineer (create experiments in staging)
- chaos-admin (create experiments in production; assigned to Dr. Crucible and Ponder only)
Prometheus alerts¶
The ChaosExporter provides metrics that Prometheus scrapes. Two alerting rules are configured:
groups:
- name: litmus-chaos
rules:
- alert: ChaosExperimentFailed
expr: litmuschaos_experiment_verdict == 0
for: 1m
labels:
severity: warning
annotations:
summary: "Chaos experiment failed: {{ $labels.chaosresult_name }}"
description: >
Experiment {{ $labels.chaosresult_name }} in namespace
{{ $labels.chaosresult_namespace }} did not pass.
This may indicate the application did not recover as expected.
- alert: ChaosEngineStuck
expr: |
kube_customresource_status_phase{customresource_kind="ChaosEngine",
customresource_phase!~"completed|stopped"} > 0
for: 30m
labels:
severity: critical
annotations:
summary: "ChaosEngine stuck in non-terminal state"
description: >
ChaosEngine {{ $labels.customresource_name }} has been in
{{ $labels.customresource_phase }} state for more than 30 minutes.
Manual intervention may be required.
The ChaosExperimentFailed alert fires when an experiment’s verdict is “fail”, meaning the application’s steady-state hypothesis was not met during the experiment. This is a signal that the application needs improvement, not that Litmus has failed; however, Ponder and Dr. Crucible review all failures before the next production experiment of the same type.