Troubleshooting¶
Falco is a low-level security tool that sits close to the kernel and consumes a meaningful share of system resources. When something goes wrong, the symptoms range from a pod that fails to start through to alert floods that obscure genuine threats. This runbook collects the failure modes Dr. Crucible and Cheery have encountered in production, along with the resolution steps. It also documents the developer debugging shell incident from the first week of operation, which is now used as a reference case for what a Falco alert actually looks like in practice.
Falco pod not starting: driver loading failure¶
If a Falco pod is in CrashLoopBackOff or Error state, the most common cause is that the eBPF probe failed to load.
Check the pod logs first:
kubectl logs -n falco <pod-name> --previous
Then check dmesg on the node the pod was scheduled to:
kubectl get pod -n falco <pod-name> -o wide
# note the node name, then SSH to it
ssh <node>
dmesg | tail -100 | grep -i bpf
BPF verifier errors appear as bpf: Permission denied or bpf: Invalid argument. The most common causes and their fixes are:
Kernel below 5.8: re-provision the node with Ubuntu 22.04 LTS
BTF not available: confirm
/sys/kernel/btf/vmlinuxexists; if absent, the kernel was built withoutCONFIG_DEBUG_INFO_BTF=yMissing capabilities: check that no Gatekeeper policy stripped
CAP_BPForCAP_PERFMONfrom the pod; see the eBPF configuration runbook for required capabilities
High CPU from Falco¶
Falco watching high-throughput workloads (build systems, CI runners, log aggregators) can consume significant CPU due to the volume of syscalls processed. Tune the buffer size preset downwards first:
syscall_buf_size_preset: 2
Then identify which rules are firing most frequently. Noisy rules inflate CPU not because of the detection itself but because of output serialisation. Query the Falcosidekick metrics:
kubectl port-forward -n falco svc/falcosidekick 2801:2801
curl -s http://localhost:2801/metrics | grep falcosidekick_inputs_total | sort -t= -k2 -rn | head -20
For rules generating thousands of events per hour with no operational value, either add a macro exception or set the priority to DEBUG so the event is suppressed:
- rule: Some chatty default rule
override:
condition: append
condition: and not k8s.ns.name = ci-runners
False positive rules¶
False positives are almost always caused by a legitimate process that matches a condition written for a narrower scope. The fix is always to extend the condition via override: append rather than disabling the rule outright.
To identify the fields of a specific event, enable JSON output and watch the log file:
kubectl exec -n falco <pod-name> -- tail -f /var/log/falco/events.log | python3 -m json.tool
Find the false positive event, note the distinguishing field values (for example proc.name, container.image.repository, or k8s.ns.name), then add an exception:
- macro: golem_trust_allowed_write_etc
condition: proc.name = my-init-container
- rule: Write below etc
override:
condition: append
condition: and not golem_trust_allowed_write_etc
Test the change with falco --dry-run before deploying. Submit the change as a merge request to golem-trust/falco-rules; it requires approval from either Dr. Crucible or Cheery.
Testing whether a specific syscall is detected¶
To confirm that a rule catches the intended behaviour, use a test container with nsenter to run a command in the target container’s namespaces, or simply exec into an existing container:
kubectl run falco-test \
--image=harbor.golems.internal/library/ubuntu:22.04 \
--restart=Never \
--rm -it \
-- /bin/bash
Inside the container, perform the action the rule targets (for example, reading /etc/shadow). Watch Falco logs on the node running the test pod:
kubectl logs -n falco <falco-pod-on-same-node> -f | grep "falco-test"
If no alert fires, confirm the pod is running on the same node as the Falco pod you are watching, and that the rule is loaded:
kubectl exec -n falco <falco-pod> -- falco --list | grep "your rule name"
Upgrade procedure¶
Falco upgrades follow the DaemonSet rolling update pattern. Before upgrading, check the driver compatibility matrix in the Falco release notes: a new Falco version sometimes requires a driver update that is not backwards compatible with the previous Helm chart.
Update the image tag in falco-values.yaml, then apply:
helm upgrade falco falcosecurity/falco \
--namespace falco \
--values falco-values.yaml \
--version <new-chart-version>
Monitor the rollout. There is a brief detection gap on each node as its pod restarts:
kubectl rollout status daemonset/falco -n falco --timeout=10m
If the rollout stalls, check the pod events on the new pod version:
kubectl describe pod -n falco <new-pod-name>
A stalled rollout most commonly means the new driver version failed to load on some node types. Roll back immediately if more than one node loses coverage for longer than two minutes:
helm rollback falco -n falco
The developer debugging shell incident¶
During the first week of Falco’s operation, a developer had left an active bash process running inside a production payments container. The container image had been built with bash installed and the developer had used kubectl exec to investigate a query performance issue, then left the session running.
The alert that fired was:
{
"output": "Shell spawned in production database container (ns=production pod=payments-db-6f7c9d-mn4xp image=harbor.golems.internal/golem-trust/payments-db:2.1.4 cmd=/bin/bash user=root)",
"priority": "CRITICAL",
"rule": "Shell spawned in production database container",
"time": "2025-10-02T11:42:03Z",
"output_fields": {
"k8s.ns.name": "production",
"k8s.pod.name": "payments-db-6f7c9d-mn4xp",
"container.image.repository": "harbor.golems.internal/golem-trust/payments-db",
"proc.cmdline": "/bin/bash",
"user.name": "root"
}
}
The response controller killed the pod within three seconds of alert generation. The pod was replaced by its ReplicaSet controller within eight seconds. The developer was contacted by Cheery and reminded that kubectl exec sessions in production containers must be approved in advance and must use a dedicated debug image rather than the production image. The production database image was subsequently rebuilt without a shell binary, and a Dockerfile linting rule was added to the CI pipeline to catch shell installation in database images at build time.