Cause Crash Loop | Steadybit Reliability Hub

Experiment editor

Introduction

You can use this step to continuously kill all (or a given) container in a selected pod.

Use Cases

Simulate failure of container startups and Kubernetes backing off to restart the container

Known limitations

Pods using hostPID=true are currently unsupported
Containers without a shell and no kill binary are currently unsupported

Rollback

No rollback necessary.

Parameters

Parameter	Required	Description	Default
Duration	true	How long should the attack run?	60s
Container	false	Name of a container which should be killed. By default all containers are killed.

Useful Templates (4 of 8)

See all

AppDynamics alerts when a Kubernetes pod is in crash loop

Verify that an AppDynamics health violation alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that the AppDynamics health violation responsible for tracking non-ready containers is in a non-violating state. As soon as one of the containers is crash looping, caused by the crash loop attack, the AppDynamics health violation should notify and escalate it to your on-call team.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

AppDynamics

Crash loop

Harden Observability

Restart

Kubernetes

AppDynamics applications

AppDynamics health rules

Kubernetes cluster

Kubernetes pods

Datadog alerts when a Kubernetes pod is in crash loop

Verify that a Datadog monitor alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that the Datadog monitor responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Datadog monitor should alert and escalate it to your on-call team.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Crash loop

Harden Observability

Datadog

Restart

Kubernetes

Datadog monitors

Kubernetes cluster

Kubernetes pods

Dynatrace should detect a crash looping as problem

Verify that Dynatrace alerts you on pods not being ready to accept traffic for a certain amount of time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that Dynatrace has no problems for an entity and doesn't alert already on non-ready containers. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, Dynatrace should detect the problem and alert to ensure your on-call team is taking action.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Crash loop

Dynatrace

Harden Observability

Kubernetes

Kubernetes cluster

Kubernetes pods

Grafana alert rule fires when a Kubernetes pod is in crash loop

Verify that a Grafana alert rule alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that the Grafana alert rule responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Grafana alert rule should fire and escalate it to your on-call team.