Templates

Use templates to kick-start your reliability journey.

You have an idea for a missing template?Create a new one and share it with the community!

Filter by

AppDynamics alerts when a Kubernetes pod is in crash loop

Verify that an AppDynamics health violation alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that the AppDynamics health violation responsible for tracking non-ready containers is in a non-violating state. As soon as one of the containers is crash looping, caused by the crash loop attack, the AppDynamics health violation should notify and escalate it to your on-call team.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

AppDynamics

Crash loop

Harden Observability

Restart

Kubernetes

AppDynamics applications

AppDynamics health rules

Kubernetes cluster

Kubernetes pods

AWS ECS Service Is Scaled up Within Reasonable Time

Verify that your ECS service is scaled up on increased CPU usage.

Motivation

Important ECS services should be scaled up within a reasonable time for an elastic and resilient cloud infrastructure. Undetected high CPU spikes and long startup times are undesirable in these infrastructures.

Structure

First, we ensure that all ECS service's tasks are ready to serve traffic. Afterward, we inject high CPU usage into the ECS task and expect that within a reasonable amount of time, ECS increases the number of ECS tasks and they become ready to handle incoming traffic.

Scalability

CPU

AWS ECS

AWS

ECS Services

ECS Tasks

Block Traffic of Kubernetes DaemonSet

Learn how easy you can block traffic of an entire Kubernetes DaemonSet.

Block Traffic of Kubernetes Deployment

Learn how easy you can block traffic of an entire Kubernetes deployment.

Block Traffic of Kubernetes Pod

Learn how easy you can block traffic of a random pod of a Kubernetes workload (deployment, StatefulSet, DaemonSet).

Block Traffic of Kubernetes StatefulSet

Learn how easy you can block traffic of an entire Kubernetes StatefulSet.

Certificate TLS/SSL expiry for Kubernetes deployment

Turn time forward and check whether your TLS/SSL certificates are valid.

Motivation

Noticing the TLS/SSL certification expiry too late is one problem you can easily avoid by frequently checking your expiry dates. While observability tools already handle this job nicely, you can't know whether they are working in your environment. With this experiment, you can turn the time forward to check whether your HTTPS endpoint works at a given date in the future. Additionally, you can configure one of the observability integrations to validate your observability tool's alerting.

Structure

First, we validate that the given HTTPS endpoint is working today. Next, we will travel with the host in time to validate that the HTTPS endpoint continues to work on a given date. If the TLS/SSL certificate has already expired at that date, the HTTP check will throw failures.

Warning

Please be aware that we will manipulate the time for a given Kubernetes node. Containers running at that host may struggle to deal with the change in the clock correctly, and you may experience other side effects.

Certificate Expiry

Kubernetes cluster

Linux Hosts

Certificate TLS/SSL expiry for Linux Hosts

Turn time forward and check whether your TLS/SSL certificates are valid.

Motivation

Structure

First, we validate that the given HTTPS endpoint is working today. Next, we will travel with the Linux host in time to validate that the HTTPS endpoint continues to work on a given date. If the TLS/SSL certificate has already expired at that date, the HTTP check will throw failures.

Warning

Please be aware that we will manipulate the time for a given Linux host. Applications running at that Linux host may struggle to deal with the change in the clock correctly, and you may experience other side effects.

Certificate Expiry

Host

Linux

Linux Hosts

Certificate TLS/SSL expiry for Windows Hosts

Turn time forward and check whether your TLS/SSL certificates are valid.

Motivation

Structure

First, we validate that the given HTTPS endpoint is working today. Next, we will travel with the Windows host in time to validate that the HTTPS endpoint continues to work on a given date. If the TLS/SSL certificate has already expired at that date, the HTTP check will throw failures.

Warning

Please be aware that we will manipulate the time for a given Windows host. Applications running at that Windows host may struggle to deal with the change in the clock correctly, and you may experience other side effects.

Host

Windows

Certificate Expiry

Windows Hosts

Check Kafka consumer's reaction to record loss

Intent

Intentionally deny access to the topic for consumers and during this time where consumption is stopped, delete records.

We can check the logs of the consumers to see how they handle the loss of records and also authorization access issues.

Message Queue

Kafka

Recoverability

Kafka consumers

Kafka topics

Datadog alerts when a Kubernetes pod is in crash loop

Verify that a Datadog monitor alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Structure

First, check that the Datadog monitor responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Datadog monitor should alert and escalate it to your on-call team.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Crash loop

Harden Observability

Datadog

Restart

Kubernetes

Datadog monitors

Kubernetes cluster

Kubernetes pods

Draining a node should reschedule pods quickly

When draining a node, Kubernetes should reschedule running pods on other nodes without hiccups to ease, e.g., node maintenance.

Motivation

Draining a node may be necessary for, e.g., maintenance of a node. If that happens, Kubernetes should be able to reschedule the pods running on that node within the expected time and without user-noticeable failures.

Structure

For the entire duration of the experiment, a user-facing endpoint should work within expected success rates. At the beginning of the experiment, all pods should be ready to accept traffic. As soon as the node is drained, Kubernetes will evict the pods, but we still expect the pod's redundancy to be able to serve the user-facing endpoint. Eventually, after 120 seconds, all pods should be rescheduled and ready again to recover after the maintenance.

Elasticity

Kubernetes

Kubernetes cluster

Kubernetes deployments

Kubernetes nodes

Dynatrace should detect a crash looping as problem

Verify that Dynatrace alerts you on pods not being ready to accept traffic for a certain amount of time.

Motivation

Structure

First, check that Dynatrace has no problems for an entity and doesn't alert already on non-ready containers. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, Dynatrace should detect the problem and alert to ensure your on-call team is taking action.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Crash loop

Dynatrace

Harden Observability

Kubernetes

Kubernetes cluster

Kubernetes pods

Faultless redundancy during rolling update

Kubernetes features a rolling update strategy to deploy new releases without downtime. When being under load, this only works reliably when your load balancer and the Kubernetes readiness probe are configured properly and DNS caches are up-to-date.

Motivation

The Kubernetes rolling update strategy ensures that a minimum number of pods remain available when a new release is deployed. This implies that a new pod with a new release is started and needs to be ready before an old pod is evicted. Even so, this process may result in degraded performance and user-facing errors, e.g., Kubernetes sending requests to pods indicated as ready but not able to respond properly or evicted pods are still retained in the load balancer.

Structure

Before performing the rolling update all desirable pods of the deployment need to be in the “ready”-state, and a load-balanced user-facing HTTP endpoint is expected to respond successfully while under load. As soon as the rolling update takes place, the HTTP endpoint under load may suffer from a degraded performance (e.g. lower success rate or higher response time). Even so, this should be within the boundaries of your SLA. After the rolling update, the number of desirable pods matches the actual pods of the deployment and the performance of the user-facing HTTP endpoint is similar to before the update.

Solution Sketch

Kubernetes liveness, readiness, and startup probes
Kubernetes deployment strategy

Rolling Update

Restart

Kubernetes

Kubernetes cluster

Kubernetes deployments

Faultless scaling of Kubernetes Deployment

Ensure that you can scale your deployment in a reasonable time without noticeable errors.

Motivation

For an elastic and resilient cloud infrastructure, ensure that you can scale your deployments without user-visible errors and within a reasonable amount of time. Long startup times, hiccups in the load balancer, or resource misallocation are undesirable but sometimes unnoticed and unexpected.

Structure

For the duration of the experiment and the deployment's upscaling, verify that a user-visible endpoint offered is responding within expected success rates and that no monitors are alerting. As soon as the deployment is scaled up, the newly scheduled pod should be ready to receive traffic within a reasonable time, e.g., 60 seconds.

Scalability

Elasticity

Kubernetes

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Fill Memory progressively of a Linux Host

Fill up the memory of a Host progressively to see at which percentage it becomes unstable. Additionally, you may want to add one of our observability checks.

Structure

We start by filling 50% of the host's memory for 30 seconds. Next, we stepwise fill the memory to 75%, 90%, and 100% - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

Linux

Memory

DaemonSet

Host

Legacy

Snippet

Linux Hosts

Fill Memory progressively of a Windows Host

Fill up the memory of a Host progressively to see at which percentage it becomes unstable. Additionally, you may want to add one of our observability checks.

Structure

Progressive

Memory

Legacy

DaemonSet

Host

Snippet

Windows

Windows Hosts

Fill Memory progressively of Kubernetes DaemonSet

Fill up the memory of a Kubernetes DaemonSet progressively to see at which percentage it will be killed by Kubernetes.

Structure

We start by filling 50% of the Kubernetes DaemonSet's memory for 30 seconds. Next, we stepwise fill the memory to 75%, 90%, and 100% - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

Memory

DaemonSet

Snippet

Kubernetes

Containers

Fill Memory progressively of Kubernetes Deployment

Fill up the memory of a Kubernetes Deployment progressively to see at which percentage it will be killed by Kubernetes.

Structure

We start by filling 50% of the Kubernetes Deployment's memory for 30 seconds. Next, we stepwise fill the memory to 75%, 90%, and 100% - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

Memory

Deployment

Snippet

Kubernetes

Containers

Fill Memory progressively of Kubernetes StatefulSet

Fill up the memory of a Kubernetes StatefulSet progressively to see at which percentage it will be killed by Kubernetes.

Structure

We start by filling 50% of the Kubernetes StatefulSet's memory for 30 seconds. Next, we stepwise fill the memory to 75%, 90%, and 100% - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

Memory

Snippet

StatefulSet

Kubernetes

Containers

Graceful degradation and Datadog alerts when Postgres database can not be reached

An unavailable database should be handled by your application gracefully and indicated appropriately Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage. You can address a potential impact on your system by implementing, e.g., a failover or caching mechanism.

Motivation

Database outages can occur for various reasons, including hardware failures, software bugs, network connectivity issues, or even intentional attacks. Such outages can severely affect your application, such as lost revenue, dissatisfied customers, and reputational damage. By testing your application's resilience to a database outage, you can identify areas for improvement and implement measures to minimize the impact of such outages on your system.

Structure

To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate an unavailable PostgreSQL database by blocking the PostgreSQL database client connection on a given hostname. During the outage, we will monitor the system and ensure that the user-facing endpoint indicates unavailability by responding with a "Service unavailable" status. We will also verify that at least one monitor in Datadog is alerting us to the database outage. Once the database becomes available again, we will verify that the endpoint automatically recovers and resumes its normal operation. We will also analyze the monitoring data to identify any potential weaknesses in the system and take appropriate measures to address them. By conducting this experiment, we can identify any weaknesses in our system's resilience to database outages and take appropriate measures to minimize their impact.

RDS

Postgres

Recoverability

Datadog

Database

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation and Datadog alerts when Postgres suffers latency

Your application should continue functioning properly and indicate unavailability appropriately in case of increased connection latency to PostgreSQL. Additionally, this experiment can highlight requests that need optimization of timeouts to prevent dropped requests.

Motivation

Latencies in shared or overloaded databases are common and can significantly impact the performance of your application. By conducting this experiment, you can gain insights into the robustness of your application and identify areas for improvement.

Structure

To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate a latency attack on the PostgreSQL database by adding a delay of 100 milliseconds to all traffic to the database hostname. During the attack, we will monitor the system's behavior to ensure the service remains operational and can deliver its purpose. We will also analyze the performance metrics to identify any request types most affected by the latency and optimize them accordingly. Finally, we will end the attack and monitor the system's recovery time to ensure it returns to its normal state promptly. By conducting this experiment, we can gain valuable insights into our application's resilience to database latencies and make informed decisions to optimize its performance under stress.

RDS

Postgres

Recoverability

Datadog

Database

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation of Kubernetes deployment while Kafka is unavailable

An unavailable Kafka broker or even an entire cluster should be handled gracefully and indicated appropriately by your application. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.

Motivation

Kafka unavailability can occur for various reasons, such as hardware failure, network connectivity issues, or even intentional attacks. Such unavailability can severely affect your application, causing lost messages, data inconsistencies, and degraded performance. By testing the resilience of your system to Kafka unavailability, you can identify areas for improvement and implement measures to minimize the impact of such outages on your system.

Structure

To conduct this experiment, we will ensure that all Kafka topics and producers are ready and that the consumer receives and processes messages correctly. We will then simulate an unavailable Kafka cluster by shutting down one or more Kafka brokers or the entire Kafka cluster. During the outage, we will monitor the system to ensure it continues delivering its intended functionality and maintaining its throughput. We will also verify that the system can handle the failure of a Kafka broker or a complete Kafka cluster outage without losing messages or data inconsistencies. Once the Kafka cluster becomes available again, we will verify that the system automatically recovers and resumes its normal operation. We will also analyze the monitoring data to identify any potential weaknesses in the system and take appropriate measures to address them. By conducting this experiment, we can identify any weaknesses in our system's resilience to Kafka unavailability and take appropriate measures to minimize their impact.

Kafka

Recoverability

Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation of Kubernetes deployment while Kafka suffers a high latency

Verify that your application handles an increased latency in your Kafka message delivery properly, allowing for increased processing time while maintaining the throughput.

Motivation

Latency in Kafka can occur for various reasons, such as network congestion, increased load, or insufficient resources. Such latency can impact your application's performance, causing delays in processing messages and affecting overall throughput. By testing your system's resilience to Kafka latency, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance.

Structure

To conduct this experiment, we will ensure that all Kafka topics and producers are ready and that the consumer receives and processes messages correctly. We will then induce latency on Kafka by introducing a delay on all incoming and outgoing messages. During the experiment, we will monitor the system to ensure it continues delivering its intended functionality and maintaining its throughput despite the increased processing time. We will also analyze the monitoring data to identify any potential bottlenecks or inefficiencies in the system and take appropriate measures to address them. Once the experiment is complete, we will remove the latency and monitor the system's recovery time to ensure it returns to its normal state promptly. By conducting this experiment, we can identify any potential weaknesses in our system's resilience to Kafka latency and take appropriate measures to improve its performance and reliability.

Kafka

Recoverability

Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation of Kubernetes deployment while RabbitMQ is down

An unavailable RabbitMQ cluster should be handled gracefully and indicated appropriately by your application. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.

Motivation

RabbitMQ downtime can lead to degraded system performance, lost data, and potentially long system recovery times. By testing your system's resilience to RabbitMQ downtime, you can ensure that it can handle the outage gracefully and continue to deliver its intended functionality. Additionally, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance and resilience.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate RabbitMQ downtime, we expect the system to indicate unavailability appropriately and maintain its throughput. To simulate downtime, we can shut down the RabbitMQ instance or cluster. The experiment aims to ensure your system can gracefully handle the outage and continue delivering its intended functionality. The performance should return to normal after the RabbitMQ instance or cluster is available again.

RabbitMQ

Datadog

Recoverability

Kubernetes

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation of Kubernetes deployment while RabbitMQ suffers high latency

Verify that your application handles an increased latency in your RabbitMQ message delivery properly, allowing for increased processing time while maintaining the throughput.

Motivation

Latency issues in RabbitMQ can lead to degraded system performance, longer response times, and potentially lost or delayed data. By testing the resilience of your system to RabbitMQ latency, you can ensure that your system can handle increased processing time and maintain its throughput during increased latency. Additionally, you can identify any potential bottlenecks or inefficiencies in your system and take appropriate measures to optimize its performance and reliability.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate RabbitMQ latency, we expect the system to maintain its throughput and indicate unavailability appropriately. To simulate latency, we can introduce delays in message delivery. The experiment aims to ensure that your system can handle increased processing time and maintain its throughput during increased latency. The performance should return to normal after the latency has ended.

RabbitMQ

Datadog

Recoverability

Kubernetes

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation when database can not be reached

An unavailable database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately

Motivation

Depending on your context, an unavailable database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the database returns, your system should recover automatically.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the database is reachable again.

RDS

Recoverability

Database

Containers

Kubernetes cluster

Kubernetes deployments

Graceful degradation when Microsoft SQL Server database can not be reached

An unavailable Microsoft SQL Server database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately

Motivation

Depending on your context, an unavailable Microsoft SQL Server database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the Microsoft SQL Server database returns, your system should recover automatically.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable Microsoft SQL Server database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the Microsoft SQL Server database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the Microsoft SQL Server database is reachable again.

RDS

Microsoft SQL Server

Recoverability

Database

Containers

Kubernetes cluster

Kubernetes deployments

Graceful degradation when Oracle database can not be reached

An unavailable Oracle database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately

Motivation

Depending on your context, an unavailable Oracle database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the Oracle database returns, your system should recover automatically.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable Oracle database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the Oracle database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the Oracle database is reachable again.

RDS

Oracle

Recoverability

Database

Containers

Kubernetes cluster

Kubernetes deployments

Graceful degradation when Postgres database can not be reached

An unavailable Postgres database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately

Motivation

Depending on your context, an unavailable Postgres database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the Postgres database returns, your system should recover automatically.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable Postgres database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the Postgres database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the Postgres database is reachable again.

RDS

Postgres

Recoverability

Database

Containers

Kubernetes cluster

Kubernetes deployments

Graceful degradation while Kafka is unavailable

An unavailable Kafka is not user-visible as it leads to graceful degradation and retries as soon as the Kafka is back available again.

Motivation

In case of an unavailable Kafka message broker, your application should still work successfully. To decouple your system parts from each other, each Kafka client should take care of appropriate caching and retry mechanisms and shouldn't make the failed Kafka message broker visible to the end user. Instead, your system should fail gracefully, so retry the submission as soon as the Kafka message broker is back again.

Structure

We will use two separate Postman collections to decouple request submissions and check business functionality. The first Postman collection runs while Kafka is unavailable. We expect the Postman collection to run without errors and the system to somehow save all requests. After Kafka is available again, we will check with another Postman collection to see whether all requests have been received and processed. In between, we allow for some processing time.

Kafka

Recoverability

Postman

Kubernetes

Containers

Kubernetes cluster

Postman Collections

Grafana alert rule fires when a Kubernetes pod is in crash loop

Verify that a Grafana alert rule alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Structure

First, check that the Grafana alert rule responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Grafana alert rule should fire and escalate it to your on-call team.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Crash loop

Harden Observability

Restart

Grafana

Kubernetes

Grafana alert rules

Kubernetes cluster

Kubernetes pods

Instana should detect a crash looping as incident

Intent

Verify that Instana alerts you that pods are not ready to accept traffic for some time.

Motivation

Structure

First, check that Instana has no critical events for an application perspective. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, Instana should detect this via a critical event to ensure your on-call team is taking action.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Crash loop

Instana

Harden Observability

Kubernetes

Instana application perspectives

Kubernetes cluster

Kubernetes pods

Keep Deployment's pods down

Check what happens when all pods of a Kubernetes deployment aren't coming up again.

Motivation

Typically, Kubernetes tries to keep as many pods running as desired for a Kubernetes deployment. However, some circumstances may prevent Kubernetes from achieving this, like missing resources in the cluster, problems with the deployment's probes, or a CrashLoopBackOff. You should validate what happens to your upstream services that are using the given deployment as a downstream service.

Structure

To keep the pods down for a given deployment, we first kill all the pods in the deployment. Simultaneously, we will scale down the Kubernetes deployment to 0 to simulate that these pods can't be scheduled again. At the of the experiment, we automatically roll back the deployment's scale to the initial value.

Deployment

Downstream Service

Kubernetes

Kubernetes cluster

Kubernetes deployments

Kubernetes pods

Keep StatefulSet's pods down

Check what happens when all pods of a Kubernetes StatefulSet aren't coming up again.

Motivation

Typically, Kubernetes tries to keep as many pods running as desired for a Kubernetes StatefulSet. However, some circumstances may prevent Kubernetes from achieving this, like missing resources in the cluster, problems with the StatefulSet's probes, or a CrashLoopBackOff. You should validate what happens to your upstream services that are using the given StatefulSet as a downstream service.

Structure

To keep the pods down for a given StatefulSet, we first kill all the pods in the StatefulSet. Simultaneously, we will scale down the Kubernetes StatefulSet to 0 to simulate that these pods can't be scheduled again. At the of the experiment, we automatically roll back the StatefulSet's scale to the initial value.

Downstream Service

StatefulSet

Kubernetes

Kubernetes cluster

Kubernetes pods

Kubernetes statefulsets

Kubernetes deployment survives Redis downtime

Check that your application gracefully handles a Redis cache downtime and continues to deliver its intended functionality. The cache downtime may be caused by an unavailable Redis instance or a complete cluster.

Motivation

Redis downtime can lead to degraded system performance, lost data, and potentially long system recovery times. By testing your system's resilience to Redis downtime, you can ensure that it can handle the outage gracefully and continue to deliver its intended functionality. Additionally, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance and resilience.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis downtime, we expect the system to indicate unavailability appropriately and maintain its throughput. We can block the traffic to the Redis instance to simulate downtime. The experiment aims to ensure that your system can gracefully handle the outage and continue delivering its intended functionality. The performance should return to normal after the Redis instance is available again.

Redis

Recoverability

Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Kubernetes deployment survives Redis latency

Verify that your application handles an increased latency in a Redis cache properly, allowing for increased processing time while maintaining throughput.

Motivation

Latency issues in Redis can lead to degraded system performance, longer response times, and potentially lost or delayed data. By testing your system's resilience to Redis latency, you can ensure that it can handle increased processing time and maintain its throughput during increased latency. Additionally, you can identify any potential bottlenecks or inefficiencies in your system and take appropriate measures to optimize its performance and reliability.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis latency, we expect the system to maintain its throughput and indicate unavailability appropriately. We can introduce delays in Redis operations to simulate latency. The experiment aims to ensure that your system can handle increased processing time and maintain its throughput during increased latency. The performance should return to normal after the latency has ended.

Redis

Recoverability

Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Kubernetes Horizontal Pod Autoscaler Scales up Within Reasonable Time

Verify that your horizontal pod autoscaler scales up your Kubernetes deployment on increased CPU usage.

Motivation

Important deployments should be scaled up within a reasonable time for an elastic and resilient cloud infrastructure. Undetected high CPU spikes and long startup times are undesirable in these infrastructures.

Structure

First, we ensure that all pods are ready to serve traffic. Afterward, we inject high CPU usage into the pods' container and expect that within a reasonable amount of time, the horizontal pod auto scaler will increase the number of pods and become ready to handle incoming traffic.

Scalability

Horizontal Pod Autoscaler

CPU

Kubernetes

Containers

Kubernetes cluster

Kubernetes deployments

Kubernetes node shutdown results in new node startup

A resilient Kubernetes cluster can cope with a crashing node and simply starts a new one.

Motivation

A changing number of nodes in your Kubernetes cluster is expected, as you may update your nodes from time to time or simply scale the cluster depending on traffic peaks. This is especially true when using spot instances in a Cloud environment. This requires the deployments to be node-independent and properly configured to be rescheduled on a newly started node or a node that still has free resources.

Structure

Before restarting a node, we verify that the cluster is healthy and that the deployments are ready. Afterward, we trigger the shutdown of the node of a specific Kubernetes deployment and expect the deployment to be rescheduled on any other node and a new node to start up within a reasonable amount of time.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Warning

Please be aware that we will shut down a node. Please ensure this is fine and your node is either virtual or can somehow be started up afterward.

Elasticity

Kubernetes

Kubernetes cluster

Kubernetes deployments

Linux Hosts

Latency progressively increases for a Linux Host

Latency of a Host progressively increases to analyse at which point the communication breaks. Additionally, you may want to add one of our observability checks.

Structure

We start by adding a 250ms latency on the host's outgoing traffic for 30 seconds. Next, we stepwise increase the latency to 500ms, 750ms, and 1s - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

Linux

Host

Legacy

Snippet

Latency

Linux Hosts

Latency progressively increases for a Windows Host

Latency of a Host progressively increases to analyse at which point the communication breaks. Additionally, you may want to add one of our observability checks.

Structure

Progressive

Host

Legacy

Snippet

Windows

Latency

Windows Hosts

Latency progressively increases for Kubernetes DaemonSet

Latency of a Kubernetes DaemonSet progressively increases to analyse at which point the communication breaks.

Structure

We start by adding a 250ms latency on the Kubernetes DaemonSet's outgoing traffic for 30 seconds. Next, we stepwise increase the latency to 500ms, 750ms, and 1s - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

DaemonSet

Snippet

Kubernetes

Latency

Containers

Latency progressively increases for Kubernetes Deployment

Latency of a Kubernetes Deployment progressively increases to analyse at which point the communication breaks.

Structure

We start by adding a 250ms latency on the Kubernetes Deployment's outgoing traffic for 30 seconds. Next, we stepwise increase the latency to 500ms, 750ms, and 1s - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

Deployment

Snippet

Kubernetes

Latency

Containers

Latency progressively increases for Kubernetes StatefulSet

Latency of a Kubernetes StatefulSet progressively increases to analyse at which point the communication breaks.

Structure

We start by adding a 250ms latency on the Kubernetes StatefulSet's outgoing traffic for 30 seconds. Next, we stepwise increase the latency to 500ms, 750ms, and 1s - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

Snippet

StatefulSet

Kubernetes

Latency

Containers

Linux host losing network connection is detected by Datadog

When a host suddenly loses connection to the network and your system, Datadog should alert about this. Eventually, everything should recover once the network is back again.

Motivation

When you're working in a less volatile system environment, a loss of network can be crucial as there is likely no backup host that will enable faster recovery. Thus, you should check your observability tools to catch this.

Structure

Before blocking a host from the network, we verify that the Datadog monitor is in an ok state Afterward, we block all traffic to and from a host and expect Datadog to alert about the isolated host. Eventually, when the host is online again, we expect Datadog to turn into an OK state again. While experimenting, we create a downtime for the Monitor so that it will not escalate due to the ongoing alert.

Legacy

Host

Linux

Datadog

Datadog monitors

Linux Hosts

Linux Host reboot is alerted by Datadog

When a Linux host is suddenly missing from your system, Datadog should alert you to this. Eventually, everything should recover when only rebooting the host.

Motivation

When you're working in a less volatile system environment, where you expect hosts always to run, you should validate whether you notice whenever a host is rebooting.

Structure

Before restarting a host, we verify that the Datadog monitor is in an ok state Afterward, we trigger the shutdown of a host and expect Datadog to alert about the missing host. Eventually, the host should come back and Datadog turn into an OK state again. While experimenting, we create a downtime for the Monitor so that it will not escalate due to the ongoing alert.

Linux

Legacy

Host

Datadog

Datadog monitors

Linux Hosts

Load balancer covers a zone outage of a Kubernetes workload

Applications hosted in cloud providers achieve high availability via redundancy across different Availability Zones. Ensure that failover works seamlessly by simulating Zone outages.

Motivation

Cloud provider can host your deployments and services across multiple locations worldwide. From a reliability standpoint, cloud provider's regions and availability zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across availability zone is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.

Structure

We leverage the Container block traffic attack to simulate an availability zone outage just for a specific deployment. Before the simulated outage, we ensure that a load-balanced user-facing endpoint works appropriately. During an availability zone's unavailability, the HTTP endpoint must continue operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover as soon as the zone is back again.

Solution Sketch

AWS regions and zones
Azure regions
Azure availability zones
GCP regions and Zones
Kubernetes liveness, readiness, and startup probes

Azure

GCP

Redundancy

AWS

Availability Zone

Kubernetes

Containers

Kubernetes cluster

Kubernetes deployments

Load balancer covers an AWS EC2 restart

EC2 is part of the AWS Elastic Compute Cloud, which acquires and releases resources depending on the traffic demand. Check whether your application is elastic as well by rebooting an EC2 instance.

Motivation

Depending on your traffic demand, you can use AWS cloud's ability to acquire and release resources automatically. Some services, such as S3 and SQS, do that automatically, while others, such as EC2, integrate with AWS Auto Scaling. Once configured, it boils down to fluctuating EC2 instances starting or shutting down frequently. Even when not using AWS Autoscaling, your EC2 instances may need to be restarted occasionally for maintenance and updating purposes. Thus, it is best practice to validate your application's behavior.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all EC2 instances available. While restarting an EC2 instance, the HTTP endpoint continues operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover to a 100% success rate once all EC2 instances are back.

Solution Sketch

AWS Well-Architected Framework
Kubernetes liveness, readiness, and startup probes

Scalability

Redundancy

Elasticity

AWS

EC2-instances

Load balancer covers an AWS zone outage

AWS achieves high availability via redundancy across different Availability Zones. Ensure that failover works seamlessly by simulating Zone outages.

Motivation

AWS hosts your deployments and services across multiple locations worldwide. From a reliability standpoint, AWS regions and Availability Zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across AWS availability zone is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.

Structure

We leverage the AWS blackhole attack to simulate an AWS availability zone outage. Before the simulated outage, we ensure that a load-balanced user-facing endpoint works appropriately. During an AWS availability zone's unavailability, the HTTP endpoint must continue operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover as soon as the zone is back again.

Solution Sketch

Regions and Zones
Kubernetes liveness, readiness, and startup probes

Redundancy

AWS

Availability Zone

Zones

Load balancing hides a single container failure for end users

If a pod becomes temporarily unavailable, you want to ensure that Kubernetes is properly reacting, excluding that pod from the Service and restarting it.

Motivation

If configured properly, Kubernetes can detect a non-responding pod and try to fix it by simply restarting the unresponsive pod. Even so, the exact configuration requires careful consideration to avoid killing your pods too early or flooding your cluster's traffic with liveness probes.

Structure

Before killing a container of a Kubernetes pod, we verify that a load-balanced user-facing endpoint is working properly and that all Kubernetes deployment's pods are marked as ready. As soon as one container crashes, Kubernetes should detect the crashed container via a failing liveness probe and mark the related pod as not ready. Now, Kubernetes is expected to restart the container so the pod becomes ready within a certain time. The user-facing HTTP endpoint may suffer from degraded performance when being under load (e.g., lower success rate or higher response time). Even so, this is expected to be within the SLA boundaries.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Redundancy

Kubernetes

Containers

Kubernetes cluster

Kubernetes deployments

Load-balanced endpoint covers exceeding ephemeral storage of Kubernetes deployment

Ensure that all containers of Kubernetes deployment resources have proper ephemeral storage limits configured to prevent the instability of other containers.

Motivation

For an elastic and resilient cloud infrastructure, ensure that the over-usage of ephemeral storage of one container doesn't affect any others. Furthermore, if one container exceeds its configured limits, Kubernetes should kill it and eventually prepare it within a given timeframe.

Structure

Verify that a user-visible endpoint responds within the expected success rates while exceeding the ephemeral storage. As soon as one container exceeds the ephemeral storage, by filling the disk in a /tmp directory, Kubernetes should evict the container, decreasing ready pods. Within 60 seconds, the evicted container should run again, and the pod should be ready.

Elasticity

Kubernetes

Containers

Kubernetes cluster

Kubernetes deployments

Network loss for Kubernetes node's outgoing traffic in an availability zone

Achieve high availability of your Kubernetes cluster via redundancy across different Availability Zones. Check what happens to your Kubernetes cluster when one of the zones suffers from a network loss.

Motivation

Cloud provider host your deployments and services across multiple locations worldwide. From a reliability standpoint, regions and availability zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across availability zone is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.

Structure

We leverage the drop outgoing traffic to simulate network loss in an availability. If you want to test for a full outage of the zone, configure it to 100% loss. While the network loss happens, we observe changes of a Kubernetes cluster with Steadybit's built-in visibility. Once the network loss is over, we expect that all deployments will recover again within a specified time.

Solution Sketch

AWS Regions and Zones
Azure Regions and Zones
GCP Regions and Zones
Kubernetes liveness, readiness, and startup probes

AWS

Azure

GCP

Redundancy

Kubernetes

Availability Zone

Kubernetes cluster

Kubernetes deployments

Linux Hosts

Network outage for Kubernetes nodes in an availability zone

Achieve high availability of your Kubernetes cluster via redundancy across different Availability Zones. Check what happens to your Kubernetes cluster when one of the zones is down.

Motivation

Cloud providers host your deployments and services across multiple locations worldwide. From a reliability standpoint, regions and availability zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across availability zones is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.

Structure

We leverage the block traffic attack to simulate a full network loss in an availability zone. While the zone outage happens, we observe changes in the Kubernetes cluster with Steadybit's built-in visibility. Once the zone outage is over, we expect that all deployments will recover again within a specified time.

Solution Sketch

AWS Regions and Zones
Azure Regions and Zones
GCP Regions and Zones
Kubernetes liveness, readiness, and startup probes

Azure

GCP

Redundancy

AWS

Availability Zone

Kubernetes cluster

Kubernetes deployments

Linux Hosts

New Relic detects an incident for CPU spikes in an ECS task

Validate your observability to detect a CPU spike in your AWS ECS cluster

Motivation

When you have New Relic configured to detect CPU spikes in your AWS ECS cluster, you can easily validate your observability strategy with this experiment template.

Structure

First, we validate whether New Relic has no ongoing incident. After that, we inject the CPU spike for an ECS service and expected that New Relic detect this as an incident within the given time frame of 3 minutes.

New Relic

AWS ECS

CPU

ECS Tasks

New Relic Accounts

New Relic should detect a crash looping as problem

Verify that New Relic alerts you that pods are not ready to accept traffic for some time.

Motivation

Structure

First, check that New Relic has no critical events for related entities. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, New Relic should detect this via an incident to ensure your on-call team is taking action.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Crash loop

New Relic

Harden Observability

Kubernetes

Kubernetes cluster

Kubernetes pods

New Relic Accounts

New Relic should detect a disrupted workflow when a workload is unavailable

Verify that New Relic alerts you to disruptions in your workflow, such as a critical deployment without pods ready to serve traffic.

Motivation

Kubernetes features a liveness probe to determine whether your pod is healthy and can accept traffic. If Kubernetes cannot probe a pod, it restarts it in the hope that it will eventually be ready. In case it is a critical deployment, New Relic workflow should alert on this disruption

Structure

First, check that the New Relic Workflow is marked as operational As soon as all pods of a workload aren't reachable, caused by the block traffic attack, New Relic should detect this by marking the workflow as disrupted and ensuring your on-call team is taking action.

Solution Sketch

Kubernetes liveness, readiness, and startup probes
New Relic Workflow

New Relic

Harden Observability

Kubernetes

Containers

Kubernetes cluster

New Relic Accounts

New Relic Workloads

Prometheus detect unhealthy deployments

Verify that your Prometheus metrics are catching unready pods in a Kubernetes deployment.

Motivation

Whenever important deployments aren't available your Prometheus metric should catch this in order to alert properly.

Structure

By blocking the traffic to all containers of the deployment, we expected failing probes and eventually, Kubernetes detecting the pods as being down. After a short time frame, Prometheus should detect this as well in the specified Prometheus metric.

Deployment

Prometheus

Kubernetes

Containers

Kubernetes cluster

Kubernetes deployments

Prometheus instances

Reasonable recovery time in case of container failures

Quick startup times are favorable in Cloud environments to enable fast recovery and improve scaling.

Motivation

In Cloud environments, it is accepted that a pod or container may crash - the more important principle is that it should recover quickly. A faster startup time is beneficial in that case as it results in a smaller Mean Time To Recover (MTTR) and reduces user-facing downtime. Also, in case of request peaks, a reasonably short startup time allows scaling the deployment properly.

Structure

We simply stop a container of one of the pods to measure the time until it is marked as ready again. Therefore, before stopping the container, we ensure that the deployment is ready. If so, we stop the container and expect the number of ready pods to drop. Within a reasonable time (e.g., 60 seconds), the container should start up again, and all desirable pods should be marked as ready.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Scalability

Recoverability

Kubernetes

Starter

Containers

Kubernetes cluster

Kubernetes deployments

Reasonable recovery time when losing a pod

When deleting a pod, Kubernetes should bring up a new pod to ensure system stability.

Motivation

Deleting a pod simulates situations in which, for any reason, a pod stops working properly and needs to be rescheduled. This experiment makes sure that rescheduling works as expected and newly scheduled pods become ready within the expected timeframe.

Structure

All pods should be ready to accept traffic at the beginning of the experiment. Rescheduling should start as soon as a pod is deleted. Eventually, after the allotted time, all pods should be ready again.

Elasticity

Recoverability

Kubernetes

Kubernetes cluster

Kubernetes deployments

Kubernetes pods

Scaling up of ECS Service Within Given Time

Ensure that you can scale up your ECS service in a reasonable time.

Motivation

For an elastic and resilient cloud infrastructure, ensure you can scale up your ECS services within a reasonable time. Long startup times are undesirable but sometimes unnoticed and unexpected.

Structure

Validate that all ECS tasks of an ECS service are running. Once we scale the ECS service up, the newly scheduled task should be ready within a reasonable time.

Scalability

Elasticity

AWS ECS

AWS

ECS Services

Simulate Kafka brokers downtime and see how consumers are handling topic lag

An experiment to block access to brokers for consumers while producing records in the topic. The consumers should get back to consume when the kafka brokers are available again and the accumulated lag must be deal with quickly.

Splunk Detector fires when a Kubernetes pod is in crash loop

Verify that a Splunk detector alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Structure

First, check that the Splunk detector responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Splunk alert rule should fire and escalate it to your on-call team.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Crash loop

Harden Observability

Restart

Splunk

Kubernetes

Kubernetes cluster

Kubernetes pods

Splunk detectors

Splunk platform alerts when a Kubernetes pod is in crash loop

Verify that Splunk platform is firing an alert when pods are not ready to accept traffic for a certain time.

Motivation

Structure

First, check that the Splunk platform alert responsible for tracking non-ready containers is not in a firing state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Splunk platform alert should fire and escalate it to your on-call team.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Crash loop

Harden Observability

Restart

Kubernetes

Splunk Platform

Kubernetes cluster

Kubernetes pods

Splunk Alerts

Stress CPU of Kubernetes DaemonSet

Learn how easy you can stress CPU of an entire Kubernetes DaemonSet.

Stress CPU of Kubernetes Deployment

Learn how easy you can stress CPU of an entire Kubernetes deployment.

Stress CPU of Kubernetes StatefulSet

Learn how easy you can stress CPU of an entire Kubernetes StatefulSet.

Stress CPU progressively of a Linux Host

Stress the CPU of a host progressively to see at which percentage it becomes unstable. Additionally, you may want to add one of our observability checks.

Structure

We start by stressing 50% of the host's CPU for 30 seconds. Next, we stepwise stress the CPU by 75%, 90%, and 100% - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

Linux

CPU

Host

Legacy

Snippet

Linux Hosts

Stress CPU progressively of a Windows Host

Stress the CPU of a host progressively to see at which percentage it becomes unstable. Additionally, you may want to add one of our observability checks.

Structure

Progressive

Legacy

CPU

Host

Snippet

Windows

Windows Hosts

Stress CPU progressively of Kubernetes DaemonSet

Stress the CPU of a Kubernetes DaemonSet progressively to see at which percentage it will be killed by Kubernetes.

Structure

We start by stressing 50% of the Kubernetes DaemonSet's CPU for 30 seconds. Next, we stepwise stress the CPU by 75%, 90%, and 100% - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

CPU

DaemonSet

Snippet

Kubernetes

Containers

Stress CPU progressively of Kubernetes Deployment

Stress the CPU of a Kubernetes Deployment progressively to see at which percentage it will be killed by Kubernetes.

Structure

We start by stressing 50% of the Kubernetes Deployment's CPU for 30 seconds. Next, we stepwise stress the CPU by 75%, 90%, and 100% - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

Deployment

CPU

Snippet

Kubernetes

Containers

Stress CPU progressively of Kubernetes StatefulSet

Stress the CPU of a Kubernetes StatefulSet progressively to see at which percentage it will be killed by Kubernetes.

Structure

We start by stressing 50% of the Kubernetes StatefulSet's CPU for 30 seconds. Next, we stepwise stress the CPU by 75%, 90%, and 100% - each for 30 seconds. In between, we have small wait steps to ease analysis in external observability tools for each phase.

Progressive

CPU

Snippet

StatefulSet

Kubernetes

Containers

Stress Memory of Kubernetes DaemonSet

Learn how easy you can stress memory of an entire Kubernetes DaemonSet.

Stress Memory of Kubernetes Deployment

Learn how easy you can stress memory of an entire Kubernetes deployment.

Stress Memory of Kubernetes StatefulSet

Learn how easy you can stress memory of an entire Kubernetes StatefulSet.

Third-party service is unavailable for a Kubernetes Deployment

Identify the effect of an unavailable third-party service on your deployment's service's success metrics.

Motivation

When you provide a synchronous service via HTTP that requires the availability of other downstream third-party services, you absolutely should check how your service behaves in case the third-party service is unavailable. Also, you want to validate whether your service is working again as soon as the third-party service is working again.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate the third-party service being unavailable, we expect the user-facing endpoint to work within specified HTTP success rates.. To simulate the unavailability, we can block the traffic to the third-party service on the client side using its hostname. The endpoint should recover automatically once the third-party service is reachable again.

Third-party

Downstream Service

Recoverability

Containers

Kubernetes cluster

Kubernetes deployments

Third-party service suffers high latency for a Kubernetes Deployment

Identify the effect of high latency of a third-party service on your deployment's service's success metrics.

Motivation

When you provide a synchronous service via HTTP that requires the availability of other downstream third-party services, you absolutely should check how your service behaves in case the third-party service suffers high latency. Also, you want to validate whether your service is working again as soon as the third-party service is working again.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate the third-party service's high latency, we expect the user-facing endpoint to work within specified HTTP success rates.. To simulate the high latency, we can delay the traffic to the third-party service on the client side using its hostname. The endpoint should recover automatically once the third-party service is reachable again.

Third-party

Downstream Service

Recoverability

Containers

Kubernetes cluster

Kubernetes deployments

Unavailable downstream service doesn't result in user-visible errors

Verify that an unavailable downstream service doesn't result in user-visible errors.

Motivation

When offering a service that is dependent on downstream services, you should ensure that the offered service also works fine whenever one of the downstream services can't be reached. This is especially true when multiple downstream services are involved and the responses of each downstream service are considered optional.

Structure

First, we validate that the HTTP endpoint of the upstream service is working as expected. Then, we block the traffic of the downstream service and expect that the HTTP endpoint will still work within the expected success rate.

This experiment template is used in our quick start on running an experiment and is especially useful for the shopping demo example. To learn more, check out the quick start in the Steadybit docs.

Shopping Demo Quick Start

Block Traffic

Kubernetes

Containers

Kubernetes cluster

Kubernetes deployments

Validate consumer's behavior when new leader is elected

Verify that your application handles a change of the leader properly.

Motivation

By testing your system's resilience to Kafka leader changes in a partition, you can identify potential weaknesses and take appropriate measures to improve its performance.

Structure

To conduct this experiment, we will ensure that all Kafka topics and producers are ready and that the consumer receives and processes messages correctly. We will then elect a new leader for one partition in Kafka and expect the system to work fine.

Message Queue

Kafka

Recoverability

Kafka consumers

Kafka topics

Validate Kafka election process when the controller is unavailable

Verify a Kafka's broker re-election when the current controller is isolated

Motivation

Testing your Zookeeper re-election process of a new controller broker helps to ensure high availability in your Kafka cluster. In addition, verify the behavior of your producers and consumers and verify what happens, e.g., to your message throughput while the re-election is happening or whether messages are lost.

Structure

We enforce the re-election of Kafka controller broker by blocking the network traffic to the current controller. By leveraging the 'check brokers'-step, we can validate that the new controller is elected and the old controller is detected as being down.

Message Queue

Zookeeper

Kafka

Recoverability

Containers

Validate Kubernetes probes for an unavailable downstream service

Failing downstream service (e.g., message broker, database, or cache) shouldn't cause liveness or readiness probe failures in Kubernetes to avoid cascading restarts.

Motivation

In Kubernetes, liveness and readiness probes indicate whether a container is alive and able to serve incoming requests. These are especially helpful for load balancers. However, it is best practice in Kubernetes not to include downstream services in the probes. Otherwise, as soon as, e.g., a Kubernetes deployment's downstream service has issues, the deployment restarts a well, which may cause a cascade of failures in the Kubernetes cluster.

Structure

While blocking traffic from a deployment's container to a downstream service, we explicitly check the HTTP liveness and readiness probes of the Kubernetes deployment. Following best practices, we expect them not to be affected by the unavailable downstream service.

References

Kubernetes liveness, readiness, and startup probes
Readiness and Liveness Probes best practices by kube-score

Kubernetes Probes

Third-party

Downstream Service

Recoverability

Kubernetes

Containers

Kubernetes cluster

Windows host losing network connection is detected by Datadog

When a host suddenly loses connection to the network and your system, Datadog should alert about this. Eventually, everything should recover once the network is back again.

Motivation

Structure

Legacy

Host

Windows

Datadog

Datadog monitors

Windows Hosts

Windows host reboot is alerted by Datadog

When a windows host is suddenly missing from your system, Datadog should alert you to this. Eventually, everything should recover when only rebooting the host.

Motivation

When you're working in a less volatile system environment, where you expect hosts always to run, you should validate whether you notice whenever a host is rebooting.

Structure

Legacy

Host

Windows

Datadog

Datadog monitors

Windows Hosts

>_ Learn how to author a templateContribute Template

Steadybit covers many out-of-the-box needs, but sometimes your organization may need proprietary or niche solutions. Leverage our templates to gain flexibility and address those needs!