Steadybit logoResilience Hub
Try SteadybitGitHub icon
Steadybit logoResilience Hub

49

Templates

Use templates to kick-start your reliability journey.
You have an idea for a missing template?Create a new one and share it with the community!
Filter by
AWS ECS Service Is Scaled up Within Reasonable Time

Verify that your ECS service is scaled up on increased CPU usage.

Motivation

Important ECS services should be scaled up within a reasonable time for an elastic and resilient cloud infrastructure. Undetected high CPU spikes and long startup times are undesirable in these infrastructures.

Structure

First, we ensure that all ECS service's tasks are ready to serve traffic. Afterward, we inject high CPU usage into the ECS task and expect that within a reasonable amount of time, ECS increases the number of ECS tasks and they become ready to handle incoming traffic.

Scalability
CPU
AWS ECS
AWS

ECS Services

ECS Tasks

Check Kafka consumer's reaction to record loss
Intent

Intentionally deny access to the topic for consumers and during this time where consumption is stopped, delete records.

We can check the logs of the consumers to see how they handle the loss of records and also authorization access issues.

Message Queue
Kafka
Recoverability

Kafka consumers

Kafka topics

Datadog alerts when a Kubernetes pod is in crash loop

Verify that a Datadog monitor alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that the Datadog monitor responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Datadog monitor should alert and escalate it to your on-call team.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
Crash loop
Harden Observability
Datadog
Restart
Kubernetes

Datadog monitors

Kubernetes cluster

Kubernetes pods

Draining a node should reschedule pods quickly

When draining a node, Kubernetes should reschedule running pods on other nodes without hiccups to ease, e.g., node maintenance.

Motivation

Draining a node may be necessary for, e.g., maintenance of a node. If that happens, Kubernetes should be able to reschedule the pods running on that node within the expected time and without user-noticeable failures.

Structure

For the entire duration of the experiment, a user-facing endpoint should work within expected success rates. At the beginning of the experiment, all pods should be ready to accept traffic. As soon as the node is drained, Kubernetes will evict the pods, but we still expect the pod's redundancy to be able to serve the user-facing endpoint. Eventually, after 120 seconds, all pods should be rescheduled and ready again to recover after the maintenance.

Elasticity
Kubernetes

Kubernetes cluster

Kubernetes deployments

Kubernetes nodes

Dynatrace should detect a crash looping as problem

Verify that Dynatrace alerts you on pods not being ready to accept traffic for a certain amount of time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that Dynatrace has no problems for an entity and doesn't alert already on non-ready containers. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, Dynatrace should detect the problem and alert to ensure your on-call team is taking action.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
Crash loop
Dynatrace
Harden Observability
Kubernetes

Kubernetes cluster

Kubernetes pods

Faultless redundancy during rolling update

Kubernetes features a rolling update strategy to deploy new releases without downtime. When being under load, this only works reliably when your load balancer and the Kubernetes readiness probe are configured properly and DNS caches are up-to-date.

Motivation

The Kubernetes rolling update strategy ensures that a minimum number of pods remain available when a new release is deployed. This implies that a new pod with a new release is started and needs to be ready before an old pod is evicted. Even so, this process may result in degraded performance and user-facing errors, e.g., Kubernetes sending requests to pods indicated as ready but not able to respond properly or evicted pods are still retained in the load balancer.

Structure

Before performing the rolling update all desirable pods of the deployment need to be in the “ready”-state, and a load-balanced user-facing HTTP endpoint is expected to respond successfully while under load. As soon as the rolling update takes place, the HTTP endpoint under load may suffer from a degraded performance (e.g. lower success rate or higher response time). Even so, this should be within the boundaries of your SLA. After the rolling update, the number of desirable pods matches the actual pods of the deployment and the performance of the user-facing HTTP endpoint is similar to before the update.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
  • Kubernetes deployment strategy
Rolling Update
Restart
Kubernetes

Kubernetes cluster

Kubernetes deployments

Faultless scaling of Kubernetes Deployment

Ensure that you can scale your deployment in a reasonable time without noticeable errors.

Motivation

For an elastic and resilient cloud infrastructure, ensure that you can scale your deployments without user-visible errors and within a reasonable amount of time. Long startup times, hiccups in the load balancer, or resource misallocation are undesirable but sometimes unnoticed and unexpected.

Structure

For the duration of the experiment and the deployment's upscaling, verify that a user-visible endpoint offered is responding within expected success rates and that no monitors are alerting. As soon as the deployment is scaled up, the newly scheduled pod should be ready to receive traffic within a reasonable time, e.g., 60 seconds.

Scalability
Elasticity
Kubernetes

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation and Datadog alerts when Postgres database can not be reached

An unavailable database should be handled by your application gracefully and indicated appropriately Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage. You can address a potential impact on your system by implementing, e.g., a failover or caching mechanism.

Motivation

Database outages can occur for various reasons, including hardware failures, software bugs, network connectivity issues, or even intentional attacks. Such outages can severely affect your application, such as lost revenue, dissatisfied customers, and reputational damage. By testing your application's resilience to a database outage, you can identify areas for improvement and implement measures to minimize the impact of such outages on your system.

Structure

To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate an unavailable PostgreSQL database by blocking the PostgreSQL database client connection on a given hostname. During the outage, we will monitor the system and ensure that the user-facing endpoint indicates unavailability by responding with a "Service unavailable" status. We will also verify that at least one monitor in Datadog is alerting us to the database outage. Once the database becomes available again, we will verify that the endpoint automatically recovers and resumes its normal operation. We will also analyze the monitoring data to identify any potential weaknesses in the system and take appropriate measures to address them. By conducting this experiment, we can identify any weaknesses in our system's resilience to database outages and take appropriate measures to minimize their impact.

RDS
Postgres
Recoverability
Datadog
Database

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation and Datadog alerts when Postgres suffers latency

Your application should continue functioning properly and indicate unavailability appropriately in case of increased connection latency to PostgreSQL. Additionally, this experiment can highlight requests that need optimization of timeouts to prevent dropped requests.

Motivation

Latencies in shared or overloaded databases are common and can significantly impact the performance of your application. By conducting this experiment, you can gain insights into the robustness of your application and identify areas for improvement.

Structure

To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate a latency attack on the PostgreSQL database by adding a delay of 100 milliseconds to all traffic to the database hostname. During the attack, we will monitor the system's behavior to ensure the service remains operational and can deliver its purpose. We will also analyze the performance metrics to identify any request types most affected by the latency and optimize them accordingly. Finally, we will end the attack and monitor the system's recovery time to ensure it returns to its normal state promptly. By conducting this experiment, we can gain valuable insights into our application's resilience to database latencies and make informed decisions to optimize its performance under stress.

RDS
Postgres
Recoverability
Datadog
Database

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation of Kubernetes deployment while Kafka is unavailable

An unavailable Kafka broker or even an entire cluster should be handled gracefully and indicated appropriately by your application. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.

Motivation

Kafka unavailability can occur for various reasons, such as hardware failure, network connectivity issues, or even intentional attacks. Such unavailability can severely affect your application, causing lost messages, data inconsistencies, and degraded performance. By testing the resilience of your system to Kafka unavailability, you can identify areas for improvement and implement measures to minimize the impact of such outages on your system.

Structure

To conduct this experiment, we will ensure that all Kafka topics and producers are ready and that the consumer receives and processes messages correctly. We will then simulate an unavailable Kafka cluster by shutting down one or more Kafka brokers or the entire Kafka cluster. During the outage, we will monitor the system to ensure it continues delivering its intended functionality and maintaining its throughput. We will also verify that the system can handle the failure of a Kafka broker or a complete Kafka cluster outage without losing messages or data inconsistencies. Once the Kafka cluster becomes available again, we will verify that the system automatically recovers and resumes its normal operation. We will also analyze the monitoring data to identify any potential weaknesses in the system and take appropriate measures to address them. By conducting this experiment, we can identify any weaknesses in our system's resilience to Kafka unavailability and take appropriate measures to minimize their impact.

Kafka
Recoverability
Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation of Kubernetes deployment while Kafka suffers a high latency

Verify that your application handles an increased latency in your Kafka message delivery properly, allowing for increased processing time while maintaining the throughput.

Motivation

Latency in Kafka can occur for various reasons, such as network congestion, increased load, or insufficient resources. Such latency can impact your application's performance, causing delays in processing messages and affecting overall throughput. By testing your system's resilience to Kafka latency, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance.

Structure

To conduct this experiment, we will ensure that all Kafka topics and producers are ready and that the consumer receives and processes messages correctly. We will then induce latency on Kafka by introducing a delay on all incoming and outgoing messages. During the experiment, we will monitor the system to ensure it continues delivering its intended functionality and maintaining its throughput despite the increased processing time. We will also analyze the monitoring data to identify any potential bottlenecks or inefficiencies in the system and take appropriate measures to address them. Once the experiment is complete, we will remove the latency and monitor the system's recovery time to ensure it returns to its normal state promptly. By conducting this experiment, we can identify any potential weaknesses in our system's resilience to Kafka latency and take appropriate measures to improve its performance and reliability.

Kafka
Recoverability
Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation of Kubernetes deployment while RabbitMQ is down

An unavailable RabbitMQ cluster should be handled gracefully and indicated appropriately by your application. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.

Motivation

RabbitMQ downtime can lead to degraded system performance, lost data, and potentially long system recovery times. By testing your system's resilience to RabbitMQ downtime, you can ensure that it can handle the outage gracefully and continue to deliver its intended functionality. Additionally, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance and resilience.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate RabbitMQ downtime, we expect the system to indicate unavailability appropriately and maintain its throughput. To simulate downtime, we can shut down the RabbitMQ instance or cluster. The experiment aims to ensure your system can gracefully handle the outage and continue delivering its intended functionality. The performance should return to normal after the RabbitMQ instance or cluster is available again.

RabbitMQ
Datadog
Recoverability
Kubernetes

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation of Kubernetes deployment while RabbitMQ suffers high latency

Verify that your application handles an increased latency in your RabbitMQ message delivery properly, allowing for increased processing time while maintaining the throughput.

Motivation

Latency issues in RabbitMQ can lead to degraded system performance, longer response times, and potentially lost or delayed data. By testing the resilience of your system to RabbitMQ latency, you can ensure that your system can handle increased processing time and maintain its throughput during increased latency. Additionally, you can identify any potential bottlenecks or inefficiencies in your system and take appropriate measures to optimize its performance and reliability.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate RabbitMQ latency, we expect the system to maintain its throughput and indicate unavailability appropriately. To simulate latency, we can introduce delays in message delivery. The experiment aims to ensure that your system can handle increased processing time and maintain its throughput during increased latency. The performance should return to normal after the latency has ended.

RabbitMQ
Datadog
Recoverability
Kubernetes

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation when database can not be reached

An unavailable database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately

Motivation

Depending on your context, an unavailable database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the database returns, your system should recover automatically.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the database is reachable again.

RDS
Recoverability
Database

Containers

Kubernetes cluster

Kubernetes deployments

Graceful degradation when Microsoft SQL Server database can not be reached

An unavailable Microsoft SQL Server database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately

Motivation

Depending on your context, an unavailable Microsoft SQL Server database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the Microsoft SQL Server database returns, your system should recover automatically.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable Microsoft SQL Server database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the Microsoft SQL Server database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the Microsoft SQL Server database is reachable again.

RDS
Microsoft SQL Server
Recoverability
Database

Containers

Kubernetes cluster

Kubernetes deployments

Graceful degradation when Oracle database can not be reached

An unavailable Oracle database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately

Motivation

Depending on your context, an unavailable Oracle database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the Oracle database returns, your system should recover automatically.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable Oracle database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the Oracle database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the Oracle database is reachable again.

RDS
Oracle
Recoverability
Database

Containers

Kubernetes cluster

Kubernetes deployments

Graceful degradation when Postgres database can not be reached

An unavailable Postgres database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately

Motivation

Depending on your context, an unavailable Postgres database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the Postgres database returns, your system should recover automatically.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable Postgres database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the Postgres database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the Postgres database is reachable again.

RDS
Postgres
Recoverability
Database

Containers

Kubernetes cluster

Kubernetes deployments

Graceful degradation while Kafka is unavailable

An unavailable Kafka is not user-visible as it leads to graceful degradation and retries as soon as the Kafka is back available again.

Motivation

In case of an unavailable Kafka message broker, your application should still work successfully. To decouple your system parts from each other, each Kafka client should take care of appropriate caching and retry mechanisms and shouldn't make the failed Kafka message broker visible to the end user. Instead, your system should fail gracefully, so retry the submission as soon as the Kafka message broker is back again.

Structure

We will use two separate Postman collections to decouple request submissions and check business functionality. The first Postman collection runs while Kafka is unavailable. We expect the Postman collection to run without errors and the system to somehow save all requests. After Kafka is available again, we will check with another Postman collection to see whether all requests have been received and processed. In between, we allow for some processing time.

Kafka
Recoverability
Postman
Kubernetes

Containers

Kubernetes cluster

Postman Collections

Grafana alert rule fires when a Kubernetes pod is in crash loop

Verify that a Grafana alert rule alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that the Grafana alert rule responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Grafana alert rule should fire and escalate it to your on-call team.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
Crash loop
Harden Observability
Restart
Grafana
Kubernetes

Grafana alert rules

Kubernetes cluster

Kubernetes pods

Instana should detect a crash looping as incident

Intent

Verify that Instana alerts you that pods are not ready to accept traffic for some time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that Instana has no critical events for an application perspective. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, Instana should detect this via a critical event to ensure your on-call team is taking action.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
Crash loop
Instana
Harden Observability
Kubernetes

Instana application perspectives

Kubernetes cluster

Kubernetes pods

Kubernetes deployment survives Redis downtime

Check that your application gracefully handles a Redis cache downtime and continues to deliver its intended functionality. The cache downtime may be caused by an unavailable Redis instance or a complete cluster.

Motivation

Redis downtime can lead to degraded system performance, lost data, and potentially long system recovery times. By testing your system's resilience to Redis downtime, you can ensure that it can handle the outage gracefully and continue to deliver its intended functionality. Additionally, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance and resilience.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis downtime, we expect the system to indicate unavailability appropriately and maintain its throughput. We can block the traffic to the Redis instance to simulate downtime. The experiment aims to ensure that your system can gracefully handle the outage and continue delivering its intended functionality. The performance should return to normal after the Redis instance is available again.

Redis
Recoverability
Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Kubernetes deployment survives Redis latency

Verify that your application handles an increased latency in a Redis cache properly, allowing for increased processing time while maintaining throughput.

Motivation

Latency issues in Redis can lead to degraded system performance, longer response times, and potentially lost or delayed data. By testing your system's resilience to Redis latency, you can ensure that it can handle increased processing time and maintain its throughput during increased latency. Additionally, you can identify any potential bottlenecks or inefficiencies in your system and take appropriate measures to optimize its performance and reliability.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis latency, we expect the system to maintain its throughput and indicate unavailability appropriately. We can introduce delays in Redis operations to simulate latency. The experiment aims to ensure that your system can handle increased processing time and maintain its throughput during increased latency. The performance should return to normal after the latency has ended.

Redis
Recoverability
Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Kubernetes Horizontal Pod Autoscaler Scales up Within Reasonable Time

Verify that your horizontal pod autoscaler scales up your Kubernetes deployment on increased CPU usage.

Motivation

Important deployments should be scaled up within a reasonable time for an elastic and resilient cloud infrastructure. Undetected high CPU spikes and long startup times are undesirable in these infrastructures.

Structure

First, we ensure that all pods are ready to serve traffic. Afterward, we inject high CPU usage into the pods' container and expect that within a reasonable amount of time, the horizontal pod auto scaler will increase the number of pods and become ready to handle incoming traffic.

Scalability
Horizontal Pod Autoscaler
CPU
Kubernetes

Containers

Kubernetes cluster

Kubernetes deployments

Kubernetes node shutdown results in new node startup

A resilient Kubernetes cluster can cope with a crashing node and simply starts a new one.

Motivation

A changing number of nodes in your Kubernetes cluster is expected, as you may update your nodes from time to time or simply scale the cluster depending on traffic peaks. This is especially true when using spot instances in a Cloud environment. This requires the deployments to be node-independent and properly configured to be rescheduled on a newly started node or a node that still has free resources.

Structure

Before restarting a node, we verify that the cluster is healthy and that the deployments are ready. Afterward, we trigger the shutdown of the node of a specific Kubernetes deployment and expect the deployment to be rescheduled on any other node and a new node to start up within a reasonable amount of time.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes

Warning

Please be aware that we will shut down a node. Please ensure this is fine and your node is either virtual or can somehow be started up afterward.

Elasticity
Kubernetes

Hosts

Kubernetes cluster

Kubernetes deployments

Load balancer covers an AWS EC2 restart

EC2 is part of the AWS Elastic Compute Cloud, which acquires and releases resources depending on the traffic demand. Check whether your application is elastic as well by rebooting an EC2 instance.

Motivation

Depending on your traffic demand, you can use AWS cloud's ability to acquire and release resources automatically. Some services, such as S3 and SQS, do that automatically, while others, such as EC2, integrate with AWS Auto Scaling. Once configured, it boils down to fluctuating EC2 instances starting or shutting down frequently. Even when not using AWS Autoscaling, your EC2 instances may need to be restarted occasionally for maintenance and updating purposes. Thus, it is best practice to validate your application's behavior.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all EC2 instances available. While restarting an EC2 instance, the HTTP endpoint continues operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover to a 100% success rate once all EC2 instances are back.

Solution Sketch

  • AWS Well-Architected Framework
  • Kubernetes liveness, readiness, and startup probes
Scalability
Redundancy
Elasticity
AWS

EC2-instances

Load balancer covers an AWS zone outage

AWS achieves high availability via redundancy across different Availability Zones. Ensure that failover works seamlessly by simulating Zone outages.

Motivation

AWS hosts your deployments and services across multiple locations worldwide. From a reliability standpoint, AWS regions and Availability Zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across AWS availability zone is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.

Structure

We leverage the AWS blackhole attack to simulate an AWS availability zone outage. Before the simulated outage, we ensure that a load-balanced user-facing endpoint works appropriately. During an AWS availability zone's unavailability, the HTTP endpoint must continue operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover as soon as the zone is back again.

Solution Sketch

  • Regions and Zones
  • Kubernetes liveness, readiness, and startup probes
Redundancy
AWS
Availability Zone

Zones

Load balancing hides a single container failure for end users

If a pod becomes temporarily unavailable, you want to ensure that Kubernetes is properly reacting, excluding that pod from the Service and restarting it.

Motivation

If configured properly, Kubernetes can detect a non-responding pod and try to fix it by simply restarting the unresponsive pod. Even so, the exact configuration requires careful consideration to avoid killing your pods too early or flooding your cluster's traffic with liveness probes.

Structure

Before killing a container of a Kubernetes pod, we verify that a load-balanced user-facing endpoint is working properly and that all Kubernetes deployment's pods are marked as ready. As soon as one container crashes, Kubernetes should detect the crashed container via a failing liveness probe and mark the related pod as not ready. Now, Kubernetes is expected to restart the container so the pod becomes ready within a certain time. The user-facing HTTP endpoint may suffer from degraded performance when being under load (e.g., lower success rate or higher response time). Even so, this is expected to be within the SLA boundaries.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
Redundancy
Kubernetes

Containers

Kubernetes cluster

Kubernetes deployments

Load-balanced endpoint covers exceeding ephemeral storage of Kubernetes deployment

Ensure that all containers of Kubernetes deployment resources have proper ephemeral storage limits configured to prevent the instability of other containers.

Motivation

For an elastic and resilient cloud infrastructure, ensure that the over-usage of ephemeral storage of one container doesn't affect any others. Furthermore, if one container exceeds its configured limits, Kubernetes should kill it and eventually prepare it within a given timeframe.

Structure

Verify that a user-visible endpoint responds within the expected success rates while exceeding the ephemeral storage. As soon as one container exceeds the ephemeral storage, by filling the disk in a /tmp directory, Kubernetes should evict the container, decreasing ready pods. Within 60 seconds, the evicted container should run again, and the pod should be ready.

Elasticity
Kubernetes

Containers

Kubernetes cluster

Kubernetes deployments

Network loss for Kubernetes node's outgoing traffic in an availability zone

Achieve high availability of your Kubernetes cluster via redundancy across different Availability Zones. Check what happens to your Kubernetes cluster when one of the zones suffers from a network loss.

Motivation

Cloud provider host your deployments and services across multiple locations worldwide. From a reliability standpoint, regions and availability zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across availability zone is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.

Structure

We leverage the drop outgoing traffic to simulate network loss in an availability. If you want to test for a full outage of the zone, configure it to 100% loss. While the network loss happens, we observe changes of a Kubernetes cluster with Steadybit's built-in visibility. Once the network loss is over, we expect that all deployments will recover again within a specified time.

Solution Sketch

  • AWS Regions and Zones
  • Azure Regions and Zones
  • GCP Regions and Zones
  • Kubernetes liveness, readiness, and startup probes
AWS
Azure
GCP
Redundancy
Kubernetes
Availability Zone

Hosts

Kubernetes cluster

Kubernetes deployments

Network outage for Kubernetes nodes in an availability zone

Achieve high availability of your Kubernetes cluster via redundancy across different Availability Zones. Check what happens to your Kubernetes cluster when one of the zones is down.

Motivation

Cloud providers host your deployments and services across multiple locations worldwide. From a reliability standpoint, regions and availability zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across availability zones is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.

Structure

We leverage the block traffic attack to simulate a full network loss in an availability zone. While the zone outage happens, we observe changes in the Kubernetes cluster with Steadybit's built-in visibility. Once the zone outage is over, we expect that all deployments will recover again within a specified time.

Solution Sketch

  • AWS Regions and Zones
  • Azure Regions and Zones
  • GCP Regions and Zones
  • Kubernetes liveness, readiness, and startup probes
Azure
GCP
Redundancy
AWS
Availability Zone

Hosts

Kubernetes cluster

Kubernetes deployments

New Relic should detect a crash looping as problem

Verify that New Relic alerts you that pods are not ready to accept traffic for some time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that New Relic has no critical events for related entities. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, New Relic should detect this via an incident to ensure your on-call team is taking action.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
Crash loop
New Relic
Harden Observability
Kubernetes

Kubernetes cluster

Kubernetes pods

New Relic Accounts

New Relic should detect a disrupted workflow when a workload is unavailable

Verify that New Relic alerts you to disruptions in your workflow, such as a critical deployment without pods ready to serve traffic.

Motivation

Kubernetes features a liveness probe to determine whether your pod is healthy and can accept traffic. If Kubernetes cannot probe a pod, it restarts it in the hope that it will eventually be ready. In case it is a critical deployment, New Relic workflow should alert on this disruption

Structure

First, check that the New Relic Workflow is marked as operational As soon as all pods of a workload aren't reachable, caused by the block traffic attack, New Relic should detect this by marking the workflow as disrupted and ensuring your on-call team is taking action.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
  • New Relic Workflow
New Relic
Harden Observability
Kubernetes

Containers

Kubernetes cluster

New Relic Accounts

New Relic Workloads

Reasonable recovery time in case of container failures

Quick startup times are favorable in Cloud environments to enable fast recovery and improve scaling.

Motivation

In Cloud environments, it is accepted that a pod or container may crash - the more important principle is that it should recover quickly. A faster startup time is beneficial in that case as it results in a smaller Mean Time To Recover (MTTR) and reduces user-facing downtime. Also, in case of request peaks, a reasonably short startup time allows scaling the deployment properly.

Structure

We simply stop a container of one of the pods to measure the time until it is marked as ready again. Therefore, before stopping the container, we ensure that the deployment is ready. If so, we stop the container and expect the number of ready pods to drop. Within a reasonable time (e.g., 60 seconds), the container should start up again, and all desirable pods should be marked as ready.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
Scalability
Recoverability
Kubernetes
Starter

Containers

Kubernetes cluster

Kubernetes deployments

Reasonable recovery time when losing a pod

When deleting a pod, Kubernetes should bring up a new pod to ensure system stability.

Motivation

Deleting a pod simulates situations in which, for any reason, a pod stops working properly and needs to be rescheduled. This experiment makes sure that rescheduling works as expected and newly scheduled pods become ready within the expected timeframe.

Structure

All pods should be ready to accept traffic at the beginning of the experiment. Rescheduling should start as soon as a pod is deleted. Eventually, after the allotted time, all pods should be ready again.

Elasticity
Recoverability
Kubernetes

Kubernetes cluster

Kubernetes deployments

Kubernetes pods

Scaling up of ECS Service Within Given Time

Ensure that you can scale up your ECS service in a reasonable time.

Motivation

For an elastic and resilient cloud infrastructure, ensure you can scale up your ECS services within a reasonable time. Long startup times are undesirable but sometimes unnoticed and unexpected.

Structure

Validate that all ECS tasks of an ECS service are running. Once we scale the ECS service up, the newly scheduled task should be ready within a reasonable time.

Scalability
Elasticity
AWS ECS
AWS

ECS Services

Third-party service is unavailable for a Kubernetes Deployment

Identify the effect of an unavailable third-party service on your deployment's service's success metrics.

Motivation

When you provide a synchronous service via HTTP that requires the availability of other upstream third-party services, you absolutely should check how your service behaves in case the third-party service is unavailable. Also, you want to validate whether your service is working again as soon as the third-party service is working again.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate the third-party service being unavailable, we expect the user-facing endpoint to work within specified HTTP success rates.. To simulate the unavailability, we can block the traffic to the third-party service on the client side using its hostname. The endpoint should recover automatically once the third-party service is reachable again.

Third-party
Upstream Service
Recoverability

Containers

Kubernetes cluster

Kubernetes deployments

Third-party service suffers high latency for a Kubernetes Deployment

Identify the effect of high latency of a third-party service on your deployment's service's success metrics.

Motivation

When you provide a synchronous service via HTTP that requires the availability of other upstream third-party services, you absolutely should check how your service behaves in case the third-party service suffers high latency. Also, you want to validate whether your service is working again as soon as the third-party service is working again.

Structure

We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate the third-party service's high latency, we expect the user-facing endpoint to work within specified HTTP success rates.. To simulate the high latency, we can delay the traffic to the third-party service on the client side using its hostname. The endpoint should recover automatically once the third-party service is reachable again.

Third-party
Upstream Service
Recoverability

Containers

Kubernetes cluster

Kubernetes deployments

Unavailable upstream service doesn't result in user-visible errors

Verify that an unavailable upstream service doesn't result in user-visible errors.

Motivation

When offering a service that is dependent on upstream services, you should ensure that the offered service also works fine whenever one of the upstream services can't be reached. This is especially true when multiple upstream services are involved and the responses of each upstream service are considered optional.

Structure

For the duration of the experiment and the deployment's upscaling, verify that a user-visible endpoint offered is responding within expected success rates and that no monitors are alerting. As soon as the deployment is scaled up, the newly scheduled pod should be ready to receive traffic within a reasonable time, e.g., 60 seconds.

Read more

This experiment template is used in our quick start on running an experiment and is especially useful for the shopping demo example. To learn more, check out the quick start in the Steadybit docs.

Shopping Demo Quick Start
Block Traffic
Kubernetes

Containers

Kubernetes cluster

Kubernetes deployments

Validate Kubernetes probes for an unavailable upstream service

Failing upstream service (e.g., message broker, database, or cache) shouldn't cause liveness or readiness probe failures in Kubernetes to avoid cascading restarts.

Motivation

In Kubernetes, liveness and readiness probes indicate whether a container is alive and able to serve incoming requests. These are especially helpful for load balancers. However, it is best practice in Kubernetes not to include upstream services in the probes. Otherwise, as soon as, e.g., a Kubernetes deployment's upstream service has issues, the deployment restarts a well, which may cause a cascade of failures in the Kubernetes cluster.

Structure

While blocking traffic from a deployment's container to a upstream service, we explicitly check the HTTP liveness and readiness probes of the Kubernetes deployment. Following best practices, we expect them not to be affected by the unavailable upstream service.

References
  • Kubernetes liveness, readiness, and startup probes
  • Readiness and Liveness Probes best practices by kube-score
Kubernetes Probes
Third-party
Upstream Service
Recoverability
Kubernetes

Containers

Kubernetes cluster

>_ Learn how to author a templateContribute Template

Steadybit covers many out-of-the-box needs, but sometimes your organization may need proprietary or niche solutions. Leverage our templates to gain flexibility and address those needs!

Steadybit logoResilience Hub
Try Steadybit
© 2024 Steadybit GmbH. All rights reserved.
Twitter iconLinkedIn iconGitHub icon