52
Templates
Filter by
AWS ECS Service Is Scaled up Within Reasonable Time
Verify that your ECS service is scaled up on increased CPU usage.
Motivation
Important ECS services should be scaled up within a reasonable time for an elastic and resilient cloud infrastructure. Undetected high CPU spikes and long startup times are undesirable in these infrastructures.
Structure
First, we ensure that all ECS service's tasks are ready to serve traffic. Afterward, we inject high CPU usage into the ECS task and expect that within a reasonable amount of time, ECS increases the number of ECS tasks and they become ready to handle incoming traffic.
Datadog alerts when a Kubernetes pod is in crash loop
Verify that a Datadog monitor alerts you when pods are not ready to accept traffic for a certain time.
Motivation
Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.
Structure
First, check that the Datadog monitor responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Datadog monitor should alert and escalate it to your on-call team.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
Draining a node should reschedule pods quickly
When draining a node, Kubernetes should reschedule running pods on other nodes without hiccups to ease, e.g., node maintenance.
Motivation
Draining a node may be necessary for, e.g., maintenance of a node. If that happens, Kubernetes should be able to reschedule the pods running on that node within the expected time and without user-noticeable failures.
Structure
For the entire duration of the experiment, a user-facing endpoint should work within expected success rates. At the beginning of the experiment, all pods should be ready to accept traffic. As soon as the node is drained, Kubernetes will evict the pods, but we still expect the pod's redundancy to be able to serve the user-facing endpoint. Eventually, after 120 seconds, all pods should be rescheduled and ready again to recover after the maintenance.
Dynatrace should detect a crash looping as problem
Verify that Dynatrace alerts you on pods not being ready to accept traffic for a certain amount of time.
Motivation
Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.
Structure
First, check that Dynatrace has no problems for an entity and doesn't alert already on non-ready containers. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, Dynatrace should detect the problem and alert to ensure your on-call team is taking action.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
Faultless redundancy during rolling update
Kubernetes features a rolling update strategy to deploy new releases without downtime. When being under load, this only works reliably when your load balancer and the Kubernetes readiness probe are configured properly and DNS caches are up-to-date.
Motivation
The Kubernetes rolling update strategy ensures that a minimum number of pods remain available when a new release is deployed. This implies that a new pod with a new release is started and needs to be ready before an old pod is evicted. Even so, this process may result in degraded performance and user-facing errors, e.g., Kubernetes sending requests to pods indicated as ready but not able to respond properly or evicted pods are still retained in the load balancer.
Structure
Before performing the rolling update all desirable pods of the deployment need to be in the “ready”-state, and a load-balanced user-facing HTTP endpoint is expected to respond successfully while under load. As soon as the rolling update takes place, the HTTP endpoint under load may suffer from a degraded performance (e.g. lower success rate or higher response time). Even so, this should be within the boundaries of your SLA. After the rolling update, the number of desirable pods matches the actual pods of the deployment and the performance of the user-facing HTTP endpoint is similar to before the update.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
- Kubernetes deployment strategy
Faultless scaling of Kubernetes Deployment
Ensure that you can scale your deployment in a reasonable time without noticeable errors.
Motivation
For an elastic and resilient cloud infrastructure, ensure that you can scale your deployments without user-visible errors and within a reasonable amount of time. Long startup times, hiccups in the load balancer, or resource misallocation are undesirable but sometimes unnoticed and unexpected.
Structure
For the duration of the experiment and the deployment's upscaling, verify that a user-visible endpoint offered is responding within expected success rates and that no monitors are alerting. As soon as the deployment is scaled up, the newly scheduled pod should be ready to receive traffic within a reasonable time, e.g., 60 seconds.
Graceful degradation and Datadog alerts when Postgres database can not be reached
An unavailable database should be handled by your application gracefully and indicated appropriately Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage. You can address a potential impact on your system by implementing, e.g., a failover or caching mechanism.
Motivation
Database outages can occur for various reasons, including hardware failures, software bugs, network connectivity issues, or even intentional attacks. Such outages can severely affect your application, such as lost revenue, dissatisfied customers, and reputational damage. By testing your application's resilience to a database outage, you can identify areas for improvement and implement measures to minimize the impact of such outages on your system.
Structure
To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate an unavailable PostgreSQL database by blocking the PostgreSQL database client connection on a given hostname. During the outage, we will monitor the system and ensure that the user-facing endpoint indicates unavailability by responding with a "Service unavailable" status. We will also verify that at least one monitor in Datadog is alerting us to the database outage. Once the database becomes available again, we will verify that the endpoint automatically recovers and resumes its normal operation. We will also analyze the monitoring data to identify any potential weaknesses in the system and take appropriate measures to address them. By conducting this experiment, we can identify any weaknesses in our system's resilience to database outages and take appropriate measures to minimize their impact.
Graceful degradation and Datadog alerts when Postgres suffers latency
Your application should continue functioning properly and indicate unavailability appropriately in case of increased connection latency to PostgreSQL. Additionally, this experiment can highlight requests that need optimization of timeouts to prevent dropped requests.
Motivation
Latencies in shared or overloaded databases are common and can significantly impact the performance of your application. By conducting this experiment, you can gain insights into the robustness of your application and identify areas for improvement.
Structure
To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate a latency attack on the PostgreSQL database by adding a delay of 100 milliseconds to all traffic to the database hostname. During the attack, we will monitor the system's behavior to ensure the service remains operational and can deliver its purpose. We will also analyze the performance metrics to identify any request types most affected by the latency and optimize them accordingly. Finally, we will end the attack and monitor the system's recovery time to ensure it returns to its normal state promptly. By conducting this experiment, we can gain valuable insights into our application's resilience to database latencies and make informed decisions to optimize its performance under stress.
Graceful degradation of Kubernetes deployment while Kafka is unavailable
An unavailable Kafka broker or even an entire cluster should be handled gracefully and indicated appropriately by your application. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.
Motivation
Kafka unavailability can occur for various reasons, such as hardware failure, network connectivity issues, or even intentional attacks. Such unavailability can severely affect your application, causing lost messages, data inconsistencies, and degraded performance. By testing the resilience of your system to Kafka unavailability, you can identify areas for improvement and implement measures to minimize the impact of such outages on your system.
Structure
To conduct this experiment, we will ensure that all Kafka topics and producers are ready and that the consumer receives and processes messages correctly. We will then simulate an unavailable Kafka cluster by shutting down one or more Kafka brokers or the entire Kafka cluster. During the outage, we will monitor the system to ensure it continues delivering its intended functionality and maintaining its throughput. We will also verify that the system can handle the failure of a Kafka broker or a complete Kafka cluster outage without losing messages or data inconsistencies. Once the Kafka cluster becomes available again, we will verify that the system automatically recovers and resumes its normal operation. We will also analyze the monitoring data to identify any potential weaknesses in the system and take appropriate measures to address them. By conducting this experiment, we can identify any weaknesses in our system's resilience to Kafka unavailability and take appropriate measures to minimize their impact.
Graceful degradation of Kubernetes deployment while Kafka suffers a high latency
Verify that your application handles an increased latency in your Kafka message delivery properly, allowing for increased processing time while maintaining the throughput.
Motivation
Latency in Kafka can occur for various reasons, such as network congestion, increased load, or insufficient resources. Such latency can impact your application's performance, causing delays in processing messages and affecting overall throughput. By testing your system's resilience to Kafka latency, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance.
Structure
To conduct this experiment, we will ensure that all Kafka topics and producers are ready and that the consumer receives and processes messages correctly. We will then induce latency on Kafka by introducing a delay on all incoming and outgoing messages. During the experiment, we will monitor the system to ensure it continues delivering its intended functionality and maintaining its throughput despite the increased processing time. We will also analyze the monitoring data to identify any potential bottlenecks or inefficiencies in the system and take appropriate measures to address them. Once the experiment is complete, we will remove the latency and monitor the system's recovery time to ensure it returns to its normal state promptly. By conducting this experiment, we can identify any potential weaknesses in our system's resilience to Kafka latency and take appropriate measures to improve its performance and reliability.
Graceful degradation of Kubernetes deployment while RabbitMQ is down
An unavailable RabbitMQ cluster should be handled gracefully and indicated appropriately by your application. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.
Motivation
RabbitMQ downtime can lead to degraded system performance, lost data, and potentially long system recovery times. By testing your system's resilience to RabbitMQ downtime, you can ensure that it can handle the outage gracefully and continue to deliver its intended functionality. Additionally, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance and resilience.
Structure
We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate RabbitMQ downtime, we expect the system to indicate unavailability appropriately and maintain its throughput. To simulate downtime, we can shut down the RabbitMQ instance or cluster. The experiment aims to ensure your system can gracefully handle the outage and continue delivering its intended functionality. The performance should return to normal after the RabbitMQ instance or cluster is available again.
Graceful degradation of Kubernetes deployment while RabbitMQ suffers high latency
Verify that your application handles an increased latency in your RabbitMQ message delivery properly, allowing for increased processing time while maintaining the throughput.
Motivation
Latency issues in RabbitMQ can lead to degraded system performance, longer response times, and potentially lost or delayed data. By testing the resilience of your system to RabbitMQ latency, you can ensure that your system can handle increased processing time and maintain its throughput during increased latency. Additionally, you can identify any potential bottlenecks or inefficiencies in your system and take appropriate measures to optimize its performance and reliability.
Structure
We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate RabbitMQ latency, we expect the system to maintain its throughput and indicate unavailability appropriately. To simulate latency, we can introduce delays in message delivery. The experiment aims to ensure that your system can handle increased processing time and maintain its throughput during increased latency. The performance should return to normal after the latency has ended.
Graceful degradation when database can not be reached
An unavailable database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately
Motivation
Depending on your context, an unavailable database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the database returns, your system should recover automatically.
Structure
We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the database is reachable again.
Graceful degradation when Microsoft SQL Server database can not be reached
An unavailable Microsoft SQL Server database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately
Motivation
Depending on your context, an unavailable Microsoft SQL Server database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the Microsoft SQL Server database returns, your system should recover automatically.
Structure
We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable Microsoft SQL Server database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the Microsoft SQL Server database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the Microsoft SQL Server database is reachable again.
Graceful degradation when Oracle database can not be reached
An unavailable Oracle database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately
Motivation
Depending on your context, an unavailable Oracle database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the Oracle database returns, your system should recover automatically.
Structure
We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable Oracle database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the Oracle database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the Oracle database is reachable again.
Graceful degradation when Postgres database can not be reached
An unavailable Postgres database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately
Motivation
Depending on your context, an unavailable Postgres database may be considered so severe that there are no suitable fallbacks. In this case, ensuring that your system indicates an appropriate error message is essential. After the Postgres database returns, your system should recover automatically.
Structure
We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate an unavailable Postgres database, we expect the user-facing endpoint to indicate unavailability by responding with a "Service unavailable" status. To simulate the unavailability, we can block the Postgres database client connection on its hostname so that no incoming or outgoing traffic goes through. The endpoint should recover automatically once the Postgres database is reachable again.
Graceful degradation while Kafka is unavailable
An unavailable Kafka is not user-visible as it leads to graceful degradation and retries as soon as the Kafka is back available again.
Motivation
In case of an unavailable Kafka message broker, your application should still work successfully. To decouple your system parts from each other, each Kafka client should take care of appropriate caching and retry mechanisms and shouldn't make the failed Kafka message broker visible to the end user. Instead, your system should fail gracefully, so retry the submission as soon as the Kafka message broker is back again.
Structure
We will use two separate Postman collections to decouple request submissions and check business functionality. The first Postman collection runs while Kafka is unavailable. We expect the Postman collection to run without errors and the system to somehow save all requests. After Kafka is available again, we will check with another Postman collection to see whether all requests have been received and processed. In between, we allow for some processing time.
Grafana alert rule fires when a Kubernetes pod is in crash loop
Verify that a Grafana alert rule alerts you when pods are not ready to accept traffic for a certain time.
Motivation
Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.
Structure
First, check that the Grafana alert rule responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Grafana alert rule should fire and escalate it to your on-call team.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
Instana should detect a crash looping as incident
Intent
Verify that Instana alerts you that pods are not ready to accept traffic for some time.
Motivation
Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.
Structure
First, check that Instana has no critical events for an application perspective. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, Instana should detect this via a critical event to ensure your on-call team is taking action.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
Keep Deployment's pods down
Check what happens when all pods of a Kubernetes deployment aren't coming up again.
Motivation
Typically, Kubernetes tries to keep as many pods running as desired for a Kubernetes deployment. However, some circumstances may prevent Kubernetes from achieving this, like missing resources in the cluster, problems with the deployment's probes, or a CrashLoopBackOff. You should validate what happens to your overall provided service when a given deployment is directly affected by this or one of the upstream services used by your deployment.
Structure
To keep the pods down for a given deployment, we first kill all the pods in the deployment. Simultaneously, we will scale down the Kubernetes deployment to 0 to simulate that these pods can't be scheduled again. At the of the experiment, we automatically roll back the deployment's scale to the initial value.
Keep StatefulSet's pods down
Check what happens when all pods of a Kubernetes StatefulSet aren't coming up again.
Motivation
Typically, Kubernetes tries to keep as many pods running as desired for a Kubernetes StatefulSet. However, some circumstances may prevent Kubernetes from achieving this, like missing resources in the cluster, problems with the StatefulSet's probes, or a CrashLoopBackOff. You should validate what happens to your overall provided service when a given StatefulSet is directly affected by this or one of the upstream services used by your StatefulSet.
Structure
To keep the pods down for a given StatefulSet, we first kill all the pods in the StatefulSet. Simultaneously, we will scale down the Kubernetes StatefulSet to 0
to simulate that these pods can't be scheduled again.
At the of the experiment, we automatically roll back the StatefulSet's scale to the initial value.
Kubernetes deployment survives Redis downtime
Check that your application gracefully handles a Redis cache downtime and continues to deliver its intended functionality. The cache downtime may be caused by an unavailable Redis instance or a complete cluster.
Motivation
Redis downtime can lead to degraded system performance, lost data, and potentially long system recovery times. By testing your system's resilience to Redis downtime, you can ensure that it can handle the outage gracefully and continue to deliver its intended functionality. Additionally, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance and resilience.
Structure
We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis downtime, we expect the system to indicate unavailability appropriately and maintain its throughput. We can block the traffic to the Redis instance to simulate downtime. The experiment aims to ensure that your system can gracefully handle the outage and continue delivering its intended functionality. The performance should return to normal after the Redis instance is available again.
Kubernetes deployment survives Redis latency
Verify that your application handles an increased latency in a Redis cache properly, allowing for increased processing time while maintaining throughput.
Motivation
Latency issues in Redis can lead to degraded system performance, longer response times, and potentially lost or delayed data. By testing your system's resilience to Redis latency, you can ensure that it can handle increased processing time and maintain its throughput during increased latency. Additionally, you can identify any potential bottlenecks or inefficiencies in your system and take appropriate measures to optimize its performance and reliability.
Structure
We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis latency, we expect the system to maintain its throughput and indicate unavailability appropriately. We can introduce delays in Redis operations to simulate latency. The experiment aims to ensure that your system can handle increased processing time and maintain its throughput during increased latency. The performance should return to normal after the latency has ended.
Kubernetes Horizontal Pod Autoscaler Scales up Within Reasonable Time
Verify that your horizontal pod autoscaler scales up your Kubernetes deployment on increased CPU usage.
Motivation
Important deployments should be scaled up within a reasonable time for an elastic and resilient cloud infrastructure. Undetected high CPU spikes and long startup times are undesirable in these infrastructures.
Structure
First, we ensure that all pods are ready to serve traffic. Afterward, we inject high CPU usage into the pods' container and expect that within a reasonable amount of time, the horizontal pod auto scaler will increase the number of pods and become ready to handle incoming traffic.
Kubernetes node shutdown results in new node startup
A resilient Kubernetes cluster can cope with a crashing node and simply starts a new one.
Motivation
A changing number of nodes in your Kubernetes cluster is expected, as you may update your nodes from time to time or simply scale the cluster depending on traffic peaks. This is especially true when using spot instances in a Cloud environment. This requires the deployments to be node-independent and properly configured to be rescheduled on a newly started node or a node that still has free resources.
Structure
Before restarting a node, we verify that the cluster is healthy and that the deployments are ready. Afterward, we trigger the shutdown of the node of a specific Kubernetes deployment and expect the deployment to be rescheduled on any other node and a new node to start up within a reasonable amount of time.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
Warning
Please be aware that we will shut down a node. Please ensure this is fine and your node is either virtual or can somehow be started up afterward.
Load balancer covers an AWS EC2 restart
EC2 is part of the AWS Elastic Compute Cloud, which acquires and releases resources depending on the traffic demand. Check whether your application is elastic as well by rebooting an EC2 instance.
Motivation
Depending on your traffic demand, you can use AWS cloud's ability to acquire and release resources automatically. Some services, such as S3 and SQS, do that automatically, while others, such as EC2, integrate with AWS Auto Scaling. Once configured, it boils down to fluctuating EC2 instances starting or shutting down frequently. Even when not using AWS Autoscaling, your EC2 instances may need to be restarted occasionally for maintenance and updating purposes. Thus, it is best practice to validate your application's behavior.
Structure
We ensure that a load-balanced user-facing endpoint fully works while having all EC2 instances available. While restarting an EC2 instance, the HTTP endpoint continues operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover to a 100% success rate once all EC2 instances are back.
Solution Sketch
- AWS Well-Architected Framework
- Kubernetes liveness, readiness, and startup probes
Load balancer covers an AWS zone outage
AWS achieves high availability via redundancy across different Availability Zones. Ensure that failover works seamlessly by simulating Zone outages.
Motivation
AWS hosts your deployments and services across multiple locations worldwide. From a reliability standpoint, AWS regions and Availability Zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across AWS availability zone is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.
Structure
We leverage the AWS blackhole attack to simulate an AWS availability zone outage. Before the simulated outage, we ensure that a load-balanced user-facing endpoint works appropriately. During an AWS availability zone's unavailability, the HTTP endpoint must continue operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover as soon as the zone is back again.
Solution Sketch
- Regions and Zones
- Kubernetes liveness, readiness, and startup probes
Load balancing hides a single container failure for end users
If a pod becomes temporarily unavailable, you want to ensure that Kubernetes is properly reacting, excluding that pod from the Service and restarting it.
Motivation
If configured properly, Kubernetes can detect a non-responding pod and try to fix it by simply restarting the unresponsive pod. Even so, the exact configuration requires careful consideration to avoid killing your pods too early or flooding your cluster's traffic with liveness probes.
Structure
Before killing a container of a Kubernetes pod, we verify that a load-balanced user-facing endpoint is working properly and that all Kubernetes deployment's pods are marked as ready. As soon as one container crashes, Kubernetes should detect the crashed container via a failing liveness probe and mark the related pod as not ready. Now, Kubernetes is expected to restart the container so the pod becomes ready within a certain time. The user-facing HTTP endpoint may suffer from degraded performance when being under load (e.g., lower success rate or higher response time). Even so, this is expected to be within the SLA boundaries.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
Load-balanced endpoint covers exceeding ephemeral storage of Kubernetes deployment
Ensure that all containers of Kubernetes deployment resources have proper ephemeral storage limits configured to prevent the instability of other containers.
Motivation
For an elastic and resilient cloud infrastructure, ensure that the over-usage of ephemeral storage of one container doesn't affect any others. Furthermore, if one container exceeds its configured limits, Kubernetes should kill it and eventually prepare it within a given timeframe.
Structure
Verify that a user-visible endpoint responds within the expected success rates while exceeding the ephemeral storage.
As soon as one container exceeds the ephemeral storage, by filling the disk in a /tmp
directory, Kubernetes should evict the container, decreasing ready pods. Within 60 seconds, the evicted container should run again, and the pod should be ready.
Network loss for Kubernetes node's outgoing traffic in an availability zone
Achieve high availability of your Kubernetes cluster via redundancy across different Availability Zones. Check what happens to your Kubernetes cluster when one of the zones suffers from a network loss.
Motivation
Cloud provider host your deployments and services across multiple locations worldwide. From a reliability standpoint, regions and availability zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across availability zone is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.
Structure
We leverage the drop outgoing traffic to simulate network loss in an availability. If you want to test for a full outage of the zone, configure it to 100% loss. While the network loss happens, we observe changes of a Kubernetes cluster with Steadybit's built-in visibility. Once the network loss is over, we expect that all deployments will recover again within a specified time.
Solution Sketch
- AWS Regions and Zones
- Azure Regions and Zones
- GCP Regions and Zones
- Kubernetes liveness, readiness, and startup probes
Network outage for Kubernetes nodes in an availability zone
Achieve high availability of your Kubernetes cluster via redundancy across different Availability Zones. Check what happens to your Kubernetes cluster when one of the zones is down.
Motivation
Cloud providers host your deployments and services across multiple locations worldwide. From a reliability standpoint, regions and availability zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across availability zones is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.
Structure
We leverage the block traffic attack to simulate a full network loss in an availability zone. While the zone outage happens, we observe changes in the Kubernetes cluster with Steadybit's built-in visibility. Once the zone outage is over, we expect that all deployments will recover again within a specified time.
Solution Sketch
- AWS Regions and Zones
- Azure Regions and Zones
- GCP Regions and Zones
- Kubernetes liveness, readiness, and startup probes
New Relic should detect a crash looping as problem
Verify that New Relic alerts you that pods are not ready to accept traffic for some time.
Motivation
Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.
Structure
First, check that New Relic has no critical events for related entities. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, New Relic should detect this via an incident to ensure your on-call team is taking action.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
New Relic should detect a disrupted workflow when a workload is unavailable
Verify that New Relic alerts you to disruptions in your workflow, such as a critical deployment without pods ready to serve traffic.
Motivation
Kubernetes features a liveness probe to determine whether your pod is healthy and can accept traffic. If Kubernetes cannot probe a pod, it restarts it in the hope that it will eventually be ready. In case it is a critical deployment, New Relic workflow should alert on this disruption
Structure
First, check that the New Relic Workflow is marked as operational As soon as all pods of a workload aren't reachable, caused by the block traffic attack, New Relic should detect this by marking the workflow as disrupted and ensuring your on-call team is taking action.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
- New Relic Workflow
Reasonable recovery time in case of container failures
Quick startup times are favorable in Cloud environments to enable fast recovery and improve scaling.
Motivation
In Cloud environments, it is accepted that a pod or container may crash - the more important principle is that it should recover quickly. A faster startup time is beneficial in that case as it results in a smaller Mean Time To Recover (MTTR) and reduces user-facing downtime. Also, in case of request peaks, a reasonably short startup time allows scaling the deployment properly.
Structure
We simply stop a container of one of the pods to measure the time until it is marked as ready again. Therefore, before stopping the container, we ensure that the deployment is ready. If so, we stop the container and expect the number of ready pods to drop. Within a reasonable time (e.g., 60 seconds), the container should start up again, and all desirable pods should be marked as ready.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
Reasonable recovery time when losing a pod
When deleting a pod, Kubernetes should bring up a new pod to ensure system stability.
Motivation
Deleting a pod simulates situations in which, for any reason, a pod stops working properly and needs to be rescheduled. This experiment makes sure that rescheduling works as expected and newly scheduled pods become ready within the expected timeframe.
Structure
All pods should be ready to accept traffic at the beginning of the experiment. Rescheduling should start as soon as a pod is deleted. Eventually, after the allotted time, all pods should be ready again.
Scaling up of ECS Service Within Given Time
Ensure that you can scale up your ECS service in a reasonable time.
Motivation
For an elastic and resilient cloud infrastructure, ensure you can scale up your ECS services within a reasonable time. Long startup times are undesirable but sometimes unnoticed and unexpected.
Structure
Validate that all ECS tasks of an ECS service are running. Once we scale the ECS service up, the newly scheduled task should be ready within a reasonable time.
Third-party service is unavailable for a Kubernetes Deployment
Identify the effect of an unavailable third-party service on your deployment's service's success metrics.
Motivation
When you provide a synchronous service via HTTP that requires the availability of other upstream third-party services, you absolutely should check how your service behaves in case the third-party service is unavailable. Also, you want to validate whether your service is working again as soon as the third-party service is working again.
Structure
We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate the third-party service being unavailable, we expect the user-facing endpoint to work within specified HTTP success rates.. To simulate the unavailability, we can block the traffic to the third-party service on the client side using its hostname. The endpoint should recover automatically once the third-party service is reachable again.
Third-party service suffers high latency for a Kubernetes Deployment
Identify the effect of high latency of a third-party service on your deployment's service's success metrics.
Motivation
When you provide a synchronous service via HTTP that requires the availability of other upstream third-party services, you absolutely should check how your service behaves in case the third-party service suffers high latency. Also, you want to validate whether your service is working again as soon as the third-party service is working again.
Structure
We ensure that a load-balanced user-facing endpoint fully works while having all pods ready. When we simulate the third-party service's high latency, we expect the user-facing endpoint to work within specified HTTP success rates.. To simulate the high latency, we can delay the traffic to the third-party service on the client side using its hostname. The endpoint should recover automatically once the third-party service is reachable again.
Unavailable upstream service doesn't result in user-visible errors
Verify that an unavailable upstream service doesn't result in user-visible errors.
Motivation
When offering a service that is dependent on upstream services, you should ensure that the offered service also works fine whenever one of the upstream services can't be reached. This is especially true when multiple upstream services are involved and the responses of each upstream service are considered optional.
Structure
For the duration of the experiment and the deployment's upscaling, verify that a user-visible endpoint offered is responding within expected success rates and that no monitors are alerting. As soon as the deployment is scaled up, the newly scheduled pod should be ready to receive traffic within a reasonable time, e.g., 60 seconds.
Read more
This experiment template is used in our quick start on running an experiment and is especially useful for the shopping demo example. To learn more, check out the quick start in the Steadybit docs.
Validate consumer's behavior when new leader is elected
Verify that your application handles a change of the leader properly.
Motivation
By testing your system's resilience to Kafka leader changes in a partition, you can identify potential weaknesses and take appropriate measures to improve its performance.
Structure
To conduct this experiment, we will ensure that all Kafka topics and producers are ready and that the consumer receives and processes messages correctly. We will then elect a new leader for one partition in Kafka and expect the system to work fine.
Validate Kubernetes probes for an unavailable upstream service
Failing upstream service (e.g., message broker, database, or cache) shouldn't cause liveness or readiness probe failures in Kubernetes to avoid cascading restarts.
Motivation
In Kubernetes, liveness and readiness probes indicate whether a container is alive and able to serve incoming requests. These are especially helpful for load balancers. However, it is best practice in Kubernetes not to include upstream services in the probes. Otherwise, as soon as, e.g., a Kubernetes deployment's upstream service has issues, the deployment restarts a well, which may cause a cascade of failures in the Kubernetes cluster.
Structure
While blocking traffic from a deployment's container to a upstream service, we explicitly check the HTTP liveness and readiness probes of the Kubernetes deployment. Following best practices, we expect them not to be affected by the unavailable upstream service.
References
- Kubernetes liveness, readiness, and startup probes
- Readiness and Liveness Probes best practices by kube-score