HTTP Check Periodically
Check
HTTP Check Periodically
Execute HTTP calls and verify responses periodically.Check
HTTP Check Periodically
Check
HTTP Check Periodically
Execute HTTP calls and verify responses periodically.Check
Load balancer covers an AWS EC2 restart
EC2 is part of the AWS Elastic Compute Cloud, which acquires and releases resources depending on the traffic demand. Check whether your application is elastic as well by rebooting an EC2 instance.
Motivation
Depending on your traffic demand, you can use AWS cloud's ability to acquire and release resources automatically. Some services, such as S3 and SQS, do that automatically, while others, such as EC2, integrate with AWS Auto Scaling. Once configured, it boils down to fluctuating EC2 instances starting or shutting down frequently. Even when not using AWS Autoscaling, your EC2 instances may need to be restarted occasionally for maintenance and updating purposes. Thus, it is best practice to validate your application's behavior.
Structure
We ensure that a load-balanced user-facing endpoint fully works while having all EC2 instances available. While restarting an EC2 instance, the HTTP endpoint continues operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover to a 100% success rate once all EC2 instances are back.
Solution Sketch
- AWS Well-Architected Framework
- Kubernetes liveness, readiness, and startup probes
EC2-instances
Load balancer covers an AWS zone outage
AWS achieves high availability via redundancy across different Availability Zones. Ensure that failover works seamlessly by simulating Zone outages.
Motivation
AWS hosts your deployments and services across multiple locations worldwide. From a reliability standpoint, AWS regions and Availability Zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across AWS availability zone is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.
Structure
We leverage the AWS blackhole attack to simulate an AWS availability zone outage. Before the simulated outage, we ensure that a load-balanced user-facing endpoint works appropriately. During an AWS availability zone's unavailability, the HTTP endpoint must continue operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover as soon as the zone is back again.
Solution Sketch
- Regions and Zones
- Kubernetes liveness, readiness, and startup probes
Zones
Kubernetes deployment survives Redis latency
Verify that your application handles an increased latency in a Redis cache properly, allowing for increased processing time while maintaining throughput.
Motivation
Latency issues in Redis can lead to degraded system performance, longer response times, and potentially lost or delayed data. By testing your system's resilience to Redis latency, you can ensure that it can handle increased processing time and maintain its throughput during increased latency. Additionally, you can identify any potential bottlenecks or inefficiencies in your system and take appropriate measures to optimize its performance and reliability.
Structure
We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis latency, we expect the system to maintain its throughput and indicate unavailability appropriately. We can introduce delays in Redis operations to simulate latency. The experiment aims to ensure that your system can handle increased processing time and maintain its throughput during increased latency. The performance should return to normal after the latency has ended.
Containers
Datadog monitors
Kubernetes cluster
Kubernetes deployments
Kubernetes deployment survives Redis downtime
Check that your application gracefully handles a Redis cache downtime and continues to deliver its intended functionality. The cache downtime may be caused by an unavailable Redis instance or a complete cluster.
Motivation
Redis downtime can lead to degraded system performance, lost data, and potentially long system recovery times. By testing your system's resilience to Redis downtime, you can ensure that it can handle the outage gracefully and continue to deliver its intended functionality. Additionally, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance and resilience.
Structure
We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis downtime, we expect the system to indicate unavailability appropriately and maintain its throughput. We can block the traffic to the Redis instance to simulate downtime. The experiment aims to ensure that your system can gracefully handle the outage and continue delivering its intended functionality. The performance should return to normal after the Redis instance is available again.
Containers
Datadog monitors
Kubernetes cluster
Kubernetes deployments