18
A resilient Kubernetes cluster is able to cope with a changing number of hosts and avoid user-facing reliability issues.
Hosts
An unavailable Kafka is not user-visible as it leads to graceful degradation and downstream retries as soon as the Kafka is back available again.
Containers
If configured properly, Kubernetes is able to detect a non-responding pod and tries to fix it by simply restarting the unresponsive pod. Even so, the exact configuration requires careful consideration to avoid killing your pods too early or flooding your cluster’s traffic with liveness probes.
EC2 is part of the AWS Elastic Compute Cloud, which acquires and releases resources depending on the traffic demand. Check whether your application is elastic as well by rebooting an EC2 instance.
EC2-instances
AWS achieves high availability via redundancy across different Availability Zones. Ensure that failover works seamlessly by simulating Zone outages.
Zones
An unavailable Kafka broker or even entire cluster should be handled by your application gracefully and being indicated appropriately. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.
Datadog monitors
Verify that an increased latency in your Kafka message delivery is handled by your application properly by having increased processing time but still being able to maintain the throughput.
An unavailable database should be handled by your application gracefully and being indicated appropriately. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage. You can address a potential impact on your system by implementing e.g. a failover or caching mechanism.
Your application should continue to function properly and indicate unavailability appropriately in case of increased connection latency to PostgreSQL. Additionally, this experiment can highlight requests that need optimization of timeouts to prevent dropped requests.
An unavailable RabbitMQ cluster should be handled by your application gracefully and being indicated appropriately. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.
Verify that an increased latency in your RabbitMQ message delivery is handled by your application properly by having increased processing time but still being able to maintain the throughput.
Check that your application handles a Redis cache downtime gracefully and continues to deliver its intended functionality. The cache downtime may be an unavailable Redis instance or a complete cluster.
Verify that an increased latency in a Redis cache is handled by your application properly by having increased processing time but still being able to maintain the throughput.
An unavailable Microsoft SQL Server database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately.
An unavailable Oracle database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately.
An unavailable PostgreSQL database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately.
Kubernetes features a rolling update strategy to deploy new releases without downtime. When being under load this only works reliably when your load balancer and the Kubernetes readiness probe are configured properly and DNS caches are up-to-date.
Kubernetes deployments
Quick startup times are favorable in Cloud environments to enable fast recovery and improve scaling.
Steadybit covers many out-of-the-box needs, but sometimes your organization may need proprietary or niche solutions. Leverage our recipes to gain flexibility and address those needs!