26
AWS achieves high availability via redundancy across different Availability Zones. Ensure that failover works seamlessly by simulating Zone outages.
Zones
When one of your containers has problems starting, it may result in a crash loop and, eventually, Kubernetes backing off to restart this container. Verify that Datadog notices a crash loop and will alert you to take action.
Datadog monitors
Kubernetes pods
When draining a node, Kubernetes should reschedule the pods on other nodes to achieve elasticity.
Kubernetes cluster
Kubernetes deployments
Kubernetes nodes
When one of your containers has problems starting, it may result in a crash loop and, eventually, Kubernetes backing off to restart this container. Verify that Dynatrace notices a crash loop and will alert you to take action.
EC2 is part of the AWS Elastic Compute Cloud, which acquires and releases resources depending on the traffic demand. Check whether your application is elastic as well by rebooting an EC2 instance.
EC2-instances
When one of your containers has problems starting, it may result in a crash loop and, eventually, Kubernetes backing off to restart this container. Verify that Instana notices a crash loop and will alert you to take action.
Instana application perspectives
Ensure that your pods become ready again when your containers exceed ephemeral storage.
Containers
If configured properly, Kubernetes is able to detect a non-responding pod and tries to fix it by simply restarting the unresponsive pod. Even so, the exact configuration requires careful consideration to avoid killing your pods too early or flooding your cluster’s traffic with liveness probes.
When one of your containers has problems starting, it may result in a crash loop and, eventually, Kubernetes backing off to restart this container. Verify that New Relic notices a crash loop and will alert you to take action.
New Relic Accounts
When a deployment has no pods ready, New Relic Workflow should notice that and mark the workflow as disrupted.
New Relic Workloads
Kubernetes features a rolling update strategy to deploy new releases without downtime. When being under load this only works reliably when your load balancer and the Kubernetes readiness probe are configured properly and DNS caches are up-to-date.
When you need to scale your Kubernetes deployment, ensure that users don't notice any hiccups and that it seamlessly integrates into your load balancing.
A resilient Kubernetes cluster is able to cope with a changing number of hosts and avoid user-facing reliability issues.
Hosts
An unavailable Kafka is not user-visible as it leads to graceful degradation and downstream retries as soon as the Kafka is back available again.
Postman Collections
An unavailable Kafka broker or even entire cluster should be handled by your application gracefully and being indicated appropriately. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.
Verify that an increased latency in your Kafka message delivery is handled by your application properly by having increased processing time but still being able to maintain the throughput.
An unavailable database should be handled by your application gracefully and being indicated appropriately. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage. You can address a potential impact on your system by implementing e.g. a failover or caching mechanism.
Your application should continue to function properly and indicate unavailability appropriately in case of increased connection latency to PostgreSQL. Additionally, this experiment can highlight requests that need optimization of timeouts to prevent dropped requests.
An unavailable RabbitMQ cluster should be handled by your application gracefully and being indicated appropriately. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.
Verify that an increased latency in your RabbitMQ message delivery is handled by your application properly by having increased processing time but still being able to maintain the throughput.
Check that your application handles a Redis cache downtime gracefully and continues to deliver its intended functionality. The cache downtime may be an unavailable Redis instance or a complete cluster.
Verify that an increased latency in a Redis cache is handled by your application properly by having increased processing time but still being able to maintain the throughput.
An unavailable Microsoft SQL Server database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately.
An unavailable Oracle database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately.
An unavailable PostgreSQL database might be too severe for suitable fallbacks and requires your system to indicate unavailability appropriately.
Quick startup times are favorable in Cloud environments to enable fast recovery and improve scaling.
Steadybit covers many out-of-the-box needs, but sometimes your organization may need proprietary or niche solutions. Leverage our recipes to gain flexibility and address those needs!