Steadybit logoReliability Hub
GitHubGitHub iconStart Free Trial
Steadybit logoReliability Hub
Stars ins the universe
Extendable Platform. Endless Possibilities.
All open-source.
Contribute Templates

Make it easy to get started with Chaos Engineering, author your own template!

AppDynamics alerts when a Kubernetes pod is in crash loop

Verify that an AppDynamics health violation alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that the AppDynamics health violation responsible for tracking non-ready containers is in a non-violating state. As soon as one of the containers is crash looping, caused by the crash loop attack, the AppDynamics health violation should notify and escalate it to your on-call team.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
AppDynamics
Crash loop
Harden Observability
Restart
Kubernetes
Faultless redundancy during rolling update

Kubernetes features a rolling update strategy to deploy new releases without downtime. When being under load, this only works reliably when your load balancer and the Kubernetes readiness probe are configured properly and DNS caches are up-to-date.

Motivation

The Kubernetes rolling update strategy ensures that a minimum number of pods remain available when a new release is deployed. This implies that a new pod with a new release is started and needs to be ready before an old pod is evicted. Even so, this process may result in degraded performance and user-facing errors, e.g., Kubernetes sending requests to pods indicated as ready but not able to respond properly or evicted pods are still retained in the load balancer.

Structure

Before performing the rolling update all desirable pods of the deployment need to be in the “ready”-state, and a load-balanced user-facing HTTP endpoint is expected to respond successfully while under load. As soon as the rolling update takes place, the HTTP endpoint under load may suffer from a degraded performance (e.g. lower success rate or higher response time). Even so, this should be within the boundaries of your SLA. After the rolling update, the number of desirable pods matches the actual pods of the deployment and the performance of the user-facing HTTP endpoint is similar to before the update.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
  • Kubernetes deployment strategy
Rolling Update
Restart
Kubernetes
Graceful degradation and Datadog alerts when Postgres database can not be reached

An unavailable database should be handled by your application gracefully and indicated appropriately Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage. You can address a potential impact on your system by implementing, e.g., a failover or caching mechanism.

Motivation

Database outages can occur for various reasons, including hardware failures, software bugs, network connectivity issues, or even intentional attacks. Such outages can severely affect your application, such as lost revenue, dissatisfied customers, and reputational damage. By testing your application's resilience to a database outage, you can identify areas for improvement and implement measures to minimize the impact of such outages on your system.

Structure

To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate an unavailable PostgreSQL database by blocking the PostgreSQL database client connection on a given hostname. During the outage, we will monitor the system and ensure that the user-facing endpoint indicates unavailability by responding with a "Service unavailable" status. We will also verify that at least one monitor in Datadog is alerting us to the database outage. Once the database becomes available again, we will verify that the endpoint automatically recovers and resumes its normal operation. We will also analyze the monitoring data to identify any potential weaknesses in the system and take appropriate measures to address them. By conducting this experiment, we can identify any weaknesses in our system's resilience to database outages and take appropriate measures to minimize their impact.

RDS
Postgres
Recoverability
Datadog
Database
Graceful degradation and Datadog alerts when Postgres suffers latency

Your application should continue functioning properly and indicate unavailability appropriately in case of increased connection latency to PostgreSQL. Additionally, this experiment can highlight requests that need optimization of timeouts to prevent dropped requests.

Motivation

Latencies in shared or overloaded databases are common and can significantly impact the performance of your application. By conducting this experiment, you can gain insights into the robustness of your application and identify areas for improvement.

Structure

To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate a latency attack on the PostgreSQL database by adding a delay of 100 milliseconds to all traffic to the database hostname. During the attack, we will monitor the system's behavior to ensure the service remains operational and can deliver its purpose. We will also analyze the performance metrics to identify any request types most affected by the latency and optimize them accordingly. Finally, we will end the attack and monitor the system's recovery time to ensure it returns to its normal state promptly. By conducting this experiment, we can gain valuable insights into our application's resilience to database latencies and make informed decisions to optimize its performance under stress.

RDS
Postgres
Recoverability
Datadog
Database
Graceful degradation of Kubernetes deployment while Kafka is unavailable

An unavailable Kafka broker or even an entire cluster should be handled gracefully and indicated appropriately by your application. Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage.

Motivation

Kafka unavailability can occur for various reasons, such as hardware failure, network connectivity issues, or even intentional attacks. Such unavailability can severely affect your application, causing lost messages, data inconsistencies, and degraded performance. By testing the resilience of your system to Kafka unavailability, you can identify areas for improvement and implement measures to minimize the impact of such outages on your system.

Structure

To conduct this experiment, we will ensure that all Kafka topics and producers are ready and that the consumer receives and processes messages correctly. We will then simulate an unavailable Kafka cluster by shutting down one or more Kafka brokers or the entire Kafka cluster. During the outage, we will monitor the system to ensure it continues delivering its intended functionality and maintaining its throughput. We will also verify that the system can handle the failure of a Kafka broker or a complete Kafka cluster outage without losing messages or data inconsistencies. Once the Kafka cluster becomes available again, we will verify that the system automatically recovers and resumes its normal operation. We will also analyze the monitoring data to identify any potential weaknesses in the system and take appropriate measures to address them. By conducting this experiment, we can identify any weaknesses in our system's resilience to Kafka unavailability and take appropriate measures to minimize their impact.

Kafka
Recoverability
Datadog
Graceful degradation of Kubernetes deployment while Kafka suffers a high latency

Verify that your application handles an increased latency in your Kafka message delivery properly, allowing for increased processing time while maintaining the throughput.

Motivation

Latency in Kafka can occur for various reasons, such as network congestion, increased load, or insufficient resources. Such latency can impact your application's performance, causing delays in processing messages and affecting overall throughput. By testing your system's resilience to Kafka latency, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance.

Structure

To conduct this experiment, we will ensure that all Kafka topics and producers are ready and that the consumer receives and processes messages correctly. We will then induce latency on Kafka by introducing a delay on all incoming and outgoing messages. During the experiment, we will monitor the system to ensure it continues delivering its intended functionality and maintaining its throughput despite the increased processing time. We will also analyze the monitoring data to identify any potential bottlenecks or inefficiencies in the system and take appropriate measures to address them. Once the experiment is complete, we will remove the latency and monitor the system's recovery time to ensure it returns to its normal state promptly. By conducting this experiment, we can identify any potential weaknesses in our system's resilience to Kafka latency and take appropriate measures to improve its performance and reliability.

Kafka
Recoverability
Datadog
Steadybit logoReliability Hub
Start Free Trial
© 2025 Steadybit GmbH. All rights reserved.
Twitter iconLinkedIn iconGitHub icon