Kubernetes Event Logs
Kubernetes Event Logs
Collect event logs from a Kubernetes.Kubernetes Event Logs
Kubernetes Event Logs
Collect event logs from a Kubernetes.Kubernetes deployment survives Redis latency
Verify that your application handles an increased latency in a Redis cache properly, allowing for increased processing time while maintaining throughput.
Motivation
Latency issues in Redis can lead to degraded system performance, longer response times, and potentially lost or delayed data. By testing your system's resilience to Redis latency, you can ensure that it can handle increased processing time and maintain its throughput during increased latency. Additionally, you can identify any potential bottlenecks or inefficiencies in your system and take appropriate measures to optimize its performance and reliability.
Structure
We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis latency, we expect the system to maintain its throughput and indicate unavailability appropriately. We can introduce delays in Redis operations to simulate latency. The experiment aims to ensure that your system can handle increased processing time and maintain its throughput during increased latency. The performance should return to normal after the latency has ended.
Kubernetes deployment survives Redis downtime
Check that your application gracefully handles a Redis cache downtime and continues to deliver its intended functionality. The cache downtime may be caused by an unavailable Redis instance or a complete cluster.
Motivation
Redis downtime can lead to degraded system performance, lost data, and potentially long system recovery times. By testing your system's resilience to Redis downtime, you can ensure that it can handle the outage gracefully and continue to deliver its intended functionality. Additionally, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance and resilience.
Structure
We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis downtime, we expect the system to indicate unavailability appropriately and maintain its throughput. We can block the traffic to the Redis instance to simulate downtime. The experiment aims to ensure that your system can gracefully handle the outage and continue delivering its intended functionality. The performance should return to normal after the Redis instance is available again.
Certificate TLS/SSL expiry for Kubernetes deployment
Turn time forward and check whether your TLS/SSL certificates are valid.
Motivation
Noticing the TLS/SSL certification expiry too late is one problem you can easily avoid by frequently checking your expiry dates. While observability tools already handle this job nicely, you can't know whether they are working in your environment. With this experiment, you can turn the time forward to check whether your HTTPS endpoint works at a given date in the future. Additionally, you can configure one of the observability integrations to validate your observability tool's alerting.
Structure
First, we validate that the given HTTPS endpoint is working today. Next, we will travel with the host in time to validate that the HTTPS endpoint continues to work on a given date. If the TLS/SSL certificate has already expired at that date, the HTTP check will throw failures.
Warning
Please be aware that we will manipulate the time for a given Kubernetes node. Containers running at that host may struggle to deal with the change in the clock correctly, and you may experience other side effects.
Network outage for Kubernetes nodes in an availability zone
Achieve high availability of your Kubernetes cluster via redundancy across different Availability Zones. Check what happens to your Kubernetes cluster when one of the zones is down.
Motivation
Cloud providers host your deployments and services across multiple locations worldwide. From a reliability standpoint, regions and availability zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across availability zones is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.
Structure
We leverage the block traffic attack to simulate a full network loss in an availability zone. While the zone outage happens, we observe changes in the Kubernetes cluster with Steadybit's built-in visibility. Once the zone outage is over, we expect that all deployments will recover again within a specified time.
Solution Sketch
- AWS Regions and Zones
- Azure Regions and Zones
- GCP Regions and Zones
- Kubernetes liveness, readiness, and startup probes