Block Traffic
Block Traffic
Network outage for Kubernetes nodes in an availability zone
Achieve high availability of your Kubernetes cluster via redundancy across different Availability Zones. Check what happens to your Kubernetes cluster when one of the zones is down.
Motivation
Cloud providers host your deployments and services across multiple locations worldwide. From a reliability standpoint, regions and availability zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across availability zones is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.
Structure
We leverage the block traffic attack to simulate a full network loss in an availability zone. While the zone outage happens, we observe changes in the Kubernetes cluster with Steadybit's built-in visibility. Once the zone outage is over, we expect that all deployments will recover again within a specified time.
Solution Sketch
- AWS Regions and Zones
- Azure Regions and Zones
- GCP Regions and Zones
- Kubernetes liveness, readiness, and startup probes
Linux host losing network connection is detected by Datadog
When a host suddenly loses connection to the network and your system, Datadog should alert about this. Eventually, everything should recover once the network is back again.
Motivation
When you're working in a less volatile system environment, a loss of network can be crucial as there is likely no backup host that will enable faster recovery. Thus, you should check your observability tools to catch this.
Structure
Before blocking a host from the network, we verify that the Datadog monitor is in an ok state Afterward, we block all traffic to and from a host and expect Datadog to alert about the isolated host. Eventually, when the host is online again, we expect Datadog to turn into an OK state again. While experimenting, we create a downtime for the Monitor so that it will not escalate due to the ongoing alert.