AWS
AWS
A Steadybit discovery and action implementation to inject faults into various AWS services.AWS
AWS
A Steadybit discovery and action implementation to inject faults into various AWS services.Introduction
The AWS (Amazon Web Services) extension bundles various attacks, target discovery, and check capabilities for AWS managed services. For example, you can use the AWS extension to change the state of EC2 instances, trigger reboot or failover for RDS instances, mess around with ECS tasks and services, or inject failures into lambdas.
The AWS extension is, in essence, an adapter for the AWS APIs.
To set up the extension and the needed IAM permissions, please consult the steadybit/extension-aws/README.md
Further Support for Managed Services
While the AWS extension provides integrations to managed services via AWS APIs, we also offer deeper integration for the following services based on the underlying technology.
Elastic Kubernetes Service (EKS)
Benefit in AWS EKS from the same integration we offer for unmanaged Kubernetes clusters by installing the following extensions in your Kubernetes cluster: extension-kubernetes, extension-container, extension-host.
Elastic Compute Cloud (EC2)
When using Linux-based hosts in EC2, you can also benefit from extension-host's capabilities.
Load balancer covers an AWS EC2 restart
EC2 is part of the AWS Elastic Compute Cloud, which acquires and releases resources depending on the traffic demand. Check whether your application is elastic as well by rebooting an EC2 instance.
Motivation
Depending on your traffic demand, you can use AWS cloud's ability to acquire and release resources automatically. Some services, such as S3 and SQS, do that automatically, while others, such as EC2, integrate with AWS Auto Scaling. Once configured, it boils down to fluctuating EC2 instances starting or shutting down frequently. Even when not using AWS Autoscaling, your EC2 instances may need to be restarted occasionally for maintenance and updating purposes. Thus, it is best practice to validate your application's behavior.
Structure
We ensure that a load-balanced user-facing endpoint fully works while having all EC2 instances available. While restarting an EC2 instance, the HTTP endpoint continues operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover to a 100% success rate once all EC2 instances are back.
Solution Sketch
- AWS Well-Architected Framework
- Kubernetes liveness, readiness, and startup probes
AWS ECS Service Is Scaled up Within Reasonable Time
Verify that your ECS service is scaled up on increased CPU usage.
Motivation
Important ECS services should be scaled up within a reasonable time for an elastic and resilient cloud infrastructure. Undetected high CPU spikes and long startup times are undesirable in these infrastructures.
Structure
First, we ensure that all ECS service's tasks are ready to serve traffic. Afterward, we inject high CPU usage into the ECS task and expect that within a reasonable amount of time, ECS increases the number of ECS tasks and they become ready to handle incoming traffic.
Scaling up of ECS Service Within Given Time
Ensure that you can scale up your ECS service in a reasonable time.
Motivation
For an elastic and resilient cloud infrastructure, ensure you can scale up your ECS services within a reasonable time. Long startup times are undesirable but sometimes unnoticed and unexpected.
Structure
Validate that all ECS tasks of an ECS service are running. Once we scale the ECS service up, the newly scheduled task should be ready within a reasonable time.
Load balancer covers an AWS zone outage
AWS achieves high availability via redundancy across different Availability Zones. Ensure that failover works seamlessly by simulating Zone outages.
Motivation
AWS hosts your deployments and services across multiple locations worldwide. From a reliability standpoint, AWS regions and Availability Zones are most interesting. While the former refers to separate geographic areas spread worldwide, the latter refers to an isolated location within a region. For most use cases, applying deployments across AWS availability zone is sufficient. Given that failures may happen at this level quite frequently, you should verify that your applications are still working in case of an outage.
Structure
We leverage the AWS blackhole attack to simulate an AWS availability zone outage. Before the simulated outage, we ensure that a load-balanced user-facing endpoint works appropriately. During an AWS availability zone's unavailability, the HTTP endpoint must continue operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover as soon as the zone is back again.
Solution Sketch
- Regions and Zones
- Kubernetes liveness, readiness, and startup probes