AWS
AWS
A Steadybit discovery and action implementation to inject faults into various AWS services.AWS
AWS
A Steadybit discovery and action implementation to inject faults into various AWS services.YouTube content is not loaded by default for privacy reasons.
Introduction
The AWS (Amazon Web Services) extension bundles various attacks, target discovery, and check capabilities for AWS managed services. For example, you can use the AWS extension to change the state of EC2 instances, trigger reboot or failover for RDS instances, mess around with ECS tasks and services, or inject failures into lambdas.
The AWS extension is, in essence, an adapter for the AWS APIs.
To set up the extension and the needed IAM permissions, please consult the steadybit/extension-aws/README.md
Further Support for Managed Services
While the AWS extension provides integrations to managed services via AWS APIs, we also offer deeper integration for the following services based on the underlying technology.
Elastic Kubernetes Service (EKS)
Benefit in AWS EKS from the same integration we offer for unmanaged Kubernetes clusters by installing the following extensions in your Kubernetes cluster: extension-kubernetes, extension-container, extension-host.
Elastic Compute Cloud (EC2)
When using Linux-based hosts in EC2, you can also benefit from extension-host's capabilities.
Load balancer covers an AWS EC2 restart
EC2 is part of the AWS Elastic Compute Cloud, which acquires and releases resources depending on the traffic demand. Check whether your application is elastic as well by rebooting an EC2 instance.
Motivation
Depending on your traffic demand, you can use AWS cloud's ability to acquire and release resources automatically. Some services, such as S3 and SQS, do that automatically, while others, such as EC2, integrate with AWS Auto Scaling. Once configured, it boils down to fluctuating EC2 instances starting or shutting down frequently. Even when not using AWS Autoscaling, your EC2 instances may need to be restarted occasionally for maintenance and updating purposes. Thus, it is best practice to validate your application's behavior.
Structure
We ensure that a load-balanced user-facing endpoint fully works while having all EC2 instances available. While restarting an EC2 instance, the HTTP endpoint continues operating but may suffer from degraded performance (e.g., lower success rate or higher response time). The performance should recover to a 100% success rate once all EC2 instances are back.
Solution Sketch
- AWS Well-Architected Framework
- Kubernetes liveness, readiness, and startup probes
New Relic detects an incident for CPU spikes in an ECS task
Validate your observability to detect a CPU spike in your AWS ECS cluster
Motivation
When you have New Relic configured to detect CPU spikes in your AWS ECS cluster, you can easily validate your observability strategy with this experiment template.
Structure
First, we validate whether New Relic has no ongoing incident. After that, we inject the CPU spike for an ECS service and expected that New Relic detect this as an incident within the given time frame of 3 minutes.
AWS ECS Service Is Scaled up Within Reasonable Time
Verify that your ECS service is scaled up on increased CPU usage.
Motivation
Important ECS services should be scaled up within a reasonable time for an elastic and resilient cloud infrastructure. Undetected high CPU spikes and long startup times are undesirable in these infrastructures.
Structure
First, we ensure that all ECS service's tasks are ready to serve traffic. Afterward, we inject high CPU usage into the ECS task and expect that within a reasonable amount of time, ECS increases the number of ECS tasks and they become ready to handle incoming traffic.
Scaling up of ECS Service Within Given Time
Ensure that you can scale up your ECS service in a reasonable time.
Motivation
For an elastic and resilient cloud infrastructure, ensure you can scale up your ECS services within a reasonable time. Long startup times are undesirable but sometimes unnoticed and unexpected.
Structure
Validate that all ECS tasks of an ECS service are running. Once we scale the ECS service up, the newly scheduled task should be ready within a reasonable time.