New Relic Account
New Relic Account
New Relic Account
New Relic Account
New Relic detects an incident for CPU spikes in an ECS task
Validate your observability to detect a CPU spike in your AWS ECS cluster
Motivation
When you have New Relic configured to detect CPU spikes in your AWS ECS cluster, you can easily validate your observability strategy with this experiment template.
Structure
First, we validate whether New Relic has no ongoing incident. After that, we inject the CPU spike for an ECS service and expected that New Relic detect this as an incident within the given time frame of 3 minutes.
New Relic should detect a crash looping as problem
Verify that New Relic alerts you that pods are not ready to accept traffic for some time.
Motivation
Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.
Structure
First, check that New Relic has no critical events for related entities. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, New Relic should detect this via an incident to ensure your on-call team is taking action.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
New Relic should detect a disrupted workflow when a workload is unavailable
Verify that New Relic alerts you to disruptions in your workflow, such as a critical deployment without pods ready to serve traffic.
Motivation
Kubernetes features a liveness probe to determine whether your pod is healthy and can accept traffic. If Kubernetes cannot probe a pod, it restarts it in the hope that it will eventually be ready. In case it is a critical deployment, New Relic workflow should alert on this disruption
Structure
First, check that the New Relic Workflow is marked as operational As soon as all pods of a workload aren't reachable, caused by the block traffic attack, New Relic should detect this by marking the workflow as disrupted and ensuring your on-call team is taking action.
Solution Sketch
- Kubernetes liveness, readiness, and startup probes
- New Relic Workflow