Incident Check

The incident check step can be dragged&dropped into the experiment editor. The action needs one or more New Relic accounts to be selected as targets. Once done, you can use it to collect information about the state of the New Relic incidents and, optionally, to verify that they are within the expected condition.

Experiments can be aborted and marked as failed when the incident check's actual state diverges from the expected state. This helps to implement pre-/post-conditions and invariants. For example, to only start an experiment when the system is healthy.

Use Cases

Pre-/postcondition or invariant for any experiment.
Verify that incidents are triggered during experiments.

Parameters

Parameter	Description	Default
Duration	How long should steadybit check for incidents ?	30s
Incident Priority Filter	Which incident priorities should be reported	"LOW", "MEDIUM", "HIGH", "CRITICAL"
Entity Tag Filter	Filter Incidents based on tags of their related entities
Condition	If you pick a condition, the experiment will fail if the condition is not met.	No check, only show incidents
Condition Check Mode	How often should the status be expected. "At least once" or "All the time"	"All the time"

Useful Templates

See all

New Relic detects an incident for CPU spikes in an ECS task

Validate your observability to detect a CPU spike in your AWS ECS cluster

Motivation

When you have New Relic configured to detect CPU spikes in your AWS ECS cluster, you can easily validate your observability strategy with this experiment template.

Structure

First, we validate whether New Relic has no ongoing incident. After that, we inject the CPU spike for an ECS service and expected that New Relic detect this as an incident within the given time frame of 3 minutes.

New Relic

AWS ECS

CPU

ECS Tasks

New Relic Accounts

New Relic should detect a crash looping as problem

Verify that New Relic alerts you that pods are not ready to accept traffic for some time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that New Relic has no critical events for related entities. As soon as one of the containers is crash looping, caused by the Steadybit attack crash loop, New Relic should detect this via an incident to ensure your on-call team is taking action.