Monitor Status

Check

Collects information about the monitor status and optionally verifies that the monitor has an expected status.

Targets:

Datadog monitors

Install now

Monitor Status

Collects information about the monitor status and optionally verifies that the monitor has an expected status.

Check

Targets:

Datadog monitors

Install now

Monitor Status

Check

Collects information about the monitor status and optionally verifies that the monitor has an expected status.

Targets:

Datadog monitors

Install now

Monitor Status

Collects information about the monitor status and optionally verifies that the monitor has an expected status.

Check

Targets:

Datadog monitors

Install now

Go back to list

YouTube content is not loaded by default for privacy reasons.

Introduction

The monitor status check step can be dragged&dropped into the experiment editor. Once done, you can use it to collect information about the state of the Datadog monitors and, optionally, to verify that they are within the expected status.

Experiments can be aborted and marked as failed when the Datadog monitor status check's actual state diverges from the expected state. This helps implement pre-/post-conditions and invariants. For example, to only start an experiment when the system is healthy.

At last, to help you understand the monitors' status and how they evolved, the run view also contains a status visualization. Through this visualization, you can see what states the Datadog monitors had throughout the experiment execution.

Use Cases

Pre-/postcondition or invariant for any experiment.
Verify that alerts are triggered during incidents.

Parameters

Parameter	Description	Default
Duration	How long should the status of the monitor be checked	30s
Expected Status	The expected state of the monitor. One of Ok, Alert, Warn, No Data, Unknown, Skipped, Ignored.
Status Check Mode	How often should the status be expected. "At least once" or "All the time"	"All the time"

Useful Templates (4 of 14)

See all

Kubernetes deployment survives Redis latency

Verify that your application handles an increased latency in a Redis cache properly, allowing for increased processing time while maintaining throughput.

Motivation

Latency issues in Redis can lead to degraded system performance, longer response times, and potentially lost or delayed data. By testing your system's resilience to Redis latency, you can ensure that it can handle increased processing time and maintain its throughput during increased latency. Additionally, you can identify any potential bottlenecks or inefficiencies in your system and take appropriate measures to optimize its performance and reliability.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis latency, we expect the system to maintain its throughput and indicate unavailability appropriately. We can introduce delays in Redis operations to simulate latency. The experiment aims to ensure that your system can handle increased processing time and maintain its throughput during increased latency. The performance should return to normal after the latency has ended.

Redis

Recoverability

Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Kubernetes deployment survives Redis downtime

Check that your application gracefully handles a Redis cache downtime and continues to deliver its intended functionality. The cache downtime may be caused by an unavailable Redis instance or a complete cluster.

Motivation

Redis downtime can lead to degraded system performance, lost data, and potentially long system recovery times. By testing your system's resilience to Redis downtime, you can ensure that it can handle the outage gracefully and continue to deliver its intended functionality. Additionally, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance and resilience.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis downtime, we expect the system to indicate unavailability appropriately and maintain its throughput. We can block the traffic to the Redis instance to simulate downtime. The experiment aims to ensure that your system can gracefully handle the outage and continue delivering its intended functionality. The performance should return to normal after the Redis instance is available again.

Redis

Recoverability

Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Datadog alerts when a Kubernetes pod is in crash loop

Verify that a Datadog monitor alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that the Datadog monitor responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Datadog monitor should alert and escalate it to your on-call team.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

Crash loop

Harden Observability

Datadog

Restart

Kubernetes

Datadog monitors

Kubernetes cluster

Kubernetes pods

Linux host losing network connection is detected by Datadog

When a host suddenly loses connection to the network and your system, Datadog should alert about this. Eventually, everything should recover once the network is back again.

Motivation

When you're working in a less volatile system environment, a loss of network can be crucial as there is likely no backup host that will enable faster recovery. Thus, you should check your observability tools to catch this.

Structure

Before blocking a host from the network, we verify that the Datadog monitor is in an ok state Afterward, we block all traffic to and from a host and expect Datadog to alert about the isolated host. Eventually, when the host is online again, we expect Datadog to turn into an OK state again. While experimenting, we create a downtime for the Monitor so that it will not escalate due to the ongoing alert.