Steadybit logoResilience Hub
Try SteadybitGitHub icon
Steadybit logoResilience Hub

Create Monitor Downtime

Other

Other

Creates a downtime for a Datadog monitor.
Install now

Create Monitor Downtime

Creates a downtime for a Datadog monitor.
Other

Other

Install now

Create Monitor Downtime

Other

Other

Creates a downtime for a Datadog monitor.
Install now

Create Monitor Downtime

Creates a downtime for a Datadog monitor.
Other

Other

Install now
Go back to list
YouTube content is not loaded by default for privacy reasons.

Introduction

When executing chaos experiments, you may mute your datadog monitors not to bother your ops colleagues. You can do this with this action.

The create downtime step can be dragged&dropped into the experiment editor. Afterwards you can select the monitors which should be muted by creating a downtime.

Use Cases

  • Avoid false positives in your monitoring system
  • Avoid alerting your ops colleagues during chaos experiments

Parameters

ParameterDescriptionDefault
DurationHow long should the downtime exist?30s
Notify after Downtime if unhealthyShould Datadog notify after the Downtime if the monitor is still in an unhealthy stateyes
Statistics
-Stars
Tags
Datadog
Observability
Monitoring
Homepage
hub.steadybit.com/extension/com.steadybit.extension_datadog
License
MIT
MaintainerSteadybit
Install now

Useful Templates

See all
Datadog alerts when a Kubernetes pod is in crash loop

Verify that a Datadog monitor alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that the Datadog monitor responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Datadog monitor should alert and escalate it to your on-call team.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
Crash loop
Harden Observability
Datadog
Restart
Kubernetes

Datadog monitors

Kubernetes cluster

Kubernetes pods

Graceful degradation and Datadog alerts when Postgres suffers latency

Your application should continue functioning properly and indicate unavailability appropriately in case of increased connection latency to PostgreSQL. Additionally, this experiment can highlight requests that need optimization of timeouts to prevent dropped requests.

Motivation

Latencies in shared or overloaded databases are common and can significantly impact the performance of your application. By conducting this experiment, you can gain insights into the robustness of your application and identify areas for improvement.

Structure

To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate a latency attack on the PostgreSQL database by adding a delay of 100 milliseconds to all traffic to the database hostname. During the attack, we will monitor the system's behavior to ensure the service remains operational and can deliver its purpose. We will also analyze the performance metrics to identify any request types most affected by the latency and optimize them accordingly. Finally, we will end the attack and monitor the system's recovery time to ensure it returns to its normal state promptly. By conducting this experiment, we can gain valuable insights into our application's resilience to database latencies and make informed decisions to optimize its performance under stress.

RDS
Postgres
Recoverability
Datadog
Database

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Graceful degradation and Datadog alerts when Postgres database can not be reached

An unavailable database should be handled by your application gracefully and indicated appropriately Specifically, we want to ensure that at least one monitor in Datadog is alerting us to the outage. You can address a potential impact on your system by implementing, e.g., a failover or caching mechanism.

Motivation

Database outages can occur for various reasons, including hardware failures, software bugs, network connectivity issues, or even intentional attacks. Such outages can severely affect your application, such as lost revenue, dissatisfied customers, and reputational damage. By testing your application's resilience to a database outage, you can identify areas for improvement and implement measures to minimize the impact of such outages on your system.

Structure

To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate an unavailable PostgreSQL database by blocking the PostgreSQL database client connection on a given hostname. During the outage, we will monitor the system and ensure that the user-facing endpoint indicates unavailability by responding with a "Service unavailable" status. We will also verify that at least one monitor in Datadog is alerting us to the database outage. Once the database becomes available again, we will verify that the endpoint automatically recovers and resumes its normal operation. We will also analyze the monitoring data to identify any potential weaknesses in the system and take appropriate measures to address them. By conducting this experiment, we can identify any weaknesses in our system's resilience to database outages and take appropriate measures to minimize their impact.

RDS
Postgres
Recoverability
Datadog
Database

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

More Datadog Monitor Actions

See all
Start Using Steadybit Today

Get started with Steadybit, and you’ll get access to all of our features to discover the full power of Steadybit. Available for SaaS and on-prem!

Are you unsure where to begin?

No worries, our reliability experts are here to help: book a demo with them!

Steadybit logoResilience Hub
Try Steadybit
© 2024 Steadybit GmbH. All rights reserved.
Twitter iconLinkedIn iconGitHub icon