Steadybit logoResilience Hub
Try SteadybitGitHub icon
Steadybit logoResilience Hub

Datadog

Extension

Extension

Bi-directional integration of Steadybit and Datadog via the Datadog API.
Install now

Datadog

Bi-directional integration of Steadybit and Datadog via the Datadog API.
Extension

Extension

Install now

Datadog

Extension

Extension

Bi-directional integration of Steadybit and Datadog via the Datadog API.
Install now

Datadog

Bi-directional integration of Steadybit and Datadog via the Datadog API.
Extension

Extension

Install now
Go back to list
YouTube content is not loaded by default for privacy reasons.

Introduction to the Datadog Extension

The Steadybit Datadog Extension bridges the world of Steadybit and Datadog. The extension adds checks to your Chaos Engineering experiments to validate Datadog monitor's status and reports events of your experiments to Datadog to ease correlation.

Integration and Functionality

Integration of Datadog into Steadybit works via the Datadog API. Thus, all you need is Datadog's API and Application Key, along with Datadog's site configuration.

Integration of Datadog in Steadybit

With the Monitor Status Check you can integrate your Datadog monitors into your experiments. Check that your observability strategy is working as expected by verifying that Datadog monitors notice a fault injected by Steadybit.

With the Create Downtime Action you can mute your monitors during an experiment to avoid false alarms and avoid incident processes.

Integration of Steadybit in Datadog

The extension automatically reports experiment executions to Datadog which helps you to correlate detected anomalies in Datadog. Furthermore, you can get a dashboard to see amount of experiment executions by installing Steadybit's Datadog integration.

Installation and Setup

To integrate the Datadog extension with your environment, follow our setup guide.

Statistics
-Stars
Tags
Datadog
Check
Observability
Monitoring
Homepage
hub.steadybit.com/extension/com.steadybit.extension_datadog
License
MIT
MaintainerSteadybit
Install now

Provided Target Discovery

See all
Datadog monitors

Provided Actions

See all
create monitor downtime

Creates a downtime for a Datadog monitor.

Attack

Attack

Datadog monitors

Useful Templates (4 of 10)

See all
Kubernetes deployment survives Redis latency

Verify that your application handles an increased latency in a Redis cache properly, allowing for increased processing time while maintaining throughput.

Motivation

Latency issues in Redis can lead to degraded system performance, longer response times, and potentially lost or delayed data. By testing your system's resilience to Redis latency, you can ensure that it can handle increased processing time and maintain its throughput during increased latency. Additionally, you can identify any potential bottlenecks or inefficiencies in your system and take appropriate measures to optimize its performance and reliability.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis latency, we expect the system to maintain its throughput and indicate unavailability appropriately. We can introduce delays in Redis operations to simulate latency. The experiment aims to ensure that your system can handle increased processing time and maintain its throughput during increased latency. The performance should return to normal after the latency has ended.

Redis
Recoverability
Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Kubernetes deployment survives Redis downtime

Check that your application gracefully handles a Redis cache downtime and continues to deliver its intended functionality. The cache downtime may be caused by an unavailable Redis instance or a complete cluster.

Motivation

Redis downtime can lead to degraded system performance, lost data, and potentially long system recovery times. By testing your system's resilience to Redis downtime, you can ensure that it can handle the outage gracefully and continue to deliver its intended functionality. Additionally, you can identify any potential weaknesses in your system and take appropriate measures to improve its performance and resilience.

Structure

We will verify that a load-balanced user-facing endpoint fully works while having all pods ready. As soon as we simulate Redis downtime, we expect the system to indicate unavailability appropriately and maintain its throughput. We can block the traffic to the Redis instance to simulate downtime. The experiment aims to ensure that your system can gracefully handle the outage and continue delivering its intended functionality. The performance should return to normal after the Redis instance is available again.

Redis
Recoverability
Datadog

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Datadog alerts when a Kubernetes pod is in crash loop

Verify that a Datadog monitor alerts you when pods are not ready to accept traffic for a certain time.

Motivation

Kubernetes features a readiness probe to determine whether your pod is ready to accept traffic. If it isn't becoming ready, Kubernetes tries to solve it by restarting the underlying container and hoping to achieve its readiness eventually. If this isn't working, Kubernetes will eventually back off to restart the container, and the Kubernetes resource remains non-functional.

Structure

First, check that the Datadog monitor responsible for tracking non-ready containers is in an 'okay' state. As soon as one of the containers is crash looping, caused by the crash loop attack, the Datadog monitor should alert and escalate it to your on-call team.

Solution Sketch

  • Kubernetes liveness, readiness, and startup probes
Crash loop
Harden Observability
Datadog
Restart
Kubernetes

Datadog monitors

Kubernetes cluster

Kubernetes pods

Graceful degradation and Datadog alerts when Postgres suffers latency

Your application should continue functioning properly and indicate unavailability appropriately in case of increased connection latency to PostgreSQL. Additionally, this experiment can highlight requests that need optimization of timeouts to prevent dropped requests.

Motivation

Latencies in shared or overloaded databases are common and can significantly impact the performance of your application. By conducting this experiment, you can gain insights into the robustness of your application and identify areas for improvement.

Structure

To conduct this experiment, we will ensure that all pods are ready and that the load-balanced user-facing endpoint is fully functional. We will then simulate a latency attack on the PostgreSQL database by adding a delay of 100 milliseconds to all traffic to the database hostname. During the attack, we will monitor the system's behavior to ensure the service remains operational and can deliver its purpose. We will also analyze the performance metrics to identify any request types most affected by the latency and optimize them accordingly. Finally, we will end the attack and monitor the system's recovery time to ensure it returns to its normal state promptly. By conducting this experiment, we can gain valuable insights into our application's resilience to database latencies and make informed decisions to optimize its performance under stress.

RDS
Postgres
Recoverability
Datadog
Database

Containers

Datadog monitors

Kubernetes cluster

Kubernetes deployments

Start Using Steadybit Today

Get started with Steadybit, and you’ll get access to all of our features to discover the full power of Steadybit. Available for SaaS and on-prem!

Are you unsure where to begin?

No worries, our reliability experts are here to help: book a demo with them!

Steadybit logoResilience Hub
Try Steadybit
© 2024 Steadybit GmbH. All rights reserved.
Twitter iconLinkedIn iconGitHub icon