Reasonable Recovery Time in Case of Container Failures

In Cloud environments, it is accepted that a pod or container may crash - the more important principle is that it should recover quickly. A faster startup time is beneficial in that case as it results in a smaller Mean Time To Recover (MTTR) and reduces user-facing downtime. Also, in case of request peaks, a reasonably short startup time allows scaling the deployment properly.

Structure

We simply stop a container of one of the pods to measure the time until it is marked as ready again. Therefore, before stopping the container, we ensure that the deployment is ready. If so, we stop the container and expect the number of ready pods to drop. Within a reasonable time (e.g., 60 seconds), the container should start up again, and all desirable pods should be marked as ready.

Solution Sketch

Kubernetes liveness, readiness, and startup probes

How to use this template?

Import via Hub Connection

Steadybit’s Reliability Hub is already connected to your platform. If you are an admin, you can just easily import templates with just one click.

Import template

Are you on-prem?

This is how you import Templates

Import as Experiment

Simply download the template and upload it as an experiment to use it once. Perfect if you are no administrator in the platform and just want to use the template once.

.json (4KB)

Used Actions

See all

>_ boost your chaos journey Connect Your Hub to Steadybit

Maximise Steadybit potential! Connect your own Hub to the platform and smoothly import your own templates: using them it’s never been this easy!

Reasonable Recovery Time in Case of Container Failures

Reasonable Recovery Time in Case of Container Failures

Reasonable Recovery Time in Case of Container Failures

Reasonable Recovery Time in Case of Container Failures

Structure

Solution Sketch

Deployment Pod Count

Kubernetes Event Logs

Pod Count Metrics

Stop Container