Reasonable Recovery Time in Case of Container Failures
Reasonable Recovery Time in Case of Container Failures
Reasonable Recovery Time in Case of Container Failures
Reasonable Recovery Time in Case of Container Failures
Quick startup times are favorable in Cloud environments to enable fast recovery and improve scaling.
Motivation
In Cloud environments, it is accepted that a pod or container may crash - the more important principle is that it should recover quickly. A faster startup time is beneficial in that case as it results in a smaller Mean Time To Recover (MTTR) and reduces user-facing downtime. Also, in case of request peaks, a reasonably short startup time allows scaling the deployment properly.
Structure
We simply stop a container of one of the pods to measure the time until it is marked as ready again. Therefore, before stopping the container, we ensure that the deployment is ready. If so, we stop the container and expect the number of ready pods to drop. Within a reasonable time (e.g., 60 seconds), the container should start up again, and all desirable pods should be marked as ready.
Solution Sketch
How to use this template?
Import via Hub Connection
Steadybit’s Reliability Hub is already connected to your platform. If you are an admin, you can just easily import templates with just one click.
Are you on-prem?
This is how you import Templates