Some useful tips on using Liveness and Readiness Probes in Kubernetes – the difference between them, and how to properly configure these checks.
To put it very briefly:
- livenessProbe: is used by Kubernetes to know when to perform a Pod restart
- readinessProbe: is used by Kubernetes to know when a container is ready to receive traffic, that is, when the corresponding Kubernetes Service can add this Pod to its routes
- startupProbe: is used by Kubernetes to know when a container has started and is ready to perform checks with
readinessProbewill start executing only after a successful
So, livenessProbe is used to determine whether the process in the event is alive, while readinessProbe is used to determine whether the service in the event is ready to receive traffic, and startupProbe is used to determine when to start executing
This post is based on three materials that I once saved and use quite a bit:
- Kubernetes production best practices: useful tips not only about Probes, but about Kubernetes in general
- Kubernetes Liveness and Readiness Probes: How to Avoid Shooting Yourself in the Foot: good examples of creating Probes and how to avoid mistakes when working with them
- Kubernetes: Best Practices for Liveness Probes: some nuances when working with Probes
And there will be a few more links in the end of this post.
livenessProbe is needed when, for example, a process is stuck in a deadlock and cannot perform its tasks. Another example is if a process has entered an infinite loop and is using 100% of the CPU, while being unable to process requests from clients because it is still connected to the Kubernetes network.
If you have a
readinessProbe but no
livenessProbe, then such a Pod will be disconnected from traffic, but will remain in the Running status, and will continue to occupy CPU/Memory resources.
livenessProbe are executed by the
kubelet process on the same WorkerNode where the container is running, and after restarting, the sub will be created on the same WorkerNode
A process in a container should stop with an error code
livenessProbe should not be a tool for responding to service errors: instead, the process should finish its execution with an error code, which will stop the container/Pod, and create a new one.
livenessProbe is used only to check the status of the process itself in the container.
Split Liveness and Readiness Probes
It is a common practice to use one endpoint for
readinessProbe, but set a higher value of
failureThreshold for the
livenessProbe, that is, disconnect traffic and customers earlier, and if things are really bad, then restart.
But these Probes have different purposes, and therefore, although it is acceptable to use the same endpoint, it is better to have different checks. In addition, if both checks fail, Kubernetes will restart a Pod and disconnect it from the network at the same time, which can lead to 502 errors for clients.
Pods should not refer to each other or to external services when running
livenessProbe: your container should not perform database server availability checks, because if the database server is down, restarting your Pod will not help solve this problem.
Instead, you can create a separate endpoint for the monitoring system and perform such checks there – for alerts and dashboards in Grafana.
In addition, a process in a container should not crash if it cannot access the service it depends on. Instead, it should perform a connection retry, because Kubernetes expects pods to be run in any order.
Correct processing of the SIGTERM signal
A process in a container must correctly handle the
SIGTERM signal – it is sent from the
kubelet to the containers when they need to be restarted in the case when
livenessProbe is failed. If there was no response to the
SIGTERM (because the process is “hanging”), then a
SIGKILL will be sent.
Or the process can perceive
SIGKILL, and stop without closing open TCP connections – see Kubernetes: NGINX/PHP-FPM graceful shutdown and 502 errors.
readinessProbe is needed to not send requests to Pods that are still spinning up, and are not ready to handle requests from users.
For example, if your Pod startup process takes 2 minutes (some kind of bootstrap, especially if it is a JVM, or loading some cache into memory), and you do not have a
readinessProbe, then Kubernetes will start sending requests as soon as it enters the Running status, and they will fail.
readinessProbe it may make sense to check the availability of services on which the Pod depends, because if the service cannot fulfill a request from a client because it does not have a connection to the database, then you do not need to allow traffic to this Pod.
However, keep in mind that
readinessProbe is executed continuously (every 15 seconds by default), and a separate database query will be executed for each such check.
But in general, it depends on your application. For example, if a database server is down, but you can return responses from some local cache then the app can continue to work, and return a 503 error for requests with write operations.
startupProbe is executed only at the start of the pod, this is where you can check connections to external services or cache access.
For example, it can be useful to check the database connection when you deploy a new version of the Helm Chart and have a Kubernetes Deployment with Rolling Update, but the new version has an error in the URL or password to the database server.
startupProbe can be useful to not increase the
readinessProbe, and instead delay their start until the
startupProbe is finished, because if the
livenessProbe does not have time to complete when the container starts, then Kubernetes will restart the pod, even though it is still “warming up”.
Types of checks
In each Probe, we can use checks by:
exec: execute the command inside the container
httpGet: execute the HTTP GET request
tcpSocket: open TCP connect to the port
grpc: make a gRPC request to a TCP port
Parameters for Probes
All Probes have parameters that allow you to fine-tune the time of the checks:
initialDelaySeconds: delay between container start and start of checks
periodSeconds: how often after
initialDelaySecondsto make requests for status checks
timeoutSeconds: how long to wait for a response to a request
failureThreshold: how many failed responses must be received to consider the check failed (or how many times to repeat the check before restarting the pod or disconnecting from the network)
successThreshold: similarly, but to consider the check passed
- Configure Liveness, Readiness and Startup Probes – official documentation
- Kubernetes startup probe – a practical guide
- Kubernetes Readiness Probes | Practical Guide
- Kubernetes Readiness & Liveliness Probes — Best Practices