We have a RabbitMQ service which sometimes can go down.
So need to:
- restart it if is exited with the failure
- send an email notification
Let’s do it via RabbitMQ’s systemd
service (though there are various options, e.g. using the monit
, check the Monit: мониторинг и перезапуск NGINX post).
Will use two options here:
RestartSec=
: delay on restart – to have a chance to finish some disk I/O operations if any, just in caseRestart=
: the condition to be used
Available conditions for the Restart
are:
Table 2. Exit causes and the effect of the Restart=
settings on them
Restart settings/Exit causes | no |
always |
on-success |
on-failure |
on-abnormal |
on-abort |
on-watchdog |
---|---|---|---|---|---|---|---|
Clean exit code or signal | X | X | |||||
Unclean exit code | X | X | |||||
Unclean signal | X | X | X | X | |||
Timeout | X | X | X | ||||
Watchdog | X | X | X | X |
Contents
systemd-unit files edit
The default RabbitMQ’s unit-file in the /lib/systemd/system/rabbitmq-server.service
.
You can observe it using systemctl cat
:
[simterm]
admin@bttrm-production-console:~$ systemctl cat rabbitmq-server.service # /lib/systemd/system/rabbitmq-server.service [Unit] Description=RabbitMQ Messaging Server After=network.target [Service] Type=simple User=rabbitmq SyslogIdentifier=rabbitmq LimitNOFILE=65536 ExecStart=/usr/sbin/rabbitmq-server ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait ExecStop=/usr/sbin/rabbitmqctl stop [Install] WantedBy=multi-user.target
[/simterm]
Do not edit it in the /lib/systemd/system/
directly, like any other file there as it will be overwritten during rabbitmq-server
package next upgrade.
When you need to update any service’s default behavior – you have to put your new files in the /etc/systemd/system
directory.
To edit an existing service – use the systemctl edit foo.service
with the --full
option:
[simterm]
root@bttrm-dev-console:/home/admin# systemctl edit --full rabbitmq-server.service
[/simterm]
This will create a temporary file like /etc/systemd/system/rabbitmq-server.service.d/.#override.conf6a0bfbaa5ed8b8d8 with the current /lib/systemd/system/rabbitmq-server.service
content and here you can update it.
Restart of failure
Add both options here – Restart=on-failure
и RestartSec=60s
:
[Unit] Description=RabbitMQ Messaging Server After=network.target [Service] Type=simple User=rabbitmq SyslogIdentifier=rabbitmq LimitNOFILE=65536 ExecStart=/usr/sbin/rabbitmq-server ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait ExecStop=/usr/sbin/rabbitmqctl stop Restart=on-failure RestartSec=60s [Install] WantedBy=multi-user.target
Re-read systemd
‘s config files:
[simterm]
root@bttrm-dev-console:/home/admin# systemctl daemon-reload
[/simterm]
systemd
will create a /etc/systemd/system/rabbitmq-server.service
file with the new content.
Now get RabbitMQ’s server PID:
[simterm]
root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service | grep PID Main PID: 14668 (rabbitmq-server)
[/simterm]
Kill it with SIGKILL
(check the Linux&FreeBSD: команды kill, nohup — сигналы и управление процессами) to make on-failure parameter be applied:
[simterm]
root@bttrm-dev-console:/home/admin# kill -9 14668
[/simterm]
Check its status:
[simterm]
root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service ● rabbitmq-server.service - RabbitMQ Messaging Server Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled) Active: activating (auto-restart) (Result: signal) since Thu 2019-02-28 12:08:32 EET; 4s ago Process: 7093 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS) Main PID: 14668 (code=killed, signal=KILL)
[/simterm]
Logs:
[simterm]
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL Mar 01 13:26:00 bttrm-dev-console rabbitmq[27392]: Stopping and halting node 'rabbit@bttrm-dev-console' ... Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state. Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.
[/simterm]
And after one minute:
[simterm]
root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service ● rabbitmq-server.service - RabbitMQ Messaging Server Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled) Active: activating (start-post) since Thu 2019-02-28 12:09:33 EET; 2s ago ... Feb 28 12:09:33 bttrm-stage-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart. Feb 28 12:09:33 bttrm-stage-console systemd[1]: Stopped RabbitMQ Messaging Server. Feb 28 12:09:33 bttrm-stage-console systemd[1]: Starting RabbitMQ Messaging Server...
[/simterm]
Logs again:
[simterm]
Mar 01 13:27:01 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart. Mar 01 13:27:01 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server. Mar 01 13:27:01 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server... Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: Waiting for 'rabbit@bttrm-dev-console' ... Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: pid is 27533 ... Mar 01 13:27:04 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.
[/simterm]
“Service hold-off time over, scheduling restart” – here is our 60 seconds delay.
email notification
Now let’s add email notification to be sent if RabbitMQ went down with an error.
Send test email first:
[simterm]
root@bttrm-dev-console:/home/admin# echo "Stage RabbitMQ restarted on failure!" | mailx -s "RabbitMQ failure notice" [email protected]
[/simterm]
Now you can use ExecStopPost=
or OnFailure=
. OnFailure
looks better – let’s use it.
Create the /etc/systemd/system/[email protected]
file:
[Unit] Description=%i failure email notification [Service] Type=oneshot ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification" [email protected]'
Add the OnFailure
option to the rabbitmq-server.service
using systemctl edit
in the [Unit]
block:
[Unit] Description=RabbitMQ Messaging Server After=network.target OnFailure=rabbitmq-notify-email@%i.service ...
Do not forget to reload systemd
files:
[simterm]
root@bttrm-dev-console:/home/admin# systemctl daemon-reload
[/simterm]
Kill RabbitMQ again:
[simterm]
root@bttrm-dev-console:/home/admin# kill -9 29970
[/simterm]
Check logs:
[simterm]
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL Feb 28 13:55:33 bttrm-dev-console rabbitmq[30476]: Stopping and halting node 'rabbit@bttrm-dev-console' ... Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state. Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Triggering OnFailure= dependencies. Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'. Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting rabbitmq-server failure email notification... Feb 28 13:55:33 bttrm-dev-console systemd[1]: Started rabbitmq-server failure email notification. Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart. Feb 28 13:55:33 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server. Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server... Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: Waiting for 'rabbit@bttrm-dev-console' ... Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: pid is 30625 ... Feb 28 13:55:37 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.
[/simterm]
- Triggering OnFailure= dependencies.
- Started rabbitmq-server failure email notification.
Okay – all works.
Mail logs:
[simterm]
root@bttrm-dev-console:/home/admin# tail /var/log/exim4/mainlog 2019-02-28 13:48:58 1gzK7S-0007Td-Bt H=alt2.aspmx.l.google.com [2a00:1450:400b:c01::1b] Network is unreachable 2019-02-28 13:51:09 1gzK7S-0007Td-Bt H=alt1.aspmx.l.google.com [172.217.192.27] Connection timed out 2019-02-28 13:51:42 1gzK7S-0007Td-Bt => [email protected] R=dnslookup T=remote_smtp H=alt2.aspmx.l.google.com [74.125.193.27] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK 1551354702 x34si4667116edb.147 - gsmtp" 2019-02-28 13:51:42 1gzK7S-0007Td-Bt Completed 2019-02-28 13:53:53 1gzK16-0006pp-NU H=alt2.aspmx.l.google.com [74.125.193.27] Connection timed out 2019-02-28 13:53:53 1gzK16-0006pp-NU H=aspmx2.googlemail.com [2800:3f0:4003:c02::1a] Network is unreachable 2019-02-28 13:54:59 1gzK16-0006pp-NU => [email protected] R=dnslookup T=remote_smtp H=aspmx3.googlemail.com [74.125.193.26] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK 1551354899 s45si1200185edm.357 - gsmtp" 2019-02-28 13:54:59 1gzK16-0006pp-NU Completed 2019-02-28 13:54:59 End queue run: pid=29201 2019-02-28 13:55:33 1gzKHl-0007xl-Lm <= [email protected] U=root P=local S=1331
[/simterm]
If you didn’t get an email – check the exim
‘s queue:
[simterm]
root@bttrm-dev-console:/home/admin# exim -bp 0m 1.2K 1gzL3R-0000dn-5h <[email protected]> [email protected]
[/simterm]
It hangs here.
Run it manually:
[simterm]
root@bttrm-dev-console:/home/admin# runq
[/simterm]
Check logs:
[simterm]
root@bttrm-dev-console:/home/admin# cat /var/log/exim4/mainlog | grep 1gzL3R-0000dn-5h 2019-02-28 14:44:49 1gzL3R-0000dn-5h <= [email protected] U=root P=local S=1241 2019-02-28 14:46:48 1gzL3R-0000dn-5h H=aspmx.l.google.com [2607:f8b0:400d:c0f::1a] Network is unreachable 2019-02-28 14:46:49 1gzL3R-0000dn-5h => [email protected] R=dnslookup T=remote_smtp H=aspmx.l.google.com [173.194.68.26] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK 1551358009 w11si208223qvc.68 - gsmtp" 2019-02-28 14:46:49 1gzL3R-0000dn-5h Completed
[/simterm]
And your email:
To solve sending email issue (not sure why exim
won’t send them) – add some dirty “hack” to the /etc/systemd/system/[email protected]
– the ExecStartPost
option:
... ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification" [email protected]' ExecStartPost=runq ...
To remove an old message from the queue – use their IDs:
[simterm]
root@bttrm-dev-console:/home/admin# exim -Mrm 1gzVar-0003oO-Rf Message 1gzVar-0003oO-Rf has been removed
[/simterm]
Done.