Linux: systemd-unit files edit, restart on failure and email notifications

By | 03/01/2019

We have a RabbitMQ service which sometimes can go down.

So need to:

  1. restart it if is exited with the failure
  2. send an email notification

Let’s do it via RabbitMQ’s systemd service (though there are various options, e.g. using the monit, check the Monit: мониторинг и перезапуск NGINX post).

Will use two options here:

  • RestartSec=: delay on restart – to have a chance to finish some disk I/O operations if any, just in case
  • Restart=: the condition to be used

Available conditions for the Restart are:

Table 2. Exit causes and the effect of the Restart= settings on them

Restart settings/Exit causes no always on-success on-failure on-abnormal on-abort on-watchdog
Clean exit code or signal X X
Unclean exit code X X
Unclean signal X X X X
Timeout X X X
Watchdog X X X X

systemd-unit files edit

The default RabbitMQ’s unit-file in the /lib/systemd/system/rabbitmq-server.service.

You can observe it using systemctl cat:

admin@bttrm-production-console:~$ systemctl cat rabbitmq-server.service
Description=RabbitMQ Messaging Server
ExecStop=/usr/sbin/rabbitmqctl stop

Do not edit it in the /lib/systemd/system/ directly, like any other file there as it will be overwritten during rabbitmq-server package next upgrade.

When you need to update any service’s default behavior – you have to put your new files in the /etc/systemd/system directory.

To edit an existing service – use the systemctl edit foo.service with the --full option:

root@bttrm-dev-console:/home/admin# systemctl edit --full rabbitmq-server.service

This will create a temporary file like /etc/systemd/system/rabbitmq-server.service.d/.#override.conf6a0bfbaa5ed8b8d8 with the current /lib/systemd/system/rabbitmq-server.service content and here you can update it.

Restart of failure

Add both options here – Restart=on-failure и RestartSec=60s:

Description=RabbitMQ Messaging Server

ExecStop=/usr/sbin/rabbitmqctl stop



Re-read systemd‘s config files:

root@bttrm-dev-console:/home/admin# systemctl daemon-reload

systemd will create a /etc/systemd/system/rabbitmq-server.service file with the new content.

Now get RabbitMQ’s server PID:

root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service | grep PID
Main PID: 14668 (rabbitmq-server)

Kill it with SIGKILL (check the Linux&FreeBSD: команды kill, nohup — сигналы и управление процессами) to make on-failure parameter be applied:

root@bttrm-dev-console:/home/admin# kill -9 14668

Check its status:

root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) (Result: signal) since Thu 2019-02-28 12:08:32 EET; 4s ago
Process: 7093 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
Main PID: 14668 (code=killed, signal=KILL)


Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL
Mar 01 13:26:00 bttrm-dev-console rabbitmq[27392]: Stopping and halting node 'rabbit@bttrm-dev-console' ...
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state.
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.

And after one minute:

root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: activating (start-post) since Thu 2019-02-28 12:09:33 EET; 2s ago
Feb 28 12:09:33 bttrm-stage-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Feb 28 12:09:33 bttrm-stage-console systemd[1]: Stopped RabbitMQ Messaging Server.
Feb 28 12:09:33 bttrm-stage-console systemd[1]: Starting RabbitMQ Messaging Server...

Logs again:

Mar 01 13:27:01 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Mar 01 13:27:01 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server.
Mar 01 13:27:01 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server...
Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: Waiting for 'rabbit@bttrm-dev-console' ...
Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: pid is 27533 ...
Mar 01 13:27:04 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.

“Service hold-off time over, scheduling restart” – here is our 60 seconds delay.

email notification

Now let’s add email notification to be sent if RabbitMQ went down with an error.

Send test email first:

root@bttrm-dev-console:/home/admin# echo "Stage RabbitMQ restarted on failure!" | mailx -s "RabbitMQ failure notice"

Now you can use ExecStopPost= or OnFailure=. OnFailure looks better – let’s use it.

Create the /etc/systemd/system/rabbitmq-notify-email@.service file:

Description=%i failure email notification

ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification"'

Add the OnFailure option to the rabbitmq-server.service using systemctl edit in the [Unit] block:

Description=RabbitMQ Messaging Server

Do not forget to reload systemd files:

root@bttrm-dev-console:/home/admin# systemctl daemon-reload

Kill RabbitMQ again:

root@bttrm-dev-console:/home/admin# kill -9 29970

Check logs:

Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL
Feb 28 13:55:33 bttrm-dev-console rabbitmq[30476]: Stopping and halting node 'rabbit@bttrm-dev-console' ...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Triggering OnFailure= dependencies.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting rabbitmq-server failure email notification...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Started rabbitmq-server failure email notification.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server...
Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: Waiting for 'rabbit@bttrm-dev-console' ...
Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: pid is 30625 ...
Feb 28 13:55:37 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.
  1. Triggering OnFailure= dependencies.
  2. Started rabbitmq-server failure email notification.

Okay – all works.

Mail logs:

root@bttrm-dev-console:/home/admin# tail /var/log/exim4/mainlog
2019-02-28 13:48:58 1gzK7S-0007Td-Bt [2a00:1450:400b:c01::1b] Network is unreachable
2019-02-28 13:51:09 1gzK7S-0007Td-Bt [] Connection timed out
2019-02-28 13:51:42 1gzK7S-0007Td-Bt => R=dnslookup T=remote_smtp [] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC," C="250 2.0.0 OK  1551354702 x34si4667116edb.147 - gsmtp"
2019-02-28 13:51:42 1gzK7S-0007Td-Bt Completed
2019-02-28 13:53:53 1gzK16-0006pp-NU [] Connection timed out
2019-02-28 13:53:53 1gzK16-0006pp-NU [2800:3f0:4003:c02::1a] Network is unreachable
2019-02-28 13:54:59 1gzK16-0006pp-NU => R=dnslookup T=remote_smtp [] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC," C="250 2.0.0 OK  1551354899 s45si1200185edm.357 - gsmtp"
2019-02-28 13:54:59 1gzK16-0006pp-NU Completed
2019-02-28 13:54:59 End queue run: pid=29201
2019-02-28 13:55:33 1gzKHl-0007xl-Lm <= U=root P=local S=1331

If you didn’t get an email – check the exim‘s queue:

root@bttrm-dev-console:/home/admin# exim -bp
0m  1.2K 1gzL3R-0000dn-5h <>

It hangs here.

Run it manually:

root@bttrm-dev-console:/home/admin# runq

Check logs:

root@bttrm-dev-console:/home/admin# cat /var/log/exim4/mainlog | grep 1gzL3R-0000dn-5h
2019-02-28 14:44:49 1gzL3R-0000dn-5h <= U=root P=local S=1241
2019-02-28 14:46:48 1gzL3R-0000dn-5h [2607:f8b0:400d:c0f::1a] Network is unreachable
2019-02-28 14:46:49 1gzL3R-0000dn-5h => R=dnslookup T=remote_smtp [] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC," C="250 2.0.0 OK  1551358009 w11si208223qvc.68 - gsmtp"
2019-02-28 14:46:49 1gzL3R-0000dn-5h Completed

And your email:

To solve sending email issue (not sure why exim won’t send them) – add some dirty “hack” to the /etc/systemd/system/rabbitmq-notify-email@.service – the ExecStartPost option:

ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification"'

To remove an old message from the queue – use their IDs:

root@bttrm-dev-console:/home/admin# exim -Mrm 1gzVar-0003oO-Rf
Message 1gzVar-0003oO-Rf has been removed