Linux: systemd-unit files edit, restart on failure and email notifications

By | 03/01/2019
 

We have a RabbitMQ service which sometimes can go down.

So need to:

  1. restart it if is exited with the failure
  2. send an email notification

Let’s do it via RabbitMQ’s systemd service (though there are various options, e.g. using the monit, check the Monit: мониторинг и перезапуск NGINX post).

Will use two options here:

  • RestartSec=: delay on restart – to have a chance to finish some disk I/O operations if any, just in case
  • Restart=: the condition to be used

Available conditions for the Restart are:

Table 2. Exit causes and the effect of the Restart= settings on them

Restart settings/Exit causes no always on-success on-failure on-abnormal on-abort on-watchdog
Clean exit code or signal X X
Unclean exit code X X
Unclean signal X X X X
Timeout X X X
Watchdog X X X X

systemd-unit files edit

The default RabbitMQ’s unit-file in the /lib/systemd/system/rabbitmq-server.service.

You can observe it using systemctl cat:

[simterm]

admin@bttrm-production-console:~$ systemctl cat rabbitmq-server.service 
# /lib/systemd/system/rabbitmq-server.service
[Unit]
Description=RabbitMQ Messaging Server
After=network.target

[Service]
Type=simple
User=rabbitmq
SyslogIdentifier=rabbitmq
LimitNOFILE=65536
ExecStart=/usr/sbin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait
ExecStop=/usr/sbin/rabbitmqctl stop

[Install]
WantedBy=multi-user.target

[/simterm]

Do not edit it in the /lib/systemd/system/ directly, like any other file there as it will be overwritten during rabbitmq-server package next upgrade.

When you need to update any service’s default behavior – you have to put your new files in the /etc/systemd/system directory.

To edit an existing service – use the systemctl edit foo.service with the --full option:

[simterm]

root@bttrm-dev-console:/home/admin# systemctl edit --full rabbitmq-server.service

[/simterm]

This will create a temporary file like /etc/systemd/system/rabbitmq-server.service.d/.#override.conf6a0bfbaa5ed8b8d8 with the current /lib/systemd/system/rabbitmq-server.service content and here you can update it.

Restart of failure

Add both options here – Restart=on-failure и RestartSec=60s:

[Unit]
Description=RabbitMQ Messaging Server
After=network.target

[Service]
Type=simple
User=rabbitmq
SyslogIdentifier=rabbitmq
LimitNOFILE=65536
ExecStart=/usr/sbin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait
ExecStop=/usr/sbin/rabbitmqctl stop

Restart=on-failure
RestartSec=60s

[Install]
WantedBy=multi-user.target

Re-read systemd‘s config files:

[simterm]

root@bttrm-dev-console:/home/admin# systemctl daemon-reload

[/simterm]

systemd will create a /etc/systemd/system/rabbitmq-server.service file with the new content.

Now get RabbitMQ’s server PID:

[simterm]

root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service | grep PID
 Main PID: 14668 (rabbitmq-server)

[/simterm]

Kill it with SIGKILL (check the Linux&FreeBSD: команды kill, nohup — сигналы и управление процессами) to make on-failure parameter be applied:

[simterm]

root@bttrm-dev-console:/home/admin# kill -9 14668

[/simterm]

Check its status:

[simterm]

root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: activating (auto-restart) (Result: signal) since Thu 2019-02-28 12:08:32 EET; 4s ago
  Process: 7093 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
 Main PID: 14668 (code=killed, signal=KILL)

[/simterm]

Logs:

[simterm]

Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL
Mar 01 13:26:00 bttrm-dev-console rabbitmq[27392]: Stopping and halting node 'rabbit@bttrm-dev-console' ...
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state.
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.

[/simterm]

And after one minute:

[simterm]

root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: activating (start-post) since Thu 2019-02-28 12:09:33 EET; 2s ago
...
Feb 28 12:09:33 bttrm-stage-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Feb 28 12:09:33 bttrm-stage-console systemd[1]: Stopped RabbitMQ Messaging Server.
Feb 28 12:09:33 bttrm-stage-console systemd[1]: Starting RabbitMQ Messaging Server...

[/simterm]

Logs again:

[simterm]

Mar 01 13:27:01 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Mar 01 13:27:01 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server.
Mar 01 13:27:01 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server...
Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: Waiting for 'rabbit@bttrm-dev-console' ...
Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: pid is 27533 ...
Mar 01 13:27:04 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.

[/simterm]

“Service hold-off time over, scheduling restart” – here is our 60 seconds delay.

email notification

Now let’s add email notification to be sent if RabbitMQ went down with an error.

Send test email first:

[simterm]

root@bttrm-dev-console:/home/admin# echo "Stage RabbitMQ restarted on failure!" | mailx -s "RabbitMQ failure notice" [email protected]

[/simterm]

Now you can use ExecStopPost= or OnFailure=. OnFailure looks better – let’s use it.

Create the /etc/systemd/system/[email protected] file:

[Unit]
Description=%i failure email notification

[Service]
Type=oneshot
ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification" [email protected]'

Add the OnFailure option to the rabbitmq-server.service using systemctl edit in the [Unit] block:

[Unit]
Description=RabbitMQ Messaging Server
After=network.target
OnFailure=rabbitmq-notify-email@%i.service
...

Do not forget to reload systemd files:

[simterm]

root@bttrm-dev-console:/home/admin# systemctl daemon-reload

[/simterm]

Kill RabbitMQ again:

[simterm]

root@bttrm-dev-console:/home/admin# kill -9 29970

[/simterm]

Check logs:

[simterm]

Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL
Feb 28 13:55:33 bttrm-dev-console rabbitmq[30476]: Stopping and halting node 'rabbit@bttrm-dev-console' ...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Triggering OnFailure= dependencies.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting rabbitmq-server failure email notification...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Started rabbitmq-server failure email notification.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server...
Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: Waiting for 'rabbit@bttrm-dev-console' ...
Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: pid is 30625 ...
Feb 28 13:55:37 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.

[/simterm]

  1. Triggering OnFailure= dependencies.
  2. Started rabbitmq-server failure email notification.

Okay – all works.

Mail logs:

[simterm]

root@bttrm-dev-console:/home/admin# tail /var/log/exim4/mainlog
2019-02-28 13:48:58 1gzK7S-0007Td-Bt H=alt2.aspmx.l.google.com [2a00:1450:400b:c01::1b] Network is unreachable
2019-02-28 13:51:09 1gzK7S-0007Td-Bt H=alt1.aspmx.l.google.com [172.217.192.27] Connection timed out
2019-02-28 13:51:42 1gzK7S-0007Td-Bt => [email protected] R=dnslookup T=remote_smtp H=alt2.aspmx.l.google.com [74.125.193.27] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK  1551354702 x34si4667116edb.147 - gsmtp"
2019-02-28 13:51:42 1gzK7S-0007Td-Bt Completed
2019-02-28 13:53:53 1gzK16-0006pp-NU H=alt2.aspmx.l.google.com [74.125.193.27] Connection timed out
2019-02-28 13:53:53 1gzK16-0006pp-NU H=aspmx2.googlemail.com [2800:3f0:4003:c02::1a] Network is unreachable
2019-02-28 13:54:59 1gzK16-0006pp-NU => [email protected] R=dnslookup T=remote_smtp H=aspmx3.googlemail.com [74.125.193.26] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK  1551354899 s45si1200185edm.357 - gsmtp"
2019-02-28 13:54:59 1gzK16-0006pp-NU Completed
2019-02-28 13:54:59 End queue run: pid=29201
2019-02-28 13:55:33 1gzKHl-0007xl-Lm <= [email protected] U=root P=local S=1331

[/simterm]

If you didn’t get an email – check the exim‘s queue:

[simterm]

root@bttrm-dev-console:/home/admin# exim -bp
 0m  1.2K 1gzL3R-0000dn-5h <[email protected]>
          [email protected]

[/simterm]

It hangs here.

Run it manually:

[simterm]

root@bttrm-dev-console:/home/admin# runq

[/simterm]

Check logs:

[simterm]

root@bttrm-dev-console:/home/admin# cat /var/log/exim4/mainlog | grep 1gzL3R-0000dn-5h
2019-02-28 14:44:49 1gzL3R-0000dn-5h <= [email protected] U=root P=local S=1241
2019-02-28 14:46:48 1gzL3R-0000dn-5h H=aspmx.l.google.com [2607:f8b0:400d:c0f::1a] Network is unreachable
2019-02-28 14:46:49 1gzL3R-0000dn-5h => [email protected] R=dnslookup T=remote_smtp H=aspmx.l.google.com [173.194.68.26] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK  1551358009 w11si208223qvc.68 - gsmtp"
2019-02-28 14:46:49 1gzL3R-0000dn-5h Completed

[/simterm]

And your email:

To solve sending email issue (not sure why exim won’t send them) – add some dirty “hack” to the /etc/systemd/system/[email protected] – the ExecStartPost option:

...
ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification" [email protected]'
ExecStartPost=runq
...

To remove an old message from the queue – use their IDs:

[simterm]

root@bttrm-dev-console:/home/admin# exim -Mrm 1gzVar-0003oO-Rf
Message 1gzVar-0003oO-Rf has been removed

[/simterm]

Done.