Redis: Sentinel – bind 0.0.0.0, the localhost issue and the announce-ip option

By | 04/10/2019
 

Originally, in a Sentinel’s configs, I have used the bind 0.0.0.0 to make them accessible from external hosts.

Because of this when I started rolling out this setup on a real environment faced with an issue when Sentinels could not determine a Master host and other Sentinel hosts.

In this post – such an issue example and its solution.

In fact – there were much more issues (on Friday evening) but I couldn’t reproduce them (on Monday morning).

Issues appeared when I rolled Redis replication from an Ansible role so some examples here will be in Ansible’s templates.

The current setup

Redis nodes and Sentinel instances are running on an AWS EC2 servers.

In the example below will use the next names:

  1. Console host: also is a Master – here is Redis Master node is running and the first Sentinel instance
  2. App1 and App2: another two ЕС2 with Redis replicas and two Sentinel instances

Redis replication configs

Redis Master config, just the main part. File redis-cluster-master.conf.j2:

bind 0.0.0.0
port 6389
...

Slave nodes, common config, file redis-cluster-slave.conf.j2:

slaveof dev.backend-console-internal.example.com 6379
bind 0.0.0.0
port 6389
...

Deploy, check:

root@bttrm-dev-console:/etc/redis-cluster# redis-cli -p 6389 info replication
Replication
role:master
connected_slaves:2
slave0:ip=10.0.2.91,port=6389,state=online,offset=1219,lag=1
slave1:ip=10.0.2.71,port=6389,state=online,offset=1219,lag=1
...

Loos good so far.

Redis Sentinels configs

Now a Sentinels config – redis-cluster-sentinel.conf.j2:

sentinel monitor {{ redis_cluster_name }} dev.backend-console-internal.example.com 6389 2
bind 0.0.0.0
port 26389
sentinel down-after-milliseconds {{ redis_cluster_name }} 6001
sentinel failover-timeout {{ redis_cluster_name }} 60000
sentinel parallel-syncs {{ redis_cluster_name }} 1
daemonize yes
logfile {{ redis_cluster_logs_home }}/redis-sentinel.log
pidfile {{ redis_cluster_runtime_home }}/redis-sentinel.pid

Deploy, check:

root@bttrm-dev-console:/etc/redis-cluster# redis-cli -p 26389 info sentinel
Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=redis-dev-cluster,status=ok,address=127.0.0.1:6389,slaves=2,sentinels=3

Looks like still good?

But nope.

An IPs issue

Check Sentinel’s log on the App-1:

root@bttrm-dev-app-1:/etc/redis-cluster# tail -f /var/log/redis-cluster/redis-sentinel.log
3163:X 08 Apr 14:24:18.586 * +sentinel-address-switch master redis-dev-cluster 10.0.2.104 6389 ip 10.0.2.104 port 26389 for 32aca990c4e875eab7ba8cd0a7c4e984d584e18c
3163:X 08 Apr 14:24:19.034 * +sentinel-address-switch master redis-dev-cluster 10.0.2.104 6389 ip 127.0.0.1 port 26389 for 32aca990c4e875eab7ba8cd0a7c4e984d584e18c
3163:X 08 Apr 14:24:20.653 * +sentinel-address-switch master redis-dev-cluster 10.0.2.104 6389 ip 10.0.2.104 port 26389 for 32aca990c4e875eab7ba8cd0a7c4e984d584e18c

And its config already updated by Sentinel:

root@bttrm-dev-app-1:/etc/redis-cluster# cat redis-sentinel.conf
sentinel myid a8fdd554a587467aadd811989c78d601433a2f37
bind 0.0.0.0
port 26389
sentinel monitor redis-dev-cluster 10.0.2.104 6389 2
...
Generated by CONFIG REWRITE
dir "/"
maxclients 4064
sentinel config-epoch redis-dev-cluster 0
sentinel leader-epoch redis-dev-cluster 0
sentinel known-slave redis-dev-cluster 10.0.2.71 6389
sentinel known-slave redis-dev-cluster 10.0.2.91 6389
sentinel known-sentinel redis-dev-cluster 10.0.2.91 26389 8a705b2e0050b0bd8935e1c3efd1a28fde5d581d
sentinel known-sentinel redis-dev-cluster 127.0.0.1 26389 32aca990c4e875eab7ba8cd0a7c4e984d584e18c
sentinel known-sentinel redis-dev-cluster 127.0.0.1 26389 a8fdd554a587467aadd811989c78d601433a2f37
sentinel current-epoch 0

Here is two issues at the same time:

  1. sentinel-address-switch master redis-dev-cluster 10.0.2.104 6389 ip 127.0.0.1 – in logs we see that Master’s IP is constantly changed between 10.0.2.104 (the Master’s EC2 address) and 127.0.0.1
  2. in the Sentinel config three recodrds known-sentinel were added althought must be only two, same as on the Master

Will back to the Master later, and now let’s take one more look at the Sentinel’s config.

In the sentinel myid a8fdd554a587467aadd811989c78d601433a2f37 line we can see the ID of this particular Sentinel instance.

Check config again:

sentinel known-sentinel redis-dev-cluster 10.0.2.91 26389 8a705b2e0050b0bd8935e1c3efd1a28fde5d581d
sentinel known-sentinel redis-dev-cluster 127.0.0.1 26389 32aca990c4e875eab7ba8cd0a7c4e984d584e18c
sentinel known-sentinel redis-dev-cluster 127.0.0.1 26389 a8fdd554a587467aadd811989c78d601433a2f37

Here:

  • 10.0.2.91 26389 8a705b2e0050b0bd8935e1c3efd1a28fde5d581d – is Sentinel on the App-2
  • 127.0.0.1 26389 32aca990c4e875eab7ba8cd0a7c4e984d584e18c – it’s Master (!) on the Console, and its IP – 127.0.0.1, althought we are checking from the App-1 host
  • 127.0.0.1 26389 a8fdd554a587467aadd811989c78d601433a2f37 – it’s Sentinel on the App-1, i.e. current instance

The similar picture is on the App-2 host:

root@bttrm-dev-app-2:/etc/redis-cluster# cat redis-sentinel.conf
sentinel myid 8a705b2e0050b0bd8935e1c3efd1a28fde5d581d
bind 0.0.0.0
port 26389
sentinel monitor redis-dev-cluster 10.0.2.104 6389 2
...
Generated by CONFIG REWRITE
dir "/"
maxclients 4064
sentinel config-epoch redis-dev-cluster 0
sentinel leader-epoch redis-dev-cluster 0
sentinel known-slave redis-dev-cluster 10.0.2.71 6389
sentinel known-slave redis-dev-cluster 10.0.2.91 6389
sentinel known-sentinel redis-dev-cluster 127.0.0.1 26389 8a705b2e0050b0bd8935e1c3efd1a28fde5d581d
sentinel known-sentinel redis-dev-cluster 127.0.0.1 26389 32aca990c4e875eab7ba8cd0a7c4e984d584e18c
sentinel known-sentinel redis-dev-cluster 10.0.2.71 26389 a8fdd554a587467aadd811989c78d601433a2f37
sentinel current-epoch 0

And for comparison – the Master’s Sentinel config:

root@bttrm-dev-console:/etc/redis-cluster# cat redis-sentinel.conf
sentinel myid 32aca990c4e875eab7ba8cd0a7c4e984d584e18c
bind 0.0.0.0
port 26389
sentinel monitor redis-dev-cluster 127.0.0.1 6389 2
...
Generated by CONFIG REWRITE
dir "/"
maxclients 4064
sentinel config-epoch redis-dev-cluster 0
sentinel leader-epoch redis-dev-cluster 0
sentinel known-slave redis-dev-cluster 10.0.2.91 6389
sentinel known-slave redis-dev-cluster 10.0.2.71 6389
sentinel known-sentinel redis-dev-cluster 10.0.2.91 26389 8a705b2e0050b0bd8935e1c3efd1a28fde5d581d
sentinel known-sentinel redis-dev-cluster 10.0.2.71 26389 a8fdd554a587467aadd811989c78d601433a2f37
sentinel current-epoch 0

Also, when execute info sentinel on the Master and slaves – the’ll have different Sentinels number:

root@bttrm-dev-console:/etc/redis-cluster# redis-cli -p 26389 info sentinel | tail -1
master0:name=redis-dev-cluster,status=ok,address=127.0.0.1:6389,slaves=2,sentinels=3

And on the slaves:

root@bttrm-dev-app-1:/etc/redis-cluster# redis-cli -p 26389 info sentinel | tail -1
master0:name=redis-dev-cluster,status=ok,address=10.0.2.104:6389,slaves=2,sentinels=4

I.e.:

  1. at first – the Sentinel instance added itself to the known-sentinel althought he’s not supposed to
  2. and secodn thing – the Master is added as 127.0.0.1 instead of its real IP

This lead to a real proble during new master election.

Stop the Redis Master:

root@bttrm-dev-console:/etc/redis-cluster# systemctl stop redis-cluster.service

Check Sentinel’s logs on the Master host:

30848:X 08 Apr 14:36:36.155 # +sdown master redis-dev-cluster 127.0.0.1 6389
30848:X 08 Apr 14:36:36.436 # +new-epoch 1

Here is no odown (objective down) at all (see SDOWN and ODOWN failure state)

And on the both slaves – no election process completed.

The App-1 logs:

3163:X 08 Apr 14:36:36.087 # +sdown master redis-dev-cluster 10.0.2.104 6389
3163:X 08 Apr 14:36:36.155 # +odown master redis-dev-cluster 10.0.2.104 6389 #quorum 2/2
3163:X 08 Apr 14:36:36.155 # +new-epoch 1
3163:X 08 Apr 14:36:36.155 # +try-failover master redis-dev-cluster 10.0.2.104 6389
3163:X 08 Apr 14:36:36.157 # +vote-for-leader a8fdd554a587467aadd811989c78d601433a2f37 1
3163:X 08 Apr 14:36:36.158 # a8fdd554a587467aadd811989c78d601433a2f37 voted for a8fdd554a587467aadd811989c78d601433a2f37 1
3163:X 08 Apr 14:36:36.163 # 8a705b2e0050b0bd8935e1c3efd1a28fde5d581d voted for a8fdd554a587467aadd811989c78d601433a2f37 1
3163:X 08 Apr 14:36:36.220 # +elected-leader master redis-dev-cluster 10.0.2.104 6389
3163:X 08 Apr 14:36:36.220 # +failover-state-select-slave master redis-dev-cluster 10.0.2.104 6389
3163:X 08 Apr 14:36:36.311 # -failover-abort-no-good-slave master redis-dev-cluster 10.0.2.104 6389
3163:X 08 Apr 14:36:36.377 # Next failover delay: I will not start a failover before Mon Apr  8 14:38:36 2019

App-2:

3165:X 08 Apr 14:36:36.160 # +new-epoch 1
3165:X 08 Apr 14:36:36.162 # +vote-for-leader a8fdd554a587467aadd811989c78d601433a2f37 1
3165:X 08 Apr 14:36:36.168 # +sdown master redis-dev-cluster 10.0.2.104 6389
3165:X 08 Apr 14:36:36.235 # +odown master redis-dev-cluster 10.0.2.104 6389 #quorum 3/2
3165:X 08 Apr 14:36:36.235 # Next failover delay: I will not start a failover before Mon Apr  8 14:38:36 2019

Rubbish.

The solution

The solution was to add sentinel announce-ip in each Sentinel’s instance config.

See the Sentinel, Docker, NAT, and possible issues although I didn’t get why this appears in an EC2’s private network where no NAT between instances.

Stop Sentinels, restore their config to an origin view and add the sentinel announce-ip with a public IP (“public” here is an instance’s Private IP as they all are in a private network in a VPC).

For example, the Master host config will look like:

bind 0.0.0.0
port 26389
sentinel monitor redis-dev-cluster dev.backend-console-internal.example.com 6389 2
sentinel down-after-milliseconds redis-dev-cluster 6001
sentinel failover-timeout redis-dev-cluster 60000
daemonize yes
logfile "/var/log/redis-cluster/redis-sentinel.log"
pidfile "/var/run/redis-cluster/redis-sentinel.pid"
sentinel announce-ip 10.0.2.104

Check Sentinel’s status on the Master host:

root@bttrm-dev-console:/etc/redis-cluster# redis-cli -p 26389 info sentinel
Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=redis-dev-cluster,status=ok,address=127.0.0.1:6389,slaves=1,sentinels=3

And configs.

Master:

root@bttrm-dev-console:/etc/redis-cluster# cat redis-sentinel.conf | grep known-sentinel
sentinel known-sentinel redis-dev-cluster 10.0.2.91 26389 6072b737a93cf7812388be21360b5cb058343f4d
sentinel known-sentinel redis-dev-cluster 10.0.2.71 26389 57869e8b8914861cc5a80895d3fede9259ce11f6

Okay…

App-1:

root@bttrm-dev-app-1:/etc/redis-cluster# cat redis-sentinel.conf | grep known-sentinel
sentinel known-sentinel redis-dev-cluster 10.0.2.91 26389 6072b737a93cf7812388be21360b5cb058343f4d
sentinel known-sentinel redis-dev-cluster 10.0.2.104 26389 1218b46be16fb759d52de6919de787c5492b4991

Okay…

App-2:

root@bttrm-dev-app-2:/etc/redis-cluster# cat redis-sentinel.conf | grep known-sentinel
sentinel known-sentinel redis-dev-cluster 10.0.2.71 26389 57869e8b8914861cc5a80895d3fede9259ce11f6
sentinel known-sentinel redis-dev-cluster 10.0.2.104 26389 1218b46be16fb759d52de6919de787c5492b4991

Okay.

And the final thing is to add this to an Ansible template.

To not create a dedicated template for each host – the lookup() can be used, see the Ansible: get a target host’s IP.

Update the redis-cluster-sentinel.conf.j2:

sentinel monitor {{ redis_cluster_name }} dev.backend-console-internal.example.com 6389 2
bind 0.0.0.0
port 26389
sentinel announce-ip {{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}
sentinel down-after-milliseconds {{ redis_cluster_name }} 6001
sentinel failover-timeout {{ redis_cluster_name }} 60000
sentinel parallel-syncs {{ redis_cluster_name }} 1
daemonize yes
logfile {{ redis_cluster_logs_home }}/redis-sentinel.log
pidfile {{ redis_cluster_runtime_home }}/redis-sentinel.pid

By the way: you can the same approach but just with the bind parameter.

Deploy, check.

Master:

root@bttrm-dev-console:/etc/redis-cluster# cat redis-sentinel.conf | grep known-sentinel
sentinel known-sentinel redis-dev-cluster 10.0.2.71 26389 8148dc7b30a692af02aee44ff051bee129710618
sentinel known-sentinel redis-dev-cluster 10.0.2.91 26389 156a28ee1c3db77876c0e7326694313c24a56dc2

App-1:

root@bttrm-dev-app-1:/etc/redis-cluster# cat redis-sentinel.conf | grep known-sentinel
sentinel known-sentinel redis-dev-cluster 10.0.2.91 26389 156a28ee1c3db77876c0e7326694313c24a56dc2
sentinel known-sentinel redis-dev-cluster 10.0.2.104 26389 b1fafcde1685861736930c7a88819b2aeac49eea

App-2:

root@bttrm-dev-app-2:/etc/redis-cluster# cat redis-sentinel.conf | grep known-sentinel
sentinel known-sentinel redis-dev-cluster 10.0.2.71 26389 8148dc7b30a692af02aee44ff051bee129710618
sentinel known-sentinel redis-dev-cluster 10.0.2.104 26389 b1fafcde1685861736930c7a88819b2aeac49eea

“It works!” (c)

And but the way – constantly sentinel-address-switch master disappeared as well.

Now – let’s stop Redis Maser to check if failover will works.

The Sentinel’s IDs at this moment:

  1. console/master: b1fafcde1685861736930c7a88819b2aeac49eea
  2. app1: 8148dc7b30a692af02aee44ff051bee129710618
  3. app2: 156a28ee1c3db77876c0e7326694313c24a56dc2

Master’s log:

1744:X 08 Apr 16:17:15.495 # +sdown master redis-dev-cluster 127.0.0.1 6389
1744:X 08 Apr 16:17:15.745 # +new-epoch 1
1744:X 08 Apr 16:17:16.809 # +config-update-from sentinel 156a28ee1c3db77876c0e7326694313c24a56dc2 10.0.2.91 26389 @ redis-dev-cluster 127.0.0.1 6389
1744:X 08 Apr 16:17:16.809 # +switch-master redis-dev-cluster 127.0.0.1 6389 10.0.2.71 6389
1744:X 08 Apr 16:17:16.809 * +slave slave 10.0.2.91:6389 10.0.2.91 6389 @ redis-dev-cluster 10.0.2.71 6389
1744:X 08 Apr 16:17:16.809 * +slave slave 127.0.0.1:6389 127.0.0.1 6389 @ redis-dev-cluster 10.0.2.71 6389
1744:X 08 Apr 16:17:22.823 # +sdown slave 127.0.0.1:6389 127.0.0.1 6389 @ redis-dev-cluster 10.0.2.71 6389

App-1:

4954:X 08 Apr 16:17:15.411 # +sdown master redis-dev-cluster 10.0.2.104 6389
4954:X 08 Apr 16:17:15.548 # +new-epoch 1
4954:X 08 Apr 16:17:15.550 # +vote-for-leader 156a28ee1c3db77876c0e7326694313c24a56dc2 1
4954:X 08 Apr 16:17:16.539 # +odown master redis-dev-cluster 10.0.2.104 6389 #quorum 2/2
4954:X 08 Apr 16:17:16.540 # Next failover delay: I will not start a failover before Mon Apr  8 16:19:16 2019
4954:X 08 Apr 16:17:16.809 # +config-update-from sentinel 156a28ee1c3db77876c0e7326694313c24a56dc2 10.0.2.91 26389 @ redis-dev-cluster 10.0.2.104 6389
4954:X 08 Apr 16:17:16.809 # +switch-master redis-dev-cluster 10.0.2.104 6389 10.0.2.71 6389
4954:X 08 Apr 16:17:16.810 * +slave slave 10.0.2.91:6389 10.0.2.91 6389 @ redis-dev-cluster 10.0.2.71 6389
4954:X 08 Apr 16:17:16.810 * +slave slave 10.0.2.104:6389 10.0.2.104 6389 @ redis-dev-cluster 10.0.2.71 6389
4954:X 08 Apr 16:17:22.859 # +sdown slave 10.0.2.104:6389 10.0.2.104 6389 @ redis-dev-cluster 10.0.2.71 6389

And App-2:

4880:X 08 Apr 16:17:15.442 # +sdown master redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:15.542 # +odown master redis-dev-cluster 10.0.2.104 6389 #quorum 2/2
4880:X 08 Apr 16:17:15.543 # +new-epoch 1
4880:X 08 Apr 16:17:15.543 # +try-failover master redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:15.545 # +vote-for-leader 156a28ee1c3db77876c0e7326694313c24a56dc2 1
4880:X 08 Apr 16:17:15.551 # 8148dc7b30a692af02aee44ff051bee129710618 voted for 156a28ee1c3db77876c0e7326694313c24a56dc2 1
4880:X 08 Apr 16:17:15.604 # +elected-leader master redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:15.604 # +failover-state-select-slave master redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:15.671 # +selected-slave slave 10.0.2.71:6389 10.0.2.71 6389 @ redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:15.671 * +failover-state-send-slaveof-noone slave 10.0.2.71:6389 10.0.2.71 6389 @ redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:15.742 * +failover-state-wait-promotion slave 10.0.2.71:6389 10.0.2.71 6389 @ redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:16.711 # +promoted-slave slave 10.0.2.71:6389 10.0.2.71 6389 @ redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:16.712 # +failover-state-reconf-slaves master redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:16.808 * +slave-reconf-sent slave 10.0.2.91:6389 10.0.2.91 6389 @ redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:17.682 # -odown master redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:17.759 * +slave-reconf-inprog slave 10.0.2.91:6389 10.0.2.91 6389 @ redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:17.759 * +slave-reconf-done slave 10.0.2.91:6389 10.0.2.91 6389 @ redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:17.849 # +failover-end master redis-dev-cluster 10.0.2.104 6389
4880:X 08 Apr 16:17:17.849 # +switch-master redis-dev-cluster 10.0.2.104 6389 10.0.2.71 6389
4880:X 08 Apr 16:17:17.850 * +slave slave 10.0.2.91:6389 10.0.2.91 6389 @ redis-dev-cluster 10.0.2.71 6389
4880:X 08 Apr 16:17:17.850 * +slave slave 10.0.2.104:6389 10.0.2.104 6389 @ redis-dev-cluster 10.0.2.71 6389
4880:X 08 Apr 16:17:23.853 # +sdown slave 10.0.2.104:6389 10.0.2.104 6389 @ redis-dev-cluster 10.0.2.71 6389

Check current Master’s IP:

root@bttrm-dev-console:/etc/redis-cluster# redis-cli -h 10.0.2.104 -p 26389 sentinel get-master-addr-by-name redis-dev-cluster
1) "10.0.2.71"
2) "6389"

10.0.2.71 – it’s the App-1 host.

Check replication status here:

root@bttrm-dev-app-1:/etc/redis-cluster# redis-cli -p 6389 info replication
Replication
role:master
connected_slaves:1
slave0:ip=10.0.2.91,port=6389,state=online,offset=70814,lag=1

role:master – roles switch worked well, all good.

slave0:ip=10.0.2.91 – only one slave now as Redis node on the Master host was stopped.

Start it:

root@bttrm-dev-console:/etc/redis-cluster# systemctl start redis-cluster.service

Sentinel log:

4954:X 08 Apr 16:23:43.337 # -sdown slave 10.0.2.104:6389 10.0.2.104 6389 @ redis-dev-cluster 10.0.2.71 6389
4954:X 08 Apr 16:23:53.351 * +convert-to-slave slave 10.0.2.104:6389 10.0.2.104 6389 @ redis-dev-cluster 10.0.2.71 6389

Nod is Up and Sentinel changed it to the Slave-mode.

Also saw another issue when Redis Sentinel recofigured Redis node in such a way when it became Slave of itself:

14542:S 04 Apr 13:25:35.187 * SLAVE OF 127.0.0.1:6389 enabled (user request from ‘id=15 addr=10.0.2.104:40087 fd=5 name=sentinel-dc0483ad-cmd age=60 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 event
s=r cmd=exec’)
14542:S 04 Apr 13:25:35.187 # CONFIG REWRITE executed with success.
14542:S 04 Apr 13:25:36.059 * Connecting to MASTER 127.0.0.1:6389
14542:S 04 Apr 13:25:36.060 * MASTER <-> SLAVE sync started
14542:S 04 Apr 13:25:36.060 * Non blocking connect for SYNC fired the event.
14542:S 04 Apr 13:25:36.060 * Master replied to PING, replication can continue…
14542:S 04 Apr 13:25:36.060 * Partial resynchronization not possible (no cached master)
14542:S 04 Apr 13:25:36.060 * Master does not support PSYNC or is in error state (reply: -ERR Can’t SYNC while not connected with my master)

Unfrotunatelly – could’t reproduce it.