Bringing Redis cluster back up if in failed state

Hi,

I have set-up a Redis cluster consisting of 5 master nodes. I noticed that sometimes the Redis cluster enters a “failed” state if one of the master nodes are down. I am wondering how can I restore the entire cluster back up again in a simple and reliable manner?

For example sometimes when I shutdown the Redis server on a node and start it soon again within a minute, the remaining nodes in the cluster return from a failed state to an OK state. Other times I noticed when I shutdown the node and turn it back up after waiting 5 or so minutes, the node and remaining cluster still stays in a failed state.

E.g. Redis node shutdown and brought up 5 minutes later:
127.0.0.1:6379> cluster info
cluster_state:fail
cluster_slots_assigned:13108
cluster_slots_ok:13108
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:5
cluster_size:5
cluster_current_epoch:5
cluster_my_epoch:2
cluster_stats_messages_ping_sent:3
cluster_stats_messages_pong_sent:4
cluster_stats_messages_sent:7
cluster_stats_messages_ping_received:4
cluster_stats_messages_pong_received:3
cluster_stats_messages_received:7
127.0.0.1:6379> cluster nodes
94fb993416d0babdeb4259e1936d3aac5e1931e0 :0@0 master,noaddr - 1653291601882 1653291601879 3 disconnected 6554-9829
414da71242dd4368948ec52e13a77f65f72e7869 :0@0 master,noaddr - 1653291601879 1653291601879 1 disconnected 2710
9df194415bf3eb084479513035d4c20ceceea36a :0@0 master,noaddr - 1653291601882 1653291601879 4 disconnected 9830-13106
93081ed37da5e945834a3381d0937670493e23e3 :0@0 master,noaddr - 1653291601882 1653291601879 5 disconnected 13107-16383
212fcd0b1e80b9d22ee7c0efd5f1d8ccdaaa5ae1 10.44.163.102:6379@16379 myself,master - 0 1653291601879 2 connected 3277-6553

Other Redis node in the cluster:
127.0.0.1:6379> cluster info
cluster_state:fail
cluster_slots_assigned:16384
cluster_slots_ok:13107
cluster_slots_pfail:0
cluster_slots_fail:3277
cluster_known_nodes:5
cluster_size:5
cluster_current_epoch:5
cluster_my_epoch:1
cluster_stats_messages_ping_sent:1236
cluster_stats_messages_pong_sent:1140
cluster_stats_messages_fail_sent:6
cluster_stats_messages_sent:2382
cluster_stats_messages_ping_received:1136
cluster_stats_messages_pong_received:1234
cluster_stats_messages_meet_received:4
cluster_stats_messages_fail_received:1
cluster_stats_messages_received:2375
127.0.0.1:6379> cluster nodes
ee2106f0115a5d70664702dd6998283ef9a15c74 10.44.163.107:6379@16379 master - 0 1653291836731 3 connected 6554-9829
df007adf6ecc0c52ddbc8afde9b7af48c8ce562b 10.44.163.110:6379@16379 master - 0 1653291836000 5 connected 13107-16383
cd7c11c57a844c6e6cc591c15bf545e9e6b9445c 10.44.162.108:6379@16379 myself,master - 0 1653291834000 1 connected 0-3276
d7dd0a990d15048c92920a513304b87276e7e2c5 :0@0 master,fail,noaddr - 1653291291862 1653291288000 2 disconnected 3277-6553
faf3f0aae6f34a193f37bca4ec117c976a18f791 10.44.163.108:6379@16379 master - 0 1653291835000 4 connected 9830-13106

Why is the case and how can I reliably restore a Redis cluster back to a working state again if I detect the cluster nodes are in a failed state?