Redi's Instances are not stable

We configured the open source Redis setup and keep it simple cluster configuration. we have 10 servers and each server has one master and slave node. these instances are not stable and sum of them are not responding and alerting as server instances are down.
When we check the logs below are the warnings in the logs.
Slave log:
WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add ‘vm.overcommit_memory = 1’ to /etc/sysctl.conf and then reboot or run the command ‘sysctl vm.overcommit_memory=1’ for this to take effect.
11485:M 30 Jan 2020 08:49:30.587 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command ‘echo never > /sys/kernel/mm/transparent_hugepage/enabled’ as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
13312:M 30 Jan 2020 09:07:55.832 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
13312:M 30 Jan 2020 09:07:55.832 # Server can’t set maximum open files to 10032 because of OS error: Operation not permitted.
13312:M 30 Jan 2020 09:07:55.833 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase ‘ulimit -n’.
4816:S 22 May 2020 15:31:46.515 * Non blocking connect for SYNC fired the event.
4816:S 22 May 2020 15:31:46.516 # Error reply to PING from master: ‘-MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-wri’
4816:S 22 May 2020 15:31:46.733 # Starting a failover election for epoch 1052.
4816:S 22 May 2020 15:31:50.250 * MASTER <-> REPLICA sync started
4816:S 22 May 2020 15:31:50.252 * Non blocking connect for SYNC fired the event.
4816:S 22 May 2020 15:31:50.280 # Error reply to PING from master: ‘-MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-wri’
4816:S 22 May 2020 15:31:52.287 * Connecting to MASTER x.60.9.106:7000
4816:S 22 May 2020 15:31:52.513 * MASTER <-> REPLICA sync started
4816:S 22 May 2020 15:31:52.693 * Non blocking connect for SYNC fired the event.
4816:S 22 May 2020 15:31:53.433 * Clear FAIL state for node bebde1846852919ca549e241ebe2e074eb87c06a: is reachable again and nobody is serving its slots after some time.
4816:S 22 May 2020 15:31:53.440 # Cluster state changed: ok
4816:S 22 May 2020 15:31:53.447 # Error reply to PING from master: ‘-MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-wri’
4816:S 22 May 2020 15:31:57.206 * Connecting to MASTER x.60.9.106:7000
4816:S 22 May 2020 15:31:57.207 * MASTER <-> REPLICA sync started
4816:S 22 May 2020 15:31:57.309 * Non blocking connect for SYNC fired the event.
4816:S 22 May 2020 15:31:57.325 # Error reply to PING from master: ‘-MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-wri’
4816:S 22 May 2020 15:31:59.081 * Connecting to MASTER x.60.9.106:7000

Master logs
:11476:M 30 Jan 2020 08:49:25.764 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
11476:M 30 Jan 2020 08:49:25.764 # Server initialized
11476:M 30 Jan 2020 08:49:25.765 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add ‘vm.overcommit_memory = 1’ to /etc/sysctl.conf and then reboot or run the command ‘sysctl vm.overcommit_memory=1’ for this to take effect.
11476:M 30 Jan 2020 08:49:25.767 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command ‘echo never > /sys/kernel/mm/transparent_hugepage/enabled’ as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
11476:M 30 Jan 2020 08:49:25.768 * Ready to accept connections
11476:M 30 Jan 2020 08:57:27.789 # configEpoch set to 7 via CLUSTER SET-CONFIG-EPOCH
11476:M 30 Jan 2020 08:57:27.922 # IP address for this node updated to x.60.9.106
11476:M 30 Jan 2020 09:05:40.387 * FAIL message received from d5fcbc4e4fb7a8480a2e6d670004b13574547d06 about 1e58f53ace34a666d7589053c20d7e5fcfd2e520
11476:M 30 Jan 2020 09:05:40.388 * FAIL message received from d5fcbc4e4fb7a8480a2e6d670004b13574547d06 about 5dd47c00296b6bfc8598e1b5c61878d22d9e8a45
11476:M 30 Jan 2020 09:06:03.098 * FAIL message received from e58f3a0495de7d53e295739dc0401b9cbd9460dc about 0b3dfddd1e08a8896c664a43ee077f5ec658b5cb
11476:M 30 Jan 2020 09:06:03.099 * FAIL message received from e58f3a0495de7d53e295739dc0401b9cbd9460dc about 1a93feba7cfd556fceaea3ee06685576fa474568
11476:M 30 Jan 2020 09:06:44.577 * FAIL message received from b1bfc6073682063b8ebdec63a1778b080835296f about e58f3a0495de7d53e295739dc0401b9cbd9460dc
11476:M 30 Jan 2020 09:06:44.578 * FAIL message received from b1bfc6073682063b8ebdec63a1778b080835296f about 9160a54395a90209c4fd97d90e53fdceab4bf4c9
13321:C 30 Jan 2020 09:08:04.507 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
13321:C 30 Jan 2020 09:08:04.508 # Redis version=5.0.5, bits=64, commit=00000000, modified=0, pid=13321, just started
13321:C 30 Jan 2020 09:08:04.509 # Configuration loaded
13321:M 30 Jan 2020 09:08:04.513 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
13321:M 30 Jan 2020 09:08:04.514 # Server can’t set maximum open files to 10032 because of OS error: Operation not permitted.
13321:M 30 Jan 2020 09:08:04.515 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase ‘ulimit -n’.
13321:M 30 Jan 2020 09:08:04.559 # Not listening to IPv6: unsupproted
13321:M 30 Jan 2020 09:08:04.563 * Node configuration loaded, I’m ad8fdf640e446889d0161daf36d6efe760082417
13321:M 30 Jan 2020 09:08:04.611 # Not listening to IPv6: unsupproted
13321:M 30 Jan 2020 09:08:04.616 * Running mode=cluster, port=7000.
13321:M 30 Jan 2020 09:08:04.618 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
13321:M 30 Jan 2020 09:08:04.619 # Server initialized
13321:M 30 Jan 2020 09:08:04.620 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add ‘vm.overcommit_memory = 1’ to /etc/sysctl.conf and then reboot or run the command ‘sysctl vm.overcommit_memory=1’ for this to take effect.
13321:M 30 Jan 2020 09:08:04.621 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command ‘echo never > /sys/kernel/mm/transparent_hugepage/enabled’ as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
13321:M 30 Jan 2020 09:08:04.622 * Ready to accept connections
13321:M 30 Jan 2020 09:10:05.318 * Clear FAIL state for node 9160a54395a90209c4fd97d90e53fdceab4bf4c9: is reachable again and nobody is serving its slots after some time.
13321:M 30 Jan 2020 09:10:12.420 * Clear FAIL state for node e58f3a0495de7d53e295739dc0401b9cbd9460dc: master without slots is reachable again.
13321:M 30 Jan 2020 09:10:57.224 * Clear FAIL state for node 0b3dfddd1e08a8896c664a43ee077f5ec658b5cb: master without slots is reachable again.
13321:M 30 Jan 2020 09:11:05.922 * Clear FAIL state for node 1a93feba7cfd556fceaea3ee06685576fa474568: is reachable again and nobody is serving its slots after some time.
13321:M 30 Jan 2020 09:11:17.872 * Clear FAIL state for node 1e58f53ace34a666d7589053c20d7e5fcfd2e520: master without slots is reachable again.
13321:M 30 Jan 2020 09:11:24.354 * Clear FAIL state for node 5dd47c00296b6bfc8598e1b5c61878d22d9e8a45: is reachable again and nobody is serving its slots after some time.
13321:M 31 Jan 2020 09:57:14.544 # configEpoch set to 0 via CLUSTER RESET HARD

Welcome, Praveen,

Have you verified that all the server can connect to each other?

How much memory do you have on each server?

Is there any data stored on the cluster?

You have a quite a few warnings to deal with here:

3312:M 30 Jan 2020 09:07:55.832 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
13312:M 30 Jan 2020 09:07:55.832 # Server can’t set maximum open files to 10032 because of OS error: Operation not permitted.

You need to increase the total number of file descriptors.

Error reply to PING from master: ‘-MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-wri’

Looks like Redis can’t write to disk.

WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

Need to increase this system setting.

I’d start by addressing these issues.

Best,
Kyle