RedisGraph replica replicating slowly

Hey,

I use RedisGraph version 2.8.17 ot to be more exact the docker.io/redislabs/redisgraph@sha256:8db03866292e35f2b3e2eb1e2df6c56299510b837f90d146e84b9615d5ad5a0d docker image.

When I write some data to RedisGraph it serializes it in around 4minutes, but then it takes 45minutes to replicate to the replica. I use the replica for read queries.
I have read some vanilla Redis documentation on replication. I use the default repl-backlog-size and the default client-output-buffer-limit replica.
I use a more aggressive AOF rewrite settings as reading of AOF is relatively slow.

Update 1 (to address the comment from SWilly22):

There are reads on the Replica but not often (one every 5 seconds).

Hardware of the Master and replica are the same. CPU is not capped.

I have looked at some historical data if the delay is 25K the replica keeps up. If its more that that something weird is happening and the number of commands on the replica falls down to15ops/s, but the CPU utilization is 100%. The repl delay is reducing very slowly.

End of Update 1

Any ideas on how to optimize this?

Best Tim

The way replication currently works on RedisGraph V2.8.17 is a follows:
GRAPH.QUERY commands which modify the underline graph in any way e.g. introduce a new node, delete an edge, processed by the master are replicated as is to the replica you can see the replicated commands if you run the MONITOR command on the replica.

Assuming both master and its replica have the same resources (hardware) and both represent the same graph (no data inconsistency) the time to execute the same GRAPH.QUERY command should be the same (given) the replica isn’t processing additional READ commands.

Will you be able to validate the above assumptions ?
Thanks!

Hey, tnx for you questions. I hope I have answered you assumptions in update 1 in the original question.

Here is another example. I have ingested more data into the redis master.

And made the following observations:

  • It’s similar then the situation above
  • After a while there is a message on the slave
1:S 20 Dec 2022 10:40:27.877 # Closing client that reached max query buffer length: id=141 addr=170.36.6.4:6379 laddr=170.36.1.6:60932 fd=13 name= age=46279 idle=0 flags=Mb db=0 sub=0 psub=0 multi=-1 qbuf=1073746113 qbuf-free=268431157 argv-mem=406 obl=0 oll=0 omem=0 tot-mem=1342198190 events=r cmd=graph.QUERY user=(superuser) redir=-1 (qbuf initial bytes: "*3\r
$11\r
GRAPH.QUERY\r
$16\r
banana-split-serverless\r
$374\r
 MERGE (sourc") [log truncated]
  • where the addr is the masters IP
  • and laddr the slaves
  • Is this the message that the client buffer filled up?
  • This happens twice
  • And it seems that a full sync improves the situation

Seems like the slave is unable to keep up.
Can you please share a number write queries you’re issuing? accompanied by their execution-plan
GRAPH.EXPLAIN <GRAPH-KEY> <QUERY>

I want to make sure both the master and the slave produce the same execution-plan for a given query.

I am not sure but must be a lot of write queries, but I am not really sure. Redis metrics report 2k commands/s for a while (one hour)
image
I can implement a metric and come back with more exact data.

I don’t think I can run an explain on the replica, as it is RO.

Is the current slow performance of the replica just by design/not jet optimized? I know it’s a lot of data for the master, but it seems it’s eating it up much more efficient than the replication?

You can configure the replica to accept write commands if you issue the following command:
CONFIG SET slave-read-only no on the replica

1 Like

Tnx, for your time.

I had to fix the issue from yesterday and found a workaroud if the replica is stuck in the slow replication state.

It seems that the full re-sync solves the issue immediately.
So one workaround strategy, that I can use is to after large ingestion and after the master processes the writes. I can go onto the slave and write SLAVEOF NO ONE and then SLAVEOF master_addr port. It seem to force the full resync.

I was also playing with client-output-buffer-limit slave if there is to large of a buffer for the replica to force a full resync. Without much luck. I put the soft limit down to 4mb and 600 seconds. Nothing happened.

I am still not sure if this is a bug or just a performance limitation of RedisGraph replication.

Do you want the explains when the replica is in this bad slow replication state? I can also wipe AOF and restart the replica to resync as a complete blank state.