Stability: RedisGraph Startup time and "module is currently replicating" issues

Hey folks, RedisGraph seems to be a great solution for resolving some of our runtime metadata lookup needs. The query performance for finding indirect relationships between nodes in or graphs has been great, often sub-millisecond and usually less than 2-3 milliseconds. That part is pretty amazing.

The part that is less than amazing is now that we’ve added RedisGraph module to our redis services (1 master, two replicas) is its become much much slower to recover / restart and be accessible since we’ve done this. We are using appendonly AOF persistance so if a redis node in kubernetes goes down, nothing should be lost. We are also using Amazon spot instances which periodically drop but due to our infrastructure replacements are immediately spun up and attached to the prior storage.

When we had redis that was ONLY key values, it would take only 10-15 seconds upon startup for it to traverse its entire appendonly file and be ready for queries. Now, having added graphs, it takes several minutes (the graphs writes we do seem to take on average ~15 ms to execute as it involves several merges and a delete). We’ve loaded about 47,000 nodes into a proof of concept graph and it makes many minutes to startup, but if its running statements from AOF file, it takes a while for all those graph statements to be reloaded. That’s a definite challenge and I am wondering what we can do to speed up recovery times. If we switch to snapshots only, I suspect it will recover faster. In a couple tests it was queryable within 45 seconds using RDB instead of AOF. We could switch to that, but we’ve stick with AOF because we don’t wanna drop any transactions.

Another MADDENING issue is I’ve seen this appear with redislabs/redisgraph docker image is even running the data locally on my laptop it will never be ready for queries and will show “module is currently replicating”. The one I am running on my desktop doesn’t have replicas (?!). I launched the image in our kubernetes cluster and the logs say the master and replicas are in good states but a graph where, even a read-only graph where reports “module is currently replicating” hours after I have spun them up.

I see another post on this forum about the replicating issue but that is IT, I cannot find other places talking about that error and while that posted posted a solution, its not super obvious what the solution is or what the cause of the original issue was.

I’m trying really hard to make a case for RedisGraph to my customer but stuff like this really puts a damper in my advocacy. We’ve not had an issue with redis replicas before adding RedisGraph module. I need it to be stable and I need to be able to use replicas without them getting in a stuck “replicating” situation.

Any deeper insight on how to improve upon startup times and make replicas responsible again is deeply appreciated. Thanks! -Michael

Update:

Yeah, the replicas have been up for over 24 hours and never get into a state where they are “ready” for RedisGraph. They can serve other key/values just fine…

$ redis-cli
127.0.0.1:6379> GRAPH.QUERY SourceSets "MATCH (n) RETURN n LIMIT 10"
(error) READONLY You can't write against a read only replica.
127.0.0.1:6379> GRAPH.RO_QUERY SourceSets "MATCH (n) RETURN n LIMIT 10"
(error) ERR RedisGraph module is currently replicating

If I look at this log output for Redis, this is all I’ve seen since we brought it online:

 1:S 26 Feb 2021 17:44:19.121 * Connecting to MASTER our-redis-master.svc.cluster.local:6379
 1:S 26 Feb 2021 17:44:19.122 * MASTER <-> REPLICA sync started
 1:S 26 Feb 2021 17:44:19.122 * Non blocking connect for SYNC fired the event.
 1:S 26 Feb 2021 17:44:19.125 * Master replied to PING, replication can continue...
 1:S 26 Feb 2021 17:44:19.127 * Partial resynchronization not possible (no cached master)
 1:S 26 Feb 2021 17:44:19.130 * Full resync from master: e9c2d80c2ac20bb99606c56c554a2bf24958b1e8:0
 1:S 26 Feb 2021 17:44:19.515 * MASTER <-> REPLICA sync: receiving 13937159 bytes from master to disk
 1:S 26 Feb 2021 17:44:19.594 * MASTER <-> REPLICA sync: Flushing old data
 1:S 26 Feb 2021 17:44:19.594 * MASTER <-> REPLICA sync: Loading DB in memory
 1:S 26 Feb 2021 17:44:19.625 * Loading RDB produced by version 6.0.11
 1:S 26 Feb 2021 17:44:19.625 * RDB age 0 seconds
 1:S 26 Feb 2021 17:44:19.625 * RDB memory usage when created 54.74 Mb
 1:S 26 Feb 2021 17:44:20.225 * <graph> Done decoding graph SourceSets
 1:S 26 Feb 2021 17:44:20.225 * <graph> Done decoding graph Tests
 1:S 26 Feb 2021 17:44:55.836 * <graph> Done decoding graph SourceSets
 1:S 26 Feb 2021 17:44:55.836 * MASTER <-> REPLICA sync: Finished with success
 1:S 26 Feb 2021 17:44:55.837 * Background append only file rewriting started by pid 201
 201:C 26 Feb 2021 17:44:55.838 * <graph> Created 1 virtual keys for graph SourceSets
 201:C 26 Feb 2021 17:44:55.838 * <graph> Created 0 virtual keys for graph Tests
 201:C 26 Feb 2021 17:44:56.096 * <graph> Done encoding graph SourceSets
 201:C 26 Feb 2021 17:44:56.267 * <graph> Done encoding graph Tests
 201:C 26 Feb 2021 17:44:56.351 * <graph> Done encoding graph SourceSets
 1:S 26 Feb 2021 17:44:56.462 * AOF rewrite child asks to stop sending diffs.
 201:C 26 Feb 2021 17:44:56.462 * Parent agreed to stop sending diffs. Finalizing AOF...
 201:C 26 Feb 2021 17:44:56.462 * Concatenating 0.00 MB of AOF diff received from parent.
 201:C 26 Feb 2021 17:44:56.462 * SYNC append only file rewrite performed
 201:C 26 Feb 2021 17:44:56.462 * <graph> Deleted 1 virtual keys for graph SourceSets
 201:C 26 Feb 2021 17:44:56.463 * <graph> Deleted 0 virtual keys for graph Tests
 201:C 26 Feb 2021 17:44:56.465 * AOF rewrite: 2 MB of memory used by copy-on-write
 1:S 26 Feb 2021 17:44:56.540 * Background AOF rewrite terminated with success
 1:S 26 Feb 2021 17:44:56.540 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
 1:S 26 Feb 2021 17:44:56.540 * Background AOF rewrite finished successfully

I’ve seen this behavior with Redis 6.0.5 and 6.0.11 now… :-/

Hi @Meremortal,

(I know we’ve already touched base, but am posting here for posterity.)

The poor performance of loading large graphs from AOF is a known issue for RedisGraph. We are working on some enhancements that should ameliorate the problem somewhat, but for the foreseeable future I expect RDB persistence to vastly outperform AOF persistence in terms of load times.

With regard to replicas getting stuck in a currently replicating state, it seems that this may have been a serialization error that was corrected as of 2.2.14. This cannot be verified, as we have not been able to reproduce the bug on our side. If it appears again, please let us know, if possible with steps to reproduce the issue!

Thanks for your patience!

1 Like

Hey Jeffrey, thank you! I meant to follow up on this message but was having login issues with the forum!

For newcomers, we are presently running RedisGraph 2.2.14 with Redis 6.0.11 with a master and two (2) replicas for a week now without seeing the “module is currently replicating issue”, so that is great!

For persistence we are using both RDB and AOF, but we have discovered whoever setup our original Redis 6.0.5 install did not include a RDB snapshot schedule, which exacerbated by the known AOF RedisGraph issue, as Jeffrey outlined.

The work-around we are trying for the moment is to have a SAVE “60 1” schedule that will save a RDB snapshot every 60 seconds if at least 1 key has changed. Since a whole graph is stored in what is considered 1 key, we’re not sure if several updates to a graph register as an update to more than one key, so we have the “60 1” schedule to try to ensure that upon startup there will not be more than 60 seconds of AOF graph transactions to replay. We can cope with startups taking 30-60s for a instance recovery (although previously it was between 15-45s) but not much longer.

We don’t know if setting RDB snapshot schedules to be more aggressive will cause replication issues / churn that could undermine stability, but so far this approach seems tolerable for our needs. In an ideal world, service restarts are rare, much rarer than they are in our dev envs with AWS spot instances in kubernetes.

I’ll continue to kick the tires while I am working with this customer to see how it goes as we push it harder!

Thanks again!

-Michael (@Meremortal)