RediSearch and memory overhead when inserting new documents

Hi,

What is the best way to minimise memory overhead (RediSearch) when inserting a lot of documents?

Our RediSearch DB is just one index with around 5MLN documents.
It looks like (various tests) that to insert all 5 MLN docs we have to spin-up larger (more memory) node that we need to run Redis for normal read/search operations.
The DB is populated only once and serves as a read-only data. However the data needs to be updated every month. So ideally we need to run the smallest node possible to save the cost.

We are updating/inserting documents by using redis protocol and piping, eg: cat data.txt | redis-cli --pipe.

RediSearch works fine with 5GB Ram for read/search operation on entire 5MLN docs. However to successfully populate that 5MLN we need to use node with at lest 8GB of RAM.
We are already looking to expand our DB to around 15GB so I assume the overhead will be even larger?

We are running (mostly default settings):

  • redis_version:5.0.7
  • RediSearch version 1.6.13 (Git=v1.6.13)
    concurrent writes: OFF, gc: ON, prefix min length: 2, prefix max expansions: 200, query timeout (ms): 500, timeout policy: return, cursor read size: 1000, cursor max idle (ms): 300000, max doctable size: 20000000, search pool size: 20, index pool size: 8,

--loadmodule /usr/lib/redis/modules/redisearch.so MAXDOCTABLESIZE 20000000 GC_POLICY FORK FORK_GC_CLEAN_THRESHOLD 10000

redis.conf: |-
    save 900 1
    save 300 10
    save 60 1000
    dir /data
    dbfilename master.rdb
    rdbchecksum yes

We are running RediSearch on k8s on GCP.

Thanks!

First question, why is your MAXDOCTABLESIZE is set to 20M ? For 5M documents setting it larger than 5M makes no sense (and I believe the default 1M will also be good for you).

Now regarding memory optimizations, There are some ways you can optimize memory usage (most of them depend on your use case and the latency you expect to have).

So first, are you using SORTABLES fields? Sortable fields increase memory usage but give better query performance so it’s up to you if you want to use it. Another thing is saving the term offset in the inverted index (use NOOFFSETS on FT.CREATE to disable it), if disabled, the inverted index will use less memory but you will not be able to do an exact search of multiple terms or highlighting, again it depends or whether or not you need it. Same for NOFIELDS options, if set it will reduce memory usage but you will not be able to search for specific fields content (just general free-text search). Last is NOFREQS which will again save memory but does not allow sorting based on the frequencies of a given term within the document.

Now per field memory optimization options are for example the NOSTEM options. Use this option and the field will not create any stemming inverted indexes, but you will not find words by their stemming either…

If you have fields which you do not want to search on make sure to define them as NOINDEX so RediSearch will not index them (which will for sure saves memory).

Let me know if those helps, we can continue talking about your specific use-case and come up with more ideas to optimize.

Thanks a lot for the answer!

I have defaulted MAXDOCTABLESIZE so it will be 1M. 20M was set during our testing.

No we do not use SORTABLE fields.

We need exact search and highlighting.

Not sure I fully understand NOFIELDS and NOFREQS, where I can find some examples?

We are setting NOSTEM on all our fields (specificity of our data).

We have quite a lot fields set as NOINDEX and I am now wondering if it wouldn’t be better (memory wise) to put that data in the PAYLOAD?

In general we are happy with the RediSearch memory consumption at “rest” and read only traffic. Our concern is around the case when we need to update our entire DB. We need to spin-up bigger (more ram) node to accomodate the updates otherwise (sorry I have not mentioned this) Redis crashes ( Write error saving DB on disk: Cannot allocate memory write(): Cannot allocate memory). /proc/sys/vm/overcommit_memory is set to 1.

Thanks

Interesting, Are all the updates are actually updates or new insertions? If they are all updates I would expect forkgc to keep up the update pace and clean the garbage… Can you share the Redis log file? I would like to analyze the exact crash you are talking about.

This happens for both new insertions and updates as tested already many times.

Here is the redis log, not sure there is much info there. I also have core dump available.:

1:C 25 Jun 2020 05:48:43.391 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 25 Jun 2020 05:48:43.391 # Redis version=5.0.7, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 25 Jun 2020 05:48:43.391 # Configuration loaded
1:M 25 Jun 2020 05:48:43.393 * Running mode=standalone, port=6379.
1:M 25 Jun 2020 05:48:43.393 # Server initialized
1:M 25 Jun 2020 05:48:43.394 * <ft> RediSearch version 1.6.13 (Git=v1.6.13)
1:M 25 Jun 2020 05:48:43.394 * <ft> Low level api version 1 initialized successfully
1:M 25 Jun 2020 05:48:43.394 * <ft> concurrent writes: OFF, gc: ON, prefix min length: 2, prefix max expansions: 200, query timeout (ms): 500, timeout policy: return, cursor read size: 1000, cursor max idle (ms): 300000, max doctable size: 1000000, search pool size: 20, index pool size: 8, 
1:M 25 Jun 2020 05:48:43.394 * <ft> Initialized thread pool!
1:M 25 Jun 2020 05:48:43.394 * Module 'ft' loaded from /usr/lib/redis/modules/redisearch.so
1:M 25 Jun 2020 05:49:18.276 * DB loaded from disk: 34.882 seconds
1:M 25 Jun 2020 05:49:18.276 * Ready to accept connections
1:M 25 Jun 2020 05:50:46.245 * 1000 changes in 60 seconds. Saving...
1:M 25 Jun 2020 05:50:46.327 * Background saving started by pid 389
389:C 25 Jun 2020 05:52:09.428 * DB saved on disk
389:C 25 Jun 2020 05:52:09.604 * RDB: 415 MB of memory used by copy-on-write
1:M 25 Jun 2020 05:52:09.902 * Background saving terminated with success
1:M 25 Jun 2020 05:53:10.036 * 1000 changes in 60 seconds. Saving...
1:M 25 Jun 2020 05:53:10.104 * Background saving started by pid 824
824:C 25 Jun 2020 05:54:32.304 * DB saved on disk
824:C 25 Jun 2020 05:54:32.508 * RDB: 552 MB of memory used by copy-on-write
1:M 25 Jun 2020 05:54:32.929 * Background saving terminated with success
1:M 25 Jun 2020 05:55:33.020 * 1000 changes in 60 seconds. Saving...
1:M 25 Jun 2020 05:55:33.143 * Background saving started by pid 1271
1271:C 25 Jun 2020 05:55:49.832 # Write error saving DB on disk: Cannot allocate memory
1:M 25 Jun 2020 05:55:50.418 # Background saving terminated by signal 9
1:M 25 Jun 2020 05:55:50.518 * 1000 changes in 60 seconds. Saving...
1:M 25 Jun 2020 05:55:50.706 * Background saving started by pid 1326
1:signal-handler (1593064551) Received SIGTERM scheduling shutdown...
1:M 25 Jun 2020 05:55:51.207 # User requested shutdown...
1:M 25 Jun 2020 05:55:51.207 # There is a child saving an .rdb. Killing it!
1:M 25 Jun 2020 05:55:51.207 * Saving the final RDB snapshot before exiting.

@bogumil I do not see any crash report on the logs you sent. I see that you shut it down. How much does the memory increase during this big update and does it goes back down after the update finishes?

@meirsh I have run more test and found out following:

  • it was k8s who was shutting down (sending SIGTERM) redis container due to memory overuse
  • I have disabled any auto persistence and this helped with the memory overhead during upserting (looks like the background saving generated the overhead, most probably not directly but trough linux kernel)

It could be good to know some general figures (%) around memory overhead needed by Redis during bulk upsert.

@bogumil its probably not even related to RediSearch, it’s probably the copy on write that happened during RDB save. In the worst case, this copy on write might double the memory usage but you can avoid this by disabling the auto persistency (as you did, and trigger it manually for example after your index updates has finished) or maybe using AOF.