Pytorch on gears - one more way to fry RedisGears cluster

Now I have 3 steps in the pipeline and functioning , I want to add 4th - tokenisation using Bert Model.
Unfortunately, tokeniser depends on Pytorch which is ~800 MB download.
It seems after installation pytorch cluster becomes unstable:

161808:M 22 May 2020 15:01:13.516 * <module> GEARS: Successfully spellchecked sentence sentences:bafab6b3dd88dcdefe111698d02f81998c9accdb:236:{1x3}
161783:S 22 May 2020 15:03:42.420 * <module> Processing ./torch-1.4.0-cp37-cp37m-manylinux1_x86_64.whl

161783:S 22 May 2020 15:03:51.325 * <module> Installing collected packages: torch

161783:S 22 May 2020 15:04:02.674 * <module> Successfully installed torch-1.4.0

161783:S 22 May 2020 15:04:09.381 # <module> disconnected : 10.144.17.211:30006, status : -1, will try to reconnect.

161783:S 22 May 2020 15:04:09.402 # <module> disconnected : 10.144.17.211:30005, status : -1, will try to reconnect.

161783:S 22 May 2020 15:04:09.422 # <module> disconnected : 10.144.17.211:30003, status : -1, will try to reconnect.

161783:S 22 May 2020 15:04:09.443 # <module> disconnected : 10.144.17.211:30002, status : -1, will try to reconnect.

161783:S 22 May 2020 15:04:09.464 # <module> disconnected : 10.144.17.211:30004, status : -1, will try to reconnect.

The command I am trying to run gears-cli --host 10.144.17.211 --port 30001 tokenizer_bert_run.py --requirements requirements_tokenizer.txt

where requirements:

torch==1.4
transformers==2.9.1

and the code

tokenizer = None 


def loadTokeniser():
    global tokenizer
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
    return tokenizer

def tokenise_sentence(record):
    global tokenizer
    if not tokenizer:
        tokenizer=loadTokeniser()
    sentence_key=x['key']
    sentence_orig=x['value']
    # sentence_key=record['value']['sentence_key']
    # sentence_orig=record['value']['content']
    shard_id=hashtag()
    log(f"Tokeniser received {sentence_key} and my {shard_id}")
    tokens = tokenizer.tokenize(record['value']['content'])
    key = "tokenized:bert:%s:{%s}" % (sentence_key,shard_id)
    for token in tokens:
        execute('lpush', key, token)
        execute('SADD','processed_docs_stage3_tokenized', sentence_key)


bg = GearsBuilder()
bg.foreach(tokenise_sentence)
bg.count()
bg.run('sentences:*')

I don’t think it reaches point where it runs code.
gears-cli times out with

Results
-------

Errors
------
%d)     %s (1, 'Execution max idle reached')
1 Like

@AlexMikhalev can you share the full logs of all the shards? I guess its just takes to long to install this requirement and we are reaching execution Max idle timeout (by the way I already have a PR that set the requirement installation idle timeout to longer value by default because it make sense it might take a while https://github.com/RedisGears/RedisGears/pull/326). Notice that you can increase this timeout https://oss.redislabs.com/redisgears/configuration.html#executionmaxidletime.

@meirsh is any way to get debug log out of shards?
I am running ./create-cluster tailall and excerpt is above - nothing else, no failures.

I increased redis-trib.py execute --addr 10.144.17.211:30001 --master-only RG.CONFIGSET ExecutionMaxIdleTime 30 and tried to re-submit the same script as above. Resulted in segfault - see gist.

OK @AlexMikhalev the issue is also this:
Downloading torch-1.4.0-cp37-cp37m-manylinux1_x86_64.whl (753.4 MB)

The size 753.4 MB is more then the default redis bulk size. Try increase it with:
CONFIG SET proto-max-bulk-len 2048mb
Make sure to do it on all the shards and do not forget to increase ExecutionMaxIdleTime

I just tried it and it worked for me.

Regarding the crash, do you mind opening and issue on github?

Tried to increase memory bulk (after create-cluster clean&restart refresh). Still failed with idle timeout even on empty cluster.
I will try to replicate crash and file bug report on github/redisgears.

@AlexMikhalev as I said this config set is not enough you need to also increase the ExecutionMaxIdleTime, when you load the module you can give it as a parameter, set it to something like 5 minutes to be on the safe side (300000 ms).

@meirsh is ExecutionMaxIdleTime in seconds or in ms?

ms (some more chars to reach 20 chars so it will allow me to send the message :slight_smile: )

1 Like

For others, fix is to run:

redis-trib.py execute --addr ip:30001 RG.REFRESHCLUSTER
redis-trib.py execute --addr ip:30001 CONFIG SET proto-max-bulk-len 2048mb
redis-trib.py execute --addr  ip:30001 RG.CONFIGSET ExecutionMaxIdleTime 300000
1 Like