TAG fields and escaping

I wonder if the query parser could be improved for querying TAG fields.

For storing the data (FT.ADD) I don’t have to escape anything in the tags except for the delimiter I supply myself.

For searching the tagged data currently needs escaping, which feels awkward.

From Redis-Lua, I had to write a function to do the special escaping needed, which is cumbersome and prone to fall over when custom tokenization or other additions might be added to RediSearch.

Here’s the function:

local function EscapeFtPunctuation(cRet)

– Escape RediSearch FT (full text) search punctuation characters, like ‘-’ (becomes ‘-’)

– Also escape spaces, see: http://redisearch.io/Tags/

– For punctuation characters, see: https://github.com/RedisLabsModules/RediSearch/blob/master/src/toksep.h

– From the C code:

–[[ [’ ‘] = 1, [’\t’] = 1, [’,’] = 1, [’.’] = 1, [’/’] = 1, [’(’] = 1, [’)’] = 1,

[’{’] = 1, [’}’] = 1, [’[’] = 1, [’]’] = 1, [’:’] = 1, [’;’] = 1, [’\’] = 1,

[’~’] = 1, [’!’] = 1, [’@’] = 1, [’#’] = 1, [’$’] = 1, [’%’] = 1, [’^’] = 1,

[’&’] = 1, [’*’] = 1, [’-’] = 1, [’=’] = 1, [’+’] = 1, [’|’] = 1, [’’’] = 1,

[’`’] = 1, [’"’] = 1, [’<’] = 1, [’>’] = 1, [’?’] = 1,

]]

– Lua gsub: The lua magic characters are ( ) . % + - * ? [ ^ $

– So: prepend with ‘%’ in the gsub pattern string (first parameter)

return (cRet:gsub(’[ \t,%./%(%){}%[%]:;\~!@#%$%%%^&%*%-=%+|’`"<>%?_]’, {

**** [’ ’ ]=’\ ’ ,

**** [’\t’]=’\\t’ ,

**** [’,’ ]=’\,’ ,

**** [’.’ ]=’\.’ ,

**** [’/’ ]=’\/’ ,

**** [’(’ ]=’\(’ ,

**** [’)’ ]=’\)’ ,

**** [’{’ ]=’\{’ ,

**** [’}’ ]=’\}’ ,

**** [’[’ ]=’\[’ ,

**** [’]’ ]=’\]’ ,

**** [’:’ ]=’\:’ ,

**** [’;’ ]=’\;’ ,

**** [’\’]=’\\’ ,

**** [’~’ ]=’\~’ ,

**** [’!’ ]=’\!’ ,

**** [’@’ ]=’\@’ ,

**** [’#’ ]=’\#’ ,

**** [’$’ ]=’\$’ ,

**** [’%’ ]=’\%’ ,

**** [’^’ ]=’\^’ ,

**** [’&’ ]=’\&’ ,

**** [’’ ]=’\’ ,

**** [’-’ ]=’\-’ ,

**** [’=’ ]=’\=’ ,

**** [’+’ ]=’\+’ ,

**** [’|’ ]=’\|’ ,

**** [’’’]=’\’’ ,

**** [’' ]='\\’ ,

**** [’"’ ]=’\"’ ,

**** [’<’ ]=’\<’ ,

**** [’>’ ]=’\>’ ,

**** [’?’ ]=’\?’ ,

     -- Add underscore as well, seems needed

**** [’’ ]=’\’ ,

}))

end

Hi,
Sorry for the late reply, I was a sick for a couple of days.

The reason it’s like that, is that the query tokenizer is not aware of its state and the allowed delimiters when it is parsing a token. It just passes on tokens to the parser that builds the parse tree.

Making it contextual and recursive like that will make it slower and way more complex.

Second, the definition of tokens and escaping can be found at lexer.rl (which uses Ragel). the relevant part is:

escape = ‘\’;
escaped_character = escape (punct | space | escape);

So basically, we are talking about punct and space

From Ragel’s manual:

punct – Punctuation. Graphical characters that are not alphanumerics**. [!-/:-@[-‘{-~] **

space – Whitespace. [\t\v\f\n\r ]

I hope this helps. In C we have the ispunct and isspace functions, not sure if this exposed by Lua.

Hi Dvir,

No problem, hope you are feeling better.

Your reply helps a lot with improving the mentioned Lua function (ispunct and isspace are not exposed by Lua, and can’t be easily imported inside Redis’ Lua interpreter because of strict determinism, see Redis docs).

However, as said, I feel that a RediSearch user preferably should not have to write this function at all.

You say “Making it contextual and recursive like that will make it slower and way more complex”, and I agree, but consider the following:

  • ‘Slower’: yes, naturally. But when escaping needs to be done, you seem to pass on the problem to the RediSearch user. And the total performance hit will most probably be bigger that Ragel’s excellent performane. In my Redis-Lua code, my options are limited, and I can only use a semi-regex engine, which is pretty fast, but is not on par with Ragel.

  • You use a very simple cleanup when storing tags (strip outer whitespace, lower(), split by custom delimiter). Tags are very powerful for a lot of our scenario’s. It would make sense that searching for tags would be simple as well.

  • ‘More complex’: Can’t argue with that. But it might be easy in the following way: treat ‘{’ and ‘}’ like you treat quotes. You can pass these curly brace parts directly to your search method.

  • ‘More complex’: Different solution: split it up. Make the query parser a completely separate Redis Module. That way, you can always add more parsers (like an SQL-like parser you mentioned) without adding complexity to the RediSearch core.

  • ‘More complex’: Your documentation should be accurate. If you don’t add this to your parser, you must add the complete specs to your documentation. So, your documentation gets more complex. I should be able to write the Lua function EscapeFtPunctuation without looking at RediSearch’s source code.

You are right in general, but consider this: What if I have two tag fields, each with a different delimiter? Knowing that I’m inside a tag clause will not be enough.
One thing we can do is not allow quotes to be a delimiter in tags, and allowing you to quote the tags in the query, negating the need for escaping at all (besides quotes themselves).

Yes, I think that’s a best of both worlds solution (including the ability to escape quotes within tag value searches).

Could you open an issue for this please?
Thanks

Yes sure: https://github.com/RedisLabsModules/RediSearch/issues/259