TAG fields and escaping

353 views
Skip to first unread message

tw-bert

unread,
Jan 9, 2018, 3:45:35 AM1/9/18
to redisearch
I wonder if the query parser could be improved for  querying TAG fields.

For storing the data (FT.ADD) I don't have to escape anything in the tags except for the delimiter I supply myself.
For searching the tagged data currently needs escaping, which feels awkward.

From Redis-Lua, I had to write a function to do the special escaping needed, which is cumbersome and prone to fall over when custom tokenization or other additions might be added to RediSearch.

Here's the function:

local function EscapeFtPunctuation(cRet)

-- Escape RediSearch FT (full text) search punctuation characters, like '-' (becomes '\-')

-- Also escape spaces, see: http://redisearch.io/Tags/

-- For punctuation characters, see: https://github.com/RedisLabsModules/RediSearch/blob/master/src/toksep.h

-- From the C code:

--[[ [' '] = 1, ['\t'] = 1, [','] = 1, ['.'] = 1, ['/'] = 1, ['('] = 1, [')'] = 1,

['{'] = 1, ['}'] = 1, ['['] = 1, [']'] = 1, [':'] = 1, [';'] = 1, ['\\'] = 1,

['~'] = 1, ['!'] = 1, ['@'] = 1, ['#'] = 1, ['$'] = 1, ['%'] = 1, ['^'] = 1,

['&'] = 1, ['*'] = 1, ['-'] = 1, ['='] = 1, ['+'] = 1, ['|'] = 1, ['\''] = 1,

['`'] = 1, ['"'] = 1, ['<'] = 1, ['>'] = 1, ['?'] = 1,

]]

-- Lua gsub: The lua magic characters are ( ) . % + - * ? [ ^ $

-- So: prepend with '%' in the gsub pattern string (first parameter)

return (cRet:gsub('[ \t,%./%(%){}%[%]:;\\~!@#%$%%%^&%*%-=%+|\'`"<>%?_]', {

         [' ' ]='\\ ' ,

         ['\t']='\\\t' ,

         [',' ]='\\,' ,

         ['.' ]='\\.' ,

         ['/' ]='\\/' ,

         ['(' ]='\\(' ,

         [')' ]='\\)' ,

         ['{' ]='\\{' ,

         ['}' ]='\\}' ,

         ['[' ]='\\[' ,

         [']' ]='\\]' ,

         [':' ]='\\:' ,

         [';' ]='\\;' ,

         ['\\']='\\\\' ,

         ['~' ]='\\~' ,

         ['!' ]='\\!' ,

         ['@' ]='\\@' ,

         ['#' ]='\\#' ,

         ['$' ]='\\$' ,

         ['%' ]='\\%' ,

         ['^' ]='\\^' ,

         ['&' ]='\\&' ,

         ['*' ]='\\*' ,

         ['-' ]='\\-' ,

         ['=' ]='\\=' ,

         ['+' ]='\\+' ,

         ['|' ]='\\|' ,

         ['\'']='\\\'' ,

         ['`' ]='\\`' ,

         ['"' ]='\\"' ,

         ['<' ]='\\<' ,

         ['>' ]='\\>' ,

         ['?' ]='\\?' ,

         -- Add underscore as well, seems needed

         ['_' ]='\\_' ,

}))

end


Sidenote: (sorry about the formatting, do now know how to apply in Google Groups -> actually, what's the added value of a mail group anyways nowadays...)

I even had to escape underscores, which aren't even tokenization characters... interested in the reason, but I'd like to question the need for total punctuation character escaping at all.

I would like to know if the list above is complete, since that's what I'm relying on right now.

Cheers, TW



Dvir Volk

unread,
Jan 10, 2018, 11:02:27 AM1/10/18
to tw-bert, redisearch
Hi,
Sorry for the late reply, I was a sick for a couple of days.

The reason it's like that, is that the query tokenizer is not aware of its state and the allowed delimiters when it is parsing a token. It just passes on tokens to the parser that builds the parse tree.
Making it contextual and recursive like that will make it slower and way more complex. 

Second, the definition of tokens and escaping can be found at lexer.rl (which uses Ragel). the relevant part is:
escape = '\\';
escaped_character = escape (punct | space | escape);

So basically, we are talking about punct and space
From Ragel's manual:
punct – Punctuation. Graphical characters that are not alphanumerics. [!-/:-@[-‘{-~] 
space – Whitespace. [\t\v\f\n\r ]

I hope this helps. In C we have the ispunct and isspace functions, not sure if this exposed by Lua. 

--
You received this message because you are subscribed to the Google Groups "redisearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redisearch+...@googlegroups.com.
To post to this group, send email to redis...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redisearch/76886cc0-dc8e-4552-8f2d-b1b72fa24387%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

tw-bert

unread,
Jan 12, 2018, 3:01:52 AM1/12/18
to redisearch
Hi Dvir,

No problem, hope you are feeling better.

Your reply helps a lot with improving the mentioned Lua function (ispunct and isspace are not exposed by Lua, and can't be easily imported inside Redis' Lua interpreter because of strict determinism, see Redis docs).

However, as said, I feel that a RediSearch user preferably should not have to write this function at all.

You say "Making it contextual and recursive like that will make it slower and way more complex", and I agree, but consider the following:
- 'Slower': yes, naturally. But when escaping needs to be done, you seem to pass on the problem to the RediSearch user. And the total performance hit will most probably be bigger that Ragel's excellent performane. In my Redis-Lua code, my options are limited, and I can only use a semi-regex engine, which is pretty fast, but is not on par with Ragel.
- You use a very simple cleanup when storing tags (strip outer whitespace, lower(), split by custom delimiter). Tags are very powerful for a lot of our scenario's. It would make sense that searching for tags would be simple as well.
- 'More complex': Can't argue with that. But it might be easy in the following way: treat '{' and '}' like you treat quotes. You can pass these curly brace parts directly to your search method.
- 'More complex': Different solution: split it up. Make the query parser a completely separate Redis Module. That way, you can always add more parsers (like an SQL-like parser you mentioned) without adding complexity to the RediSearch core.
- 'More complex': Your documentation should be accurate. If you don't add this to your parser, you must add the complete specs to your documentation. So, your documentation gets more complex. I should be able to write the Lua function EscapeFtPunctuation without looking at RediSearch's source code.

Dvir Volk

unread,
Jan 12, 2018, 4:00:41 AM1/12/18
to tw-bert, redisearch
You are right in general, but consider this: What if I have two tag fields, each with a different delimiter? Knowing that I'm inside a tag clause will not be enough. 
One thing we can do is not allow quotes to be a delimiter in tags, and allowing you to quote the tags in the query, negating the need for escaping at all (besides quotes themselves). 

tw-bert

unread,
Jan 12, 2018, 4:35:33 AM1/12/18
to redisearch
Yes, I think that's a best of both worlds solution (including the ability to escape quotes within tag value searches).

Dvir Volk

unread,
Jan 12, 2018, 5:55:06 AM1/12/18
to tw-bert, redisearch
Could you open an issue for this please?
Thanks

tw-bert

unread,
Jan 12, 2018, 7:17:47 AM1/12/18
to redisearch
Reply all
Reply to author
Forward
0 new messages