Lucene query syntax / space issue

Gergely Kakuszi

unread,

Feb 21, 2013, 10:05:35 AM2/21/13

to dot...@googlegroups.com

Hi,

I think I hava a similar problem as http://forumarchive.dotcms.com/Lucene-query-syntax-space-issue-td5710415.html, but I would like to be able to use a wildcard search with spaces involved.

The goal I try to achieve is to give back suggestions to a dojo filtering select, based on the value the user enters.

For demonstration purposes I created a new Structure called LuceneTest on the demo site with only one column called name.
I added the following content:

[a a]
[a a a]
[aaa a a]

The desired behavior whould be:
    When user enters [a] : all 3 content should be returned
    When user enters [a ] : the first two content should be returned
    When user enters [a a] : the first two content should be returned
    When user enters [a a ] : only the first content should be returned

Everything works fine, when spaces are not involved, using a query like [+structureName:LuceneTest +LuceneTest.name:USER_ENETERED_TEXT]
                   The following query: [+structureName:LuceneTest +LuceneTest.name:a] returns all the 3 contents
       However the following query: [+structureName:LuceneTest +LuceneTest.name:a ] (notice there is a space after a) returns all the 3 contents also instead of only the first two
       I tried different approaches:   [+structureName:LuceneTest +LuceneTest.name:a?a] (replacing space characters with ?) this one returns only the 3rd content.
                        [+structureName:LuceneTest +LuceneTest.name:a*a] (replacing space characters with *) this one returns only the 3rd content.
                        [+structureName:LuceneTest +LuceneTest.name:a\ a] (escaping space characters) returns all the 3 contents
       Tried using quotes also        [+structureName:LuceneTest +LuceneTest.name:"aa"] but this query does not return anything (howerver it works for exact matches)

Is the a way to treat spaces as normal characters in dotcms (or lucene)?
Thanks

Maria Ahues Bouza

unread,

Feb 21, 2013, 11:53:19 PM2/21/13

to dot...@googlegroups.com

Gergely,

Based on your content

[a a]
[a a a]
[aaa a a]

When user enters [a] : all 3 content should be returned --> OK
    When user enters [a ] : the first two content should be returned --> All three have a "a " in the name. It doesn't look only starting the word it looks in all the tokens in the name.
    When user enters [a a] : the first two content should be returned --> Same as above, all three have a "a a".
    When user enters [a a ] : only the first content should be returned --> In this case I'm pretty sure we trim at the end of the query and that's why all three come back.

Only way to get one is using a token that the other ones dont have

+structureName:LuceneTest +LuceneTest.name:"aaa a a"

+structureName:LuceneTest +LuceneTest.name:"aaa"

-Maria

--
You received this message because you are subscribed to the Google Groups "dotCMS User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dotcms+un...@googlegroups.com.
To post to this group, send email to dot...@googlegroups.com.
Visit this group at http://groups.google.com/group/dotcms?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--
Community Manager

dotCMS
Main: 305.900.2001
Fax: 305.397.2579
www.dotcms.com
http://www.twitter.com/dotCMS
http://www.facebook.com/dotCMS
http://www.twitter.com/mabouza

Please consider the planet before printing this email.

Gergely Kakuszi

unread,

Feb 22, 2013, 3:28:53 AM2/22/13

to dot...@googlegroups.com

Thanks Maria,

Your answer made things clearer.... but isn't there a way to say: "this space is not a separator"? [Like escaping them with a backslash ("\ ")]

Gergely

Mark Pitely

unread,

Feb 22, 2013, 10:06:18 AM2/22/13

to dot...@googlegroups.com

Gergely,

You can also do exclusions, which might help, with the - symbol.

I find that if I really need fine processing, I do it at the foreach level, with an if; the velocity string tools are much tighter, you can check for individual characters.
Granted, this will not work if you are doing some sort of massive pull. As a last resort, you can use SQL to fully support regex, but that wouldn't work for every situation.

Mark

Gergely Kakuszi

unread,

Feb 22, 2013, 3:52:15 PM2/22/13

to dot...@googlegroups.com

Thanks Mark,

I think your suggestion could work, however, after a few hours of reading and investigation I think I found a solution:
I noticed on the elasticsearch admin page that almost all field has a pair field with a _dotraw postfix.
On this column I can do a lucene query where spaces are escaped with \.

So now I can write the above query like this:
+structureName:LuceneTest +LuceneTest.name_dotraw:a\ a*

Maria:
I have one more question about this soultion (technically two): Is using the dotraw fields considered a big hack that should be avoided at all costs? Will I have to rewrite this part of the application when a new DotCMS release comes out?

Gergely

Maria Ahues Bouza

unread,

Feb 22, 2013, 11:19:29 PM2/22/13

to dot...@googlegroups.com

Gergely,

I think you'll be fine using this for now. I'll confirm with our developers though because we don't use these fields in implementations.

-Maria

Gergely

--

You received this message because you are subscribed to the Google Groups "dotCMS User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dotcms+un...@googlegroups.com.
To post to this group, send email to dot...@googlegroups.com.
Visit this group at http://groups.google.com/group/dotcms?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Jason Tesser

unread,

Feb 23, 2013, 5:37:00 AM2/23/13

to dot...@googlegroups.com

I can confirm this field is ok to use. It has been around for a while. It is the in analyzed text

Reply all

Reply to author

Forward