lucene bug? with special characters

107 views
Skip to first unread message

Anders Jonsson

unread,
Jun 25, 2010, 3:47:39 AM6/25/10
to ravendb
I thought I'd make a new thread, updated with a better subject, and a
clearer description of the issue

If I have an email address such as anders.jonsson(at)gmail.com, I dont
get any results if I search for "anders.jonsson"
If I instead save it as anders.jonsson#gmail.com, I find it with
"anders.jonsson"

Shouldn't # and @ be treated in the same way by lucene?

And, is there a way to specify what tokenizer to use? so I can be sure
that anders.jonsson can be found with "anders", "jonsson" or
"anders.jonsson", without making changes in both the index and the
query

Matt Warren

unread,
Jun 25, 2010, 5:34:00 AM6/25/10
to ravendb
RavenDB currently doesn't let you change the Analyser/Tokenizer, is
uses the standard one for the version of Lucene it's built against
(2.9). I don't know if this will be changed in the future though.

All you can currently change is the Field.Index (ANALYSED/
NOT_ANALYSED) and Field.Store (YES/NOT/COMPRESS) parameters.

Based on a brief bit of research (http://stackoverflow.com/questions/
1826927/correct-way-to-write-a-tokenizer-in-lucene), it seems that you
are better of tokenising the test yourself and letting the
StandardAnalyzer handle it.

Anders Jonsson

unread,
Jun 26, 2010, 9:49:25 AM6/26/10
to ravendb
Ok, I can deal with that.. But I'm still not sure about the
differences between @ and #. Is there a reason that they are behaving
differently?

fschwiet

unread,
Jun 26, 2010, 7:30:47 PM6/26/10
to ravendb
I think there is a bug for the client API. Are you running RavenDB
as a server or embedded? Here's a test + fix if its the same issue:
http://github.com/fschwiet/ravendb/commit/4b3df98cf6164c3546ffc7fb677d0c063eb91902
> > > query- Hide quoted text -
>
> - Show quoted text -

fschwiet

unread,
Jun 26, 2010, 8:07:40 PM6/26/10
to ravendb
I think the difference between # and @ is because an url has
components http://host/path?querystring#fragment. This gets
mishandled, part of the querystring is removed as it looks like a
fragment,

On Jun 26, 4:30 pm, fschwiet <fschw...@gmail.com> wrote:
>   I think there is a bug for the client API.  Are you running RavenDB
> as a server or embedded?  Here's a test + fix if its the same issue:http://github.com/fschwiet/ravendb/commit/4b3df98cf6164c3546ffc7fb677...
> > - Show quoted text -- Hide quoted text -

fschwiet

unread,
Jun 27, 2010, 12:35:09 AM6/27/10
to ravendb
I totally loused up these commits. Would need commits
http://github.com/fschwiet/ravendb/commit/4b3df98cf6164c3546ffc7fb677d0c063eb91902
through http://github.com/fschwiet/ravendb/commit/6201302dc12fff0583a7677e0a1e2c097c95e28f
instead of just the one I think.

After I realized what I pushed was incomplete I amended locally...
Still learning GIT.

On Jun 26, 4:30 pm, fschwiet <fschw...@gmail.com> wrote:
>   I think there is a bug for the client API.  Are you running RavenDB
> as a server or embedded?  Here's a test + fix if its the same issue:http://github.com/fschwiet/ravendb/commit/4b3df98cf6164c3546ffc7fb677...
> > - Show quoted text -- Hide quoted text -

Matt Warren

unread,
Jun 27, 2010, 6:32:13 PM6/27/10
to ravendb
Also Lucene does treat the '@' and '#' characters differently. Using
Luke (http://code.google.com/p/luke/downloads/detail?
name=lukeall-1.0.1.jar) you can see what is stored by Lucene.

When storing the following data:
session.Store(new User { Age = 9999, Name = "M@T", Surname =
"M#T" });

With the RavenDB default (Analysed) it gets stored as
Name: "m@t"
Surname "m"
Surname "t"

So the Lucene StandardAnalyser obviously treats the '@' symbol as a
special case (I guess because of email addresses). But splits the
string (and removes) on the '#' character. I can't find any docs for
the Lucene StandardAnalyser to get an explanation for this though.

On Jun 27, 5:35 am, fschwiet <fschw...@gmail.com> wrote:
>   I totally loused up these commits.  Would need commitshttp://github.com/fschwiet/ravendb/commit/4b3df98cf6164c3546ffc7fb677...
> throughhttp://github.com/fschwiet/ravendb/commit/6201302dc12fff0583a7677e0a1...

Matt Warren

unread,
Jun 28, 2010, 6:46:45 AM6/28/10
to ravendb

Anders Jonsson

unread,
Jun 28, 2010, 7:22:36 AM6/28/10
to ravendb
Thanks for the info! It was really helpful.

So there is some special treatment for email addresses. Found the
grammar for the email parsing in
https://svn.apache.org/repos/asf/lucene/lucene.net/trunk/C%23/src/Lucene.Net/Analysis/Standard/StandardTokenizerImpl.jflex

That certainly explains the weird results I've been getting. The only
time lucene splits the string is when it's an incorrect address (and
probably on some correct ones as well, since their parsing of email
addresses is a bit on the simple side. But it'll get the large
majority right).

Then it seems like I need to get comfortable with the idea of pre-
tokenizing email addresses by splitting the string myself (in the
index, that is). Unless anyone has a better idea?

On 28 Juni, 12:46, Matt Warren <mattd...@gmail.com> wrote:
> Okay I found some docs (but nothing complete), the low-level grammar
> for the StandardAnalyser is athttps://svn.apache.org/repos/asf/lucene/lucene.net/trunk/C%23/src/Luc...
> and the C# source that results from that is athttps://svn.apache.org/repos/asf/lucene/lucene.net/trunk/C%23/src/Luc....
>
> The other source files are athttps://svn.apache.org/repos/asf/lucene/lucene.net/trunk/C%23/src/Luc....

Matt Warren

unread,
Jun 28, 2010, 12:03:18 PM6/28/10
to ravendb
For the time being I can't see any other way (but I'm far from a
Lucene expert).

If RavenDB would allow you to specify the Analyser you could do it
that way. But the Analyser would need to be exposed on the Sever (if
it was a custom one) and you have to make sure you use the same
Analyser for indexing and querying, so it would need to be a MEF plug-
in that the server could pick up. An easier option would be to expose
the properties that StandardAnalyser has (stop-words,
defaultReplaceInvalidAcronym, etc) but I don't know if they'll cover
what you need.

Also based on some (brief) research, pre-tokenising the string seems
easier that extending the Analyser.

On Jun 28, 12:22 pm, Anders Jonsson <anders.jons...@gmail.com> wrote:
> Thanks for the info! It was really helpful.
>
> So there is some special treatment for email addresses. Found the
> grammar for the email parsing inhttps://svn.apache.org/repos/asf/lucene/lucene.net/trunk/C%23/src/Luc...
Reply all
Reply to author
Forward
0 new messages