Wildcard searches. Contains, StartsWith, EndsWith

763 views
Skip to first unread message

Anders Jonsson

unread,
Jun 21, 2010, 12:21:47 PM6/21/10
to ravendb
Hi,

As far as I can tell EndsWith is not possible with lucene (i've been
toying with the silly idea to add a garbage character in front of all
my strings, just to be able to do #*myname as *myname doesnt work). Is
there any way to do that kind of search without resorting to ugliness?
As far as I can Contains, StartsWith & EndsWith are not implemented
for linq yet.

Also (maybe this should be a separate thread?) are there plans to
support Any()? My objects contains arrays, so Any() would be really
useful.

Thanks,
Anders

Anders Jonsson

unread,
Jun 21, 2010, 12:49:02 PM6/21/10
to ravendb

I just realized tham Any() would be useless against a flat index,
right?

Ayende Rahien

unread,
Jun 21, 2010, 1:12:42 PM6/21/10
to rav...@googlegroups.com
EndsWith is something that would be _really_ bad from performance perspective, since it would require scanning the entire index.
I can add support for that, but what is the usage scenario?

And can you explain a bit more what you need in terms of Any()?
A real case would be great

Anders Jonsson

unread,
Jun 21, 2010, 2:22:38 PM6/21/10
to ravendb
Yes, I understand the performance-hit. I'd like to be able to skip it,
but our users have the ability to do endswith today (they use it to
find all people with email address from a certain domain, for
example). Most of it could probably be worked around (like to add
another field for email domain, or add a reversed field and do a
search on that, with the reversed search-string)

As for Any(), it really ties into my second thread (querying against
multiple indexes). I think this discussion is better off in that
thread, since they're colliding heavily by now. I realized to late
that it's really the same issue with two attempted (and failed)
solutions

Thanks for your patience :)

/Anders

On 21 Juni, 19:12, Ayende Rahien <aye...@ayende.com> wrote:
> EndsWith is something that would be _really_ bad from performance
> perspective, since it would require scanning the entire index.
> I can add support for that, but what is the usage scenario?
>
> And can you explain a bit more what you need in terms of Any()?
> A real case would be great
>

Ayende Rahien

unread,
Jun 21, 2010, 2:26:27 PM6/21/10
to rav...@googlegroups.com
Anders,
Yes, I would strongly suggest having a field just for the name.
It doesn't have to be on the document, you can create that field in the index itself.

Anders Jonsson

unread,
Jun 21, 2010, 2:29:04 PM6/21/10
to ravendb
Ah, that's true. Clever :)

Thanks!

On 21 Juni, 20:26, Ayende Rahien <aye...@ayende.com> wrote:
> Anders,
> Yes, I would strongly suggest having a field just for the name.
> It doesn't have to be on the document, you can create that field in the
> index itself.
>

Anders Jonsson

unread,
Jun 22, 2010, 4:53:19 AM6/22/10
to ravendb
Been thinking a bit. A reversed field does cover EndsWith, but what
about Contains?
One way is to add a separate field in the index that starts with a
predetermined letter, so that i can search for a*anders* when I want
the documents that contains "anders". I do realize this would be
inefficient, and I'd like to avoid it if possible. Any ideas?

If it's not possible or performs worse than our current solution (LIKE-
searches in mssql), we might be able to get away with removing the
ability to do Contains, but I'd prefer not to remove any features.

/Anders

Ayende Rahien

unread,
Jun 22, 2010, 7:45:03 AM6/22/10
to rav...@googlegroups.com
Contains is what Lucene does by default

Anders Jonsson

unread,
Jun 22, 2010, 9:05:54 AM6/22/10
to ravendb
Ok, then I must be doing something wrong, because the results I'm
getting are kinda weird. My first attempt was to search for
"anders.jonsson". That never gave me any hits, even though I have
"anders....@gmail.com" in my database.

documentSession.LuceneQuery<Person>("PeopleByStdInfo").Where ("Email",
"gmail") gives me hits on "anders@gmail" (an incorrect address), but
misses "and...@gmail.com"

Am I missing something in lucene? Seems like the dot is messing
something up. I've tried escaping it, but that doesn't help

On 22 Juni, 13:45, Ayende Rahien <aye...@ayende.com> wrote:
> Contains is what Lucene does by default
>
> On Tue, Jun 22, 2010 at 11:53 AM, Anders Jonsson
> <anders.jons...@gmail.com>wrote:

Ayende Rahien

unread,
Jun 23, 2010, 6:35:13 AM6/23/10
to rav...@googlegroups.com
I think that the problem is with tokenizing. In other words, Lucene doesn't know how to split the email properly.
For now, try this in the index:
Email = string.Join(" ", person.Email.Split(new[]{'@','.'}))

Anders Jonsson

unread,
Jun 23, 2010, 7:13:35 AM6/23/10
to ravendb
I'll give it a shot. Thanks

On 23 Juni, 12:35, Ayende Rahien <aye...@ayende.com> wrote:
> I think that the problem is with tokenizing. In other words, Lucene doesn't
> know how to split the email properly.
> For now, try this in the index:
> Email = string.Join(" ", person.Email.Split(new[]{'@','.'}))
>
> On Tue, Jun 22, 2010 at 4:05 PM, Anders Jonsson <anders.jons...@gmail.com>wrote:
>
>
>
> > Ok, then I must be doing something wrong, because the results I'm
> > getting are kinda weird. My first attempt was to search for
> > "anders.jonsson". That never gave me any hits, even though I have
> > "anders.jons...@gmail.com" in my database.

Anders Jonsson

unread,
Jun 24, 2010, 2:52:55 AM6/24/10
to ravendb
That worked. But I also realize that if our users enter something dot
something in a field, they're not going to be able to find it
correctly, unless we split all fields in a similar manner.
Is this behavior as intended? Is it a built-in weakness in lucene? A
bug? Shouldnt an email-address count as a single word, since there are
no spaces? I realize that lucene wasn't designed for that scenario,
but Im hoping for some cleverness to help me work this out :)

Also, the Contains, that lucene does by default, doesnt allow our
users the same abilities they have today (with LIKE in mssql) since it
searches for whole words (and the wildcard-search doesnt allow
endswith). Does anyone se a solution or workaround for this? We would
benefit, in some ways, from lucene, with the ability to do fuzzy
searches and so on, but first we need to get our current features up
and running in a nosql environment

Matt Warren

unread,
Jun 24, 2010, 4:30:11 AM6/24/10
to ravendb
You can do a fuzzy search using lucene, see the doc here
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html.

The latest build has a lot of extra methods (on IDocumentQuery not the
LINQ provider) that can handle these cases, such as Fuzzy, Boost,
Proximity and more are being added.

Ryan Heath

unread,
Jun 24, 2010, 4:45:57 AM6/24/10
to rav...@googlegroups.com
Looking at the (C#) code of Lucene, I think it should be possible to
create querys that support EndsWith

The QueryParser has a method SetAllowLeadingWildcard(bool allowLeadingWildcard)
which is default false.

// Ryan

Anders Jonsson

unread,
Jun 24, 2010, 5:52:37 AM6/24/10
to ravendb
I've thought about it, and to treat an email address as one word
wouldn't be a good solution after all. To be able to search for
anders.jonsson and still get a hit for my email would be nice, but
with a single word approach I'd have to use anders.jonsson*.. but an
automatic tokenizing of those words would be nice, so we don't have to
split the fields in both the indexes and the search parameters.

Yes, I've seen the lucene documentation. It covers a lot and I'm sure
our users would benefit from the power of lucene. I've also done tests
with fuzzy from the client api, and it works nicely.

Matt Warren

unread,
Jun 24, 2010, 6:11:58 AM6/24/10
to ravendb
Could you store the email field as "anders jonsson gmail
anders....@gmail.com", so that all cases are covered?

Anders Jonsson

unread,
Jun 24, 2010, 7:09:49 AM6/24/10
to ravendb
Yes, right now we're doing that in the index. It works, but I've
realized that we'd need to do that for every field. If I search the
email for "anders", I dont get any hits on anders.jonsson. That would
mean that if one of our users writes "something1.something2" in any
field, lucene won't find that field if I search for "something1".
Sure, I could do a wildcard search for it (which our users are forced
to do today, in mssql), but if we're going to use the lucene approach
that searches by words it would be nice to have an automatic
separation at the dots and @.

It's not that big of a deal to replace dots and @ with a space in the
index, and it works nicely, but I was curious to know where this
behavior is coming from, since it's not really consistent

"gmail" gives me addresses that ends in gmail (faulty addresses,
without the top domain, such as "mytestaddress@gmail")
"mytestaddress", for example gives me "mytestaddress@gmail"
"test" fails to give me "te...@gmail.com" (so this one fails while the
one above succeeds.. somehow the dot in the domain messes with the
rest of the query)
"test" gives me "test@.se" (another faulty address)
"gmail.com" gives me -@gmail.com, _@gmail.com, mytest@test@gmail.com,
mytest&@gmail.com, mytest.@gmail.com etc. Only the ones with special
characters before the @

That just doesnt seem right to me

/Anders

On 24 Juni, 12:11, Matt Warren <mattd...@gmail.com> wrote:
> Could you store the email field as "anders jonsson gmail
> anders.jons...@gmail.com", so that all cases are covered?

Matt Warren

unread,
Jun 24, 2010, 8:37:16 AM6/24/10
to ravendb
There's another thread that explains so of the issue with the lucene
index and FieldStorage v. FieldAnalysed, it might have some useful
info. See http://groups.google.com/group/ravendb/browse_thread/thread/13d9e250391ec598

On Jun 24, 12:09 pm, Anders Jonsson <anders.jons...@gmail.com> wrote:
> Yes, right now we're doing that in the index. It works, but I've
> realized that we'd need to do that for every field. If I search the
> email for "anders", I dont get any hits on anders.jonsson. That would
> mean that if one of our users writes "something1.something2"  in any
> field, lucene won't find that field if I search for "something1".
> Sure, I could do a wildcard search for it (which our users are forced
> to do today, in mssql), but if we're going to use the lucene approach
> that searches by words it would be nice to have an automatic
> separation at the dots and @.
>
> It's not that big of a deal to replace dots and @ with a space in the
> index, and it works nicely, but I was curious to know where this
> behavior is coming from, since it's not really consistent
>
> "gmail" gives me addresses that ends in gmail (faulty addresses,
> without the top domain, such as "mytestaddress@gmail")
> "mytestaddress", for example gives me "mytestaddress@gmail"
> "test" fails to give me "t...@gmail.com" (so this one fails while the
> one above succeeds.. somehow the dot in the domain messes with the
> rest of the query)
> "test" gives me "test@.se" (another faulty address)
> "gmail.com" gives me -...@gmail.com, _...@gmail.com, mytest@t...@gmail.com,
> myte...@gmail.com, myte...@gmail.com etc. Only the ones with special

Anders Jonsson

unread,
Jun 24, 2010, 11:31:56 AM6/24/10
to ravendb
Thanks, that clears it up somewhat. It's always nice to get a better
understanding of the inner workings.

I was wondering about this bit:
"6. FieldIndexing.Analyzed causes values to be converted to strings
and
to be parsed up into words similar to search engines (whitespace and
punctuation ignored) "

Punctuation doesnt seem to be ignored fully. If punctuation was
treated as whitespace i'd get a hit when I search for "jonsson@gmail",
wouldn't I? And a search for "gmail" should find addresses with
"gmail.com" not just the faulty ones ending with "gmail".

As for the example above, where "gmail.com" only gives me the hits
with special characters just before the @, I get the same results if I
search for "@gmail.com", so it seems to be ignoring the @, but not if
theres a letter in front of the @.

I could understand that, if it wasn't for the inability to find
anders.jonsson(at)gmail.com with "anders.jonsson" or
"jonsson(at)gmail.com" or "jonsson@gmail". If there's only one word in
front of the @, such as test(at)gmail.com, I find it if I search for
"test", but if there are two words I cant find it at all without the
entire address. So test.testing(at)gmail.com can't be found with
"test" or "test.testing".

It's as if there was some special parsing for email addresses. Could
that be it? Are email addresses treated differently in lucene?

The current solution (separating the words in the index) works, but
I'm worried that we'll run into issues with other fields as well

I know that I'm repeating myself, but I'm really struggling to
understand whats going on. Thank's for your patience :)

btw. is there a way to set setAllowLeadingWildcard in the query
parser? I'd like to do a few tests, to see if the performance hit is
worth it

/Anders

On 24 Juni, 14:37, Matt Warren <mattd...@gmail.com> wrote:
> There's another thread that explains so of the issue with the lucene
> index and FieldStorage v. FieldAnalysed, it might have some useful
> info. Seehttp://groups.google.com/group/ravendb/browse_thread/thread/13d9e2503...

Anders Jonsson

unread,
Jun 24, 2010, 1:54:56 PM6/24/10
to ravendb
Just did another test. I've added a person with the address
"anders.jonsson#gmail.com". According to the lucene docs # isn't a
special character either, so in my mind it should be treated like @.
Now I DO get a hit if I search for "anders.jonsson", or "gmail.com".
So # isn't treated the same way as @. Any ideas?

If I search for "anders" or "anders jonsson" I don't get any hits.. So
I guess dots aren't ignored, as that other thread said. Found this in
the lucene docs "Splits words at punctuation characters, removing
punctuation. However, a dot that's not followed by whitespace is
considered part of a token.". That makes sense, but it's inconvenient
since that makes it possible to mess up the searchability of a string
by not hitting space after punctuations

The best solution, from my point of view, would be to allow users to
specify what Tokenizer to use. Or is that already possible? The
earlier solution requires a custom approach both in the index and when
querying. If I could use my own Tokenizer I could, hopefully, do
indexes and querys without thinking about that
> ...
>
> läs mer »
Reply all
Reply to author
Forward
0 new messages