Substring search

473 views
Skip to first unread message

Oren Eini (Ayende Rahien)

unread,
Jan 24, 2012, 6:56:03 AM1/24/12
to ravendb
We got a lot of requests recently related to how we can support substring searches. 
The major problem with this is that a substring (like) search can't really use the index properly.
It forces a full index scan, which is really slow.

The typical use case for wanting to do that is when we want to provide good search results for the users. Most times, we can get away with just doing full text search, but sometimes we need something more, truly searching inside the text.

For that, we can do NGram search.
What is NGram search? Instead of splitting the text of token boundaries, it generate searchable tokens every N characters. For example, the following statement:
"Arava is a Dog"

Using the default analyzer, it would be indexed as:
[arava is a dog]

Using the standard analyzer, it would be indexed as:

[arava] [dog]

And using the NGram analyzer (2, 6), it would be indexed as:

[ar] [ra] [av] [va] [ara] [rav] [ava] [arav] [rava] [do] [og] [dog] [arava]

That way, you can search for "ava" and it would find it for you.

This is an elegant way to do so, and it is much faster than the *ava* alternative. 
I have a test case showing how this works: https://gist.github.com/1669767

The question now becomes, whatever this would become a RavenDB feature, vs. just an option that you can use.
And if it is a builtin feature, how do we expose it?

Matt Warren

unread,
Jan 24, 2012, 7:14:21 AM1/24/12
to rav...@googlegroups.com
The question now becomes, whatever this would become a RavenDB feature, vs. just an option that you can use.

I think that is should be a RavenDB feature, but you have to enable it. It needs extra space in the index and in some scenarios you need to understand the best values (NGram min/max length), so it's probably better being an explicit step.
 
And if it is a builtin feature, how do we expose it?

What about doing something similar to the new Boost(..) feature,

public class Users_ByName : AbstractIndexCreationTask<User>
{
    public Users_ByName()
    {
        Map = users => from user in users
                       select new
                       {
                           FirstName = user.FirstName.WithSubstringSearches(),
                           user.LastName
                       };
    } 
Or would that cause problems if you want to do both, i.e. FirstName..Boost(3).WithSubstringSearches()?

Oren Eini (Ayende Rahien)

unread,
Jan 24, 2012, 7:16:27 AM1/24/12
to rav...@googlegroups.com
Matt,
The problem with that is that this isn't really how it works. We need to define an analyzer for this to work, not just modify the value that we save.

Ryan Heath

unread,
Jan 24, 2012, 7:26:12 AM1/24/12
to rav...@googlegroups.com
Why is it or should it be different from the current way of specifing
which analyzer you want to use?

// Ryan

Oren Eini (Ayende Rahien)

unread,
Jan 24, 2012, 7:29:13 AM1/24/12
to rav...@googlegroups.com
That is pretty much the point, I think it should be the same.
The only problem is when we want to provide configuration to that
But we can deal with that with some convention

Itamar Syn-Hershko

unread,
Jan 24, 2012, 7:33:40 AM1/24/12
to rav...@googlegroups.com
I don't think that should be built in. When searching for actual words in an ngram some unexpected results may come up (bad recall). That is true for search engines, and it becomes even more disturbing with DB queries. It also bloats the index.

My main point is you want to index with ngrams only for very specific use cases. You don't want this to be to easy for people to use. Otherwise you get people using shotguns for killing flies.

I think this should be a documented feature, leveraging ShingleFilter which we can easily port to .NET if it hasn't been ported yet.

Having proper docs on FTS describing substring matches side by side with various other available Analyzers should do the job.

jalchr

unread,
Jan 24, 2012, 7:54:30 AM1/24/12
to rav...@googlegroups.com
I think it should be part of the Indexing options .. and when used it will use the  NGram analyzer

using our little test:

using (var session = store.OpenSession())
                {
                    session.Store(new Image { Id = "1", Name = "Great Photo buddy" });
                    session.Store(new Image { Id = "2", Name = "Nice Photo of the sky" });
                    session.SaveChanges();
                }

                store.DatabaseCommands.PutIndex("test", new IndexDefinition
                {
                    Map = "from doc in docs.Images select new { doc.Name }",
                    Indexes = {
					    { "Name", FieldIndexing.Analyzed_Fragmented }
				    }   

                });
                
                using (var session = store.OpenSession())
                {
                    var images = session.Query<Image>("test")
                        .Customize(x => x.WaitForNonStaleResults())
                        .Search(x => x.Name, "Phot")
                        .ToList();
                    Assert.NotEmpty(images);
                }

Oren Eini (Ayende Rahien)

unread,
Jan 24, 2012, 8:14:46 AM1/24/12
to rav...@googlegroups.com
How would you handle things like determining the gram length?

jalchr

unread,
Jan 24, 2012, 8:22:15 AM1/24/12
to rav...@googlegroups.com
This could be handled somewhere in the convention configuration. 

Itamar Syn-Hershko

unread,
Jan 24, 2012, 8:23:07 AM1/24/12
to rav...@googlegroups.com
You want to have that per index

Itamar Syn-Hershko

unread,
Jan 24, 2012, 8:23:53 AM1/24/12
to rav...@googlegroups.com
Actually per field

Exactly why I'm against having this a built-in feature

Oren Eini (Ayende Rahien)

unread,
Jan 24, 2012, 8:26:20 AM1/24/12
to rav...@googlegroups.com
And the more I think about this, the more I lean toward agreeing with Itamar.
We will probably document this and point people to this, but it isn't something that we care to do.

Daniel Lang

unread,
Jan 24, 2012, 11:21:49 AM1/24/12
to rav...@googlegroups.com
I agree with Itamar, it should be hard (in terms of one has to read the documentation) to do such queries. I can hardly think of any use-case where one needs to have both a leading and a trailing wildcard. Most of the time, all you need is a trailing one that leads to a prefix query, which is already fast.

Justin A

unread,
Jan 26, 2012, 6:14:37 AM1/26/12
to rav...@googlegroups.com
>> Most of the time, all you need is a trailing one that leads to a prefix query, which is already fast.

Would be nice if FieldIndexing.Reverse was built in (maybe it is and i'm still a noob at this).

Itamar Syn-Hershko

unread,
Jan 26, 2012, 6:18:56 AM1/26/12
to rav...@googlegroups.com
You have a Reverse() extension method supported for index definitions, but how's that related?
Reply all
Reply to author
Forward
0 new messages