Autocomplete Lucene Query

1,739 views
Skip to first unread message

Khalid Abuhakmeh

unread,
Oct 6, 2010, 12:22:14 PM10/6/10
to ravendb
I am currently trying to write an autocomplete feature using a lucene
query that looks like this.

LuceneQuery(Constants.Raven.Indexes.Players)
.WhereContains("Name", Name)
.AddOrder("Name", false)
.Fuzzy(0.5m)
.ToList();

The issue I'm running into is that fuzzy searching still matches the
length of the search term. So Jon and Jen might show up together but
not John and Jon. I also tried putting * around and got more results
but didn't necessarily get the results I was looking for. John, Jon,
and Jen in the same set.

Is there a better way to setup a query for autocomplete that will give
me the behavior I am looking for?

Behavior being: Exact matches, partial matches, and close matches with
fuzzyness.

Ayende Rahien

unread,
Oct 6, 2010, 12:47:48 PM10/6/10
to ravendb
What happen when you increase the fuzziness?

Ayende Rahien

unread,
Oct 6, 2010, 12:48:44 PM10/6/10
to ravendb
Another would be to add support for Did You mean?

On Wed, Oct 6, 2010 at 6:22 PM, Khalid Abuhakmeh <bird...@gmail.com> wrote:

Khalid Abuhakmeh

unread,
Oct 6, 2010, 12:55:06 PM10/6/10
to ravendb
The closer the the fuzziness get's to 0 the more it just decides that
everything is a match. I lowered the fuzziness down to 0.2 and it
seems like it is just getting me all the results that start with the
same letter.

On Oct 6, 12:47 pm, Ayende Rahien <aye...@ayende.com> wrote:
> What happen when you increase the fuzziness?
>

Ayende Rahien

unread,
Oct 6, 2010, 12:57:06 PM10/6/10
to ravendb
Then we probably need to work on implementing the did you mean feature
Care to take a go at it?

Khalid Abuhakmeh

unread,
Oct 6, 2010, 1:23:13 PM10/6/10
to ravendb
Sure, how do you picture it working. Would it be a bundle or an
Analyzer of some sort?

On Oct 6, 12:57 pm, Ayende Rahien <aye...@ayende.com> wrote:
> Then we probably need to work on implementing the did you mean feature
> Care to take a go at it?
>

Khalid Abuhakmeh

unread,
Oct 6, 2010, 1:25:50 PM10/6/10
to ravendb
I got something close to what I wanted using the LuceneQuery but it
looks really ugly in the code.

.Where(string.Format("Name:*{0}*", Name.Replace(" ", string.Empty)))
.Fuzzy(0.75m)

The issue is that fuzzyness throws an exception if you try to put it
on multi word searches.

On Oct 6, 12:57 pm, Ayende Rahien <aye...@ayende.com> wrote:
> Then we probably need to work on implementing the did you mean feature
> Care to take a go at it?
>

Khalid Abuhakmeh

unread,
Oct 6, 2010, 2:15:11 PM10/6/10
to ravendb
I might look crazy having a conversation with myself, but my last
reply didn't work as nicely as I thought it did. Here is my latest
iteration.

var query = LuceneQuery(Constants.Raven.Indexes.Players);

foreach (var word in Name.Split(' ')) {
query.Where(string.Format("Name:*{0}*",
word)).Fuzzy(0.75m).AndAlso();
}
query.Not.WhereEquals("Name", null);
query.AddOrder("Name", false);

var result = query.ToList();

The only issue with this (might not be) is that words are treated like
separate tokens and ' ' is assumed as the separator.

To what you said Ayende about a suggestion feature:

I think it would make sense to use a library like NHunspell on the
server side and be able to send back spelling suggestions either by
using the dictionaries that NHunspell has, or by somehow hitting an
index that returns a list of words that might be used as your
dictionary (or a combination of the two).

Suggestions would only be made if you explicitly made a call through
the client like

session.LuceneQuery<Search>(SearchIndex).WhereEquals("Name",
name).Suggest(suggestionCount).ToList();

and you would get back a Suggestions<T> that might have queries ready
to execute or just the suggestions themselves. My concern is this
could be very costly and may be considered a business concern rather
than a persistence concern.

Any thoughts?

Ayende Rahien

unread,
Oct 6, 2010, 2:33:56 PM10/6/10
to ravendb
I don't think it can be done via a bundle, I think you would have to directly modify the Index class.
Basically, take a look at the link from SO above, which shows how to do that using raw Lucene.
We should be able to use that.
I like the API that you have.

Matt Warren

unread,
Oct 6, 2010, 5:52:55 PM10/6/10
to ravendb
A simplier way might be to use the "spell-checker" technique outlined
here http://norvig.com/spell-correct.html. There's a C# port here
http://www.codegrunt.co.uk/2010/07/08/C-Sharp-Norvig-Spelling-Corrector.html.
BTW this is probably the cleverest piece of code I've seen, it's only
21 lines of Python.

You could use this to build a list of all the possible "corrections"
for a word and then search in the lucene index for matches of any of
these words. To build the list of possible "correction" you just a
large corpus of text, see the Norvig article for more details.

On Oct 6, 7:33 pm, Ayende Rahien <aye...@ayende.com> wrote:
> I don't think it can be done via a bundle, I think you would have to
> directly modify the Index class.
> Basically, take a look at the link from SO above, which shows how to do that
> using raw Lucene.
> We should be able to use that.
> I like the API that you have.
>

Ayende Rahien

unread,
Oct 6, 2010, 5:56:57 PM10/6/10
to ravendb
While you can do that, the Lucene way is probably better, since that is actually going to use the indexed terms themselves.

Matt Warren

unread,
Oct 6, 2010, 6:33:11 PM10/6/10
to ravendb
Yeah you're right, I guess they're doing things a different way
round.

The Lucene way seems to add all the mis-spellings of a word into the
index so they can be searched on. My method generates all the mis-
spelling each time and then searchs for them, so it's not going to be
as quick.

On Oct 6, 10:56 pm, Ayende Rahien <aye...@ayende.com> wrote:
> While you can do that, the Lucene way is probably better, since that is
> actually going to use the indexed terms themselves.
>
> On Wed, Oct 6, 2010 at 11:52 PM, Matt Warren <mattd...@gmail.com> wrote:
> > A simplier way might be to use the "spell-checker" technique outlined
> > herehttp://norvig.com/spell-correct.html. There's a C# port here
>
> >http://www.codegrunt.co.uk/2010/07/08/C-Sharp-Norvig-Spelling-Correct...
> > .

Khalid Abuhakmeh

unread,
Oct 7, 2010, 11:43:29 AM10/7/10
to ravendb
Which repository should i branch off of if I would like to modify the
code?

Ayende Rahien

unread,
Oct 7, 2010, 11:48:05 AM10/7/10
to ravendb
http://github.com/ayende/ravendb

Although, given git abilities, it doesn't really matter

Khalid Abuhakmeh

unread,
Oct 7, 2010, 2:11:06 PM10/7/10
to ravendb
I'm trying to think more about this before I try to crowbar this into
Raven.Database. I first thought this would be similar to the
SpatialIndex, but quickly realized that this is an entirely different
animal.

The SpatialIndex helps you create an index that Spatial.Net can use.
In this case we already should have (or easily create) an index using
the facilities already in Raven. The real problem lies in the Query.

It almost seems like you don't really need a "Special" index for this
to work. This would work fine:
// dictionary Index (Name = Players)
from player in docs.Players
select new { player.Name }

The Client API would need to be altered in some form, maybe like
below?
// Client API
SuggestionQuery("DictionaryIndex").Suggest(term, indexFieldName,
numOfSuggestions = 10, accuracy = 0.5f)
SuggestionQuery("Players").Suggest("John", "Name", numOfSuggestions =
10, accuracy = 0.5f)
SuggestionQuery<Player>("Players").Suggest( p => p.Name == "John",
numOfSuggestions = 10, accuracy = 0.5f)

All you need is for Raven to scan an entire index using the
SpellChecker.Net library, then spit back a Json/C# object that matches
the following.

public class SuggestionResult
{
// The term entered by the end user
public string Term { get; }

// The dictionary index you searched for suggestions
public string IndexName {get;}

// The lucene index field that you searched on
public string Field { get; }

// The suggestions based on the term and dictionary
public IEnumerable<string> Suggestions { get; }
}

It feels like to me that I should create a new Query type, and get the
Raven.Database.QueryRunner class to execute the new SuggestionQuery
and then pump the results into RemoteQueryResults.Results on line 84
of QueryRunner.cs.

Any Suggestions? Have I missed something fundemental?


On Oct 7, 11:48 am, Ayende Rahien <aye...@ayende.com> wrote:
> http://github.com/ayende/ravendb
>
> Although, given git abilities, it doesn't really matter
>

Khalid Abuhakmeh

unread,
Oct 7, 2010, 2:21:02 PM10/7/10
to ravendb
What about a Responder? That seems like an even better approach.

Ayende Rahien

unread,
Oct 7, 2010, 3:02:12 PM10/7/10
to ravendb
I agree that a responder is the way to go about this.
The problem is that you also need to provide access to the actual index, which is why you need to modify RavenDB code.

Khalid Abuhakmeh

unread,
Oct 7, 2010, 3:10:44 PM10/7/10
to ravendb
There are actually two approaches to accessing the index. Which one do
you think is better.

1.) I could access the lucene index directly, but I would have to
modify DocumentDatabase or create a way to access the Lucene Query
directly.
2.) I could create a new Dictionary called RavenDictionary that
SpellChecker.Net could use to traverse. I can do this with the
existing facilities found in RequestResponder.

On Oct 7, 3:02 pm, Ayende Rahien <aye...@ayende.com> wrote:
> I agree that a responder is the way to go about this.
> The problem is that you also need to provide access to the actual index,
> which is why you need to modify RavenDB code.
>
> ...
>
> read more »

Ayende Rahien

unread,
Oct 7, 2010, 3:13:41 PM10/7/10
to ravendb
I am not sure that I am following you.
What do you mean by RavenDictionary ?

Khalid Abuhakmeh

unread,
Oct 7, 2010, 3:31:46 PM10/7/10
to ravendb
Spellchecker has a Dictionary interface that just expects to get words
back.

https://svn.apache.org/repos/asf/lucene/lucene.net/trunk/C%23/contrib/SpellChecker.Net/SpellChecker.Net/Spell/Dictionary.cs

If I can implement a RavenDictionary that does the same, then we don't
need to access the lucene index directly.


On Oct 7, 3:13 pm, Ayende Rahien <aye...@ayende.com> wrote:
> I am not sure that I am following you.
> What do you mean by RavenDictionary ?
>
> ...
>
> read more »

Ayende Rahien

unread,
Oct 7, 2010, 3:32:59 PM10/7/10
to ravendb
How would you get access to that?

Khalid Abuhakmeh

unread,
Oct 7, 2010, 3:51:40 PM10/7/10
to ravendb
public class RavenDictionary:
SpellChecker.Net.Search.Spell.Dictionary
{
private readonly QueryResult _queryResult;
private readonly string _fieldName;

public RavenDictionary(QueryResult queryResult, string
fieldName)
{
_queryResult = queryResult;
_fieldName = fieldName;
}

public IEnumerator GetWordsIterator()
{
return _queryResult.Results.Select(result =>
result.Value<string>(_fieldName)).GetEnumerator();
}
}

var spellchecker = new SpellChecker.Net.Search.Spell.SpellChecker(new
RAMDirectory() , new LevenshteinDistance());
var result = Database.Query(indexName, new IndexQuery {FieldsToFetch
= new[] { field }});

spellchecker.SetAccuracy(accuracy);
// indexing result again.... not sure about this part with
RamDirectory()
spellchecker.IndexDictionary( new RavenDictionary(result, field));

On Oct 7, 3:32 pm, Ayende Rahien <aye...@ayende.com> wrote:
> How would you get access to that?
>
> On Thu, Oct 7, 2010 at 9:31 PM, Khalid Abuhakmeh <birdch...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Spellchecker has a Dictionary interface that just expects to get words
> > back.
>
> >https://svn.apache.org/repos/asf/lucene/lucene.net/trunk/C%23/contrib...
> ...
>
> read more »

Ayende Rahien

unread,
Oct 7, 2010, 3:56:23 PM10/7/10
to ravendb
That isn't what you want. You don't want the results of the query (which may be nothing).
You want the terms for the field.

Khalid Abuhakmeh

unread,
Oct 7, 2010, 4:26:13 PM10/7/10
to ravendb
I was thinking that the results of the Query (which is everything in
the index) would be your dictionary. I would only be pulling back the
field that the user asks for as a suggestion. Currently not allowing
people to filter on an index for the suggestion.

On Oct 7, 3:56 pm, Ayende Rahien <aye...@ayende.com> wrote:
> That isn't what you want. You don't want the results of the query (which may
> be nothing).
> You want the *terms* for the field.
> ...
>
> read more »

Ayende Rahien

unread,
Oct 7, 2010, 4:34:48 PM10/7/10
to ravendb
You can't get everything in the index, Raven will prevent that (safe by default).
I think that it would be better to integrate it a bit more deeply.

Khalid Abuhakmeh

unread,
Oct 7, 2010, 4:44:05 PM10/7/10
to ravendb
Oh yeah, forgot about that feature. You are right then, I will access
the lucene index directly.

On a side note, what is the best way to test a responder?

On Oct 7, 4:34 pm, Ayende Rahien <aye...@ayende.com> wrote:
> You can't get everything in the index, Raven will prevent that (safe by
> default).
> I think that it would be better to integrate it a bit more deeply.
>
> ...
>
> read more »

Ayende Rahien

unread,
Oct 7, 2010, 5:03:42 PM10/7/10
to ravendb
2 ways.
You can test it via Fiddler, then drop the saved to the Raven Scenarios folder.
Just execute the code and run against it.

I don't suggest trying to test a responder is isolation, just use a RemoteClientTest.

Do note that we probably need to offer embedded & remote versions.

Khalid Abuhakmeh

unread,
Oct 8, 2010, 4:52:02 PM10/8/10
to ravendb
Ok I think I got it working, w00t! (still too early to celebrate)

I tested on an index of 10, and tried to get suggestions from that. I
found the more stuff you had in your index the better chance you had
of getting suggestions, and that playing around with the distance
algorithms gave you different results. This still needs more testing,
I could use a little help in that department and a quick review to
make sure I didn't do anything dumb.

The code has been pushed up here.

http://github.com/khalidabuhakmeh/ravendb

On Oct 7, 5:03 pm, Ayende Rahien <aye...@ayende.com> wrote:
> 2 ways.
> You can test it via Fiddler, then drop the saved to the Raven Scenarios
> folder.
> Just execute the code and run against it.
>
> I don't suggest trying to test a responder is isolation, just use a
> RemoteClientTest.
>
> Do note that we probably need to offer embedded & remote versions.
>
> ...
>
> read more »

Ayende Rahien

unread,
Oct 8, 2010, 8:58:39 PM10/8/10
to ravendb
Until you get it working, can you sign the contributing agreement here:

Khalid Abuhakmeh

unread,
Oct 10, 2010, 10:10:16 AM10/10/10
to ravendb
Hello Ayende,

I sent the contributor agreement in to you and I have updated the code
found at the repository. I have tested it with an index container 1000
documents and it works! Right now the code for SpellChecker.Net is
compiled and added as a reference, but in the future it might be
better to pull the code in and allow MEF to get a list of
StringDistance calculators. That way user's or you could develop
suggestion algorithms that better suit your needs.

to try it out you will need to run a query like this through fiddler:

http://localhost:8080/suggest?term=john*&index=PersonsByName&max=5&accuracy=0.1&field=Name&distance=default

and the result might look like:

{"Suggestions":["john summers","john stanley","jameson johns","slade
johnston"],"Term":"john*","IndexName":"PersonsByName","Field":"Name","MaxSuggestions":
5,"Distance":"Default","Accuracy":0.1}

The API command can be found under DatabaseCommands for both the
embedded and client api as Suggest(SuggestionsQuery q)

Hope this helps someone out there.

On Oct 8, 8:58 pm, Ayende Rahien <aye...@ayende.com> wrote:
> Until you get it working, can you sign the contributing agreement here:http://ravendb.net/faq/contributing
>
> ...
>
> read more »

Ayende Rahien

unread,
Oct 11, 2010, 5:33:36 AM10/11/10
to rav...@googlegroups.com
Changed to use:

I removed a lot of work that was done in exception catch clauses.

Most importantly, it doesn't have any tests.
And when I tried to write a test, it failed.

[Fact]
public void ExactMatch()
{
    using(var store = NewDocumentStore())
    {
        store.DatabaseCommands.PutIndex("Test", new IndexDefinition
        {
            Map = "from doc in docs select new { doc.Name }",
            Indexes = {{"Name", FieldIndexing.Analyzed}}
        });
        using(var s = store.OpenSession())
        {
            s.Store(new User{Name = "Ayende"});
            s.Store(new User { Name = "Oren" });
            s.SaveChanges();

            s.Query<User>("Test").Customize(x => x.WaitForNonStaleResults()).ToList();
        }

        using (var s = store.OpenSession())
        {
            var suggestionQueryResult = s.Advanced.DatabaseCommands.Suggest("Test",
                                                                            new SuggestionQuery
                                                                            {
                                                                                Field = "Name",
                                                                                Term = "Oren",
                                                                                MaxSuggestions = 10
                                                                            });

            Assert.Equal(1, suggestionQueryResult.Suggestions.Length);
            Assert.Equal("Oren", suggestionQueryResult.Suggestions[0]);
        }
    }
}

You can see all of my changes here:
Reply all
Reply to author
Forward
0 new messages