ngram substring searching

Cameron Hurst

unread,

Dec 17, 2010, 1:34:00 PM12/17/10

to Sunspot

Working my way through solr I have been working my way through its
manual as well as sunspot and came up with an odd point having to do
with ngrams, specifically this page dealing with edgengrams.
https://github.com/outoftime/sunspot/wiki/Wildcard-searching-with-ngrams

The suggestion is too ad the edgengrams filter to the text analyzer
which is good but in the current configuration applies the edgengrams
to both queries and indexing where it should only be used to filter
the indexing. If you use it as a query filter you will essentially be
making up to an additional 15 queries if you hit the maximum count
perfectly on the dot and would be diluting your search effectiveness
from the score as you add in all these smaller (and presumably more
common) words when matching the 15 length perfectly will give you the
sufficiently high score you need anyways.

My suggestion for this would be to break up your analyzers into two
different sections for this, a query and index analysis in the
schema.xml file, so it would look something like this.

<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
</analyzer>
</fieldType>

You do this in the substring matching I noticed as I wrote this up but
that should be transitioned into this page as well.

Kevin

unread,

Dec 17, 2010, 2:02:43 PM12/17/10

to Sunspot

Thanks for this.

Where'd you come up with the 15 queries, out of curiosity?

-Kevin

On Dec 17, 10:34 am, Cameron Hurst <cameron.a.hu...@gmail.com> wrote:
> Working my way through solr I have been working my way through its
> manual as well as sunspot and came up with an odd point having to do

> with ngrams, specifically this page dealing with edgengrams.https://github.com/outoftime/sunspot/wiki/Wildcard-searching-with-ngrams

Cameron Hurst

unread,

Dec 17, 2010, 2:42:12 PM12/17/10

to Sunspot

Sorry 15 should of bee changed to 14. It will query the database for
each of the edgengrams it comes up with, so I was using a maximal
worst case scenario where it splits a word from 2 letter up to 15
letter chunks, i just did bad math and ignored the minimal 2 so should
of been 14.

kja...@gmail.com

unread,

Apr 8, 2011, 12:09:22 AM4/8/11

to ruby-s...@googlegroups.com

I am having trouble getting ngrams to work. Here's my schema.xml:

      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
      </analyzer>
    </fieldType>

My database has a bunch of entries with "Elizabeth" and "Elizabeths". When I try to query on "Elizabeth" I get only "Elizabeth" and not "Elizabeth".
The odd thing is, when I check out the solr admin, the Analysis page shows that the EdgenGramFilterFactory is indeed available, and results in "Elizabeths" being expanded into

e

el

eli

eliz

eliza

elizab

elizabe

elizabet

elizabeth

It seems like the indexer isn't picking up on this. I have restarted Sunspot and reindexed multiple times. No dice. Any ideas? How can I directly check the indexed words list?

Reply all

Reply to author

Forward