Surprising Results with NGram Analyzer in Raven DB Build 355220

47 views
Skip to first unread message

Shubha Lakshmi

unread,
Oct 18, 2017, 1:12:01 AM10/18/17
to RavenDB - 2nd generation document database
I have tried to compare the performance of both NGram and Lucene full text analyzer (Standard Analyzer) over some string columns in Raven DB.

Describing the scenarios here :
NGram settings : min:3 max:100
Column name : Plan,DPlan (String)
Weight,Cube(Integer)

Please note that "Contains" and "Not Contains" are implemented through "Search" method on "IRavenQueryable".

An example of value the string column stores : 122-456-789-100

Scenario 1 : Using NGram for Contains and NotContains 

Filter : Plan contains 100  -- 894 ms
Plan Not Contains 100 -- 5.51 sec
Plan Contains 100 AND DPlan contains 123 -- 2.21 seconds
Plan contains 100 and weight>100 and Cube < 20 -- 3.61 seconds
Plan contains 122 - 962 ms
Plan not contains 122 -- 5.96 seconds

Scenario 2: using NGram for contains and standard analyzer for Not Contains 

Filter : Plan contains 100  -- 894 ms
Plan Not Contains 100 -- 977 ms
Plan Contains 100 AND DPlan contains 123 -- 2.21 seconds
Plan contains 100 and weight>100 and Cube < 20 -- 3.61 seconds
Plan contains 122 - 962 ms
Plan not contains 122 -- 902 ms 

Scenario 3 : Using Standard Analyzer for both Contains and Not Contains 

Filter : Plan contains 100  -- 259 ms
Plan Not Contains 100 -- 977 ms
Plan Contains 100 AND DPlan contains 123 -- 415 ms
Plan contains 100 and weight>100 and Cube < 20 -- 875 ms
Plan contains 122 - 801 ms
Plan not contains 122 -- 902 ms


Total number of entities : 175824

As we can see, Default analyzer is giving much better results compared to NGram
I tried changing some settings like Raven/NewIndexInMemoryMaxMB , Raven/Indexing/FlushIndexToDiskSizeInMb in Raven.Server.Exe.Config but they did not have any impact

My few queries on the same :

1. Are there any settings to be made in order to keep all the terms emitted by NGram index in memory?

2. Why is it that "Not Contains" is so slow, in spite of using NGram Analyzer?

3. Does using Wildcards in Search method cause degradation in NGram index performance?

TIA



Grisha Kotler

unread,
Oct 18, 2017, 2:07:06 AM10/18/17
to rav...@googlegroups.com
Hi,

NGram already generated the required terms, searching with wildcards searches inside each term.
For example: the phrase: "Hello World" produces the terms: "hel", "ell", "llo", "wor", "orl", "rld"
while the standard analyzer produces the terms: "hello", "world"

You shouldn't use wildcards in the Search method when using the NGram Analyzer. 


Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Grisha Kotler l RavenDB Core Team Developer Mobile: +972-54-586-8647

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

RavenDB paving the way to "Data Made Simplehttp://ravendb.net/


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shubha Lakshmi

unread,
Oct 18, 2017, 2:25:29 AM10/18/17
to RavenDB - 2nd generation document database
Thanks, But for some patterns like 123*456*789 (123 followed by anything followed by 456 ..) and 123*456-567*900  I am not sure how to avoid using wildcards .. Also why would "Not Contains" be so slow, is it due to fact that all the terms need to be scanned  .. Also, any settings in "rave.server.exe.config" need to changed to ensure that entire index is kept in-memory all the time ...
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Tal Weiss

unread,
Oct 18, 2017, 2:26:22 AM10/18/17
to RavenDB - 2nd generation document database
not contain is expensive due to the way Lucene implements not queries, a not query in Lucene is basically give me everything and not (query result).
Now Lucene has to fetch everything and remove entries from all so it is really not recommended having a query with just, not in it.
Now the NGram purpose is to locate substrings fast if you are trying to find whole words the standard will be much faster for it has a lot less work to do and a lot fewer entries.
--

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Tal Weiss l Core Team Developer Mobile:+972-54-802-4849

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811l Skype: talweiss1982

Tal Weiss

unread,
Oct 18, 2017, 2:30:22 AM10/18/17
to RavenDB - 2nd generation document database
Shuba, it might be the case where writing a designated analyzer may boost your performance by a lot and yield better results.
If you are going this way please make sure to reuse the token stream and aware that any allocation you do in the analyzer will cost you a bunch so try to stick to zero allocation patterns. 

To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Tal Weiss l Core Team Developer Mobile:+972-54-802-4849

RavenDB paving the way to "Data Made Simplehttp://ravendb.net/ 

Shubha Lakshmi

unread,
Oct 18, 2017, 3:43:07 AM10/18/17
to RavenDB - 2nd generation document database
Thanks, I removed leading and trailing wildcards and getting good results for "Contains". For now using Standard Analyzer for "Not Contains", till I find a better strategy.

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Mrinal Kamboj

unread,
Oct 18, 2017, 10:32:47 AM10/18/17
to RavenDB - 2nd generation document database
Tal, are there examples of how to write a custom analyzer for a specific requirement, which we can utilize to get started. Would it be recommended for us to create a analyzer which can help us apply regex filter the column data, which is not supported by default.

On Wednesday, October 18, 2017 at 12:00:22 PM UTC+5:30, Tal Weiss wrote:

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Tal Weiss

unread,
Oct 18, 2017, 11:35:18 AM10/18/17
to RavenDB - 2nd generation document database
I have written the below analyzer:
And filter:
It doesn't do something too special it does the same thing the Lucene standard analyzer is doing but it does so in a single phase unlike Lucene's that uses 3 separate phases.
Note that this is from the v3.5 branch but the code should compile on v4.0 with maybe some namespace changes (not sure)

Now regarding your question about an analyzer for regex, you can write an analyzer that will detect a regex and return matches but the problem is that you won't be able to modify the regex each analyzer will have to have an hardcoded regex.
If your requirement allows you to use 1 regex per field than it could easily be done, if you need to query a field by different regex's, you won't have a way to do so i'm afraid.
you could support regex1 OR/AND regex2 OR/AND regex3 but not chose which regex to query by.
If you have a small set of regex you can do this trick in the select:
select new
{
  MyOriginalField: doc.Field,
  regex1: doc.Field,
  regex2: doc.Field,
  regex3: doc.Field,
}

and have each field analyzed by a different regex analyzer.


To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Oren Eini (Ayende Rahien)

unread,
Oct 19, 2017, 2:22:55 AM10/19/17
to ravendb
Note that it is easier for us to handle regex directly in the index.

select new
{
  MyOriginalField: doc.Field,
  regex1: Regex.IsMatch("\s", doc.Field),
  regex2: Regex.IsMatch("[A-Z]", doc.Field),

}

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Reply all
Reply to author
Forward
0 new messages