Turkish I problem on queries and custom TurkishAnalyzer for Lucene.Net

535 views
Skip to first unread message

tugber...@gmail.com

unread,
Jul 15, 2013, 8:13:49 AM7/15/13
to rav...@googlegroups.com
Proplem:

If I store Turkish text in RavenDB and make a query, its default LowerCaseKeywordAnalyzer fails on notorious-pain-in-the-backside Turkish "I" letter cases. Repro: 

class Program
{
 
static void Main(string[] args)
 
{
 
const string DefaultDatabase = "EqualsTryOut";
 
IDocumentStore store = new DocumentStore
 
{
 
Url = "http://localhost:8080",
 
DefaultDatabase = DefaultDatabase
 
}.Initialize();


 store
.DatabaseCommands.EnsureDatabaseExists(DefaultDatabase);


 
using (var ses = store.OpenSession())
 
{
 
var user = new User { Name = "Irmak", Roles = new List<string> { "adMin", "GuEst" } };
 ses
.Store(user);
 ses
.SaveChanges();


 
//This fails dues to Turkish I
 
var user1 = ses.Query<User>().Where(usr => usr.Name == "ırmak").FirstOrDefault();


 
//this finds Name:Irmak
 
var user2 = ses.Query<User>().Where(usr => usr.Name == "IrMak").FirstOrDefault();
 
}
 
}
}


public class User
{
 
public string Id { get; set; }
 
public string Name { get; set; }
 
public ICollection<string> Roles { get; set; }
}

I'm not sure but my problem can be solved here by setting a custom analyzer for the name field as described here: http://ravendb.net/docs/client-api/querying/static-indexes/configuring-index-options#using-a-non-default-analyzer However, there is no proper Lucene.Net analyzer for this case that I'm aware of. I guess apache lucene has this: http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/tr/TurkishAnalyzer.html 

AFAIK, RavenDB supports dropping in custom analyzers as indicated below:

You can also create your own custom analyzer, compile it to a dll and drop it in in directory called "Analyzers" under the RavenDB base directory. Afterward, you can then use the fully qualified type name of your custom analyzer as the analyzer for a particular field.

Before going ahead and creating it from scratch, I would like to ask if this is a known issue (well, if we call is an issue) and there is already a solution available on RavenDB.

Thanks!

Mircea Chirea

unread,
Jul 15, 2013, 9:52:37 AM7/15/13
to rav...@googlegroups.com
Yes, it is a known problem. We use the following analyzer for full text search; the way it works is by normalizing all Unicode text into ASCII.

Itamar Syn-Hershko

unread,
Jul 15, 2013, 10:09:55 AM7/15/13
to rav...@googlegroups.com
Calling reader.ReadToEnd is inefficient 

You want to use this : https://github.com/apache/lucene.net/blob/3.0.3/src/core/Analysis/ASCIIFoldingFilter.cs -- add this to the filter chain of your analyzer (by providing your own)


--
You received this message because you are subscribed to the Google Groups "ravendb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Oren Eini (Ayende Rahien)

unread,
Jul 15, 2013, 10:18:47 AM7/15/13
to ravendb
We are actually using ToLowerInvariant, which should alleviate this problem, no?


--

tugber...@gmail.com

unread,
Jul 15, 2013, 10:44:40 AM7/15/13
to rav...@googlegroups.com
For usual cases, absolutely yes. However, that's the one that actually creates the problem in my case. I modified the RavenDB's default analyzer as you can see:


I dropped the LuceneAnalyzers.dll under /Server/Analyzers folder and created the following index:

    public class Users : AbstractIndexCreationTask<User>
   
{
       
public Users()
       
{
           
Map = users => from user in users
                           
select new
                           
{
                               user
.Name
                           
};


           
Analyzers.Add(x => x.Name, typeof(LuceneAnalyzers.TurkishLowerCaseKeywordAnalyzer).FullName);
       
}
   
}


This sometimes throws on index creation and sometimes doesn't. When it doesn't throw on index creation, it throws the following error duding the query:

An unhandled exception of type 'System.InvalidOperationException' occurred in Raven.Client.Lightweight.dll
Additional information: Url: "/indexes/Users?query=Name%253A%25C4%25B1rmak&pageSize=1"


System.InvalidOperationException: Cannot find analyzer type 'LuceneAnalyzers.TurkishLowerCaseKeywordAnalyzer' for field: Name
   at Raven.Database.Extensions.IndexingExtensions.CreateAnalyzerInstance(String name, String analyzerTypeAsString) in c:\Builds\RavenDB-Stable\Raven.Database\Extensions\IndexingExtensions.cs:line 48
   at Raven.Database.Indexing.Index.CreateAnalyzer(Analyzer defaultAnalyzer, ICollection`1 toDispose, Boolean forQuerying) in c:\Builds\RavenDB-Stable\Raven.Database\Indexing\Index.cs:line 451
   at Raven.Database.Indexing.Index.IndexQueryOperation.GetLuceneQuery(String query, IndexQuery indexQuery) in c:\Builds\RavenDB-Stable\Raven.Database\Indexing\Index.cs:line 1110
   at Raven.Database.Indexing.Index.IndexQueryOperation.GetLuceneQuery() in c:\Builds\RavenDB-Stable\Raven.Database\Indexing\Index.cs:line 1081
   at Raven.Database.Indexing.Index.IndexQueryOperation.<Query>d__56.MoveNext() in c:\Builds\RavenDB-Stable\Raven.Database\Indexing\Index.cs:line 803
   at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext()
   at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext()
   at Raven.Database.DocumentDatabase.<>c__DisplayClass97.<Query>b__8e(IStorageActionsAccessor actions) in c:\Builds\RavenDB-Stable\Raven.Database\DocumentDatabase.cs:line 1220
   at Raven.Storage.Esent.TransactionalStorage.ExecuteBatch(Action`1 action) in c:\Builds\RavenDB-Stable\Raven.Database\Storage\Esent\TransactionalStorage.cs:line 558
   at Raven.Storage.Esent.TransactionalStorage.Batch(Action`1 action) in c:\Builds\RavenDB-Stable\Raven.Database\Storage\Esent\TransactionalStorage.cs:line 516
   at Raven.Database.DocumentDatabase.Query(String index, IndexQuery query) in c:\Builds\RavenDB-Stable\Raven.Database\DocumentDatabase.cs:line 1237
   at Raven.Database.Server.Responders.Index.PerformQueryAgainstExistingIndex(IHttpContext context, String index, IndexQuery indexQuery, Guid& indexEtag) in c:\Builds\RavenDB-Stable\Raven.Database\Server\Responders\Index.cs:line 499
   at Raven.Database.Server.Responders.Index.ExecuteQuery(IHttpContext context, String index, Guid& indexEtag) in c:\Builds\RavenDB-Stable\Raven.Database\Server\Responders\Index.cs:line 436
   at Raven.Database.Server.Responders.Index.GetIndexQueryResult(IHttpContext context, String index) in c:\Builds\RavenDB-Stable\Raven.Database\Server\Responders\Index.cs:line 375
   at Raven.Database.Server.HttpServer.DispatchRequest(IHttpContext ctx) in c:\Builds\RavenDB-Stable\Raven.Database\Server\HttpServer.cs:line 864
   at Raven.Database.Server.HttpServer.HandleActualRequest(IHttpContext ctx) in c:\Builds\RavenDB-Stable\Raven.Database\Server\HttpServer.cs:line 609

Am I doing something wrong here?

Oren Eini (Ayende Rahien)

unread,
Jul 15, 2013, 10:45:46 AM7/15/13
to ravendb
Did you put the it in the Analyzers folder?
In Web or in Service mode?

Oren Eini (Ayende Rahien)

unread,
Jul 15, 2013, 10:45:55 AM7/15/13
to ravendb
You might want to enable the fusion log, as well.

tugber...@gmail.com

unread,
Jul 15, 2013, 10:50:45 AM7/15/13
to rav...@googlegroups.com
Thanks.

I created a custom analyzer, too (nearly all code belongs to raven's default impl) but having problem registering it. dropping the compiled assembly inside the Analyzers folder doesn't do anything.

Oren Eini (Ayende Rahien)

unread,
Jul 15, 2013, 10:54:26 AM7/15/13
to ravendb
You have to restart the srever.


--

tugber...@gmail.com

unread,
Jul 15, 2013, 10:58:39 AM7/15/13
to rav...@googlegroups.com
I'm running it through Start.cmd and I have put my assembly under Server/Analyzers folder. wrong place?

tugber...@gmail.com

unread,
Jul 15, 2013, 11:04:23 AM7/15/13
to rav...@googlegroups.com
yep, restarted it. Still the same behavior. I'm getting the following error on index creation:

An unhandled exception of type 'System.InvalidOperationException' occurred in Raven.Client.Lightweight.dll
Additional information: Url: "/indexes/Users"


System.ArgumentException: Could not create analyzer for field: 'Name' because the type 'LuceneAnalyzers.TurkishLowerCaseKeywordAnalyzer' was not found
   at Raven.Database.Indexing.IndexStorage.AssertAnalyzersValid(IndexDefinition indexDefinition) in c:\Builds\RavenDB-Stable\Raven.Database\Indexing\IndexStorage.cs:line 385
   at Raven.Database.Indexing.IndexStorage.CreateIndexImplementation(IndexDefinition indexDefinition) in c:\Builds\RavenDB-Stable\Raven.Database\Indexing\IndexStorage.cs:line 376
   at Raven.Database.DocumentDatabase.PutIndex(String name, IndexDefinition definition) in c:\Builds\RavenDB-Stable\Raven.Database\DocumentDatabase.cs:line 1092
   at Raven.Database.Server.Responders.Index.Put(IHttpContext context, String index) in c:\Builds\RavenDB-Stable\Raven.Database\Server\Responders\Index.cs:line 83

   at Raven.Database.Server.HttpServer.DispatchRequest(IHttpContext ctx) in c:\Builds\RavenDB-Stable\Raven.Database\Server\HttpServer.cs:line 864
   at Raven.Database.Server.HttpServer.HandleActualRequest(IHttpContext ctx) in c:\Builds\RavenDB-Stable\Raven.Database\Server\HttpServer.cs:line 609

Oren Eini (Ayende Rahien)

unread,
Jul 15, 2013, 11:06:37 AM7/15/13
to ravendb
Check you check the fusion log?

Oren Eini (Ayende Rahien)

unread,
Jul 15, 2013, 11:06:53 AM7/15/13
to ravendb
Are you sure that you are compiling against the ravendb & lucene version that ravendb uses?

tugber...@gmail.com

unread,
Jul 15, 2013, 11:07:44 AM7/15/13
to rav...@googlegroups.com
BTW, my RavenDB version is 2.0.3.2375 (got this version through CommonAssemblyInfo.cs)


On Monday, July 15, 2013 5:54:26 PM UTC+3, Oren Eini wrote:

Mircea Chirea

unread,
Jul 15, 2013, 4:06:08 PM7/15/13
to rav...@googlegroups.com
That's... I mean... I can't really say much about sticking every damn Unicode character (or close to) as cases in a switch.

ReadToEnd may be inefficient, but we don't have the volume of data for that to matter. I prefer my implementation, simple and does the job fine.

Mircea Chirea

unread,
Jul 15, 2013, 4:07:28 PM7/15/13
to rav...@googlegroups.com
That doesn't guarantee to properly transform Unicode text. It simply assumes the invariant culture (which is en-US).

tugber...@gmail.com

unread,
Jul 16, 2013, 2:37:34 AM7/16/13
to rav...@googlegroups.com
I'm using Lucene.Net 3.0.3 (as u can see here: https://github.com/tugberkugurlu/LuceneAnalyzers/tree/master/src/LuceneAnalyzers) and Server is using the same version. Do u see anything wrong in the project?

BTW, how can I see the fusion log that u mentioned?

tugber...@gmail.com

unread,
Jul 16, 2013, 2:51:15 AM7/16/13
to rav...@googlegroups.com
Oren,

What I was doing wrong was pass the FullName of the type, not the "AssemblyQualifiedName". Changed my Index creation code to below one:

    public class Users : AbstractIndexCreationTask<User>
   
{
       
public Users()
       
{
           
Map = users => from user in users
                           
select new
                           
{
                               user
.Name
                           
};



           
Analyzers.Add(x => x.Name, typeof(LuceneAnalyzers.TurkishLowerCaseKeywordAnalyzer).AssemblyQualifiedName);
       
}
   
}

Now my following query works like a charm:

var retrievedUser = ses.Query<User, Users>().Where(usr => usr.Name == "ırmak").FirstOrDefault();

I guess documentation misled me yesterday: http://ravendb.net/docs/appendixes/lucene-indexes-usage#using-custom-analyzers It's either wrong or very outdated.

Thanks for the help.

Oren Eini (Ayende Rahien)

unread,
Jul 16, 2013, 2:51:03 AM7/16/13
to ravendb
You need to use the version RavenDB uses, which is a custom build.

Oren Eini (Ayende Rahien)

unread,
Jul 16, 2013, 2:56:45 AM7/16/13
to ravendb
Sigh, it is correct, but only if you are using Lucene builtin analyzers.
I'll fix that 

Tugberk Ugurlu

unread,
Jul 17, 2013, 2:26:55 AM7/17/13
to rav...@googlegroups.com
FYI, I wrote a blog post about the solution I applied for this issue: http://www.tugberkugurlu.com/archive/turkish-i-problem-on-ravendb-and-solving-it-with-custom-lucene-analyzers

Please let me know if there is a better solution for this.

Thanks all for the help.
Reply all
Reply to author
Forward
0 new messages