Upgrade to Lucene 5

Kai Hüner

unread,

Oct 14, 2015, 12:23:40 PM10/14/15

to duke

Dear Duke community,

We are using Duke in a maven project with other dependencies to Lucene (i.e. from the Solr framework). Due to conflicting Lucene versions we are facing problems with the default LuceneDatabase that is using an older Lucene version (4.0.0, released in 2012). To handle these conflicts, we have compiled a custom Duke version (forked from current 1.3 version) with Lucene version 5.2.1. To get it work, only few lines of code were changed. Of course, this is no complete migration to Lucene 5 because any of the new Lucene features was used and some used Lucene functions are deprecated. However, it works and better enables integration with other Lucene-based components.

What is your opinion:

Is it worth to start migration to Lucene 5?

Should we provide a pull request with our code?

And could this upgrade be provided in the next release that will be available at central maven repositories? This would really simplify integration (in our case)

Looking forward to your comments,

best regards,

kai

Lars Marius Garshol

unread,

Oct 14, 2015, 3:24:58 PM10/14/15

to duke

* Kai Hüner

Is it worth to start migration to Lucene 5?

I did an initial attempt at migration a long time ago, but unfortunately, I found that the compression scheme used in newer Lucene versions causes Duke performance to drop quite badly. If I remember correctly performance was only half of what it was in earlier Lucene versions. I found an SPI interface for plugging in my own compression, which should have made it easy to plug in a non-compressing compression, but I couldn't get it to work. So I dropped the whole thing.

But, yes, it would definitely make sense to make a new attempt. Continuing to rely on an antiquated Lucene version isn't really an option.

Should we provide a pull request with our code?

Yes, please do.

And could this upgrade be provided in the next release that will be available at central maven repositories? This would really simplify integration (in our case)

I'll try to make that happen, and will certainly bring the issue up for discussion if it means trading off performance again.

--Lars Marius

Kai Hüner

unread,

Oct 15, 2015, 3:53:11 AM10/15/15

to duke-...@googlegroups.com

Hey Lars,

Thanks for your feedback.

I will provide a pull request with minimal revision, just to get Lucene 5 running.

With this code, we did not realize any performance issues -- but, yes, we have to check that thoroughly.

I will try to provide some benchmarks as well.

By the way: You did really a GREAT job with this framework!!

--
You received this message because you are subscribed to a topic in the Google Groups "duke" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/duke-dedup/QPkrdrfTZiA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to duke-dedup+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lars Marius Garshol

unread,

Oct 15, 2015, 7:01:02 AM10/15/15

to duke

* Kai Hüner

I will provide a pull request with minimal revision, just to get Lucene 5 running.

That's great, thank you. :) I'll do some benchmarks on it, too, and then we can compare notes.

By the way: You did really a GREAT job with this framework!!

Thank you. :-)

--Lars Marius

Kai Hüner

unread,

Oct 18, 2015, 11:34:22 AM10/18/15

to duke-...@googlegroups.com

Hey Lars,

I added a pull request with minimum code revision to use Lucene 5.

https://github.com/larsga/Duke/pull/217

To avoid ambiguities: This is NOT a complete upgrade to Lucene 5 but just a change of maven dependencies to Lucene 5 libraries and required code revision to make it run.

I also compared performance of this version with the version that is currently available at maven repositories (1.2) by deduplicating 20k address data records. Both with activated and deactivated "fuzzy-search" in LuceneDatabase configuration I have not realized any performance drawbacks.

However, my "comparison" is not a thorough benchmark for Lucene performance because I just measured duration of the entire Processor.deduplicate function. And of course, most time is taken by the comparators and not the Lucene part...

To be honest: I have no idea how to separate the Lucene part from the comparators for a thorough test on Lucene performance :/ We have always been using your framework as a "black box" and never run just a single phase. Of course, If you could provide some hints or a basic test draft I would provide better benchmarking results.

Best regards,

kai

--

Lars Marius Garshol

unread,

Oct 20, 2015, 4:41:37 AM10/20/15

to duke

* Kai Hüner

I added a pull request with minimum code revision to use Lucene 5.
https://github.com/larsga/Duke/pull/217
To avoid ambiguities: This is NOT a complete upgrade to Lucene 5 but just a change of maven dependencies to Lucene 5 libraries and required code revision to make it run.

Thank you! I'll experiment a bit with this.

I also compared performance of this version with the version that is currently available at maven repositories (1.2) by deduplicating 20k address data records. Both with activated and deactivated "fuzzy-search" in LuceneDatabase configuration I have not realized any performance drawbacks.

However, my "comparison" is not a thorough benchmark for Lucene performance because I just measured duration of the entire Processor.deduplicate function. And of course, most time is taken by the comparators and not the Lucene part...

I think with 20k records Lucene is going to perform very well, and you're not really going to be able to compare the two cases at all. You need a few hundred thousand records before Lucene performance really starts dropping off.

To be honest: I have no idea how to separate the Lucene part from the comparators for a thorough test on Lucene performance :/ We have always been using your framework as a "black box" and never run just a single phase. Of course, If you could provide some hints or a basic test draft I would provide better benchmarking results.

If you run Duke from the command-line with "--profile" it will show you performance data. Here's from a trivial example:

Duke version 1.3-SNAPSHOT, build 3,629 (2015-10-10), built by lars.garshol

InMemoryDatabase

Threads: 1

231 processed, 1054 records/second; comparisons: 67221

Run completed, 916 records/second

231 records total in 0 seconds

Reading from source: 0 (0%)

Indexing: 0 (0%)

Searching: 0 (0%)

Comparing: 0 (98%)

Callbacks: 0 (1%)

Total memory: 257425408, free memory: 210951096, used memory: 46474312

--Lars Marius

Lars Marius Garshol

unread,

Oct 20, 2015, 6:31:31 AM10/20/15

to duke

Ok, I did a little profiling of a job with 229,000 records. Not huge, but enough to get a real test.

--Lucene 4

1 thread: 217 seconds, 206 seconds

8 threads: 81 seconds, 93 seconds

--Lucene 5

1 thread: 215 seconds, 224 seconds

8 threads: 97 seconds, 100 seconds

In general, Lucene 4 does the first 120,000 records faster, after that Lucene 5 does the next batches faster.

We'd need more test runs to be certain, but it looks as though Lucene 5 is slower, but not catastrophically slower. Probably the slowdown is worth it so that we can get up to the latest Lucene versions. With a little tuning perhaps the differences can be ironed out.

--Lars Marius

Reply all

Reply to author

Forward