Is it worth to start migration to Lucene 5?
Should we provide a pull request with our code?
And could this upgrade be provided in the next release that will be available at central maven repositories? This would really simplify integration (in our case)
--
You received this message because you are subscribed to a topic in the Google Groups "duke" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/duke-dedup/QPkrdrfTZiA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to duke-dedup+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I will provide a pull request with minimal revision, just to get Lucene 5 running.
By the way: You did really a GREAT job with this framework!!
--
I added a pull request with minimum code revision to use Lucene 5.To avoid ambiguities: This is NOT a complete upgrade to Lucene 5 but just a change of maven dependencies to Lucene 5 libraries and required code revision to make it run.
I also compared performance of this version with the version that is currently available at maven repositories (1.2) by deduplicating 20k address data records. Both with activated and deactivated "fuzzy-search" in LuceneDatabase configuration I have not realized any performance drawbacks.However, my "comparison" is not a thorough benchmark for Lucene performance because I just measured duration of the entire Processor.deduplicate function. And of course, most time is taken by the comparators and not the Lucene part...
To be honest: I have no idea how to separate the Lucene part from the comparators for a thorough test on Lucene performance :/ We have always been using your framework as a "black box" and never run just a single phase. Of course, If you could provide some hints or a basic test draft I would provide better benchmarking results.
Duke version 1.3-SNAPSHOT, build 3,629 (2015-10-10), built by lars.garshol
InMemoryDatabase
Threads: 1
231 processed, 1054 records/second; comparisons: 67221
Run completed, 916 records/second
231 records total in 0 seconds
Reading from source: 0 (0%)
Indexing: 0 (0%)
Searching: 0 (0%)
Comparing: 0 (98%)
Callbacks: 0 (1%)
Total memory: 257425408, free memory: 210951096, used memory: 46474312
--Lars Marius