Lucene 4.8.1 dependency

52 views
Skip to first unread message

Matteo Fiandesio

unread,
Aug 1, 2014, 10:47:13 AM8/1/14
to duke-...@googlegroups.com
Hi guys,
I am working on an app who needs awesome duke's features but it also has a dependencies on Lucene 4.8 that can't be downgraded.

So when I invoke the deduplication process I receive the exception:

java.lang.IllegalStateException: TokenStream contract violation: close() call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89)
at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:307)
at org.apache.lucene.analysis.standard.StandardAnalyzer$1.setReader(StandardAnalyzer.java:120)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145)
at no.priv.garshol.duke.databases.LuceneDatabase.parseTokens(LuceneDatabase.java:393)
at no.priv.garshol.duke.databases.LuceneDatabase.findCandidateMatches(LuceneDatabase.java:266)
at no.priv.garshol.duke.Processor.match(Processor.java:417)
at no.priv.garshol.duke.Processor.match(Processor.java:252)
at no.priv.garshol.duke.Processor.deduplicate(Processor.java:244)
[...]


In a previous post I read that there is some kind of fix for that issue and I would be glad if someone can enlighten me a little bit on how can i solve the problem.
Thanks a lot
Matteo

Lars Marius Garshol

unread,
Aug 1, 2014, 11:55:07 AM8/1/14
to duke-...@googlegroups.com

* Matteo Fiandesio
>
> I am working on an app who needs awesome duke's features but it also has a dependencies on Lucene 4.8 that can't be downgraded.

There's two parts to this. One is that it's probably pretty easy to adapt Duke to the Lucene 4.8 API.

The other part is that Lucene 4.x (for x > 1) uses a form of encryption that makes Duke much slower. Basically, it punishes applications with many small field values. There is an API you can use to work around this (at least there was in 4.1), but unfortunately this entangles you in a service provider API and classloading magic, and I never worked it out.

But I guess it's high time we made another attempt to get over this hurdle.

More details here:
https://github.com/larsga/Duke/issues/85

--
Lars Marius Garshol
Head of product development, Sesam
Cell phone: +47 98 21 55 50
http://sesam.io


Matteo Fiandesio

unread,
Aug 1, 2014, 12:24:09 PM8/1/14
to duke-...@googlegroups.com
Thanks Lars for your quick answer,
do you have any benchmarks on how slow Duke performs using this encryption?

Bye

Lars Marius Garshol

unread,
Aug 1, 2014, 12:30:03 PM8/1/14
to duke-...@googlegroups.com

* Matteo Fiandesio
>
> Thanks Lars for your quick answer,
> do you have any benchmarks on how slow Duke performs using this encryption?

Results from Lucene 4.1 indicate it takes twice as long as 4.0, so it's not a trivial issue.

However, I'm sure it's possible to fix. May be easier with 4.8. We need to make another attempt.
Reply all
Reply to author
Forward
0 new messages