TokenStream contract violation: close() call missing

362 views
Skip to first unread message

Haltoras

unread,
Apr 7, 2014, 9:53:23 AM4/7/14
to duke-...@googlegroups.com
Hi guys,

i've got a problem with my first run with Duke.

I've a csv file as input and I try to deduplicate the records inside (10000 records).

As I run the Duke class, I got this:

Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
    at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89)
    at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:307)
    at org.apache.lucene.analysis.standard.StandardAnalyzer$1.setReader(StandardAnalyzer.java:120)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145)
    at no.priv.garshol.duke.databases.LuceneDatabase.parseTokens(LuceneDatabase.java:403)
    at no.priv.garshol.duke.databases.LuceneDatabase.findCandidateMatches(LuceneDatabase.java:275)
    at no.priv.garshol.duke.Processor.match(Processor.java:417)
    at no.priv.garshol.duke.Processor.match(Processor.java:252)
    at no.priv.garshol.duke.Processor.deduplicate(Processor.java:244)
    at no.priv.garshol.duke.Processor.deduplicate(Processor.java:216)
    at no.priv.garshol.duke.Duke.main_(Duke.java:163)
    at no.priv.garshol.duke.Duke.main(Duke.java:35)


Here is my configuration:

<duke>

  <schema>
    <threshold>0.7</threshold>

    <property type="id">
      <name>ID</name>
    </property>  
    <property>
      <name>NOME</name>
      <comparator>no.priv.garshol.duke.comparators.JaroWinkler</comparator>
      <low>0.3</low>
      <high>0.7</high>
    </property>
    <property>
      <name>COGNOME</name>
      <comparator>no.priv.garshol.duke.comparators.JaroWinkler</comparator>
      <low>0.3</low>
      <high>0.7</high>
    </property>
    <property>
      <name>INDIRIZZO</name>
      <comparator>no.priv.garshol.duke.comparators.JaroWinkler</comparator>
      <low>0.3</low>
      <high>0.88</high>
    </property>    
  </schema>
 
  <csv>
    <param name="input-file" value="tessere.csv"/>
    <param name="header-line" value="false"/>

    <column name="1"
            property="ID"
            cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
    <column name="2"
            property="NOME"
            cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
    <column name="3"
            property="COGNOME"
            cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="4"
            property="INDIRIZZO"
            cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
  </csv>

</duke>


Any idea about the issue?

Thank u so much,

Massimo

Fabrizio Fortino

unread,
Apr 7, 2014, 11:55:52 AM4/7/14
to duke-...@googlegroups.com
Hi Massimo,

Are you sure you are running your application using lucene 4.0?
I had the same issue when I tried using lucene 4.6.

Cheers,
Fabrizio

Haltoras

unread,
Apr 7, 2014, 12:00:15 PM4/7/14
to duke-...@googlegroups.com
Ah!

I'll try tomorrow. If I remember well, I'm using lucene 4.6.

Thanks Fabrizio

Lars Marius Garshol

unread,
Apr 7, 2014, 1:00:04 PM4/7/14
to duke-...@googlegroups.com

* Fabrizio Fortino
>
> Are you sure you are running your application using lucene 4.0?
> I had the same issue when I tried using lucene 4.6.

Thanks, Fabrizio. :-)

The reason we’re still using Lucene 4.0 is that Lucene 4.1 and up use compression of the index, which slows down applications that have small property values, like Duke.

It can be solved by using some sort of plugin interface (I forget the details right now). Since Lucene is becoming less important to Duke, it hasn’t seemed that urgent.

--
Lars Marius Garshol | Consultant
Bouvet ASA Sandakerveien 24C D11 Postboks 4430 Nydalen NO-0403 Oslo
Phone: +47 23 40 60 00 | Fax: +47 23 40 60 01 | Mobile: +47 98 21 55 50
http://www.bouvet.no


Haltoras

unread,
Apr 8, 2014, 5:27:56 AM4/8/14
to duke-...@googlegroups.com
Solved!

I downloaded Lucene 4.0.0 and now everything is working as expected.

Thanks again,

Massimo




On Monday, April 7, 2014 5:55:52 PM UTC+2, Fabrizio Fortino wrote:
Reply all
Reply to author
Forward
0 new messages