linkfile all confidence probabiilities equals to 1

s142...@gmail.com

unread,

Dec 17, 2015, 8:18:42 AM12/17/15

to duke

Hi,

I'm doing record linkage mode and execute the command:

duke.sh --linkfile=results.txt --singlematch script.xml

However at results.txt file I'm able to see only records which have 1.0 confidence level. However using --interactive mode I see different confidence levels.

Furthermore I'm not able to execute:

duke.sh --linkfile=results.txt --singlematch --noreindex script.xml

ERROR:
Exception in thread "main" java.lang.NullPointerException
    at no.priv.garshol.duke.databases.LuceneDatabase$EstimateResultTracker.doQuery(LuceneDatabase.java:475)
    at no.priv.garshol.duke.databases.LuceneDatabase$EstimateResultTracker.doQuery(LuceneDatabase.java:465)
    at no.priv.garshol.duke.databases.LuceneDatabase.findCandidateMatches(LuceneDatabase.java:271)
    at no.priv.garshol.duke.Processor.match(Processor.java:417)
    at no.priv.garshol.duke.Processor.match(Processor.java:252)
    at no.priv.garshol.duke.Processor.linkBatch(Processor.java:379)
    at no.priv.garshol.duke.Processor.linkRecords(Processor.java:371)
    at no.priv.garshol.duke.Processor.linkRecords(Processor.java:342)
    at no.priv.garshol.duke.Duke.main_(Duke.java:174)
    at no.priv.garshol.duke.Duke.main(Duke.java:37)

I have attached results,txt and srcipt.xml files. Maybe someone can help me.

Thanks

results.txt

script.xml

Mauro Fraboni

unread,

Feb 20, 2017, 9:08:54 AM2/20/17

to duke

I have similar problems

Brandon Hoult

unread,

Jul 5, 2017, 4:39:07 PM7/5/17

to duke

Same problem... did you ever find a solution?

Brandon Hoult

unread,

Jul 6, 2017, 10:53:31 PM7/6/17

to duke

So, it seems apparent that duke has been basically abandoned, there are no updates after 2016 and info about it is pretty sparse. I needed this functionality but there are numerous bugs that made it difficult to achieve what I needed. In the 1.2 version I could build the index (it took about 12 hours) and would then run, but the --noreindex would produice an error. I could not reindex every time. I noticed that duke 1.3 was in the source repo and there were build instructions so I tried that but it failed to build with a bunch of missing libraries that I could not locate. I found a built version here: https://oss.sonatype.org/content/repositories/snapshots/no/priv/garshol/duke/duke/1.3-SNAPSHOT/ this worked with the --noreindex, but would not actually build the index without running out of memory. So I was planning to build the index with 1.2 then match against it with 1.3.

If you still want to use Duke at this point, you can probably get it to work by going through all that.

I wondered why this was the only real open source duplicator I could find. Also I saw that some of the config files turned off fuzzy matching in lucene... so I went to look into lucene. Turns out that lucene and solr basically have all the functionality of duke and are well maintained and documented. Solr is a search engine based on lucene that does "fuzzy matching", but if you load your data into that and then search against it it will give you matches (duplicates). It has all the same matching engines as duke plus a whole lot more.

So I expect the reason that everyone lost interest in duke is because they figured this out and duke is now redundant. Unfortunately it took me a week to realize this, hopefully I can save anyone that reads this some time.

Nicola Ghirardi

unread,

Jul 8, 2017, 10:15:47 AM7/8/17

to duke

Hi,

i've used Duke last year for a project and I know what solr and es had to offer.

Even if the lucene part can be (partially) done externally using existing search engines, Duke has the comparators and the union of fields comparison that are not in es and solr.

If I had to do ha dedup/matching project again i would try a machine learning approach such as the one used by dedupe : https://github.com/dedupeio/dedupe

Hope it helps

Nicola

Brandon Hoult

unread,

Jul 8, 2017, 5:46:40 PM7/8/17

to duke-...@googlegroups.com

Look again at solr. It has all the same comparitors as Duke, including the phonetic, geospatial, ngram, etc. It has even more in fact. It also has the equivalent of the Duke cleaners. It also has fuzzy matching using number of edits. It also has scoring. The search expressions are way more flexible, you can do any Boolean logic with expression grouping and wildcards.

The only thing it lacks is the bayesian scoring, but the scoring that it does include turned out to be more functional for my use anyway.

Anyway, if Duke were still maintained it would do just fine for it's use case... But since it is not it is hard to suggest it at this point.

--
You received this message because you are subscribed to a topic in the Google Groups "duke" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/duke-dedup/j9Kefki9DFU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to duke-dedup+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward