Duke deduplication engine : How can I give a string to deduplication program

263 views
Skip to first unread message

kishore kumar suthar

unread,
Oct 7, 2015, 9:47:22 AM10/7/15
to duke

I am attempting to use Duke to match records from one csv to another.First csv and second both has ID,Model,Price,CompanyName,Review,Url columns. I am trying to match to another csv to find duplicates records.

package no.priv.garshol.duke;

import no.priv.garshol.duke.matchers.PrintMatchListener;

public class RunDuke {

  public static void main(String[] argv) throws Exception {
    Configuration config =
        ConfigLoader
            .load("/home/kishore/Duke-master/doc/example-data/presonalCare.xml");
    Processor proc = new Processor(config);
    proc.addMatchListener(new PrintMatchListener(true, true, true, false, config.getProperties(),
        true));
    proc.link();
    proc.deduplicate();
    proc.close();
  }

}
Program is working fine with 2 csv. 
How can I give the data source as String, mean I want to check on as single record in second data source  csv. I dont want to write that single record in csv and check. I want to give string as a data source 

Lars Marius Garshol

unread,
Oct 7, 2015, 3:12:39 PM10/7/15
to duke
* kishore kumar suthar

    proc.link();
    proc.deduplicate();

You should have only one of these. .link() is for matching two data sets against each other, which is what you do.

.deduplicate() is for matching within a single dataset, which you're not doing. So you should delete this line.
 
How can I give the data source as String, mean I want to check on as single record in second data source  csv. I dont want to write that single record in csv and check. I want to give string as a data source 

One way to do it is:

(1) Set up Duke with a config that has a single CSV source (no <group>).
(2) Remove the .link() and .deduplicate() calls.
(3) Build a Record from your string
(4) Call proc.deduplicate(Collections.singleton(record)).

Unfortunately, that's not quite what you want, since it will add the new record to the index, so later records can match it.

I need to make small API change to support this case. Will do: https://github.com/larsga/Duke/issues/212

--Lars Marius
Reply all
Reply to author
Forward
0 new messages