Duke - config xml - Data Deduplication

125 views

Skip to first unread message

Soundarya Thiagarajan

unread,

Mar 31, 2016, 11:57:19 PM3/31/16

to duke

I am working on https://github.com/larsga/Duke for Data Deduplication..

https://github.com/larsga/Duke/blob/master/doc/example-data/countries.xml - for each of the field in the config xml - we have a property which defines the name, comparator, low and high..

Can anyone tell how this works? like how it works with the probability threshold.. we can have low alone in some cases right. if it is atleast above the threshold value, we can consider as a match or something.

And one more question - the total probability threshold also we are mentioning at the top of the xml..

  <schema>
    <threshold>0.7</threshold>

https://github.com/larsga/Duke/blob/master/doc/example-data/countries.xml

how they are actually calculating the total threshold value.. It would be great if someone explain the xml file here.. Thank you so much in advance.

Lars Marius Garshol

unread,

Apr 1, 2016, 2:43:30 AM4/1/16

to duke

* Soundarya Thiagarajan

https://github.com/larsga/Duke/blob/master/doc/example-data/countries.xml - for each of the field in the config xml - we have a property which defines the name, comparator, low and high..
Can anyone tell how this works? like how it works with the probability threshold.. we can have low alone in some cases right.

No, you have to have both high and low. This wiki page

https://github.com/larsga/Duke/wiki/XMLConfig

explains the XML config syntax and goes through high, low, and the threshold.

And one more question - the total probability threshold also we are mentioning at the top of the xml..

how they are actually calculating the total threshold value.. It would be great if someone explain the xml file here..

The total probability for each record is computed by using Naive Bayes to combine the probabilities. You can see the code here: https://github.com/larsga/Duke/blob/master/duke-core/src/main/java/no/priv/garshol/duke/utils/Utils.java#L11