I am working on https://github.com/larsga/Duke for Data Deduplication..
https://github.com/larsga/Duke/blob/master/doc/example-data/countries.xml - for each of the field in the config xml - we have a property which defines the name, comparator, low and high..
Can anyone tell how this works? like how it works with the probability threshold.. we can have low alone in some cases right. if it is atleast above the threshold value, we can consider as a match or something.
And one more question - the total probability threshold also we are mentioning at the top of the xml..
<schema>
<threshold>0.7</threshold>
https://github.com/larsga/Duke/blob/master/doc/example-data/countries.xml
how they are actually calculating the total threshold value.. It would be great if someone explain the xml file here.. Thank you so much in advance.
https://github.com/larsga/Duke/blob/master/doc/example-data/countries.xml - for each of the field in the config xml - we have a property which defines the name, comparator, low and high..
Can anyone tell how this works? like how it works with the probability threshold.. we can have low alone in some cases right.
No, you have to have both high and low. This wiki page
https://github.com/larsga/Duke/wiki/XMLConfig
explains the XML config syntax and goes through high, low, and the threshold.
And one more question - the total probability threshold also we are mentioning at the top of the xml..
how they are actually calculating the total threshold value.. It would be great if someone explain the xml file here..