detailed explanation of how Duke's Bayesian algorithm works?

Roman Hennig

unread,

Jun 2, 2016, 11:30:56 AM6/2/16

to duke

Hi,

first of all, thanks for making Duke available freely, I think it will prove very valuable to what I'm doing.

I have a question about the workings of the Bayesian algorithm. I'm looking at the How it works page and trying to follow the steps but it seems I'm missing something.

To make a simple example in Bayes' terms, we have the following assumptions:

event A: entities are equal (match)

event B: property (e.g. address, name) of two entries is equal (or is event B actually that the entries have a certain similarity score, calculated by a comparator function?)

we assume: P(A) = 0.5 without prior knowledge. This seems arbitrary but I get that you have to start somewhere.

we assert: P(A | B) = p_high, P(A | not B) = p_low

e.g. for the first step on the "How it works" page we have p_high = 0.65 and p_low = 0.25 for the address comparisons.

the page then tells us that the similarity between two given addresses is 0.867 and that this updates the probability to 0.6127.

This is where I am getting lost. How do you get the new probability, starting from 0.5? And how does a similarity score between two properties translate to P(B), the probability that the properties are actually equal?

would be glad for any explanation or link to a more detailed description of what the algorithm is doing.

Lars Marius Garshol

unread,

Jun 4, 2016, 5:06:54 AM6/4/16

to duke

Hi there,

You can see the actual formula used here: https://github.com/larsga/Duke/blob/master/duke-core/src/main/java/no/priv/garshol/duke/utils/Utils.java

There's also a link to an article that explains why the formula looks the way it does.

I hope that clears things up.

Best,

--Lars Marius

Roman Hennig

unread,

Jun 6, 2016, 4:19:49 PM6/6/16

to duke

Hey, thanks for the reply.

I do understand the Bayesian algorithm, but here is the exact part that I don't understand:

the "how it works" example tells us this:

---ADDRESS1
'main street 101' ~ 'mian street 101': 0.867 (prob 0.6127)
Result: 0.5 -> 0.6127

so that means a name similarity score of 0.867 translates into a probability of 0.6127 for entity matching.

How did you arrive at that probability? It seems that you need to combine the 0.867 similarity score with the earlier assumption that matching addresses mean a 0.65 probability of the records being the same (the <high>0.65</high> part), but I have tried some straightforward ways and have not been able to come up with the result of 0.6127.

thanks a lot,

Roman

Lars Marius Garshol

unread,

Jun 30, 2016, 11:15:21 AM6/30/16

to duke

* Roman Hennig

---ADDRESS1
'main street 101' ~ 'mian street 101': 0.867 (prob 0.6127)
Result: 0.5 -> 0.6127
so that means a name similarity score of 0.867 translates into a probability of 0.6127 for entity matching.

How did you arrive at that probability?

Good question. :) Yes, that part is indeed not covered.

The computation is here: https://github.com/larsga/Duke/blob/master/duke-core/src/main/java/no/priv/garshol/duke/PropertyImpl.java#L121

It's not really founded on anything other than practical experience. It appears to work very well.

--Lars Marius

Roman Hennig

unread,

Jul 4, 2016, 2:32:01 PM7/4/16

to duke

Well if it works well, that's fair enough. I just want to understand what I'm using :)

Thanks for the answer. I think it would be a good idea to post a link to this page in the respective section in the "How it works" page.

Reply all

Reply to author

Forward