MySQL Example: cluster_score all over the place

55 views

Skip to first unread message

John Beales

unread,

May 2, 2018, 8:30:04 AM5/2/18

to open source deduplication

Hi,

I think this might just be me misunderstanding the cluster_score column in the MySQL example, but here goes:

I have adapted the mysql example and run it on my own data. It seems to work well. However, when I look at the output in the entity_map table, the cluster_score values are all over the place - anything from 0.000000026456188351176024 to 0.9999985098838806. I had expected all of the cluster_score values to be above the threshold passed to the matchBlocks method, (0.5).

Am I wrong about that, maybe the threshold is used for pair comparisons and cluster_score is a score for the whole group? If so, should I be worries about having low scores for a group?

Thanks,

John

mza...@clarityinsights.com

unread,

May 8, 2018, 10:16:08 PM5/8/18

to open source deduplication

John-

When I first started working on dedupe, I was on the same line of thought as you - I should only see cluster_scores ABOVE the threshold. However, I think these are related to how the connected components eventually group (similar to how you stated), and the threshold helps inform them in the background. As far as if a low score should be "worrisome"... I am not sure. I definitely see some odd groupings at times but have my doubts that it is only the *lowest* scores that have the bad/odd groupings.

My understanding of this is pretty rudimentary, and I would appreciate being corrected if I am way off base here. So, I don't think I was a lot of help here, but I can commiserate with you if that is of any consolation. :)