MySQL Example: cluster_score all over the place

55 views
Skip to first unread message

John Beales

unread,
May 2, 2018, 8:30:04 AM5/2/18
to open source deduplication
Hi,

I think this might just be me misunderstanding the cluster_score column in the MySQL example, but here goes:

I have adapted the mysql example and run it on my own data. It seems to work well. However, when I look at the output in the entity_map table, the cluster_score values are all over the place - anything from 0.000000026456188351176024 to 0.9999985098838806.  I had expected all of the cluster_score values to be above the threshold passed to the matchBlocks method, (0.5).

Am I wrong about that, maybe the threshold is used for pair comparisons and cluster_score is a score for the whole group?  If so, should I be worries about having low scores for a group?

Thanks,

John

mza...@clarityinsights.com

unread,
May 8, 2018, 10:16:08 PM5/8/18
to open source deduplication
John-
When I first started working on dedupe, I was on the same line of thought as you - I should only see cluster_scores ABOVE the threshold.  However, I think these are related to how the connected components eventually group (similar to how you stated), and the threshold helps inform them in the background.  As far as if a low score should be "worrisome"... I am not sure.  I definitely see some odd groupings at times but have my doubts that it is only the *lowest* scores that have the bad/odd groupings.

My understanding of this is pretty rudimentary, and I would appreciate being corrected if I am way off base here.  So, I don't think I was a lot of help here, but I can commiserate with you if that is of any consolation.  :)

-Matt Z
Reply all
Reply to author
Forward
0 new messages