Pre-Define Blocking rules

99 views
Skip to first unread message

Tim Harder

unread,
Aug 16, 2018, 4:30:48 AM8/16/18
to open source deduplication
Hi, 

is there a way to pre define blocking rules in dedupe instead of learning them? 

I have a dataset of which I know some parameters should very effectively define blocks (i.e. country_code) and I would like to add those to the blocking rules in advance to concentrate the learning phase on the "other" parameters then.
Is that possible at all?

Thanks
Tim

Josh Wieder

unread,
Aug 17, 2018, 8:22:59 PM8/17/18
to open-source-...@googlegroups.com
Hi Tim -

If I understand correctly you have two fairly straight-forward options here.

If all you want to do if use a really straight-forward string to
determine blocking, like an ISO 3166 country code, and you're certain
that the strings are accurate in your data set, you can declare the
country_code as an "Exact" type variable in training
(https://docs.dedupe.io/en/latest/Variable-definition.html) like this:

{'field' : 'country_code', 'type': 'Exact'}}

Alternatively, if you've come up with something really clever for your
country code comparisons, or there's something weird about them in
your dataset, or if you are just using country codes as an example,
you can consider declaring a custom comparator type. According to the
docs (https://docs.dedupe.io/en/latest/Variable-definition.html#custom-types),
"The comparator must be a function that can take in two field values
and return a number."

The example docs give a simple comparison example:

def sameOrNotComparator(field_1, field_2) :
if field_1 and field_2 :
if field_1 == field_2 :
return 0
else:
return 1
variable definition:

{'field' : 'country_code', 'type': 'Custom',
'comparator' : sameOrNotComparator}

If you're normalizing different types of country codes, your function
might want to look at number of characters in a string, but you've
already got your logic worked out.

--
All the best,
Josh W.
https://joshwieder.net

Tim Harder

unread,
Aug 23, 2018, 3:59:00 AM8/23/18
to open source deduplication
Hi Josh,

thanks for the detailed answer and apologies for the delayed response.

I tried the exact matcher which seemed very straight forward, however I still have quite a few matches (in training and in the final results) where the country code is not the matched (i.e. the "match" would be "us" with "de").

Maybe I am misunderstanding the way blocking is used .. I thought that through blocking, only matches within that block could be found? 
Is there a block size that can be configured? 
I am trying to match entity names from a city / country .. so valid matches would always need to be in the same city and country. 

Thanks
Tim
Reply all
Reply to author
Forward
0 new messages