Blocking time varies drastically based on training phase (gazetteer)

meh...@thinkdataworks.com

unread,

May 4, 2018, 10:31:10 AM5/4/18

to open source deduplication

Hi,

Testing the gazetteer package on multiple example use cases, I have noticed that each time I train on random samples, in the matching phase the blocking phase time varies. Some trainings lead to very slow blocking phase. This is noticeable especially when using a field that has long strings such as addresses. I was wondering about the reason for this and whether there is anyway to resolve this or understand what kind of training makes the process slower/faster and the affect it has on quality of match.

Thank you,

Mehrsa

Forest Gregg

unread,

May 4, 2018, 10:37:24 AM5/4/18

to open-source-...@googlegroups.com

The blocking rules that dedupe learns is a function of the training data. The number and composition of the blocking rules will determine how long it takes to block.

If you are drawing random sample, there will be variation in your training pairs due to that sampling, which will produced variation in the learned blocking rules. To reduce the variation, increase the number of pairs you actually label and increase the sample.

--

---
You received this message because you are subscribed to the Google Groups "open source deduplication" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-source-dedupl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

meh...@thinkdataworks.com

unread,

May 4, 2018, 10:46:54 AM5/4/18

to open source deduplication

Thank you for the explanation. So does this also mean that the more useful labels we provide in training phase, the more efficient the blockings would be (and thus less time it takes for blocking)?

Forest Gregg

unread,

May 4, 2018, 10:53:28 AM5/4/18

to open-source-...@googlegroups.com

No.

Typically, if your provide more positive training labels, dedupe will learn more blocking rules to cover all the different cases in your training data. This will lead to more blocking, not less.

Think about the case if you only labeled a single duplicate pair. Dedupe would attempt to find the blocking rule that blocked those two records together, but minimized the total number of comparisons. That single blocking rule is very unlikely to cover all true duplicates.

--

mza...@clarityinsights.com

unread,

May 8, 2018, 9:43:57 PM5/8/18

to open source deduplication

Mehrsa-

A more layman explanation to compliment Forest's explanation:

Take a look at what predicates and index_fields are created from each of your trainings. These will likely be different each time you train and ones that are more "comprehensive" will take longer to process:

#Show predicates
print(deduper.blocker.predicates)
print(deduper.blocker.index_fields)

Note- your object may be gazetteer and not deduper.

On top of that, while you are doing the Active Learning step (where you say Y/N/U for the pairs), you should be able to see the dedupe package actually calculating and changing these predicates as you assign more Yeses and Noes.

See a sample run-through that I did--I have removed my data, but you can see the output and predicate set. It changes (and gets more complicated) as I do more matches and get more positives and negatives:

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonIntegerPredicate, s_road), SimplePredicate: (commonTwoTokens, scity))

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonIntegerPredicate, s_road), SimplePredicate: (commonTwoTokens, scity))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, nameclean), SimplePredicate: (wholeFieldPredicate, s_road))

6/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, nameclean), SimplePredicate: (tokenFieldPredicate, szip5))
INFO:dedupe.training:(SimplePredicate: (oneGramFingerprint, szip5), TfidfNGramCanopyPredicate: (0.8, s_road))
INFO:dedupe.training:(SimplePredicate: (commonIntegerPredicate, s_road), SimplePredicate: (commonTwoTokens, scity))

8/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, s_road), SimplePredicate: (oneGramFingerprint, szip5))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, nameclean), SimplePredicate: (tokenFieldPredicate, szip5))

19/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (twoGramFingerprint, s_house_number), TfidfNGramCanopyPredicate: (0.4, s_road))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, nameclean), SimplePredicate: (tokenFieldPredicate, szip5))
INFO:dedupe.training:(SimplePredicate: (commonIntegerPredicate, s_road), SimplePredicate: (commonTwoTokens, scity))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, nameclean), SimplePredicate: (commonTwoTokens, scity))

25/10 positive, 14/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (tokenFieldPredicate, sstate), SimplePredicate: (wholeFieldPredicate, s_house_number))
INFO:dedupe.training:(LevenshteinCanopyPredicate: (3, nameclean), SimplePredicate: (tokenFieldPredicate, szip5))
INFO:dedupe.training:(TfidfNGramCanopyPredicate: (0.8, s_po_box), TfidfTextCanopyPredicate: (0.4, nameclean))


...

Pretty cool, right?? This is a powerful set of algorithms to maximize the matching based on your recall_threshold parameter.

-Matt Z

Reply all

Reply to author

Forward