Couple 'simple' questions

69 views

Skip to first unread message

Rich DeRidder

unread,

Aug 17, 2023, 9:39:00 AM8/17/23

to open source deduplication

Hey folks.. I know just enough to be dangerous, but don't know the details on many things here.

1. is the prepare_training "sample size" parameter, irrelevant if a training file is specified?

just want to make sure it is not impacting anything when a training file is used. Or if I do specify both, is the process using all training file data, plus some samples of source data as well?

2. can someone explain to me why the data in the training file (matches, and distincts), are not reflected 100% in the resulting entity map? for example, I have 120 matches in the training file, and only 98 of those appear in the resulting entity map.. and I have 220 distincts in the training file, but 22 of them appear as matches in the resulting entity map.

3. does matching occur on records outside of the blocks? should I expect every source record to at least have 1 block record for every field variable I define?

In other words, the blocking table should have at least 1 record in it for every record in the source? There would normally be many blocking records for each source record due to the various predicates / variables... But I had thought I should expect AT LEAST one record per source? This isn't holding true, so I'm assuming I'm missing something. There are many source records that do not appear in my blocking table at all, and so are not being considered for matches at all I assume. How can I ensure that every record is included in the blocks? Or am I off, and even records that are not in the blocks, get considered for matches..

4. "clustering:A component contained 77937 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0012458356151866146"

can someone explain to me in layman terms what this means really, and tips for avoiding it? Does it mean some record is matching to 77937 other records? Or some other aspect that is basically too loose in the matching criteria?

I typically see this error before my server reboots (i think due to running out of memory)

side info:

I use dedupe for finding duplicates in a table of about 90k persons.

I use full_name, phone, email and address fields for matching.

Thanks!

Rich

Reply all

Reply to author

Forward

0 new messages