Andy

unread,

May 30, 2019, 4:13:35 PM5/30/19

to open source deduplication

Let's say I have a trained deduper and used it it to deduplicate a dataset successfully.

Now I add one new row to the dataset.

I want to check if this new row is a duplicate or not.

Is there a way to do that in dedupe (without reclassifying the whole dataset)?

Debie Viswo

unread,

Sep 29, 2019, 2:25:30 PM9/29/19

to open source deduplication

Hello Andy.

i also have question. Were you able to find a way to do that?

Thanks!

Debbie

Andrea Borruso

unread,

Sep 30, 2019, 1:36:36 AM9/30/19

to open source deduplication

Debbie I have no idea :(

Debie Viswo

unread,

Oct 2, 2019, 7:37:13 PM10/2/19

to open source deduplication

Thank you Andrea for your response.

Just curious what tools did you end up using then..? Any pointers..?

Thanks

Kuldeepak Sharma

unread,

Jan 3, 2020, 9:56:17 AM1/3/20

to open source deduplication

I don't know if this is still relevant. But, check out Flávio Juvenal's address. He talks about incremental record linkage strategies here: https://youtu.be/McsTWXeURhA?t=2541

I haven't tried any of these myself.

Thanks.

Flávio Juvenal

unread,

Jan 28, 2020, 7:03:05 PM1/28/20

to open source deduplication

Hi folks, based on what Forest Gregg told me when I made the talk, the workflow is the following (please correct me if I'm wrong Gregg):

Dedupe your data with Dedupe class
You're confident that now you have good deduplicated data
Some time passes, you get new data
Now, use the Gazeteer class to match this new data to your old deduplicated data. The new data is messy_data in parameters, while the deduplicated data is simply data.

Check the gazetteer_example from https://github.com/dedupeio/dedupe-examples

Best,

Flávio.

Harshit Saxena

unread,

Apr 23, 2020, 10:54:32 AM4/23/20

to open source deduplication

Hi Flavio,

Could you explain a bit more on point 4, I'm curious to learn more about incremental run, specifically how to manage it.

1. Do we have to load the total de-duped data every time a new data point comes in.

2. How to sequence the new IDs thus created (my plan is to keep a counter on the ClusterID)

Best,

Harshit

Flávio Juvenal

unread,

Apr 27, 2020, 11:27:07 AM4/27/20

to open-source-...@googlegroups.com

Hi folks, all I know is on the gazetteer_example from https://github.com/dedupeio/dedupe-examples

--

---
You received this message because you are subscribed to a topic in the Google Groups "open source deduplication" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/open-source-deduplication/5Ith8G5tu-A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to open-source-dedupl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/open-source-deduplication/dedebf42-bace-4651-85eb-9bccd2ee1c4a%40googlegroups.com.

--

Flávio Juvenal Partner & Dev

Dedupe one new row against existing dataset

Andy

Debie Viswo

Andrea Borruso

Debie Viswo

Kuldeepak Sharma

Flávio Juvenal

Harshit Saxena

Flávio Juvenal

Flávio Juvenal Partner & Dev

Hangout fla...@vinta.com.br

Skype flaviojuvenal

Build smart, venture beyond.

vintasoftware.com