Dedupe one new row against existing dataset

191 views
Skip to first unread message

Andy

unread,
May 30, 2019, 4:13:35 PM5/30/19
to open source deduplication
Let's say I have a trained deduper and used it it to deduplicate a dataset successfully.

Now I add one new row to the dataset.

I want to check if this new row is a duplicate or not.

Is there a way to do that in dedupe (without reclassifying the whole dataset)?

Debie Viswo

unread,
Sep 29, 2019, 2:25:30 PM9/29/19
to open source deduplication
Hello Andy. 

i also have question. Were you able to find a way to do that?

Thanks!

Debbie

Andrea Borruso

unread,
Sep 30, 2019, 1:36:36 AM9/30/19
to open source deduplication
Debbie I have no idea :(

Debie Viswo

unread,
Oct 2, 2019, 7:37:13 PM10/2/19
to open source deduplication
Thank you Andrea for your response. 

Just curious what tools did you end up using then..? Any pointers..?

Thanks

Kuldeepak Sharma

unread,
Jan 3, 2020, 9:56:17 AM1/3/20
to open source deduplication
I don't know if this is still relevant. But, check out Flávio Juvenal's address. He talks about incremental record linkage strategies here: https://youtu.be/McsTWXeURhA?t=2541
I haven't tried any of these myself.

Thanks.

Flávio Juvenal

unread,
Jan 28, 2020, 7:03:05 PM1/28/20
to open source deduplication
Hi folks, based on what Forest Gregg told me when I made the talk, the workflow is the following (please correct me if I'm wrong Gregg):

  1. Dedupe your data with Dedupe class
  2. You're confident that now you have good deduplicated data
  3. Some time passes, you get new data
  4. Now, use the Gazeteer class to match this new data to your old deduplicated data. The new data is messy_data in parameters, while the deduplicated data is simply data.
Check the gazetteer_example from https://github.com/dedupeio/dedupe-examples

Best,
Flávio.

Harshit Saxena

unread,
Apr 23, 2020, 10:54:32 AM4/23/20
to open source deduplication
Hi Flavio,

Could you explain a bit more on point 4, I'm curious to learn more about incremental run, specifically how to manage it.

1. Do we have to load the total de-duped data every time a new data point comes in.
2. How to sequence the new IDs thus created (my plan is to keep a counter on the ClusterID)

Best,
Harshit

Flávio Juvenal

unread,
Apr 27, 2020, 11:27:07 AM4/27/20
to open-source-...@googlegroups.com
Hi folks, all I know is on the gazetteer_example from https://github.com/dedupeio/dedupe-examples

--

---
You received this message because you are subscribed to a topic in the Google Groups "open source deduplication" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/open-source-deduplication/5Ith8G5tu-A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to open-source-dedupl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/open-source-deduplication/dedebf42-bace-4651-85eb-9bccd2ee1c4a%40googlegroups.com.


--

Flávio Juvenal Partner & Dev

Hangout fla...@vinta.com.br

Skype flaviojuvenal

 

Build smart, venture beyond.

vintasoftware.com


Reply all
Reply to author
Forward
0 new messages