Groups
Sign in
Groups
open source deduplication
Conversations
About
Send feedback
Help
open source deduplication
Contact owners and managers
1–30 of 109
Dedupe is an open-source python library for deduplicating or matching messy data, brought to you by
DataMade
. You can get the library here:
https://github.com/datamade/
dedupe
Mark all as read
Report group
0 selected
Rich DeRidder
8/17/23
Couple 'simple' questions
Hey folks.. I know just enough to be dangerous, but don't know the details on many things here. 1
unread,
Couple 'simple' questions
Hey folks.. I know just enough to be dangerous, but don't know the details on many things here. 1
8/17/23
mmcneill
7/4/23
Asked to label pairs that are already marked
Hi all, I have a question about the intended behavior of mark_pairs and read_training more generally.
unread,
Asked to label pairs that are already marked
Hi all, I have a question about the intended behavior of mark_pairs and read_training more generally.
7/4/23
Tim Stallmann
2
9/22/22
possible to ID focal record within cluster -or- best way to choose canonical record from cluster
Lol I am realizing I totally failed to notice the `canonicalize` convenience function, which is
unread,
possible to ID focal record within cluster -or- best way to choose canonical record from cluster
Lol I am realizing I totally failed to notice the `canonicalize` convenience function, which is
9/22/22
Maxim Kupfer
7/29/22
How to put less weight on certain words
If I have a field that contains entries with a common word or combination of words that don't
unread,
How to put less weight on certain words
If I have a field that contains entries with a common word or combination of words that don't
7/29/22
Maxim Kupfer
7/20/22
Matching with foreign languages
The String types only match on similarity, but what about foreign languages. For example; Day, Daag,
unread,
Matching with foreign languages
The String types only match on similarity, but what about foreign languages. For example; Day, Daag,
7/20/22
Satish Patil
6/22/22
dedupe.console_label to API
Hi Team, Facing issue with deduper object state while serializing and desearializing Trying to modify
unread,
dedupe.console_label to API
Hi Team, Facing issue with deduper object state while serializing and desearializing Trying to modify
6/22/22
Moksha Vora
4/28/22
How to add new data to the existing clusters and make prediction at run time
It is mentioned in the documentation, that it is possible to add new data to the existing pretrained
unread,
How to add new data to the existing clusters and make prediction at run time
It is mentioned in the documentation, that it is possible to add new data to the existing pretrained
4/28/22
Mark Vervuurt
2/10/22
Visualise generated settings file
Dear Experts, Is there a way to inspect or visualise the settings file generated by Dedupe? Part of
unread,
Visualise generated settings file
Dear Experts, Is there a way to inspect or visualise the settings file generated by Dedupe? Part of
2/10/22
Erik Paulson
5/28/21
questions on gazetteer API/using gazetteer with OpenRefine
Hello - I've been a big fan of the DeDupe work for quite a while, thanks for providing it and
unread,
questions on gazetteer API/using gazetteer with OpenRefine
Hello - I've been a big fan of the DeDupe work for quite a while, thanks for providing it and
5/28/21
Indrajit Saha
5/27/21
Regarding Custom Blocking & Centroid Calculation for non-numerical vector
Hi All, First of all thank you for making this package open-sourced. Hope all of you are doing well.
unread,
Regarding Custom Blocking & Centroid Calculation for non-numerical vector
Hi All, First of all thank you for making this package open-sourced. Hope all of you are doing well.
5/27/21
Lucy Choque Mansilla
11/26/20
Gazetteer - Datasets for training, testing, and application
Hello everyone, I am starting to use Dedupe 2.0 and learning about Active Learning as well. Currently
unread,
Gazetteer - Datasets for training, testing, and application
Hello everyone, I am starting to use Dedupe 2.0 and learning about Active Learning as well. Currently
11/26/20
Dylan Culfogienis
6/22/20
One-to-many, many-to-one, and many-to-many matching (Gazetteer)
So, I have a clean dataset that I created using a combination of filtering methods and deduplication.
unread,
One-to-many, many-to-one, and many-to-many matching (Gazetteer)
So, I have a clean dataset that I created using a combination of filtering methods and deduplication.
6/22/20
Andy
, …
Flávio Juvenal
8
4/27/20
Dedupe one new row against existing dataset
Hi folks, all I know is on the gazetteer_example from https://github.com/dedupeio/dedupe-examples On
unread,
Dedupe one new row against existing dataset
Hi folks, all I know is on the gazetteer_example from https://github.com/dedupeio/dedupe-examples On
4/27/20
Forest Gregg
,
Deepesh Chaudhari
3
3/9/20
Dedupe 2.0
🎉🎊 On Mon, Mar 9, 2020 at 11:04 AM Forest Gregg <fgr...@datamade.us> wrote: released! On Wed,
unread,
Dedupe 2.0
🎉🎊 On Mon, Mar 9, 2020 at 11:04 AM Forest Gregg <fgr...@datamade.us> wrote: released! On Wed,
3/9/20
Matthew Gross
, …
Deepesh Chaudhari
10
3/4/20
Problem with mysql_example: No records have been blocked together
Hi, I have a similar error with a line that implies I'm not useing the data that I had trained on
unread,
Problem with mysql_example: No records have been blocked together
Hi, I have a similar error with a line that implies I'm not useing the data that I had trained on
3/4/20
Flávio Juvenal
1/28/20
Question: sample_size is pairs or records?
Hi folks, once again thanks for this excellent library! I have a question about the sample method
unread,
Question: sample_size is pairs or records?
Hi folks, once again thanks for this excellent library! I have a question about the sample method
1/28/20
Efrem Braun
7/17/19
AttributeError: Attempting to block with an index predicate without indexing records
Hello, I started using dedupe a few days ago, and so far I'm a big fan. Thanks for building this
unread,
AttributeError: Attempting to block with an index predicate without indexing records
Hello, I started using dedupe a few days ago, and so far I'm a big fan. Thanks for building this
7/17/19
Ednalson Guy Mirlin ELIODOR
,
Edward Wong
2
1/17/19
Synonym
Can you give an example of the use case? I think that'll give us a better sense of how to tackle
unread,
Synonym
Can you give an example of the use case? I think that'll give us a better sense of how to tackle
1/17/19
sergey....@gmail.com
,
Forest Gregg
3
9/28/18
Dedupe with multiple matches
Hi Forest, Thanks for quick response. When I tried gazetteer sample from GitHub and when I changing
unread,
Dedupe with multiple matches
Hi Forest, Thanks for quick response. When I tried gazetteer sample from GitHub and when I changing
9/28/18
Forest Gregg
, …
mza...@clarityinsights.com
5
9/27/18
Dedupe 0.8.0
I don't mean to dread up ancient history; however this post addressed a question I had, and I
unread,
Dedupe 0.8.0
I don't mean to dread up ancient history; however this post addressed a question I had, and I
9/27/18
Tim Harder
,
Josh Wieder
3
8/23/18
Pre-Define Blocking rules
Hi Josh, thanks for the detailed answer and apologies for the delayed response. I tried the exact
unread,
Pre-Define Blocking rules
Hi Josh, thanks for the detailed answer and apologies for the delayed response. I tried the exact
8/23/18
mza...@clarityinsights.com
, …
rachel...@gettectonic.com
11
8/15/18
Dedupe - large datasets choke trying to output flat file from matchBlocks generator
@matt, can you share what you did to separate out the 3-digit zip? We are facing a similar problem
unread,
Dedupe - large datasets choke trying to output flat file from matchBlocks generator
@matt, can you share what you did to separate out the 3-digit zip? We are facing a similar problem
8/15/18
Tom Proctor
6/18/18
Adding labeled training data to already trained model
I have successfully been using dedupe to identify duplicates in a data set. I will be adding new data
unread,
Adding labeled training data to already trained model
I have successfully been using dedupe to identify duplicates in a data set. I will be adding new data
6/18/18
John Beales
,
mza...@clarityinsights.com
2
5/8/18
MySQL Example: cluster_score all over the place
John- When I first started working on dedupe, I was on the same line of thought as you - I should
unread,
MySQL Example: cluster_score all over the place
John- When I first started working on dedupe, I was on the same line of thought as you - I should
5/8/18
meh...@thinkdataworks.com
, …
mza...@clarityinsights.com
5
5/8/18
Blocking time varies drastically based on training phase (gazetteer)
Mehrsa- A more layman explanation to compliment Forest's explanation: Take a look at what
unread,
Blocking time varies drastically based on training phase (gazetteer)
Mehrsa- A more layman explanation to compliment Forest's explanation: Take a look at what
5/8/18
Matt Chambers
4/18/18
Recommended data structure / field types for handling multiple/historical email addresses
Hey gang, Looking to add matching capability for multiple/historic email addresses and curious about
unread,
Recommended data structure / field types for handling multiple/historical email addresses
Hey gang, Looking to add matching capability for multiple/historic email addresses and curious about
4/18/18
Andrea Borruso
4/1/18
Use csvdeddupe for a CSV almost "clean"
Hi, I have a 3000 records CSV file. It's almost clean, it has only nearby 5 rows that are
unread,
Use csvdeddupe for a CSV almost "clean"
Hi, I have a 3000 records CSV file. It's almost clean, it has only nearby 5 rows that are
4/1/18
Vinay Babu
,
ilan....@credifi.com
4
3/27/18
K nearest Classifier: ValueError: Found array with 0 sample(s) (shape=(0, 165)) while a minimum of 1
Hi, After your field definition you need to create an object of the Dedupe class with all the fields
unread,
K nearest Classifier: ValueError: Found array with 0 sample(s) (shape=(0, 165)) while a minimum of 1
Hi, After your field definition you need to create an object of the Dedupe class with all the fields
3/27/18
ilan....@credifi.com
3/27/18
Help with array/list variable definitions
Appreciate any help in this area: I have a data set with many companies in it. Features include
unread,
Help with array/list variable definitions
Appreciate any help in this area: I have a data set with many companies in it. Features include
3/27/18
Darek Lubomski
,
Abhinav Jain
2
3/1/18
Sizeof labeled examples
Look if you give all the 10000 in training then, first question arises how will you do this for this
unread,
Sizeof labeled examples
Look if you give all the 10000 in training then, first question arises how will you do this for this
3/1/18