Fwd: Dedupe Alpha is here

9 views
Skip to first unread message

Jason Lally

unread,
Feb 28, 2013, 11:15:50 AM2/28/13
to colorado-code-...@googlegroups.com
Check this out:

Begin forwarded message:

From: Derek Eder <derek...@gmail.com>
Date: February 27, 2013, 10:39:11 AM MST
To: Forest Gregg <fgr...@gmail.com>
Subject: Dedupe Alpha is here

After a year of development and over 500 code commits Forest and I are happy to announce that the Alpha version of Dedupe is here.

What is dedupe, you ask? it is an open source python library that quickly de-duplicates any set of data, up to millions of rows in size. Furthermore, you don’t need a powerful server to use it - we built it to run on a modern day laptop.

Here are some of it’s features:

  • machine learning - reads in human labeled data to automatically create optimum weights and blocking rules
  • built as a library - so it can be integrated into your applications or import scripts
  • extensible - supports adding custom data types, string comparators and blocking rules
  • open source - anyone can use, modify or add to it

Usage examples
To demonstrate how it works, we came up with two examples of how to use dedupe to clean up your data. They are:

  • CSV example - loads in a flat CSV file, asks the user to label some duplicate pairs, and outputs the same file with a new canonical ID column
  • MySQL example - imports a list of 1.7 million Illinois campaign contributions to a MySQL database, asks the user to label some duplicate pairs, and then assigns each donor a canonical ID. Then outputs the top 10 de-duplicated donors.

We need testers and feedback
What we need now is for you guys to test and use this thing. Specifically:

  • If you have big, messy data sets you want to deduplicate, follow one of our examples and try it out! We’d be happy to give guidance over email/in person with this.
  • We’d like to see how long it takes you to run MySQL example on your computer (and what your stats are). For us, it takes anywhere between 1-4 hours.
  • If you notice any bugs, or have a feature request for the dedupe API, please file an issue on Github.
  • Join our Google group and help us build a community around this tool.

Happy deduplicating!

Derek and Forest
Reply all
Reply to author
Forward
0 new messages