Dedupe 0.8.0

144 views
Skip to first unread message

Forest Gregg

unread,
Mar 10, 2015, 11:51:04 AM3/10/15
to open-source-...@googlegroups.com
Hi all,

Just release dedupe 0.8. Lots of goodies. Main thing is support for python 3.4 has been added, and python 2.6 has been dropped. Python 2.7 is still supported.

Check out details here: https://github.com/datamade/dedupe/blob/master/CHANGELOG.md

Best,

Forest

Vijay Rao

unread,
Apr 16, 2015, 1:11:52 PM4/16/15
to open-source-...@googlegroups.com
Hi All,

Can you comment on any known usage of this library in a) Arabic names b) Very sparse records with many missing field values. 

Thanks

Vijay

Forest Gregg

unread,
Apr 16, 2015, 2:18:21 PM4/16/15
to open-source-...@googlegroups.com
Hi Vijay,

The library should work with Arabic names (it's unicode compatible). Many of string algorithms were developed for English and may not perform very well for a non latin writing system. Would love to hear your experience or if you know of string distances that are good for Arabic.

Sparse records are not a huge problem because we have a way of modeling missing data.

Best,

Forest

--

---
You received this message because you are subscribed to the Google Groups "open source deduplication" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-source-dedupl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vijay Rao

unread,
Apr 16, 2015, 2:49:20 PM4/16/15
to open-source-...@googlegroups.com
Thanks for that feedback.  I will report back on what I find in the coming days/weeks.
To unsubscribe from this group and stop receiving emails from it, send an email to open-source-deduplication+unsub...@googlegroups.com.

mza...@clarityinsights.com

unread,
Sep 27, 2018, 5:44:02 PM9/27/18
to open source deduplication
I don't mean to dread up ancient history; however this post addressed a question I had, and I wanted to add a few keywords to it:

I had a question if Unicode, UTF 8/16/etc, double byte characters, non-English characters were compatible with any/all algorithms in Dedupe.  It sounds like the answer is YES!  I would expect that things like the affine-gap might be applicable but some algorithms may fail miserably (n-char prefix maybe??).

I suppose there may be some special considerations around addresses and other fields that may not use US/English conventions.  

Cheers!
Reply all
Reply to author
Forward
0 new messages