Re: using dedupe on the web

63 views
Skip to first unread message

Friedrich Lindenberg

unread,
Nov 25, 2013, 4:39:01 AM11/25/13
to Derek Eder, Declan Frye, Forest Gregg, open-source-...@googlegroups.com
Hi all,

(cc'ing open-source-deduplication, as suggested by Derek)




On Wed, Nov 20, 2013 at 5:17 PM, Derek Eder <derek...@gmail.com> wrote:
Friedrich & Declan -

Both of you have an interest in getting dedupe set up to work via a web interface, so it seems like a good idea to connect you!

Friedrich, Declan is the CTO at Purple Binder and has an interest in using dedupe to match locations that offer social services. He has started some work on this with his dedupe-http-server repo.

Declan, Friedrich is with the Open Knowledge Foundation and is interested in hooking dedupe up to Nomenklatura, a free and open source data reconciliation service. 

Hope you guys can find a way to collaborate! Would you mind starting a thread on our Dedupe Google Group so others can join in if interested? Also, Forest and I can always help out if you have any questions.


Friedrich Lindenberg

unread,
Nov 25, 2013, 4:53:56 AM11/25/13
to Derek Eder, Declan Frye, Forest Gregg, open-source-...@googlegroups.com
Hi again,

sorry about the empty message - it's Monday morning and this is what happens.

Anyway, I wanted to introduce myself (http://pudo.org) to this group and present nomenklatura, a software project that I've been working on for a while and that I'm hoping may become a user of dedupe very soon.

Nomenklatura (http://nomenklatura.okfnlabs.org/, https://github.com/pudo/nomenklatura) came out of my attempts to build different open data projects, including a EU lobby tracker and a parliamentary tracking site in Germany. In both cases, I had to integrate multiple datasets by merging references to entities - in one case, German parliamentarians, in the other, European companies and NGOs. I decided that I wanted to store the identity mappings online, in a place independent of the applications they would be used in. 

In that sense, nomenklatura is just an API to store entity names, and a very simple UI that helps mark some of them as duplicates of others. There is not - as of yet - support for sophisticated clustering.

Since publishing nomenklatura, few people besides myself have begun to adopt it. I've therefore  recently started a charme offensive, including a major refactor to simplify the code base, improve UX by switching to AngularJS and writing better documentation (cf. "nextgen-13" branch on GitHub).

I hope to be done with much of this by early/mid December, and would then like to look at integrating a clustering mode based on dedupe. I'm very much looking for collaborators on this, and this community here may just be nerdy enough to find me some!

@Declan: it looks like dedupe-http-server is also based on Flask, perhaps it could be turned into a blueprint and there could be some internal API that would allow nomenklatura to feed one of its datasets (essentially a list of entity objects with a variety of attributes) into the service and then get the training API and a celery job to run for the real task?

All the best,

 - Friedrich
 
Reply all
Reply to author
Forward
0 new messages