Hi Owen and Thad,
Thanks for adding your thoughts.
Owen, very interesting that you are working on reconciliation of Indian politicians. I'll touch base with you off-line to see if there are things that we can share or learn from each other. Our work is part of a non-partisan research center at Ashoka University.
You're right, I'm sure the parts are all there in OpenRefine, it's just a matter of tweaking a few things -- in particular, the confirmation of a cluster could unify an id field rather than the field values directly. As you pointed out, we need to cluster based on any variant of the field within a record, not just a representative value.
I'd like to emphasize a couple of important needs in an application like ours:
- Incremental clustering is essential, so the ability to only show clusters that span different records is very useful.
- It's very possible that 2 fields have the same value, but are actually different records. The only way to track that would be through an ID field, so that is another reason we can't depend on just unifying the field value.
I think the above would be essential features for many applications of OpenRefine in entity resolution. We'd be happy to work on helping improve Refine to address this use case, if people think it's important and generally applicable. However, we would probably need a good deal of help from the core developers.
The clustering options in it are not as sophisticated as Refine (we have edit distance and another home-grown algorithm to cluster names modulo initials). But we have some other nice features to canonicalize Indian names, and elements in the user interface that make the analyst's job easier for our particular dataset.
Thanks!
SH