Seeing clusters consisting of multiple records

111 views
Skip to first unread message

Sudheendra Hangal

unread,
Nov 12, 2018, 5:49:42 AM11/12/18
to openr...@googlegroups.com
Hello,
I'm new to Refine, so I apologize if I'm missing something obvious.

I'm trying to cluster names in a dataset that has already been partially merged. The first column of the dataset is a sorted ID (on which I did "Blank Down", so I could get records for each unique ID). The second column is the Name.

Now I need to cluster names that belong to different IDs, so that I can merge those IDs. Currently, when I cluster on the names column, I see candidate rows for merging that belong to the same record, i.e. they already have the same ID. How can I see only merges that involve multiple records?
Thanks!

SH

--
Sudheendra Hangal

Owen Stephens

unread,
Nov 12, 2018, 5:11:54 PM11/12/18
to OpenRefine
Hi Sudheendra - welcome to the OpenRefine group!

It's hard to know exactly without seeing some examples, but the way clustering usually works in OpenRefine is to find all the values that might be the same, and then changing them to a single value. It doesn't really pay any attention to the records values belong to - they are all just values to be clustered.

If you need to keep the original values as well then you can create a duplicate of the Name column (by using 'Edit Column->Add column based on this column') and then use that for the actual clustering where you make changes to the values. Then you will have the original values in the original Name column, and the outcome of your clustering in the second column.

I'm not sure if this helps at all - if you could share some example data (dummy data that illustrates the problem you are trying to solve if you can't share the original data) then it may be easier for me and others to give advice

Owen

Sudheendra Hangal

unread,
Nov 12, 2018, 11:28:38 PM11/12/18
to openr...@googlegroups.com
Hi Owen,
Thanks for the response. My use case is the following (and the data is entirely public at http://lokdhaba.ashoka.edu.in/):

Given a large list of names of Indian politicians -- everyone who's contested either a federal or state election since 1962 - we'd like to assign IDs to them that are robust with respect to variations in spelling, initialization, titles, etc.
For example, the names
LAL KRISHNA ADVANI
Mr. L.K. ADVANI
ADVANI, LALKRISHNA
are all variants of the same person and I'd like to assign the same ID to them. Let's say I've merged the first two and have 

pid, name
1, LAL KRISHNA ADVANI
1, L.K. ADVANI
2, ADVANI, LALKRISHNA

The first 2 rows have been assigned the same ID, but the third does not which I'd like to do. However, I do not want to review again the clustering between rows 1 and 2 because there will be a lot of such rows.

The dataset is large enough (about 1M rows) -- and important enough -- that we've spent several months trying to clean it, and in fact built a tool called Surf that helps resolving these rows to the same ID. However, we'd like to try clustering with some of the algorithms in OpenRefine because it may catch clusters that our algorithms didn't.

I'm imagining that this use case of assigning IDs for entities must be quite common. Are there existing extensions, or is it possible to write one for this job?
Thanks!

SH


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Sudheendra Hangal
CEO, Amuse Labs

Owen Stephens

unread,
Nov 14, 2018, 2:31:05 AM11/14/18
to OpenRefine
Thanks SH

By coincidence I'm also working with lists of Indian politicians at the moment! I'm working on a project (https://www.mysociety.org/democracy/democratic-commons/) that is adding information on politicians to Wikidata (https://www.wikidata.org/wiki/Wikidata:WikiProject_every_politician).

In terms of dealing with your clustering challenge - the OpenRefine clustering process is designed to actually change the values used for the clustered items, rather than simply clustering them.   Alternatively the OpenRefine 'reconciliation' process is designed to check a value against an authorised list of values and assign an ID from that authorised list to the value in your project. It seems like you are looking for something that sits between the two methods - use the clustering to group values, but rather than change the value, assign an ID to the value to group it with all other similar values. I can see some advantages to this approach, but at the moment OpenRefine doesn't support it directly.

It would probably be possible to write an extension to OpenRefine to behave in the way you need - although I've not really given the implementation any real thought, it feels like all the relevant pieces are there, it is just a matter of using them in a slightly different way to the current behaviour.

A very small step in the direction you need would be to:

Duplicate the name column
Blank down the PID column to get records
On the duplicate name column use the GREL
row.record.cells.name.value[0]

This gets you a single value for each cluster in the new column, you can then use the usual clustering process on that new column - which will update the values in that column. You can then use these changed values to group the rows based on unique values which would then allow you to bring together your PIDs.

The problem with this approach is you lose the variety you have in the original name column - so the clustering process is going to be poorer and only use a single value from your current richer groups - so while this might get you some additional clustering, it wouldn't really be making use of the data you have and could easily miss matches that you'd find from the full list of values.

Sorry I can't offer a better solution here - maybe others can suggest alternative approaches that are closer to what you want.

Owen

Thad Guidry

unread,
Nov 14, 2018, 9:13:25 AM11/14/18
to openr...@googlegroups.com
Interesting...

Cluster values by a chosen algorithm...and then create a new column with an ID in the cells of that new column...effectively, this is "Cluster and create records".


Sudheendra Hangal

unread,
Nov 16, 2018, 5:15:50 AM11/16/18
to openr...@googlegroups.com
Hi Owen and Thad,
Thanks for adding your thoughts.

Owen, very interesting that you are working on reconciliation of Indian politicians. I'll touch base with you off-line to see if there are things that we can share or learn from each other. Our work is part of a non-partisan research center at Ashoka University.

You're right, I'm sure the parts are all there in OpenRefine, it's just a matter of tweaking a few things -- in particular, the confirmation of a cluster could unify an id field rather than the field values directly. As you pointed out, we need to cluster based on any variant of the field within a record, not just a representative value. 

I'd like to emphasize a couple of important needs in an application like ours:
- Incremental clustering is essential, so the ability to only show clusters that span different records is very useful.
- It's very possible that 2 fields have the same value, but are actually different records. The only way to track that would be through an ID field, so that is another reason we can't depend on just unifying the field value.

I think the above would be essential features for many applications of OpenRefine in entity resolution. We'd be happy to work on helping improve Refine to address this use case, if people think it's important and generally applicable. However, we would probably need a good deal of help from the core developers.

If anyone is interested, the tool that we've built for our work is at http://lokdhaba.ashoka.edu.in/surf.
The clustering options in it are not as sophisticated as Refine (we have edit distance and another home-grown algorithm to cluster names modulo initials). But we have some other nice features to canonicalize Indian names, and elements in the user interface that make the analyst's job easier for our particular dataset.
Thanks!

SH


Thad Guidry

unread,
Nov 16, 2018, 5:39:19 AM11/16/18
to openr...@googlegroups.com
One thing we thought of doing, but never had the time, was to make Clustering "pluggable" with custom algorithms or rules.
In your case, I imagine that would be ideal.

Could you share your thoughts or code that you have for your algorithms with us, so we can begin a design plan for making OpenRefine Clustering "pluggable" ?


Sudheendra Hangal

unread,
Nov 16, 2018, 6:28:41 AM11/16/18
to openr...@googlegroups.com
Hi Thad,
A pluggable architecture for clustering would be great. But even without that, I wonder if it would be possible to have an option to only see clusters that span more than 1 record. That should be pretty straightforward, no?

That would already help us to find candidates for clustering and we can update IDs with some other tool.

If you need to look at our clustering algorithms, here is the code:
(all the code is open source, under Apache 2.0 license. Happy to help people use it, if anyone finds it relevant.)

The compatible name algorithm manages to cluster things like
"JUGAL KISOR" and "JUGAL KISOR SARMA"
"IERAM REDI SUBA WENKATA" and I REDI SUBA"
"ADWANI LAL KRISNA" and "L K ADWANI"
"AJMAL SIRAJUDIN" and  "AJMAL SIRAJ UDIN"

(the above strings have themselves been re-tokenized and normalized based on variants of spellings in India - the rules  are in Config.java.)

SH


SH




Owen Stephens

unread,
Nov 16, 2018, 6:50:19 PM11/16/18
to OpenRefine
Thanks SH

I think some work on this would be very interesting - I completely see why you want to keep the original values with IDs for the clusters and how that offers more than the current OpenRefine approach for you. I would suggest a way forward would be to start a Github feature request where we can have some further discussion on the approach and other members of the development team could contribute.

And please do get in touch re Indian politicians - it would be interesting to compare notes :)

Owen
Reply all
Reply to author
Forward
0 new messages