Clustering using the values in two columns

893 views
Skip to first unread message

Keith Maguire

unread,
Nov 19, 2012, 9:03:03 PM11/19/12
to openr...@googlegroups.com
Is it possible to include the values in two separate columns when clustering? 

I've got a file where I need to clean information relating to collectors and identifiers of biological specimens. 

The information in the columns needs to be kept separate but the same individuals are involved in both activities so I'd like to use the values in both of them so I don't end up cleaning the two columns in two different ways

(Unfortunately there isn't a third, clean, file which could be used for reconciliation.)

Can this be done?
Thanks
Keith

Thad Guidry

unread,
Nov 19, 2012, 9:21:05 PM11/19/12
to openr...@googlegroups.com
One approach to the problem is to just manually join those columns and create a new column, with a unique split identifier.

Without knowing or seeing some sample rows of your data, I suspect you have something like ?:

Collector    BioSpecimenID

and you could probably just Add a New Column called Collector_BioSpecimenID with a manual join like so:

cells.Collector.value + " splitmelater " + cells.BioSpecimenID.value

You might want to create some custom text facets using ngram() , fingerprint() or ngramFingerprint() on the columns where necessary.

If the collectors are famous scientists, and you can parse them out into a new column, then you might also want to reconcile those to Freebase or a standard reconcile service or a Web API that's out there.

Can you give us a few sample rows of your data to look at here ?  It's hard to see the cluster patterns without seeing a few.



Thad Guidry

unread,
Nov 19, 2012, 9:26:13 PM11/19/12
to openr...@googlegroups.com
The information in the columns needs to be kept separate but the same individuals are involved in both activities so I'd like to use the values in both of them so I don't end up cleaning the two columns in two different ways


Re-reading that, perhaps your really just asking how to make a copy of the cells values first into a new column...to work with further clustering and transforming...without disturbing your original columns ?

Just create Add a New Column based on this column with the expression.

value

or perhaps

value + "splitmelater" + cells.specimenIdColumnName.value

or

value + "splitmelater" + cells["Specimen ID Column 1"].value

whatever form works for your column name with/without spaces.

--
-Thad
http://www.freebase.com/view/en/thad_guidry

Keith Maguire

unread,
Nov 19, 2012, 11:56:37 PM11/19/12
to openr...@googlegroups.com
Thanks Thad,

I'm okay with joining or splitting columns as necessary, but I'm not sure that that will enable me to do what I'm after - 

It's tricky to describe as the largest file has 60,000 entries with about 90 different people, expeditions and organisations named in it. But this is an example of the kind of entries there are:

Identifier   Collector
Gross G F
A M Lea
Lea A M Tindale N B 
Gross G F Stony Desert Survey
Tindale NB Tindale N B 
Gross G F Tindale N B 

I'd like to have

 "Lea A M" and "A M Lea" 

and

 "Tindale NB" and "Tindale N B" 

presented as clusters.

So I eventually end up with two clean columns of names where an individual or organisation is given the same format name whichever column they appear in.

I don't think joining the columns together avoids this but would be happy to be wrong!

Keith

Owen Stephens

unread,
Nov 20, 2012, 6:36:50 AM11/20/12
to openr...@googlegroups.com
Not sure if this is the best way to do it, but I think you could:

Create a new column that combines Identifier and Collector values separated by (e.g.) pipe |. Ensure you test for Blank cells. I think you'll want to include a 'BLANK' value (or something to keep number of values in cell consistent)
In this new column use 'split multi-valued cells' function to split out values. Now you have a column with all possible name values in it
Cluster cells in this column - based on the examples here a good approach might be Key Collision using Ngram Fingerprint with Ngram length of 1
Based on this clustering clean up the data using the form of the name you prefer
You should now have a column with all names in a consistent format - join multi value cells back together
Split this column into to two
These two columns are your new Identifier and Collector cols with the names normalised

As I say, no idea if this is the best way to do it, but I think it would work

Owen

Tom Morris

unread,
Nov 20, 2012, 10:00:56 AM11/20/12
to openr...@googlegroups.com
That's going to be kind of tricky.  The best Refine is going to be able to do is help you identify the mappings.  See below...

On Mon, Nov 19, 2012 at 11:56 PM, Keith Maguire <mrkjm...@gmail.com> wrote:

It's tricky to describe as the largest file has 60,000 entries with about 90 different people, expeditions and organisations named in it. But this is an example of the kind of entries there are:

Identifier   Collector
Gross G F
A M Lea
Lea A M Tindale N B 
Gross G F Stony Desert Survey
Tindale NB Tindale N B 
Gross G F Tindale N B 

I'd like to have

 "Lea A M" and "A M Lea" 

and

 "Tindale NB" and "Tindale N B" 

presented as clusters.columns needs to be kept separate but the same individuals are involved in both activities so I'd like to use the values in both of them so I don't end up cleaning the two columns in two different ways

You could create a new column which consists of the other two columns joined with a pipe or some other unused character (value+'|'+cells['Identifier']].value), then split the multi-valued cells on the new column using the same separate column.  This will give you a single column of all identifiers which you can cluster, but you're going to have to apply the mappings that it finds back to the original columns by hand.

Having said that, 90 identifiers doesn't sound like a whole lot.  Have you tried just editing using the text facet?  If things collate together fairly well, you could just edit by hand there pretty quickly.  Since Refine applies the edits to all rows with the same value, the import factor is the number of variants, not the total number of rows, so you're dealing with a couple hundred(?) edits, not 60,000 rows.

Tom

Martin Magdinier

unread,
Nov 20, 2012, 10:15:35 AM11/20/12
to openr...@googlegroups.com
I'd go with a sliglty different approach that Tom and Owen (as I don't understand why they join value from the two columns, this will create longer string making the cluster algorithm less efficient). 

1/ Make sure that your have no cell with blank content in both cells, if so replace it by a non used character like a pipe |
2/ transpose the two column across rows into: (1) one unique column that you can name: named and (2) by prepend the column name.
3/ Split your new column into two rows, one named 'previous column name', one named 'data'
4/ Cluster as much as you need on the new data column

Now we need to get your data back in the original format:

5/ filter on the 'previous column name' and select 'Collector' value
6/ Create a new column based on the 'data' column and keep the same value, name this new column Collector
7/ blank all Collector value on 'data' column
8/ Remove the facet on the 'previous column name' field
9/ Rename the data value to 'Identifier ' and fill down.
10/ Filter the Collector field on blank and remove all matching records
11/ You're done!


Martin


--
 
 

Owen Stephens

unread,
Nov 20, 2012, 10:33:33 AM11/20/12
to openr...@googlegroups.com
On Tuesday, November 20, 2012 3:15:57 PM UTC, Magdmartin wrote:
I'd go with a sliglty different approach that Tom and Owen (as I don't understand why they join value from the two columns, this will create longer string making the cluster algorithm less efficient). 

We don't cluster on the joined value - the 'split mutli-valued cell' function puts each | separated value in joined string into it's own row - so I think we reach a pretty similar solution by these different routes.

Thad Guidry

unread,
Nov 20, 2012, 11:58:25 AM11/20/12
to openr...@googlegroups.com
Ah, that makes it easier for me to help.

So what you really want is to keep the keys and the values , then work out the clustering, preferably within a single column (since that's how Stefano designed Clustering to work, and we do not support Clustering over more than 1 column currently).

We have a useful menu item that I asked for back in the early days of Refine that I use for cases such as yours.  Transpose cells across columns into rows.

You want to keep your identifiers I think, and actually probably just want to work with copies of the columns, come to think of it.  Anyways,

I have attached a project that you can import that shows your example and the operations that I performed, where I also did a hacky custom text facet to cluster on a name, but you can probably variate that regex or forEach() loop how you desire:

forEach(value.partition(/([A-Z][a-z]+)/), v, v.match(/([A-Z][a-z]+)/))[1].join('')


custom-cluster.google-refine.tar.gz

Thad Guidry

unread,
Nov 20, 2012, 12:05:21 PM11/20/12
to openr...@googlegroups.com
oops, wrong project... here ya go !

custom-cluster.google-refine (1).tar.gz

Thad Guidry

unread,
Nov 20, 2012, 12:22:59 PM11/20/12
to openr...@googlegroups.com
Owen,

I should have also pointed out that you want to Apply that custom GREL expression to the cluster_me 2 column, btw.

Keith Maguire

unread,
Nov 21, 2012, 1:33:39 AM11/21/12
to openr...@googlegroups.com
I'm grateful to everyone below for their help and suggestions - you've pointed me in a number of useful directions and also clarified which things cannot be done!

Thanks

Keith
Reply all
Reply to author
Forward
0 new messages