Duplicates: Cluster merge, but merging the whole rows

657 views
Skip to first unread message

Chichon

unread,
Mar 8, 2012, 5:38:37 AM3/8/12
to google...@googlegroups.com
Hi,
I'm new with Google Refine and I'm trying to clean up duplicates from a hotels db from my city. Clustering is working great, and most of the cluster suggests are right. The thing is, I want to merge those clusters in just one row, even if each rows have different data in some columns.
A basic example would be this:

Name··········| Address········| Phone
Hotel Great···| 14th street 12·| +1 543 765432
Great Hotel···|················| 543 765432
Grand·········|················| 653 878754
Grand Hotel···| Gran Av. 654···|

Ending like this:

Name··········| Address········| Phone
Great Hotel···| 14th street 12·+1 543 765432
Grand Hotel···|·Gran Av. 654···| 653 878754

The more complicate part here is what to do with ho Great Hotel phone number. I would like to make something easy, like keeping the data from the cell with the longest string or something like that (cause the real db has a lot more columns with many differente types of data). But I don't know how to face this kind of work. Do you have any idea on how to do it or, at least, do you think if it possible?
Thanks a lot!
FS

Tom Morris

unread,
Mar 8, 2012, 9:36:28 AM3/8/12
to google...@googlegroups.com

Sounds like you should vote for this enhancement:
http://code.google.com/p/google-refine/issues/detail?id=90

In the mean time, you could cluster on the name, use the Blank Down
command on the names which will convert all the independent rows into
Refine "records" and then access the various phone numbers with the
expression:

row.record.cells['Phone'].value

This will give you an array of all the values which you could use with
forEach(), max(), or whatever you expression will get you the value
that you think is appropriate. You could also use Blank Down on that
column to easily eliminate any duplicates before doing additional
processing (hint, trim leading and trailing white space first for best
results).

Hope this gets you started in the right direction.

Tom

Chichon

unread,
Mar 11, 2012, 11:12:36 AM3/11/12
to google...@googlegroups.com
I'd really love to see that enhancement approved!

I'll try your suggested method now. Hope it works! I'll let you know if it work.

Thanks, Tom!

Aju Badardeen

unread,
Jul 19, 2013, 11:13:33 AM7/19/13
to openr...@googlegroups.com, google...@googlegroups.com
Clustering by record would be a great feature. I see the related issue was created in 2010 and was "accepted". Wondering if it has been implemented. Thanks for the great work!

- A


On Thursday, March 8, 2012 9:36:28 AM UTC-5, Tom Morris wrote:

Martin Magdinier

unread,
Jul 26, 2013, 8:00:59 AM7/26/13
to openrefine
Aju,

Thanks for your feedback. Indeed the clustering by records has been requested by few people.
Currently the best solution is to join all value into a single cells and cluster on this new field. You can separate your value using a pipe to easily split them in their original column after.


Martin



--
You received this message because you are subscribed to the Google Groups "Open Refine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Reply all
Reply to author
Forward
0 new messages