removing duplicate based two coulmn value.

Rajbhar

unread,

Jun 17, 2014, 7:23:05 AM6/17/14

to openr...@googlegroups.com

hello all,

I need an help from you guys.

my data is like this

A A value 1 value 2
A B value# value#
B A value# value#
D F value# value#
V F value# value#

i need either of one row either A B or B A.it should look like this

A A value 1 value 2
A B value# value#
B A value# value# .....this record should be removed
D F value# value#
V F value# value#

please help on this isssue.

Thanks all of you

sanjay

Owen Stephens

unread,

Jun 17, 2014, 8:11:34 AM6/17/14

to openr...@googlegroups.com

There are a few approaches I think, but the first one that came to mind was to create a key for each row based on the values in the two key columns, and then de-duplicate based on that key.

E.g. if first two cols are 'Col1' and 'Col2'. Create new column based on 'Col2' with the following expression

(value + "|" + row.cells.Col1.value).split("|").sort().join("")"

This puts the two values into an array, sorts the array and then rejoins the array into a string with no separator - results in

AA

AB

DF

VF

You can then facet on this key to find duplicates and remove them.

Owen

Rajbhar

unread,

Jun 17, 2014, 8:34:29 AM6/17/14

to openr...@googlegroups.com

no its give all record true when when do duplicate facet on new key coulmn.

true :26270

Owen Stephens

unread,

Jun 17, 2014, 8:41:03 AM6/17/14

to openr...@googlegroups.com

Can you check by eye that the key column is populated with the expected values and that two rows you regard as duplicates have ended up with the same value in the new column?

If so, then the find duplicates will work. Note that if all rows have a duplicate, the total number of rows in the duplicate facet would be the same number as the total number of rows.

If this is the case, then the best bet for de-duplicating would be to follow this tutorial on removing duplicate rows based on a key

http://googlerefine.blogspot.co.uk/2011/08/remove-duplicate.html

You'll want to move the new 'key' column to be the very left hand column in your refine project before working through the rest of the steps described by the tutorial.

Thanks,

Owen

SanjayKumar Rajbhar

unread,

Jul 4, 2014, 6:12:51 AM7/4/14

to openr...@googlegroups.com

okay it give me list desired possible duplicate. but how to delete one of them. keep one records.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Sanjay

Owen Stephens

unread,

Jul 4, 2014, 6:37:02 AM7/4/14

to openr...@googlegroups.com

For how to get rid of the duplicates see http://googlerefine.blogspot.co.uk/2011/08/remove-duplicate.html

SanjayKumar Rajbhar

unread,

Jul 5, 2014, 7:30:20 AM7/5/14

to openr...@googlegroups.com

Thanks Owen ..

it works like charm...!

can i know how the "blank down" command works,...

Owen Stephens

unread,

Jul 15, 2014, 1:57:02 PM7/15/14

to openr...@googlegroups.com

The 'blank down' command looks for sequentially identical values - that is it looks at a value in a column, then compares it to the next value in the column. If the next value is the same, it will be changed to a blank, then move on to the next value in the column and repeat. When it comes to a new value in the column it takes that as a the new value to compare to and does the same thing again.

By creating a key that will be the same for all duplicate rows, sorting this key column, and then doing blank down on this key column, you end up with just one of the duplicate rows containing the key (the first in the sort). You can then delete the rows where the key column is blank leaving you with one of each unique row.

I hope this explanation is helpful

Owen

SanjayKumar Rajbhar

unread,

Jul 16, 2014, 2:50:24 AM7/16/14

to openr...@googlegroups.com

perfectly fine.i got that. thanks

Reply all

Reply to author

Forward