Deduplicate based on two rows

Devin Pinkston

unread,

Sep 30, 2013, 8:33:19 AM9/30/13

to openr...@googlegroups.com

Hello,

After reading a few of the posts I've found here and on other blogs, I have a question if it is possible to determine if there is a duplicate row if two columns are identical. For instance:

--- name -- id -- price

name1 -- id1 - $100

name1 -- id1 - $99

name2 -- id2 - $200

So if the name and id columns are matches, I would want to remove the row with the highest price. Is there a way to do this using the duplicates facet? I know it can be done on a single column, I am just curious about the two columns matches for determining a duplicate row.

Thanks!

Thad Guidry

unread,

Sep 30, 2013, 9:59:55 AM9/30/13

to openr...@googlegroups.com

Create a Custom Text Facet that evaluates if your name and id columns are matches :

cells["name"].value == cells["id"].value

that outputs a true or false in a facet.

Then use the facet to select the true's.

Then use the All column's menu to Edit Rows -> Remove all matching rows.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

-Thad
Thad on Freebase.com

Thad on LinkedIn

Devin Pinkston

unread,

Sep 30, 2013, 11:01:03 AM9/30/13

to openr...@googlegroups.com

Hey Thad,

Sorry i think my question was too ambiguous. I meant that duplicates are determined if the name and id columns appear twice, however their price and other columns could be different. In a dataset i am working with, there are about 100 identical name cells, but there are only a few duplicates since the ids only appear identical a couple times.

---NAME --- ID --- Price --- col4 ---

1 name 1 --- nm1 - 10 ---- info

2 name 1 --- nm2 - 20 ---- info

3 name 1 --- nm1 - 30 ---- info

4 name 1 --- nm3 - 40 ---- info

5 name 1 --- nm1 - 50 ---- info

So in this dataset, the duplicates are rows 1, 3, 5. And i would only keep row 1 since it has the lowest price.

Thanks!

Tom Morris

unread,

Sep 30, 2013, 11:25:29 AM9/30/13

to openr...@googlegroups.com

Two little tricks here:

1. Use Refine's record mode to group matching rows

2. Use join multi-valued cells to get all the values into a single cell where we can operate on them (Refine doesn't do operations across rows)

- get rid of dollar signs and any other extraneous characters in the "price" column

- Create a new composite key column "key" based on name with a value of value+'||'+cells.id.value

- move column "key" to the beginning

- sort on column "key"

- make sort permanent

- blank down on key

- join multi-valued cells on column "price" using a separator of "|" or something else unique

- transform the "price" column using: forEach(value.split("|"),price,price.toNumber()).sort()[0]

The last step splits apart our joined string, converts the pieces back to numeric form, sorts them and keeps the lowest value. Changing the array index to [-1] would keep the highest value.

Tom

Reply all

Reply to author

Forward