Deduplicate based on two rows

261 views
Skip to first unread message

Devin Pinkston

unread,
Sep 30, 2013, 8:33:19 AM9/30/13
to openr...@googlegroups.com
Hello,

After reading a few of the posts I've found here and on other blogs, I have a question if it is possible to determine if there is a duplicate row if two columns are identical.  For instance:

--- name -- id -- price
name1 -- id1 - $100
name1 -- id1 - $99
name2 -- id2 - $200

So if the name and id columns are matches, I would want to remove the row with the highest price.  Is there a way to do this using the duplicates facet?  I know it can be done on a single column, I am just curious about the two columns matches for determining a duplicate row.

Thanks!

Thad Guidry

unread,
Sep 30, 2013, 9:59:55 AM9/30/13
to openr...@googlegroups.com
Create a Custom Text Facet that evaluates if your name and id columns are matches :

cells["name"].value == cells["id"].value

that outputs a true or false in a facet.

Then use the facet to select the true's.

Then use the All column's menu to Edit Rows -> Remove all matching rows.




--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--

Devin Pinkston

unread,
Sep 30, 2013, 11:01:03 AM9/30/13
to openr...@googlegroups.com
Hey Thad,

Sorry i think my question was too ambiguous.  I meant that duplicates are determined if the name and id columns appear twice, however their price and other columns could be different.  In a dataset i am working with, there are about 100 identical name cells, but there are only a few duplicates since the ids only appear identical a couple times.

  ---NAME --- ID --- Price --- col4 ---
1  name 1 --- nm1 - 10  ----  info
2  name 1 --- nm2 - 20  ---- info
3  name 1 --- nm1 - 30  ---- info
4  name 1 --- nm3 - 40  ---- info
5  name 1 --- nm1 - 50  ---- info

So in this dataset, the duplicates are rows 1, 3, 5.  And i would only keep row 1 since it has the lowest price.  

Thanks!

Tom Morris

unread,
Sep 30, 2013, 11:25:29 AM9/30/13
to openr...@googlegroups.com
Two little tricks here:
1. Use Refine's record mode to group matching rows
2. Use join multi-valued cells to get all the values into a single cell where we can operate on them (Refine doesn't do operations across rows) 

- get rid of dollar signs and any other extraneous characters in the "price" column
- Create a new composite key column "key" based on name with a value of value+'||'+cells.id.value
- move column "key" to the beginning
- sort on column "key"
- make sort permanent
- blank down on key
- join multi-valued cells on column "price" using a separator of "|" or something else unique
- transform the "price" column using: forEach(value.split("|"),price,price.toNumber()).sort()[0]

The last step splits apart our joined string, converts the pieces back to numeric form, sorts them and keeps the lowest value.  Changing the array index to [-1] would keep the highest value.

Tom


Reply all
Reply to author
Forward
0 new messages