Sum of duplicate rows / Merge rows based on two column content

439 views
Skip to first unread message

Clément Vogt

unread,
Feb 5, 2017, 5:10:51 AM2/5/17
to OpenRefine
Hello,

I want to merge duplicate rows based on two column content, and get the sum of these duplicate rows...

Ex : if I have the same "Year" for the "Name" Josephine, give me the sum for each year in the "count" column (or in a new column...). 

Is it possible ?

I make a capture of my table, for a better understanding of my problem !

Thank you !
Clément


OpenRefine Capture.png

Ettore RIZZA

unread,
Feb 5, 2017, 6:50:27 AM2/5/17
to openr...@googlegroups.com
Hi Clement, 

if the combination "name" + "year" should be unique, the easiest way to handle duplicates would be to concatenate these two columns in a single one, called "nameyear" or "ID".

Based on the column "name" : add column based on this column -> name it "ID" -> value + cells.year.value

Move this column to beginning and blank it down. You will thus have a single record per couple "name" + "year".

To compute the sum of their column "count": add a new column based on "count" (make sur that it contains numbers and not strings) and use this GREL formula: 

row.record.cells.count.value.sum()

And finally, perform a "facet by blank" on the ID column to isolate or remove duplicates.

Here is a screencast of the operation, it may be clearer.



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Clément Vogt

unread,
Feb 5, 2017, 12:35:45 PM2/5/17
to OpenRefine
Hi Ettore !
Merci pour ta réponse / Thank you for your answer ;)

J'arrive bien à créer une nouvelle colonne ID, à la déplacer au début du tableur. Mais après, il ne se passe rien quand je "blank down" ("Blank down 0 cells in column ID") et j'ai encore 2 lignes par années... Après avoir tenté diverses manip, je me permets de revenir vers toi... (désolé !)

J'ai peut-être raté une étape... Quand tu dis "Move this column to beginning and blank it down", tu parles bien de "Edit cells" puis "blank down" ?? Pour être sûr que je vais au bon endroit, je rajoute une capture.
Dernière petite question : peut-on retrouver ton screencast quelque part (il est légèrement coupé sur la droite et je me dis que j'oublie peut-être quelque chose...)

Encore merci pour ton aide précieuse !







To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
blank down.png

Ettore RIZZA

unread,
Feb 5, 2017, 1:05:38 PM2/5/17
to openr...@googlegroups.com
J'ai un oublié un détail sur le screencast : le blank down ne fonctionne que si la colonne est triée. Voici la procédure :

I forgot to mention that Blank down works only if a column is sorted. Here is how to reorder your column ID permanently.





I hope this help/J'espère que cette fois ça roulera. :)

2017-02-05 16:17 GMT+01:00 Clément Vogt <cleme...@gmail.com>:
Hi Ettore !
Merci pour ta réponse / Thank you for your answer ;)

J'arrive bien à créer une nouvelle colonne ID, à la déplacer au début du tableur. Mais après, il ne se passe rien quand je "blank down" ("Blank down 0 cells in column ID") et j'ai encore 2 lignes par années... Après avoir tenté diverses manip, je me permets de revenir vers toi... (désolé !)

J'ai peut-être raté une étape... Quand tu dis "Move this column to beginning and blank it down", tu parles bien de "Edit cells" puis "blank down" ?? Pour être sûr que je vais au bon endroit, je rajoute une capture.
Dernière petite question : peut-on retrouver ton screencast quelque part (il est légèrement coupé sur la droite et je me dis que j'oublie peut-être quelque chose...)

Encore merci pour ton aide précieuse !








Le dimanche 5 février 2017 12:50:27 UTC+1, Ettore Rizza a écrit :
Hi Clement, 

if the combination "name" + "year" should be unique, the easiest way to handle duplicates would be to concatenate these two columns in a single one, called "nameyear" or "ID".

Based on the column "name" : add column based on this column -> name it "ID" -> value + cells.year.value

Move this column to beginning and blank it down. You will thus have a single record per couple "name" + "year".

To compute the sum of their column "count": add a new column based on "count" (make sur that it contains numbers and not strings) and use this GREL formula: 

row.record.cells.count.value.sum()

And finally, perform a "facet by blank" on the ID column to isolate or remove duplicates.

Here is a screencast of the operation, it may be clearer.


2017-02-05 2:04 GMT+01:00 Clément Vogt <cleme...@gmail.com>:
Hello,

I want to merge duplicate rows based on two column content, and get the sum of these duplicate rows...

Ex : if I have the same "Year" for the "Name" Josephine, give me the sum for each year in the "count" column (or in a new column...). 

Is it possible ?

I make a capture of my table, for a better understanding of my problem !

Thank you !
Clément


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages