grel for splitting a column's value, and put into clusters all unique values

274 views
Skip to first unread message

George S.

unread,
May 29, 2019, 10:48:03 AM5/29/19
to OpenRefine
i have some marc records output, in csv format:

"001","878","245","$c","Barmarick Publications,","$c","Emerald"

what i need to do, is separate each occurrence of the $c subfield, into each own row, but also keeping the first columns value (001 = record id)

"001","878","245","$c","Barmarick Publications,"
"001","878", "$c","Emerald"


Then , i use the cluster feature, and i get all values that are gathered into clusters.

How could i also in the same json output i get in the end, with the clusters results, also have in there also all other unique values that did not fit in any cluster?

Thank you in advance for an example of teh above

Petros Liveris

unread,
May 29, 2019, 3:15:39 PM5/29/19
to OpenRefine
for instance regarding the second part, i take the clusters json, and i reimport it to a new openrefine project.
Now i need to get all values that did not fit in a cluster, below the values that were already clustered.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/e9eeed80-5655-4ce0-966f-57d202e18394%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bradley Ferguson

unread,
May 29, 2019, 6:00:17 PM5/29/19
to OpenRefine
I would say your data isn't setup very well for OpenRefine.

In your example, I'm not sure why "878" shows up in the second row, but "245" does not. If "001" is the row ID, then shouldn't the second row be:
"001","$c","Emerald"

If you import your data as "line-based text files" you might get the flexibility you need to break it up properly. Once you have it chunked up, you can use "transpose" to change your columns into rows.

The workflow might be:
1. Import as line-based text file (this results in a single column of data)
2. Create new column based on Column 1 that is just the record ID
3. Split the imported column on "$c"
4. Transpose across cells into rows (this creates your new rows)
5. Fill down on your record ID
6. Split the new columns on a regular expression: ,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)

The regex is from https://stackoverflow.com/questions/18893390/splitting-on-comma-outside-quotes and seems to correctly handle commas inside quotes.

Brad

George S.

unread,
May 30, 2019, 3:14:06 AM5/30/19
to OpenRefine
thank you Brad,

you are correct.
The desired output should be:

"001","878","245","$c","Barmarick Publications,"
"001","878","245","$c","Emerald"

Now i need to have in the clusters output, somehow also all the values of 245$c, that did not fit in a cluster (all unique values),


Probably i will do use somehow the cross function, between the original project, and the clusters project.


Desired output:

... clusters in the cluster project...
Barmarick Publications,
Barmarick publications
Barmarick Publications
first row of the other unique values from original source project
Emerald
Sage
Wiley
...

Owen Stephens

unread,
May 30, 2019, 8:34:32 AM5/30/19
to OpenRefine
In case it is of any help/interest I blogged about using OpenRefine with Mnemonic MARC format at http://www.meanboyfriend.com/overdue_ideas/2015/07/worked-example-fixing-marc-data-4/

Also you could try importing the MARC directly into OpenRefine as it can parse native MARC - although it does this by converting the MARC to MARCXML and then imports as xml

Owen

George S.

unread,
May 30, 2019, 8:59:41 AM5/30/19
to OpenRefine
yes, i am aware of this,
thank you Owen.

So my use case is the following:
for a given tag/subfield, that is repeatable, i need to have a final project in openrefine, which would have the output of say the method key collision, and Keying Function fingerprint.

so the first challenge is to split
on the
say following value

"001","878","245","$c","Barmarick Publications,","$c","Emerald"

and make it look like:

"001","878","245","$c","Barmarick Publications,"
"001","878","245","$c","Emerald"
now one has the clustered values.
How could i add below, all other values that were not put in a cluster?
i would appreciate whenever you had the time, if you investigated this, since it may be useful to other librarians as well,
thank you

Thad Guidry

unread,
May 30, 2019, 11:29:42 PM5/30/19
to openr...@googlegroups.com
(See attached exported project with your example data... view the Undo/Redo to see the operations I applied. )
 
I would approach this with a more careful step by step method...
Carefully splitting on each occurrence of "$c" 
and keeping the string part obtained afterwards.

Add new column based on name it "part1"
  value.split("$c")[0]

Add new column based on and name it "part2"
  value.split("$c")[1]

...etc...

Then move columns around, even moving column "part1" to the beginning as an "index" and to create record rows with it using "Blank Down" on it.

Then perhaps doing your grouping of part1 + part2 and part1 + part3 , etc.

OpenRefine works best in this very methodical step-by-step approach to partitioning your data, and its best to handle that in a Column additive way quite often, rather than a Row additive way. (especially if you have Multi-valued cell values, as you do in your case)
Optionally creating Record rows when/where it makes sense for your Record indexing needs (Since we are a powertool we allow you to build Record rows however you need and do not limit you!  This was a feature that I always desparately lacked in my existing Database tools when we first starting designing OpenRefine ala Gridworks .  And David Huynh our original designer I worked with was kind enough to keep it entirely flexible by using overlay models under the covers)

I hope this helps you understand how to approach partitioning multi-value cell data in OpenRefine and how working with the overlay model system is a benefit, and not against it in your workflow, by ensuring you have an Undo step created at every GREL value.split/value.partition and retaining the previous row data, rather than throwing it away with a replace function or longer Python all-in-one function that does exactly what you want in 1 shot, but might get it wrong and need 100 shots to finally get it right. ;-)



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
clipboard.openrefine.tar.gz

Owen Stephens

unread,
Jun 3, 2019, 4:47:56 AM6/3/19
to OpenRefine
Sorry for the delay in replying George. Can you give any more information about your starting data set? You've given a single line example here so I'm unclear if all rows follow the same pattern or how much variation there might be.

Specifically - does each row only contain 001 and 245$$c values? Or do some rows have other fields?
I assume that the 245$$c can repeat a variable number of times (from once upwards)?

Thanks

Owen

Giwrgos Spuropoulos

unread,
Jun 3, 2019, 4:58:35 AM6/3/19
to openr...@googlegroups.com
thank you Owen.

all rows, have the 001 value.
i give some examples of 245:

245 may have one, two, three, or more $c subfields.

Imagine i used Marcedit to export only the 001 values, then the 245$c of each record.




--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Owen Stephens

unread,
Jun 3, 2019, 5:15:05 AM6/3/19
to OpenRefine
Hi George,

If you import the file as a CSV into OpenRefine so each value in your example is in it's own column (e.g. named Column 1, Column 2, Column 3...)
From the second column do "Add a new column based on this column" operation with the GREL

forEachIndex(row.columnNames,i,cn,if(and(i>3,cells[cn].value!="$c"),cells[cn].value,null)).join("|")

This is basically a way of stepping through each column from column 5 onwards (because we know that the first 4 columns don't contain any publisher names) and extracting the value as long as it isn't just a "$c". The result should be a new column with a list of the $c values separated by a "|" character.

Now you've got those in a single cell, you can use "Edit cells->Split multi-valued cells" using | character.
Then finally use 'Fill down' on the column containing your 001 ID (and any other columns you want) and you should have what you need I think!

The approach here is based on the "Merging all or more than two columns in a project" recipe on the wiki https://github.com/OpenRefine/OpenRefine/wiki/Recipes#merging-all-or-more-than-two-columns-in-a-project, but with a slight tweak to skip the cells we don't need. You could use other variations on the same recipe to achieve the same outcome if you were happy to add in a few more steps.

Owen
To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

Petros Liveris

unread,
Jun 3, 2019, 5:59:18 AM6/3/19
to OpenRefine
thank you, what about the cross cell function, i would like to retrieve all values that did not fit in any cluster

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/52177754-f4cc-444f-8c4a-429bd032a518%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages