How to extract MARCXML data?

418 views
Skip to first unread message

Felix Hemme

unread,
Jul 6, 2017, 1:22:02 PM7/6/17
to OpenRefine
Hey,

i'd like to extract some data from a cell with data in marcxml format. The cell's content looks like this:

    <datafield tag="100" ind1="1" ind2=" ">
      <subfield code="a">Rivera, Juan Carlos</subfield>
    </datafield>

I want "Rivera, Juan Carlos" to be in a new column. I allready tried OpenRefine's GREL parseHTML function, but I don't understand how this should work.

John Little

unread,
Jul 6, 2017, 2:29:53 PM7/6/17
to openr...@googlegroups.com
Felix:

Try this:

Let's say the column which has the cell with XML data has a header name of "XML Data"

click on the caret for the column header  (XML Data) to get the context menu.

  1. XML Data > Edit column > Add column based on this column... 
  2. [In the resulting dialog box], 
    • Set "New Column Name" to "XML Text" (or whatever you like)
    • In the Expression box:  value.parseHtml().select("datafield")[0].htmlText()


  



If you type the expression in, one part at a time, you can watch the processing in the Preview pane.  There are 5 parts to the the GREL expression.

e.g. 

  1. value -- shows the Raw XML unprocessed
  2. value.parseHtml() -- converts the XML to an HTML document
  3. .select("<<insert tag name here>>") -- selects a tag-name.  In this case you can either use "datafield" for the outter tag, or "subfield" as the inner tag.  
  4. [0] -- takes the first element of the array.  You know your results are in an array because the processed preview window shows the data preceded by an open bracket "[", and followed by a close bracked "]".  In this case there is only one element in the array -- you know this because there is no comma separating values within the array.
  5. .htmlText() -- takes the text within the tag

By using the following expression, you could also extract the subfiled attribute, i.e. the subfield code, into a separate column to retain subfield data

value.parseHtml().select("subfield")[0].htmlAttr("code")







--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thad Guidry

unread,
Jul 6, 2017, 3:22:39 PM7/6/17
to openr...@googlegroups.com
Felix,

We support loading Marc XML files at import time, and they get parsed out nicely into cells with the columns as tags and the values as String text.
Look for that importer when your creating your new OpenRefine project.


-Thad

Owen Stephens

unread,
Jul 7, 2017, 5:21:30 AM7/7/17
to OpenRefine
Hi Felix,

Although OpenRefine supports the import of both native MARC and MARCXML files (in the former case it converts the MARC to MARCXML first), my recommendation would be to use OpenRefine in conjunction with the MarcEdit program http://marcedit.reeset.net

I've written about using OpenRefine to work with MARC records at http://www.meanboyfriend.com/overdue_ideas/2015/07/worked-example-fixing-marc-data-4/ - this is part 4 of a series of 5 blog posts, but it is the first that deals with OpenRefine. I first converted the MARC records to the so called 'mnemonic' format - using MarcEdit, and then work with those records.

Owen

Felix Hemme

unread,
Aug 2, 2017, 5:14:45 AM8/2/17
to OpenRefine
Thank you all for your replies and good ideas! 
Reply all
Reply to author
Forward
0 new messages