MARC-XML parsing

291 views
Skip to first unread message

Parthasarathi Mukhopadhyay

unread,
May 20, 2021, 2:20:26 PM5/20/21
to openr...@googlegroups.com

Dear All

I have a record (fetched from OCLC FAST) like this (truncated here but structure is as-it-is).

I tried  to extract each occurrence of the tag 450 (content under subfield  code "a" and 'z") but it always resulted in error as I don't know (and not getting any example anywhere like this) how to select marc#datafield tag 451; if this is possible then a GREL expression like this can work forEach( value.parseXml().select("???")[0].select("???"), e, e.xmlText()).join("++").

Thanks


<?xml version="1.0" encoding="UTF-8"?>

<marc:record xmlns:marc="http://www.loc.gov/MARC21/slim">

<marc:leader> 00000cz a2200000n 4500 </marc:leader>

<marc:controlfield tag="001"> fst01332189 </marc:controlfield>

<marc:controlfield tag="003"> OCoLC </marc:controlfield>

<marc:controlfield tag="005"> 20180103163101.0 </marc:controlfield>

………………….

<marc:datafield tag="043" ind1=" " ind2=" "> <marc:subfield code="a"> a-ii--- </marc:subfield> </marc:datafield>

<marc:datafield tag="151" ind1=" " ind2=" "> <marc:subfield code="a"> India </marc:subfield> <marc:subfield code="z"> Coimbatore (District) </marc:subfield> </marc:datafield>

<marc:datafield tag="451" ind1=" " ind2=" "> <marc:subfield code="a"> India </marc:subfield> <marc:subfield code="z"> Coimbator District </marc:subfield> </marc:datafield>

<marc:datafield tag="451" ind1=" " ind2=" "> <marc:subfield code="a"> Kōyamputtūr Māvaṭṭam </marc:subfield> </marc:datafield>

…………………….

<marc:datafield tag="751" ind1=" " ind2="0"> <marc:subfield code="a"> Coimbatore (India : District) </marc:subfield> <marc:subfield code="0"> (viaf)147778104 </marc:subfield> <marc:subfield code="0"> https://viaf.org/viaf/147778104 </marc:subfield> </marc:datafield>

</marc:record>





-----------------------------------------------------------------------
Parthasarathi Mukhopadhyay
Professor, Department of Library and Information Science,
University of Kalyani, Kalyani - 741 235 (WB), India
-----------------------------------------------------------------------

Owen Stephens

unread,
May 20, 2021, 4:01:44 PM5/20/21
to OpenRefine
Hi

Because the XML specifies a namespace you have to use that in your selector - but the syntax is slightly different to the usual. So you can use something like:

value.parseXml().select("marc|datafield[tag=451]")

That will find all the 451 tags
If you are only interested in the content of the a and z subfields, you can extend the selection syntax like:

value.parseXml().select("marc|datafield[tag=451] marc|subfield[code~=(a|z)]")

The "code~=" takes a regular expression - so that's why you can use (a|z). If you were only interested in a specific subfield you would use a straight equals sign:

value.parseXml().select("marc|datafield[tag=451] marc|subfield[code=a]")

Hope this helps and let me know if I can assist further

Owen

Parthasarathi Mukhopadhyay

unread,
May 20, 2021, 5:00:50 PM5/20/21
to openr...@googlegroups.com
Thanks Owen.

Meanwhile I tried this on the basis of an old thread (https://groups.google.com/g/openrefine/c/KLwPvpTU3gQ?pli=1)

value.parseXml().select("marc|datafield[tag=451]")[0].select("marc|subfield[code=a]")[0].xmlText() + "--" +value.parseXml().select("marc|datafield[tag=451]")[0].select("marc|subfield[code=z]")[0].xmlText()


But your solution is an elegant one as always.


I have tried your solution to generate forEach (as in many records tag 451 is repeated)and got result nicely -

forEach(value.parseXml().select("marc|datafield[tag=451] marc|subfield[code~=(a|z)]"),e,e.xmlText()).join("\n") >>     India Tarmapuri Māvaṭṭam

                                                                                                                                                                                         India Tarumapuri Māvaṭṭam


Is it possible to tinker the above expression to generate in one go something like this?


$aIndia$zTarmapuri Māvaṭṭam                                                                                                                                                                                         

$aIndia$zTarumapuri Māvaṭṭam



Best regards





--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/0ab7ad37-4a42-4b43-83d6-fc622ebc2cd1n%40googlegroups.com.

Owen Stephens

unread,
May 21, 2021, 6:04:04 AM5/21/21
to OpenRefine
On Thursday, May 20, 2021 at 10:00:50 PM UTC+1 psmukho...@gmail.com wrote:
I have tried your solution to generate forEach (as in many records tag 451 is repeated)and got result nicely -

forEach(value.parseXml().select("marc|datafield[tag=451] marc|subfield[code~=(a|z)]"),e,e.xmlText()).join("\n") >>     India Tarmapuri Māvaṭṭam

                                                                                                                                                                                         India Tarumapuri Māvaṭṭam


Is it possible to tinker the above expression to generate in one go something like this?


$aIndia$zTarmapuri Māvaṭṭam                                                                                                                                                                                         

$aIndia$zTarumapuri Māvaṭṭam



You can get the subfield code by using xmlAttr()  in the GREL (https://docs.openrefine.org/manual/grelfunctions/#xmlattrs-element)
So amending the expression you have above you can do something like:
forEach(value.parseXml().select("marc|datafield[tag=451] marc|subfield[code~=(a|z)]"),e,"$"+e.xmlAttr("code")+e.ownText()).join("\n")
and get
$aIndia
$zTarmapuri Māvaṭṭam                                                                                                                                                                                          
$aIndia
$zTarumapuri Māvaṭṭam

But that's not quite what you wanted I think - if I understood you want to have one line in the output per 451 field in the original. I think to do that you have to have a longer expression nesting an additional forEach:
forEach(value.parseXml().select("marc|datafield[tag=451]"),f,forEach(f.select("marc|subfield[code~=(a|z)]"),e,"$"+e.xmlAttr("code")+e.ownText()).join("")).join("\n")
which should give
$aIndia$zTarmapuri Māvaṭṭam                                                                                                                                                                                          
$aIndia$zTarumapuri Māvaṭṭam

Owen

Parthasarathi Mukhopadhyay

unread,
May 21, 2021, 6:31:21 AM5/21/21
to openr...@googlegroups.com
Thanks again, Owen.

It worked like a charm and I understood the logic of xml parsing.

Best regards as always...

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Thad Guidry

unread,
May 21, 2021, 6:04:37 PM5/21/21
to openr...@googlegroups.com
Hi Parthasarathi,

I'd like to offer a suggestion...
If you are goingto deal with MARC XML...instead of GREL directly...
you might find it easier to just switch GREL to Python and then use:
https://pypi.org/project/pymarc/
or
https://pypi.org/project/marcalyx/  (if you prefer it's syntax and convenience methods over pymarc)

Do you know how to install and drop in Python libraries from pypi into your environment so that OpenRefine sees them?
Take a look at the Tutorials we have listed under our Jython & Clojure docs page: https://docs.openrefine.org/manual/jythonclojure



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Parthasarathi Mukhopadhyay

unread,
May 22, 2021, 6:38:06 AM5/22/21
to openr...@googlegroups.com
Thanks Thad for all the leads.

A quick look of the mentioned utilities ensures a lot of promises for my use cases.
I will explore this further.

Best regards

Reply all
Reply to author
Forward
0 new messages