XML to MRC results in 0 bytes file

51 views
Skip to first unread message

Holly Pickering

unread,
Nov 11, 2024, 5:47:59 PM11/11/24
to pymarc Discussion
Hi all,

New here, relatively new to MARC and XML, and not a particularly great programmer either so please forgive me if this is a silly/obvious question. I also tried to send this to the group on email, but having not used Google groups before and not seeing that it showed up in the discussion, I'm not sure it went through. 

I'm working with a bunch (127, to be exact) of XML files and need them in MRC format as one file. I've done this successfully with a few (3) XML files using some of Heidi Frank's code from many years ago (thank you!), updated slightly for Python 3 on Windows, but I'm trying again with XML files from a different supplier and not having any luck.

Unfortunately, I'm not getting any errors thrown - the process runs, seemingly fine, but then the MRC file it spits out is 0 bytes. I'm not sure where to start with troubleshooting. My code is as follows:

----------------------------------------------------------------------------

import os
import pymarc
from pymarc import Record, Field
import codecs

inst_code = input('Enter the 3-letter institutional code: ')
batch_date = input('Enter the batch date (YYYYMMDD): ')
base_dir = 'work/'+inst_code+'/'+inst_code+'_'+batch_date
marcRecsOut_bin = pymarc.MARCWriter(codecs.open(base_dir+'/'+inst_code+'_'+batch_date+'_1_orig_recs_bin.mrc', 'wb', 'utf-8'))

marcxml_dir = base_dir+'/marcxml' 
for filename in os.listdir(marcxml_dir):
file_path = os.path.join(marcxml_dir,filename)
if os.path.isfile(file_path):
if file_path[-3:]=='xml':
marc_xml_array = pymarc.parse_xml_to_array(file_path) 
for rec in marc_xml_array:
rec.leader = rec.leader[0:9] + 'a' +rec.leader[10]
marcRecsOut_bin.write(rec)

marcRecsOut_bin.close()

----------------------------------------------------------------------------

I thought perhaps it was an encoding issue, thus the addition of that stuff but no. I'm guessing something is off with the files but I am unsure of what. Can't see how to attach files in these discussions but happy to share if anyone wants. Any help is greatly appreciated!

Holly

Andrew Hankinson

unread,
Nov 11, 2024, 6:03:28 PM11/11/24
to pym...@googlegroups.com
Without seeing the XML it's difficult to tell where the problem is. You can copy / paste an example of the XML file in plain text if you can't get the file to attach.

I also think there's a bug in your code, too, but it wouldn't be the core problem.

rec.leader = rec.leader[0:9] + 'a' +rec.leader[10]

Should probably be:

rec.leader = rec.leader[0:9] + 'a' +rec.leader[10:]  <-- note the colon here.



--
You received this message because you are subscribed to the Google Groups "pymarc Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymarc+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/pymarc/0ec3b4db-220f-48e2-9ac7-cae89e716d03n%40googlegroups.com.

Tomasz Kalata

unread,
Nov 11, 2024, 7:21:50 PM11/11/24
to pym...@googlegroups.com
Assuming you have several records in your XML file, the line below will not output them correctly ('wb' - truncate the file first). When you write to the output file, a previously saved record will be overwritten. Replace it with the 'ab' value, which will append records to the file. I am not sure if this is the reason for your problems (unless your last record in XML is some kind of ghost bib 😉), but based on your code it looks like it is your intended behavior.

marcRecsOut_bin = pymarc.MARCWriter(codecs.open(base_dir+'/'+inst_code+'_'+batch_date+'_1_orig_recs_bin.mrc', 'wb', 'utf-8'))

Cheers,
Tomasz



--
Tomasz Kalata
Assistant Director, Cataloging

BookOps

The New York Public Library & Brooklyn Public Library

Holly Pickering

unread,
Nov 13, 2024, 10:23:49 PM11/13/24
to pymarc Discussion
Thanks for the correction, I'll fix that! I've tested a couple of XML files today and neither work. Here is the XML for a shorter file that doesn't work:

-------------

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
<ONIXmessage release="2.1">
  <header>
    <m173>2013975</m173>  
    <m174>Penguin Random House</m174>
    <m175>Publishing Operations Support - 1-800-733-3000</m175>
    <m283>PRHDistr...@penguinrandomhouse.com</m283>
    <m182>202311042314</m182>
  </header>

  <product>
    <a001>9780735247628</a001>
    <a002>03</a002>
    <productidentifier>
      <b221>02</b221>
      <b244>0735247625</b244>
    </productidentifier>
    <productidentifier>
      <b221>03</b221>
      <b244>9780735247628</b244>
    </productidentifier>
    <productidentifier>
      <b221>14</b221>
      <b244>09780735247628</b244>
    </productidentifier>
    <productidentifier>
      <b221>15</b221>
      <b244>9780735247628</b244>
    </productidentifier>
    <b012>AJ</b012>
    <b384>06</b384>
    <n338 />
    <title>
      <b202>01</b202>
      <b203>Prophet</b203>
    </title>
    <workidentifier>
      <b201>01</b201>
      <b244>734634</b244>
    </workidentifier>
    <contributor>
      <b034>1</b034>
      <b035>A01</b035>
      <b036>Helen Macdonald</b036>
      <b037>Macdonald, Helen</b037>
      <b039>Helen</b039>
      <b040>Macdonald</b040>
      <personnameidentifier>
        <b390>16</b390>
        <b244>0000000042788941</b244>
      </personnameidentifier>
    </contributor>
    <contributor>
      <b034>2</b034>
      <b035>A01</b035>
      <b036>Sin Blaché</b036>
      <b037>Blaché, Sin</b037>
      <b039>Sin</b039>
      <b040>Blaché</b040>
    </contributor>
    <contributor>
      <b034>3</b034>
      <b035>E07</b035>
      <b036>Jake Fairbrother</b036>
      <b037>Fairbrother, Jake</b037>
      <b039>Jake</b039>
      <b040>Fairbrother</b040>
    </contributor>
    <contributor>
      <b034>4</b034>
      <b035>E07</b035>
      <b036>Ryan Forde Iosco</b036>
      <b037>Iosco, Ryan Forde</b037>
      <b039>Ryan Forde</b039>
      <b040>Iosco</b040>
    </contributor>
    <contributor>
      <b034>5</b034>
      <b035>E07</b035>
      <b036>Charlotte Davey</b036>
      <b037>Davey, Charlotte</b037>
      <b039>Charlotte</b039>
      <b040>Davey</b040>
    </contributor>
      <b056>UBR</b056>
      <language>
        <b253>01</b253>
        <b252>eng</b252>
      </language>
    <extent>
      <b218>09</b218>
      <b219>1022</b219>
      <b220>05</b220>
    </extent>
    <b064>FIC019000</b064>
    <subject>
      <b067>10</b067>
      <b069>FIC028010</b069>
    </subject>
    <subject>
      <b067>10</b067>
      <b069>FIC036000</b069>
    </subject>
    <subject>
      <b067>20</b067>
      <b070>cyberpunk;billionaire;body horror;techno horror;Father's Day;mystery thriller suspense;thriller books;literary fiction;biological warfare;mysteries and thrillers;espionage;spy novel;USA;corruption;creepy;horror;books fiction;literature;LGBTQ novel;gay romance;romance;speculative fiction;books science fiction;novel;gift books;science fiction novels;mystery and thriller;adventure;future;fantasy;mystery;technology;action;action adventure;dystopia;military;space;crime;science fiction;sci-fi;thriller</b070>
    </subject>
    <audience>
      <b204>01</b204>
      <b206>01</b206>
    </audience>
    <othertext>
      <d102>01</d102>
      <d103>02</d103>
      <d104>&lt;b&gt;A genre-bending, strikingly original tour-de-force about an unlikely spy duo on the most dangerous and otherwordly mission of their lives, from the &lt;i&gt;New York Times&lt;/i&gt;&amp;ndash;bestselling author of &lt;i&gt;H is for Hawk, &lt;/i&gt;Helen Macdonald,&amp;nbsp;and the stunning new voice of Sin Blach&amp;eacute;.&lt;/b&gt;&lt;br&gt;&lt;br&gt;Adam Rubinstein and Sunil Rao have been nemeses and reluctant partners since their Uzbekistan days. Adam is a seemingly unflappable American Intelligence officer; Rao is ex-MI6, an addict and rudderless pleasure-hound with an uncanny ability to discern the truth about anything and anyone&amp;mdash;except Adam.&lt;br&gt;&lt;br&gt;Adam and Rao have gone their separate ways until they are called back together when a full-sized, 1950s American diner shows up in an English farmer's field and a mysterious death ensues. What follows is a reality-twisting, action-filled quest as the&amp;nbsp;unlikely duo begin to uncover&amp;nbsp;how and why people&amp;rsquo;s fondest memories are being manifested and weaponized against them, in increasingly bizarre and tangible forms, by a spooky, ever-shifting substance called Prophet. Adam and Rao must find a way to stop these malevolent entities from taking over a world that is just one perilous step from our own.&lt;br&gt;&lt;br&gt;The brilliant minds of Helen Macdonald and Sin Blach&amp;eacute; have created a tantalizing fusion of sci-fi, detective noir, action, and romance in this high-tension, fast-paced adventure. &lt;i&gt;Prophet &lt;/i&gt;is a triumph of storytelling.</d104>
    </othertext>
    <othertext>
      <d102>08</d102>
      <d103>02</d103>
      <d104>&lt;b&gt;Praise for &lt;i&gt;Prophet&lt;/i&gt;:&lt;/b&gt;&lt;br&gt;&lt;br&gt;"It's a fabulous book! . . . Present day science fiction that feels like the best sort of spy novel with real people you can care about. And it's a page-turner. So good."&amp;#160;&lt;br&gt;&amp;mdash;&lt;b&gt;Neil Gaiman&lt;/b&gt;&lt;br&gt; &amp;#160;&lt;br&gt;&amp;ldquo;&lt;i&gt;Prophet&lt;/i&gt; is a crackling, shape-shifting romp with big ideas and a bigger heart. Blach&amp;eacute; and Macdonald take a no-holds-barred approach to manifesting the ways in which individual desires are exploited by the systems we live under, and ask the necessary question of whether escape from that cycle is possible. This is a display of sheer inventiveness, and a delight.&amp;rdquo; &lt;br&gt;&amp;mdash;&lt;b&gt;C Pam Zhang, author of &lt;i&gt;How Much of These Hills is Gold&lt;/i&gt;&lt;/b&gt;&lt;br&gt;&lt;br&gt;&amp;ldquo;Sin Blach&amp;eacute; and Helen Macdonald have turned nostalgia &amp;mdash; &amp;lsquo;the trash of hearts&amp;rsquo; &amp;mdash; into a world and a trap. Prophet promises to bring back everything you lost and now yearn for. Is it a drug? Or is it a new state of matter? Whatever it is, it&amp;rsquo;s proper science fiction &amp;mdash; self-aware, funny, ruthlessly propulsive, full of invention, parodic yet perfectly serious about its underlying issues with contemporary retro culture, and ending with a complex, emotionally satisfying extension of the personal into the sublime. I loved it.&amp;rdquo; &lt;br&gt;&amp;mdash;&lt;b&gt;M. John Harrison, author of &lt;i&gt;The Sunken Land Begins to Rise Again&lt;/i&gt;&lt;/b&gt;&lt;br&gt;&lt;br&gt;"A wildly fun, inventive, funny, and terrifying book, with a superb mystery that gets ever more compelling and weird and, horrifyingly, familiar. This book finds the nightmare in the comforting lies we tell ourselves about our pasts, and how they inform our present.&amp;rdquo; &lt;br&gt;&amp;mdash;&lt;b&gt;Phil Klay, author of &lt;i&gt;Missionaries&lt;br&gt;&lt;br&gt;&lt;/i&gt;&lt;/b&gt;&amp;ldquo;Absorbing, fast-paced and febrile,&amp;#160;Prophet&amp;#160;takes you through the world at an angle, exposing cracks in the reality we think we inhabit. An exhilarating and surprisingly tender trip.&amp;rdquo; &lt;br&gt;&lt;b&gt;&amp;mdash;&lt;b&gt;G. Willow Wilson, author of&lt;/b&gt;&lt;i&gt;&lt;b&gt; &lt;i&gt;The Bird King&lt;/i&gt;&lt;/b&gt;&lt;/i&gt;&lt;/b&gt;&lt;br&gt;&lt;br&gt;&amp;ldquo;A hyperkinetic headrush of a novel that proves its organic bona fides by getting you drunk with ideas before casually and cataclysmically breaking your heart.&amp;rdquo; &lt;br&gt;&amp;mdash;&lt;b&gt;Paraic O&amp;rsquo;Donnell, author of &lt;i&gt;The Maker of Swans&lt;br&gt;&lt;br&gt;&lt;/i&gt;&lt;/b&gt;"[A] thrilling dystopian novel." &lt;br&gt;&lt;i&gt;&lt;b&gt;&amp;mdash;TIME&lt;/b&gt;&lt;/i&gt;&lt;b&gt;&lt;i&gt;&lt;br&gt;&lt;/i&gt;&lt;/b&gt;&lt;br&gt;&amp;ldquo;[&lt;i&gt;Prophet&lt;/i&gt;]&lt;i&gt; &lt;/i&gt;is immense fun, a work of exceptional storytelling skill and stylistic panache. . . . The writing is high-spec, lively, vivid. . . . Without letting the pace slacken, Macdonald and Blach&amp;eacute; manage to fold in powerful reflections on loss and trauma. The balance of the lethal actualisation of happy memories with the sensitive, believable way the two main characters are shown processing their &lt;i&gt;un&lt;/i&gt;happy ones makes this novel a cut above the usual techno-thriller fare. H Is for Highly Recommended.&amp;rdquo; &lt;br&gt;&lt;b&gt;&amp;mdash;&lt;i&gt;The Guardian&lt;/i&gt;&lt;br&gt;&lt;/b&gt;&lt;br&gt;"Wildly surreal with occasional flashes of dark humor. . . . The authors&amp;rsquo; most irresistible achievement, though, is their odd-couple pairing of the Dionysian Rao with the fastidious Rubenstein. . . . The well-matched authors make good on their audacious premise." &lt;br&gt;&lt;i&gt;&lt;b&gt;&amp;mdash;Publishers Weekly&lt;/b&gt;&lt;/i&gt;&lt;b&gt;&lt;i&gt;&lt;br&gt;&lt;br&gt;&lt;/i&gt;&lt;/b&gt;&amp;ldquo;Shrewdly imagined, sharply crafted, witty, chilling, psychologically lush, grotesque, and romantic.&amp;rdquo; &lt;br&gt;&lt;i&gt;&lt;b&gt;&amp;mdash;Booklist&lt;/b&gt;&lt;/i&gt;</d104>
    </othertext>
    <othertext>
      <d102>13</d102>
      <d103>02</d103>
      <d104>HELEN MACDONALD is a writer, poet, naturalist, and historian of science. Their books include &lt;i&gt;H is for Hawk&lt;/i&gt;, which won many prizes including the Costa Book of the Year and the Samuel Johnson Prize for Non-Fiction, and the &lt;i&gt;Sunday Times&lt;/i&gt; bestselling &lt;i&gt;Vesper Flights&lt;/i&gt;. They live in Suffolk with their two parrots.&amp;nbsp;SIN BLACH&amp;Eacute; is a Black Irish musician and author. They have been writing horror and sci-fi stories all their life. Born in California, they live in the Northwest of Ireland and can be found obsessing over obscure folk instruments, being a reluctant saviour to feral cats, and playing too many video games.</d104>
    </othertext>
    <mediafile>
      <f114>04</f114>
      <f115>03</f115>
      <f116>01</f116>
          <f117>http://images.randomhouse.com/cover/d/9780735247628</f117>
      <f373>20230612</f373>
    </mediafile>
    <imprint>
      <b241>01</b241>
      <b243>NZ</b243>
      <b079>Viking</b079>
    </imprint>
    <publisher>
      <b291>01</b291>
      <b241>01</b241>
      <b243>9B</b243>
      <b081>Penguin Canada</b081>
    </publisher>
    <b394>04</b394>
    <b003>20230822</b003>
    <salesrights>
      <b089>01</b089>
      <b090>CA</b090>
    </salesrights>
    <salesrights>
      <b089>03</b089>
      <b090>AD AE AF AG AI AL AM AO AQ AR AS AT AU AW AX AZ BA BB BD BE BF BG BH BI BJ BL BM BN BO BQ BR BS BT BV BW BY BZ CC CD CF CG CH CI CK CL CM CN CO CR CU CV CW CX CY CZ DE DJ DK DM DO DZ EC EE EG EH ER ES ET FI FJ FK FM FO FR GA GB GD GE GF GG GH GI GL GM GN GP GQ GR GS GT GU GW GY HK HM HN HR HT HU ID IE IL IM IN IO IQ IR IS IT JE JM JO JP KE KG KH KI KM KN KP KR KW KY KZ LA LB LC LI LK LR LS LT LU LV LY MA MC MD ME MF MG MH MK ML MM MN MO MP MQ MR MS MT MU MV MW MX MY MZ NA NC NE NF NG NI NL NO NP NR NU NZ OM PA PE PF PG PH PK PL PM PN PR PS PT PW PY QA RE RO RS RU RW SA SB SC SD SE SG SH SI SJ SK SL SM SN SO SR SS ST SV SX SY SZ TC TD TF TG TH TJ TK TL TM TN TO TR TT TV TW TZ UA UG UM US UY UZ VA VC VE VG VI VN VU WF WS YE YT ZA ZM ZW</b090>
    </salesrights>
    <relatedproduct>
      <h208>06</h208>
      <productidentifier>
        <b221>02</b221>
        <b244>0735247587</b244>
      </productidentifier>
      <productidentifier>
        <b221>15</b221>
        <b244>9780735247581</b244>
      </productidentifier>
      <b012>BB</b012>
    </relatedproduct>
    <supplydetail>
      <j136>2013975</j136>
      <j137>Penguin Random House</j137>
      <j270>1-800-733-3000</j270>
      <j292>01</j292>
      <j141>IP</j141>
      <j396>20</j396>
      <j143>20230822</j143>
      <price>
        <j148>01</j148>
        <j261>06</j261>
        <discountcoded>
          <j363>02</j363>
          <j364>BQM</j364>
        </discountcoded>
        <j151>126.00</j151>
        <j152>CAD</j152>
        <b251>CA</b251>
      </price>
    </supplydetail>
  </product>
</ONIXmessage>

----------------------------------

Also, as a side note, I seem to be having problems posting, I responded yesterday on email and also on the groups site but my messages never went through. Just saying in case this does go through and a moderator or similar sees it.

Thanks all!
Holly

Michael Bolam

unread,
Nov 14, 2024, 11:03:22 AM11/14/24
to pym...@googlegroups.com
Still pretty new to Python/pymarc, but I am assuming this is because the code is designed to work with MARC XML and you are passing it ONIX XML, which is not the same as MARCXML. I think this function, pymarc.parse_xml_to_array(), is expecting MARCXML (rather than XML in general) -  see the XML and JSON section here: https://pypi.org/project/pymarc/ which specifically references using the function to parse MARCXML. I looked around a bit, and while there are some crosswalks from ONIX to MARC (e.g. https://www.loc.gov/marc/onix2marc.html) they are pretty complex. It  doesn't look like common tools like MarcEdit have tried to tackle this. I saw a bit of discussion on places like StackExchange of people trying to extract data from ONIX, but not a full ONIX to MARCXML transformation - While ONIX has a lot of the data we may want to add to a MARC record, it doesn't seem to have data sufficient to create a MARC record.

Apologies in advance if I'm missing something here or misinterpreted the functions or the data. I'm still learning, too.

Mike

B W

unread,
Nov 14, 2024, 11:08:30 AM11/14/24
to pym...@googlegroups.com
Agreed, Mike. The XML as posted by Holly will need to be parsed with something like lxml or ElementTree to extract the desired information: 

Geoffrey Spear

unread,
Nov 14, 2024, 11:12:28 AM11/14/24
to pym...@googlegroups.com
This is correct.

I'd add that it's probably best to call pymarc.parse_xml_to_array(filepath, strict=True) unless you're certain you have a file that's actually parseable as MARC-XML but doesn't conform to the XSD in some way that coincidentally doesn't break the parsing.

To be honest I think it's kind of weird that the parser doesn't default to strict mode, but it's probably an artifact of the library largely being written to get data out of bad MARC and into, uh, a better format that we've been hoping for 30 years would have killed MARC entirely by now. 

On Thu, Nov 14, 2024 at 11:03 AM Michael Bolam <mbo...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages