Re: [EXTERNAL] [VuFind-General] Importing marc with upper case subcodes

11 views
Skip to first unread message

Demian Katz

unread,
Aug 16, 2024, 5:56:28 AM8/16/24
to Uwe Steinmann, vufind-...@lists.sourceforge.net, solrma...@googlegroups.com
I'm copying the solrmarc-tech list into this reply in case anyone there can comment. 

It might be worth looking at the SolrMarc code in GitHub to see if the option is still supported. I can't think of a reason why it wouldn't be, but maybe it got broken in refactoring somewhere along the way. 

If nobody else replies and/or you need more help, please remind me and I'll look into this more deeply. 

- Demian 

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android

From: Uwe Steinmann <vuf...@steinmann.cx>
Sent: Friday, August 16, 2024 4:30:18 AM
To: vufind-...@lists.sourceforge.net <vufind-...@lists.sourceforge.net>
Subject: [EXTERNAL] [VuFind-General] Importing marc with upper case subcodes
 
Hі,

we have some marc files which have upper case subcodes

  <datafield tag="952" ind1=" " ind2=" ">
        ...
    <subfield code="R">2021-03-25 15:10:25</subfield>
          <subfield code="J">Reference</subfield>
    <subfield code="P">1997-12-08</subfield>
    <subfield code="U">2002-07-09</subfield>
        </datafield>

This does not comply with the marc specification, but currently
I can't change this.
In vufind 5 is was no problem to import these files after setting

org.marc4j.MarcPermissiveStreamReader.upperCaseSubfields = true

in import.properties, but the importer shipped with vufind 10 doesn't
seem to care about it anymore. I even passed it on the java command line
with -D but the solr field marc_error still contains
'["[fullrecord]Invalid code: R"]'

 INFO [main] (IndexDriver.java:166) - Reading and compiling index specifications: marc.properties, marc_local.properties
 INFO [main] (IndexDriver.java:257) - Opening index spec file: /home/www-data/dai/vufind-10.0/import/marc.properties
 INFO [main] (IndexDriver.java:257) - Opening index spec file: /home/www-data/dai/vufind-10.0/import/marc_local.properties
 INFO [main] (IndexDriver.java:100) - Opening input files: [/tmp/bibliography_2024-08-08/test.xml]
DEBUG [main] (IndexDriver.java:331) - System Class Path = /home/www-data/dai/vufind-10.0/import/solrmarc_core_3.5.jar
DEBUG [MarcReader-Thread] (MarcReaderThread.java:35) - record read : 000000001
 WARN [RecordIndexer-Thread-0] (Indexer.java:455) - Exception in record: 000000001
 WARN [RecordIndexer-Thread-0] (Indexer.java:456) - while processing index specification:  FullRecordAsJSON2
 INFO [main] (ThreadedIndexer.java:259) - Done with all indexing, finishing writing records to solr
 INFO [main] (ThreadedIndexer.java:272) - Done writing records to solr
 INFO [main] (Indexer.java:595) - Commmiting updates to Solr
 INFO [main] (IndexDriver.java:391) - 1 records read
 INFO [main] (IndexDriver.java:392) - 1 records indexed  and
 INFO [main] (IndexDriver.java:399) - 1 records sent to Solr in 1.44 seconds

It somewhat looks like MarcPermissiveStreamReader isn't used anymore.

  Uwe

--
  MMK GmbH, Fleyer Str. 196, 58097 Hagen
  Uwe.St...@mmk-hagen.de
  Tel: 02331 840446    Fax: 02331 843920

Demian Katz

unread,
Aug 16, 2024, 8:38:04 AM8/16/24
to Uwe Steinmann, vufind-...@lists.sourceforge.net, solrma...@googlegroups.com
Uwe,

What input format are you using? I did a bit of poking around in the SolrMarc code and found this:

https://github.com/solrmarc/solrmarc/blob/d3ec70b6efe26ae6e30b344313ededc7f836458c/src/org/solrmarc/marc/SolrMarcMarcReaderFactory.java#L177-L184

I may be mistaken, since I'm not very familiar with the code, but it looks like the PermissiveStreamReader may only be getting used for binary MARC and not MARC-XML... so if you're using an XML format, perhaps that is a factor.

- Demian

-----Original Message-----
From: Uwe Steinmann <vuf...@steinmann.cx>
Sent: Friday, August 16, 2024 6:43 AM
To: Demian Katz <demia...@villanova.edu>
Cc: vufind-...@lists.sourceforge.net; solrma...@googlegroups.com
Subject: Re: [EXTERNAL] [VuFind-General] Importing marc with upper case subcodes

Am Fri, Aug 16, 2024 at 09:56:23AM +0000 schrieb Demian Katz:
> I'm copying the solrmarc-tech list into this reply in case anyone there can comment.
>
> It might be worth looking at the SolrMarc code in GitHub to see if the option is still supported. I can't think of a reason why it wouldn't be, but maybe it got broken in refactoring somewhere along the way.
I checked the code at
https://github.com/marc4j/marc4j/blob/master/src/org/marc4j/MarcPermissiveStreamReader.java
already. It's still there, but maybe the permissive stream reader isn't used at all, though the settings in import.properties indicate it is used.

marc.to_utf_8 = true
marc.permissive = true
marc.default_encoding = BESTGUESS
marc.include_errors = true

These properties look alot like those needd to be passed to the constructor of MarcPermissiveStreamReader

> If nobody else replies and/or you need more help, please remind me and I'll look into this more deeply.
Thanks again for your help.

Uwe

Demian Katz

unread,
Aug 16, 2024, 9:51:28 AM8/16/24
to Uwe Steinmann, vufind-...@lists.sourceforge.net, solrma...@googlegroups.com
Uwe,

Have you tried converting your MARCXML to binary MARC using a tool like yaz-marcdump (https://software.indexdata.com/yaz/doc/yaz-marcdump.html) to see if that solves the problem? If so, you might be able to use format conversion as a workaround until we can find a better solution in SolrMarc; either way, it would help to prove or disprove my theory.

I don't think it's likely that the change in the way VuFind stores raw MARC in the index is related to your problem -- the format conversion for storage in the index happens after the file has been initially parsed/read, and it seems like the problem you're encountering is occurring at that earlier parsing/reading stage.

As I said, I'm not at all familiar with SolrMarc internals, so my theory may be entirely wrong -- but I'll be interested to hear what you find if you try my proposed test, and whatever outcome you find, I can spend a little more time digging deeper to see if I can either revise my theory or find a better workaround.

- Demian

-----Original Message-----
From: Uwe Steinmann <vuf...@steinmann.cx>
Sent: Friday, August 16, 2024 9:32 AM
To: Demian Katz <demia...@villanova.edu>
Cc: vufind-...@lists.sourceforge.net; solrma...@googlegroups.com
Subject: Re: [EXTERNAL] [VuFind-General] Importing marc with upper case subcodes

Am Fri, Aug 16, 2024 at 12:37:57PM +0000 schrieb Demian Katz:
> Uwe,
>
> What input format are you using? I did a bit of poking around in the SolrMarc code and found this:
>
> https://github.com/solrmarc/solrmarc/blob/d3ec70b6efe26ae6e30b344313ed
> edc7f836458c/src/org/solrmarc/marc/SolrMarcMarcReaderFactory.java#L177
> -L184
>
> I may be mistaken, since I'm not very familiar with the code, but it looks like the PermissiveStreamReader may only be getting used for binary MARC and not MARC-XML... so if you're using an XML format, perhaps that is a factor.
That would explain it. I'm actually using marcxml, but I used that in vufind 5 as well. I just imported the same marc record in vufind 5 again which fails in vufind 10. I also double checked and made all subcodes lower case and the importer in vufind 10 doesn't complain anymore.

vufind 5 uses solrmarc_core 3.1 and vufind 10 has solrmarc_core 3.5 Could be that something in between has changed. I looked into the git commit taged as 3.1, but that seems to use the same factory code like 3.5.

I also checked the solr record in vufind 5 which has binary data in the field 'fullrecord'. Could it be that the marcxml was converted to binary marc and that is processed by the MarcPermissiveStreamReader?
But when it is converting into json the MarcPermissiveStreamReader is used anymore?

Uwe

Tod Olson

unread,
Aug 16, 2024, 10:44:31 AM8/16/24
to Uwe Steinmann, Demian Katz, solrma...@googlegroups.com, vufind-...@lists.sourceforge.net
One question for the solrmarc-tech group might be: is there a reason not to allow the use of org.marc4j.MarcPermissiveStreamReader in MARCXML or any of the other non-binary MARC formats? 

From a casual look at the code Demian referenced, it seems like a PR to honor the upperCaseSubfields property would not be difficult to create. I've had some interaction with people working with CNMARC, the Chinese MARC standard, which regularly uses some uppercase subfields. Supporting that for all input formats seems like a good move.

-Tod

Tod Olson <t...@uchicago.edu> (he/him)
Director of Integrated Library Systems
University of Chicago Library

On Aug 16, 2024, at 9:31 AM, Uwe Steinmann <vuf...@steinmann.cx> wrote:

Am Fri, Aug 16, 2024 at 01:51:24PM +0000 schrieb Demian Katz:
Uwe,

Have you tried converting your MARCXML to binary MARC using a tool
like yaz-marcdump
(https://software.indexdata.com/yaz/doc/yaz-marcdump.html) to see if
that solves the problem? If so, you might be able to use format
conversion as a workaround until we can find a better solution in
SolrMarc; either way, it would help to prove or disprove my theory.
Your theory appears to be right. Importing the binary MARC works!
So I tried 100000 records after converting them to .mrc and it also
works. It even works without setting
org.marc4j.MarcPermissiveStreamReader.upperCaseSubfields = true
in imports.properties
By looking at the code I thought that would be needed in any case.

Thanks for the help so far. That at least keeps me going to get some
data into my test installation

 Uwe
_______________________________________________
VuFind-General mailing list
VuFind-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/vufind-general

Demian Katz

unread,
Aug 16, 2024, 10:49:35 AM8/16/24
to Tod Olson, Uwe Steinmann, solrma...@googlegroups.com, vufind-...@lists.sourceforge.net

It certainly seems to me that the encoding of the input data should not change the behavior of the software… but I don’t know if there’s a technical reason why the MarcPermissiveStreamReader can only work with binary MARC data. I’d definitely favor greater consistency, but I’m not sure how hard that would be to achieve without digging deeper into the Marc4j code/documentation.

 

- Demian

Demian Katz

unread,
Aug 16, 2024, 11:53:06 AM8/16/24
to Tod Olson, Uwe Steinmann, solrma...@googlegroups.com, vufind-...@lists.sourceforge.net

I did a little more digging and found that there is a technical reason for all this: the permissive behavior is a constructor switch on Marc4j’s MarcPermissiveStreamReader, which can only read binary MARC. The MarcUnprettyXmlReader class used for reading XML does not appear to have an equivalent switch. This seems to be a shortcoming/inconsistency in Marc4j that is just affecting the upstream SolrMarc code. Again, I only did a cursory investigation, so maybe I’m missing something, but this may not be simple to solve.

 

- Demian

Demian Katz

unread,
Aug 16, 2024, 1:47:56 PM8/16/24
to Uwe Steinmann, Tod Olson, solrma...@googlegroups.com, vufind-...@lists.sourceforge.net
This is all a bit puzzling. Maybe the exception is coming from the code that translates the MARC into JSON, rather than the code that reads in the MARC initially. But if that's the case, I don't know why changing the input format would eliminate the error, unless the binary MARC reader is normalizing some specific thing that is making the JSON writer fail.

In any case, VuFind will work correctly regardless of which MARC encoding you store in the Solr index. We changed the default to JSON because it's a) more compact than XML and b) not subject to the length limits of the binary format, making it the best compromise for the largest set of use cases. But if you've been using binary all this time without running into problems with long records, there's no reason you can't continue using it if it saves you the trouble of dealing with this problem.

- Demian

-----Original Message-----
From: Uwe Steinmann <vuf...@steinmann.cx>
Sent: Friday, August 16, 2024 1:18 PM
To: Demian Katz <demia...@villanova.edu>
Cc: Tod Olson <t...@uchicago.edu>; solrma...@googlegroups.com; vufind-...@lists.sourceforge.net
Subject: Re: [VuFind-General] [EXTERNAL] Importing marc with upper case subcodes

Am Fri, Aug 16, 2024 at 03:53:01PM +0000 schrieb Demian Katz:
> I did a little more digging and found that there is a technical reason
> for all this: the permissive behavior is a constructor switch on
> Marc4j’s MarcPermissiveStreamReader, which can only read binary MARC.
> The MarcUnprettyXmlReader class used for reading XML does not appear
> to have an equivalent switch. This seems to be a
> shortcoming/inconsistency in Marc4j that is just affecting the
> upstream SolrMarc code. Again, I only did a cursory investigation, so
> maybe I’m missing something, but this may not be simple to solve.
Thanks for the explaination. I'm still wondering why solrmarc shipped with vufind 5 didn't show this behaviour. But maybe it wasn't that picky. Out of curiousity I set

fullrecord = FullRecordAsMarc

in import/marc_local.properties and suprisingly this also works. Setting it to FullRecordAsXML is working as well, just FullRecordAsJSON2 doesn't work.

Looks like the reason it worked in vufind 5 is because fullrecord is set to FullRecordAsMarc by default.

Uwe

> From: Demian Katz
> Sent: Friday, August 16, 2024 10:50 AM
> To: Tod Olson <t...@uchicago.edu>; Uwe Steinmann <vuf...@steinmann.cx>
> Cc: solrma...@googlegroups.com;
> vufind-...@lists.sourceforge.net
> Subject: RE: [VuFind-General] [EXTERNAL] Importing marc with upper
> case subcodes
>
> It certainly seems to me that the encoding of the input data should not change the behavior of the software… but I don’t know if there’s a technical reason why the MarcPermissiveStreamReader can only work with binary MARC data. I’d definitely favor greater consistency, but I’m not sure how hard that would be to achieve without digging deeper into the Marc4j code/documentation.
>
> - Demian
>
> From: Tod Olson <t...@uchicago.edu<mailto:t...@uchicago.edu>>
> Sent: Friday, August 16, 2024 10:44 AM
> To: Uwe Steinmann <vuf...@steinmann.cx<mailto:vuf...@steinmann.cx>>
> Cc: Demian Katz
> <demia...@villanova.edu<mailto:demia...@villanova.edu>>;
> solrma...@googlegroups.com<mailto:solrma...@googlegroups.com>;
> vufind-...@lists.sourceforge.net<mailto:vufind-...@lists.sourc
> eforge.net>
> Subject: Re: [VuFind-General] [EXTERNAL] Importing marc with upper
> case subcodes
>
> One question for the solrmarc-tech group might be: is there a reason not to allow the use of org.marc4j.MarcPermissiveStreamReader in MARCXML or any of the other non-binary MARC formats?
>
> From a casual look at the code Demian referenced, it seems like a PR to honor the upperCaseSubfields property would not be difficult to create. I've had some interaction with people working with CNMARC, the Chinese MARC standard, which regularly uses some uppercase subfields. Supporting that for all input formats seems like a good move.
>
> -Tod
> Tod Olson <t...@uchicago.edu<mailto:t...@uchicago.edu>> (he/him)
> Director of Integrated Library Systems University of Chicago Library
>
> From: Uwe Steinmann <vuf...@steinmann.cx<mailto:vuf...@steinmann.cx>>
> Sent: Friday, August 16, 2024 4:30:18 AM
> To:
> vufind-...@lists.sourceforge.net<mailto:vufind-...@lists.sourc
> eforge.net>
> <vufind-...@lists.sourceforge.net<mailto:vufind-...@lists.sour
> Uwe.St...@mmk-hagen.de<mailto:Uwe.St...@mmk-hagen.de>
> Tel: 02331 840446 Fax: 02331 843920
>
> --
> MMK GmbH, Fleyer Str. 196, 58097 Hagen
> Uwe.St...@mmk-hagen.de<mailto:Uwe.St...@mmk-hagen.de>
> Tel: 02331 840446 Fax: 02331 843920
>
> --
> MMK GmbH, Fleyer Str. 196, 58097 Hagen
> Uwe.St...@mmk-hagen.de<mailto:Uwe.St...@mmk-hagen.de>
> Tel: 02331 840446 Fax: 02331 843920
>
> --
> MMK GmbH, Fleyer Str. 196, 58097 Hagen
> Uwe.St...@mmk-hagen.de<mailto:Uwe.St...@mmk-hagen.de>
> Tel: 02331 840446 Fax: 02331 843920
> _______________________________________________
> VuFind-General mailing list
> VuFind-...@lists.sourceforge.net<mailto:VuFind-...@lists.sourc
> eforge.net>
> https://lists.sourceforge.net/lists/listinfo/vufind-general

Demian Katz

unread,
Aug 19, 2024, 9:05:48 AM8/19/24
to Uwe Steinmann, Tod Olson, solrma...@googlegroups.com, vufind-...@lists.sourceforge.net
I did a little more digging into the code, since the error seems to be somehow tied to the FullRecordAsJSON2 output. Maybe the XML reader is somehow formatting things in a way that causes errors for the JSON2 writer. Not sure why that would be, of course! In any case, here's the relevant code:

https://github.com/solrmarc/solrmarc/blob/d3ec70b6efe26ae6e30b344313ededc7f836458c/src/org/solrmarc/index/extractor/impl/fullrecord/FullRecordAsJSON2ValueExtractor.java#L8

This in turn points to the Marc4j MarcJsonWriter class. So I guess it might be an interesting exercise to write a small program to read a record with a capital R subfield using the Marc4j XML reader and then output it with the Marc4j MarcJsonWriter and see if the problem can be isolated. Not sure if it's worth the effort given the availability of workarounds, but if I had to solve this, that would probably be my next step. 😊

- Demian

-----Original Message-----
From: Uwe Steinmann <vuf...@steinmann.cx>
Sent: Monday, August 19, 2024 4:49 AM
To: Demian Katz <demia...@villanova.edu>
Cc: Tod Olson <t...@uchicago.edu>; solrma...@googlegroups.com; vufind-...@lists.sourceforge.net
Subject: Re: [VuFind-General] [EXTERNAL] Importing marc with upper case subcodes

It's even more puzzling when looking at the xsd of marcxml https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd

<xsd:complexType name="subfieldatafieldType" id="subfield.ct">
<xsd:simpleContent>
<xsd:extension base="subfieldDataType">
<xsd:attribute name="id" type="idDataType" use="optional"/>
<xsd:attribute name="code" type="subfieldcodeDataType" use="required"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>

<xsd:simpleType name="subfieldcodeDataType" id="code.st">
<xsd:restriction base="xsd:string">
<xsd:whiteSpace value="preserve"/>
<xsd:pattern value="[\dA-Za-z!"#$%&'()*+,-./:;<=>?{}_^`~\[\]\\]{1}"/>
<!-- "A-Z" added after "\d" May 21, 2009 -->
</xsd:restriction>
</xsd:simpleType>

It explicitly allows A-Z for codes in a subfield

Well, I suppose we will solve this in the near future and I stick with marc binary in the mean time.

Thanks for all your help.

Uwe
Reply all
Reply to author
Forward
0 new messages