Problem loading some Russian and Arabic records into VuFind

Tulie Amichal

unread,

Oct 26, 2011, 3:46:03 AM10/26/11

to solrma...@googlegroups.com, vufin...@lists.sourceforge.net

Hi All

We're working with VuFind 1.0.1 and having problems loading certain records into VuFind. The records are mostly in Arabic and Russian. I'm attaching a sample record in Russian. I ran this record through MarcEdit to correct issues we found with empty indicators and when that was corrected the following error is now stopping us from loading:

ERROR [main] (MarcImporter.java:310) - Error indexing record: HUJ000043590 -- String index out of range: 0
java.lang.StringIndexOutOfBoundsException: String index out of range: 0
        at java.lang.String.charAt(String.java:694)
        at org.solrmarc.index.SolrIndexer.getSubfieldDataAsSet(SolrIndexer.java:1612)
        at org.solrmarc.index.SolrIndexer.getFieldList(SolrIndexer.java:1103)
        at org.solrmarc.index.SolrIndexer.map(SolrIndexer.java:536)
        at org.solrmarc.marc.MarcImporter.addToIndex(MarcImporter.java:329)
        at org.solrmarc.marc.MarcImporter.importRecords(MarcImporter.java:262)
        at org.solrmarc.marc.MarcImporter.handleAll(MarcImporter.java:506)
        at org.solrmarc.marc.MarcImporter.main(MarcImporter.java:785)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at com.simontuffs.onejar.Boot.run(Boot.java:334)
        at com.simontuffs.onejar.Boot.main(Boot.java:170)
INFO [main] (MarcImporter.java:516) - Adding 0 of 1 documents to index

Any Idea how I can see which field is causing the problem? Looking at the code got me to a dead end. Unless im not reading this correctly the parameter subfldsStr is both not null, not longer than 1 char and has no value for character in position 0

(SolrIndexer.java roughly around line 1612)


            if (!isControlField(fldTag) && subfldsStr != null)
            {
                // DataField
                DataField dfield = (DataField) vf;

                if (subfldsStr.length() > 1 || separator != null)
                {
                ...[removed code
                }
                else
                {
                    // get all instances of the single subfield
                    List<Subfield> subFlds = dfield.getSubfields(subfldsStr.charAt(0));
                    for (Subfield sf : subFlds)
                    {
                        resultSet.add(sf.getData().trim());
                    }
                }
            }

Any ideas on why this is happening or steps to continue debugging this? we have about 100,000 records like these.

Thanks
Tulie

--

טולי עמיכל
052-8700781
tulie....@gmail.com
http://about.me/tulie/

test_after_nooclc.mrc

Robert Haschart

unread,

Oct 26, 2011, 11:46:25 AM10/26/11

to solrma...@googlegroups.com, vufin...@lists.sourceforge.net

Tulie,

Have you modified the marc.properties file or the marc_local.properties file?    In the code you highlighted it will take one code path if you provide a field specification like

physical = 300abcefg:530abcd

where multiple subfields are to be extracted from a field and concatenated (when   subfldsStr.length() > 1) , and another for when the length is not greater than 1

publisher = 260b

however it seems that is a bug in SolrMarc such that if you accidently forget to specify which subfield you are interested in, the subfldsStr.length() will be zero. Therefore it will take the second code path (because the length is not greater than 1) and then try to get the first character from a zero-length string, which will throw the exception you are getting.

so look for a field specification in marc.properties or marc_local.properties where there is a field tag specified, but no subfield tags specified, like the following:

publisher = 260 topic_facet = 600x:610x:611x:630:648x:650a:650x:651x:655x
                               ^This is clearly a bug in SolrMarc. Rather than crashing with a stack dump, should instead flag the error with a helpful error message and continue along its way.

-Robert Haschart

Tulie Amichal wrote:

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

Tulie Amichal

unread,

Oct 27, 2011, 8:38:51 AM10/27/11

to solrma...@googlegroups.com, vufin...@lists.sourceforge.net

Hi Robert,

I think you're correct. I did some analysis by creating marc files using marcedit each with one row removed and loading these and found that the problematic lines were fields such as 929 and 912. Which are mapped in the following example:

author2 = 110ab:111ab:700abcd:710ab:711ab:929:912

So i assume that having these fields without subfield specifications in a field that is comprised of other tags with subfields will fail?

Thanks for confirming the bug
Tulie

Robert Haschart

unread,

Oct 27, 2011, 11:49:29 AM10/27/11

to solrma...@googlegroups.com, vufin...@lists.sourceforge.net

Tulie,

It isn't that the marc record is bad or those fields in the marc record are bad. It also isn't that having fields listed in a index specification string without subfields along with fields that have subfields that is the problem. The problem simply is that listing a field in an index specification without following it with the subfield/subfields that are to be looked at, is an error.
(the only exception to this rule is for fields 001 to 009 which are not allowed to have subfields)

So this is an error, and will cause SolrMarc to exit:

publisher = 260

and this is an error

topic_facet = 600x:610x:611x:630:648x:650a:650x:651x:655x

as well as this:

author2 = 110ab:111ab:700abcd:710ab:711ab:929:912

But the bug in SolrMarc isn't that these index specifications aren't accepted, the bug is that the way it lets you know that you've done something wrong is to unceremoniously crash and exit, rather than printing a useful error message indicating exactly what the problem is, and either continuing on as best it can, or stopping after displaying that message.

To fix the problem in your index specification change the above line to

author2 = 110ab:111ab:700abcd:710ab:711ab:929a:912a

or

author2 = 110ab:111ab:700abcd:710ab:711ab:929abcdef:912abcdefghij orauthor2 = 110ab:111ab:700abcd:710ab:711ab:929[a-z]:912[a-z]

(whichever is appropriate for your data) and you should be good to go.

Reply all

Reply to author

Forward