Strange results for getPublicationDate

2 views
Skip to first unread message

Delis, Christopher

unread,
Apr 19, 2019, 11:56:01 AM4/19/19
to solrmarc-tech
Hello, all,

I'm having a difficult time trying to understand the behavior of solrmarc's built-in method, getPublicationDate. I'm not sure if there is a bug somewhere or if I simply do not understand how the Index Specification Language is supposed to work. Can someone help me figure out why one of the following examples produces a reasonable result (to me, anyway) while the other does not?

—————————————————————————————
import/marc.properties:

publishDate = custom, getPublicationDate
—————————————————————————————

I consulted the source code (src/org/solrmarc/index/SolrIndexerShim.java) and found that the built-in definition of getPublicationDate is equivalent to:

—————————————————————————————
import/marc.properties:

publishDate = 008[7-10]:008[11-14]:260c:264c?(ind2=1||ind2=4),clean, first, map("(^|.*[^0-9])((20|1[5-9])[0-9][0-9])([^0-9]|$)=>$2",".*[^0-9].*=>")
—————————————————————————————

I've verified that, indeed, I experience the same behavior using either definition. I will continue using the explicit definition instead of the built-in one because it allows me to more easily run different variations of it in the hopes of better understanding what might be occurring.

—————————————————————————————
Here's an example record with results I would expect:

"good" record:

<record>
<leader>01043nam a2200301 a 4500</leader>
<controlfield tag="001">123456</controlfield>
<controlfield tag="003">myorg</controlfield>
<controlfield tag="008">008 771024s1503 gw a 001 0 lat d</controlfield>
<datafield tag="260" ind1=" " ind2=" ">
<subfield code="a">Impressum in nobili Heluecior[um] vrbe Arge[n]tina :</subfield>
<subfield code="b">Per Ioanne[m] Gruninger,</subfield>
<subfield code="c">anno 1503. XV Kalendas Aprilis.</subfield>
</datafield>
</record>

Relevant MARC data:

008[7-10] "1503"
008[11-14] " "
260$c "anno 1503. XV Kalendas Aprilis."

Solr result(*):

<arr name="publishDate">
<str>1503</str>
</arr>


(*) Here's the publishDate definition in schema.xml:

<field name="publishDate" type="string" indexed="true" stored="true" multiValued="true"/>
—————————————————————————————


—————————————————————————————
Here's an example record with results that do not make sense to me:

"bad" record:

<record>
<leader>01043nam a2200301 a 4500</leader>
<controlfield tag="001">789012</controlfield>
<controlfield tag="003">myorg</controlfield>
<controlfield tag="008">010618q14891493fr 000 0dlat d</controlfield>
<datafield tag="260" ind1=" " ind2=" ">
<subfield code="a">[Strasbourg :</subfield>
<subfield code="b">Johann Prüss,</subfield>
<subfield code="c">between 1489 and 1493]</subfield>
</datafield>
</record>

Relevant MARC data:

008[7-10] "1489"
008[11-14] "1493"
260$c "between 1489 and 1493]"

Solr result:

<arr name="publishDate">
<str/>
</arr>

I would have expected the following result:

<arr name="publishDate">
<str>1489</str>
</arr>

—————————————————————————————

I'm using the most recent master branch of solrmarc:

$ git remote -v
origin https://github.com/solrmarc/solrmarc.git (fetch)
origin https://github.com/solrmarc/solrmarc.git (push)

$ git status
On branch master
Your branch is up-to-date with 'origin/master'.


$ git log --name-status
commit 936774b41f4a1fa2128b99a2b4ca6f79284e11e3 (HEAD -> master, tag: v3.2, origin/master, origin/HEAD)
Author: Robert Haschart <rh...@virginia.edu>
Date: Tue Oct 2 15:36:13 2018 -0400

Update eclipse project file to reference new classgraph library

M .classpath

—————————————————————————————

I've run a few test cases on these two records using different variations of the publishDate definition. It seems that the map operation is somehow adversely affecting the second ("bad") record, but I'm not sure how or why.


—————————————————————————————
Test results for "bad" record:


Test 1 (original):

publishDate = 008[7-10]:008[11-14]:260c:264c?(ind2=1||ind2=4),clean, first, map("(^|.*[^0-9])((20|1[5-9])[0-9][0-9])([^0-9]|$)=>$2",".*[^0-9].*=>")

<arr name="publishDate">
<str/>
</arr>


Test 2:

publishDate = 008[7-10]:008[11-14]:260c:264c?(ind2=1||ind2=4)

<arr name="publishDate">
<str>1489</str>
<str>1493</str>
<str>between 1489 and 1493]</str>
</arr>

Test 3:

publishDate = 008[7-10]:008[11-14]:260c:264c?(ind2=1||ind2=4), clean

<arr name="publishDate">
<str>1489</str>
<str>1493</str>
<str>between 1489 and 1493</str>
</arr>

Test 4:

publishDate = 008[7-10]:008[11-14]

<arr name="publishDate">
<str>1489</str>
<str>1493</str>
</arr>

Test 5:

publishDate = 008[7-10]:008[11-14], clean

<arr name="publishDate">
<str>1489</str>
<str>1493</str>
</arr>

Test 6:

publishDate = 008[7-10], clean

<arr name="publishDate">
<str>1489</str>
</arr>

Test 7:

publishDate = 008[7-10], clean, map("(^|.*[^0-9])((20|1[5-9])[0-9][0-9])([^0-9]|$)=>$2",".*[^0-9].*=>")

[empty]

Test 8:

publishDate = 008[7-10], clean, first, map("(^|.*[^0-9])((20|1[5-9])[0-9][0-9])([^0-9]|$)=>$2",".*[^0-9].*=>")

[empty]


The results of Tests 2-6 are what I would have expected. But why does Test 7 return an empty set (the field is not indexed in Solr at all)?


—————————————————————————————


For comparison, I've run the same tests using the "good" record (which seem reasonable/correct to me):


—————————————————————————————
Test results for "good" record:


Test 1 (original):

publishDate = 008[7-10]:008[11-14]:260c:264c?(ind2=1||ind2=4),clean, first, map("(^|.*[^0-9])((20|1[5-9])[0-9][0-9])([^0-9]|$)=>$2",".*[^0-9].*=>")

<arr name="publishDate">
<str>1503</str>
</arr>

Test 2:

publishDate = 008[7-10]:008[11-14]:260c:264c?(ind2=1||ind2=4)

<arr name="publishDate">
<str>1503</str>
<str>anno 1503. XV Kalendas Aprilis.</str>
</arr>

Test 3:

publishDate = 008[7-10]:008[11-14]:260c:264c?(ind2=1||ind2=4), clean

<arr name="publishDate">
<str>1503</str>
<str>anno 1503. XV Kalendas Aprilis</str>
</arr>

Test 4:

publishDate = 008[7-10]:008[11-14]

<arr name="publishDate">
<str>1503</str>
</arr>

Test 5:

publishDate = 008[7-10]:008[11-14], clean

<arr name="publishDate">
<str>1503</str>
</arr>

Test 6:

publishDate = 008[7-10], clean

<arr name="publishDate">
<str>1503</str>
</arr>

Test 7:

publishDate = 008[7-10], clean, map("(^|.*[^0-9])((20|1[5-9])[0-9][0-9])([^0-9]|$)=>$2",".*[^0-9].*=>")

<arr name="publishDate">
<str>1503</str>
</arr>

Test 8:

publishDate = 008[7-10], clean, first, map("(^|.*[^0-9])((20|1[5-9])[0-9][0-9])([^0-9]|$)=>$2",".*[^0-9].*=>")

<arr name="publishDate">
<str>1503</str>
</arr>


—————————————————————————————


Thanks in advance,
Chris

Delis, Christopher

unread,
Apr 19, 2019, 12:29:21 PM4/19/19
to solrma...@googlegroups.com

I made a typo in my previous email. In the previous email, the "good" record's 008 was entered incorrectly as "008 771024s1503 gw a 001 0 lat d". The correct version does not contain the "008 " at the beginning. It should be "771024s1503 gw a 001 0 lat d".

Here's the *correct* "good" record:

<record>
<leader>01043nam a2200301 a 4500</leader>
<controlfield tag="001">123456</controlfield>
<controlfield tag="003">myorg</controlfield>

<controlfield tag="008">771024s1503 gw a 001 0 lat d</controlfield>


<datafield tag="260" ind1=" " ind2=" ">
<subfield code="a">Impressum in nobili Heluecior[um] vrbe Arge[n]tina :</subfield>
<subfield code="b">Per Ioanne[m] Gruninger,</subfield>
<subfield code="c">anno 1503. XV Kalendas Aprilis.</subfield>
</datafield>
</record>

</collection>

Sorry about that!

To avoid confusion, let me re-send the complete email with the above correction:

-------------------------------------------------------------


Hello, all,

I'm having a difficult time trying to understand the behavior of solrmarc's built-in method, getPublicationDate. I'm not sure if there is a bug somewhere or if I simply do not understand how the Index Specification Language is supposed to work. Can someone help me figure out why one of the following examples produces a reasonable result (to me, anyway) while the other does not?

—————————————————————————————
import/marc.properties:

publishDate = custom, getPublicationDate
—————————————————————————————

I consulted the source code (src/org/solrmarc/index/SolrIndexerShim.java) and found that the built-in definition of getPublicationDate is equivalent to:

—————————————————————————————
import/marc.properties:

publishDate = 008[7-10]:008[11-14]:260c:264c?(ind2=1||ind2=4),clean, first, map("(^|.*[^0-9])((20|1[5-9])[0-9][0-9])([^0-9]|$)=>$2",".*[^0-9].*=>")
—————————————————————————————

I've verified that, indeed, I experience the same behavior using either definition. I will continue using the explicit definition instead of the built-in one because it allows me to more easily run different variations of it in the hopes of better understanding what might be occurring.

—————————————————————————————
Here's an example record with results I would expect:

"good" record:

<record>
<leader>01043nam a2200301 a 4500</leader>
<controlfield tag="001">123456</controlfield>
<controlfield tag="003">myorg</controlfield>

<controlfield tag="008">771024s1503 gw a 001 0 lat d</controlfield>

Delis, Christopher

unread,
Apr 22, 2019, 11:07:27 AM4/22/19
to solrma...@googlegroups.com
It turns out that my problem was due to the fact that the published dates in my example were before 1500. This caused the map regexp to clobber them:

publishDate = 008[7-10]:008[11-14]:260c:264c?(ind2=1||ind2=4),clean, first, map("(^|.*[^0-9])((20|1[5-9])[0-9][0-9])([^0-9]|$)=>$2",".*[^0-9].*=>")

My solution was to change the 1[5-9] to 1[0-9] below:

publishDate = 008[7-10]:008[11-14]:260c:264c?(ind2=1||ind2=4),clean, first, map("(^|.*[^0-9])((20|1[0-9])[0-9][0-9])([^0-9]|$)=>$2",".*[^0-9].*=>")


--Chris

________________________________________
From: solrma...@googlegroups.com <solrma...@googlegroups.com> on behalf of Delis, Christopher <ced...@uillinois.edu>
Sent: Friday, April 19, 2019 11:29 AM
To: solrma...@googlegroups.com
Subject: [solrmarc-tech] Re: Strange results for getPublicationDate

Sorry about that!

-------------------------------------------------------------


Hello, all,

—————————————————————————————
import/marc.properties:

publishDate = custom, getPublicationDate
—————————————————————————————

—————————————————————————————
import/marc.properties:

"good" record:

Relevant MARC data:

Solr result(*):

"bad" record:

Relevant MARC data:

Solr result:

—————————————————————————————

M .classpath

—————————————————————————————


Test 1 (original):


Test 2:

Test 3:

Test 4:

Test 5:

Test 6:

Test 7:

[empty]

Test 8:

[empty]


—————————————————————————————


Test 1 (original):

Test 2:

Test 3:

Test 4:

Test 5:

Test 6:

Test 7:

Test 8:


—————————————————————————————


Thanks in advance,
Chris

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com.
To post to this group, send email to solrma...@googlegroups.com.
Visit this group at https://groups.google.com/group/solrmarc-tech.
For more options, visit https://groups.google.com/d/optout.

Demian Katz

unread,
Apr 22, 2019, 12:05:41 PM4/22/19
to solrma...@googlegroups.com
Thanks for the update, Chris; I'm glad you figured it out. Sorry I didn't have a chance to take a closer look before you got that far; conference travels got in the way of prompt email follow-up! 🙂

- Demian


Sent: Monday, April 22, 2019 11:07 AM
To: solrma...@googlegroups.com
Subject: [solrmarc-tech] Solved (was: Re: Strange results for getPublicationDate)
 


--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com.
To post to this group, send email to solrma...@googlegroups.com.

Delis, Christopher

unread,
Apr 22, 2019, 12:08:40 PM4/22/19
to solrma...@googlegroups.com
No worries. I feel bad (and a little embarrassed) that I didn't recognize the regex issue sooner. Sometimes it helps to take a break (weekend) to see the obvious staring you in the face :-)

________________________________________
From: solrma...@googlegroups.com <solrma...@googlegroups.com> on behalf of Demian Katz <demia...@villanova.edu>
Sent: Monday, April 22, 2019 11:05 AM
To: solrma...@googlegroups.com
Subject: [solrmarc-tech] Re: Solved (was: Re: Strange results for getPublicationDate)
To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com<mailto:solrmarc-tec...@googlegroups.com>.
To post to this group, send email to solrma...@googlegroups.com<mailto:solrma...@googlegroups.com>.
Visit this group at https://groups.google.com/group/solrmarc-tech.
For more options, visit https://groups.google.com/d/optout.

Haschart, Robert J (rh9ec)

unread,
Apr 24, 2019, 5:27:35 PM4/24/19
to solrma...@googlegroups.com

Chris,


After seeing your description of how you solved the problem, I remembered that the exact same issue bit me in the past.


I've actually stopped using getPublicationDate some time ago and recently have been making (yet another) effort of defining a specification (or set of specifications) to reliably extract a single, best, sortable date to be used for sorting by publication date,   and a separate date of publication range to be used for searching by date range.


So if a serial was published from 1949 to 1962 and the date range is stored as [1949 TO 1962]  and a user searches for items published between 1950 and 1955 that serial would be returned as a hit.  


-Bob





Sent: Monday, April 22, 2019 12:08 PM

Demian Katz

unread,
Apr 25, 2019, 11:00:59 AM4/25/19
to solrma...@googlegroups.com

Bob,

 

Adding support for date range storage has been on my long-term to-do list forever, but it’s something that’s never become high enough priority to get to the top of the list. Please let me know if/when you get the indexing routine finished up, and that might be the motivation I need to get range functionality fully implemented in VuFind. 😊

 

- Demian

Reply all
Reply to author
Forward
0 new messages