DataUtil::cleanDate() and pre-1500 dates

14 views
Skip to first unread message

Demian Katz

unread,
Oct 5, 2020, 1:03:03 PM10/5/20
to solrma...@googlegroups.com

It’s recently been brought to my attention that SolrMarc’s built in date validation logic drops pre-1500 dates, which can be a problem for collections containing early printed works. I’m pretty sure it’s down to this regular expression:

 

https://github.com/solrmarc/solrmarc/blob/master/src/org/solrmarc/tools/DataUtil.java#L23-L24

 

A few questions:

 

  1. Is there a compelling reason for this?
  2. Is anyone else actually using this method outside of the VuFind context? There are a few internal references in the code base, but some of them seem potentially obsolete.
  3. Does it make sense for this to be part of the SolrMarc core, or would this be better treated as an external support method?
  4. If there is a reason to keep this in the core, would it make sense to make it more configurable so the key regular expression(s) could be overridden without having to rewrite the whole function?

 

Thoughts/feedback welcome. I’m happy to help with a solution if we can agree on a general direction.

 

thanks,

Demian

Uwe Reh

unread,
Oct 5, 2020, 2:20:45 PM10/5/20
to solrma...@googlegroups.com
Hi Demian,

since I have replaced SolrMarc with my own SolrPica, I can only post my
thoughts:

To your questions:

1. 1500 looks for my like a plausibility check. Invented by a computer
scientist, not by an archivist.

2.
a) At HeBIS, within our SolrPica the plausibility check is: "from 500 to
3000".
b) The German Index (GVI) (has no VuFind, but it is using SolrMarc)
imports with no validation at all.
(The German consortium BSZ uses the GVI in Boss3 (VuFind-Fork) as Index)

3.
Some libraries holds handwriting from the 7th century or even older.
With the digitalization we get a lot of scans with the publish date of
the original writing. So I vote for getting rid of this limitation.

4.
You should consider a distinction between searching (publish_date) and
sorting (publish_date_sort).
* publish_date_sort has to be unique and needs to be unified.
Normalisation and validation seems to be handy.
* publish_date should be multi valued. E.g. journals are published not
only in their first or last year. A validation may hide relevant hits.

Just my two cents
Uwe


Am 05.10.20 um 19:03 schrieb Demian Katz:
> It’s recently been brought to my attention that SolrMarc’s built in date
> validation logic drops pre-1500 dates, which can be a problem for
> collections containing early printed works. I’m pretty sure it’s down to
> this regular expression:
>
>  
>
> https://github.com/solrmarc/solrmarc/blob/master/src/org/solrmarc/tools/DataUtil.java#L23-L24
>
>  
>
> A few questions:
>
>  
>
> 1. Is there a compelling reason for this?
> 2. Is anyone else actually using this method outside of the VuFind
> context? There are a few internal references in the code base, but
> some of them seem potentially obsolete.
> 3. Does it make sense for this to be part of the SolrMarc core, or
> would this be better treated as an external support method?
> 4. If there is a reason to keep this in the core, would it make senseHi
> to make it more configurable so the key regular expression(s) could
> be overridden without having to rewrite the whole function?
>
>  
>
> Thoughts/feedback welcome. I’m happy to help with a solution if we can
> agree on a general direction.
>
>  
>
> thanks,
>
> Demian
>
> --
> You received this message because you are subscribed to the Google
> Groups "solrmarc-tech" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to solrmarc-tec...@googlegroups.com
> <mailto:solrmarc-tec...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/solrmarc-tech/DM5PR03MB3082BE7F4A5FCBB9A2BB5BD7E80C0%40DM5PR03MB3082.namprd03.prod.outlook.com
> <https://groups.google.com/d/msgid/solrmarc-tech/DM5PR03MB3082BE7F4A5FCBB9A2BB5BD7E80C0%40DM5PR03MB3082.namprd03.prod.outlook.com?utm_medium=email&utm_source=footer>.

Demian Katz

unread,
Oct 6, 2020, 4:30:46 PM10/6/20
to solrma...@googlegroups.com
Thanks for the thoughts; this is definitely helpful!

As you suggest, VuFind already has separate publishDate and publishDateSort fields, since there are differing needs for display vs. sorting. However, at present, these are both string fields. Strings work fine for four-digit years, but if we allow three-digit years, that will cause problems unless we either normalize the data or change the field type in the schema. Further argument for potentially having different normalization rules for sort vs. display/search.

In any case, whether or not we decide to make a specific change for this particular case, I think it would be useful generally if all of the regular expressions in DataUtil could be overridden in a subclass, in order to make it easier to customize the code without having to copy and paste the whole thing. The fact that everything is private and final in the existing code makes this impractical... and since all of the properties and methods are static, I'm not sure what the best practice would be to improve upon this in Java. (In PHP, you can use the static:: prefix in place of self:: to inherit static properties in a flexible way, but I don't know if Java has an equivalent...). I'm open to suggestions, as always! If somebody has a preferred/recommended approach, I can try to open a PR when time permits.

- Demian
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2Fsolrmarc%2Fsolrmarc%2Fblob%2Fmaster%2Fsrc%2Forg%2Fsolrmarc%2F
> tools%2FDataUtil.java%23L23-L24&amp;data=02%7C01%7Cdemian.katz%40villa
> nova.edu%7Cbfb4ff0857c344d35a8408d8695b615c%7C765a8de5cf9444f09cafae5b
> f8cfa366%7C0%7C0%7C637375188515086194&amp;sdata=j%2F%2Bv7kIkBAN%2BHmjs
> CMHYL9LWbu4lvrFrW8YFik5O89U%3D&amp;reserved=0
>
>  
>
> A few questions:
>
>  
>
> 1. Is there a compelling reason for this?
> 2. Is anyone else actually using this method outside of the VuFind
> context? There are a few internal references in the code base, but
> some of them seem potentially obsolete.
> 3. Does it make sense for this to be part of the SolrMarc core, or
> would this be better treated as an external support method?
> 4. If there is a reason to keep this in the core, would it make senseHi
> to make it more configurable so the key regular expression(s) could
> be overridden without having to rewrite the whole function?
>
>  
>
> Thoughts/feedback welcome. I’m happy to help with a solution if we can
> agree on a general direction.
>
>  
>
> thanks,
>
> Demian
>
> --
> You received this message because you are subscribed to the Google
> Groups "solrmarc-tech" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to solrmarc-tec...@googlegroups.com
> <mailto:solrmarc-tec...@googlegroups.com>.
> To view this discussion on the web visit
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgrou
> ps.google.com%2Fd%2Fmsgid%2Fsolrmarc-tech%2FDM5PR03MB3082BE7F4A5FCBB9A
> 2BB5BD7E80C0%2540DM5PR03MB3082.namprd03.prod.outlook.com&amp;data=02%7
> C01%7Cdemian.katz%40villanova.edu%7Cbfb4ff0857c344d35a8408d8695b615c%7
> C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637375188515086194&amp;sda
> ta=t%2BF253fuHXydH30QbI83rW9G7DONtMivcH7Z9zw9TRY%3D&amp;reserved=0
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fsolrmarc-tech%2FDM5PR03MB3082BE7F4A5FCBB9A2BB5BD7E80C0%2540DM5PR03MB3082.namprd03.prod.outlook.com%3Futm_medium%3Demail%26utm_source%3Dfooter&amp;data=02%7C01%7Cdemian.katz%40villanova.edu%7Cbfb4ff0857c344d35a8408d8695b615c%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637375188515086194&amp;sdata=gzD2HTYQlcownReF8yFNBiBIqroWkUzEppQdwqQX2vs%3D&amp;reserved=0>.

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com.
To view this discussion on the web visit https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fsolrmarc-tech%2Fff7e519b-5892-4be6-0741-cfff134a1f2c%2540hebis.uni-frankfurt.de&amp;data=02%7C01%7Cdemian.katz%40villanova.edu%7Cbfb4ff0857c344d35a8408d8695b615c%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637375188515096155&amp;sdata=wK8fTpWnofUWjY1EjUbchg7EBM3a7RQ9urt4E8N08ks%3D&amp;reserved=0.

Uwe Reh

unread,
Oct 7, 2020, 4:42:32 AM10/7/20
to solrma...@googlegroups.com
Am 06.10.20 um 22:30 schrieb Demian Katz:
> As you suggest, VuFind already has separate publishDate and publishDateSort fields ...
Sorry, I should have had a look first.

Now I had also a closer look on org/solrmarc/tools/DataUtil.java. I
would suggest just to patch the original RegEX to something weaker, like
"[0-2]\\d{3}"

In the unlikely case that backward compatibility is needed, I could
offer a PR, with an the additional method.
* public static String cleanDate(final String date, String regEx)
or more performant
* public static String cleanDate(final String date, Pattern compiledRegEx)

Uwe

Demian Katz

unread,
Oct 7, 2020, 12:08:36 PM10/7/20
to solrma...@googlegroups.com
Thanks, Uwe, when time permits, I'll open a PR with your proposed simple solution in order to move this conversation into GitHub, and then from there (and with Bob's input) we can decide whether to merge as-is or refine into something a little more nuanced. I appreciate the input and conversation!
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fsolrmarc-tech%2FDM5PR03MB3082BE7F4A5FCBB9A2BB5BD7E80C0%2540DM5PR03MB3082.namprd03.prod.outlook.com%3Futm_medium%3Demail%26utm_source%3Dfooter&amp;data=02%7C01%7Cdemian.katz%40villanova.edu%7Cc0577841dd6d4251680708d86a9cef6b%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637376569562176463&amp;sdata=l9TCyERf9rrXIJtFDQS2NtEaKl75oP5%2BGQj0l3iPNJw%3D&amp;reserved=0>.
>
> --
> You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com.
> To view this discussion on the web visit https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fsolrmarc-tech%2Fff7e519b-5892-4be6-0741-cfff134a1f2c%2540hebis.uni-frankfurt.de&amp;data=02%7C01%7Cdemian.katz%40villanova.edu%7Cc0577841dd6d4251680708d86a9cef6b%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637376569562176463&amp;sdata=FIXbHPcIluRO6d19JiWLyyUDjshuz4jhvriRvj%2FISnE%3D&amp;reserved=0.
>

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com.
To view this discussion on the web visit https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fsolrmarc-tech%2F2e3ef17e-16e7-56d6-0276-471bff4ba5aa%2540hebis.uni-frankfurt.de&amp;data=02%7C01%7Cdemian.katz%40villanova.edu%7Cc0577841dd6d4251680708d86a9cef6b%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637376569562176463&amp;sdata=bBr2WiNqFI2chsh49ZhjGtTVXv0lr9CiHpvSt1wafOE%3D&amp;reserved=0.

Demian Katz

unread,
Oct 13, 2020, 12:42:57 PM10/13/20
to solrma...@googlegroups.com
As promised, I've opened this pull request:

https://github.com/solrmarc/solrmarc/pull/90

I just started with supporting all 1xxx four-digit years, since I think three-digit years add some other dimensions of complexity that might require further thought/discussion. I put a TODO checkbox on the PR for that.

I'd be happy for this to be merged as-is if no one objects to it, but I'm also open for further discussion and refinement as needed. I'm opening this just to get the ball rolling and so we don't forget about the issue with pre-1500 dates.
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fsolrmarc-tech%2FDM5PR03MB3082BE7F4A5FCBB9A2BB5BD7E80C0%2540DM5PR03MB3082.namprd03.prod.outlook.com%3Futm_medium%3Demail%26utm_source%3Dfooter&amp;data=02%7C01%7Cdemian.katz%40villanova.edu%7Cdb0bc66beaf0403a801308d86adb3f5c%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637376837178562032&amp;sdata=5wCrtr7jN2wJFZCjVjgECl7nI0fw8paC19yN%2FJBb2j0%3D&amp;reserved=0>.
>
> --
> You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com.
> To view this discussion on the web visit https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fsolrmarc-tech%2Fff7e519b-5892-4be6-0741-cfff134a1f2c%2540hebis.uni-frankfurt.de&amp;data=02%7C01%7Cdemian.katz%40villanova.edu%7Cdb0bc66beaf0403a801308d86adb3f5c%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637376837178562032&amp;sdata=%2BN2UzPj3N%2BTS46ndy2h8J0ur8aMaynZkPZ6YITcc%2FAI%3D&amp;reserved=0.
>

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com.
To view this discussion on the web visit https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fsolrmarc-tech%2F2e3ef17e-16e7-56d6-0276-471bff4ba5aa%2540hebis.uni-frankfurt.de&amp;data=02%7C01%7Cdemian.katz%40villanova.edu%7Cdb0bc66beaf0403a801308d86adb3f5c%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637376837178562032&amp;sdata=YZxmHZqhZugH7b9D7LRPC1zgBzyxIisDs%2BWsQGpZoPg%3D&amp;reserved=0.

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com.
To view this discussion on the web visit https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fsolrmarc-tech%2FDM5PR03MB30821E821ADB994603A07724E80A0%2540DM5PR03MB3082.namprd03.prod.outlook.com&amp;data=02%7C01%7Cdemian.katz%40villanova.edu%7Cdb0bc66beaf0403a801308d86adb3f5c%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637376837178562032&amp;sdata=fe2SA6SF8C%2B7Sm4PBesu0uhdnW2nIyeTYDi%2B%2B%2Bf425M%3D&amp;reserved=0.
Reply all
Reply to author
Forward
0 new messages