RE: [EXTERNAL] [VuFind-Tech] Handling non-sorting control charactes

3 views
Skip to first unread message

Demian Katz

unread,
Nov 2, 2021, 6:48:15 AM11/2/21
to Bernd Fehling, vufin...@lists.sourceforge.net, solrma...@googlegroups.com
Bernd,

There is some discussion of the issue on this JIRA ticket:

https://vufind.org/jira/browse/VUFIND-974

I'm also copying the solrmarc-tech list into the reply for additional input, because I think this is probably an issue that should be addressed in SolrMarc (if it is not already).

- Demian

-----Original Message-----
From: Bernd Fehling <bernd....@uni-bielefeld.de>
Sent: Tuesday, November 2, 2021 6:28 AM
To: vufin...@lists.sourceforge.net
Subject: [EXTERNAL] [VuFind-Tech] Handling non-sorting control charactes

Dear list,

a question about handling non-sorting control characters while importing/indexing.

How do you handle this?

How is VuFind handling this?

We have the need to remove them at index time because they disturb our sorting in the search result lists.

Does VuFind has some out-of-the-box solutions?

Unfortunately there many fields affected by non-soting control characters.
Just in case you need more background info:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.loc.gov%2Fmarc%2Fnonsorting.html&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7C81f49fdf9a364eff59cd08d99deda9cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637714466930823696%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=W%2FWZkXGVz0jM3Jr2miiSUJj7x3vmf06o%2Bbn7I5I2GeM%3D&amp;reserved=0

Regards
Bernd

--
*************************************************************
Bernd Fehling Bielefeld University Library
Dipl.-Inform. (FH) LibTec - Library Technology
Universitätsstr. 25 and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ub.uni-bielefeld.de%2F~befehl%2F&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7C81f49fdf9a364eff59cd08d99deda9cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637714466930823696%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=7fP3xwjN%2FUWKfLWWPsrrNPxu0kFD%2FYoH4bpp%2FrGgqdQ%3D&amp;reserved=0

BASE - Bielefeld Academic Search Engine - https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.base-search.net%2F&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7C81f49fdf9a364eff59cd08d99deda9cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637714466930833659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WQfWFJhe3FK5DQyDEVgs3UMu8GqqIQ%2FJ5x0GQ1KeDIA%3D&amp;reserved=0
*************************************************************


_______________________________________________
Vufind-tech mailing list
Vufin...@lists.sourceforge.net
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fvufind-tech&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7C81f49fdf9a364eff59cd08d99deda9cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637714466930833659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=qvo2e3aqay7fUJ80SRncDLBpnKEzN8fyklS3ImMnY5g%3D&amp;reserved=0

Demian Katz

unread,
Nov 2, 2021, 7:28:42 AM11/2/21
to Bernd Fehling, vufin...@lists.sourceforge.net, solrma...@googlegroups.com
Bernd,

My recollection is that when this came up in 2014, it was mostly theoretical -- we couldn't find a use case involving MARC21, and the Unimarc problems discussed were solved by conversion scripts, so the need wasn't perceived as urgent and no further action was taken. If you have some sample records demonstrating the problem, feel free to attach one or two to the ticket to get the conversation moving again.

- Demian

-----Original Message-----
From: Bernd Fehling <bernd....@uni-bielefeld.de>
Sent: Tuesday, November 2, 2021 7:03 AM
To: vufin...@lists.sourceforge.net
Cc: solrma...@googlegroups.com
Subject: Re: [VuFind-Tech] [EXTERNAL] Handling non-sorting control charactes

Thanks Demian, for pointing me to the jira.
Seams to be somewhat old from 2014, could have some attention.

SolrMarc makes sense, because there are many fields affected.
Should be configurable which are the non-sorting control characters.

Nevertheless, an open issue/jira.

Regards
Bernd


Am 02.11.21 um 11:48 schrieb Demian Katz:
> Bernd,
>
> There is some discussion of the issue on this JIRA ticket:
>
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fvufind.org%2Fjira%2Fbrowse%2FVUFIND-974&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7C00e423b6c267454ef54a08d99df06e03%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637714478253876568%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=gUKCMBBFgi9bffoFoDfRgPhf5Qr7I5qdoNkNwj6s%2Fm4%3D&amp;reserved=0
>
> I'm also copying the solrmarc-tech list into the reply for additional input, because I think this is probably an issue that should be addressed in SolrMarc (if it is not already).
>
> - Demian
>
> -----Original Message-----
> From: Bernd Fehling <bernd....@uni-bielefeld.de>
> Sent: Tuesday, November 2, 2021 6:28 AM
> To: vufin...@lists.sourceforge.net
> Subject: [EXTERNAL] [VuFind-Tech] Handling non-sorting control charactes
>
> Dear list,
>
> a question about handling non-sorting control characters while importing/indexing.
>
> How do you handle this?
>
> How is VuFind handling this?
>
> We have the need to remove them at index time because they disturb our sorting in the search result lists.
>
> Does VuFind has some out-of-the-box solutions?
>
> Unfortunately there many fields affected by non-soting control characters.
> Just in case you need more background info:
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.loc.gov%2Fmarc%2Fnonsorting.html&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7C00e423b6c267454ef54a08d99df06e03%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637714478253876568%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Hn6fTC2p6JDP9PGzGU0xpTui%2FrwPx%2BEe2SDBE4SvxnU%3D&amp;reserved=0
>
> Regards
> Bernd
>
> _______________________________________________
> Vufind-tech mailing list
> Vufin...@lists.sourceforge.net
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fvufind-tech&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7C00e423b6c267454ef54a08d99df06e03%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637714478253876568%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Jx0IrAzQgVt6GCIH%2B8o6kSpLfzDz4D1REB412WEHR2w%3D&amp;reserved=0
>


_______________________________________________
Vufind-tech mailing list
Vufin...@lists.sourceforge.net
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fvufind-tech&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7C00e423b6c267454ef54a08d99df06e03%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637714478253876568%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Jx0IrAzQgVt6GCIH%2B8o6kSpLfzDz4D1REB412WEHR2w%3D&amp;reserved=0

Demian Katz

unread,
Nov 4, 2021, 10:32:44 AM11/4/21
to Bernd Fehling, vufin...@lists.sourceforge.net, solrma...@googlegroups.com
Bernd,

It seems like perhaps SolrMarc needs a new modifier as a complement to the existing stripInd/stripInd1/stripInd2 options, perhaps accompanied by some new configuration properties to control which characters are used for stripping. I haven't worked with modifier code before, so I'm not exactly sure how this works... but seems like a possible approach. I'm copying solrmarc-tech back into the thread in case others there have thoughts on this.

- Demian

-----Original Message-----
From: Bernd Fehling <bernd....@uni-bielefeld.de>
Sent: Thursday, November 4, 2021 10:17 AM
To: vufin...@lists.sourceforge.net
Subject: Re: [VuFind-Tech] [EXTERNAL] Handling non-sorting control charactes

There is no need for any patches in marc4j, it has already everything implemented according to MARC rules.

But there could be some work at SolrMarc done and it must be very flexible because every catalog system has it's own non-sorting chars or even char sequence. Your Pica uses "\u009b" and "\u009c", Alma uses "<<" and ">>".

The question is, should SolrMarc rely on the non-sorting rules of MARC records or should it ignore any MARC rules and just go for replaceFirst of a special char or char sequence.

Regards
Bernd


Am 04.11.21 um 14:00 schrieb Uwe Reh:
> Adding a own method in Solrmarc sounds like a good approach. A patch
> of marc4j seems to be 'too' general. (marc4j shouldn't change the
> content.)
>
> In our HDS project we are using dedicated fields for sorting
>
> * 'author_sort', 'title_sort': filled respecting the rules for non
> sorting prefixes.
>
> * 'date_sort_asc', 'date_sort_desc': filled with first/last date of
> publication. (needed for journals and series)
>
> Since we fill our index with a own 'SolrPica', I can't share a
> SolrMarc solution. But the generic code is quite simple.
>
>> private String removeNonSortingLeader(String out) {
>> if (out == null) return null;
>> if (out.isEmpty()) return "";
>> // Entferne '@' an erster Stelle.
>> if (out.charAt(0) == '@') return out.substring(1);
>> // Entferne Nichtsortierzeichen nach RAK
>> out = out.replaceFirst("^\\w+ @", "");
>> // Entferne Nichtsortierzeichen nach MARC
>> out = out.replaceFirst("^˜\u009b.*\u009c", "");
>> return out;
>> }
>
> Note 1) In my original code I'm using the control chars directly. With
> the encoded chars, the last regex might not work.
> Note 2) Yes, the code isn't optimized. I tend to leave this task to the JVM.
>
> Uwe
>
>
>
> Am 04.11.21 um 13:10 schrieb Bernd Fehling:
>> Hi Demian,
>>
>> I was looking for a general solution to enhance SolrMarc and
>> therefore digging into marc4j.
>> It is generally possible to have this in SolrMarc because of marc4j.
>> But as I mentioned, the librarians have their own view and just left
>> the ages of punched-card readers. ;-)
>>
>> I write a class for it and that's it. QAE (quick and easy)
>>
>> Regards
>> Bernd
>>
>
>
> _______________________________________________
> Vufind-tech mailing list
> Vufin...@lists.sourceforge.net
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.sourceforge.net%2Flists%2Flistinfo%2Fvufind-tech&amp;data=04%7C01%7C
> demian.katz%40villanova.edu%7Cfa9972897eab4eb8cc8a08d99f9ddb08%7C765a8
> de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637716323361945275%7CUnknown%7CT
> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI
> 6Mn0%3D%7C3000&amp;sdata=lcK1jOO1Y6roj3h9iKNNuJUowTHziqKt59hjpgKvyTY%3
> D&amp;reserved=0
>

--
*************************************************************
Bernd Fehling Bielefeld University Library
Dipl.-Inform. (FH) LibTec - Library Technology
Universitätsstr. 25 and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ub.uni-bielefeld.de%2F~befehl%2F&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7Cfa9972897eab4eb8cc8a08d99f9ddb08%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637716323361955235%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=SZdByUFco7JI6XzR3KepWTXwAdfT0WIi%2BJY3TRMEpRc%3D&amp;reserved=0

BASE - Bielefeld Academic Search Engine - https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.base-search.net%2F&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7Cfa9972897eab4eb8cc8a08d99f9ddb08%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637716323361955235%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=NS29LtrhIgFolBJj6X%2FfI107d%2BDyIDTKP3k5a60peSA%3D&amp;reserved=0
*************************************************************


_______________________________________________
Vufind-tech mailing list
Vufin...@lists.sourceforge.net
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fvufind-tech&amp;data=04%7C01%7Cdemian.katz%40villanova.edu%7Cfa9972897eab4eb8cc8a08d99f9ddb08%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637716323361955235%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=y8QJVyCJkuMPSUvRm6NTspJsBHfj8K05DSNuthr85K0%3D&amp;reserved=0
Reply all
Reply to author
Forward
0 new messages