Sean,
It might be helpful to look at some of the existing examples:
https://github.com/vufind-org/vufind/tree/master/import/index_java/src/org/vufind/index
For example, if you look at PublisherTools::getPublishers, you can see how the code loops through various fields and subfields.
FullTextTools::getFullText also demonstrates how you can pass a fieldspec from the SolrMarc properties file into the custom code.
Does that help at all? If anything remains unclear, please tell me a little more about what you are trying to do, and I should be able to help!
I’m also copying the solrmarc-tech list on this reply so that it can reach a broader audience.
Good luck!
- Demian
From: Sean Filipov <se...@uchicago.edu>
Sent: Wednesday, May 22, 2019 5:48 PM
To: Vufin...@lists.sourceforge.net
Subject: [VuFind-Tech] Write Java indexing code
Hi tech-list,
I want to write some index import java files under local/import/index_java/src/ but I don’t know how to call(pull) datafield(fields and subfields) information from marc.properties file and how to loop through the records that I want to index and pull out all needed fields.
Does anyone know how to do this? What methods should I use?
Thank you
Sean,
Is there a need to write a custom method, or is this something you can accomplish by chaining together the existing SolrMarc modifier? (i.e. see stripInd2 at https://github.com/solrmarc/solrmarc/wiki/Specification-Modifiers). I imagine that even if you do need to do this within a custom method, there is probably a way to hook into the modifier from your code, though I’m not sure of the exact syntax. I can dig deeper if you need more help and can provide a little more context.
- Demian
From: Sean Filipov <se...@uchicago.edu>
Sent: Friday, July 12, 2019 4:58 PM
To: Demian Katz <demia...@villanova.edu>; Vufin...@lists.sourceforge.net; solrma...@googlegroups.com; Michael Katzmann <mk...@nlsbph.org>
Subject: [EXTERNAL] Re: [VuFind-Tech] Write Java indexing code
Michael, Demian,
Thank you for your replies.
I have another question:
I want to write custom index code to process title and journal browse. How to use existing code in solrMarc to drive parsing option to process tags and subfields from marc.properties file? I don’t want to hard code fields and subfields. What I want to do in my custom code is to check the title fields for first and second indicator to see whether the articles have to be trimmed.
Is there any existing functions I can use in solrMarc to drive our process?
Thank you
On May 23, 2019, at 9:21 AM, Michael Katzmann <mk...@nlsbph.org> wrote:
Sean,
The methods are that of MARC4J.
This book is a great resource in using the methods to access fields/subfields of a record using MARC4J
Michael
_______________________________________________
Vufind-tech mailing list
Vufin...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/vufind-tech
--
|\ _,,,--,,_/,`.-'`' ._ \-;;,_|,4- ) )_ .;.( `'-''---''(_/._)-'(_\_)
Sean,
By default, the title browse is using the title_fullStr field as the display value (populated by a copyField in the Solr schema) and the title_sort field as the sort value.
In reviewing how the title_sort field is populated, I discovered that the custom SolrMarc method in use is just a thin wrapper around a basic index specification, so I have just simplified the default VuFind configuration for clarity:
https://github.com/vufind-org/vufind/commit/47c6f76c4e3235b98108d546d8f7f91a577832ac
For your situation, it might be simpler to switch the title browse to use local custom fields so that you can work with multiple values. I think you might be able to do something like this:
title_browse_str_mv = 245[a-z]
title_browse_str_mv += 245[a-z] ? ind2 > 0, stripInd2
title_browse_sort_str_mv = 245abkp,titleSortLower
title_browse_sort_str_mv += 245abkp,titleSortLower ? ind2 > 0, stripInd2, notunique
That’s totally untested, and I may have some syntax wrong. But the idea here is, first always put all the 245 subfields into title_browse_str_mv… and also add a non-filing-stripped version when indicator 2 is non-zero. Also, create parallel sort fields, allowing duplication for rows with a non-zero indicator 2 to keep the parallel structure working.
If this works as expected (and it may not – I’m not expert on this new SolrMarc syntax), then you should be able to search with or without articles and still successfully land on a heading in the correct position. However, a negative side effect is that titles will display in the browse list twice, reflecting versions with and without articles. If that’s unacceptable, then I think we need to come up with a solution that relies on browse handler changes rather than indexing changes.
Is that any help?
- Demian
From: Sean Filipov <se...@uchicago.edu>
Sent: Tuesday, July 16, 2019 7:23 PM
To: Demian Katz <demia...@villanova.edu>
Cc: Vufin...@lists.sourceforge.net; solrma...@googlegroups.com; Michael Katzmann <mk...@nlsbph.org>
Subject: Re: [EXTERNAL] [VuFind-Tech] Write Java indexing code
Hi Demian,
The problem is that we want to index titles with and without articles but display trimmed versions in alpha browse.
When people search for some title, it gives different results whether they included article or not.
I think Demian is on the right track. But as he guessed the syntax is not quite right. The first one will work exactly as he specified.
title_browse_str_mv = 245[a-z]
title_browse_str_mv += 245[a-z] ?( ind2 > 0), stripInd2
title_browse_sort_str_mv = 245abkp, toLower, stripPunct, stripAccent, clean
title_browse_sort_str_mv |= 245abkp ,titleSortLower
title_browse_str_mv = 245[a-z]
title_browse_str_mv |= 245[a-z], stripInd2
= the result(s) generated by this specification will replace any and all previous values generated for this field for the current record.Third Note:
+= the result(s) generated by this specification will be added as an additional value(s) for the specified field (possible resulting in duplicate values)
|= the result(s) generated by this specification will be added as an additional value(s) for the specified field ONLY IF that exact value(s) isn't already present.
?= the result(s) generated by this specification will be only be used if the specified field doesn't yet have a defined value from a previous specification.
Thanks for the clarification, Bob!
One tricky thing, though – our goal here is to display titles both with and without articles at the same position using our alphabetical browse handler. This requires us to index parallel values in two fields. So in your example, we would want:
title_browse_str_mv : The ship of death [sound recording] / Peter Dickson Lopez.
title_browse_str_mv : ship of death [sound recording] / Peter Dickson Lopez.
(as you demonstrated), but we would want the SAME sort value for both, because we always want to sort by the article-stripped version:
title_browse_sort_str_mv : ship of death
title_browse_sort_str_mv : ship of death
We need to duplicate the value to ensure that for each title_browse_str_mv, there is exactly one parallel title_browse_sort_str_mv. That’s what I was trying to get it (though unsuccessfully) in my original specs. 😊
In any case, it would definitely be great to add the notes about the operators to the wiki somewhere. If you don’t have time, let me know and I can probably find a moment to paste this in for future reference.
- Demian
To view this discussion on the web visit https://groups.google.com/d/msgid/solrmarc-tech/BN7PR13MB2417873F0958618580FAF1A980C90%40BN7PR13MB2417.namprd13.prod.outlook.com.
Tod,
Thanks for the clarification!
Regarding the issue of lists getting out of sync, I think the distinction between += and |= might help, but it’s definitely tricky no matter how you slice it. Adding support to the browse handler for delimited fields instead of parallel fields might reduce a lot of this complexity (though building the delimited fields in SolrMarc might be a whole new puzzle to work out).
Regarding your question about leveraging the spec language within custom code, here’s the default implementation of getSortableTitle in SolrMarc:
I think this might serve as a helpful example….
- Demian
From: Tod Olson <t...@uchicago.edu>
Sent: Wednesday, July 17, 2019 11:20 AM
To: Demian Katz <demia...@villanova.edu>
Cc: Tod Olson <t...@uchicago.edu>; Sean Filipov <se...@uchicago.edu>; Vufin...@lists.sourceforge.net; solrma...@googlegroups.com
Subject: Re: [VuFind-Tech] [EXTERNAL] Write Java indexing code
Hi Demian,
The new solrmarc syntax is new to us also, but it's interesting to see the += operator used. That might do it. Our specs for title browse might be described as pathologically complete. From our current production:
title_browse = custom(org.vufind.index.ucTitleBrowseFunctions), titleBrowse(130adfklmnoprs:210ab:240adfklmnoprs:243adfklmnoprs:245abfgknps:246abnp:247abnp:534t:700fklmnoprst:710fklmnoprst:711fklnpst:730adfklmnoprs:740anp:775t:776t:LNK130adfklmnoprs:LNK210ab:LNK240adfklmnoprs:LNK243adfklmnoprs:LNK245abfgknps:LNK246abnp:LNK247abnp:LNK534t:LNK700fklmnoprst:LNK710fklmnoprst:LNK711fklnpst:LNK730adfklmnoprs:LNK740anp:LNK775t:LNK776t)
marc.properties:title_browse_sort = custom(org.vufind.index.ucTitleBrowseFunctions), titleBrowseSort(130adfklmnoprs:210ab:240adfklmnoprs:243adfklmnoprs:245abfgknps:246abnp:247abnp:534t:700fklmnoprst:710fklmnoprst:711fklnpst:730adfklmnoprs:740anp:775t:776t:LNK130adfklmnoprs:LNK210ab:LNK240adfklmnoprs:LNK243adfklmnoprs:LNK245abfgknps:LNK246abnp:LNK247abnp:LNK534t:LNK700fklmnoprst:LNK710fklmnoprst:LNK711fklnpst:LNK730adfklmnoprs:LNK740anp:LNK775t:LNK776t)
The new syntax would let us express something about this in single lines. And there are one or two of these fields where the non-filing indicator is ind1, not ind2. So that can be clearly expressed.
The problem that I anticipate is that duplicate entries may be thrown away and the fields become different lengths. We see this in practice, a record can have two different display strings which normalize the same sort key and that puts the fields out of joint. Here's a contrived example:
245 _4 ‡aThe Thing
246 __ ‡a Thing
title_browse => "The Thing", "Thing"
title_browse_sort => "Thing", "Thing" => (passes through a Set somewhere) => "Thing"
And then the record is rejected from the browse indexes because the list of display strings and the list of sort keys are of different lengths. To fix this, we'll need to change the way we build the browse indexes. I need to lay this out in VUFIND-1342.
So I think we can try this and see how it does.
Stepping away from this specific issues and looking more generally, there is a question in here about whether there is a way to use the expressive power of the spec language, or at least the tag strings, in local custom code. If a site is are writing some custom indexing function where they do not want to hard-code the fields involved, is there some library functionality to help take advantage of the tag string syntax, or does that logic need to be recreated in that custom function?
-Tod
Demian,
Ok so the goal for the title_browse_sort_str_mv part is to generate the same number of entries as are generated by the specifications for title_browse_str_mv , but for the values generated for the title_browse_sort_str_mv field to always be the same.
# generate displayable title value using all subfields
title_browse_str_mv = 245[a-z]
# if ind2 is > 0 generate additional displayable title value with leading article stripped (but otherwise the same)
title_browse_str_mv += 245[a-z] ?( ind2 > 0), stripInd2
# generate sort title value using abkp subfields, shifted to lowercase, with punctuation, accents stripped and the leading article (if present removed)
title_browse_sort_str_mv = 245abkp,titleSortLower
# if ind2 is > 0 generate additional sort title value identical to the first one.
title_browse_sort_str_mv += 245abkp ? (ind2 > 0), titleSortLower
Demian, Tod:
In the "custom_methods" repository https://github.com/solrmarc/custom_methods there is a custom method example called JoinFieldsMixin that might get close to what you are thinking of.
It allows you to provide two complete index specifications as the parameters to the method, it will evaluate both of the methods and return a set of results that join the results of the two specifications together in a pairwise manner.
so using the getComplexJoinedFields method defined in that example source file you could use the following specifications:
for records where there is no initial article only one field value would be generated
browse_title : Ethnographica et folkloristica Carpathica. --|-- ethnographica et folkloristica carpathica
-Bob Haschart
Thanks, Bob, this is all extremely helpful! 😊
- Demian
From: solrma...@googlegroups.com <solrma...@googlegroups.com>
On Behalf Of Haschart, Robert J (rh9ec)
To view this discussion on the web visit https://groups.google.com/d/msgid/solrmarc-tech/BN7PR13MB24174682AC50ACCCD01CB1FD80C90%40BN7PR13MB2417.namprd13.prod.outlook.com.
Did you consider the possibility of using an inline translation map to strip the trailing characters in a more targeted way, rather than relying on a modifier? That might work. See the prefixing example here for a rough idea of how you can manipulate strings with regular expressions in SolrMarc:
https://vufind.org/wiki/indexing:solrmarc#customizing_record_ids
- Demian
From: Sean Filipov <se...@uchicago.edu>
Sent: Wednesday, July 24, 2019 3:09 PM
To: Haschart, Robert J (rh9ec) <rh...@virginia.edu>
Cc: solrma...@googlegroups.com; Vufin...@lists.sourceforge.net; Tod Olson <t...@uchicago.edu>; Demian Katz <demia...@villanova.edu>
Subject: Re: [VuFind-Tech] [EXTERNAL] Write Java indexing code
Hi Bob,
I tried to use specification modifiers for title and journal browse/browse-_sort fields. It’s powerful tools but stripPunct removes more than we want. For example, it removes hyphens but we want to preserve them. Is it possible to get one more modifier like ‘stripCharacters’ and give/specify a string of characters we want to strip, as an example: stripCharacters(*,|,\)? We tried to go through solrmarc code but it’s hard to read how all the abstract classes work together and what classes are called.
If you believe this code modification is not very challenging to do, could you tell us how your code works so we can start planning this modification?
Thank you
"
title_browse":
["The sea-atlas.",
“The sea atlas.",
"The watter-world.",
"The watter world.",
"The water world.",
"De zee-atlas ofte water-waereld.",
"The sea-atlas, or, The watter-world shewing all the sea-coasts of y known parts of y earth with a generall doscription of the same : verie vsefull for all masters & mates of shipps & likwise for merchants newly sett forth.”],
"
title_browse_sort":
["sea atlas",
"watter world",
"water world",
"zee atlas ofte water waereld",
"sea atlas or the watter world shewing all the sea coasts of y known parts of y earth with a generall doscription of the same verie vsefull for all masters mates of shipps likwise for merchants newly sett forth"]},
"
title_browse":
["The Quakers shaken.",
"A fire-brand snach'd out of the fire.",
"A fire brand snach'd out of the fire.",
"The Quakers shaken, or, A fire-brand snach'd out of the fire being a briefe relation of Gods wonderfull mercie extended to John Gilpin of Kendale in Westmoreland : who as will appeare by the sequel, was not only deluded, but possessed by the devill."],
"
title_browse_sort":
["quakers shaken",
"fire brand snach d out of the fire",
"quakers shaken or a fire brand snach d out of the fire being a briefe relation of gods wonderfull mercie extended to john gilpin of kendale in westmoreland who as will appeare by the sequel was not only deluded but possessed by the devill"]},
{
I think Demian is on the right track here.
Doing a quick test of the following two specifications:
returns almost identical results.
this part:
filter("[=&(),'.]|\\[|\\]=>\ ")
says for any parts of the generated result that matches one of these characters = & ( ) , ' . [ ] replace that character with a space.
this part:
filter("[ ][ ]+=>\ ")
replaces runs of 2 or more spaces with a single space.
if the result contains a hyphen or a colon or a semi-colon or any other punctuation NOT listed, those punctuation marks will be untouched.
-Bob
To view this discussion on the web visit https://groups.google.com/d/msgid/solrmarc-tech/BN7PR13MB2417E205D529A5E95760CB4980C60%40BN7PR13MB2417.namprd13.prod.outlook.com.