Generalized getTitleSort function

22 views
Skip to first unread message

Tod Olson

unread,
Aug 25, 2012, 5:21:46 PM8/25/12
to solrma...@googlegroups.com, t...@uchicago.edu, se...@uchicago.edu
I'm looking to implement a generalized getTitleSortKeys method or function. I want to implement a bean shell function that will take a tag string and return the text of the requested fields and subfields with the non-filing bits removed and with some other normalization for computing sort keys. And I have a couple questions about tackling this.

The context is that we are working with the VuFind title browse, but extending it so a variety of title fields in a record can show up in the title browse list. This means parallel fields in Solr, a title_browse with all of the display versions, and a title_browse_keys with the same data normalized for sorting into a browse list. (these then get pulled dumped into a relational table to provide the actual index). So we'll have a couple lines like this in the properties file (yes, we like to pull from everywhere):

title_browse = 210ab:211a:212a:214a:240:242abchnp:245abcdefghknps:246abfghnp:247abfghnp:490av:740ahnp:780bcst:785bcst:787bcst:840ahv:844a                                                                                                                        
title_browse_keys = script(ucGetTitleBrowseKeys.bsh), getTitleBrowseKeys(210ab:211a:212a:214a:240:242abchnp:245abcdefghknps:246abfghnp:247abfghnp:490av:740ahnp:780bcst:785bcst:787bcst:840ahv:844a)                                                             

Here are my questions:

1. Does this or some useful building block already exist? I think I'll have to reimplement the iteration over the tag string and handling the subfields. I've not really found something that would take the tag string and return Set<Field>. I have found methods that return a Set<String>, which does not give the non-filing information. If I'm overlooking some useful helper functions, a pointer would be welcome.

2. Are there any worries about implementing this in bean shell? Have people found bean shell to be a significant performance hit during indexing?

3. I'd like this to be general enough to be of use to others. If there's something about the function as described that could be generalized or broken out as a utility to be of more use, please let me know.

-Tod

Demian Katz

unread,
Aug 27, 2012, 10:34:03 AM8/27/12
to solrma...@googlegroups.com, t...@uchicago.edu, se...@uchicago.edu

1.)    I’m not aware of a method that meets your need.  Perhaps it would make sense to refactor the existing Set<String> getFieldList to wrap around a separate method that returns a Set<Field>, but this does not look like it would be an entirely straightforward task, so it may be impractical.  Maybe Bob has a better idea.

2.)    BeanShell definitely causes a performance hit, but I don’t think it’s terribly significant.  It certainly doesn’t hurt to prototype in BeanShell.  If the performance is good enough, then you’re done; if you have problems, it’s not difficult to adapt it to pure Java and compile it in.

3.)    A method that takes a fieldspec and returns an array of title sort keys sounds pretty generalized to me.  My only question is whether this could be combined with the existing getSortableTitle in some way to avoid redundant logic (i.e. have one work as a special case of the other).

 

- Demian

 

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To view this discussion on the web visit https://groups.google.com/d/msg/solrmarc-tech/-/_qSRKdgdsvoJ.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

Tod Olson

unread,
Aug 28, 2012, 8:38:53 AM8/28/12
to solrma...@googlegroups.com, t...@uchicago.edu, se...@uchicago.edu
Thanks, that helps. After mulling it over for a bit, here's what I'm thinking.

A beanshell version of getFieldList that takes an extra parameter to signal normalizing: 

  Set<string> getFieldList(String taglist, String normalizer)

normalizer tells getFieldList whether or what kind of normalization to apply in it's inner loop, after it has extracted the desired subfields from the tag spec. If normalizer=null, then it's just like the current getFieldList.

Maybe normalizer starts with a generic "sort" value, which trims non-sorting chars, downcases, and strips punctuation. Or maybe there's a "title-sort" value which knows about the non-sorting indicator, and a plain "sort" that doesn't include that logic. I'm not certain yet.

Ideally, normalizer would be the name of a function/method to call to act as the normalizer. That would be the most general. But I don't think Java/beanshell is so friendly to that kind of approach.

Anyhow, that's what I'm thinking of. Any reactions?

-Tod

Demian Katz

unread,
Aug 28, 2012, 8:46:38 AM8/28/12
to solrma...@googlegroups.com, t...@uchicago.edu, se...@uchicago.edu

Hopefully Bob Haschart will chime in on this – he’s the main architect of SolrMarc, and his opinion is much more informed than mine!  However, I think this does sound like a sensible solution.  As you say, it would be great if you could take the functional programming approach and actually pass in the processing routine you want to use.  That’s not so easy in Java, but maybe the next best thing would to be define an interface (i.e. MarcFilterInterface) and pass in an object that instantiates the interface.  Then you just use the object to process the matches.  Obviously, the big limitation here is that you can’t pass arbitrary objects in from the SolrMarc configuration files…  but at least this would allow more flexibility under the hood, and you could create wrapper functions as needed to be called from the configs.

 

- Demian

 

To view this discussion on the web visit https://groups.google.com/d/msg/solrmarc-tech/-/WQ2NAUyfSOIJ.

Tod Olson

unread,
Aug 28, 2012, 1:32:52 PM8/28/12
to solrma...@googlegroups.com, t...@uchicago.edu, se...@uchicago.edu
So now we're trying to implement a beanshell version of getFieldList, but some of the methods we need to use, like getSubfieldDataAsSet, are protected. So there's not a good way to do this in beanshell without reimplementing some of these functions. And if we try to extend SolrIndexer in our own package, we'll just run into the same problem.

It seems that getSubfieldDataAsSet is a pretty useful utility method that SolrIndex subclasses would want to use. Does it need to be protected? And can anyone suggest a workaround?

The secondary question is looking ahead towards support for non-Roman scripts. I see in getFieldList aht there is some undocumented syntax supported in getFieldList for specifying linked fields, so "LNK245ab" should indicate the 880 that corresponds to this 245. Are people using this syntax in their marc_local.properties files? It certainly makes sense that you'd want a way to specify specific 880s in the tag list.

But if this syntax is still intended to be supported, I think I've found a bug. It looks like this line:

    String subfield = tags[i].substring(3);

will erroneously set subfield to "245ab" rather than just "ab".

If this LNK syntax is considered supported, are there constraints on how it is supposed to work? Or is it better to rely on the getLinkedField* methods?

Anyhow, I'd appreciate any suggestions for working around this business of the protected static getSubfieldDataAsSet. And also any forward-looking suggestions for bringing in the non-Roman 880 data.

Best,

-Tod

Naomi Dushay

unread,
Aug 29, 2012, 10:23:59 PM8/29/12
to solrma...@googlegroups.com, t...@uchicago.edu, se...@uchicago.edu
Tod,

see getAlphaSubfldsAsSortStr
and  getInd2AsInt
on line 1034.

The latter method is also in SolrIndexer in Bob's googlecode project.

I'm ignorant about bean shell scripts, so I can't help you there.

- Naomi


Tod Olson

unread,
Sep 10, 2012, 1:44:34 PM9/10/12
to solrma...@googlegroups.com, t...@uchicago.edu, se...@uchicago.edu
Thanks, Naomi. I'm trying to resume work on this this week. Thanks for the pointers. getInd2AsInt would also be nice to have access to, rather than duplicate.

Bean Shell isn't a fave, but it's syntactically very close to Java, and it let's me postpone a UChicagoIndexer class.

-Tod
Reply all
Reply to author
Forward
0 new messages