RE: Write Java indexing code

4 views
Skip to first unread message

Demian Katz

unread,
May 23, 2019, 7:42:59 AM5/23/19
to Sean Filipov, Vufin...@lists.sourceforge.net, solrma...@googlegroups.com

Sean,

 

It might be helpful to look at some of the existing examples:

 

https://github.com/vufind-org/vufind/tree/master/import/index_java/src/org/vufind/index

 

For example, if you look at PublisherTools::getPublishers, you can see how the code loops through various fields and subfields.

 

FullTextTools::getFullText also demonstrates how you can pass a fieldspec from the SolrMarc properties file into the custom code.

 

Does that help at all? If anything remains unclear, please tell me a little more about what you are trying to do, and I should be able to help!

 

I’m also copying the solrmarc-tech list on this reply so that it can reach a broader audience.

 

Good luck!

 

- Demian

 

From: Sean Filipov <se...@uchicago.edu>
Sent: Wednesday, May 22, 2019 5:48 PM
To: Vufin...@lists.sourceforge.net
Subject: [VuFind-Tech] Write Java indexing code

 

Hi tech-list,

 

I want to write some index import java files under local/import/index_java/src/ but I don’t know how to call(pull) datafield(fields and subfields) information from marc.properties file and how to loop through the records that I want to index and pull out all needed fields.

Does anyone know how to do this? What methods should I use?

 

Thank you

 

 

 

Demian Katz

unread,
Jul 15, 2019, 9:23:38 AM7/15/19
to Sean Filipov, Vufin...@lists.sourceforge.net, solrma...@googlegroups.com, Michael Katzmann

Sean,

 

Is there a need to write a custom method, or is this something you can accomplish by chaining together the existing SolrMarc modifier? (i.e. see stripInd2 at https://github.com/solrmarc/solrmarc/wiki/Specification-Modifiers). I imagine that even if you do need to do this within a custom method, there is probably a way to hook into the modifier from your code, though I’m not sure of the exact syntax. I can dig deeper if you need more help and can provide a little more context.

 

- Demian

 

From: Sean Filipov <se...@uchicago.edu>
Sent: Friday, July 12, 2019 4:58 PM
To: Demian Katz <demia...@villanova.edu>; Vufin...@lists.sourceforge.net; solrma...@googlegroups.com; Michael Katzmann <mk...@nlsbph.org>
Subject: [EXTERNAL] Re: [VuFind-Tech] Write Java indexing code

 

Michael, Demian,

Thank you for your replies.

 

I have another question:

I want to write custom index code to process title and journal browse. How to use existing code in solrMarc to drive parsing option to process tags and subfields from marc.properties file? I don’t want to hard code fields and subfields. What I want to do in my custom code is to check the title fields for first and second indicator to see whether the articles have to be trimmed.

Is there any existing functions I can use in solrMarc to drive our process?

 

Thank you

 

 

 

On May 23, 2019, at 9:21 AM, Michael Katzmann <mk...@nlsbph.org> wrote:

 

Sean,

The methods are that of MARC4J.

This book is a great resource in using the methods to access fields/subfields of a record using MARC4J

 

Michael

 

 

_______________________________________________
Vufind-tech mailing list
Vufin...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/vufind-tech


 

--

        |\      _,,,--,,_       
        /,`.-'`'   ._  \-;;,_   
       |,4-  ) )_   .;.(  `'-'  
      '---''(_/._)-'(_\_)       

 

 

Demian Katz

unread,
Jul 17, 2019, 10:01:23 AM7/17/19
to Sean Filipov, Vufin...@lists.sourceforge.net, solrma...@googlegroups.com, Michael Katzmann

Sean,

 

By default, the title browse is using the title_fullStr field as the display value (populated by a copyField in the Solr schema) and the title_sort field as the sort value.

 

In reviewing how the title_sort field is populated, I discovered that the custom SolrMarc method in use is just a thin wrapper around a basic index specification, so I have just simplified the default VuFind configuration for clarity:

 

https://github.com/vufind-org/vufind/commit/47c6f76c4e3235b98108d546d8f7f91a577832ac

 

For your situation, it might be simpler to switch the title browse to use local custom fields so that you can work with multiple values. I think you might be able to do something like this:

 

title_browse_str_mv = 245[a-z]

title_browse_str_mv += 245[a-z] ? ind2 > 0, stripInd2

title_browse_sort_str_mv = 245abkp,titleSortLower

title_browse_sort_str_mv += 245abkp,titleSortLower ? ind2 > 0, stripInd2, notunique

 

That’s totally untested, and I may have some syntax wrong. But the idea here is, first always put all the 245 subfields into title_browse_str_mv… and also add a non-filing-stripped version when indicator 2 is non-zero. Also, create parallel sort fields, allowing duplication for rows with a non-zero indicator 2 to keep the parallel structure working.

 

If this works as expected (and it may not – I’m not expert on this new SolrMarc syntax), then you should be able to search with or without articles and still successfully land on a heading in the correct position. However, a negative side effect is that titles will display in the browse list twice, reflecting versions with and without articles. If that’s unacceptable, then I think we need to come up with a solution that relies on browse handler changes rather than indexing changes.

 

Is that any help?

 

- Demian

 

From: Sean Filipov <se...@uchicago.edu>
Sent: Tuesday, July 16, 2019 7:23 PM
To: Demian Katz <demia...@villanova.edu>

Cc: Vufin...@lists.sourceforge.net; solrma...@googlegroups.com; Michael Katzmann <mk...@nlsbph.org>
Subject: Re: [EXTERNAL] [VuFind-Tech] Write Java indexing code

 

Hi Demian,

 

The problem is that we want to index titles with and without articles but display trimmed versions in alpha browse.

When people search for some title, it gives different results whether they included article or not. 

Haschart, Robert J (rh9ec)

unread,
Jul 17, 2019, 11:01:00 AM7/17/19
to Sean Filipov, solrma...@googlegroups.com, Vufin...@lists.sourceforge.net, Michael Katzmann

I think Demian is on the right track.   But as he guessed the syntax is not quite right.  The first one will work exactly as he specified.


title_browse_str_mv = 245[a-z]
title_browse_str_mv += 245[a-z] ?( ind2 > 0), stripInd2

This will add all of the subfields from the 245 field to a Solr Field named    title_browse_str_mv  
and then IF the indicator 2 is not equal to '0'  it will also add the field with the leading article removed as a second value in the Solr field named    title_browse_str_mv  

e.g.  
title_browse_str_mv : The ship of death [sound recording] / Peter Dickson Lopez.
title_browse_str_mv : ship of death [sound recording] / Peter Dickson Lopez.


title_browse_sort_str_mv = 245abkp, toLower, stripPunct, stripAccent, clean
title_browse_sort_str_mv |= 245abkp ,titleSortLower

This will add subfields abkp  from the 245 field to a Solr Field named    title_browse_sort_str_mv   with all accents and punctuation removed and shifted to lower case.
and then it will generate a second entry using the same subfields, with all accents and punctuation removed and shifted to lower case AND with a leading article removed (as indicated by a non-zero value in ind2)  and then IF that second entry is different than the the value already added it will add it as a second Solr field value in the    title_browse_sort_str_mv    field
e.g.
title_browse_sort_str_mv : the ship of death
title_browse_sort_str_mv : ship of death


Note:  The first set of specifications Demian wrote could also be written as:

title_browse_str_mv = 245[a-z]
title_browse_str_mv |= 245[a-z], stripInd2

Second Note (since it may not be documented anywhere except the in commit comment for the associated Git commit)

If a index specification uses the same Solr field as a previous one then there are four options for the operator following the field name (  =   +=   |=   and ?=  )
=     the result(s) generated by this specification will replace any and all previous values generated for this field for the current record.
+=    the result(s) generated by this specification will be added as an additional value(s) for the specified field (possible resulting in duplicate values)
|=    the result(s) generated by this specification will be added as an additional value(s) for the specified field ONLY IF that exact value(s) isn't already present.
?=    the result(s) generated by this specification will be only be used if the specified field doesn't yet have a defined value from a previous specification.

Third Note:
A quick test of this fourth operator above   (  ?= )  which is useful for defining a default value for a field, doesn't work exactly as intended. 
If the specified field does not yet have a value, and if the specification following the ?= generates multiple values, only the first of the values generated by the that specification will be added.   This a a bug.


-Bob Haschart



From: solrma...@googlegroups.com <solrma...@googlegroups.com> on behalf of Demian Katz <demia...@villanova.edu>
Sent: Wednesday, July 17, 2019 10:01:21 AM
To: Sean Filipov <se...@uchicago.edu>
Cc: Vufin...@lists.sourceforge.net <Vufin...@lists.sourceforge.net>; solrma...@googlegroups.com <solrma...@googlegroups.com>; Michael Katzmann <mk...@nlsbph.org>
Subject: [solrmarc-tech] RE: [EXTERNAL] [VuFind-Tech] Write Java indexing code
 
--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com.
To post to this group, send email to solrma...@googlegroups.com.
Visit this group at https://groups.google.com/group/solrmarc-tech.
To view this discussion on the web visit https://groups.google.com/d/msgid/solrmarc-tech/BN3PR03MB2388095E65F6D865E61E6140E8C90%40BN3PR03MB2388.namprd03.prod.outlook.com.
For more options, visit https://groups.google.com/d/optout.

Demian Katz

unread,
Jul 17, 2019, 11:09:29 AM7/17/19
to solrma...@googlegroups.com, Sean Filipov, Vufin...@lists.sourceforge.net, Michael Katzmann

Thanks for the clarification, Bob!

 

One tricky thing, though – our goal here is to display titles both with and without articles at the same position using our alphabetical browse handler. This requires us to index parallel values in two fields. So in your example, we would want:

 

title_browse_str_mv : The ship of death [sound recording] / Peter Dickson Lopez.
title_browse_str_mv : ship of death [sound recording] / Peter Dickson Lopez.

 

(as you demonstrated), but we would want the SAME sort value for both, because we always want to sort by the article-stripped version:

 

title_browse_sort_str_mv : ship of death
title_browse_sort_str_mv : ship of death

 

We need to duplicate the value to ensure that for each title_browse_str_mv, there is exactly one parallel title_browse_sort_str_mv. That’s what I was trying to get it (though unsuccessfully) in my original specs. 😊

 

In any case, it would definitely be great to add the notes about the operators to the wiki somewhere. If you don’t have time, let me know and I can probably find a moment to paste this in for future reference.

 

- Demian

Demian Katz

unread,
Jul 17, 2019, 11:26:01 AM7/17/19
to Tod Olson, Sean Filipov, Vufin...@lists.sourceforge.net, solrma...@googlegroups.com

Tod,

 

Thanks for the clarification!

 

Regarding the issue of lists getting out of sync, I think the distinction between += and |= might help, but it’s definitely tricky no matter how you slice it. Adding support to the browse handler for delimited fields instead of parallel fields might reduce a lot of this complexity (though building the delimited fields in SolrMarc might be a whole new puzzle to work out).

 

Regarding your question about leveraging the spec language within custom code, here’s the default implementation of getSortableTitle in SolrMarc:

 

https://github.com/solrmarc/solrmarc/blob/285c52cfdc885251db20eb644a12fc3c8d7ec88e/src/org/solrmarc/index/SolrIndexerShim.java#L616

 

I think this might serve as a helpful example….

 

- Demian

 

From: Tod Olson <t...@uchicago.edu>
Sent: Wednesday, July 17, 2019 11:20 AM
To: Demian Katz <demia...@villanova.edu>
Cc: Tod Olson <t...@uchicago.edu>; Sean Filipov <se...@uchicago.edu>; Vufin...@lists.sourceforge.net; solrma...@googlegroups.com
Subject: Re: [VuFind-Tech] [EXTERNAL] Write Java indexing code

 

Hi Demian,

 

The new solrmarc syntax is new to us also, but it's interesting to see the += operator used. That might do it. Our specs for title browse might be described as pathologically complete. From our current production:

 

title_browse = custom(org.vufind.index.ucTitleBrowseFunctions), titleBrowse(130adfklmnoprs:210ab:240adfklmnoprs:243adfklmnoprs:245abfgknps:246abnp:247abnp:534t:700fklmnoprst:710fklmnoprst:711fklnpst:730adfklmnoprs:740anp:775t:776t:LNK130adfklmnoprs:LNK210ab:LNK240adfklmnoprs:LNK243adfklmnoprs:LNK245abfgknps:LNK246abnp:LNK247abnp:LNK534t:LNK700fklmnoprst:LNK710fklmnoprst:LNK711fklnpst:LNK730adfklmnoprs:LNK740anp:LNK775t:LNK776t)
marc.properties:title_browse_sort = custom(org.vufind.index.ucTitleBrowseFunctions), titleBrowseSort(130adfklmnoprs:210ab:240adfklmnoprs:243adfklmnoprs:245abfgknps:246abnp:247abnp:534t:700fklmnoprst:710fklmnoprst:711fklnpst:730adfklmnoprs:740anp:775t:776t:LNK130adfklmnoprs:LNK210ab:LNK240adfklmnoprs:LNK243adfklmnoprs:LNK245abfgknps:LNK246abnp:LNK247abnp:LNK534t:LNK700fklmnoprst:LNK710fklmnoprst:LNK711fklnpst:LNK730adfklmnoprs:LNK740anp:LNK775t:LNK776t) 

The new syntax would let us express something about this in single lines. And there are one or two of these fields where the non-filing indicator is ind1, not ind2. So that can be clearly expressed.

 

The problem that I anticipate is that duplicate entries may be thrown away and the fields become different lengths. We see this in practice, a record can have two different display strings which normalize the same sort key and that puts the fields out of joint. Here's a contrived example:

 

245 _4 ‡aThe Thing

246 __ ‡a Thing

 

title_browse => "The Thing", "Thing"

title_browse_sort => "Thing", "Thing" => (passes through a Set somewhere) => "Thing"

 

And then the record is rejected from the browse indexes because the list of display strings and the list of sort keys are of different lengths. To fix this, we'll need to change the way we build the browse indexes. I need to lay this out in VUFIND-1342.

 

So I think we can try this and see how it does.

 

Stepping away from this specific issues and looking more generally, there is a question in here about whether there is a way to use the expressive power of the spec language, or at least the tag strings, in local custom code. If a site is are writing some custom indexing function where they do not want to hard-code the fields involved, is there some library functionality to help take advantage of the tag string syntax, or does that logic need to be recreated in that custom function?

 

-Tod

Haschart, Robert J (rh9ec)

unread,
Jul 17, 2019, 11:45:44 AM7/17/19
to solrma...@googlegroups.com, Sean Filipov, Vufin...@lists.sourceforge.net, Michael Katzmann

Demian,


Ok so the goal for the   title_browse_sort_str_mv    part is to generate the same number  of entries as are generated by the specifications for   title_browse_str_mv  , but for the values  generated for the   title_browse_sort_str_mv    field to always be the same.


# generate displayable title value using all subfields
title_browse_str_mv = 245[a-z] 
# if ind2 is > 0 generate additional displayable title value with leading article stripped (but otherwise the same)
title_browse_str_mv += 245[a-z] ?( ind2 > 0), stripInd2

# generate sort title value using abkp subfields, shifted to lowercase, with punctuation, accents stripped and the leading article (if present removed)
title_browse_sort_str_mv = 245abkp,titleSortLower
# if ind2 is > 0 generate additional sort title value identical to the first one.
title_browse_sort_str_mv += 245abkp ? (ind2 > 0), titleSortLower



-Bob




Haschart, Robert J (rh9ec)

unread,
Jul 17, 2019, 12:15:54 PM7/17/19
to Tod Olson, solrma...@googlegroups.com, Sean Filipov, Vufin...@lists.sourceforge.net

Demian, Tod:  


In the "custom_methods"  repository https://github.com/solrmarc/custom_methods  there is a custom method example called JoinFieldsMixin that might get close to what you are thinking of.


It allows you to provide two complete index specifications as the parameters to the method, it will evaluate both of the methods and return a set of results that join the results of the two specifications together in a pairwise manner.


so using the   getComplexJoinedFields  method defined in that example source file you could use the following specifications:


browse_title = getComplexJoinedFields("245[a-z]", "245abkp,titleSortLower", " --|-- ")
browse_title |= getComplexJoinedFields("245[a-z], stripInd2", "245abkp,titleSortLower", " --|-- ")


and the results would be field value(s) containing two entries separated by the separator string " --|-- "


browse_title : The ship of death [sound recording] / Peter Dickson Lopez. --|-- ship of death
browse_title : ship of death [sound recording] / Peter Dickson Lopez. --|-- ship of death

for records where there is no initial article only one field value would be generated


browse_title : Ethnographica et folkloristica Carpathica. --|-- ethnographica et folkloristica carpathica



-Bob Haschart


Sent: Wednesday, July 17, 2019 11:25:56 AM
To: Tod Olson <t...@uchicago.edu>
Cc: Sean Filipov <se...@uchicago.edu>; Vufin...@lists.sourceforge.net <Vufin...@lists.sourceforge.net>; solrma...@googlegroups.com <solrma...@googlegroups.com>
Subject: [solrmarc-tech] RE: [VuFind-Tech] [EXTERNAL] Write Java indexing code
 
--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to solrmarc-tec...@googlegroups.com.
To post to this group, send email to solrma...@googlegroups.com.
Visit this group at https://groups.google.com/group/solrmarc-tech.

Demian Katz

unread,
Jul 17, 2019, 12:34:15 PM7/17/19
to solrma...@googlegroups.com, Tod Olson, Sean Filipov, Vufin...@lists.sourceforge.net

Thanks, Bob, this is all extremely helpful! 😊

 

- Demian

 

From: solrma...@googlegroups.com <solrma...@googlegroups.com> On Behalf Of Haschart, Robert J (rh9ec)

Tod Olson

unread,
Jul 18, 2019, 9:07:15 AM7/18/19
to Demian Katz, Tod Olson, Sean Filipov, Vufin...@lists.sourceforge.net, solrma...@googlegroups.com
Hi Demian,

The new solrmarc syntax is new to us also, but it's interesting to see the += operator used. That might do it. Our specs for title browse might be described as pathologically complete. From our current production:

title_browse = custom(org.vufind.index.ucTitleBrowseFunctions), titleBrowse(130adfklmnoprs:210ab:240adfklmnoprs:243adfklmnoprs:245abfgknps:246abnp:247abnp:534t:700fklmnoprst:710fklmnoprst:711fklnpst:730adfklmnoprs:740anp:775t:776t:LNK130adfklmnoprs:LNK210ab:LNK240adfklmnoprs:LNK243adfklmnoprs:LNK245abfgknps:LNK246abnp:LNK247abnp:LNK534t:LNK700fklmnoprst:LNK710fklmnoprst:LNK711fklnpst:LNK730adfklmnoprs:LNK740anp:LNK775t:LNK776t)
marc.properties:title_browse_sort = custom(org.vufind.index.ucTitleBrowseFunctions), titleBrowseSort(130adfklmnoprs:210ab:240adfklmnoprs:243adfklmnoprs:245abfgknps:246abnp:247abnp:534t:700fklmnoprst:710fklmnoprst:711fklnpst:730adfklmnoprs:740anp:775t:776t:LNK130adfklmnoprs:LNK210ab:LNK240adfklmnoprs:LNK243adfklmnoprs:LNK245abfgknps:LNK246abnp:LNK247abnp:LNK534t:LNK700fklmnoprst:LNK710fklmnoprst:LNK711fklnpst:LNK730adfklmnoprs:LNK740anp:LNK775t:LNK776t) 

The new syntax would let us express something about this in single lines. And there are one or two of these fields where the non-filing indicator is ind1, not ind2. So that can be clearly expressed.

The problem that I anticipate is that duplicate entries may be thrown away and the fields become different lengths. We see this in practice, a record can have two different display strings which normalize the same sort key and that puts the fields out of joint. Here's a contrived example:

245 _4 ‡aThe Thing
246 __ ‡a Thing

title_browse => "The Thing", "Thing"
title_browse_sort => "Thing", "Thing" => (passes through a Set somewhere) => "Thing"

And then the record is rejected from the browse indexes because the list of display strings and the list of sort keys are of different lengths. To fix this, we'll need to change the way we build the browse indexes. I need to lay this out in VUFIND-1342.

So I think we can try this and see how it does.

Stepping away from this specific issues and looking more generally, there is a question in here about whether there is a way to use the expressive power of the spec language, or at least the tag strings, in local custom code. If a site is are writing some custom indexing function where they do not want to hard-code the fields involved, is there some library functionality to help take advantage of the tag string syntax, or does that logic need to be recreated in that custom function?

-Tod

Tod Olson

unread,
Jul 18, 2019, 9:07:15 AM7/18/19
to Robert Haschart, Tod Olson, solrma...@googlegroups.com, Sean Filipov, Vufin...@lists.sourceforge.net
Bob,

Thanks for these responses. These do look like possible solutions for both the near-term and longer-term.

I really like the extended flexibility of the expression language. I'd actually like a better understanding of the implementation as I think it could be quite instructive. However, I've not really cracked that nut, nor really figured out the right entry point. Limited time and all. If you and Oliver ever feel motivated to put together a little outline of the implementation or hints on how to understand the code, there would be at least a small audience.

Thanks again!

-Tod

Demian Katz

unread,
Jul 24, 2019, 3:44:06 PM7/24/19
to Sean Filipov, Haschart, Robert J (rh9ec), solrma...@googlegroups.com, Vufin...@lists.sourceforge.net, Tod Olson

Did you consider the possibility of using an inline translation map to strip the trailing characters in a more targeted way, rather than relying on a modifier? That might work. See the prefixing example here for a rough idea of how you can manipulate strings with regular expressions in SolrMarc:

 

https://vufind.org/wiki/indexing:solrmarc#customizing_record_ids

 

- Demian

 

From: Sean Filipov <se...@uchicago.edu>
Sent: Wednesday, July 24, 2019 3:09 PM
To: Haschart, Robert J (rh9ec) <rh...@virginia.edu>
Cc: solrma...@googlegroups.com; Vufin...@lists.sourceforge.net; Tod Olson <t...@uchicago.edu>; Demian Katz <demia...@villanova.edu>
Subject: Re: [VuFind-Tech] [EXTERNAL] Write Java indexing code

 

Hi Bob,

 

I tried to use specification modifiers for title and journal browse/browse-_sort fields. It’s powerful tools but stripPunct removes more than we want. For example, it removes hyphens but we want to preserve them. Is it possible to get one more modifier like ‘stripCharacters’ and give/specify a string of characters we want to strip, as an example:  stripCharacters(*,|,\)? We tried to go through solrmarc code but it’s hard to read how all the abstract classes work together and what classes are called. 

If you believe this code modification is not very challenging to do, could you tell us how your code works so we can start planning this modification?

 

Thank you

 

 "title_browse":["The sea-atlas.",
          “The sea atlas.",
          "The watter-world.",
          "The watter world.",
          "The water world.",
          "De zee-atlas ofte water-waereld.",
          "The sea-atlas, or, The watter-world shewing all the sea-coasts of y known parts of y earth with a generall doscription of the same : verie vsefull for all masters & mates of shipps & likwise for merchants newly sett forth.”],
        "title_browse_sort":["sea atlas",
          "watter world",
          "water world",
          "zee atlas ofte water waereld",
          "sea atlas or the watter world shewing all the sea coasts of y known parts of y earth with a generall doscription of the same verie vsefull for all masters mates of shipps likwise for merchants newly sett forth"]},
      

 

        "title_browse":["The Quakers shaken.",
          "A fire-brand snach'd out of the fire.",
          "A fire brand snach'd out of the fire.",
          "The Quakers shaken, or, A fire-brand snach'd out of the fire being a briefe relation of Gods wonderfull mercie extended to John Gilpin of Kendale in Westmoreland : who as will appeare by the sequel, was not only deluded, but possessed by the devill."],
         "title_browse_sort":["quakers shaken",
          "fire brand snach d out of the fire",
          "quakers shaken or a fire brand snach d out of the fire being a briefe relation of gods wonderfull mercie extended to john gilpin of kendale in westmoreland who as will appeare by the sequel was not only deluded but possessed by the devill"]},
      {

 

Haschart, Robert J (rh9ec)

unread,
Jul 24, 2019, 5:09:04 PM7/24/19
to Demian Katz, Sean Filipov, solrma...@googlegroups.com, Vufin...@lists.sourceforge.net, Tod Olson

I think Demian is on the right track here.


Doing a quick test of the following two specifications:


title0 = 245ab, titleSortLower
title1 = 245ab, toLower, stripInd2, stripAccent, clean, join(" "), filter("[=&(),'.]|\\[|\\]=>\ "), filter("[ ][ ]+=>\ ")


returns almost identical results.  


this part:

filter("[=&(),'.]|\\[|\\]=>\ ")

says for any parts of the generated result that matches one of these characters    = & ( ) , ' . [ ]    replace that character with a space.


this part:

filter("[ ][ ]+=>\ ")

replaces runs of 2 or more spaces with a single space.


if the result contains a hyphen  or a colon or a semi-colon   or any other punctuation NOT listed, those punctuation marks will be untouched.


-Bob



From: Demian Katz <demia...@villanova.edu>
Sent: Wednesday, July 24, 2019 3:44:00 PM
To: Sean Filipov <se...@uchicago.edu>; Haschart, Robert J (rh9ec) <rh...@virginia.edu>
Cc: solrma...@googlegroups.com <solrma...@googlegroups.com>; Vufin...@lists.sourceforge.net <Vufin...@lists.sourceforge.net>; Tod Olson <t...@uchicago.edu>
Subject: RE: [VuFind-Tech] [EXTERNAL] Write Java indexing code
 

Tod Olson

unread,
Jul 25, 2019, 10:16:27 AM7/25/19
to solrma...@googlegroups.com, Tod Olson, Demian Katz, Sean Filipov, Vufin...@lists.sourceforge.net
Thanks for all of the help with this. We may be able to get rid of more custom indexing code that we had hoped. 

I'm planning to submit a PR to add an ant task to built the javadoc. There's a lot of code with no docs, but even seeing the relationships between classes, interfaces, and abstract classes is useful in navigating the code.

-Tod

Reply all
Reply to author
Forward
0 new messages