Is it possible to distribute Values non-reduntantly from one MARC-field to multiple Solr-fields?

5 views

Skip to first unread message

Andreas Kahl

unread,

Aug 5, 2009, 2:19:09 PM8/5/09

to solrmarc-general

Hello everyone,

I am trying to do some - at first glance - simple analysis on my MARC-
field 653a containing topics. To achieve better facets in vufind, I
try to distribute subject headings to different faceting-fields. A
mapping for String Values can be provided in several .properties-
files. If a string is not found in .properties-files, it should be
indexed in topic/topicStr by default.

An example:
(assuming that this is part of a single record containing four 653a-
fields)
653a: Constitutional law => no match => topic
653a: History => no match => topic
653a: United States => geographic.properties => geographic
653a: Monograph => genre.properties => genre

Do you have any recommendations how to achieve this?

Andreas

P.S. My attempts to solve this so far:
1.
I have tried to use solrmarc's translation maps. Those worked fine for
the genre and geographic, but I am not able to delete the recognized
values from topicStr. If I do so by providing a topic.properties
mapping everything mentioned in the other properties-files to empty
Strings, the values are deleted entirely from all fields. The
order of lines in marc.properties does not have any effect on this.

2.
I do not delete anything in marc.properties; I only use the following:
geographic = 653a, geographic.properties
genre = 653a, genre.properties
topic = 653a
With that, my index and facets contain the values mapped in geographic
and genre according to the translation maps.
Unfortunately, now my topic-facet contains the same values. So I
declared a stopword-filter in schema.xml containing all values mapped
before in the translation maps. My schema.xml:

"
<fieldtype name="sswd" class="solr.StrField" sortMissingLast="true"
omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.sswd.txt"
ignoreCase="true"/>
</analyzer>
</fieldtype>
[...]
<field name="topicStr" type="sswd" indexed="true" stored="false"
multiValued="true"/>
[...]
<copyField source="topic" dest="topicStr"/>
"

To my disappointment, this did not remove any of the values in the
topicStr-field and therefore in the facets. The only change was that a
refinement with facet-values contained in stopwords.sswd.txt returned
no results.
I would have expected, the strings to be removed entirely from the
'Topic' Facet.

Robert Haschart

unread,

Sep 17, 2009, 5:47:05 PM9/17/09

to solrmarc...@googlegroups.com, Andrea...@gmx.net

Andreas,

Sorry to be slow to respond. At the time you wrote your initial
message, there really was no way of doing what you want to do, without
downloading the solrmarc source code, and creating a custom java method
(or perhaps several methods). Then you would have had to use ant to
build a jar containing all of solrmarc as well as your custom code.

Although creating, compiling and using a custom indexing function is not
too difficult, (and if you want to try to do that I can help you get
things setup to do so) there is now (or rather there soon will be)
another option.

The soon-to-be-released next version of SolrMarc will support a
java-like scripting language. You will be able to define a index
specification like the following in your xxxx_index.properties file:

topic_multi_facet = script(split_topics.bsh), getTopicFacets(653a)

which will load and interpret the script file names split_topics.bsh
from a directory named scripts next to the Generic_VuFind_SolrMarc.jar
file. and then for each record, it would execute the function named
getTopicFacets that you would define in the file split_topic.bsh

As an exercise for testing the capability of the index scripting
functions, I implemented an example that does substantially what you
describe below. The sample script is attached to this e-mail message.

-Bob Haschart