Issue with SOLR indexing of MODS fields with multiple values:

132 views
Skip to first unread message

dcha...@uncg.edu

unread,
Aug 29, 2018, 10:50:22 AM8/29/18
to islandora
When using Islandora Solr Metadata to create a custom display, I discovered what seems to be an issue
with the way our MODS metadata is being indexed, specifically with respect to fields with multiple
values (e.g. subject headings, personal names, etc.).

When I start to select a field (e.g. mods_subject_authority_topic), it looks like the wrong date types are
being applied. If I understand correctly, a fieldname ending with "_ms" should be a STRING, not a TEXT
field, as is the case below, since "_ms" indicates a "multiple value string"

001_solrModsSingleMultipleIssue.png

Actually, it seems like there should not be any fieldnames ending in "_s" at all if the field has multiple values (but I may be wrong about that).

This is what similar fieldnames look like in the sandbox (note "string" rather than "text_en"):


002_solrModsSingleMultipleIssue.png


For the record, the additional attributes like "authority" don't seem to affect this one way or another so that doesn't seem to be the issue.

The big problem for me is that it also causes a MAJOR problem with the display. Neither "version" of the field displays all the multiple values. If I choose to display "mods_subject_authority_topic_s" only the first value displays. If I choose to display "mods_subject_authority_topic_ms" only the second and all subsequent values display.

So, the MODS XML looks like this:

<subject>

<topic authority="lcsh">Subject 1</topic>
<topic authority="lcsh">Subject 2</topic>
<topic authority="lcsh">Subject 3</topic>
<topic authority="lcsh">Subject 4</topic>

</subject>

While the display looks like this:

003_solrModsSingleMultipleIssue.png


When it SHOULD look like this (with everything displaying together as part of the same field):

004_solrModsSingleMultipleIssue.png


Jared Whiklo

unread,
Aug 29, 2018, 11:06:19 AM8/29/18
to isla...@googlegroups.com
So my guess is the problem is in your Solr schema.

So I would find your Solr "conf" directory and look for the schema.xml
in there and see how the <dynamicField name="*_ms" is defined.

If the type is text_en, then you've found your problem.

cheers,
jared

On 2018-08-29 9:50 AM, dchardin via islandora wrote:
> When using Islandora Solr Metadata to create a custom display, I
> discovered what seems to be an issue
> with the way our MODS metadata is being indexed, specifically with
> respect to fields with multiple
> values (e.g. subject headings, personal names, etc.).
>
> When I start to select a field (e.g. mods_subject_authority_topic), it
> looks like the wrong date types are
> being applied. If I understand correctly, a fieldname ending with "_ms"
> should be a STRING, not a TEXT
> field, as is the case below, since "_ms" indicates a "multiple value string"
>
> 001_solrModsSingleMultipleIssue.png <about:invalid#zClosurez>
>
> Actually, it seems like there should not be any fieldnames ending in
> "_s" at all if the field has multiple values (but I may be wrong about
> that).
>
> This is what similar fieldnames look like in the sandbox (note "string"
> rather than "text_en"):
>
>
> 002_solrModsSingleMultipleIssue.png <about:invalid#zClosurez>
>
>
> For the record, the additional attributes like "authority" don't seem to
> affect this one way or another so that doesn't seem to be the issue.
>
> The big problem for me is that it also causes a MAJOR problem with the
> display. Neither "version" of the field displays all the multiple
> values. If I choose to display "mods_subject_authority_topic_s" only the
> first value displays. If I choose to display
> "mods_subject_authority_topic_ms" only the second and all subsequent
> values display.
>
> So, the MODS XML looks like this:
>
> <subject>
>
> <topic authority="lcsh">Subject 1</topic>
> <topic authority="lcsh">Subject 2</topic>
> <topic authority="lcsh">Subject 3</topic>
> <topic authority="lcsh">Subject 4</topic>
>
> </subject>
>
> While the display looks like this:
>
> 003_solrModsSingleMultipleIssue.png <about:invalid#zClosurez>
>
>
> When it SHOULD look like this (with everything displaying together as
> part of the same field):
>
> 004_solrModsSingleMultipleIssue.png <about:invalid#zClosurez>
>
>
> --
> For more information about using this group, please read our Listserv
> Guidelines: http://islandora.ca/content/welcome-islandora-listserv
> ---
> You received this message because you are subscribed to the Google
> Groups "islandora" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to islandora+...@googlegroups.com
> <mailto:islandora+...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/islandora.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/islandora/1eaba2d9-90db-4098-9750-946f8dab8c1f%40googlegroups.com
> <https://groups.google.com/d/msgid/islandora/1eaba2d9-90db-4098-9750-946f8dab8c1f%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Jared Whiklo
jwh...@gmail.com
--------------------------------------------------
Oh, they have the Internet on computers now. -- Homer Simpson

signature.asc

dp...@metro.org

unread,
Aug 29, 2018, 11:18:21 AM8/29/18
to islandora
Adding to the Jared's very good answer:

1. Even if you change the field type definition inside your schema.xml file, you will have to delete completely your Solr core (or shard or index, depending on what you have there) and reindex your repo. Once a field is stored/indexed as a certain type it stays like that. A Schema.xml file can be seen as a hint for Solr when creating fields for the first time but has no effect once data is in place 

2. Also: if a MODS document contains a single value for a certain tag/element, Gsearch´s slurp all to mods transformation will only create a  *_s solr field for it. If then another islandora object, with another MODS document has for the same tag/element 3 values, it will create a *_s with the first value and an *_ms with the following 2 ones. (that is what you are seeing in your output). To get around that issue/limitation or expected behavior (depending where you come from), you need to add a copyField instruction to your schema.xml making sure that all _s fields also get copied into _ms. That way you will uniform your solr fields and have always everything inside your _ms (copyfield is additive, means if values are there already you will add, not replace). It can be expensive to get all _s copied to _ms, so you can also choose the ones you feel are more important.

Hope this helps

Diego Pino
Metro.org

Donnie Hardin

unread,
Aug 29, 2018, 12:27:38 PM8/29/18
to isla...@googlegroups.com
Hmm, well, this is interesting.. I had a look at my schema.xml file, and it does not look anything like the one from the Discovery Garden basic_solr_config repo that I was SURE i put there. So strange. Here is what my current schema.xml file's "dynamicField name" entries look like:  (note that there is no "*_ms" entry)




Now here is the schema.xml file from DiscoveryGarden basic_solr_config (again, not sure how I missed placing it)



Diego, should I place the Discovery_Garden basic_solr_config schema.xml file in place of the current one, and then delete my solr core and then reindex? I assume it is a "core" and not a shard nor index, but im not sure. Can you clue me in to the best way to delete and reindex? Should I do that from terminal or via the solr admin page?
 



> Visit this group at https://groups.google.com/group/islandora.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/islandora/1eaba2d9-90db-4098-9750-946f8dab8c1f%40googlegroups.com
> <https://groups.google.com/d/msgid/islandora/1eaba2d9-90db-4098-9750-946f8dab8c1f%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Jared Whiklo
jwh...@gmail.com
--------------------------------------------------
Oh, they have the Internet on computers now. -- Homer Simpson
--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Donald Hardin
I.T. Professional: Linux administrator - LAMP application development
Electronic Resources and Information Technology
University Libraries
The University Of North Carolina at Greensboro 
P.O. Box 26170
Greensboro, NC 27402-6170

dp...@metro.org

unread,
Aug 29, 2018, 5:00:34 PM8/29/18
to islandora
Hi Donnie, just to be 100% sure before you go deleting schema files and cores can you make SURE the first schema.xml is in fact the one in use? You can check/compare the schema.xml file visible on the solr admin interface in your

Reason i ask you this is because Solr's distribution ships with a lot of schema files, demo files, etc and at least i ended many times with a few around.

Once you are sure what you are replacing and where you are replacing it (/usr/local/solr or wherever you have your core) with the correct DGI schema.xml file and you double checked all the copyfields and dynamicfields you want are there, (OCR, etc, remember reindexing is slow, so better to safe)


Sorry for confusing you with cores, shards and indexes. Core is the physical "Luce index" plus its configurations, files, etc. like the live instance. Index is the index/stored info, on disk, the "data" of a core, shard is when you are dealing with Solr cloud and your index is split/distributed. And guess what, collection can be  a single core, many cores, multiple shards, etc, so weird, but collections are meta-grouping of documents! So pretty sure 99.9% chances you just need to delete your index and then reload your core or restart Solr to be sure new schema sticks. *just follow that link =)

Hope this helps and not adds to the confusion

D



> Visit this group at https://groups.google.com/group/islandora.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/islandora/1eaba2d9-90db-4098-9750-946f8dab8c1f%40googlegroups.com
> <https://groups.google.com/d/msgid/islandora/1eaba2d9-90db-4098-9750-946f8dab8c1f%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Jared Whiklo
jwh...@gmail.com
--------------------------------------------------
Oh, they have the Internet on computers now. -- Homer Simpson

--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages