SEVERE level errors in Solr -- cannot parse special characters of MODS extension elements

165 views
Skip to the first unread message

Peter MacDonald

unread,
12 Aug 2017, 12:47:0512/08/2017
to islandora
Islandora: 7x-1.7
Solr: 4.2

We use MODS extension elements in some of our Islandora sites.

Solr indexes the MODS extension elements and searching works fine with one exception: un-escaped special characters.

The Solr logging page is full of "Cannot parse" errors when it tries parsing a query containing a MODS extension element in which the search string has an un-escaped question mark or a square bracket in it. When Solr sees an un-escaped question mark or square bracket it interprets it as a functional operator (single-character wildcard and range indicator, respectively).

Obviously, we want ?, [ and ] to be escaped so they are treated as just text characters. Strangely enough, AFAIK, we don't have this problem with our regular MODS elements -- just with the extension elements.

The Solr parser also has trouble with queries containing a percent sign as in:
Cannot parse 'mods_extension_dhiapw_prison_name_s:"United\ States\ Penitentiary%': Lexical error at line 1, column 67.  Encountered: <EOF> after : "\"United\\ States\\ Penitentiary%"
The full element value is: "United States Penitentiary (Terre Haute, Indiana)"

Here is an example of a failure to parse a string with an apostrophe. Solr chokes when it sees the percent sign.
Value: "Teachers's Aid"
Cannot parse '-mods_extension_dhiapw_prison_work_s:"Teacher%': Lexical error at line 1, column 47. Encountered: <EOF> after : "\"Teacher%"

Here are a few other examples (I have dozens of others too):
Value: Florence Central (Florence, Arizona)
Cannot parse 'mods_extension_dhiapw_prison_name_s:"Florence\ Central\ \(Flor': Lexical error at line 1, column 63.  Encountered: <EOF> after : "\"Florence\\ Central\\ \\(Flor"

Value: Calipatria State Prison (Calipatria, California)
Cannot parse 'mods_extension_dhiapw_prison_name_s:"Calipatria\ State\ Prison\ \(Calipatria,\ Califor': Lexical error at line 1, column 87.  Encountered: <EOF> after : "\"Calipatria\\ State\\ Prison\\ \\(Calipatria,\\ Califor"

As you can see, spaces, dashes, double quotation marks, and even some parentheses are escaped just fine, but when a parentheses gets represent by a percent code in the query, the Solr parser gets confused and throws the "Cannot parse" error again. 

Percent signs seem to always cause a Solr parsing failure.

Let me add that I don't even know why these queries are being sent to Solr in the first place. These queries are being sent to Solr a couple times every hour 24/7, but the fedora.log doesn't show any user-generated queries happening at those exact times. Could it be some cron job, an automatic Solr indexing, something else?

My Questions: 
1. What can I do to ensure that these MODS extension elements get the special characters escaped properly before they are sent to Solr?
2. What automatic mechanism would be sending these queries periodically to Solr 24/7?
3. What does it mean when the query starts with a dash as in:
 Cannot parse '-mods_extension_dhiapw_prison_name_s:"State\':
3. What does it mean when the query starts with a dash as in:
 Cannot parse '-mods_extension_dhiapw_prison_name_s:"State\':

Thank you for reading this far.

Peter

====================
P.S. - Extra exmples
====================

If want more examples of queries that can't be parse, look at these:

"None\ \-\ Still\ looking\ but%':
"None\ \-\ Still\ looking\ ':
"Tutor\ for\ prisoners\ pursuin':
"Teacher's\ aide\ \-\%2':
"Para\-legal\ \/\ Law\ Lib':
"Mentor\/Mentor\ instructor\ overseeing%5':
"Mentor\/Mentor\ instructor\ over':
"Mentor\/Mentor\ instructor\ ':
"Inmate\ barber,?\ college\/semi':
"Facilitator\ for\ Education\%':
"Administrative\ clerk\ in\ a\%':
"Administrative\ clerk\ in\ a\ sensitive\ security%5':
"7th\ year\ in\ SHU\/Iso\':
"7th\ year\ in\ SHU\/Iso\/A':
"Minimum\ \-\ Prison\ Cam':
"Minimum\ \-\ Prison\ Cam':
"Maximum\ \(1993\-2014\)\ Medium\ \(2014':
"Level\ IV\ and\ Level\%':
"Level\ 4\ Maximum\ Se':
"USP\ Tucson\ \(Tucson,?\%':
"Thumb\ Correctional\ Facility\':
"Southern\ Ohio\ Correctional%':
"SCI\-Albion\ \(Albion,?\ Pe':
"San\ Quentin\ State\ Prison\':
"San\ Quentin\ State\ Prison\':
"San\ Quentin\ State\ Prison%5':
"Mule\ Creek\ State\ Prison\%':
"Mississippi\ Department\ of\%2':
"Maine\ State\ Prison\ \(Thomaston,?\ Maine\)%2':
"Maine\ State\ Prison\ \(Thomaston,?\ Maine\)%':
"Kern\ Valley\ State\ Prison\':
"ESP\ \-\ Ely\ State\ Prison\ \(Ely,?\ N':
"ESP\ \-\ Ely\ State\ Prison':
"Eastern\ Correctional\ Facility\':
"Dixon\ Correctional\ Center%':
"Curran\ Fromhold\ Correctiona':
"CS':
"Commins\ Unity\ \(Grady%2':
"Chillicothe\ Correctional\ Institution\ \(Chillicothe,?\ Ohio\)':
"California\ State\ Prison\':
"California\ State\ Prison\ \-\ Los\ Angeles\ County\ \(Pro':
"California\ State\ Prison\ %5':
"California\ Prison\ State\':
"California\ Correctional\ Center\ \(Susanville,?\%':
"Brown\ Creek\ Correctional\ ':
"Bra':
"AZ.\ State\ Prison\ Compl':

dp...@metro.org

unread,
15 Aug 2017, 14:46:1315/08/2017
to islandora
Peter. 
Your errors don´t seem to be on the indexing side of things but on the querying one. Did you check if the Solr documents that you have inside the Solr index actually have the info? I would guess yes and If so, then you are fine on that side and no escaping is required at all. The issue is on fetching and how the query string is converted into a Solr query one.
Give 7.x-1.x (head) a try. A few encoding pulls (form submission, facets and breadcrumb) got pulled recently (a few months?) into Islandora Solr Search to deal with wrongly escaped characters while searching
If errors persist after upgrading we can see what other fixes can be applied.

best

Diego Pino

Peter MacDonald

unread,
15 Aug 2017, 21:29:5715/08/2017
to isla...@googlegroups.com
Thanks, Diego:

I have already updated one set of our Islandora modules to 7x-1.9 but the that set does not operate the site I'm having trouble with. I'll be updating that set of Islandora modules within the month.

I hope that will help solve my problem.

Peter

--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/82274bdf-a4db-46f6-933c-201d09258414%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Peter MacDonald,
Library Information Systems Specialist
Hamilton College Library
Clinton, New York
315 859-4493
pmacdona-hamilton (Skype)

Reply all
Reply to author
Forward
0 new messages