Search with hyphen (-)

4,312 views
Skip to first unread message

Carolina Melo

unread,
Oct 4, 2017, 2:41:52 PM10/4/17
to AtoM Users
Hi, everyone!

I noticed that description with hyphen (this character in red: - ) can't be searched, at least in my application. 
I'm using brazilian portuguese interface.
Trying to search this description bellow I got no results.


Are anyone having the same problem?

Regards,

Carolina Melo

Dan Gillean

unread,
Oct 6, 2017, 12:13:29 PM10/6/17
to ICA-AtoM Users
Hi Carolina,

Sorry for the delay in replying to this. I've managed to reproduce this behavior in English as well. It's strange! 

You can get results if you put your search terms in "quotations"

Alternatively (and strangely), if you search the same string without including the hyphen, you will get results - aka if you search SUMÁRIO DOCUMENTOS BRASÍLIA.docx

I do know a bit about how this works, but I'm still surprised by the result. When Elasticsearch adds strings to the search index, each term is "tokenized" - that is, each separate part of a string is broken up into individual tokens. Search results are then returned by matching tokens and performing a number of other calculations for relevance - for example, in 2.4, a match in the title field will be given more weight than a match in a notes field - but there are many other factors, such as comparing how long the matching string is vs how often the term appears, etc. So - if you search for "test", a record called Test will be returned higher than one called Test One Two Three. 

When strings are tokenized, some things are NOT tokenized to prevent too many irrelevant results from overwhelming the results set returned. Some stop words (such as "a", "an", "the" etc) are removed based on individual culture stop-word lists. We don't have such a list for all cultures at present, but they are there where available, so the focus when returning searches isn't overwhelmed by returning every record that has "the" in it (since this is far too common a word, and nearly every record might conceivably be returned).

Another thing that I believe happens is that special characters are not tokenized - so there is likely no hyphen in the index at all. In fact, the hyphen appears to be a special case, because it is a reserved character in Elasticsearch, used for ensuring that a term immediately following a hyphen does not appear in the results. For example, a search for -tea time means return results that have time, but do NOT include the word tea in them. This is noted in the AtoM documentation here: 
For reference, from the ES documentation, here are all the reserved characters: 

If you need to use any of the characters which function as operators in your query itself (and not as operators), then you should escape them with a leading backslash. For instance, to search for (1+1)=2, you would need to write your query as \(1\+1\)\=2.
The reserved characters are: + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /
Failing to escape these special characters correctly could lead to a syntax error which prevents your query from running.


Interestingly, this attempt to escape the special character does not seem to work on your site - searching for SUMÁRIO \- DOCUMENTOS BRASÍLIA.docx still does not return any results. 

What I think is happening is therefore a mismatch between how terms are submitted for searching, and how they are indexed. 

So when you search for SUMÁRIO - DOCUMENTOS, AtoM is actually searching for SUMÁRIO AND - AND DOCUMENTOS. However, because the hyphen is escaped and not tokenized in ES, AtoM is not finding any records that appear to include all elements requested. It might even be possible that it is interpreting this as a special search - i.e. you are saying return records that have SUMÁRIO AND DOCUMENTOS but do not include any spaces (the hyphen here is before a space) - which would certainly return zero results. I'm not sure if that's what is happening, but it is one possibility. 

If you put the search terms in quotations, then ES ignores the individual tokens and looks for exact string matches in the exact order that was in quotations - which is why it finds results. If you search without the hyphen, then it IS finding records with SUMÁRIO AND DOCUMENTOS, so once again you are getting the matches you expect. 

The question is what we could do to improve this, without making all other search results much worse. A search online reveals many different ways people have solved this when they want to be able to get results that include hyphens, but many of the solutions involve adding fixes and hacks for specific fields - which wouldn't be a solution for all of AtoM. We could potentially add a substitution, so that a dash is replaced with another character (like a space) before indexing - but I'm not sure what effect this might have on reference code searching. 

Elasticsearch is highly configurable, but extremely complex. I think for us to be able to make changes to resolve this issue, we'd need to have some community support for research, testing, and analysis to find the best solution that will work across cultures, for all of AtoM's different needs and use cases. 

In the meantime, I suggest searching with quotations, or without the hyphen character. It's not ideal, but it should return what you need. 

Regards, 



Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/a2a29a9c-861d-4066-a237-423d8adb47b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Carolina Melo

unread,
Oct 13, 2017, 7:57:27 AM10/13/17
to AtoM Users
Thank you so much, Dan!
Your answer helped a lot, as always.

Best wishes,

Carolina Melo




To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages