Order of search hits in a long list of returns

75 views
Skip to first unread message

Patricia McGuire

unread,
Feb 4, 2014, 10:28:46 AM2/4/14
to ica-ato...@googlegroups.com
I can't find this in the AtoM 2.0.1 manual under Searching, but I have seen threads saying that fields are weighted in search results. However what if each of your hits is found in, say, the title? Then what determines the order of returns? For example (I'll be boring) search for Sara Turing on our trial site http://ec2-54-196-124-87.compute-1.amazonaws.com/ and get 100 hits. The first four in order are all from the same collection (and repository):

Reference code GB GBR-0272 AMT/K/7/24

Title Sara Turing

Reference code GB GBR-0272 AMT/K/1 /1/85

Title Note by Sara Turing


Reference codeGB GBR-0272 AMT/K/1

Title Letters mainly to his mother, Sara Turing


Reference code GB GBR-0272 AMT/A/10

Title Letters from various authors to Mrs Sara Turing


So it's not alphabetical on reference code or title. Of course this forces users to facet their searching, is that the intention?

-Patricia McGuire
King's College
Cambridge, UK


Dan Gillean

unread,
Feb 4, 2014, 2:35:27 PM2/4/14
to ica-ato...@googlegroups.com
Hi Patricia,

There are several factors at work here.

Search weighting is something we hope to add back into AtoM 2.0, but at the moment, has not been implemented in the search for archival descriptions. You can see the settings that we had in place for ICA-AtoM (the 1.x version of our software), here: https://www.ica-atom.org/doc/Search_fields

Unless members of our community present strong arguments for changing these settings, it's likely that we will work toward re-implementing what seemed to work well in 1.x, and use that as a base from which to make further improvements.

2.x was a huge architectural change for AtoM, and one in which we put in many unsupported development hours. Consequently, the 2.0.0 release was not as polished as we might have dreamed of, and there are elements of the application that still require fine-tuning. Elasticsearch was one of the major architectural changes, and while it is a powerful search and analytic engine, we have not yet had the chance to experiment with all of its possibilities and fine tune these. Most of the default ES settings have been maintained in AtoM - with some analyzers put in place for autocomplete fields in the application. You can find the ES documentation here: http://www.elasticsearch.org/guide/

There are also a couple other elements that will inform the current search ordering for archival descriptions. Notably:

1) Most recent vs. Alphabetical: If you navigate to Admin > Settings > Global, you will see that there are two settings for sort - one for authenticated (i.e. logged in) users, and one for anonymous (i.e. not logged in) users. These settings determine the default sort order for either kind of user. "Last updated" will set the default sort for search and browse results to show the most recently created or updated records that match the query first, while alphabetic will sort them alphabetically.

2) ASCII sort vs. Natural sort: this is a problem that exists not just in AtoM, but in computer science in general - it is known as the difference between ASCII sorting (i.e. the order that the computer's file system uses, based on the ASCII table where, for example, "a" comes after A-Z) and natural sorting - that is, an alphabetical or alpha numeric sorting that makes sense to humans. Coming up with an algorithm that transforms the computer's native ASCII sort into one that presents information to the end-user in a natural sort order that makes sense and works every time is surprisingly difficult. There is an article that outlines some of the challenges here: http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html. You will note, by taking a look at the ASCII table, that a description whose title starts with "A" will be preceded by one starting with a number, which in turn will be preceded by one beginning with a quotation mark, which will be preceded by a description that begins with a space before its first character. Similarly, capitalization affects sort order - lowercase letters appear farther down in the sort order than uppercase ones.

At some point in the future, we'd love to spend some time seriously analyzing this challenge and creating a natural sort order for AtoM - but it will take time and consideration, and likely requires development sponsorship for us to be able to dedicate the time to it the task requires. For now, if you are concerned about sort order, be sure to consider this when naming your records.

I have added a note about this in our documentation about the use of the Sort button (here) on our Navigation page (along with an ASCII table for reference), but the Search and Advanced search pages are still works in progress, and I have not added the information about this in either place yet.

I hope that helps explain the current behavior.

Regards,

Dan Gillean, MAS, MLIS
AtoM Product Manager / Systems Analyst,
Artefactual Systems, Inc.
604-527-2056
@accesstomemory


--
You received this message because you are subscribed to the Google Groups "ICA-AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at http://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/7cbe9838-79ce-4558-98e0-6c125007c9d9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

kjam...@ualberta.ca

unread,
Apr 26, 2016, 12:17:16 PM4/26/16
to ICA-AtoM Users
Have there been any changes to this in the past 2 years? And is it possible for this to be tweaked at the institutional level? (for example, to have rankings take into account the level of description so that fonds and series level results appear above item level results, etc)

Dan Gillean

unread,
Apr 26, 2016, 4:41:03 PM4/26/16
to ICA-AtoM Users
Hi there,

Unfortunately not much has changed. The ascii-sort VS natural sort issue would require analysis and development sponsorship for us to be able to really tackle. There is still no weighting for archival descriptions in the search results - though we do hope this might be sponsored for inclusion in the 2.4 release, along with several other improvements to our search index implementation, for better results. 

Regarding preference based on levels of description - this is a tricky issue, actually. Levels of description are kept in a user-editable taxonomy - this allows users to delete or edit terms they don't use, and add new ones as needed. All terms are equal in the taxonomy, to account for differences in local practice - in Australia for example, the Series is often the highest level of description, while in the U.S. the concept of the Record group is still in use. I've seen some users organize their materials by creating a top-level collection, with a number of fonds as children - how would ranking behave in this case?

The fact is, we would have to customize the term edit page for the levels of description taxonomy so users are able to add a weight or sort-order, based on level - because we don't want to hard-code this, since our users are international and have wildly different local practices. There is also the issue of relevance:  adding these user-editable weights might be useful for ordering the browsing of all levels, but when a search is performed, the results should be ordered based on relevance (i.e. how closely does the result match the user's full search query?), not level - so how to reconcile the two factors when they are in conflict? This is complicated by the potential of adding search weighting based on fields in 2.4 - e.g. showing a match found in Scope and content higher than a match found in a notes field, for example. There are also the technical aspect tos consider, as we are currently getting all of our search result data from Elasticsearch, AtoM's search index. What changes would we have to make to the index so that user-added weighting based on levels of description is included - can it be included in the index, or is this an additional query that must be performed first? Which form of weighting gets priority? What effects on performance and scalability might such a feature have, when we add another database check and sort operation to every search?

Once again, it might not be impossible, but it would certainly require analysis and development to be able to implement such a solution. If your institution is interested in discussing development opportunities further, please feel free to contact me off-list.

As I mentioned, we are hoping to do a serious overhaul of the Elasticsearch implementation for archival descriptions in the 2.4 release - we are waiting for confirmation of a development contract. This should improve the accuracy of results significantly - but unfortunately, it won't necessarily fully resolve the specific issues raised at the start of this thread.

Cheers,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

kjam...@ualberta.ca

unread,
Apr 26, 2016, 5:36:24 PM4/26/16
to ICA-AtoM Users
Thanks for the extra info! Just one more quick question: do perfect matches out-weigh partial matches in search results? Based on my search inquiries I would guess yes, but am not 100% sure. 

Best,
Krista

Dan Gillean

unread,
Apr 26, 2016, 6:17:37 PM4/26/16
to ICA-AtoM Users
Hi Krista,

Yes, as I understand it, Elasticsearch's relevance ranking will mean that an exact match should return higher than a partial match. One problem with this which we are hoping to correct in 2.4 is that currently, the default operator in AtoM is set to OR - meaning, when you search for city hall, you are actually getting results returned for city OR hall, instead of city AND hall as you might expect. Results that have both terms should still come back higher than just one or the other, but it can affect how much extra cruft is included in the search results that might not actually be what you're looking for.

We have one user in the UK who is curently making local changes in their AtoM implementation to use AND as the default operator instead of OR - we're looking forward to hearing more about how this goes. If our contract for search index enhancements goes through and this particular user reports back positively, then we'll likely implement AND as the default operator in the 2.4 release of AtoM as well.


You might also be interested in checking out some of the other Boolean characters you can use in AtoM searches:

Note that AtoM's search/browse/advanced search interface is getting a significant overhaul in 2.3 - I've already updated the documentation (though we're still finalizing this release), so if you're curious, you can read more and see screenshots here:


Cheers,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

kjam...@ualberta.ca

unread,
Apr 26, 2016, 6:38:52 PM4/26/16
to ICA-AtoM Users
Oh that's great! Thanks for all the extra info! Very helpful. 
Best,
Krista
Reply all
Reply to author
Forward
0 new messages