Seeking ArcLight Search Relevance Explanation

128 views
Skip to first unread message

John Weise

unread,
Oct 11, 2022, 2:25:06 PM10/11/22
to arclight-community
Hello! 

I hope everyone is well and open to a question being posed here on this group that has been quiet for some time. At Michigan, we are working again on our initial ArcLight implementation (based on Duke's) and one of the things requested by our campus stakeholders is an explanation of how search works in ArcLight. What would make it more understandable to end users and a bit less magical? If anyone has anything to share, I'd appreciate it. And in the absence of a user friendly rendition, a technical explanation of what's going on behind the scenes would give us what we need to craft a user friendly version.  Thank you, all, for considering this. 

Best, 

John

Sean Aery

unread,
Oct 12, 2022, 11:27:27 PM10/12/22
to arclight-community
Hi John,

I'm happy to share some notes on ArcLight search functionality from our perspective at Duke. I don't know how much of this translates into patron-friendly, public-facing documentation, but I hope it'll at least help your staff understand it better.

Collections & Components
The Solr documents that'll be searched/retrieved will be either 1) collections or 2) components that belong to a collection. When you index a finding aid (an EAD2002 file) via ArcLight's core Traject pipeline ( https://github.com/projectblacklight/arclight/blob/main/lib/arclight/traject/ead2_config.rb ), you get one Solr doc that encodes all the collection-level description (from the <archdesc> level), then one Solr doc for each individual component therein (<c>, <c01>, <c02>, etc.), with the corresponding component-level description encoded.

The component docs use some native Solr document nesting (at least for their component-collection relationship), so, e.g., if you delete the Solr doc for a collection it'll delete all of its components too. Other component-component relationships are encoded in traditional Solr fields, e.g., an array of ancestor node IDs for a deeply nested component is encoded in parent_ssim .

When you search ArcLight using the default All Results view, you're searching all component and collection "documents" together -- they are all interleaved.  You do see breadcrumb trails to help give a sense of which collection each document is part of.  At Duke, we have about 4,000 collections. Each comprises an average of 250 components. So it's a bank of around 1,000,000 records that are searched, ranked, and potentially returned in search results.

What Fields are Searched and How Are They Weighted?

ArcLight ships with a basic schema.xml and solrconfig.xml file that you may want to customize.  Out of the box, most fields are indeed searched per this default config.

solrconfig.xml SearchHandler
https://github.com/projectblacklight/arclight/blob/main/solr/conf/solrconfig.xml#L68-L188
- big relevance boosts for collection title, title
- small relevance boosts for names, places, unitid
- a catch-all "text" field with no boosting

schema.xml copyField rules
https://github.com/projectblacklight/arclight/blob/main/solr/conf/schema.xml#L327-L371
- note how fields are defined to be auto-copied into a catch-all "text" field, which is one of the fields included in the above SearchHandler.

If there's data from the EAD you want to be indexed so it can be searched, first ensure it's actually getting captured and written into a Solr field via Traject ( https://github.com/projectblacklight/arclight/blob/main/lib/arclight/traject/ead2_config.rb ).  Then add it to the schema.xml and solrconfig.xml files.  Feel free to look at ours to see how we have extended the core setup:
https://gitlab.oit.duke.edu/dul-its/dul-arclight/-/blob/main/solr/arclight/conf/schema.xml
https://gitlab.oit.duke.edu/dul-its/dul-arclight/-/blob/main/solr/arclight/conf/solrconfig.xml

Caveat: there may be better ways nowadays to customize search scoping and relevance ranking without digging into those Solr config files; perhaps others in the Blacklight community can advise there.

Results Grouped by Collection

Note that this is not currently the default in ArcLight core, but one unique and powerful feature of ArcLight that distinguishes it from other Blacklight apps is grouping results by collection (collection_ssi field specifically, see https://github.com/projectblacklight/arclight/blob/main/lib/arclight/engine.rb#L18-L28). All of us at Duke, Princeton, Indiana, and Albany have modified our local apps to make Group by Collection the default in the UI.

It is perhaps unintuitive but nevertheless important to note that a top-level collection record is part of a collection group just like a component is.  This means that the collection document itself can and often will appear as a matching record within the collection group. So it looks a little redundant in the UI. But if it didn't work that way, you'd have no way to 1) show highlighted keyword-in-context snippets for query matches in the collection-level description; 2) have the collection-level description weigh heavily in relevance rank for a group.

Relevance Order When Grouped by Collection

On a search results page, the matching collection groups appear in relevance order, but there's more to it than meets the eye. They appear in order of their highest-scoring document for the query (remember, that document might be the top-level collection description itself or it might be an individual component from within the collection). E.g.:

Group: Collection A (note: this does not have a "score")
  Collection A doc (score 100)
  Component A1 doc (score 3)
  Component A2 doc (score 2)

Group: Collection B (note: this does not have a "score")
  Component B5 doc (score 95)
  Component B2 doc (score 80)
  Collection B doc (score 75)

Note that the number of components in a collection that match the query has no impact on the relevance ranking whatsoever. One highly relevant component in a not-very-relevant collection will make that collection group beat out a relevant collection that includes thousands of moderately-relevant components.

You can test this out a bit in our UI by appending &debug=true to the end of a search result URL: you'll see the relevance score for each document e.g.:
https://archives.lib.duke.edu/?utf8=%E2%9C%93&group=true&search_field=all_fields&q=basketball&debug=true

Other ArcLight Search Amenities

There's a lot of nice search-supporting features that ArcLight provides out of the box that we have appreciated, especially:
- autosuggest/typeahead on the search boxes
- hit highlighting (keyword-in-context)
- within-field searches for things like Names, Places, Subjects
- faceting, especially for narrowing down to digital content

I'd be happy to answer any followup questions you might have. Best of luck with your implementation!

-- Sean 

~~~~~~~~~~~~~~~~~~~~~~
Sean Aery
Pronouns: he/him/his
Digital Projects Developer
Assessment & User Experience Strategy
Duke University Libraries
030U Bostock Library Box 90198
Durham, NC 27708
sean...@duke.edu

~~~~~~~~~~~~~~~~~~~~~~

Sean Aery

unread,
Oct 13, 2022, 9:05:01 AM10/13/22
to arclight-community
Another important point I'll add is that out-of-the-box there is not much cross-pollination of description between collections and the components belonging to them. The top-level collection document does not include any description from its components, and the component documents mostly don't inherit description from their top-level collection. Nested components also don't inherit description from their ancestor components (even though in archival descriptive practice this inheritance often seems to be the intent).

What this means is if your search query has two terms, one that is present only in the collection-level description and another that is present only in the description of one of its components, you will get zero results. The same is true if those two terms appear only in two different components within the same collection, even when those components are situated as ancestor-descendant.

One strategy we have used to help with this at Duke is making sure at least the titles from the ancestor trail (including the collection) get indexed on each component. Note this is captured as an array (parent_unittitles_ssm) for display purposes but by default that's not indexed for search. So we have a local customization to also capture that as indexed text parent_unittitles_teim and add it to the list of fields in the schema.xml file that get copied into the all-purpose text field.

There was not much time during the last (2019) ArcLight community work cycle to refine the out-of-the-box search relevance, so this is indeed an area with some rough edges. But with a few local revisions to the schema.xml, solrcofig.xml, and the ead2_config.rb file, you can optimize the setup to work well for your data.

-- Sean

John Weise

unread,
Oct 13, 2022, 9:40:21 AM10/13/22
to Sean Aery, arclight-community
Good morning, Sean.  

This is incredibly helpful. Thank you so much for your detailed and thoughtful response. You have catapulted our understanding forward.  As we process this information more fully it is easy to imagine we'll have some follow-up questions so we appreciate your offer to field those too. 

John

--
You received this message because you are subscribed to a topic in the Google Groups "arclight-community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/arclight-community/NkiiEVyqkcg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to arclight-commun...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arclight-community/381e5991-1b83-49a3-820f-a4f2b7943992n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages