Hi John,
I'm happy to share some notes on ArcLight search functionality from our perspective at Duke. I don't know how much of this translates into patron-friendly, public-facing documentation, but I hope it'll at least help your staff understand it better.
Collections & ComponentsThe Solr documents that'll be searched/retrieved will be either 1) collections or 2) components that belong to a collection. When you index a finding aid (an EAD2002 file) via ArcLight's core Traject pipeline (
https://github.com/projectblacklight/arclight/blob/main/lib/arclight/traject/ead2_config.rb ), you get one Solr doc that encodes all the collection-level description (from the
<archdesc> level), then one Solr doc for each individual component therein (
<c>,
<c01>,
<c02>, etc.), with the corresponding component-level description encoded.
The component docs use some native Solr document nesting (at least for their component-collection relationship), so, e.g., if you delete the Solr doc for a collection it'll delete all of its components too. Other component-component relationships are encoded in traditional Solr fields, e.g., an array of ancestor node IDs for a deeply nested component is encoded in
parent_ssim .
When you search ArcLight using the default All Results view, you're searching all component and collection "documents" together -- they are all interleaved. You do see breadcrumb trails to help give a sense of which collection each document is part of. At Duke, we have about 4,000 collections. Each comprises an average of 250 components. So it's a bank of around 1,000,000 records that are searched, ranked, and potentially returned in search results.
What Fields are Searched and How Are They Weighted?ArcLight ships with a basic
schema.xml and
solrconfig.xml file that you may want to customize. Out of the box, most fields are indeed searched per this default config.
solrconfig.xml SearchHandlerhttps://github.com/projectblacklight/arclight/blob/main/solr/conf/solrconfig.xml#L68-L188- big relevance boosts for collection title, title
- small relevance boosts for names, places, unitid
- a catch-all "text" field with no boosting
schema.xml copyField ruleshttps://github.com/projectblacklight/arclight/blob/main/solr/conf/schema.xml#L327-L371- note how fields are defined to be auto-copied into a catch-all "text" field, which is one of the fields included in the above SearchHandler.
If there's data from the EAD you want to be indexed so it can be searched, first ensure it's actually getting captured and written into a Solr field via Traject (
https://github.com/projectblacklight/arclight/blob/main/lib/arclight/traject/ead2_config.rb ). Then add it to the
schema.xml and
solrconfig.xml files. Feel free to look at ours to see how we have extended the core setup:
https://gitlab.oit.duke.edu/dul-its/dul-arclight/-/blob/main/solr/arclight/conf/schema.xmlhttps://gitlab.oit.duke.edu/dul-its/dul-arclight/-/blob/main/solr/arclight/conf/solrconfig.xml Caveat: there may be better ways nowadays to customize search scoping and relevance ranking without digging into those Solr config files; perhaps others in the Blacklight community can advise there.
Results Grouped by CollectionNote that this is not currently the default in ArcLight core, but one unique and powerful feature of ArcLight that distinguishes it from other Blacklight apps is grouping results by collection (
collection_ssi field specifically, see
https://github.com/projectblacklight/arclight/blob/main/lib/arclight/engine.rb#L18-L28). All of us at Duke, Princeton, Indiana, and Albany have modified our local apps to make Group by Collection the default in the UI.
It is perhaps unintuitive but nevertheless important to note that a top-level collection record
is part of a collection group just like a component is. This means that
the collection document itself can and often will appear as a matching record within the collection group. So it looks a little redundant in the UI. But if it didn't work that way, you'd have no way to 1) show highlighted keyword-in-context snippets for query matches in the collection-level description; 2) have the collection-level description weigh heavily in relevance rank for a group.
Relevance Order When Grouped by Collection
On a search results page, the matching collection groups appear in relevance order, but there's more to it than meets the eye. They appear in order of their highest-scoring document for the query (remember, that document might be the top-level collection description itself or it might be an individual component from within the collection). E.g.:
Group: Collection A (
note: this does not have a "score")
Collection A doc (score 100)
Component A1 doc (score 3)
Component A2 doc (score 2)
Group: Collection B (
note: this does not have a "score")
Component B5 doc (score 95)
Component B2 doc (score 80)
Collection B doc (score 75)
Note that the number of components in a collection that match the query has no impact on the relevance ranking whatsoever. One highly relevant component in a not-very-relevant collection will make that collection group beat out a relevant collection that includes thousands of moderately-relevant components.
You can test this out a bit in our UI by appending &debug=true to the end of a search result URL: you'll see the relevance score for each document e.g.:
https://archives.lib.duke.edu/?utf8=%E2%9C%93&group=true&search_field=all_fields&q=basketball&debug=trueOther ArcLight Search AmenitiesThere's a lot of nice search-supporting features that ArcLight provides out of the box that we have appreciated, especially:
- autosuggest/typeahead on the search boxes
- hit highlighting (keyword-in-context)
- within-field searches for things like Names, Places, Subjects