[DuraSpace JIRA] (FCREPO-1019) Exploration of complex GSearch use cases

0 views
Skip to first unread message

Gert Schmeltz Pedersen (Created) (DuraSpace JIRA)

unread,
Oct 24, 2011, 5:25:03 AM10/24/11
to fcrepo-...@googlegroups.com
Exploration of complex GSearch use cases
----------------------------------------

Key: FCREPO-1019
URL: https://jira.duraspace.org/browse/FCREPO-1019
Project: Fedora Repository Project
Issue Type: Story
Components: GSearch
Reporter: Gert Schmeltz Pedersen
Assignee: Gert Schmeltz Pedersen
Fix For: GSearch 2.4


This issue is created in response to the messages below. I want to explore complex GSearch use cases and sketch or implement solutions, based on existing and/or potential GSearch functionality. Such functionality includes many-repositories-to-many-indexes, indexing xslt stylesheets creating index documents across Fedora datastreams and/or objects, managing GSearch configurations in Fedora objects (FCREPO-1018), and interaction between the resource index and the Lucene/Solr index(es) (FCREPO-1009).


From: "aj...@virginia.edu" <aj...@virginia.edu>
Date: 24. okt 2011 02.19.24 CEST
To: Support and info exchange list for Fedora users. <fedora-com...@lists.sourceforge.net>
Subject: Re: [fcrepo-user] [fcrepo-dev] GSearch planning
Reply-To: Support and info exchange list for Fedora users. <fedora-com...@lists.sourceforge.net>

The intention of bringing the structure of the indexing workflow out of XSLT into the RDF relationships between objects is not primarily to provide for complex cases, although it can do that. It is, instead, to make that structure part of the curation of the objects themselves.

The interest of this move follows on the claim that the presentation of objects increasingly is dependent on indexing (in part because so many "front-end" frameworks for Fedora rely on indexes to immediately construct many user-facing web pages, and not on direct retrieval from the repository, e.g. Hydra or Islandora), and that therefore indexing workflow deserves to be curated alongside data contents in the _strongest practical way_. I claim further that the strongest possible way to curate relationships between content datastreams and indexing transforms in a Fedora repository is in explicit RDF, and that this is practical.

I quite agree that a powerful but unwieldy or opaque style of configuration may be worse than a weaker but more transparent style, but I believe that with enough thought and attention for the specific modeling of workflow, we could provide graceful factoring in configuration through which simple GSearch indexing workflows would incur very little expense (and even less than they now do) but sophisticated workflows remain possible.

---
A. Soroka
Online Library Environment
the University of Virginia Library


On Oct 23, 2011, at 8:05 PM, Conal Tuohy wrote:

On 17/10/11 11:48, aj...@virginia.edu wrote:
Heartily seconded!

In the architecture we're exploring at UVa, we use RELS-INT to define relationships between datastreams and indexing transforms. The relevance to this issue lies in RELS-EXT. By indexing RELS-EXT as a datastream (and assuming that the molecular "para-object" that is responsible for a given index record is constructed via RELS-EXT relationships) we can obtain information about the other objects that may be involved in any index record to which a given object is associated. I'm in agreement that keeping the analysis of object relationships for indexing purposes in indexing XSLT is _not_ the best way, and instead we look to combine this technique with the use of Enhanced Content Model Views to create the kind of multiobject records to which Jonathan is pointing by hiding the explicit structure of the "para-object" from the indexing XSLT. This may or may not be the best possible solution for the problem, so I'm just offering it as a place to begin discussion.


---
A. Soroka
Online Library Environment
the University of Virginia Library


On Oct 16, 2011, at 8:15 PM, Jonathan Green wrote:

Something that I think needs to be considered when moving forward with gsearch is that the index may not always share a 1 to 1 relationship with objects in fedora. In a very atomistic content model perhaps the solr document is actually composed of parts from many related objects. These types of decisions are currently very hard to make in XSLTs.
In what way hard? Can you expand a little on the difficulties you see?

While I think XSLTs have a place in transforming metadata, there needs to be something more.
One issue to keep to in mind here is the 80/20 rule. If Fedora's
indexing system is complex enough to allow for all manner of complex
cases, then it may be needlessly complex for many simple cases. A more
complex system would make complex indexing easier, but if it also makes
simpler cases harder (even just harder to understand a configuration
system), then the OVERALL ease-of-use might actually decrease. I don't
think it's possible to strike a perfect balance, but a technology like
XSLT might be a useful catch-all: it can handle simple cases very
simply, but can also be extended arbitrarily (including, for instance,
transcluding metadata from related Fedora objects or other XML datasources).

In very many cases, the mapping of Fedora objects to Solr documents is
very simple and won't, for instance, involve any aggregation. But the
mapping from Fedora objects to Solr documents is in principle arbitrary;
you might choose to do pretty much anything, quite legitimately. You
might have metadata schemas of any type; you might use the RDF store,
you might have external authority files, etc. This is where, I think, a
system which is sufficiently configurable to be fully general could well
end up as complex as an XSLT-based system would be, but without many of
the advantages of XSLT (code libraries, books and mail-lists, programmer
experience, etc).

It might be enough to ship Fedora with a basic set of XSLT transforms,
and a few sample transforms showing how to use the resource index, etc.
--

Conal Tuohy
eResearch Business Analyst
Victorian eResearch Strategic Initiative
+61-466324297


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Chris Wilper (Updated) (DuraSpace JIRA)

unread,
Oct 25, 2011, 11:27:03 AM10/25/11
to fcrepo-...@googlegroups.com

[ https://jira.duraspace.org/browse/FCREPO-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Wilper updated FCREPO-1019:
---------------------------------

Status: Open (was: Received)

Reply all
Reply to author
Forward
0 new messages