External indexing issues

Stefano Cossu

unread,

Oct 27, 2014, 12:00:32 PM10/27/14

to fedor...@googlegroups.com

There has been a lively discussion about removing internal search from Fedora 4 on the IRC and Github lately. I have a few points that I'd like to summarize here rather than in more volatile and technical places.

I understand the concerns about the complexity and possibly the inappropriateness of building a full-fledged query endpoint in Fedora while there are so many tools out there (Solr, triplestores, etc.) which do an egregious job at it.

I also feel, from a less technical and more product-centered standpoint, that a repository without even a minimal search facility would sound less appealing. Bundling the external indexer with the core repository, as @awoods suggested, might be a good step toward maintaining the one-click install feature and a complete set of basic repository tools.

Still I have two issues that I'm struggling with.

1 - Discovering relations to nodes that have just been created: e.g. client creates a node, receives HTTP response header with new node location, and wants to query related nodes to that new node right away. This is likely to fail with an asynchronous indexer.

2 - Implementing complex security policies and enforce them all the way down the whole search/indexing module. E.g. I want Imaging staff to be able search an image by a 'poor quality' flag. This property has of course to be exposed in the indexer for query and retrieval by Imaging, but I don't want other departments, some of which may have ties to third parties, to see it.
A coarse distinction can be made between web users and staff, but we can't create a separate index for each access level case!
An alternative would be building security in a middleware layer, which can read the complex rules in Fedora (including a XACML implementation - ugh) and enforce them for indexing. I'd like to avoid such cumbersome task.

Both issues above assume that my Fedora repo is accessed by an arbitrary number of clients so I can't rely on a single, complex implementation (e.g. Hydra, Islandora).

Maybe I am missing the big picture, but I would appreciate some help to understand how these issues can be tackled.

Thanks,

Stefano

Andrew Woods

unread,

Oct 27, 2014, 2:54:08 PM10/27/14

to fedor...@googlegroups.com, fedora-...@googlegroups.com

Hello Stefano,

Thank you for elevating your concern/interest in the topic of the Fedora-4.0-release-term plans for internal-admin search. As you know, the project is putting full focus on tightening the codebase for the pending production release. That effort includes establishing a feature-base that will be stable as we move into the future. It will be much less painful for the Fedora user community to drop dependence on a feature now than after the production release has gone out.

One of the transformative capabilities that Fedora4 is bringing to the community is its native, intuitive, standards-based Linked Data interaction model. The issue of internal-admin search has come to a head because in the process of the team polishing the Linked Data interaction, it has become clear that there are consequences of the fact that there is no one-to-one mapping of RDF types and backend JCR property types. The result of this fact is that the internal-admin search capabilities (which rely on backend JCR indexing) would require Significant and Ugly software gymnastics to work at even a moderately functional level.

Given,

1) the unmaintainability and limited functionality of such an implementation,

2) the fact that existing (and Fedora-integrated) tooling already does a good job of search/indexing (Solr, triplestores),

3) that robust support for RDF interactions as a feature likely trumps internal-admin search,

4) the dearth of documented internal-admin-search-related use cases [1], and

5) the performance benefits of disabling JCR indexing,

it is a hard argument to make for internal-admin search remaining in the 4.0 feature set.

Getting back to the two issues your raise,

re:1) It sounds like the fundamental issue is a need for a quick (milliseconds? seconds?) update of a query-able index for a UI that is working directly against your Fedora. This is a reasonable use case, but one that does not necessarily imply a synchronous internal-admin search. I would be very interested in the actual user/usability requirements for such an interaction, and push on indexing/messaging models that support those requirements.

re:2) The question of authorization is one that the entire community has, and will continue to face. From the beginning, we have intentionally termed the internal-admin search capability exactly that to make clear that it was designed for internal, administrative users. Among other notions, embedded in that name is the implication that authorization is not a characteristic of the feature. That said, many Fedora users and Fedora-based frameworks have addressed the issue of authorization policies over external indexes which live as Fedora resources. Given the exact situation that you raise, we as a community, and myself personally, have a vested interest in ensuring that valid solutions exist.

In summary, there is a tension between Linked Data support and the internal-admin search feature. Given the developer resources, the timeline, the community use cases and feature priorities, indications would favor leaving internal-admin search out of the initial 4.0 production release.

I hear your position, Stefano, and sympathize with it. There are solutions to the issues you raise, and they will come collectively from the community. It is not clear that an expanded internal search is a good fit.

However, despite the strong technical and project-level arguments for favoring Linked Data over internal-admin-search, this is the time to raise your opinion/concerns as we approach this immediate juncture of removing or not removing internal-admin-search from the Fedora 4.0 production release.

Regards,

Andrew

[1] https://wiki.duraspace.org/label/FF/uc-search

--
You received this message because you are subscribed to the Google Groups "Fedora Tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech...@googlegroups.com.
To post to this group, send email to fedor...@googlegroups.com.
Visit this group at http://groups.google.com/group/fedora-tech.
For more options, visit https://groups.google.com/d/optout.

Stefano Cossu

unread,

Oct 27, 2014, 4:39:42 PM10/27/14

to fedor...@googlegroups.com, fedora-...@googlegroups.com

Andrew,
Thanks for taking the time to explain in detail the reasons behind dropping internal search. I agree with them and support that decision, especially since the project is not in a completely functional state.

The reasons why I recently started exploring the possibilities to expand internal search were the two points I raised below, which I could not find a solution to using an external indexer.

Therefore:
#1 is not critical but an internal search facility would have been very useful to get results synchronously with a legacy application that is hard to adapt to a messaging system. Provided that this facility will not available, I guess that the best solution in this case would be relying on an integration framework (which we are already building) to handle asynchronous processes for these systems.

#2 is a critical point for us to go live with an institution-wide Fedora repo and I am glad to hear that it is a common request. I am available to meet and explore some scenarios and try to find possible solutions.

Thanks,

Stefano Cossu
Director of Application Services, Collections

The Art Institute of Chicago
116 S. Michigan Ave.
Chicago, IL 60603
312-499-4026

Exorcise Your Mind
Temptation: The Demons of James Ensor
November 23–January 25

The Art Institute of Chicago
Voted #1 museum in the world by TripAdvisor

Andrew Woods

unread,

Oct 27, 2014, 5:25:42 PM10/27/14

to fedor...@googlegroups.com, fedora-...@googlegroups.com

Hello Stefano,

Thank you for your understanding. You and your team have been invaluable partners in the Fedora 4 effort, and your engagement is appreciated. As for point #2, there are related discussions actively underway in the Hydra community (see recent notes [1] and the scheduled call for this coming Wednesday [2]). In any case, an authorization model that supports external indexes is an important community priority.

Andrew

[1] https://wiki.duraspace.org/display/hydra/Hydra+Tech+Call+2014-10-15

[2] https://wiki.duraspace.org/display/hydra/Hydra+Tech+Call+2014-10-29

Stefano Cossu

unread,

Oct 27, 2014, 5:31:24 PM10/27/14

to fedor...@googlegroups.com, fedora-...@googlegroups.com

That sounds interesting. If you don't mind crashing the meeting, I can peek in and make an effort to sit quiet in a corner and listen.

Even though we are not planning to implement Hydra in the near future, I am interested in which ideas may come up that we can apply to a more generic index filter (again, we are assuming different clients accessing Fedora and its indexes).

Thanks,

Stefano Cossu
Director of Application Services, Collections

The Art Institute of Chicago
116 S. Michigan Ave.
Chicago, IL 60603
312-499-4026

Exorcise Your Mind
Temptation: The Demons of James Ensor
November 23–January 25

The Art Institute of Chicago
Voted #1 museum in the world by TripAdvisor

Esmé Cowles

unread,

Oct 27, 2014, 5:38:40 PM10/27/14

to fedor...@googlegroups.com

Stefano-

The current push to use WebAccessControl access metadata started with the Hydra community, but it's certainly not limited to just Hydra folks. We definitely want input from people using other systems, and hope to have a common solution that everybody can use regardless of what kinds of frontend applications they are using.

-Esme

Durbin, Michael (md5wz)

unread,

Oct 27, 2014, 10:50:33 PM10/27/14

to fedor...@googlegroups.com, fedora-...@googlegroups.com

I understand the desire to remove it. If there were no significant cost, I'd argue for easy-of-use that we retain it (and that full-text search we already scrapped.

I'd be curious to see the numbers for the performance cost of synchronous updates for the fcrep4 index. In fedora 3 it was big. In my work experience, I've always benefited from a synchronously updating index even though the performance cost was relatively high. This was worth it compared to the cost of developing workflows that had to maintain redundant indexes (or store redundant relationships) when mulgara couldn't be trusted to accurately list children (for example) due to the high latency of asynchronous updates.

For UVA (and IU for whom I no longer work) synchronous query is a strongly requested feature. The use case has been expressed and we've completed an acceptance test. (https://wiki.duraspace.org/display/FF/2014-09-25+Acceptance+Test+-+Internal+search)

I'll be working in the next sprint, and if it makes sense, I can spend that time making the feature an option that's easy to add to a fedora 4 instance so that those whose use of fedora doesn't require reliably querying it for recently updated RDF assertions.

The use case boils down to being able to query one-way relationships immediately after adding them. There are definitely ways to work around it, but I guess I've grown to expect a system to be able to immediately verify something it just successfully did rather than have to wait an unknown interval.

-Mike

From: fedor...@googlegroups.com [fedor...@googlegroups.com] on behalf of Andrew Woods [awo...@duraspace.org]
Sent: Monday, October 27, 2014 2:54 PM
To: fedor...@googlegroups.com
Cc: fedora-...@googlegroups.com
Subject: Re: [fedora-tech] External indexing issues

Scott Prater

unread,

Oct 28, 2014, 7:27:43 AM10/28/14

to fedor...@googlegroups.com, fedora-...@googlegroups.com

Our use case at UW Madison matches Mike's. We rely on synchronous queries heavily; the performance cost hasn't been an issue for us.

-- Scott

Andrew Woods

unread,

Oct 28, 2014, 9:46:23 AM10/28/14

to fedor...@googlegroups.com, fedora-...@googlegroups.com

Hello Mike,

You raise a good point around the possibility of a search feature being a "drop-in" optional component (akin to the recent fcrepo4-oaiprovider [1]). We have seen two different search variations in Fedora 4: one offered basic keyword-matching over resource properties, and the other was a limited SPARQL-Query endpoint. As mentioned earlier in this thread, the SPARQL endpoint has a fundamental impedance mismatch which would likely be an unending source of discontent. However, it would be interesting to explore your, Scott's and other's requirements to determine if the simpler keyword search would satisfy a real need.

That said, at the current point in the project-cycle where we are attempting to produce a trimmed, tight release, I suspect the immediate focus will be on testing/documenting/releasing.

Andrew

[1] https://github.com/fcrepo4-labs/fcrepo4-oaiprovider

Tom Cramer

unread,

Oct 28, 2014, 12:11:08 PM10/28/14

to fedor...@googlegroups.com, Tom Cramer

Stefano,

Just to underscore Esme's point, you'd be welcome to sit in and weigh in on the Hydra-tech discussions. In addition to welcoming input and additional perspectives from fellow travelers (but not current adopters) in general, we're eager to see that the architectural approaches taken by Hydra align with overall trends and broader Fedora, and Web, community.

- Tom

Tom Cramer

unread,

Oct 28, 2014, 12:16:05 PM10/28/14

to fedor...@googlegroups.com, Tom Cramer, Adam Wead

And hot off the presses, here is the agenda and connection information for tomorrow's Hydra Tech call on access controls:

From: Adam Wead <amste...@gmail.com>
Date: October 28, 2014 9:10:20 AM PDT
To: Hydra-Tech <hydra...@googlegroups.com>
Subject: [hydra-tech] tomorrow's call
Reply-To: hydra...@googlegroups.com

Hi all,

Tomorrow’ HydraTech call is a special topic on web access controls for hydra rights metadata. I don’t think other agenda items are excluded, but the meeting may take the full hour on that topic.

https://wiki.duraspace.org/display/hydra/Hydra+Tech+Call+2014-10-29

…adam

- Tom

Reply all

Reply to author

Forward