[fcrepo-user] Fedora GSearch and OBJ datastream (application/pdf) extraction for Solr index

24 views
Skip to first unread message

Serhiy Polyakov

unread,
Nov 23, 2011, 3:26:13 AM11/23/11
to fedora-com...@lists.sourceforge.net
Hello,

I am trying to get OBJ datastream (application/pdf) processed and
indexed into Solr 3.4 with GSearch2.3. I excluded all MODS streams to
isolate the problem. So I have DC and OBJ (pdf)

Note: Pdf indexing was working for me in last spring installation with
GSearch 2.2 on Lucene. Summer time system with and GSearch 2.2 beta on
Solr 1.4 is not indexing pdf as well.

For the debugging I tried command line XSLT processor xalan 2.7.0 that
comes with GSearch. I include all classpath vars as I mentioned in
previous messages.

It gives this Error:

file:///home/fedora/3_XML_Pro/foxmlToSolr.xslt; Line #86; Column #-1;
XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a matching
8-argument function named
{xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
Exception in thread "main" java.lang.RuntimeException: Cannot find a
matching 8-argument function named
{xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl}getDatastreamText()
at org.apache.xalan.xslt.Process.doExit(Process.java:1153)
at org.apache.xalan.xslt.Process.main(Process.java:1126)

Only when I downloaded Xalan 2.7.1 into separate directory and added
classpath to it in the command line I can process and get output file
with all the fields including OBJ fulltext extracted from pdf. I tried
to overwrite Xalan Jars that came with GSearch with new ones but it
still gives same error. Only when I am directly running Xalan 2.7.1
from the separate directory it is processing the input file.

====================
Here is excerpt from the input object's Foxml I am using to process:

<foxml:datastream ID="OBJ" FEDORA_URI="info:fedora/islandora:6/OBJ"
STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
<foxml:datastreamVersion ID="OBJ.0" LABEL="Title_2.pdf"
CREATED="2011-10-19T09:07:40.379Z" MIMETYPE="application/pdf"
SIZE="56276">
<foxml:contentLocation TYPE="INTERNAL_ID"
REF="http://myhost:8080/fedora/get/islandora:6/OBJ/2011-10-19T09:07:40.379Z"/>

====================
I am using stylesheet foxmlToSolr.xslt that came with GSearch. It has
the following lines in header:

xmlns:exts="xalan://dk.defxws.fedoragsearch.server.GenericOperationsImpl"
exclude-result-prefixes="exts"
---------------------------
And the following in the body:

<xsl:for-each select="foxml:datastream[@CONTROL_GROUP='M' or
@CONTROL_GROUP='E' or @CONTROL_GROUP='R']">
<field>
<xsl:attribute name="name">
<xsl:value-of select="concat('dsm.', @ID)"/>
</xsl:attribute>
<xsl:value-of select="exts:getDatastreamText($PID,
$REPOSITORYNAME, @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS,
$TRUSTSTOREPATH, $TRUSTSTOREPASS)"/>
</field>
</xsl:for-each>
====================

When objects are submitted into Fedora all inline data streams are
getting OK into the index. All non-inline (Managed) datasteams that do
not require external processing (like ORC text) are processed OK into
index. Non-inline datasteam OBJ containing pdf that require external
processing are not getting into the index.

I have this package
dk.defxws.fedoragsearch.server.GenericOperationsImpl

under
..tomcat/webapps/fedoragsearch/WEB-INF/classes

And it is used by GSearch for extraction of foxml.all.text. It means
it is visible for GSearch. Sounds like it is only when GSearh passes
pdf content of OBJ datastream for extraction it is not getting it
back.

Could somebody confirm that objects with pdf content are fulltext
indexed OK with GSearch on Solr?

Thanks,

Serhiy

Gert Schmeltz Pedersen

unread,
Nov 23, 2011, 4:23:01 AM11/23/11
to Support and info exchange list for Fedora users.
I can confirm that the pdf document in datastream DS2 of the demo object demo:18 is indexed in my test installation.

If I understand you correctly, you _do_ get the pdf indexed as part of foxml.all.text, right? So that must mean that the error is produced somewhere else in your indexing stylesheet, maybe in line #86 as indicated in the error message below, also, it is strange that the error message refers to saxon, saxon cannot work, when your exts refers to xalan. Look into fedoragsearch.log and catalina.out, there must be something.

-Gert
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-com...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


Serhiy Polyakov

unread,
Nov 23, 2011, 1:09:14 PM11/23/11
to Support and info exchange list for Fedora users.
Gert,

In demoFoxmlToSolr there is part where it says: managed datastream is
fetched, if its mimetype can be handled, the text become the value of
the field.

All my installations of GSearch + Solr properly process managed
datastreams with MIMETYPEs “text/xml” or “text/plain”. MIMETYPEs
“application/msword” OR “application/pdf” OR “application/ps” are not
processed into index. My resulting foxml.all.text field misses
fulltext from those three MIMETYPEs.

I suppose that demoFoxmlToSolr is correct becasue I was able to fetch
and process these three MIMETYPEs into text is in debugging mode from
the command line with Xalan 2.7.1 downloaded into separate directory.
However, I was not able to repeat that from command line with Xalan
sitting in FedoraGSearch. In latter case yes, is strange that the
error message refers to saxon.

I will be looking into the logs and testing. Will install new instance
of GSearch Solr.

Thanks,
Serhiy

Serhiy Polyakov

unread,
Nov 28, 2011, 5:47:10 AM11/28/11
to Support and info exchange list for Fedora users.
Gert,

I fixed the problem of indexing managed datastreams by downloading
newer GSearch 2.3 few days ago. I was trying to fix the problem with
the version I got from https://github.com/fcrepo/gsearch on October
31. I think it was beta.

So this class
dk.defxws.fedoragsearch.server.GenericOperationsImpl
is working now.

I am still getting the error with another class that comes from
Islandora and is supposed to parse MODS inline datastream. Command
line processing with Xalan gives this error that for some reason
refers to Saxon?:

file:///home/user1/4_XML_Tr/demoFoxmlToSolr.xslt; Line #298; Column
#-1; XSLT Error (net.sf.saxon.trans.XPathException): Cannot find a
matching 8-argument function named
{xalan://ca.upei.roblib.DataStreamForXSLT}getDatastreamTextRaw()
Exception in thread "main" java.lang.RuntimeException: Cannot find a
matching 8-argument function named
{xalan://ca.upei.roblib.DataStreamForXSLT}getDatastreamTextRaw()
at org.apache.xalan.xslt.Process.doExit(Process.java:1153)
at org.apache.xalan.xslt.Process.main(Process.java:1126)

Interestingly, parser worked from the command line when I removed
tomcat/webapps/fedoragsearch/WEB-INF/lib/saxon9he.jar
and also copied fedora-client-3.1.jar there (taken from GSearch 2.2).
However, in this case http://myhost:8080/fedoragsearch/rest says Saxon
is missing.


Thanks,
Serhiy



On Wed, Nov 23, 2011 at 3:23 AM, Gert Schmeltz Pedersen
<gs...@dtic.dtu.dk> wrote:

Gert Schmeltz Pedersen

unread,
Nov 28, 2011, 6:10:27 AM11/28/11
to Support and info exchange list for Fedora users.
One thing in order to avoid mixing xalan and saxon is to check the exts definition in your indexing stylesheet (whether lucene or solr), from fedoragsearch.properties:

# xsltProcessor, xalan or saxon
# this choice must be accompanied by the right namespace in your foxmlToLucene.xslt:
fedoragsearch.xsltProcessor = xalan

Please address the Islandora community also. I appreciate very much their use of GSearch, but I cannot answer questions about it.

-Gert

Serhiy Polyakov

unread,
Nov 28, 2011, 12:22:31 PM11/28/11
to Support and info exchange list for Fedora users.
Gets,

Yes, my setting are alright:
fedoragsearch.xsltProcessor = xalan

I asked Islandora community about problem with their function,
hopefully they will update the class. I included that inquiry here
just in case.

Thanks,
Serhiy


On Mon, Nov 28, 2011 at 5:10 AM, Gert Schmeltz Pedersen

Burgis, Richard

unread,
Nov 29, 2011, 11:04:04 AM11/29/11
to Support and info exchange list for Fedora users.

I’m moving a test repository and trying various methods to get it to work.

 

For objects with  embedded content (X), I have no problem ingesting from the source repository. But when I try ingesting objects with Managed content (M), I get errors saying that the managed content cannot be found.

 

I tried exporting the objects and ingesting the exported objects, but I get the same result.

 

I modified the contentLocation  to use file URLs pointing to the location of the content in the file system. This time the ingest succeeded, but I got errors (500?) when I tried to edit or view the content. I got the same error when I attempted to re-import it via the Admin program. I was finally able to get the import to work after purging the items.

 

This seems unreasonable as a process, so I would assume that I am missing something critical.

 

Any suggestions?

Thanks

Rich

 

Richard Green

unread,
Nov 29, 2011, 12:00:01 PM11/29/11
to Support and info exchange list for Fedora users.

Rich

 

Don’t assume that it’s just you.  This sounds rather familiar and we’ve been wondering if it was *us*.  Are you trying this with the Fedora Java admin client (ingest object(s) from repository)?  If so – exactly what error message do you get?  What version of Fedora are you using?

 

Richard

___________________________________________________________________

 

Richard Green

Consultant to Library & Learning Innovation, University of Hull

managing the History DMP and Hydra (Hull) Projects

 

http://hydra.hull.ac.uk

http://hydrangeainhull.wordpress.com

http://projecthydra.org

http://historydmp.wordpress.com

Benjamin Armintor

unread,
Nov 29, 2011, 12:19:52 PM11/29/11
to Support and info exchange list for Fedora users.
Richard (the first):
Can you provide a bit more detail about the configuration of the two
fedoras (are they both using the same storage module?), what exactly
it is you're trying to (it sounds like you're trying to submit the
exported foxml as the body of an ingest request), and the type of 500
errors you're getting (or any information from the logs)?

If you're changing the contentLocations to be file uris, did you also
update the relevant policy to allow resolution of those uris?

- Ben

Benjamin Armintor

unread,
Nov 29, 2011, 12:24:41 PM11/29/11
to Support and info exchange list for Fedora users.

Kyle Banerjee

unread,
Nov 29, 2011, 12:38:01 PM11/29/11
to Support and info exchange list for Fedora users.
 

I tried exporting the objects and ingesting the exported objects, but I get the same result.



You are doing nothing wrong.

I can't remember what the exact issue was since we haven't messed with this for awhile, but our experience is that you cannot import exported objects without modifying them.

kyle

--
----------------------------------------------------------
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance 
bane...@uoregon.edu / 503.877.9773

Burgis, Richard

unread,
Nov 29, 2011, 12:40:14 PM11/29/11
to Support and info exchange list for Fedora users.

I’ve tried both the web and the Java version of the administrator. I am using 3.5

 

The error I got after the successful ingest (with the file urls) was 500 Internal Server Error. Purging the datastream and then reimporting the files works fine.

Burgis, Richard

unread,
Nov 29, 2011, 12:49:00 PM11/29/11
to Support and info exchange list for Fedora users.
They are both using Akubra module. From the logs it appears that the object was not in low level storage.

I did try using the export file directly as ingest. That seemed the most straight forward way to do this. However it refers to locations in the source repository for the managed datastreams and the ingest could not find them. That's why I tried changing to a file URL.

I have not changed the policy file to allow this. Thank you very much this sounds promising.

My other question relates to why I cannot ingest objects with managed datastreams directly from the original repository.

My biggest problem is that Fedora comes with a huge set of concepts to digest all at once in support of it capabilities. Getting by that will be the challenge.

Thank you
Rich
-----Original Message-----
From: Benjamin Armintor [mailto:armi...@gmail.com]
Sent: Tuesday, November 29, 2011 12:25 PM
To: Support and info exchange list for Fedora users.
Subject: Re: [fcrepo-user] Ingest Question

Also, just in case:
https://wiki.duraspace.org/display/FEDORA35/Using+File+URIs

On Tue, Nov 29, 2011 at 12:19 PM, Benjamin Armintor <armi...@gmail.com> wrote:
> Richard (the first):
>  Can you provide a bit more detail about the configuration of the two
> fedoras (are they both using the same storage module?), what exactly
> it is you're trying to (it sounds like you're trying to submit the
> exported foxml as the body of an ingest request), and the type of 500
> errors you're getting (or any information from the logs)?
>
> If you're changing the contentLocations to be file uris, did you also
> update the relevant policy to allow resolution of those uris?
>
> - Ben
>
> On Tue, Nov 29, 2011 at 12:00 PM, Richard Green <R.G...@hull.ac.uk> wrote:
>> Rich
>>
>>
>>
>> Don't assume that it's just you.  This sounds rather familiar and we've been
>> wondering if it was *us*.  Are you trying this with the Fedora Java admin
>> client (ingest object(s) from repository)?  If so - exactly what error

awo...@duraspace.org

unread,
Nov 29, 2011, 12:56:31 PM11/29/11
to Support and info exchange list for Fedora users.
If I understand you rightly you are using the "Migrate" style of export, so that managed datastreams will be expressed in the exported FOXML as URLs back into the original repository. If, by any chance, the original repository URLs are inaccessible at the time of ingest (e.g., because of XACML policy), you may see some funny behavior. It's something that has bitten me before because of my own absent-mindedness and you might want to check to make sure it's not happening to you. It's easy to check by taking one of those URLs and retrieving it _from the machine on which the new repository is running_ right before you start the ingest, using a tool like 'wget' or 'curl'. If this fails, it should at least give you more information about why it's happening (whether you have an XACML policy problem or, like me, are a bit absentminded and prone to turning things off without remembering you did {grin}).

---
A. Soroka
Online Library Environment
the University of Virginia Library

Burgis, Richard

unread,
Nov 29, 2011, 1:20:36 PM11/29/11
to Support and info exchange list for Fedora users.
Thanks. That's one of the things that I checked. It makes that much more
frustrating when I can get to the file via the web or a service.

But this leads to a bigger question. My background is in big systems and
I'm feeling my way around in the new world. I read in the section that
Mr. Armintor kindly pointed out that it seems like a very bad idea to
use file URI's for external datastreams. I'm am going to be using those
extensively in the future and would appreciate it if anyone could
suggest how I configure my server (or create the URI's) so that I can
use an http based URI.

This is at a level of ignorance so high that I would think that a
reference to a web site or book would be the easiest way to answer.

Thank you very much
Rich

-----Original Message-----
From: aj...@virginia.edu [mailto:aj...@virginia.edu]
Sent: Tuesday, November 29, 2011 12:57 PM
To: Support and info exchange list for Fedora users.
Subject: Re: [fcrepo-user] Ingest Question

If I understand you rightly you are using the "Migrate" style of export,
so that managed datastreams will be expressed in the exported FOXML as
URLs back into the original repository. If, by any chance, the original
repository URLs are inaccessible at the time of ingest (e.g., because of
XACML policy), you may see some funny behavior. It's something that has
bitten me before because of my own absent-mindedness and you might want
to check to make sure it's not happening to you. It's easy to check by
taking one of those URLs and retrieving it _from the machine on which
the new repository is running_ right before you start the ingest, using
a tool like 'wget' or 'curl'. If this fails, it should at least give you
more information about why it's happening (whether you have an XACML
policy problem or, like me, are a bit absentminded and prone to turning
things off without remembering you did {grin}).

---
A. Soroka
Online Library Environment
the University of Virginia Library



awo...@duraspace.org

unread,
Nov 29, 2011, 1:30:27 PM11/29/11
to Support and info exchange list for Fedora users.
One natural question is why you are wanting to use file:// URLs? Is it because you are dealing with very large datastreams (which is often the motivation)? Or for some other reason? As for using http:// URLs for external datastreams, it's not difficult to create objects that so do. The pattern of URLs to use is essentially arbitrary, as long as Fedora can dereference them when appropriate, and the specifics really depend on your larger system design criteria. Keep in mind that when using external datastreams, you give up some of Fedora's abilities to manage that content, e.g. eventing via JMS or content versioning.

As to the specific problem-- it would be helpful if you could provide the section of Fedora's log in which the new repository fails to retrieve the managed datastreams from the old repository. Then we might be able to determine just why that's happening in your situation. If it makes you feel any better about the time you are spending with this problem, I move objects between repositories by this means all the time, so it is possible, and there is some definite reason it isn't working for you, and we can find out what that is and fix it.

---
A. Soroka
Online Library Environment
the University of Virginia Library




> ------------------------------------------------------------------------------

Burgis, Richard

unread,
Nov 29, 2011, 1:49:22 PM11/29/11
to Support and info exchange list for Fedora users.
Thanks.

The error is
org.fcrepo.server.errors.HttpServiceNotFoundException:[DefaultExternalCo
ntentManager] returned an error. The underlying error was a
org.fcrepo.server.errors.GeneralException The message was "Error getting
http://localhost:8080//fedora/get/uahc:AP00001/source/2011-10-19T18:14:3
4.840Z"


The Fedora log at this point just contains a Java exception dump that
starts out exactly as the above:

ERROR 2011-11-29 13:42:36.394 [http-bio-8080-exec-4]
(FedoraAPIMBindingSOAPHTTPImpl) Error ingesting
org.fcrepo.server.errors.HttpServiceNotFoundException:
[DefaultExternalContentManager] returned an error. The underlying error
was a org.fcrepo.server.errors.GeneralException The message was "Error
getting
http://localhost:8080/fedora/get/uahc:AP00001/source/2011-10-19T18:14:34
.840Z" .
at
org.fcrepo.server.storage.DefaultExternalContentManager.getExternalConte
nt(DefaultExternalContentManager.java:155) [fcrepo-server-3.5.jar:na]
at
org.fcrepo.server.storage.DefaultDOManager.doCommit(DefaultDOManager.jav
a:1203) [fcrepo-server-3.5.jar:na]
at
org.fcrepo.server.storage.SimpleDOWriter.commit(SimpleDOWriter.java:509)
[fcrepo-server-3.5.jar:na]
at
org.fcrepo.server.management.DefaultManagement.ingest(DefaultManagement.
java:177) [fcrepo-server-3.5.jar:na]
at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
[na:na]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25) [na:1.6.0_25]
at java.lang.reflect.Method.invoke(Method.java:597)
[na:1.6.0_25]
at
org.fcrepo.server.messaging.NotificationInvocationHandler.invoke(Notific
ationInvocationHandler.java:68) [fcrepo-server-3.5.jar:na]
at $Proxy5.ingest(Unknown Source) [na:na]
at
org.fcrepo.server.management.ManagementModule.ingest(ManagementModule.ja
va:354) [fcrepo-server-3.5.jar:na]
at
org.fcrepo.server.management.FedoraAPIMBindingSOAPHTTPImpl.ingest(Fedora
APIMBindingSOAPHTTPImpl.java:83) [fcrepo-server-3.5.jar:na]
at
org.fcrepo.server.management.FedoraAPIMBindingSOAPHTTPSkeleton.ingest(Fe
doraAPIMBindingSOAPHTTPSkeleton.java:355) [fcrepo-common-3.5.jar:na]
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
[na:na]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25) [na:1.6.0_25]

There is a great deal more of course.

awo...@duraspace.org

unread,
Nov 29, 2011, 2:08:12 PM11/29/11
to Support and info exchange list for Fedora users.
That's pretty much what we'd expect. I assume you have the new repository running on some other port on the same machine? Now, assuming that you can retrieve that exact URL for the managed datastream (http://localhost:8080//fedora/get/uahc:AP00001/source/2011-10-19T18:14:34.840Z) via 'wget' or the like from the new repository machine, I would ask you to show us what the log for the _old_ repository says at that moment. Is the request being received and if so, why isn't it being served? Presumably you should either see no request being received (in which case we have a problem around the repository) or some kind of exception is being thrown to indicate why that datastream won't be served.

---
A. Soroka
Online Library Environment
the University of Virginia Library




> ------------------------------------------------------------------------------

Burgis, Richard

unread,
Nov 29, 2011, 2:18:22 PM11/29/11
to Support and info exchange list for Fedora users.
Actually the first server is on a second windows machine in my office,
and the second is running on Linux somewhere in our data center. I tried
connecting via the web and successfully retrieved my file. However,
since the server is on a different machine I had to change the localhost
to the IP address of that machine.

This seems suspicious to me. I set the first machine up with all
defaults, thus the host defaulted to localhost. I'm wondering if I
should reinstall using the address of the machine instead.

I checked the source server and there is nothing in the log other than I
gave you the object you asked for.

Thanks

awo...@duraspace.org

unread,
Nov 29, 2011, 2:25:59 PM11/29/11
to Support and info exchange list for Fedora users.
I agree that it seems a little strange. {grin}

If the logs of the new repository show it requesting "http://localhost:8080..." it is going to be requesting that URL of itself, and it certainly won't get it! It seems that the export from the old repository has URLs that are containing "http://localhost:8080..." and not an actual externally discoverable address of the machine. I'm not very familiar with the export codebase, so I welcome advice from the committers who are, but I believe this may be the root of your problem.

If, as I believe is the case, the construction of that URL is governed by the parameter in fedora.fcfg:

<param name="fedoraServerHost" value="my.host.name.was.here">
<comment>Defines the host name for the Fedora server, as seen from the
outside world.</comment>
</param>

then you might try changing that value from "localhost" to something the new machine can find (even a bare IP address), restarting the old repository, and redoing a single export to see if the URLs for managed datastreams change. If they do (and I'm pretty confident that they will) you can try importing that object to see if all goes well.

---
A. Soroka
Online Library Environment
the University of Virginia Library




> ------------------------------------------------------------------------------

Richard Green

unread,
Nov 29, 2011, 2:17:06 PM11/29/11
to Support and info exchange list for Fedora users.
Rich

That's exactly the error we've been seeing. Between the two of us I
think we have probably stumbled upon a real issue. Either that, or
we're both capable of being stupid - certainly not beyond the bounds of
possibility in my case!!!

Richard

Burgis, Richard

unread,
Nov 29, 2011, 2:42:55 PM11/29/11
to Support and info exchange list for Fedora users.
I really liked that answer. It made sense. I changed the configuration
file, restarted the server and did the export. The contentLocation was
changed from localhost to the correct IP address. I tried to ingest the
object, which labored a long time and ended up with the same error. But
this time the error referenced the IP address. I just tried ingesting
from the source repository and got the same error.

The log on the source repository says that the object was requested and
exported successfully.

What was Huxley's line: "A beautiful theory wrecked by an inconvenient
fact" (or words to that effect)

awo...@duraspace.org

unread,
Nov 29, 2011, 3:06:08 PM11/29/11
to Support and info exchange list for Fedora users.
I hate to bang on the same drum, but I'm still not yet absolutely clear about one thing: when you say that "I just tried ingesting from the source repository" (I assume you mean via the Java admin tool, or perhaps with the command-line utilities) and "I tried connecting via the web and successfully retrieved my file." do you mean exactly that you tried _from the new repository machine_ or that perhaps you tried from your desktop or some other machine? I just want to be very, very sure that the new repository machine can successfully connect via HTTP to the old repository machine, without worrying about Fedora's vagaries in making that connection. Essentially, I'm trying to separate any problems with firewalls or network-layer access restrictions or the like from problems with Fedora (in which I'm actually quite willing to believe {grin}).

That aside, I think it may be useful to dig into the old repository. Can you give us the output of a failing-to-ingest "Migration"-style export in its totality?

---
A. Soroka
Online Library Environment
the University of Virginia Library




> ------------------------------------------------------------------------------

Burgis, Richard

unread,
Nov 29, 2011, 3:36:05 PM11/29/11
to Support and info exchange list for Fedora users.
I'm sorry I'm slow. Of course you're right. The ssh'ing into the server
machine and trying to wget. My connection times out each time. Just to
be sure I tried the same thing for the target server and it worked like
a charm.

We have a firewall problem. (Or at least a firewall problem.)

On the off chance it's not here is a non-importing file. But for what
it's worth. When I had localhost in the locations, it failed quickly.
With this one it fails slowly, which is what you'd expect from Fedora
retrying the connection several times before giving up.
uahc_SpartanArchives.xml

awo...@duraspace.org

unread,
Nov 29, 2011, 3:41:42 PM11/29/11
to Support and info exchange list for Fedora users.
Not at all. I have yet to come across a real-world deployment of Fedora that didn't involve at least a little bit of network-layer... um... "fun". {grin}

If you get that network-layer "fun" resolved and there are other problems, we'll be here.

The question of using "file://" URLs for external datastreams is a little different, and I'd still be interested to hear what you're trying to do in that way, particularly as a means of discovering where Fedora may need to grow and expand functionality.

---
A. Soroka
Online Library Environment
the University of Virginia Library




> <uahc_SpartanArchives.xml>------------------------------------------------------------------------------

awo...@duraspace.org

unread,
Nov 29, 2011, 3:44:37 PM11/29/11
to Support and info exchange list for Fedora users.
Richard--

I'm frankly very dubious that you're being stupid. {grin}

If you don't trace the problem to some kind of network issue, please let us have some stacktraces and FOXML. It's far from impossible that there is some kind of issue here that gets masked by other problems.

If nothing else, perhaps we need to augment the documentation to explain more clearly that a "migration"-style FOXML serialization of an object has URL-dependencies on the repository from which it was exported...

---
A. Soroka
Online Library Environment
the University of Virginia Library




> ------------------------------------------------------------------------------

Richard Green

unread,
Dec 1, 2011, 5:51:19 AM12/1/11
to Support and info exchange list for Fedora users.
So, Adam (and others), it goes like this:

We have two properly configured Fedora 3.4 servers, 'test' and 'prod'.
By that I mean the fcfg file has a proper address for the server, not
just 'localhost'.

If, on my desktop machine's Fedora Java admin client, I log into test
and try to do 'ingest > one object > from repository' from prod (giving
correct passwords and asking for 'migrate' format) I get an
'HttpServiceNotFoundException' (full error listing below) for a
metadata-only object (this for a 'managed' metadata datastream).

If we use the admin client software actually on the test server, the
exact same thing. If we use a browser on the test server to access
http://....../fedora/search on prod and attempt to download the failing
datastream or file from there - it works fine (which makes us think it
isn't a firewall thing).

The log on the sending server (prod) reports a successful export.

If the attempted ingest involves a content datastream (which, for us
means an MD5 checksum is set), we get a checksum failure rather than an
HttpService error, but I'd like to bet that the checksum failure is
actually because the file can't be retrieved.

'Prod' really is our repository production server so we know that we're
starting with valid objects. The httpService error stack is below (I've
just removed part of the server address for security). If I paste the
full address from the error message into a browser I get an
authentication challenge and then the datastream content. (That won't
work for you because the Fedora server is behind the University
firewall.)

When we set up the prod server 10 weeks ago we tried to transfer content
in the reverse direction (test to prod) but failed (we ended up copying
the data directories and (re)building prod's Fedora over them - probably
quicker anyway). I can't now swear we got the exact same error, but I'd
be fairly certain. I'm very reluctant to try a 'test to prod' transfer
now, precisely because it is the production server we're dealing with.

Any insights welcome!!!

=================================

org.fcrepo.server.errors.HttpServiceNotFoundException:
[DefaultExternalContentManager] returned an error. The underlying error
was a org.fcrepo
.server.errors.GeneralException The message was "Error getting
http://fedora-prod-vm.hull.ac.uk:8080/fedora/get/hull:4994/descMetadata/
20
11-11-28T10:22:40.549Z" .
at
org.apache.axis.message.SOAPFaultBuilder.createFault(SOAPFaultBuilder.ja
va:222)
at
org.apache.axis.message.SOAPFaultBuilder.endElement(SOAPFaultBuilder.jav
a:129)
at
org.apache.axis.encoding.DeserializationContext.endElement(Deserializati
onContext.java:1087)
at
org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at
org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDis
patcher.dispatch(Unknown Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unkno
wn Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at
org.apache.axis.encoding.DeserializationContext.parse(DeserializationCon
text.java:227)
at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696)
at org.apache.axis.Message.getSOAPEnvelope(Message.java:435)
at
org.apache.axis.handlers.soap.MustUnderstandChecker.invoke(MustUnderstan
dChecker.java:62)
at org.apache.axis.client.AxisClient.invoke(AxisClient.java:206)
at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
at org.apache.axis.client.Call.invoke(Call.java:2767)
at org.apache.axis.client.Call.invoke(Call.java:2443)
at org.apache.axis.client.Call.invoke(Call.java:2366)
at org.apache.axis.client.Call.invoke(Call.java:1812)
at
fedora.server.management.FedoraAPIMBindingSOAPHTTPStub.ingest(FedoraAPIM
BindingSOAPHTTPStub.java:537)
at
fedora.client.APIMStubWrapper$1.construct(APIMStubWrapper.java:31)
at fedora.client.SwingWorker$2.run(SwingWorker.java:131)
at java.lang.Thread.run(Unknown Source)


___________________________________________________________________

Richard Green
Consultant to Library & Learning Innovation, University of Hull
managing the History DMP and Hydra (Hull) Projects

http://hydra.hull.ac.uk
http://hydrangeainhull.wordpress.com
http://projecthydra.org
http://historydmp.wordpress.com




> -------- All the data continuously generated in your IT infrastructure

Stephen Bayliss

unread,
Dec 1, 2011, 1:14:54 PM12/1/11
to Support and info exchange list for Fedora users.
Hi Richard

This looks like the error/stack trace from the Java client (apologies if I
am mistaken) - if that is the case can you find the relevant section of the
receiving server's log file and see if it provides any more detail.

Thanks
Steve

-----Original Message-----
From: Richard Green [mailto:R.G...@hull.ac.uk]

Richard Green

unread,
Dec 2, 2011, 5:33:54 AM12/2/11
to Support and info exchange list for Fedora users.
You're right, Steve, that was the client log. Attached the much longer
server log entry (I've edited a URL or two for security). Although we
recreated the problem this morning it follows the exact pattern
described below. FYI the object was hull:4994, comprising five
metadata-only datastreams DC (X), descMetadata (M), RELS-EXT (X),
rightsMetadata (X), defaultObjectRights (X). It does seem to be managed
datastreams that cause the problem. Note the final 'causedBy' 401 - to
my non-developer eye it makes me wonder if maybe credentials aren't
getting passed to the source server (despite the fact the admin client
has credentials for both source and destination).

Richard
Fedora_transfer_error.rtf

awo...@duraspace.org

unread,
Dec 2, 2011, 7:49:48 AM12/2/11
to Support and info exchange list for Fedora users.
So I understand rightly, this object has managed datastreams that are guarded by XACML policies that do not allow for unauthenticated access. The receiving repo has no way to authenticate, and therefore cannot obtain the datastreams. Is that right?

---
A. Soroka
Online Library Environment
the University of Virginia Library




On Dec 2, 2011, at 8:24 AM, Stephen Bayliss wrote:

> Hi Richard
>
> I think your non-developer eye may have spotted the issue here.
>
> The doCommit(...) method is where the object is being saved on the target
> server, and that's the point at which it is using a URL (from the FOXML) to
> grab the content from the managed content datastream from the origin server
> so it can save the datastream.
>
> And it would appear that it is not authenticating (and therefore getting a
> 401) when it is attempting to grab the content. In fact I don't think the
> target server would have the credentials at that point. And I am struggling
> to think as to what the best resolution for this might be...
> ------------------------------------------------------------------------------

Stephen Bayliss

unread,
Dec 2, 2011, 9:15:15 AM12/2/11
to Support and info exchange list for Fedora users.
I think that's basically it - though in fact just requiring authentication
for API-A alone would cause this I think, as I don't think the
DefaultExternalContentManager would have any knowledge of how to
authenticate (I believe Chris was working on some additional configuration
for this to add some auth config - though the use-case was more for external
datastreams I think).

Steve

Stephen Bayliss

unread,
Dec 2, 2011, 9:28:47 AM12/2/11
to Support and info exchange list for Fedora users.
> However, as the Fedora admin client has had to authenticate in order to
> get the FOXML export from the source, I'd have imagined that those
> credentials were around to use when fetching any managed datastreams
> that the subsequent ingest needed?

I think this is the issue - the Admin client has the credentials, and uses
these to generate the FOXML to pass to the target server. But the target
server doesn't have credentials to use when getting the datastream.


> We've just thought to check the
> source server log for a matching 401: the only thing in the fedora-prod
> logs is a INFO line saying that 'Completed Export..(pid:4994... Nothing
> else is reported.

That would make sense I think - it is essentially a two-phase process
- generating FOXML that has references to the datastream location on the
source server
- ingesting that FOXML on the target server, at which point the datastreams
are grabbed

So possibly you can see some 401 errors later down?

Steve


> -----Original Message-----
> From: Richard Green [mailto:R.G...@hull.ac.uk]
> Sent: Friday, December 02, 2011 1:16 PM
> To: Support and info exchange list for Fedora users.
> Subject: Re: [fcrepo-user] Ingest Question
>
> Our test and production servers have authentication enabled for API-A
> and API-M access so it's not a locally added XACML policy, if that's
> what you meant Adam.
>
> However, as the Fedora admin client has had to authenticate in order to
> get the FOXML export from the source, I'd have imagined that those
> credentials were around to use when fetching any managed datastreams
> that the subsequent ingest needed? We've just thought to check the
> source server log for a matching 401: the only thing in the fedora-prod
> logs is a INFO line saying that 'Completed Export..(pid:4994... Nothing
> else is reported.
>
> Richard
> >> -------- All the data continuously generated in your IT
> >> infrastructure
> >
> >> contains a definitive record of customers, application performance,
> >> security threats, fraudulent activity, and more. Splunk takes this
> >> data and makes sense of it. IT sense. And common sense.
> >> http://p.sf.net/sfu/splunk-novd2d
> >> _______________________________________________
> >> Fedora-commons-users mailing list
> >> Fedora-com...@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> >
> >
> > ----------------------------------------------------------------------
> > --
> > ------
> > All the data continuously generated in your IT infrastructure contains
>
> > a definitive record of customers, application performance, security
> > threats, fraudulent activity, and more. Splunk takes this data and
> > makes sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-novd2d
> > _______________________________________________
> > Fedora-commons-users mailing list
> > Fedora-com...@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> >
> > ----------------------------------------------------------------------
> > --
> > ----
> > --
> > All the data continuously generated in your IT infrastructure contains
>
> > a definitive record of customers, application performance, security
> > threats, fraudulent activity, and more. Splunk takes this data and
> > makes sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-novd2d
> > _______________________________________________
> > Fedora-commons-users mailing list
> > Fedora-com...@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> >
> >

Scott Prater

unread,
Dec 2, 2011, 8:58:51 AM12/2/11
to Support and info exchange list for Fedora users.
Would IP-based FeSL authentication and an IP access rule in a global XACML policy be a good temporary workaround for this problem?  That is, allowing on the sourcerepo unauthenticated access from the target repo, and setting a corresponding rule that opens the door to the target IP in a global XACML policy on the source server?

That would basically mean any requests from the target server to the source server would be allowed in, but I suppose the XACML could be tightened to just apply to certain objects or operations, if that were necessary.

-- Scott
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-com...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

--
--
Scott Prater
Library, Instructional, and Research Applications (LIRA)
Division of Information Technology (DoIT)
University of Wisconsin - Madison
pra...@wisc.edu

Stephen Bayliss

unread,
Dec 2, 2011, 11:03:52 AM12/2/11
to pra...@wisc.edu, Support and info exchange list for Fedora users.
Hi Scott

It might be a solution... but won't the authN challenge still happen if
API-A authentication is turned on? Won't the policy just define authZ, ie
whether the request is allowed, rather than whether the request has to be
authenticated?

Steve

> -----Original Message-----
> From: Scott Prater [mailto:pra...@wisc.edu]
> Sent: Friday, December 02, 2011 1:59 PM
> To: Support and info exchange list for Fedora users.
> Subject: Re: [fcrepo-user] Ingest Question
>
> > > >> -------- All the data continuously generated in your IT
> > > > --
> > > > ----
> > > > --
> > > > All the data continuously generated in your IT infrastructure
> contains
> > >
> > > > a definitive record of customers, application performance, security
> > > > threats, fraudulent activity, and more. Splunk takes this data and
> > > > makes sense of it. IT sense. And common sense.
> > > > http://p.sf.net/sfu/splunk-novd2d
> > > > _______________________________________________
> > > > Fedora-commons-users mailing list
> > > > Fedora-com...@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> > > >
> > > >
> > > > --------------------------------------------------------------------
> --
> > > > --
> > > > ------
> > > > All the data continuously generated in your IT infrastructure
> contains
> > >
> > > > a definitive record of customers, application performance, security
> > > > threats, fraudulent activity, and more. Splunk takes this data and
> > > > makes sense of it. IT sense. And common sense.
> > > > http://p.sf.net/sfu/splunk-novd2d
> > > > _______________________________________________
> > > > Fedora-commons-users mailing list
> > > > Fedora-com...@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> > > >
> > > >
> ----
> > > ----
> > > All the data continuously generated in your IT infrastructure
> > > contains a definitive record of customers, application performance,
> > > security threats, fraudulent activity, and more. Splunk takes this
> > > data and makes sense of it. IT sense. And common sense.
> > > http://p.sf.net/sfu/splunk-novd2d
> > > _______________________________________________
> > > Fedora-commons-users mailing list
> > > Fedora-com...@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> >
> >
> > ------------------------------------------------------------------------
> ------
> > All the data continuously generated in your IT infrastructure
> > contains a definitive record of customers, application performance,
> > security threats, fraudulent activity, and more. Splunk takes this
> > data and makes sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-novd2d
> > _______________________________________________
> > Fedora-commons-users mailing list
> > Fedora-com...@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>
> --
> --
> Scott Prater
> Library, Instructional, and Research Applications (LIRA)
> Division of Information Technology (DoIT)
> University of Wisconsin - Madison
> pra...@wisc.edu
>

Benjamin Armintor

unread,
Dec 2, 2011, 10:23:11 AM12/2/11
to Support and info exchange list for Fedora users.
Yes, I think you'd have to turn off the API-A authentication
requirement, and then implement a policy somewhat like that
disallowing API-M from non-localhost servers (for API-A and with an
exception, of course).
> ------------------------------------------------------------------------------

Richard Green

unread,
Dec 2, 2011, 8:15:34 AM12/2/11
to Support and info exchange list for Fedora users.
Our test and production servers have authentication enabled for API-A
and API-M access so it's not a locally added XACML policy, if that's
what you meant Adam.

However, as the Fedora admin client has had to authenticate in order to
get the FOXML export from the source, I'd have imagined that those
credentials were around to use when fetching any managed datastreams
that the subsequent ingest needed? We've just thought to check the
source server log for a matching 401: the only thing in the fedora-prod
logs is a INFO line saying that 'Completed Export..(pid:4994... Nothing
else is reported.

Richard

> org.apache.axis.message.SOAPFaultBuilder.createFault(SOAPFaultBuilder.
> ja
> va:222)
> at
> org.apache.axis.message.SOAPFaultBuilder.endElement(SOAPFaultBuilder.j
> av
> a:129)
> at
> org.apache.axis.encoding.DeserializationContext.endElement(Deserializa
> ti
> onContext.java:1087)
> at
> org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
> at
> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown
> Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentD
> is
> patcher.dispatch(Unknown Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unk
> no
>> -------- All the data continuously generated in your IT
>> infrastructure
>
>> contains a definitive record of customers, application performance,
>> security threats, fraudulent activity, and more. Splunk takes this
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Fedora-commons-users mailing list
>> Fedora-com...@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>
>
> ----------------------------------------------------------------------
> --
> ------
> All the data continuously generated in your IT infrastructure contains

> a definitive record of customers, application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and
> makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-com...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>
> ----------------------------------------------------------------------
> --
> ----
> --
> All the data continuously generated in your IT infrastructure contains

> a definitive record of customers, application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and
> makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-com...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>
>

Scott Prater

unread,
Dec 2, 2011, 10:30:49 AM12/2/11
to Stephen Bayliss, Fedora Commons
Hi, Steve --

I was thinking that FeSL/JAAS could be configured for IP-based
authentication, then use the usual XACML policies for authorization. So
the IP access would be configured into two places: at the front door
(during the authN request), then on the porch (for authZ).

During a little bit of reading now: configuring JAAS for IP-based
authentication is apparently not as trivial as it sounds. Most web
application containers have their own IP-based access control mechanism;
Tomcat uses Valves. Valves and JAAS can be made to play together:
essentially, a custom Valve can do an IP check, then set a user
principal that then is made available to JAAS:

http://java.sys-con.com/node/1876662

There may be a simpler way to do it, but I don't want to hijack this
thread more than I already have.

-- Scott
>>>>> Consultant to Library& Learning Innovation, University of Hull
>>>>>>>>>> Consultant to Library& Learning Innovation, University of Hull

Stephen Bayliss

unread,
Dec 2, 2011, 8:24:57 AM12/2/11
to Support and info exchange list for Fedora users.
Hi Richard

I think your non-developer eye may have spotted the issue here.

The doCommit(...) method is where the object is being saved on the target
server, and that's the point at which it is using a URL (from the FOXML) to
grab the content from the managed content datastream from the origin server
so it can save the datastream.

And it would appear that it is not authenticating (and therefore getting a
401) when it is attempting to grab the content. In fact I don't think the
target server would have the credentials at that point. And I am struggling
to think as to what the best resolution for this might be...

Burgis, Richard

unread,
Nov 29, 2011, 2:44:33 PM11/29/11
to Support and info exchange list for Fedora users.
I can't speak for you, but I'm quite capable of being stupid.

I truly want this to be a stupid problem, because then someone will come
by and say: well if you just did it this way it would work. And it
would.

Subtle problems take a lot longer.

Sorry you're in the same boat.
Rich

-----Original Message-----
From: Richard Green [mailto:R.G...@hull.ac.uk]
Reply all
Reply to author
Forward
0 new messages