islandora_book_batch - tuque timed out when updating objects

63 views
Skip to first unread message

bgil...@pitt.edu

unread,
Jul 11, 2017, 4:28:27 PM7/11/17
to islandora
Has anybody had a fatal error during a book batch ingest?

WD islandora: Failed to modify datastream RELS-EXT from pitt:31735054778760-0597
code
: 6
message
: name lookup timed out                                                                                                             [error]  
WD php
: RepositoryException: name lookup timed out in RepositoryConnection->parseFedoraExceptions() (line 229 of /var/www/html/drupal-7.54_gamera_sandbox/sites/all/libraries/tuque/RepositoryConnection.php).                    [error]  
RepositoryException: name lookup timed out in RepositoryConnection->parseFedoraExceptions() (line 229 of /var/www/html/drupal-7.54_gamera_sandbox/sites/all/libraries/tuque/RepositoryConnection.php).  
Drush command terminated abnormally due to an unrecoverable error.                                    

We do have an architecture that consists of separate servers for fedora, solr, mysql, http and separate http for admin, RI (blazegraph), as well as separate read-only fedora.

From the error message, I can see that there was an exception performing one of the hooks for the object:model_datastream ID.  That code seems to have been trying to modify a RELS-EXT datastream.  The response seems to be traced to a cURL timeout occurrence (CURLOPT_CONNECTTIMEOUT).  The libraries/tuque/HttpConnection.php file specifies $connectTimeout = 5 seconds.  I may try adjusting the value for the timeout to 10 seconds.

The code in islandora/includes/tuque_wrapper.inc is:
class IslandoraFedoraDatastream extends FedoraDatastream {
 
...

 
protected function modifyDatastream(array $args) {
   
try {
      parent
::modifyDatastream($args);
      islandora_invoke_datastream_hooks
(ISLANDORA_DATASTREAM_MODIFIED_HOOK, $this->parent->models, $this->id, $this->parent, $this);
     
if ($this->state == 'D') {
        islandora_invoke_datastream_hooks
(ISLANDORA_DATASTREAM_PURGED_HOOK, $this->parent->models, $this->id, $this->parent, $this->id);
     
}
   
}
   
catch (Exception $e) {
      watchdog
('islandora', 'Failed to modify datastream @dsid from @pid</br>code: @code<br/>message: @msg', array(
         
'@pid' => $this->parent->id,
         
'@dsid' => $this->id,
         
'@code' => $e->getCode(),
         
'@msg' => $e->getMessage()), WATCHDOG_ERROR);
     
throw $e;
   
}
 
}


(this happens to be the second kind of fatal error during book ingest... the other one was due to the code trying to access the OBJ datastream before it was actually ingested into fedora -- I wrote a "wait for object's datastream" module that I call during the derivative creation routine in the large image solution pack.-- it will wait up to 30 seconds before it gives up and prevented that issue from happening again).

Any wisdom on this topic is greatly appreciated.

Brian Gillingham

University of Pittsburgh | University Library System

… sometimes I’m like  ¯\_()_/¯

dp...@metro.org

unread,
Jul 12, 2017, 8:59:09 AM7/12/17
to islandora
Hi Brian,

Seems like you have a DNS resolving issue. CURL  is accessing your Fedora repo via a named host instead of a IP address? It is a pretty common issue if the caller machine(in this case drupal) can't resolve (temporarily or permanently) via hostname /qualified domain (e.g myawesomefedora.com:8080) a given host. If the Drupal server itself resolving names or do you depend on an external DNS for this, how about /etc/resolve.conf?? It could be just a temporary timeout or delay in your dns resolver, but could be also just a timeout issue because one or any or many of your servers during operations like this are overloaded.
If you do a Ping from Drupal server to fedora server, how are the response times?
CURL allows for better timeouts, you could try to hardcode in tuque at https://github.com/Islandora/tuque/blob/1.x/HttpConnection.php#L96?

Best

Diego

bgil...@pitt.edu

unread,
Jul 12, 2017, 9:38:16 AM7/12/17
to islandora
Diego,

I think you could be right, but I am so puzzled that millions of requests from the web server to the fedora server have worked... it just decides that one request will take more than the allowed 5 seconds and then blows up?  Based on the completely random nature of this, I think JAVA has instability issues -- my theory is that it has something to do with garbage collection or memory management.

I have modified my tuque/HttpConnection.php to have a timeout of 30 seconds for these calls, but I am considering adding a second layer to the tuque/RepositoryConnection.php code such that it will not throw the exception the first failure, but it will wait another minute and try again... if it fails the second time, then I'd allow the exception to be thrown.

My web server did not reach the fedora server at all using a ping command.  I was told it has something to do with DMZ (I do not understand all that network mumbo-jumbo).

I will try the same batch with "defer derivatives" - and gearman is configured as a client and worker on the same machine that has been attempting these ingests.

Brian Gillingham

dp...@metro.org

unread,
Jul 12, 2017, 10:20:49 AM7/12/17
to islandora
Brian seems right. 
Can you monitor free resources (memory) while this stuff happens? As you said it could be tomcat/fedora running out of memory on your backend server and yes, garbage collection can be a cause and random connections timeouts can happen because of that, but also if the threshold of open (tomcat is so bad at closing connections!) cons. is reached, or even your client machine/apache running out of juice!
oh, DMZ.. yeah, no ping if you have one machine outside and the other inside, THAT can also add a lot of overhead to your connection!
I personally think retrying and retrying is good if you know why (like this case) but can also be counterproductive sometimes if things are failing really because there is an issue. It's up to you of course.
I use this garbage collecting strategy: G1GC http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html

Best!

Diego Pino
Metro.org

Giancarlo Birello

unread,
Jul 12, 2017, 11:37:11 AM7/12/17
to isla...@googlegroups.com
I noted some delays due to IPv6 with last Fedora/Islandora releases so I refer fedora by IPv4 address.
In addition  I noted network issues after long batch ingesting and to reset I do a service networking restart onto fedora backend.

Just a couple of ideas

Giancarlo

--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/3efe1a18-c0a8-43a7-9544-9e91491fa36a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

bgil...@pitt.edu

unread,
Jul 12, 2017, 12:04:36 PM7/12/17
to islandora
Giancarlo,

I don't know which version of Islandora makes that change -- we are on 7.x-1.6 with a lot of 7.x here and there.

Resetting the network to make it work correctly (again) is not normal, but do you do this programmatically based on some metric?

Brian Gillingham
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.

bgil...@pitt.edu

unread,
Jul 12, 2017, 12:04:49 PM7/12/17
to islandora
Diego,

I have been able to look at the resources for the various machines using our VCenter.  Since the issue only happens after at least an hour of nonstop processing, I am grateful that I can look at these metrics after the fact (and don't have to watch them all the time).

I updated our tuque library code and am trying the same ingest again - and I'll report back if that addresses my issue.

The garbage collection options you listed may help tune up our system.  I know there are a lot of settings that can improve performance, but nothing can really make network communications go any faster through DMZ.

thanks again!
Brian Gillingham

Giancarlo Birello

unread,
Jul 12, 2017, 12:11:53 PM7/12/17
to isla...@googlegroups.com
I work with Islandora 7.x head.
I do that not programmatically, at the moment I don't have any explications and I need more time to find why. It is an empiric way to go ahead with a big ingesting.



To unsubscribe from this group and stop receiving emails from it, send an email to islandora+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages