"Unable to connect to IMS: Connection timed out: connect" error causes ACE-AM audit to fail, or crash ace-am Tomcat application altogether

177 views
Skip to first unread message

Dave Rogers

unread,
Dec 20, 2013, 12:42:59 PM12/20/13
to ace-...@googlegroups.com
Some of the ACE-AM audits that we run can take as long as 50 hours. That's mainly because of a network bottleneck that we hope to correct soon, and no fault of ACE-AM.

However, this does mean that the long audit runs tend to trip up from time to time with an error like the following:

Exception in batch threadedu.umiacs.ace.ims.api.IMSConnectionException: [201] Unable to connect to IMS: Connection timed out: connect
	at edu.umiacs.ace.ims.api.IMSService.handleException(IMSService.java:272)
	at edu.umiacs.ace.ims.api.IMSService.getRoundSummaries(IMSService.java:187)
	at edu.umiacs.ace.ims.api.TokenValidator.processBatch(TokenValidator.java:214)
	at edu.umiacs.ace.ims.api.TokenValidator.run(TokenValidator.java:267)
Caused by: java.net.ConnectException: Connection timed out: connect
	at java.net.DualStackPlainSocketImpl.connect0(Native Method)
	at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
	at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
	at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
	at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
	at java.net.PlainSocketImpl.connect(Unknown Source)
	at java.net.SocksSocketImpl.connect(Unknown Source)
	at java.net.Socket.connect(Unknown Source)
	at java.net.Socket.connect(Unknown Source)
	at sun.net.NetworkClient.doConnect(Unknown Source)
	at sun.net.www.http.HttpClient.openServer(Unknown Source)
	at sun.net.www.http.HttpClient.openServer(Unknown Source)
	at sun.net.www.http.HttpClient.(Unknown Source)
	at sun.net.www.http.HttpClient.New(Unknown Source)
	at sun.net.www.http.HttpClient.New(Unknown Source)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
	at sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(Unknown Source)
	at com.sun.xml.internal.ws.transport.http.client.HttpClientTransport.getOutput(Unknown Source)
	at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.process(Unknown Source)
	at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.processRequest(Unknown Source)
	at com.sun.xml.internal.ws.transport.DeferredTransportPipe.processRequest(Unknown Source)
	at com.sun.xml.internal.ws.api.pipe.Fiber.__doRun(Unknown Source)
	at com.sun.xml.internal.ws.api.pipe.Fiber._doRun(Unknown Source)
	at com.sun.xml.internal.ws.api.pipe.Fiber.doRun(Unknown Source)
	at com.sun.xml.internal.ws.api.pipe.Fiber.runSync(Unknown Source)
	at com.sun.xml.internal.ws.client.Stub.process(Unknown Source)
	at com.sun.xml.internal.ws.client.sei.SEIStub.doProcess(Unknown Source)
	at com.sun.xml.internal.ws.client.sei.SyncMethodHandler.invoke(Unknown Source)
	at com.sun.xml.internal.ws.client.sei.SyncMethodHandler.invoke(Unknown Source)
	at com.sun.xml.internal.ws.client.sei.SEIStub.invoke(Unknown Source)
	at com.sun.proxy.$Proxy99.getRoundSummaries(Unknown Source)
	at edu.umiacs.ace.ims.api.IMSService.getRoundSummaries(IMSService.java:183)
	... 2 more

I've attached an HTML file with an Activity Log display of such an error that occurred at Tue Dec 10 15:19:29 PST, only 8 minutes after the start of an file audit run. I've attached the contents of the corresponding aceam.log file covering that audit run as well. To protect possibly sensitive information, I've replaced names of collections and paths with placeholders (e.g., <<PATH>>). Unfortunately, I no longer have a copy of the e-mail that was sent when the audit run aborted prematurely (although it was received).

In this case, the IMS connection error only caused the Audit run to end abruptly with an "Interrupted" status. In previous cases, the same IMS connection error has caused the entire ace-am Tomcat application to crash. It would then have to be restarted from the Tomcat Manager screen. I don't have a log file sample for those errors though.

Is the problem simply that ACE-AM can't connect to the IMS server for an extended time? I also see messages in the log file of the sort "Driver returned Item: /000013.tif error: false error msg: null hash: ...". Are those related? And are they of concern?

Because of the long audit run times, we do tend to get the IMS errors more often than not. Is there anything we can do from our end to better diagnose the problem? Can the IMS timeout period be extended in any way (in case we're experiencing intermittent connection problems from our remote location in northwest Canada)? Is there a way to catch the exception so that the audit run can proceed locally (with suitable warnings) if IMS is unreachable?

Thanks in advance for your help and advice. Merry Christmas and the best of the season to everyone in Maryland from us here in Yukon (where it was just about -40C = -40F this past Wednesday),

Dave Rogers
(on behalf of Yukon Archives)

aceam-activity-log-20131210.htm
aceam-ims-failure-20131210.log

shake

unread,
Dec 21, 2013, 12:35:17 AM12/21/13
to ace-...@googlegroups.com
Hi Dave,

The server we run the IMS on did not have a clean kernel upgrade the other night, and unfortunately was down for some time. I'm going to see if we can possibly get some redundancy in case something like this happens in the future. It actually impacted one of the audits we have running as well :(. For reference, we have patch nights on the third Thursday of each month. If you ever have any long standing audits I can tell the staff here to keep the server up; hopefully I can push forward with some redudancy efforts early next month and have it be less of a burden for everyone. If you'd like to know which dates specifically, here is a list of the maintenance days for the upcoming year.

As for a timeout period if the IMS cannot be reached - certainly an interesting idea and is something I can look into. It is an annoyance to lose the state of an audit (though if all the items are registered, you can resume an audit by opening up a report and running an audit on the "corrupted" files. Corrupted in this case being that they were not audited correctly). I did add an "audit-only mode" which will validate checksums, but not attempt to register new items or contact the IMS server. I believe ace will fall back to this mode if it cannot connect to the server initially, but I don't think it makes an attempt if the audit is already running. I think pausing the audit at that point in time would be the wisest thing to do, until it can reestablish a connection to the IMS. The crashing sounds worrisome, I'll take a look into seeing if I can make ACE crash while running audits. If it does happen again and you can get any errors, send them along and I'll put it on a high priority.

Lastly, the message your seeing in your log file is a little hard to read if you don't know what to look for. Here's what it is:
Item Returnd:  /some/path/to/an/item
error: true | false
error msg: "An error message" | null
hash: checksum of the item

So it's not thing to worry about, just a bit of not so great logging :)

Happy holidays to you all as well, the weather is treating us well - a very mild 55F.

-Mike

Dave Rogers

unread,
Jan 31, 2014, 1:37:09 PM1/31/14
to ace-...@googlegroups.com
Hi Mike,

Well, we had another "Unable to connect to IMS: Connection timed out" error that aborted an audit about 33 hours into the run at 6am (PST) January 29th. As far as I know, the ACE application didn't crash.

I've attached an extract from the aceam.log file showing the beginning of the audit (27/Jan/2014:14:56:03) and then the error(s) at the end (29/Jan/2014:06:02:28).

There are actually two errors. First the "IMSConnectionException: [201] Unable to connect to IMS: Connection timed out: connect" and then the "[Audit] Error reading file: <<PATH>>\89_54_21.tif" error that follows. Could the two be related. One reason I ask is because we often get "Error reading file" errors for one particular path in a collection. Not all files can't be read, but the ones that have trouble consistently do. The log shows another file in that folder being read without problem. We don't have any trouble reading those same files manually. Can you suggest what could be causing that file error?

In any case, the IMS error is the one causing the most trouble. Do you happen to know if there was a service hiccup at around 9am (EST) on the 29th?

Thanks for your help,

Dave Rogers
(on behalf of Yukon Archives)
aceam-ims-failure-20140129.log

Michael Ritter

unread,
Jan 31, 2014, 3:22:43 PM1/31/14
to ace-...@googlegroups.com
Hi Dave,

I wasn't aware of any downtime of our server here, looking through our logs I do see a stop in traffic to it around 9AM indicating that somewhere along the chain the network was interrupted (note: this is just going off of the ims logs, I didn't get any notification through our normal network monitoring).

For the errors - the IMS exception certainly triggered the second by interrupting the audit and stopping the file read. I'll need to see if I can have audits end more gracefully, I've noticed things can get pretty nasty in the logs when abruptly stopping them. If the file reading errors are popping up without the IMSConnectionException, that's of some concern. I'll take a look at where some of the exceptions were thrown to see if I can spot anything unusual.

One thing which will interest you on this though - I have made some commits recently which will allow ACE to block if there is a connection timeout until a certain amount of time has elapsed. It's rather crude at the moment, but it gets the job done. I hope to refine it more to get it in the next release, and in the mean time I will update the beta version to include it if you would like to try it.

-Mike

--
You received this message because you are subscribed to the Google Groups "ace-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ace-devel+...@googlegroups.com.
To post to this group, send email to ace-...@googlegroups.com.
Visit this group at http://groups.google.com/group/ace-devel.
For more options, visit https://groups.google.com/groups/opt_out.

Dave Rogers

unread,
Feb 5, 2014, 1:10:16 PM2/5/14
to ace-...@googlegroups.com
Hi Mike,

Yes, we would definitely welcome an update to ACE that could handle short network interruptions more gracefully. Would that be included in version 1.9 when it becomes stable?

I'll put aside the file reading errors for now, but if we see them again in a cleanly-finished audit I'll forward along the logs then.

As always, thanks so much for your help,

Dave
(on behalf of Yukon Archives)
Reply all
Reply to author
Forward
0 new messages