idpf validator site went down

48 views
Skip to first unread message

Marisa DeMeglio

unread,
Jan 31, 2012, 9:03:11 PM1/31/12
to daisy-pip...@googlegroups.com
Hi all,

I noticed the IDPF validator site went down.  I restarted it and it's back up.  Part of the log is pasted below.  Any thoughts?
We're ok on disk space but our tmp directory was full of about 1500 files.  Could it be that they were there because they were still open, eventually bringing the system down?

Marisa

# last known good validation
2012-01-30 08:07:44,134 INFO  ~ Validation from IP xxx.xxx.xx.xxx
2012-01-30 08:07:44,134 INFO  ~ 0 issues found in file 'xxxxxxxxxxxx.epub'.
2012-01-30 08:07:55,288 INFO  ~ Visitor user-agent [Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; chromeframe/16.0.912.77; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; EmbeddedWB 14.52 from: http://www.bsalsa.com/ EmbeddedWB 14.52; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C; .NET4.0E; InfoPath.2)]
2012-01-30 08:07:55,288 INFO  ~ Visitor IP 175.193.45.248

# errors begin
2012-01-30 08:07:56,386 ERROR ~

@695oha4a6
Internal Server Error (500) for request POST /application/validate

Execution exception (In /app/controllers/Application.java around line 76)
RuntimeException occured : java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cannot run program "java": java.io.IOException: error=24, Too many open files

play.exceptions.JavaExecutionException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cannot run program "java": java.io.IOException: error=24, Too many open files
        at play.mvc.ActionInvoker.invoke(ActionInvoker.java:231)
        at Invocation.HTTP Request(Play!)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cannot run program "java": java.io.IOException: error=24, Too many opee
n files
        at org.daisy.validation.epubcheck.EpubcheckBackend.validate(EpubcheckBackend.java:49)
        at org.daisy.validation.epubcheck.EpubcheckBackend.validate(EpubcheckBackend.java:30)
        at org.daisy.validation.epubcheck.EpubcheckBackend.run(EpubcheckBackend.java:24)
        at controllers.Application.runEpubcheck(Application.java:76)
        at controllers.Application.validate(Application.java:57)
        at play.mvc.ActionInvoker.invokeWithContinuation(ActionInvoker.java:548)
        at play.mvc.ActionInvoker.invoke(ActionInvoker.java:502)
        at play.mvc.ActionInvoker.invokeControllerMethod(ActionInvoker.java:478)
        at play.mvc.ActionInvoker.invokeControllerMethod(ActionInvoker.java:473)
        at play.mvc.ActionInvoker.invoke(ActionInvoker.java:161)
        ... 1 more
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cannot run program "java": java.io.IOException: error=24, Too many open files
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
        at java.util.concurrent.FutureTask.get(FutureTask.java:111)
        at org.daisy.validation.epubcheck.EpubcheckBackend.validate(EpubcheckBackend.java:43)
        ... 10 more
Caused by: java.lang.RuntimeException: Cannot run program "java": java.io.IOException: error=24, Too many open files
        at org.daisy.validation.epubcheck.CommandExecutor.run(CommandExecutor.java:56)
        at org.daisy.validation.epubcheck.EpubcheckBackend.doValidate(EpubcheckBackend.java:59)
        at org.daisy.validation.epubcheck.EpubcheckBackend.access$000(EpubcheckBackend.java:16)
        at org.daisy.validation.epubcheck.EpubcheckBackend$1.call(EpubcheckBackend.java:39)

Romain Deltour

unread,
Feb 1, 2012, 2:41:16 AM2/1/12
to daisy-pip...@googlegroups.com
Huh.

There are some leads in this StackOVerflow questions:

It definitely seems we're not releasing some file descriptors there. The problem is it's really a pain to reproduce and debug…

Romain.

Jostein Austvik Jacobsen

unread,
Feb 1, 2012, 5:20:54 AM2/1/12
to daisy-pip...@googlegroups.com
The EpubcheckBackend is a singleton, and in Play!, singletons persist between requests. Could this cause filehandles in for instance EpubcheckBackend.validate(File) to avoid garbage collection and remain open?

Jostein

2012/2/1 Romain Deltour <rdel...@gmail.com>

Romain Deltour

unread,
Feb 1, 2012, 10:15:33 AM2/1/12
to daisy-pip...@googlegroups.com
I've pushed a fix (hopefully). The root was this Java bug:

Romain.

Marisa DeMeglio

unread,
Feb 1, 2012, 1:40:17 PM2/1/12
to daisy-pip...@googlegroups.com
Great, I'm going to restart the site now.  We can monitor it and see how it holds up.  I'm going to try and check it in the morning and evening; if anyone is inclined to do the same, it would be great.  Theoretically, there is an AWS tool that would alert us if the site goes down, but I'm not sure which it is or if it costs extra.  I'll look into it when I get a chance.

Marisa DeMeglio

unread,
Feb 9, 2012, 2:06:38 PM2/9/12
to daisy-pip...@googlegroups.com
Well, the site is still up, but we are accumulating temp files again.  I've manually deleted them for now but it seems like the problem is still around.   There were around 400 temp files from Feb 1 through today, so not every upload is hanging around, but still it's enough to cause problems.  

For now I think I'll set up a once-a-day cron script that deletes the tmp files (atm I'm just running a command to delete anything more than 10 minutes old, to avoid causing errors for anyone currently on the site).   We'll have to look into it in greater depth after the UK workshop when we have time.

Romain Deltour

unread,
Feb 9, 2012, 2:34:11 PM2/9/12
to daisy-pip...@googlegroups.com
Argh, damm.

It's weird though that it only occurs from time to time… Anyway, a cron job sounds like a safe temporary solution. I'll have a look ASAP after the workshop.

Romain.

Marisa DeMeglio

unread,
Feb 14, 2012, 2:36:26 AM2/14/12
to daisy-pip...@googlegroups.com
Argh again.  It went down again today.  I have been using a cronjob to delete tmp files one per day, so there weren't even that many in the tmp dir, but I still saw the error about too many open files.  Perhaps that error has nothing to do with tmp files.

I have an automated monitoring service that failed to notify me :|   So, until after the workshop, I'll just have to check the site a few times per day more than I'm already doing.  

Markus Gylling

unread,
Feb 14, 2012, 4:54:21 AM2/14/12
to daisy-pip...@googlegroups.com
I just went through the EPUBCheck codebase and made sure input- and outputstreams are properly closed. It seems unlikely to me that this is the problem, but even so. The changes are committed to the branch, so you can either build yourself or ask me to send a cut whenever you are working on this next. /markus

Romain Deltour

unread,
Feb 20, 2012, 12:35:56 PM2/20/12
to daisy-pip...@googlegroups.com
Hi Marisa,

I hope you had a safe trip back from the UK. I just had a quick look at the epubcheck issue  and in my experiments all files are deleted from the temporary directory.

Were the non-deleted files kept in the tmp/ or tmp/upload directory ? Do you still have a copy of the log file(s) ?

Thx,
Romain.

Marisa DeMeglio

unread,
Feb 21, 2012, 11:15:56 AM2/21/12
to daisy-pip...@googlegroups.com
See attached for the log file.

The files that should have been deleted were found in the tmp directory.  The uploads dir only ever contains a handful of files, each 4096 bytes.

application.log

Marisa DeMeglio

unread,
Feb 21, 2012, 11:33:23 AM2/21/12
to daisy-pip...@googlegroups.com
And, here is the correct log file!
application.log

Romain Deltour

unread,
Feb 21, 2012, 11:42:56 AM2/21/12
to daisy-pip...@googlegroups.com
Hummm… nothing really helpful in the log, it simply starts saying  "too many open files" and then repeats it for all further requests. The previous logs up to this one in particular were fine, correct ?

Could we analyze the logs to get an estimate of the number of requests to make to reach this buggy state ?

R.

On 21 févr. 2012, at 17:33, Marisa DeMeglio wrote:

And, here is the correct log file!
On Feb 21, 2012, at 8:15 AM, Marisa DeMeglio wrote:

See attached for the log file.

The files that should have been deleted were found in the tmp directory.  The uploads dir only ever contains a handful of files, each 4096 bytes.

<application.log>

<application.log>

Marisa DeMeglio

unread,
Feb 21, 2012, 12:24:21 PM2/21/12
to daisy-pip...@googlegroups.com
In this case, there were 10 log files, 8 of which were fine.  The file immediately before the one I sent is where errors started appearing.

There were approximately 21,995 validations before the first "too many open files" error occurred.

Marisa DeMeglio

unread,
Feb 22, 2012, 6:10:33 PM2/22/12
to daisy-pip...@googlegroups.com
By the way, I ran an overnight test of 22k validations on a locally-running epubcheck site, and there are no stray tmp files nor were there crashes.  Maybe it has something to do with the number of concurrent requests (in this case, 1 at a time).

Marisa DeMeglio

unread,
Mar 4, 2012, 3:21:11 PM3/4/12
to daisy-pip...@googlegroups.com
The latest is that we incorporated a new build of epubcheck last week, which featured more safety checks for file handle closing.  However, it doesn't seem to have had an effect, as the site crashed again today.  I've upped the file limits on the server as described here: http://lj4newbies.blogspot.com/2007/04/too-many-open-files.html

Except with huge limits:
* soft nofile 66535
* hard nofile 66535

We'll see if that has any effect.  It may just mask the underlying problem.

When I query the process with lsof -p xxx|wc -l , there are never more than 5 files open, so I think whoever is leaving them hanging is outside of the play fwk code and running as a separate process. This would suggest epubcheck because it runs separately. However, I haven't been able to reproduce it locally, so that makes me think it's not epubcheck. 

Just to review, the errors we're getting look like this:

2012-03-03 09:29:03,439 ERROR ~ serveStatic for request GET /public/javascripts/jquery-1.5.2.min.js
java.io.FileNotFoundException: /home/ec2-user/src/daisy-pipeline.epubcheck/epubcheck-web/public/javascripts/jquery-1.5.2.min.js (Too many open files)
        at java.io.RandomAccessFile.open(Native Method)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
        at play.server.PlayHandler.serveStatic(PlayHandler.java:869)
        at Invocation.HTTP Request(Play!)

and

@69fdki32b
Internal Server Error (500) for request POST /application/validate

Oops: IllegalStateException
An unexpected error occured caused by exception IllegalStateException: Error when handling upload

play.exceptions.UnexpectedException: Unexpected Error
        at play.mvc.ActionInvoker.invoke(ActionInvoker.java:258)
        at Invocation.HTTP Request(Play!)
Caused by: java.lang.IllegalStateException: Error when handling upload
        at play.data.parsing.ApacheMultipartParser.parse(ApacheMultipartParser.java:591)
        at play.mvc.Scope$Params.checkAndParse(Scope.java:386)
        at play.mvc.Scope$Params.all(Scope.java:467)
        at play.mvc.ActionInvoker.getActionMethodArgs(ActionInvoker.java:637)
        at play.mvc.ActionInvoker.invokeControllerMethod(ActionInvoker.java:478)
        at play.mvc.ActionInvoker.invokeControllerMethod(ActionInvoker.java:473)
        at play.mvc.ActionInvoker.invoke(ActionInvoker.java:161)
        ... 1 more
Caused by: play.data.parsing.ApacheMultipartParser$IOFileUploadException: Processing of multipart/form-data request failed. /home/ec2-user/src/daisy-pipeline.epubcheck/epubcheck-web/tmp/upload_00016000.tmp (Too many open files)

Any thoughts?

Romain Deltour

unread,
Mar 4, 2012, 4:14:10 PM3/4/12
to daisy-pip...@googlegroups.com
No epiphany since the last time I checked…

Do you still have some ghost files hanging around ?

Btw, when you ran your local stress test: did you run all requests against the same EPUB file ? Maybe Java doesn't acquire a new file handle if the file has already been opened…

R.

Romain Deltour

unread,
Mar 5, 2012, 9:52:56 AM3/5/12
to daisy-pip...@googlegroups.com
I've run some stress tests against our epubcheck-backend project:

1. running 22000 sequential validations on different files
2. running 300 "parallel" validations (actually 10 at a time because of our built-in limit)

Both tests were successful.

I can't think of other causes than a low file limit on the server, I suppose we'll see over the next few days.

BTW, do you have some Google Analytics or similar data to investigate the case ?

R.

Marisa DeMeglio

unread,
Mar 5, 2012, 12:58:15 PM3/5/12
to daisy-pip...@googlegroups.com
All I have at the moment are the log files, from which we can figure out roughly how many validations until the site goes down (it's always around 22k, which takes about a week to accumulate), and I suppose we could also estimate the number of simultaneous requests, since all requests are timestamped.  What kinds of data do you think would be helpful?

Marisa DeMeglio

unread,
Mar 5, 2012, 12:59:29 PM3/5/12
to daisy-pip...@googlegroups.com

On Mar 4, 2012, at 1:14 PM, Romain Deltour wrote:

> No epiphany since the last time I checked…
>
> Do you still have some ghost files hanging around ?

This time there were about 400 tmp files.

>
> Btw, when you ran your local stress test: did you run all requests against the same EPUB file ? Maybe Java doesn't acquire a new file handle if the file has already been opened…

Yes, all against the same file. Good point.

Romain Deltour

unread,
Mar 5, 2012, 1:46:13 PM3/5/12
to daisy-pip...@googlegroups.com
Yes, I was thinking about knowing the number of simultaneous requests, if the server was suddenly hit by a big number, or if it's just casual use…

R.

Romain Deltour

unread,
Mar 5, 2012, 1:47:06 PM3/5/12
to daisy-pip...@googlegroups.com

On 5 mars 2012, at 18:59, Marisa DeMeglio wrote:

>
> On Mar 4, 2012, at 1:14 PM, Romain Deltour wrote:
>
>> No epiphany since the last time I checked…
>>
>> Do you still have some ghost files hanging around ?
>
> This time there were about 400 tmp files.

Wow.

>
>>
>> Btw, when you ran your local stress test: did you run all requests against the same EPUB file ? Maybe Java doesn't acquire a new file handle if the file has already been opened…
>
> Yes, all against the same file. Good point.

I've ran my stress tests against different files, it made no difference…

Romain Deltour

unread,
Mar 5, 2012, 1:53:45 PM3/5/12
to daisy-pip...@googlegroups.com
How did you run your load test btw ?

I've started playing with JMeter:
http://jmeter.apache.org/

I just launched a big stress test involving 50 users hitting a local server 300 times each. The server has been started in PROD mode with a 10-thread execution pool.

I'll report the result tomorrow :)

R.

Marisa DeMeglio

unread,
Mar 5, 2012, 2:06:23 PM3/5/12
to daisy-pip...@googlegroups.com

On Mar 5, 2012, at 10:47 AM, Romain Deltour wrote:

>
> On 5 mars 2012, at 18:59, Marisa DeMeglio wrote:
>
>>
>> On Mar 4, 2012, at 1:14 PM, Romain Deltour wrote:
>>
>>> No epiphany since the last time I checked…
>>>
>>> Do you still have some ghost files hanging around ?
>>
>> This time there were about 400 tmp files.
>
> Wow.

And since I restarted the site yesterday, 52 more have accumulated.

Marisa DeMeglio

unread,
Mar 5, 2012, 2:08:37 PM3/5/12
to daisy-pip...@googlegroups.com

On Mar 5, 2012, at 10:53 AM, Romain Deltour wrote:

> How did you run your load test btw ?

Nothing special, just a bunch of cmd line ruby programs (30 or so) that call the server a zillion times each. I'm not sure it was a great test because I don't know how the calls are interspersed. JMeter sounds interesting -- looking forward to the results tomorrow <grin>.

Romain Deltour

unread,
Mar 6, 2012, 8:01:01 AM3/6/12
to daisy-pip...@googlegroups.com
FYI I could run more than 21000 requests with no failure before JMeter ran out of memory (it was keeping the requests answers in memory). I'll try again tonight with less reporting to reach the 30000 requests, but it seems to indicate all is fine with the app and framework and would confirm the issue comes from a limitation from our EC2 server, which should hopefully be fixed now that you increased the limit. Let's wait and see how the live server reacts.

R.

Marisa DeMeglio

unread,
Mar 19, 2012, 1:25:57 PM3/19/12
to daisy-pip...@googlegroups.com
Well, the site hasn't crashed since March 4th, so I think increasing the file limits helped, although who knows if this approach is just masking the problem. One issue is that files are lingering in the tmp directory and eventually we'll run out of disk space.

I'm going to be out of town this Weds through next Monday (March 21-26), and I won't have my computer with me, so I'll clear the tmp dir and restart the server before I leave to ensure that it will stay alive while I'm gone.

Marisa

Reply all
Reply to author
Forward
0 new messages