Task failed to report status exceptions

23 views
Skip to first unread message

Jonathan Herzig

unread,
Mar 17, 2011, 11:08:40 AM3/17/11
to Jaql Developers
Hi,

I'm running jaql over a cluster of 6 machines.

When i run my jobs on small data it runs smoothly.

However, when i use larger data (~4G) the following occurs:

I can see that alot of tasks which have been completed, go back to
"pending" state.
When this happens i get exceptions that look like:

"Task attempt_201103161639_0002_m_000000_0 failed to report status for
607 seconds. Killing!"

Most of the time the cluster gets stuck ater a while , apperantly from
memory loss, and should be restarted.

What can possibly be wrong?
Are there any parameters i should change?

Thanks,
Jonathan

Vuk Erecegovac

unread,
Mar 17, 2011, 12:59:13 PM3/17/11
to jaql-...@googlegroups.com
Couple of things to look at:

1. is the machine with the long running task thrashing?

    if so, you might look into if too many tasks (each being a jvm) are spawned concurrently or
    if your aggregate memory requirements across jvm's is too high.

2. does your task include a long running function call?

    in our use, we do have certain udf's that can take a long time since they, as well as the data
    is complex. for such cases, we wrap such suspect functions with "timeout", to make sure a given
    expression does not exceed a certain time limit. If such a limit is reached, an exception is thrown.
    if you want your task to continue after such an exception, you can place a "catch" in jaql around
    timeout, which will let you control whether to continue with the computation or not. the docs for
    these builtin functions has been updated, please look for "timeout" and "catch" here:

    http://code.google.com/p/jaql/wiki/Builtin_functions

Lets see if either of these help your situation, and if not, we can explore further.

Vuk
Reply all
Reply to author
Forward
0 new messages