aborted machine

P. Oscar Boykin

unread,

Jul 14, 2010, 11:38:39 AM7/14/10

to acisp2...@googlegroups.com

Came in this morning, excited to check on how my big job was
progressing, and lo and behold, my virtual machine was not running and
VirtualBox said it was aborted.

What does this mean for my jobs? Are they lost?

If the submitter crashes, what then?
--
P. Oscar Boykin http://boykin.acis.ufl.edu
Assistant Professor, Department of Electrical and Computer Engineering
University of Florida

David Isaac Wolinsky

unread,

Jul 14, 2010, 11:40:30 AM7/14/10

to acisp2...@googlegroups.com

Possibly, restart the VM asap and maybe the jobs haven't timed out.
If the jobs have timed out, the work done will, unfortunately, be lost.
Any clues on why it aborted?

Regards,
David

Renato Figueiredo

unread,

Jul 14, 2010, 11:46:58 AM7/14/10

to acisp2pusers

peeking condorview, from the status of the pool, I think the jobs unfortunately timed out - if the machine running a job is not able to contact the job submitter for some time, it terminates the job. if I recall correctly, this value by default is of the order of a couple of hours - I don't recall if it's something that can be configured at the submit side.
--rf

--
You received this message because you are subscribed to the Google Groups "acis.p2p.users" group.
To post to this group, send email to acisp2...@googlegroups.com.
To unsubscribe from this group, send email to acisp2pusers...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/acisp2pusers?hl=en.

--
Dr. Renato J. Figueiredo
Associate Professor
ACIS Lab - ECE - University of Florida
UF Site Director, Center for Autonomic Computing
http://byron.acis.ufl.edu
ph: 352-392-6430

P. Oscar Boykin

unread,

Jul 14, 2010, 6:18:07 PM7/14/10

to acisp2...@googlegroups.com

When I run condor_q, I see 68 jobs running, 311 idle.

Should I stop those? Why are they idle.

PS: I want to love this system, but I've wrestled with it a lot. I'm
sure some of our users might not have this much patience.

--

Renato Figueiredo

unread,

Jul 14, 2010, 8:12:25 PM7/14/10

to acisp2pusers

In general, unless the job is on "H" (hold) state, it should be a matter of time before Condor will schedule them, but Condor will eventually do and you shouldn't need to manually restart any. Idle means they are just sitting in the queue and waiting to be scheduled.

(Perhaps they should have chosen a better name for this state, like Queued instead of Idle)

The thing that's worrisome is if your VM hangs again you may lose the jobs that were making progress - it might be a virtualbox bug that will show up again. I believe the parameter to change if you want to increase the time before a job is given up is JobLeaseDuration (the default in the appliance is 14400 seconds), but this is a job submission parameter so it only applies for jobs you submit after this parameter is set.

--rf

David Isaac Wolinsky

unread,

Jul 23, 2010, 12:10:08 PM7/23/10

to acisp2...@googlegroups.com

I just peaked at your machine, because I noticed you had some jobs
running and had a few observations:

1) Over the past 24 hours, I found a pretty nasty bug in IPOP that was
causing it to go on and offline a lot, which could potentially affect a
submission site, but I don't know how widespread the effect of the bug
was... after all, it didn't really make itself prevalent until this past
week
2) Your VM is low on memory, for each job you submit, you're using 3M
(though 2.5 is shared) on your local machine, that is potentially one
reason why your VM is crashing
3) Your VM is using swap space, another indication that things may be a
bit wonky
4) Your jobs are probably going to be coming in finally, with all the
weirdness the system has experienced over the past 2 days, nothing
productive occurred, I hope that with the bug fixed, this will now be of
no concern
5) One of our sites didn't have IA32 libraries installed, I fixed that
(it was a configuration error on their part)

I've actually noticed that the newer Linux kernels are getting crappier
and crappier, it seems like someone is messing around with the memory
manager in a very unuser friendly way.

As of now, I suspect you might have enough memory to complete your
current jobs, but I would definitely add more RAM prior to submitting
more jobs, that or don't have as many jobs running in parallel.

Cheers,
David

Reply all

Reply to author

Forward