Grid Appliance locked up, not responding to SSH

1 view
Skip to first unread message

Nate

unread,
Nov 5, 2009, 3:47:04 PM11/5/09
to Archer User's Group
I have (had?) a few hundred jobs submitted (two different clusters).
"top" indicates that VirtualBox is using 100% of one of my physical
CPUs and the Grid Appliance won't respond, neither by VirtualBox or by
SSH.

Is it possible that the grid appliance is still running, or is it
likely dead in the water? If I "pull the plug", is there any way to
reliably start my jobs up again where they left off, or will I have to
inspect the logs and start new clusters for the remaining jobs?

The lock-up occurred soon after submission of the second cluster of
jobs (relatively lightweight compared to the larger cluster of jobs
that was finishing up), presumably while the jobs were being
dispatched.

- Nathan

David Isaac Wolinsky

unread,
Nov 5, 2009, 8:56:54 PM11/5/09
to archer-us...@googlegroups.com
As you put it, it is likely dead in the water. I suspect your VM ran
out of memory, but if you can't access the VM its hard to debug. What's
top say the memory usage is? If you can't SSH and you can't VMware
Console into it, I am afraid you have no other choice but to do a hard
reboot. I don't know the state condor will be in after restarting.
Sorry :(. After you reboot, could you check the syslog and messages.log
to see if there is anything interesting to describe what could have
happened.

Regards,
David
> --
>
> You received this message because you are subscribed to the Google Groups "Archer User's Group" group.
> To post to this group, send email to archer-us...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/archer-users-group?hl=en.
>
>
>
>

Nathan Blythe

unread,
Nov 5, 2009, 11:09:49 PM11/5/09
to archer-us...@googlegroups.com
Thanks for the input. The virtual machine certainly didn't come near
depleting the host's memory and I don't think the virtual machine ran
up against its Virtual Box configured limit, but I'll have to
double-check the latter tomorrow.

The host indicated lots of network traffic even after the virtual
machine stopped responding, and the host wasn't doing anything else,
but maybe that was just the condor network banging on the door.

I'll reboot it tomorrow and with any luck I should be able to figure
out where it left off.

Thanks,
Nathan
Reply all
Reply to author
Forward
0 new messages