[Rocks-Discuss] sge_execd dies after ~11 minutes

52 views
Skip to first unread message

Jimmy Hedman

unread,
May 20, 2008, 4:56:48 AM5/20/08
to Rocks List
Hi,
We have encountered a strange problem with SGE. sge_execd dies after
about 11 minutes from boot with the message: 'commlib error: got read
error (closing "test.southpole.se/qmaster/1")'. We have tried both Rocks
4.3 and 5. It is only after boot, if we restart sgeexecd it keeps
running.
Any ideas what this could be?

Many thanks in advance,
Jimmy Hedman

Jimmy Hedman

unread,
May 26, 2008, 3:00:09 AM5/26/08
to Discussion of Rocks Clusters
More info on this. It seems to be only direct after a reinstallation. If I
reboot the machines it works fine.

Philip Papadopoulos

unread,
May 26, 2008, 4:55:00 PM5/26/08
to Discussion of Rocks Clusters
On Mon, May 26, 2008 at 12:00 AM, Jimmy Hedman <jimmy....@southpole.se>
wrote:

Can you find anything the SGE logs? This is strange behavior.

-P

>
> >
> > Many thanks in advance,
> > Jimmy Hedman
> >
> >
> >
>
>
>
>


--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20080526/7f302661/attachment.html

Jimmy Hedman

unread,
May 27, 2008, 3:22:04 AM5/27/08
to Discussion of Rocks Clusters
On Mon, May 26, 2008 22:55, Philip Papadopoulos wrote:
> On Mon, May 26, 2008 at 12:00 AM, Jimmy Hedman <jimmy....@southpole.se>
> wrote:
>
>> On Tue, May 20, 2008 10:56, Jimmy Hedman wrote:
>> > Hi,
>> > We have encountered a strange problem with SGE. sge_execd dies after
>> > about 11 minutes from boot with the message: 'commlib error: got read
>> > error (closing "test.southpole.se/qmaster/1")'. We have tried both
>> Rocks
>> > 4.3 and 5. It is only after boot, if we restart sgeexecd it keeps
>> > running.
>> > Any ideas what this could be?
>> More info on this. It seems to be only direct after a reinstallation. If
>> I
>> reboot the machines it works fine.
>>
> Can you find anything the SGE logs? This is strange behavior.
The only thing is the message on the node ('commlib error: got read error
(closing "test.southpole.se/qmaster/1")'). The only thing the master says
is that I can re-compile SGE if I like to have more than 1004 clients.
We had different set of Rolls on Rocks 4.3 vs Rocks V. I did first suspect
the OFED roll since it's doing magic stuff after the install but since we
didn't have that roll on Rocks V I'm pretty sure it's not the problem.

// Jimmy


Reply all
Reply to author
Forward
0 new messages