signs of life?

2 views
Skip to first unread message

Daniel Gruner

unread,
Jan 17, 2009, 4:12:53 PM1/17/09
to xc...@googlegroups.com
Hi All,

I haven't seen any postings on the list for about a month - well,
except for my own. What is going on with the xcpu project?

I've reported a bunch of bugs, specifically with bjs, but never heard
anything from anybody (Abhishek?). I've gone "production" (with
trepidation) with my cluster, and these bugs are quite a nuisance.
Furthermore, yesterday two of my nodes crashed - in fact appeared to
be completely powered off - while running stuff. No warnings or
apparent problems, but I am still investigating. One of the worst
issues with that is that bjs ends up in a weird state, hanging up
rather than reporting that there are two less nodes available, and it
does not recover until xstat shows that all the nodes are back online.

Then there is the MPI problem...

Regards,
Daniel

p.s. IF we can have most of these issues resolved, and xstat manages
to scale up significantly, then I would be very interested in testing
xcpu on our new 4,000 node cluster which should be available by end of
April. I'd love to be able to run xcpu on it (would be quite an
uphill battle, but if it is shown to work then it would be quite a
coup)...

Roger Mason

unread,
Jan 23, 2009, 11:44:09 AM1/23/09
to xc...@googlegroups.com
Hi Daniel,

"Daniel Gruner" <dgr...@gmail.com> writes:

> I haven't seen any postings on the list for about a month - well,
> except for my own. What is going on with the xcpu project?
>
> I've reported a bunch of bugs, specifically with bjs, but never heard
> anything from anybody (Abhishek?). I've gone "production" (with
> trepidation) with my cluster, and these bugs are quite a nuisance.
> Furthermore, yesterday two of my nodes crashed - in fact appeared to
> be completely powered off - while running stuff. No warnings or
> apparent problems, but I am still investigating. One of the worst
> issues with that is that bjs ends up in a weird state, hanging up
> rather than reporting that there are two less nodes available, and it
> does not recover until xstat shows that all the nodes are back online.
>
> Then there is the MPI problem...

I'm still here. Not much use to you I know.
I havn't had time to do any further debugging because term started and
I'm running to keep up. Sonn I hope.

Cheers,
Roger

seanb

unread,
Jan 23, 2009, 2:36:58 PM1/23/09
to xc...@googlegroups.com
HI
We're swamped with other work and Abhishek has gone back to school.

Sean

Reply all
Reply to author
Forward
0 new messages