SGE matlabpool problem

Marcin

unread,

Nov 19, 2009, 11:27:18 AM11/19/09

to

I'm trying to use Distributed Computing Server with SGE scheduler. My configuration passes validation, but the pool I get always consists of a single lab only. When I issue "matlabpool 1", everything is fine. When I issue "matlabpool 2", I get the following output in matlab on the client machine:

Starting matlabpool using the 'SGE-smart@dec120' configuration ...
Your job 1664 ("Job1.1") has been submitted
Your job 1665 ("Job1.2") has been submitted

and it gets stuck there.

Now, when I try qstat on the head node, it says that Job1.1 is running all the time (which is good) but Job2.1 runs for a moment and then finishes. When I look at the log files for both tasks (see below) there is an error for Task2, but I have no idea what it means. When I try "matlabpool 3" etc. it's always the first task that seems to be fine and there is this error for all the rest. It doesn't depend on the node which is executing the task (so the same node works fine if it gets Task1 but fails if it gets Task2, 3 etc.) For the things to be even more complicated, my configuration passes the verification procedure without problems, although at the matlabpool stage I get "Connected to 1 lab", instead of 15.

I suspect that it might be some problem with communication between the labs. I wasn't however able to find anything about how the labs actually communicate (protocol, port number etc.) in the documentation.

---------------- Task1 -------------------------------------------
Executing: /opt/matlab/2009b/bin/worker -parallel

< M A T L A B (R) >
Copyright 1984-2009 The MathWorks, Inc.
Version 7.9.0.529 (R2009b) 64-bit (glnxa64)
August 12, 2009

To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.

About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN"
About to find job proxy using location "Job1"
About to find task proxy using location "Job1/Task1"
Completed pre-execution phase
About to pPreJobEvaluate
About to pPreTaskEvaluate
About to add job dependencies
About to call jobStartup
About to call taskStartup
About to get evaluation data
About to pInstantiatePool
Pool instatiation complete
About to call poolStartup
Begin task function
End task function

---------------- Task2 -------------------------------------------
Executing: /opt/matlab/2009b/bin/worker -parallel

< M A T L A B (R) >
Copyright 1984-2009 The MathWorks, Inc.
Version 7.9.0.529 (R2009b) 64-bit (glnxa64)
August 12, 2009

To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.

About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN"
About to find job proxy using location "Job1"
About to find task proxy using location "Job1/Task2"
Completed pre-execution phase
About to pPreJobEvaluate
About to pPreTaskEvaluate
Unexpected error in PreTaskEvaluate - MATLAB will now exit.
No appropriate method, property, or field pPreTaskEvaluate for class handle.handle.

Error in ==> dctEvaluateTask at 40
task.pPreTaskEvaluate;

Error in ==> distcomp_evaluate_filetask>iDoTask at 96
dctEvaluateTask(postFcns, finishFcn);

Error in ==> distcomp_evaluate_filetask at 38
iDoTask(handlers, postFcns);

Edric M Ellis

unread,

Nov 25, 2009, 3:45:22 AM11/25/09

to

"Marcin " <mb1...@gazeta.pl> writes:

> I'm trying to use Distributed Computing Server with SGE scheduler. My
> configuration passes validation, but the pool I get always consists of a
> single lab only. When I issue "matlabpool 1", everything is fine. When I issue
> "matlabpool 2", I get the following output in matlab on the client machine:
>
> Starting matlabpool using the 'SGE-smart@dec120' configuration ...
> Your job 1664 ("Job1.1") has been submitted
> Your job 1665 ("Job1.2") has been submitted

I'm not too familiar with SGE - is that the expected behaviour for a *parallel*
job under SGE? Shouldn't there be a single parallel job submitted using
something like "qsub ... -pe matlab 2" ? Are you using the integration scripts
in toolbox/distcomp/examples/integration/sge? If not, do you know what the
parallel "qsub" command line looks like?

> [...]

> About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN"
> About to find job proxy using location "Job1"
> About to find task proxy using location "Job1/Task2"
> Completed pre-execution phase
> About to pPreJobEvaluate
> About to pPreTaskEvaluate
> Unexpected error in PreTaskEvaluate - MATLAB will now exit.
> No appropriate method, property, or field pPreTaskEvaluate for class handle.handle.

When things end up as "handle.handle", that's usually a sign that the underlying
files for the job or task have been deleted. Not quite sure how you're ending up
there...

Cheers,

Edric.

Marcin

unread,

Nov 25, 2009, 7:17:03 AM11/25/09

to

Hi,

Yes, I am using the integration scripts which came with MATLAB, although I had to modify them a bit, as in the original they were not working at all (the job didn't even get submitted to the cluster).
I still think that there is a problem with communication between the labs, but I don't know how to check it.

Marcin

Edric M Ellis <eel...@mathworks.com> wrote in message <ytw4ooj...@uk-eellis-deb5-64.mathworks.co.uk>...

Edric M Ellis

unread,

Nov 25, 2009, 7:54:36 AM11/25/09

to

"Marcin " <mb1...@gazeta.pl> writes:

> Yes, I am using the integration scripts which came with MATLAB, although I had
> to modify them a bit, as in the original they were not working at all (the job
> didn't even get submitted to the cluster). I still think that there is a
> problem with communication between the labs, but I don't know how to check it.

What does your parallel job "qsub" command line look like?

Cheers,

Edric.

Marcin

unread,

Nov 25, 2009, 8:24:04 AM11/25/09

to

Edric M Ellis <eel...@mathworks.com> wrote in message <ytwzl6a...@uk-eellis-deb5-64.mathworks.co.uk>...

It's generated by the integration scripts and looks for example like this:

qsub -N Job2.8 -l q=matlab_pe -j yes -o "/home/smart/PCWIN_2009b/Job2/Task8.log" "/home/smart/PCWIN_2009b/Job2/sgeWrapper.sh"

Marcin

Edric M Ellis

unread,

Nov 25, 2009, 11:17:36 AM11/25/09

to

"Marcin " <mb1...@gazeta.pl> writes:

Hmm, that's not actually submitting a parallel job, and you're not submitting
the parallel wrapper, so it's not unexpected that that doesn't work.

You must submit something along the lines of

qsub ... -pe matlab 2 ... /path/to/Job#/sgeParallelWrapper.sh

otherwise there's no chance that a parallel job will function correctly. The
"-pe matlab 2" states that you need a "parallel environment" called "matlab",
and that you need two parallel processes. The script that you submit must be
something like the sgeParallelWrapper.sh which starts up the smpd daemons and
then uses mpiexec to launch the workers.

I'd suggest going back to the shipping integration scripts (which should work
with only minor modifications) - what doesn't work when you use those?

Cheers,

Edric.

Marcin

unread,

Nov 25, 2009, 4:54:19 PM11/25/09

to

Edric M Ellis <eel...@mathworks.com> wrote in message <ytwvdgy...@uk-eellis-deb5-64.mathworks.co.uk>...

It was it! Thank you, thank you, thank you 1000 times :)) I have discovered that indeed instead of sgeNonSharedParallelSubmitFcn, sgeNonSharedSimpleSubmitFcn was called. It's a shame though that Mathworks support didn't notice it...

Marcin

unread,

Nov 26, 2009, 1:43:03 AM11/26/09

to

"Marcin " <mb1...@gazeta.pl> wrote in message <hek92b$92v$1...@fred.mathworks.com>...

But now, I'm getting a new error:

>> pmode open 12
Starting pmode using the 'SGE-smart@dec120' configuration ...
Your job 2100 ("Job1") has been submitted

??? Error using ==> distcomp.interactiveclient.start at 103
The client lost connection to lab 6.
This might be due to network problems, or the interactive matlabpool job might have errored. This is causing:
java.io.IOException: An existing connection was forcibly closed by the remote host

Error in ==> pmode at 84
client.start('pmode', nlabs, config, 'opengui');

Sending a stop signal to all the labs ... stopped.

??? Error using ==> distcomp.interactiveclient.start at 119
Failed to initialize the interactive session.
This is caused by:
Java exception occurred:
com.mathworks.toolbox.distcomp.pmode.SessionDestroyedException
at com.mathworks.toolbox.distcomp.pmode.Session.getFileDependenciesAssistant(Session.java:146)

Error in ==> pmode at 84
client.start('pmode', nlabs, config, 'opengui');

When I try to create a smaller pool though, like pmode open 8 - it usually works. My cluster has 15 nodes, the total number of slots has been set to 75 (5 per node). There shouldn't be any connectivity problems, as it all runs on a separate gigabit network.

Interestingly, when I submit a parallel job like this:

clusterHost = 'dec120.bmth.ac.uk';
remoteDataLocation = '/home/smart';
sched = findResource('scheduler', 'type', 'generic');
% Use a local directory as the DataLocation
set(sched, 'DataLocation', struct('pc','C:/TEMP/MATLAB','unix','/home/smart'));
set(sched, 'ClusterMatlabRoot', '/opt/matlab/2009b');
set(sched, 'HasSharedFilesystem', false);
set(sched, 'ClusterOsType', 'unix');
set(sched, 'GetJobStateFcn', @sgeGetJobState);
set(sched, 'DestroyJobFcn', @sgeDestroyJob);
set(sched, 'SubmitFcn', {@sgeNonSharedSimpleSubmitFcn, clusterHost, remoteDataLocation});
set(sched, 'ParallelSubmitFcn', {@sgeNonSharedParallelSubmitFcn, clusterHost, remoteDataLocation});

parJob = createParallelJob(sched,'Min',15,'Max',15);
createTask(parJob, @labindex, 1);
submit(parJob);
waitForState(parJob);
results2 = getAllOutputArguments(parJob);

It finishes without error and all 15 nodes are involved (I know it by examining the log files on the cluster).

Many thanks, Marcin

Edric M Ellis

unread,

Nov 26, 2009, 3:26:26 AM11/26/09

to

"Marcin " <mb1...@gazeta.pl> writes:

> [...]

Hmm, glad we made *some* progress!

> But now, I'm getting a new error:
>
>>> pmode open 12
> Starting pmode using the 'SGE-smart@dec120' configuration ...
> Your job 2100 ("Job1") has been submitted
>
> ??? Error using ==> distcomp.interactiveclient.start at 103
> The client lost connection to lab 6.
> This might be due to network problems, or the interactive matlabpool job might have errored. This is causing:
> java.io.IOException: An existing connection was forcibly closed by the remote host
>
> Error in ==> pmode at 84
> client.start('pmode', nlabs, config, 'opengui');
>
> Sending a stop signal to all the labs ... stopped.
>
> ??? Error using ==> distcomp.interactiveclient.start at 119
> Failed to initialize the interactive session.
> This is caused by:
> Java exception occurred:
> com.mathworks.toolbox.distcomp.pmode.SessionDestroyedException
> at com.mathworks.toolbox.distcomp.pmode.Session.getFileDependenciesAssistant(Session.java:146)

That error basically means that the connection between the workers and the
client went away. Unfortunately, this is a relatively generic error that doesn't
really indicate what the cause might be.

> When I try to create a smaller pool though, like pmode open 8 - it usually
> works. My cluster has 15 nodes, the total number of slots has been set to 75
> (5 per node). There shouldn't be any connectivity problems, as it all runs on
> a separate gigabit network.

You say "usually works" - is there a point where it always works, and a point
where it always fails?

> Interestingly, when I submit a parallel job like this:
>
> clusterHost = 'dec120.bmth.ac.uk';
> remoteDataLocation = '/home/smart';
> sched = findResource('scheduler', 'type', 'generic');
> % Use a local directory as the DataLocation
> set(sched, 'DataLocation', struct('pc','C:/TEMP/MATLAB','unix','/home/smart'));
> set(sched, 'ClusterMatlabRoot', '/opt/matlab/2009b');
> set(sched, 'HasSharedFilesystem', false);
> set(sched, 'ClusterOsType', 'unix');
> set(sched, 'GetJobStateFcn', @sgeGetJobState);
> set(sched, 'DestroyJobFcn', @sgeDestroyJob);
> set(sched, 'SubmitFcn', {@sgeNonSharedSimpleSubmitFcn, clusterHost, remoteDataLocation});
> set(sched, 'ParallelSubmitFcn', {@sgeNonSharedParallelSubmitFcn, clusterHost, remoteDataLocation});
>
> parJob = createParallelJob(sched,'Min',15,'Max',15);
> createTask(parJob, @labindex, 1);
> submit(parJob);
> waitForState(parJob);
> results2 = getAllOutputArguments(parJob);
>
> It finishes without error and all 15 nodes are involved (I know it by
> examining the log files on the cluster).

Are the settings that you've got there identical to whatever you've got set for
the configuration used by pmode? I'd try

sched = findResource( 'scheduler', 'Configuration', '<configname>' )

rather than all the manual settings and see if that works...

Cheers,

Edric.

Marcin

unread,

Nov 26, 2009, 4:56:03 AM11/26/09

to

Edric M Ellis <eel...@mathworks.com> wrote in message <ytwr5rl...@uk-eellis-deb5-64.mathworks.co.uk>...

Hi,

It doesn't make a difference, but I know a bit more about the problem now. It seems that there are two nodes in my cluster, which when used together cause the problem. So when I create a pool using matlabpool or pmode and they are both in this pool - it will crash but if only one of them gets picked - it works. Strange, as all of them have exactly the same configuration. At least they should...

Marcin

unread,

Nov 26, 2009, 9:00:21 AM11/26/09

to

"Marcin " <mb1...@gazeta.pl> wrote in message <heljbj$ft$1...@fred.mathworks.com>...

As a workaround we have removed one of the problematic nodes from the cluster and the admin is currently investigating the issue. I have another question though: how can I monitor the progress of my parallel job other than using qstat, which doesn't tell me much?

Edric M Ellis

unread,

Nov 26, 2009, 9:59:49 AM11/26/09

to

"Marcin " <mb1...@gazeta.pl> writes:

> [...]

> As a workaround we have removed one of the problematic nodes from the cluster
> and the admin is currently investigating the issue.

Just a wild stab in the dark here - occasionally, we see weird problems caused
by bogus localhost entries in /etc/hosts - in particular, lines like

"127.0.0.1 <stuff> <real-machine-name>"

cause problems.

> I have another question though: how can I monitor the progress of my parallel
> job other than using qstat, which doesn't tell me much?

We don't have any built-in facilities I'm afraid. (What sort of thing were you
after?) For now, your best bet is to write stuff out to a file that you can
access from your client.

Cheers,

Edric.

Marcin

unread,

Nov 26, 2009, 10:59:03 AM11/26/09

to

Edric M Ellis <eel...@mathworks.com> wrote in message <ytwd435...@uk-eellis-deb5-64.mathworks.co.uk>...

Well I was thinking about a way to monitor the resource usage on all nodes, how much resources have been allocated to a particular job and this kind of stuff.
Thanks again for you help.

Marcin

unread,

Nov 28, 2009, 2:14:06 PM11/28/09

to

"Marcin " <mb1...@gazeta.pl> wrote in message <hem8k7$d1q$1...@fred.mathworks.com>...

Edric, I have another small problem. When I pmode to all my cluster nodes and issue the maxNumCompThreads command, each of them returns 1, although the machines have quadcore CPUs. After I issue maxNumCompThreads ('automatic') it indeed changes to 4. Can I somehow force each worker to use more than one core at startup?

Thanks

Edric M Ellis

unread,

Nov 30, 2009, 3:33:56 AM11/30/09

to

"Marcin " <mb1...@gazeta.pl> writes:

> Edric, I have another small problem. When I pmode to all my cluster nodes and
> issue the maxNumCompThreads command, each of them returns 1, although the
> machines have quadcore CPUs. After I issue maxNumCompThreads ('automatic') it
> indeed changes to 4. Can I somehow force each worker to use more than one core
> at startup?

You should be able to use jobStartup.m to do that.

http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/jobstartup.html

Cheers,

Edric.