[slurm-dev] Setting up a test cluster

2 views
Skip to first unread message

hgies...@us.nanya.com

unread,
Jul 12, 2012, 11:12:03 AM7/12/12
to slurm-dev

I have SLURM 2.3.3 installed and running in production but have found that I need to do some more testing and tweaking before I can migrate all of our LSF jobs to SLURM. I would like to install a test cluster but am unsure about the following:

 

1.       Will I need to install the test cluster on a separate controller or is it enough that I just install it to a different path and use different port numbers?

2.       Can I use existing nodes from my production cluster in the test cluster?

3.       Are there any other things to look out for in running a parallel cluster?

 

Thanks,

 

Howard Gieselman

Moe Jette

unread,
Jul 12, 2012, 11:39:04 AM7/12/12
to slurm-dev

Quoting hgies...@us.nanya.com:

> I have SLURM 2.3.3 installed and running in production but have found
> that I need to do some more testing and tweaking before I can migrate
> all of our LSF jobs to SLURM. I would like to install a test cluster but
> am unsure about the following:
>
>
>
> 1. Will I need to install the test cluster on a separate
> controller or is it enough that I just install it to a different path
> and use different port numbers?

Different paths and ports are sufficient.


> 2. Can I use existing nodes from my production cluster in the test
> cluster?

Yes.


> 3. Are there any other things to look out for in running a
> parallel cluster?

On pretty much all system types you will just over-subscribe resources.


>
> Thanks,
>
>
>
> Howard Gieselman
>
>

Mark A. Grondona

unread,
Jul 12, 2012, 12:33:02 PM7/12/12
to slurm-dev
Also, be sure to audit your epilog and prolog scripts to make sure
they won't cause harm when running in parallel with another SLURM
instance. Be careful if you are using SLURM cgroups support, as cgroups
created for each parallel instance of SLURM will exist in the same
namespace, and that may cause unexpected issues.

mark


>
>>
>> Thanks,
>>
>>
>>
>> Howard Gieselman
>>
>>

hgies...@us.nanya.com

unread,
Jul 12, 2012, 3:07:02 PM7/12/12
to slurm-dev
How do I get another slurmctld running for the test cluster?

Howard Gieselman

Mark A. Grondona

unread,
Jul 12, 2012, 5:00:04 PM7/12/12
to slurm-dev




hgies...@us.nanya.com writes:
> How do I get another slurmctld running for the test cluster?

You just need to point the slurmctld at a different slurm.conf
that uses a different port for slurmctld and slurmd. Then point
slurmd and all slurm commands to this new slurm.conf. If you
are running out of a build directory or alternate path, you
may also need to update PluginDir or other directories (perhaps
epilog and prolog location as well, you decide for your testing)

I used to have a script that would "boot" a test slurm instance
as a SLURM job, but that script doesn't work anymore with
recent versions of SLURM (however, there really wasn't much
to it)

mark

hgies...@us.nanya.com

unread,
Jul 17, 2012, 10:45:03 AM7/17/12
to slurm-dev
First, I want to thank Mark and Moe for helping me get a test cluster setup. I got so focused on getting this running that I failed to show my appreciation in a timely matter.

I took 5 nodes from my production cluster and configured as a separate installation of SLURM 2.3.3. I was not successful in getting the test cluster to use the same MySQL database so I built a new one. Everything seems to be communicating now but I have a strange problem. No matter which user I submit jobs from I get the following error on the screen:

srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

The Slurmctld.log shows this:

[2012-07-17T09:30:10] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=20053
[2012-07-17T09:30:10] debug3: JobDesc: user_id=20053 job_id=-1 partition=test1 name=sleep
[2012-07-17T09:30:10] debug3: cpus=1-4294967294 pn_min_cpus=-1
[2012-07-17T09:30:10] debug3: -N min-[max]: 1-[4294967294]:65534:65534:65534
[2012-07-17T09:30:10] debug3: pn_min_memory_job=-1 pn_min_tmp_disk=-1
[2012-07-17T09:30:10] debug3: immediate=0 features=(null) reservation=(null)
[2012-07-17T09:30:10] debug3: req_nodes=(null) exc_nodes=(null) gres=(null)
[2012-07-17T09:30:10] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1
[2012-07-17T09:30:10] debug3: kill_on_node_fail=-1 script=(null)
[2012-07-17T09:30:10] debug3: argv="/bin/sleep"
[2012-07-17T09:30:10] debug3: stdin=(null) stdout=(null) stderr=(null)
[2012-07-17T09:30:10] debug3: work_dir=/home/hgieselman alloc_node:sid=myvtlin20:28611
[2012-07-17T09:30:10] debug3: resp_host=10.88.226.20 alloc_resp_port=47269 other_port=36729
[2012-07-17T09:30:10] debug3: dependency=(null) account=(null) qos=(null) comment=(null)
[2012-07-17T09:30:10] debug3: mail_type=0 mail_user=(null) nice=55534 num_tasks=4294967294 open_mode=0 overcommit=-1 acctg_freq=-1
[2012-07-17T09:30:10] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null)
[2012-07-17T09:30:10] debug3: end_time=Unknown signal=0@0 wait_all_nodes=-1
[2012-07-17T09:30:10] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1
[2012-07-17T09:30:10] debug3: cpus_bind=65534:(null) mem_bind=65534:(null) plane_size:65534
[2012-07-17T09:30:10] error: User 20053 not found
[2012-07-17T09:30:10] _job_create: invalid account or partition for user 20053, account '(null)', and partition 'test1'
[2012-07-17T09:30:10] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified

The issue seems to be that the jobs get submitted with the uid instead of the username as the user. We use NIS and all of my testing of NIS comes out clean. As I mentioned earlier, these were production compute nodes and did not have this problem then. Any ideas on where to look for the root cause?

Thanks,

hgies...@us.nanya.com

unread,
Jul 19, 2012, 9:00:03 AM7/19/12
to slurm-dev
Just a quick follow-up...

Mother Nature fixed my problem with UID vs. Username. A severe storm on Tuesday night took out the power and my server room soon went down with it. Upon powering everything back on my test cluster now sees job submissions correctly by username.

Thanks all,
Reply all
Reply to author
Forward
0 new messages