First, I want to thank Mark and Moe for helping me get a test cluster setup. I got so focused on getting this running that I failed to show my appreciation in a timely matter.
I took 5 nodes from my production cluster and configured as a separate installation of SLURM 2.3.3. I was not successful in getting the test cluster to use the same MySQL database so I built a new one. Everything seems to be communicating now but I have a strange problem. No matter which user I submit jobs from I get the following error on the screen:
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
The Slurmctld.log shows this:
[2012-07-17T09:30:10] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=20053
[2012-07-17T09:30:10] debug3: JobDesc: user_id=20053 job_id=-1 partition=test1 name=sleep
[2012-07-17T09:30:10] debug3: cpus=1-4294967294 pn_min_cpus=-1
[2012-07-17T09:30:10] debug3: -N min-[max]: 1-[4294967294]:65534:65534:65534
[2012-07-17T09:30:10] debug3: pn_min_memory_job=-1 pn_min_tmp_disk=-1
[2012-07-17T09:30:10] debug3: immediate=0 features=(null) reservation=(null)
[2012-07-17T09:30:10] debug3: req_nodes=(null) exc_nodes=(null) gres=(null)
[2012-07-17T09:30:10] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1
[2012-07-17T09:30:10] debug3: kill_on_node_fail=-1 script=(null)
[2012-07-17T09:30:10] debug3: argv="/bin/sleep"
[2012-07-17T09:30:10] debug3: stdin=(null) stdout=(null) stderr=(null)
[2012-07-17T09:30:10] debug3: work_dir=/home/hgieselman alloc_node:sid=myvtlin20:28611
[2012-07-17T09:30:10] debug3: resp_host=10.88.226.20 alloc_resp_port=47269 other_port=36729
[2012-07-17T09:30:10] debug3: dependency=(null) account=(null) qos=(null) comment=(null)
[2012-07-17T09:30:10] debug3: mail_type=0 mail_user=(null) nice=55534 num_tasks=4294967294 open_mode=0 overcommit=-1 acctg_freq=-1
[2012-07-17T09:30:10] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null)
[2012-07-17T09:30:10] debug3: end_time=Unknown signal=0@0 wait_all_nodes=-1
[2012-07-17T09:30:10] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1
[2012-07-17T09:30:10] debug3: cpus_bind=65534:(null) mem_bind=65534:(null) plane_size:65534
[2012-07-17T09:30:10] error: User 20053 not found
[2012-07-17T09:30:10] _job_create: invalid account or partition for user 20053, account '(null)', and partition 'test1'
[2012-07-17T09:30:10] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified
The issue seems to be that the jobs get submitted with the uid instead of the username as the user. We use NIS and all of my testing of NIS comes out clean. As I mentioned earlier, these were production compute nodes and did not have this problem then. Any ideas on where to look for the root cause?
Thanks,