Unspecified number of nodes limit

35 views
Skip to first unread message

Hyungsik Jo

unread,
Jun 20, 2022, 2:14:23 AM6/20/22
to google-cloud-slurm-discuss
I want to run two tasks in one partition

One job uses 100 nodes, and the other job wants to use about 400 nodes.

However, if both tasks are used at the same time, only 138 nodes in total are used.

I changed the node configuration part and maxarraysize / maxjobcount in the slurm.conf file,

MaxJobCount=50000
MaxArraySize=50000

# COMPUTE NODES
NodeName=DEFAULT CPUs=16 RealMemory=63216 State=UNKNOWN
NodeName=node-0-[0-599] State=CLOUD
PartitionName=debug Nodes=node-0-[0-599] MaxTime=INFINITE State=UP DefMemPerCPU=3951 LLN=yes Default=YES

The sbatch settings for the second job are as follows.
#SBATCH -o ./out/%j.out
#SBATCH --ntasks=1
#SBATCH --array=0-399
#SBATCH --cpus-per-task=16
#SBATCH -W

As far as I know, as many nodes as the number of arrays should be executed, but only some are executed and some are marked as resource , waiting for the preceding work to finish.

Is there a value I need to set additionally?

Olivier Martin

unread,
Jun 20, 2022, 9:13:38 AM6/20/22
to Hyungsik Jo, google-cloud-slurm-discuss
Just some thoughts - does your project have enough quotas? Are you seeing some allocation failures in the logs? when you run the largest (~400 nodes) job, do you get the same 138 nodes allocated?

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/61a99f74-3504-4651-ab76-5ca72e6b7679n%40googlegroups.com.

Hyungsik Jo

unread,
Jun 20, 2022, 7:51:30 PM6/20/22
to google-cloud-slurm-discuss
I'm sending it twice because I think it would be a good idea to reply in full for everyone to see.

Please understand my mistake.

1. The quota of the project unit is checked periodically, but there is no excess.

2. Even if you check the log, there is no part about the cause of the failure. It just can't use resources.

3. When a task is executed with more than 138 nodes, all are fixed to 138 and work sequentially.

Just a thought, maybe there is a limit to the debug partition itself??

2022년 6월 20일 월요일 오후 10시 13분 38초 UTC+9에 martin...@google.com님이 작성:

Olivier Martin

unread,
Jun 21, 2022, 12:51:16 AM6/21/22
to Hyungsik Jo, google-cloud-slurm-discuss
When you have 138 nodes busy, are you able to create (using gcloud, console or api/other means) a new node of the same type in the same project and region (oustide Slurm's control)? Maybe that's the case, I'm just suggesting this as a way to see if there's something wrong outside of slurm. The logs would be those of Slurm, which I believe can be on the controller or login node - and they may be exported to stackdriver if the service account for the VMs have the proper role (logging.write, monitoring.write) and I believe only logging.write would be necessary in this case. You could also ssh to the controller node (I believe it's this one vs the login node which is responsible for calling the bulk creation api which (should) trigger the creation of the VMs by GCE).

good luck!
Olivier

Hyungsik Jo

unread,
Jun 21, 2022, 1:03:43 AM6/21/22
to google-cloud-slurm-discuss
Tested on the slurm-controller.

While 138 nodes were working, I tried to create a new node using command : srun --pty bash .

Since the maximum number of cpu is fixed in the executed sbatch file, a new node is created.

However, the job is stuck in the following state:

# srun --pty bash
srun: job 59300 queued and waiting for resources
srun: job 59300 has been allocated resources

Also found error log.
/var/log/slurm/slurmctld.log
[2022-06-21T04:58:49.406] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2022-06-21T04:58:49.406] error: slurm_set_addr: Unable to resolve "node-0-899"
[2022-06-21T04:58:49.406] error: fwd_tree_thread: can't find address for host node-0-899, check slurm.conf

Thanks for your interest, olivier

ps. I tried to create and proceed with the partition problem I mentioned earlier, but it is generating the same error.

2022년 6월 21일 화요일 오후 1시 51분 16초 UTC+9에 martin...@google.com님이 작성:

Olivier Martin

unread,
Jun 21, 2022, 12:51:37 PM6/21/22
to Hyungsik Jo, google-cloud-slurm-discuss
Hi, are you able to create (when the 138 nodes are busy on something else) to create a VM using something like :

gcloud compute instances create instance-1 --project=<your_projet> --zone=us-central1-a --machine-type=c2-standard-4 --network-interface=network-tier=PREMIUM,subnet=<your_subnet_name> --metadata=enable-oslogin=true --maintenance-policy=MIGRATE --provisioning-model=STANDARD --service-account=<your_service_account> --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --create-disk=auto-delete=yes,boot=yes,device-name=<instance_name>,image=projects/debian-cloud/global/images/debian-11-bullseye-v20220519,mode=rw,size=10,type=projects/<your_project>/zones/<your_zone>/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --reservation-affinity=any

Of course your command would be different and many parameters could be replaced (or you do it from the console, this is where I got this command from). Just to see if at the project/vpc/subnet level, there's no limit/quota/suspiscious thing being hit (and use the machine type (c2-standard-4 in my example but yours could be different).

cheers,
Olivier 

 


Hyungsik Jo

unread,
Jun 21, 2022, 8:19:47 PM6/21/22
to Olivier Martin, google-cloud-slurm-discuss
Oh, sorry I didn't see the word slurm outside.

As you said, we tested the creation of a new VM on 138 nodes, and it is proceeding normally.

All conditions were treated the same.
- Internal/external IP
- http-server/https-server tag
- machine type
- region

Wouldn't this be a quota or limit issue on Google Cloud?

2022. 6. 22. 오전 1:51, Olivier Martin <martin...@google.com> 작성:

Olivier Martin

unread,
Jun 21, 2022, 8:25:34 PM6/21/22
to Hyungsik Jo, google-cloud-slurm-discuss
My understanding is that if the node creates normally when using the console (while 138 nodes are busy running a slurm job) then there has to be something that’s not behaving normally within slurm or is not configured correctly. Not being support, it’s hard for me to give you the proper. The proper next step is probably to try to investigate the logs of slurm or run a tail -f under /var/logs/messages or other files (not sure exactly which ones) while you’re trying to submit a slurm job that would launch the 139th node (also monitoring stack driver for messages which could signal some details of what’s going wrong - assuming the service account for the controller node has the logging.write permissions).
--

Olivier Martin

martin...@google.com

HPC Customer Engineer

(514) 670-8562

Hyungsik Jo

unread,
Jun 21, 2022, 9:00:48 PM6/21/22
to Olivier Martin, google-cloud-slurm-discuss
/var/log/slurmctld.log contains logs related to slum execution. 

The log of the operation in the log is as follows.

1. Creating Nodes and Assigning Tasks
sched: Allocate JobId=60961_199(60961) NodeList=node-0-999 #CPUs=16 Partition=partition
sched/backfill: _start_job: Started JobId=60961_199(60961) in partition on node-0-999

2-1. 138 nodes running normally
Node node-0-800 now responding

2-2. resetting due to failure of remaining nodes
job_time_limit: Configuration for JobId=60961_43(61005) complete
Resetting JobId=60961_43(61005) start time for node power up

node node-0-976 not resumed by ResumeTimeout(300) - marking down and power_save
requeue job JobId=61243_176(61420) due to failure of node node-0-976
Requeuing JobId=61243_176(61420)

After the calculation of some of the 138 nodes is completed, the failed node is created and the operation proceeds. (similar to using an array)

I've also found and copied logs of VMs that fail and are deleted.

It appears to be a lack of resources in the zone.

There seems to be no other part of the quota that seems to be exceeded.



{
  "protoPayload": {
    "status": {
      "code": 8,
      "message": "ZONE_RESOURCE_POOL_EXHAUSTED",
      "details": [
        {
          "value": {
            "zoneResourcePoolExhaustedWithDetails": {
              "zoneResource": {
                "resourceType": "ZONE",
                "resourceName": "asia-northeast3-b",
                "project": {
                  "canonicalProjectId": "748182922348"
                },
                "scope": {
                  "scopeType": "GLOBAL",
                  "scopeName": "global"
                }
              },
              "details": "(resource type:compute)"
            }
          }
        }
      ]
    },
    "authenticationInfo": {
      "principalEmail": "7481829223...@developer.gserviceaccount.com"
    },
    "requestMetadata": {
      "callerIp": "35.216.74.94",
      "callerSuppliedUserAgent": "Slurm_GCP_Scripts/1.2 (GPN:SchedMD) (gzip),gzip(gfe)",
      "requestAttributes": {},
      "destinationAttributes": {}
    },
    "serviceName": "compute.googleapis.com",
    "methodName": "v1.compute.instances.bulkInsert",
    "authorizationInfo": [
      {
        "permission": "compute.instances.create",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/test-project/zones/asia-northeast3-b/instances/unusedName",
          "type": "compute.instances"
        }
      },
      {
        "permission": "compute.disks.create",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/test-project/zones/asia-northeast3-b/disks/unusedName",
          "type": "compute.disks"
        }
      },
      {
        "permission": "compute.subnetworks.use",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/test-project/regions/asia-northeast3/subnetworks/node-asia-northeast3",
          "type": "compute.subnetworks"
        }
      },
      {
        "permission": "compute.subnetworks.useExternalIp",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/test-project/regions/asia-northeast3/subnetworks/node-asia-northeast3",
          "type": "compute.subnetworks"
        }
      },
      {
        "permission": "compute.instances.setMetadata",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/test-project/zones/asia-northeast3-b/instances/unusedName",
          "type": "compute.instances"
        }
      },
      {
        "permission": "compute.instances.setTags",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/test-project/zones/asia-northeast3-b/instances/unusedName",
          "type": "compute.instances"
        }
      },
      {
        "permission": "compute.instances.setServiceAccount",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/test-project/zones/asia-northeast3-b/instances/unusedName",
          "type": "compute.instances"
        }
      }
    ],
    "resourceName": "projects/test-project/zones/asia-northeast3-b/instances/node-compute-0-999",
    "request": {
    }
  },
  "insertId": "udzude11ki4",
  "resource": {
    "type": "gce_instance",
    "labels": {
      "instance_id": "9147678799106660706",
      "zone": "asia-northeast3-b",
      "project_id": "test-project"
    }
  },
  "timestamp": "2022-06-22T00:51:42.846715Z",
  "severity": "ERROR",
  "logName": "projects/test-project/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "operation-1655859087952-5e1febcbc4ac4-b713c511-7ccfeadb",
    "producer": "compute.googleapis.com",
    "first": true,
    "last": true
  },
  "receiveTimestamp": "2022-06-22T00:51:43.611376754Z"
}


2022. 6. 22. 오전 1:51, Olivier Martin <martin...@google.com> 작성:

Reply all
Reply to author
Forward
0 new messages