Getting utilization, etc. info for a partition

Hossein Pourreza

unread,

Dec 15, 2016, 4:39:46 PM12/15/16

to pys...@googlegroups.com

Hi.

I am trying to get utilization, etc. information per SLURM partition. I tried to get list of nodes per partition using pyslurm.partition().find_id("part_name")["nodes"] but I cannot feed that list to any method to get information that I want. I was hoping that I can change the show-cluster-util.py to filter based on cluster name but could not do it.

Any help will be greatly appreciated.

Thanks

Hossein

Giovanni

unread,

Dec 15, 2016, 6:17:43 PM12/15/16

to pyslurm

Hossein,

Which partition information are you specifically looking for? show-cluster-util will output information shown here: https://github.com/giovtorres/slurmtools/tree/master/show-cluster-util, but does not break down the output by partition. Are you trying to filter by partition or by cluster (sounds like you might have a multi-cluster slurm setup)? I wrote another utility, salljobs (https://github.com/giovtorres/slurmtools/tree/master/salljobs), that you can get usage by partition, for example, run `salljobs -p part_name`.

Let me know which information you are trying to gather and I can try to help.

Giovanni

Hossein Pourreza

unread,

Dec 16, 2016, 9:24:35 AM12/16/16

to Giovanni, pyslurm

Dear Giovanni,

Many thanks for these useful tools and prompt reply to the questions. I really appreciate all your time and effort and I hope I can contribute.

I really like show-cluster-util but I want to be able to filter its result based on a partition name. e.g., show-cluster-util -p part_name. We have a few partitions (as we added new nodes we grouped them as partitions) and I want to get stat about usage of each partition.

Maybe I can modify your salljobs script to include utilization info at the end.

Thanks again

Hossein

--
You received this message because you are subscribed to the Google Groups "pyslurm" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyslurm+u...@googlegroups.com.
To post to this group, send email to pys...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyslurm/177cf671-c9bf-45d9-88b2-17d46109fde5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Giovanni

unread,

Dec 16, 2016, 3:43:02 PM12/16/16

to pyslurm

Hi Hossein,

Checkout the "perpart" branch of slurmtools. I modified show-cluster-util to have an optional -p flag to specify a single partition option. Let me know if that is helpful and what you were looking for. I'll then merge it into master.

Giovanni

Hossein Pourreza

unread,

Dec 16, 2016, 4:09:13 PM12/16/16

to Giovanni, pyslurm

Wow!! That is great. Very nice. Thank you so much. I should read the code to understand how you did it : )

Two questions:

· Does Partition CPU % (Alloc + Unalloc) mean Allocated + Idle? should this and Parition CPU % Unallocatable add to 100%? For me those add up to 81%.

Total Allocated CPUs : 4586

Total Idle CPUs : 2131

Total Down CPUs : 2828

Total Unallocatable CPUs : 787

Total Eligible CPUs : 7504

Total Configured CPUs : 10332

Partition CPU % Unallocatable : 10%

Partition CPU % (Alloc + Unalloc) : 71%

· On one of partitions I receive the following assertion error:

Traceback (most recent call last):

File "show-cluster-util", line 253, in <module>

metrics = get_util(nodes)

File "show-cluster-util", line 139, in get_util

all_metrics["total_nodes_down"] == all_metrics["total_nodes_config"]

AssertionError

but sinfo -p part_name shows this:

PARTITION AVAIL TIMELIMIT NODES STATE

part_name up infinite 1 drng

part_name up infinite 6 resv

part_name up infinite 24 idle

Any idea?

Thanks again

Hossein

From: <pys...@googlegroups.com> on behalf of Giovanni <giovann...@gmail.com>
Date: Friday, December 16, 2016 at 2:43 PM
To: pyslurm <pys...@googlegroups.com>
Subject: Re: Getting utilization, etc. info for a partition

Hi Hossein,

Checkout the "perpart" branch of slurmtools. I modified show-cluster-util to have an optional -p flag to specify a single partition option. Let me know if that is helpful and what you were looking for. I'll then merge it into master.

Giovanni

--

You received this message because you are subscribed to the Google Groups "pyslurm" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyslurm+u...@googlegroups.com.
To post to this group, send email to pys...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/pyslurm/556655fe-cfd0-4f58-aea2-250d3b310126%40googlegroups.com.

Giovanni

unread,

Dec 16, 2016, 4:28:48 PM12/16/16

to pyslurm

It could be that you have overlapping partitions, so nodes were counted twice? There is a configurable list at the top of the script, OVERLAPPING_PARTITIONS, where you can include overlapping partitions that should be excluded.

With regards the first question,

Partition CPU % (Alloc + Unalloc) = (Total Allocated CPUs + Total Unallocated CPUs) / Total Eligible CPUs
71% = (4586 + 787) / 7504

Idle means the CPUs are available for allocation. Unallocatable refers to the CPUs on nodes where, for example, a 32CPU/100GB RAM machine has been allocated a single job with 2 CPUs and 100GB of RAM. All the memory on this node has been allocated and, as a result, no further jobs will be allocated to this node. However, 30 CPUs on that node are technically free but unallocatable due to the memory allocation. I have some descriptions in the README.

Giovanni

Reply all

Reply to author

Forward