Getting utilization, etc. info for a partition

84 views
Skip to first unread message

Hossein Pourreza

unread,
Dec 15, 2016, 4:39:46 PM12/15/16
to pys...@googlegroups.com

Hi.

 

I am trying to get utilization, etc. information per SLURM partition. I tried to get list of nodes per partition using pyslurm.partition().find_id("part_name")["nodes"] but I cannot feed that list to any method to get information that I want. I was hoping that I can change the show-cluster-util.py to filter based on cluster name but could not do it.

 

Any help will be greatly appreciated.

 

Thanks

Hossein

 

Giovanni

unread,
Dec 15, 2016, 6:17:43 PM12/15/16
to pyslurm
Hossein,

Which partition information are you specifically looking for?  show-cluster-util will output information shown here: https://github.com/giovtorres/slurmtools/tree/master/show-cluster-util, but does not break down the output by partition. Are you trying to filter by partition or by cluster (sounds like you might have a multi-cluster slurm setup)?  I wrote another utility, salljobs (https://github.com/giovtorres/slurmtools/tree/master/salljobs), that you can get usage by partition, for example, run `salljobs -p part_name`.

Let me know which information you are trying to gather and I can try to help.

Giovanni

Hossein Pourreza

unread,
Dec 16, 2016, 9:24:35 AM12/16/16
to Giovanni, pyslurm

 

Dear Giovanni,

 

Many thanks for these useful tools and prompt reply to the questions. I really appreciate all your time and effort and I hope I can contribute.

 

I really like show-cluster-util but I want to be able to filter its result based on a partition name. e.g., show-cluster-util -p part_name. We have a few partitions (as we added new nodes we grouped them as partitions) and I want to get stat about usage of each partition.

 

Maybe I can modify your salljobs script to include utilization info at the end.

 

Thanks again

Hossein

--
You received this message because you are subscribed to the Google Groups "pyslurm" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyslurm+u...@googlegroups.com.
To post to this group, send email to pys...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyslurm/177cf671-c9bf-45d9-88b2-17d46109fde5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Giovanni

unread,
Dec 16, 2016, 3:43:02 PM12/16/16
to pyslurm
Hi Hossein,

Checkout the "perpart" branch of slurmtools.  I modified show-cluster-util to have an optional -p flag to specify a single partition option.  Let me know if that is helpful and what you were looking for.  I'll then merge it into master.

Giovanni

Hossein Pourreza

unread,
Dec 16, 2016, 4:09:13 PM12/16/16
to Giovanni, pyslurm

Wow!! That is great. Very nice. Thank you so much. I should read the code to understand how you did it : )

 

Two questions:

·         Does Partition CPU % (Alloc + Unalloc) mean Allocated + Idle? should this and Parition CPU % Unallocatable add to 100%? For me those add up to 81%.

 

Total Allocated CPUs                 :     4586

Total Idle CPUs                      :     2131

Total Down CPUs                      :     2828

Total Unallocatable CPUs             :      787

Total Eligible CPUs                  :     7504

Total Configured CPUs                :    10332

Partition CPU % Unallocatable        :      10%

Partition CPU % (Alloc + Unalloc)    :      71%

 

·         On one of partitions I receive the following assertion error:

 

Traceback (most recent call last):

  File "show-cluster-util", line 253, in <module>

    metrics = get_util(nodes)

  File "show-cluster-util", line 139, in get_util

    all_metrics["total_nodes_down"] == all_metrics["total_nodes_config"]

AssertionError

 

but sinfo -p part_name shows this:

 

PARTITION AVAIL  TIMELIMIT  NODES  STATE

part_name     up        infinite          1     drng

part_name     up        infinite          6     resv

part_name     up        infinite          24   idle

 

Any idea?

 

Thanks again

Hossein

 

 

From: <pys...@googlegroups.com> on behalf of Giovanni <giovann...@gmail.com>
Date: Friday, December 16, 2016 at 2:43 PM
To: pyslurm <pys...@googlegroups.com>
Subject: Re: Getting utilization, etc. info for a partition

 

Hi Hossein,



Checkout the "perpart" branch of slurmtools.  I modified show-cluster-util to have an optional -p flag to specify a single partition option.  Let me know if that is helpful and what you were looking for.  I'll then merge it into master.

Giovanni

--

You received this message because you are subscribed to the Google Groups "pyslurm" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyslurm+u...@googlegroups.com.
To post to this group, send email to pys...@googlegroups.com.

Giovanni

unread,
Dec 16, 2016, 4:28:48 PM12/16/16
to pyslurm
It could be that you have overlapping partitions, so nodes were counted twice?  There is a configurable list at the top of the script, OVERLAPPING_PARTITIONS, where you can include overlapping partitions that should be excluded.

With regards the first question,
 
  Partition CPU % (Alloc + Unalloc) = (Total Allocated CPUs + Total Unallocated CPUs) / Total Eligible CPUs
  71% = (4586 + 787) / 7504

Idle means the CPUs are available for allocation.  Unallocatable refers to the CPUs on nodes where, for example, a 32CPU/100GB RAM machine has been allocated a single job with 2 CPUs and 100GB of RAM.  All the memory on this node has been allocated and, as a result, no further jobs will be allocated to this node.  However, 30 CPUs on that node are technically free but unallocatable due to the memory allocation.  I have some descriptions in the README.

Giovanni
Reply all
Reply to author
Forward
0 new messages