Performance getting list of running jobs

167 views

Skip to first unread message

Francesco

unread,

Jul 10, 2018, 9:33:03 AM7/10/18

to pyslurm

Hi all,

I need to get the list of running jobs and used resources on a cluster running slurm 17.11.2.

At the moment I do not need all the bunch of information that slurm can provide, but just a few, similar to what squeue and sinfo are returning.

I originally wrote a simple python module that runs squeue and sinfo and parses their output to extract the needed values.

Yesterday I found pyslurm and tried it, hoping to speed up performance.

Now, running squeue and sinfo from the shell takes about 76 ms.

Running my little wrapper that parses squeue and sinfo output takes about 103 ms.

Running pyslurm.job().get() takes 368 ms.

This is much longer than what I was expecting. I know that job().get() returns much more information than a simple squeue output, but it is still relatively slow, without including the resource information (sinfo) yet.

If I use find("job_state", "RUNNING") or get() and then taking only the running jobs, the total time is pretty much the same and most of the time is spent in the "get" method.

import time

import pyslurm

a=pyslurm.job()

t1=time.time()

d=a.get()

t2=time.time()

print "Time for get: ", t2-t1

print "Returned jobs: ", len(d)

l = [d[k] for k in d if d[k]["job_state"] == "RUNNING"]

t3=time.time()

print "Running jobs: ", len(l)

print "Time for filter: ", t3-t2

Time for get: 0.367117881775
Returned jobs: 915
Running jobs: 28
Time for filter: 0.00134110450745

What is really job.get() returning? It includes lots of already completed jobs, but definitively not all of them.

Would filtering out the completed jobs directly in the get method help the performance?

Do you have any other suggestion?

Is this behaviour expected or am I doing anything wrong?

Thanks in advance,

Francesco

Giovanni

unread,

Jul 10, 2018, 8:01:02 PM7/10/18

to pyslurm

Hi Francesco,

job.get() will call the slurm_load_jobs() API call, which will load all jobs in all states, puts them into a dictionary of dictionaries, and then returns that to you. Often times, aside from running jobs, there are jobs returned that are in the pending, completed, completing and canceled states. `scontrol show jobs` will do the same thing. You may have an env variable controlling the output of squeue. Either way, squeue also calls the same slurm job API call.

job.find() also calls job.get() first to load up the dictionary, then does searches the dictionary by name and value.

You can try running a profiler to see where the time is spent: https://github.com/PySlurm/pyslurm/wiki/Profiling-PySlurm#using-line_profiler. I'd be curious to see the output.

I rewrote the job class in a different branch to use objects instead of dictionaries and this did improve performance, but it is one release behind at 17.02 and hasn't been officially released.