Hi all,
I need to get the list of running jobs and used resources on a cluster running slurm 17.11.2.
At the moment I do not need all the bunch of information that slurm can provide, but just a few, similar to what squeue and sinfo are returning.
I originally wrote a simple python module that runs squeue and sinfo and parses their output to extract the needed values.
Yesterday I found pyslurm and tried it, hoping to speed up performance.
Now, running squeue and sinfo from the shell takes about 76 ms.
Running my little wrapper that parses squeue and sinfo output takes about 103 ms.
Running pyslurm.job().get() takes 368 ms.
This is much longer than what I was expecting. I know that job().get() returns much more information than a simple squeue output, but it is still relatively slow, without including the resource information (sinfo) yet.
If I use find("job_state", "RUNNING") or get() and then taking only the running jobs, the total time is pretty much the same and most of the time is spent in the "get" method.
import time
import pyslurm
a=pyslurm.job()
t1=time.time()
d=a.get()
t2=time.time()
print "Time for get: ", t2-t1
print "Returned jobs: ", len(d)
l = [d[k] for k in d if d[k]["job_state"] == "RUNNING"]
t3=time.time()
print "Running jobs: ", len(l)
print "Time for filter: ", t3-t2
Time for get: 0.367117881775
Returned jobs: 915
Running jobs: 28
Time for filter: 0.00134110450745
What is really job.get() returning? It includes lots of already completed jobs, but definitively not all of them.
Would filtering out the completed jobs directly in the get method help the performance?
Do you have any other suggestion?
Is this behaviour expected or am I doing anything wrong?
Thanks in advance,
Francesco