Karl Schulz
unread,Jan 1, 2013, 5:05:04 PM1/1/13Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to slurm-dev
Hello again,
Apologies for the slow barrage of rookie questions. I was curious if others in the community see any slurm command interactivity degradation when attempting to use the topology/tree plugin at large scale and the "--switches" option?
If I enable topology/tree in version 2.4.5 and use the --switches flag for a single job, I can verify that it does as we expect to honor the switch topology provided in topology.conf. However, on the same idle system, a small job submitted without the "--switches" option which should fit on 1 switch, is not scheduled to 1 switch. I understand from the docs, that the scheduling may be sub-optimal, but was surprised to see that happen when there were not any actively running jobs. Consequently, the remaining discussion is focused on testing with the --switches flag.
Note that based on the guidance provided in the docs for the topology.conf configuration, I have only defined 2 levels of the fat-tree topology (the first level connected to endpoint hosts, and the 2nd level which connects to all level-1 switches). This attempts to minimize how many switches are provided to the plugin, but it is still decent in size because of a large number of hosts (6400 in this case, with > 300 L1 switches and 288 L2 switches).
The issue seems to arise once I start submitting multiple jobs with "--switches" requests. Once there are more than a few, the interactivity of commands like squeue and sinfo intermittently decreases dramatically (e.g. more than a minute at times, more frequently 5-20 seconds).
This observation is derived from a simple test which submits 40 small sbatch jobs to an otherwise idle system. As the jobs are very small, the scheduler should be able to have all jobs running simultaneously.
Test Mode 1 (no topology requirements):
In the first test mode, I submit the jobs without any extra "--switches" options, and in this case, slurm schedules all the jobs almost instantly. In this mode, there is no noticeable interactive command degradation.
Test Mode 2 (each job includes an additional --switches option):
In the second test mode, each job adds a --switch=[num_switches] option with "num_switches" chosen to be the smallest value for which the job can be accommodated topologically. The 40 jobs are submitted in sequence from a simple shell script and as the jobs begin to be accepted, slurm command interactivity becomes erratic. In this mode, I have seen squeue -u <userid> take over a minute to complete. In addition to the sluggish interactivity (which seems to disappear eventually after a subset of the jobs are running), it takes much longer for the topology jobs to schedule. I certainly understand that the space-filling curve algorithm will slow this process down, but it seems to take more than a factor of 2 longer on an idle system. Would you expect this? The strange part is that some of the jobs continue to report as pending for resources, although there are thousands of nodes which could satisfy the min switch request.
To quantify the difference a bit; the time required from submission of first job to completion of the 40th job is as follows:
(1) Test Mode 1 (no topo requirements): ~5 minutes
(2) Test Mode 2 (each job with --switch option): ~12 minutes
Any thoughts on what might be amiss based on these tests?
Thanks again for any advice,
Karl