multicore jobs in v7r2

62 views
Skip to first unread message

Daniela Bauer

unread,
Jun 18, 2021, 10:28:52 AM6/18/21
to diracgrid-forum
Hi All,

We  installed a v7r2(p10) test server at Imperial and are testing job submission.
The server was upgrade from v7r1.
Single core job submission works, the jobs run and I can retrieve the output.
For multicore jobs in v7r1 we are using the MultiProcessorSiteDirector to submit multi core jobs. As I understand this, this is now deprecated, so I am trying to use the standard site director. 
My queues (all of them) accept both single core and multi core jobs. I have set:
Tag = MultiProcessor
and
NumberOfProcessors = 8 (or 16 or 64)
on these queues.
I use a JDL that works in v7r1, it contains the line:
Tags = {"8Processors"};
(I also tried Tags = {"8Processors", "MultiProcessor"};)
The jobs are matched and run, but only seem to request one core (I can see the ClassAdd(?) on the server.)

What am I overlooking ?

For the chosen few who are members of the gridpp VO and have survived the recent purge/upgrade mishap you can see the test server at:

Regards,
Daniela

Federico Stagni

unread,
Jun 21, 2021, 5:56:24 AM6/21/21
to Daniela Bauer, diracgrid-forum
Hi,
first of all: you do use Pilot3, right?
Anyway, you also need to set "LocalCEType = Pool" . These options are added by the BDII2CSAgent but IIUC you don't use it. 

This is an example of configuration in LHCb where we added them at queue level, but you can also use CE level

image.png

Cheers,
Federico

--
You received this message because you are subscribed to the Google Groups "diracgrid-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diracgrid-for...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/diracgrid-forum/489566e9-a06c-44b0-a7d5-b0b2b3ac5415n%40googlegroups.com.

Daniela Bauer

unread,
Jun 21, 2021, 7:04:01 AM6/21/21
to Federico Stagni, diracgrid-forum
Hi Federico,

We are using pilot3. I looked at the documentation of the PoolCE, but it is a bit sparse:
Obviously I can do a bit of trial and error, but how does the PoolComputingElement handle 'mixed' environments?
I.e. if I set NumberOfProcessors to 8, does it always wait for an 8 core slot and then tries to fill that up with single core jobs ?
Because that would do spectacularly go wrong for queues that take between 1 and 64 core jobs :-S

Regards,
Daniela
--
Sent from my guinea pig enhanced living room

-----------------------------------------------------------
daniel...@imperial.ac.uk
HEP Group/Physics Dep
Imperial College
London, SW7 2BW
Tel: Working from home, please use email.
http://www.hep.ph.ic.ac.uk/~dbauer/

Andrei Tsaregorodtsev

unread,
Jun 21, 2021, 8:05:39 AM6/21/21
to diracgrid-forum
Hi Daniela,

First, you do not need to set MultiProcessor tag as it is added automatically for queues with NumberOfProcessors > 1.
Second, in the job JDL you have to set only NumberOfProcessors requirement (separate topic Min/MaxNumberOfProcesors).
The logic of the PoolComputinElement consists of the following steps:

1. If no jobs yet taken, ask for jobs with WholeNode tag
2. If nothing taken on step 1., ask for jobs with NumberOfProcessors > 1, do that as many times to fill the local capacity
3. If processors are stil left after step 2., take single-core jobs to fill the rest

  Cheers,
  Andrei

Federico Stagni

unread,
Jun 21, 2021, 9:32:56 AM6/21/21
to Andrei Tsaregorodtsev, diracgrid-forum
Hi Daniela,

The Pool "inner" CE will administer a Pool of processors. If you want to use this on cloud VMs, I suggest you set NumberOfProcessors = to the number of Logical Processors of each VM. For "normal" CEs, e.g. the example above is for an ARC CE, it will reserve an 8-core slot on ARC.

Cheers,
Federico

Daniela Bauer

unread,
Jun 22, 2021, 7:04:25 AM6/22/21
to Federico Stagni, Andrei Tsaregorodtsev, diracgrid-forum
Hi Andrei, Federico,

The rather fundamental problem I have right now is that my 8 core jobs don't request 8 cores and all run as a single core job.
I just tried the DIRAC certification server and reconfigured ceprod03.grid.hep.ph.ic.ac.uk, you can check if it looks OK.

I currently have on dteam job sitting in the queue (and it's definitely my job) and looking at in in condor I see:
RequestCpus = 1
I used this jdl
lxplus749:v7r3-pre12 :~] cat multiVO.jdl  
[
Executable = "multiVOexe.sh";
StdOutput = "job.log";
StdError = "job.log";
InputSandbox = "multiVOexe.sh";
OutputSandbox = "job.log";
Site = "LCG.UKI-LT2-IC-HEP.uk";
JobName = "MultiVOTest";
Tag = {"8Processors"};
]
I use these two variables to check for multicore:
echo "Checking number of threads: NSLOTS=${NSLOTS}"
echo "Checking number of threads: OMP_NUM_THREADS=${OMP_NUM_THREADS}"
Though OMP_NUM_THREADS is set by the batch system so poolCE might have other ideas what it is (in v7r1 that is definitely 8 for an 8 core job), but looking at the condor log on the CE is pretty conclusive.

I tried an ARC CE for a different VO from my own v7r2 testserver and that worked. Could it be that the condor submission is broken ?

Regards,
Daniela

Andrei Tsaregorodtsev

unread,
Jun 22, 2021, 9:16:13 AM6/22/21
to diracgrid-forum
Hi Daniela,

First, in the jobs you should use "NumberOfProcessors" = 8; instead of Tag = {"8Processors"}; .
Yes, behind the scene this option is transformed into the tag, but this can change in the future.
Second, in the certification installation I see that the only queue for the ceprod03.grid.hep.ph.ic.ac.uk CE
has NumberOfProcessors = 1. It should be 8 in your case, I guess.

  Cheers,
  Andrei

Daniela Bauer

unread,
Jun 22, 2021, 10:02:14 AM6/22/21
to Andrei Tsaregorodtsev, diracgrid-forum
Hi Andrei,

It was 8 when I submitted it, your autoconfig overrides this, not sure how to convince it otherwise, short of hacking the code on the server and that's apparently frowned upon ;-) Is there a way I can exclude this ce from the automated updates ? I'm not too familiar with the 'standard' auto updater.
I could make a dummy queue, but then I have a multi-core and a single core submission DIRAC queue both submitting (in principle) to the same queue at Imperial (as we only have one) and I am not sure what other side effects that would have.
I tried the NumberOfProcessors on my own installation, to no avail.
(And for ARC I used Tag = {"8Processors"};and that worked, so my money is still on condor.)

Regards,
Daniela

--
You received this message because you are subscribed to a topic in the Google Groups "diracgrid-forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/diracgrid-forum/fRXPZfkVRGY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to diracgrid-for...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/diracgrid-forum/0d483f78-65c8-48f4-a22c-e91a54355b3en%40googlegroups.com.

Andrei Tsaregorodtsev

unread,
Jun 22, 2021, 10:31:14 AM6/22/21
to diracgrid-forum
I think you can forbid updating the ceprod03.grid.hep.ph.ic.ac.uk CE by adding it to the list in the option
/Systems/Configuration/Certification/Agents/Bdii2CSAgent/BannedCEs

  Cheers,
  Andrei

Daniela Bauer

unread,
Jun 23, 2021, 6:10:46 AM6/23/21
to Andrei Tsaregorodtsev, diracgrid-forum
Hi Andrei,

Nope, it still seems to reconfigure it.
Someone feel an urge to look at the agent on the certification server ?

Cheers,
Daniela

Daniela Bauer

unread,
Jun 24, 2021, 11:24:18 AM6/24/21
to Andrei Tsaregorodtsev, diracgrid-forum
Hi All,

Coming back to my original problem (I have convinced the auto config not to reconfigure my site).
If you have access to the DIRAC certification instance jobs 2983 (HTCondorCE) and 2984 (ARC CE) show the problem. They were both submitted with identical jdls(*) to either LCG.UKI-LT2-IC-HEP.uk (HTCondorCE) or LCG.UKI-SOUTHGRID-RALPP.uk (ARC). The jobs run as dteam and multicore.sh just dumps the env. To check for multicore I use  OMP_NUM_THREADS. It's 8 for RALPP and 1 for Imperial. 
Has anyone made multicore jobs work with HTCondorCE for DIRAC >= v7r2 ? If so, how ?

Regards,
Daniela


(*)
[
Executable = "multicore.sh";
StdOutput = "job.log";
StdError = "job.log";
InputSandbox = "multicore.sh";
OutputSandbox = "job.log";
Site = "LCG.UKI-SOUTHGRID-RALPP.uk";
JobName = "MultiCoreTest";
NumberOfProcessors = "8";
]


ernst pijper

unread,
Jul 13, 2021, 7:57:44 AM7/13/21
to diracgrid-forum
Hi,

This thread helped me a lot in setting up the configuration for multicore jobs. So, thanks for that. 

Still, some things are not clear me to me. I have configured an 8 core queue in the way that was explained in this thread.
Now, when I submitted 10 single core jobs, 10 8 core pilot jobs were submitted. After 5 min or so, I submitted another 9 single
core jobs. Another 9 pilot jobs were submitted. According to what Andrei explained, I expected that one 8 core pilot job would take
8 single core jobs. I must be doing something wrong, as this is not what I'm seeing.

If, what I'm seeing is correct, and we don't want to waste resources, does that mean that i would have to configure separate queues for 1, 2, 4, 16 etc core jobs? 
Having 3 arc-ce's and 4 different queues per arc, this means in total > 50 dirac queues. And we would also have to configure all these separate queues on our ARC's 
as the arc queue must match the dirac queue name, so for instance Nordugrid-slurm-infra, nordugrid-slurm-infra-2c, NorduGrid-slurm-infra-4c etc.

So you can understand that I hope I'm doing something wrong...


Daniela Bauer

unread,
Jul 13, 2021, 9:39:36 AM7/13/21
to ernst pijper, Federico Stagni, diracgrid-forum
Hi Ernst,

If it's any consolation, I haven't managed to get that setup properly either - though I have an HTCondorCE.
According to Andrei T,  you would have to set up different queues in DIRAC (not the site) with different numbers of processors, but at least for HTCondorCEs I don't think it's working (I'm still running tests, but I did end up with single core jobs in 8 core slots even though I had a single core queue available.).
@Federico Stagni et al, I think the multicore tests in the certification don't cover this at all. I'll try and think of something catchy, but I'd need a working configuration server for tests between the hackathons, please.

Regards,
Daniela

--
You received this message because you are subscribed to a topic in the Google Groups "diracgrid-forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/diracgrid-forum/fRXPZfkVRGY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to diracgrid-for...@googlegroups.com.

Federico Stagni

unread,
Jul 13, 2021, 9:59:34 AM7/13/21
to ernst pijper, diracgrid-forum
Hi,
no need to set up all the queues you mention. We had a similar question in this PR: https://github.com/DIRACGrid/DIRAC/pull/5242 so you find in the replies to that what you need. Ask more questions if you want to.

Cheers,
Federico

--
You received this message because you are subscribed to the Google Groups "diracgrid-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diracgrid-for...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/diracgrid-forum/93870040-a70f-46da-815d-a470dee7f9ban%40googlegroups.com.

ernst pijper

unread,
Jul 13, 2021, 10:30:11 AM7/13/21
to diracgrid-forum
Hi Federico,

Thanks for the link but things are still a bit unclear. You say I don't need to setup all the queues. Can you then maybe show me an example
of how to do it using CS language. Suppose I want 1, 2, 4, and 8 cores, using the infra queue as an example.

Ernst

Op dinsdag 13 juli 2021 om 15:59:34 UTC+2 schreef sta...@gmail.com:

Andrei Tsaregorodtsev

unread,
Jul 13, 2021, 11:17:20 AM7/13/21
to diracgrid-forum
Hi Ernst,
First, if you even need to setup several queues for your CE in DIRAC, you do not need to define as many queues in the ARC configuration. You just add CEQueueName option in the DIRAC queue configuration parameters which is the name of the ARC CE queue. Then several DIRAC queues can point to the same ARC queue but with different parameters.

Second, I do not think that you need separate queues for 1,2,4 and 8 cases. Although, you can do it, but 2-core jobs will be eligible to both 4-core and 8-core queues and so on. So, the case with 2 queues with NumberOfProcessors = 1 and 8 will be sufficient. A schematic example of the CS settings is the following:

YourSite
{
  CE = your.ce.nl
  CEs
  {
     ce01.node.nl
     {
        CEType = ARC
        ...
        Queues
        {
           1core
           {
              # This is a single core queue, nothing special
              CEQueueName = nordugrid-torque-infra
              ...
           }
           8core
           {
              # This is a 8-core queue which accepts only MP jobs
              CEQueueName = nordugrid-torque-infra
              ...
              NumberOfProcessors = 8
              RequiredTag = MultiProcessor
              LocalCEType = Pool
           }
        }
     }
  }
}

As for more pilots submitted than it is necessary, this is a feature of the concept I would say. Indeed if the number of user jobs is small,
the system tends to submit more pilots than it seems reasonable to gran too many resources. However, if the number of user jobs is large
(our usual target case), then all the eligible resources will get a chance to compete for user jobs and pilots will execute more than 1 user
jobs each. So, there is a small waste of resources in the first case, and large resources savings in the second. And you can always try to
set up a compromise solution playing with the queues definitions like above.

  Cheers,
  Andrei

ernst pijper

unread,
Jul 14, 2021, 4:06:17 AM7/14/21
to diracgrid-forum
Thanks Andrei, the CEQueueName parameter makes it a lot easier. I have now configured some multicore queue using your examples.
However, when trying to match a job using the dirac-wms-march command, I always get the 'Job Tag MultiProcessor not satisfied'
warning for multicore jobs. For single core jobs it works just fine. I know that the 'MultiProcessor' tag is added automatically for multicore 
jobs. I have also made sure the 'RequiredTag = MultiProcessor' is present for multicore queues. I've also tried to add the 'MultiProcessor'
tag to the Tag parameter, and some other permutations as well. Nothing seems to work.

Is there a minimum dirac version requirement perhaps for this to work? I'm running v7r1p30.

Thanks,
Ernst


Op dinsdag 13 juli 2021 om 17:17:20 UTC+2 schreef Andrei Tsaregorodtsev:

Daniela Bauer

unread,
Jul 14, 2021, 4:44:00 AM7/14/21
to Andrei Tsaregorodtsev, diracgrid-forum
Hi Andrei,

I have now setup all my CEs in the following setup (I went with "RequiredTag to avoid single core requests trying to start multicore pilots). The one core queue is configured by our standard v7r1 configuration tool, I just left it unchanged.

image.png

I submitted 20 x2 core jobs and 20 x 1 core jobs.

While tracking the jobs (easier said than done), much to my surprise I found this on a worker node:

[root@wf26 DIRAC_DXQLmJpilot]# ls
138      142      150.jdl  153.jdl  159.jdl  166.jdl  174.jdl         
control             diracos         job                pilotCommands.py   PilotLogger.pyc       pilot.tar       work
138.jdl  142.jdl  151.jdl  154.jdl  160.jdl  167.jdl  178.jdl          
defaults-DIRAC.cfg  dirac-pilot.py  MessageSender.py   pilotCommands.pyc  PilotLoggerTools.py   pilotTools.py
139.jdl  144.jdl  152      155      161.jdl  168.jdl 
bashrc            DIRAC               etc             MessageSender.pyc  pilot.json         PilotLoggerTools.pyc  pilotTools.pyc
140.jdl  145.jdl  152.jdl  155.jdl  162.jdl  172.jdl 
checksums.sha512  dirac-install.py    __init__.py     pilot.cfg          PilotLogger.py     pilot.out             scripts

Now jobs up to and including 157 were 2 core multicore jobs, while >= 158 were single core jobs. How come this pilot collected single core jobs on top of the multicore jobs ? It didn't do any harm, but why would that happen ?

Regards,
Daniela

You received this message because you are subscribed to a topic in the Google Groups "diracgrid-forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/diracgrid-forum/fRXPZfkVRGY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to diracgrid-for...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/diracgrid-forum/ee66882b-5e74-490f-80c3-a2b9d7b31b47n%40googlegroups.com.

ernst pijper

unread,
Jul 15, 2021, 4:28:37 AM7/15/21
to diracgrid-forum
Hi,

It seems that although the dirac-wms-match cannot find a match because of the message 'Job Tag MultiProcessor not satisfied', when i actually submit the job I does find a match and is executed successfully. 

Ernst

Op woensdag 14 juli 2021 om 10:06:17 UTC+2 schreef ernst pijper:

Andrei Tsaregorodtsev

unread,
Jul 16, 2021, 5:13:13 AM7/16/21
to diracgrid-forum
Looks like a bug, will look into it
Andrei

Daniela Bauer

unread,
Oct 6, 2021, 9:16:53 AM10/6/21
to diracgrid-forum

Hi All,

Simon and me just got back to this. 

As a reminder: We wanted to set up single core and multi core (8 core) jobs at each site as most (though not all) of our VOs use single core jobs and most of the resources these communities get are opportunistic, which means 8 core pilots will wait a long time, if they are enabled for these VOs at all (you will just have to trust me on that one, especially if you are on LHCb) , so the PoolCE route of submitting 8 core pilots and filling them up with single core jobs is not very attractive.
In v7r1 we do this via the MultiProcessorSiteDirector. For v7r2 this involves setting up two queues for each site, one specifying "NumberOfProcessors = 1" and the other the other with "NumberOfProcessors = 8"  but otherwise identical. Would it be possible for NumberOfProcessors to take a comma separated list and expand that internally ? That way LHCb could stick with "NumberOfProcessors = 8" and the behaviour would be unchanged, but GridPP could have "NumberOfProcessors = 1, 8" and wouldn't have to create a mess in their configuration system. Simon and me had a quick look, but got lost in the code logic, hence the question.

Regards,
Daniela


Federico Stagni

unread,
Oct 8, 2021, 3:59:59 AM10/8/21
to Daniela Bauer, diracgrid-forum
Hi,
as of today no, it's not possible to set NumberOfProcessor = 1,8. This parameter is read in many places, and it for sure would create a mess. I would not be up for introducing such possibility, while if needed we could think of introducing something similar to https://dirac.readthedocs.io/en/latest/AdministratorGuide/Resources/storage.html#storageelementbases but for queues.

Cheers,
Federico

--
You received this message because you are subscribed to the Google Groups "diracgrid-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diracgrid-for...@googlegroups.com.

Daniela Bauer

unread,
Oct 20, 2021, 7:24:36 AM10/20/21
to diracgrid-forum
Hi Federico,

For GridPP at least the queue base scheme would be quite good, we could have single-core, multi-core and GPU based queues on top of that.
Are you offering to code something :-) ?

Regards,
Daniela

Federico Stagni

unread,
Oct 20, 2021, 8:06:35 AM10/20/21
to Daniela Bauer, diracgrid-forum
No, I am not offering to code something, do you? In both cases please make an issue in github with what would be the requirements and possibly a proposal.

Cheers,
Federico

Reply all
Reply to author
Forward
0 new messages