Multiple payload matching to same JobID?

17 views
Skip to first unread message

MIYAKE Hideki

unread,
Sep 22, 2021, 8:41:04 AM9/22/21
to diracgrid-forum
Hello,

At the last BiLD meeting I reported odd job status transition (from Done to Running) at Belle2.
After the meeting, I collected ~20 similar job samples and realized the issue could be originated by multiple payload matching to same JobID.
In many cases reschedulings occur but sometimes not, thus I'm not sure whether rescheduling affects this issue.

Anyways Matcher picks already assigned job sufficiently after some period from the first matching, as shown below.
In my understanding it should not occur since the JobID is removed from TaskQueue...perhaps any race condition in optimizer?
Since we have our extension in Optimizer, we must suspect that it comes from our modules, first of all. I'd leave the information as a record, and will update the topic when we find any clue.

However, before that, I'd appreciate it if you could give us any advice to investigate further detail or implement protection.


Source                      Status       MinorStatus                        ApplicationStatus                                       DateTime
=========================================================================================================================================================
JobManager                  Received     Job accepted                       Unknown                                                 2021-09-11 13:20:24
JobPath                     Checking     Scouting                           Unknown                                                 2021-09-11 13:20:33
JobSanity                   Checking     InputData                          Unknown                                                 2021-09-11 13:20:33
Scouting                    Checking     JobSanity                          Unknown                                                 2021-09-11 13:20:33
InputData                   Checking     JobScheduling                      Unknown                                                 2021-09-11 13:20:34
JobScheduling               Waiting      Pilot Agent Submission             Unknown                                                 2021-09-11 13:20:34
Matcher                     Matched      Assigned                           Unknown                                                 2021-09-11 21:26:51
JobPath                     Checking     Scouting                           Unknown                                                 2021-09-11 21:37:47
Scouting                    Checking     JobSanity                          Unknown                                                 2021-09-11 21:37:47
JobSanity                   Checking     InputData                          Unknown                                                 2021-09-11 21:37:47
InputData                   Checking     JobScheduling                      Unknown                                                 2021-09-11 21:37:49
JobScheduling               Checking     JobScheduling                      On Hold: after rescheduling 1                           2021-09-11 21:37:49
JobScheduling               Waiting      Pilot Agent Submission             Unknown                                                 2021-09-11 21:40:51
JobA...@OSG.BNL.us         Matched      Job Received by Agent              Unknown                                                 2021-09-11 21:42:07
Matcher                     Matched      Assigned                           Unknown                                                 2021-09-11 21:42:07
JobA...@OSG.BNL.us         Matched      Installing Software                Unknown                                                 2021-09-11 21:42:09
JobA...@OSG.BNL.us         Matched      Submitted To CE                    Unknown                                                 2021-09-11 21:42:19
JobWrapper                  Running      Job Initialization                 Unknown                                                 2021-09-11 21:42:25
JobWrapper                  Running      Downloading InputSandbox           Unknown                                                 2021-09-11 21:42:28
JobWrapper                  Running      Input Data Resolution              Unknown                                                 2021-09-11 21:42:37
JobWrapper                  Running      Application                        Unknown                                                 2021-09-11 21:43:00
Job_205833717               Running      Application                        Executing RunScriptStep1                                2021-09-11 21:43:05
basf2                       Running      Application                        Running                                                 2021-09-11 21:43:22
JobA...@DIRAC.IITG.in      Matched      Installing Software                Running                                                 2021-09-11 21:56:58
JobA...@DIRAC.IITG.in      Matched      Submitted To CE                    Running                                                 2021-09-11 21:57:13
JobA...@DIRAC.IITG.in      Matched      Job Received by Agent              Running                                                 2021-09-11 22:07:27
JobWrapper                  Running      Job Initialization                 Running                                                 2021-09-11 23:08:03
JobWrapper                  Running      Downloading InputSandbox           Running                                                 2021-09-11 23:18:08

Best Regards,
Hideki

Federico Stagni

unread,
Sep 22, 2021, 12:46:26 PM9/22/21
to MIYAKE Hideki, diracgrid-forum
Hi,
this is really weird, I have never seen such a logging.

I also suspect that there might be an issue in your Optimizer extension, as the first 10 lines are rather weird.
At the same time, the main issue here is the matcher matching twice the same job, and this is not something coming from the Optimizers' extension.

I just have for the moment some questions:

- how many instances of the Matcher service are you running?
- do you have other extensions in WMS? E.g in TaskQueueDB?
- which DIRAC version is this?
- can you get to know the percentage of jobs being rescheduled?
- is the logging above full? I think there are more lines to be shown


Cheers,
Federico


--
You received this message because you are subscribed to the Google Groups "diracgrid-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diracgrid-for...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/diracgrid-forum/7d06425c-2904-4803-aef5-4116c8782388n%40googlegroups.com.

MIYAKE Hideki

unread,
Sep 22, 2021, 1:38:58 PM9/22/21
to diracgrid-forum
Hi Federico,

-how many instances of the Matcher service are you running?

We have only one Matcher, though Optimizers are 6.

- do you have other extensions in WMS? E.g in TaskQueueDB?

We have several modules in WMS (e.g. Agent to control our test job scheme (named "Scouting"))
Direct extensions to vanilla components are SiteDirector (PilotWrapper) and TaskQueueDB.
TaskQueueDB extenstion adds some new functions (e.g. update priority)

- which DIRAC version is this?

v7r0p56

- can you get to know the percentage of jobs being rescheduled?

Roughly 2% for Done and 9% for Failed jobs (RescheduleCounter>0 in JobDB)

- is the logging above full? I think there are more lines to be shown

I chopped since it is a bit longer and may be confusing.
Here is the rest.

JobPath                     Checking     Scouting                           Running                                                 2021-09-11 23:57:47
JobSanity                   Checking     InputData                          Running                                                 2021-09-11 23:57:47
Scouting                    Checking     JobSanity                          Running                                                 2021-09-11 23:57:47
JobManager                  Received     Job Rescheduled                    Unknown                                                 2021-09-11 23:57:47
InputData                   Checking     JobScheduling                      Unknown                                                 2021-09-11 23:57:49
JobScheduling               Checking     JobScheduling                      On Hold: after rescheduling 2                           2021-09-11 23:57:49
JobScheduling               Waiting      Pilot Agent Submission             Unknown                                                 2021-09-12 00:02:59
Matcher                     Matched      Assigned                           Unknown                                                 2021-09-12 00:03:05
JobWrapper                  Rescheduled  Input Sandbox Download             Unknown                                                 2021-09-12 00:38:19
JobWrapper                  Rescheduled  Input Sandbox Download             Failed Input Sandbox Download                           2021-09-12 00:38:19
JobWrapper                  Running      Failed Downloading InputSandbox    Failed Input Sandbox Download                           2021-09-12 00:38:19
JobA...@DIRAC.IITG.in      Matched      Job Received by Agent              Failed Input Sandbox Download                           2021-09-12 00:43:41
JobA...@DIRAC.IITG.in      Matched      Installing Software                Failed Input Sandbox Download                           2021-09-12 01:15:14
JobA...@DIRAC.IITG.in      Matched      Submitted To CE                    Failed Input Sandbox Download                           2021-09-12 01:25:28
JobWrapper                  Running      Job Initialization                 Failed Input Sandbox Download                           2021-09-12 02:36:21
JobWrapper                  Running      Downloading InputSandbox           Failed Input Sandbox Download                           2021-09-12 03:36:28
JobManager                  Received     Job Rescheduled                    Unknown                                                 2021-09-12 04:56:09
JobPath                     Checking     Scouting                           Unknown                                                 2021-09-12 04:56:09
Scouting                    Checking     JobSanity                          Unknown                                                 2021-09-12 04:56:09
JobSanity                   Checking     InputData                          Unknown                                                 2021-09-12 04:56:09
JobScheduling               Checking     InputData                          On Hold: after rescheduling 3                           2021-09-12 04:56:11
InputData                   Checking     JobScheduling                      On Hold: after rescheduling 3                           2021-09-12 04:56:11
JobScheduling               Waiting      Pilot Agent Submission             Unknown                                                 2021-09-12 05:06:34
JobWrapper                  Running      Failed Downloading InputSandbox    Unknown                                                 2021-09-12 05:06:40
JobWrapper                  Rescheduled  Input Sandbox Download             Unknown                                                 2021-09-12 05:06:40
JobWrapper                  Rescheduled  Input Sandbox Download             Failed Input Sandbox Download                           2021-09-12 05:06:40
Matcher                     Matched      Assigned                           Failed Input Sandbox Download                           2021-09-12 05:06:55
JobA...@DIRAC.Shandong.cn  Matched      Job Received by Agent              Failed Input Sandbox Download                           2021-09-12 05:06:55
JobA...@DIRAC.Shandong.cn  Matched      Installing Software                Failed Input Sandbox Download                           2021-09-12 05:06:57
JobA...@DIRAC.Shandong.cn  Matched      Submitted To CE                    Failed Input Sandbox Download                           2021-09-12 05:07:02
JobWrapper                  Running      Job Initialization                 Failed Input Sandbox Download                           2021-09-12 05:07:08
JobWrapper                  Running      Downloading InputSandbox           Failed Input Sandbox Download                           2021-09-12 05:07:10
JobWrapper                  Running      Input Data Resolution              Failed Input Sandbox Download                           2021-09-12 05:07:16
JobWrapper                  Running      Application                        Failed Input Sandbox Download                           2021-09-12 05:07:23
Job_205833717               Running      Application                        Executing RunScriptStep1                                2021-09-12 05:07:27
basf2                       Running      Application                        Running                                                 2021-09-12 05:07:39
gbasf2                      Running      Application                        Preparing to upload                                     2021-09-12 08:25:32
gbasf2                      Running      Application                        Uploading                                               2021-09-12 08:25:42
gbasf2                      Running      Application                        Done                                                    2021-09-12 08:25:53
Job_205833717               Running      Application                        Done                                                    2021-09-12 08:25:55
JobWrapper                  Completed    Application Finished Successfully  Done                                                    2021-09-12 08:26:06
JobWrapper                  Completed    Uploading Output Sandbox           Done                                                    2021-09-12 08:26:07
JobWrapper                  Done         Execution Complete                 Done                                                    2021-09-12 08:26:09
JobWrapper                  Completed    Output Sandbox Uploaded            Done                                                    2021-09-12 08:26:09
gbasf2                      Completed    Output Sandbox Uploaded            Preparing to upload                                     2021-09-12 13:13:01
gbasf2                      Completed    Output Sandbox Uploaded            Uploading                                               2021-09-12 13:13:32
gbasf2                      Completed    Output Sandbox Uploaded            Error: existing Replica                                 2021-09-12 13:39:57
gbasf2                      Completed    Output Sandbox Uploaded            Error: uploading failed                                 2021-09-12 13:39:57
gbasf2                      Completed    Output Sandbox Uploaded            Error: existing Replica                                 2021-09-12 14:03:05
Job_205833717               Completed    Output Sandbox Uploaded            basf2helper.py Exited With Status 254                   2021-09-12 14:03:06
Job_205833717               Completed    Output Sandbox Uploaded            RuntimeError("basf2helper.py Exited With Status 254",)  2021-09-12 14:03:07
JobWrapper                  Completed    Application Finished With Errors   RuntimeError("basf2helper.py Exited With Status 254",)  2021-09-12 14:03:16
JobWrapper                  Failed       Application Finished With Errors   RuntimeError("basf2helper.py Exited With Status 254",)  2021-09-12 14:03:18

2021年9月23日木曜日 1:46:26 UTC+9 sta...@gmail.com:

MIYAKE Hideki

unread,
Sep 22, 2021, 1:45:20 PM9/22/21
to diracgrid-forum
...and this is typical logging when job successfully finished

Source               Status     MinorStatus                        ApplicationStatus         DateTime
==================================================================================================================
JobManager           Received   Job accepted                       Unknown                   2021-09-19 13:42:04
JobPath              Checking   Scouting                           Unknown                   2021-09-19 13:42:12
Scouting             Checking   JobSanity                          Unknown                   2021-09-19 13:42:12
JobSanity            Checking   InputData                          Unknown                   2021-09-19 13:42:12
InputData            Checking   JobScheduling                      Unknown                   2021-09-19 13:42:13
JobScheduling        Waiting    Pilot Agent Submission             Unknown                   2021-09-19 13:42:13
Matcher              Matched    Assigned                           Unknown                   2021-09-19 13:47:46
JobA...@LCG.KEK.jp  Matched    Job Received by Agent              Unknown                   2021-09-19 13:47:46
JobA...@LCG.KEK.jp  Matched    Installing Software                Unknown                   2021-09-19 13:47:46
JobA...@LCG.KEK.jp  Matched    Submitted To CE                    Unknown                   2021-09-19 13:47:52
JobWrapper           Running    Input Data Resolution              Unknown                   2021-09-19 13:47:56
JobWrapper           Running    Downloading InputSandbox           Unknown                   2021-09-19 13:47:56
JobWrapper           Running    Job Initialization                 Unknown                   2021-09-19 13:47:56
JobWrapper           Running    Application                        Unknown                   2021-09-19 13:47:57
Job_206886131        Running    Application                        Executing RunScriptStep1  2021-09-19 13:47:58
basf2                Running    Application                        Running                   2021-09-19 13:48:08
gbasf2               Running    Application                        Preparing to upload       2021-09-19 13:57:51
gbasf2               Running    Application                        Uploading                 2021-09-19 13:57:52
gbasf2               Running    Application                        Done                      2021-09-19 13:57:59
Job_206886131        Running    Application                        Done                      2021-09-19 13:57:59
JobWrapper           Completed  Application Finished Successfully  Done                      2021-09-19 13:58:00
JobWrapper           Completed  Uploading Output Sandbox           Done                      2021-09-19 13:58:00
JobWrapper           Completed  Output Sandbox Uploaded            Done                      2021-09-19 13:58:00
JobWrapper           Done       Execution Complete                 Done                      2021-09-19 13:58:00

Cheers,
Hideki

2021年9月23日木曜日 2:38:58 UTC+9 MIYAKE Hideki:

Federico Stagni

unread,
Sep 23, 2021, 4:10:40 AM9/23/21
to MIYAKE Hideki, diracgrid-forum
Hi Hideki,
Only 1 matcher is correct. Don't have more than 1, it might lead to issues, at least in its current state.

Your DIRAC version is really old and not supported anymore, but this you knew already. There are also some relevant implications here:
- I remembered that I made this fix: https://github.com/DIRACGrid/DIRAC/pull/4799/commits/cba01c5ff22531a7d3153ee53cfe2858fec9a6e5 in v7r2. This makes sure that the JobWrapperTemplate kills the jobWrapper in case of exceptions. In the past in LHCb we had similar cases to yours.
- the full logging info shows that the time registered for these log events is not fully correct. There have been several updates in this field in later versions, e.g. https://github.com/DIRACGrid/DIRAC/pull/4839 but several others went in as fixes part of other PRs.

So, basically: update to more (most!) recent version ASAP.

Cheers,
Federico




MIYAKE Hideki

unread,
Sep 23, 2021, 11:15:21 PM9/23/21
to diracgrid-forum
Dear Federico,

Thank you very much for useful information and advices!

Actually, for this particular issue, we realized one specific site is involved in all cases.
Although still I don't understand the cause but the timestamp JobAgent started and Matcher assigned have contradiction regardless of same pilot reference.
[Matcher]
2021-09-11 21:26:51 UTC WorkloadManagement/Matcher/JobLoggingDB INFO: Adding record for job 205833717: 'status/minor/app=Matched/Assigned/idem' from Matcher
[pilot]
2021-09-11 22:07:23 UTC WorkloadManagement/JobAgent INFO: Loading WorkloadManagement/JobAgent

We are under further investigation but perhaps this case comes down to a site issue, though still I don't understand who triggered (probably) rescheduling at 21:37:47...StalledJobAgent?

Anyways, as you pointed out v7r0 sometimes shows odd timestamp and we need additional verification to check such logging info.
We really need to catch up the latest (yes ASAP)

Cheers,
Hideki

2021年9月23日木曜日 17:10:40 UTC+9 sta...@gmail.com:
Reply all
Reply to author
Forward
0 new messages