Update to 7.3.5 : Job rescheduling seem not working

26 views
Skip to first unread message

Luisa Arrabito

unread,
Oct 7, 2021, 10:20:45 AM10/7/21
to diracgrid-forum

I've just updated DIRAC servers to 7.3.5 and I'm running some tests.
I've noticed that Job Rescheduling of Failed jobs does not seem to work.
Jobs get:
MinorStatus=Job Rescheduled
but the MinorStatus remains Failed

In the job logging info I get:
....
JobManager Received Job Rescheduled Unknown 2021-10-07 13:48:53

but they stuck in this status.

Any idea?

Thank you,

Luisa

Federico Stagni

unread,
Oct 12, 2021, 10:27:11 AM10/12/21
to Luisa Arrabito, diracgrid-forum
Hi,
I am not sure I understand. In the JobLoggingInfo you wrote:

JobManager Received Job Rescheduled Unknown 2021-10-07 13:48:53

which means, IUUC that the job is in status "Received". But you are saying that they are "Failed"...?

Can you post the whole JobLoggingInfo?

Cheers,
Federico

--
You received this message because you are subscribed to the Google Groups "diracgrid-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diracgrid-for...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/diracgrid-forum/5c22481f-f02e-4cd7-a150-bd781f3d9772n%40googlegroups.com.

Luisa Arrabito

unread,
Oct 12, 2021, 11:17:35 AM10/12/21
to diracgrid-forum
The job Failed and then I've rescheduled it and here what I get as JobLoggingInfo:

(base) bash-4.2$ dirac-wms-job-logging-info 262
Source                    Status      MinorStatus                       ApplicationStatus                                                            DateTime
==========================================================================================================================================================================
JobManager                Received    Job accepted                      Unknown                                                                      2021-10-12 13:59:26
JobPath                   Checking    JobSanity                         Unknown                                                                      2021-10-12 13:59:26
JobSanity                 Checking    JobScheduling                     Unknown                                                                      2021-10-12 13:59:26
JobScheduling             Waiting     Pilot Agent Submission            Unknown                                                                      2021-10-12 13:59:26
Matcher                   Matched     Assigned                          Unknown                                                                      2021-10-12 14:02:31
JobA...@LCG.IN2P3-CC.fr  Matched     Job Received by Agent             Unknown                                                                      2021-10-12 14:05:00
JobA...@LCG.IN2P3-CC.fr  Matched     Submitting To CE                  Unknown                                                                      2021-10-12 14:05:00
JobWrapper                Running     Job Initialization                Unknown                                                                      2021-10-12 14:05:01
JobWrapper                Running     Downloading InputSandbox          Unknown                                                                      2021-10-12 14:05:01
JobWrapper                Running     Application                       Unknown                                                                      2021-10-12 14:05:09
Job_262                   Running     Application                       Executing Step1_LS_Init                                                      2021-10-12 14:05:11
Job_262                   Running     Application                       ls -alhtr successful                                                         2021-10-12 14:05:13
Job_262                   Running     Application                       Executing Step2_Env                                                          2021-10-12 14:05:13
Job_262                   Running     Application                       env successful                                                               2021-10-12 14:05:13
Job_262                   Running     Application                       Executing Step3_SetupSoftware                                                2021-10-12 14:05:13
Job_262                   Running     Application                       cta-prod-setup-software Exited With Status 1                                 2021-10-12 14:05:14
Job_262                   Running     Application                       Operation not permitted ( 1 : cta-prod-setup-software Exited With Status 1)  2021-10-12 14:05:14
JobWrapper                Completing  Application Finished With Errors  Operation not permitted ( 1 : cta-prod-setup-software Exited With Status 1)  2021-10-12 14:05:26
JobWrapper                Failed      Application Finished With Errors  Operation not permitted ( 1 : cta-prod-setup-software Exited With Status 1)  2021-10-12 14:05:26

JobManager                Received    Job Rescheduled                   Unknown 

Luisa Arrabito

unread,
Oct 12, 2021, 11:19:48 AM10/12/21
to diracgrid-forum
However in the JobMonitor it's still Failed (jobID=262)

Federico Stagni

unread,
Oct 13, 2021, 5:21:25 AM10/13/21
to Luisa Arrabito, diracgrid-forum

Luisa Arrabito

unread,
Oct 13, 2021, 5:31:38 AM10/13/21
to diracgrid-forum
OK thanks a lot. I will give it a try as soon as it's merged.
Luisa

Luisa Arrabito

unread,
Oct 13, 2021, 10:14:17 AM10/13/21
to diracgrid-forum
I've seen that it has been merged in 7.3.
Is there a plan to make a patch release out of it or do you want me to test it before?

Luisa Arrabito

unread,
Oct 13, 2021, 10:31:01 AM10/13/21
to diracgrid-forum
So I've just tried to apply the fix on our test server and I'm now able to 'reset' jobs, but not to 'reschedule' them.
They get stuck in Checking status.

Here what I get after rescheduling a Failed job:

(base) bash-4.2$ dirac-wms-job-status 285
JobID=285 ApplicationStatus=On Hold: after rescheduling 1; MinorStatus=JobScheduling; Status=Checking; Site=ANY;

(base) bash-4.2$ dirac-wms-job-logging-info 285

Source                        Status      MinorStatus                       ApplicationStatus                                              DateTime
================================================================================================================================================================
JobManager                    Received    Job accepted                      Unknown                                                        2021-10-13 14:23:17
JobPath                       Checking    JobSanity                         Unknown                                                        2021-10-13 14:23:17
JobSanity                     Checking    JobScheduling                     Unknown                                                        2021-10-13 14:23:17
JobScheduling                 Waiting     Pilot Agent Submission            Unknown                                                        2021-10-13 14:23:17
Matcher                       Matched     Assigned                          Unknown                                                        2021-10-13 14:23:25
JobManager                    Received    Job Rescheduled                   Unknown                                                        2021-10-13 14:24:03
JobPath                       Checking    JobSanity                         Unknown                                                        2021-10-13 14:24:03
JobSanity                     Checking    JobScheduling                     Unknown                                                        2021-10-13 14:24:03
JobScheduling                 Checking    JobScheduling                     On Hold: after rescheduling 1                                  2021-10-13 14:24:03
JobA...@LCG.DESY-ZEUTHEN.de  Matched     Job Received by Agent             On Hold: after rescheduling 1                                  2021-10-13 14:25:54
JobA...@LCG.DESY-ZEUTHEN.de  Matched     Submitting To CE                  On Hold: after rescheduling 1                                  2021-10-13 14:25:55
JobWrapper                    Running     Job Initialization                On Hold: after rescheduling 1                                  2021-10-13 14:25:57
JobWrapper                    Running     Downloading InputSandbox          On Hold: after rescheduling 1                                  2021-10-13 14:25:57
JobWrapper                    Running     Application                       On Hold: after rescheduling 1                                  2021-10-13 14:25:59
Job_285                       Running     Application                       Executing RunScriptStep1                                       2021-10-13 14:26:00
Job_285                       Running     Application                       ls test Exited With Status 2                                   2021-10-13 14:26:01
Job_285                       Running     Application                       No such file or directory ( 2 : ls test Exited With Status 2)  2021-10-13 14:26:01
JobWrapper                    Completing  Application Finished With Errors  No such file or directory ( 2 : ls test Exited With Status 2)  2021-10-13 14:26:11
JobWrapper                    Failed      Application Finished With Errors  No such file or directory ( 2 : ls test Exited With Status 2)  2021-10-13 14:26:20

Thank you,

Luisa

Federico Stagni

unread,
Oct 13, 2021, 11:01:18 AM10/13/21
to Luisa Arrabito, diracgrid-forum
How did you reschedule #285? With a DIRAC command?

Luisa Arrabito

unread,
Oct 13, 2021, 11:07:41 AM10/13/21
to diracgrid-forum
reschedule button from the web portal (while 'reset' button worked fine).

Luisa Arrabito

unread,
Oct 14, 2021, 9:46:00 AM10/14/21
to diracgrid-forum
Just to let you know that I get the same behviour even when rescheduling with CLI: dirac-wms-job-reschedule.

Luisa

Federico Stagni

unread,
Oct 14, 2021, 10:46:02 AM10/14/21
to Luisa Arrabito, diracgrid-forum
Job Reset calls Job Reschedule, so I don't understand how the first can work while the other does not. Have you restarted the JobManager in the meantime? Are you sure that you applied the hotfix in the correct location? py3 code is of course in a different location.

Luisa Arrabito

unread,
Oct 14, 2021, 11:03:33 AM10/14/21
to diracgrid-forum
Yes I restarted JobManager and also the OptimizationMind.
In the log of the OptimizationMind for one rescheduled job I get:

2021-10-14 14:57:20 UTC WorkloadManagement/OptimizationMind/WorkloadManagement/JobLoggingDB INFO: Adding record for job  230: 'status/minor/app=Failed/Could not dispatch task: Exception while calling dispatch callback/None' from OptimizationMindHandler
2021-10-14 14:57:20 UTC WorkloadManagement/OptimizationMind ERROR: Uncaught exception when serving Connect Function conn_connected
Traceback (most recent call last):
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 406, in _rh_executeConnectionCallback
    uReturnValue = oMethod(self.__trid, *args)
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/Base/ExecutorMindHandler.py", line 154, in conn_connected
    return self.exec_executorConnected(trid, kwargs["executorTypes"])
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/Service/OptimizationMindHandler.py", line 138, in exec_executorConnected
    return cls.__loadJobs(eTypes)
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/Service/OptimizationMindHandler.py", line 111, in __loadJobs
    cls.executeTask(jid, CachedJobState(jid))
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/Base/ExecutorMindHandler.py", line 278, in executeTask
    return cls.__eDispatch.addTask(taskId, taskObj)
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/Utilities/ExecutorDispatcher.py", line 654, in addTask
    return self.__dispatchTask(taskId)
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/Utilities/ExecutorDispatcher.py", line 571, in __dispatchTask
    self.__cbHolder.cbTaskError(taskId, taskObj, "Could not dispatch task: %s" % result["Message"])
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/Base/ExecutorMindHandler.py", line 56, in cbTaskError
    return self.__taskErrCB(taskId, taskObj, errorMsg)
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/Service/OptimizationMindHandler.py", line 215, in exec_taskError
    return jobState.setStatus("Failed", errorMsg, source="OptimizationMindHandler")
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/Client/JobState/JobState.py", line 194, in setStatus
    return JobState.__db.logDB.addLoggingRecord(
  File "/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/DB/JobLoggingDB.py", line 85, in addLoggingRecord
    % (int(jobID), status, minorStatus, applicationStatus[:255], str(_date), epoc, source[:32])
TypeError: 'NoneType' object is not subscriptable

Any idea of what's the problem?

I've changed the code here:
/opt/dirac/versions/v2.0a5-1633444362/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/

I was wondering if it's also necessary to clean __pycache__

Thank you,

Luisa

Federico Stagni

unread,
Oct 14, 2021, 11:26:30 AM10/14/21
to Luisa Arrabito, diracgrid-forum

Luisa Arrabito

unread,
Oct 15, 2021, 8:17:41 AM10/15/21
to diracgrid-forum
Thanks.
I've tried this hack and now job get rescheduled, but they take from 5 to 10 minutes to get in Waiting Status:

JobWrapper           Failed      Application Finished With Errors   No such file or directory ( 2 : dirac Exited With Status 2)  2021-10-14 13:47:25
JobManager           Received    Job Rescheduled                    Unknown                                                      2021-10-15 09:10:29
JobPath              Checking    JobSanity                          Unknown                                                      2021-10-15 09:10:31
JobSanity            Checking    JobScheduling                      Unknown                                                      2021-10-15 09:10:31
JobScheduling        Checking    JobScheduling                      On Hold: after rescheduling 1                                2021-10-15 09:10:31
JobScheduling        Waiting     Pilot Agent Submission             Unknown                                                      2021-10-15 09:14:02

Is there any other issue?

Thank you,

Luisa

Andrei Tsaregorodtsev

unread,
Oct 15, 2021, 12:28:01 PM10/15/21
to diracgrid-forum
The jobs are getting in On Hold status for several minutes, in order not to hit the same problem as the one that caused rescheduling immediately. So, this is rather normal
Andrei

Luisa Arrabito

unread,
Oct 18, 2021, 5:13:40 AM10/18/21
to diracgrid-forum
Yes, but I was surprised because we don't have this behavior in 7.2. Is there any profound reason?
Thank you,
Luisa

Andrei Tsaregorodtsev

unread,
Oct 18, 2021, 5:18:34 AM10/18/21
to diracgrid-forum
I think this was always like that, are you sure you did not have it in v7r2 ?
Andrei

Luisa Arrabito

unread,
Oct 18, 2021, 5:44:00 AM10/18/21
to diracgrid-forum
Hi Andrei,
ok probably I had a wrong impression. I've just tried to reschedule on job on our production server and it got 3 minutes before getting to Waiting status...
so it seems to be normal.

So can you please tell me if you plan a patch relaese with these fixes?

I would like to update our production instance with the fixes already integrated.

Thank you,

Luisa
Reply all
Reply to author
Forward
0 new messages