Turning off watchdog stalling job during data upload due to low CPU load

76 views
Skip to first unread message

Igor Pelevanyuk

unread,
May 28, 2020, 5:46:39 AM5/28/20
to diracgrid-forum
Hello,

Sometimes a job needs to upload a very big file to a storage element. During this process, the CPU load is close to zero, and watchdog thinks that the job somehow blocked and needs to be canceled.
Is it possible to increase the time period during which watchdog would allow low CPU usage?

Kind regards,
Igor Pelevanyuk

Text for searches:
Watchdog identified this job as stalled

Federico Stagni

unread,
May 28, 2020, 8:58:02 AM5/28/20
to Igor Pelevanyuk, diracgrid-forum
Hi Igor,
the watchdog can be disabled if it finds a file named "DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK" in the path (I originally remembered it was a environment variable, but indeed right now it's just a file in the current implementation: https://github.com/DIRACGrid/DIRAC/blob/rel-v7r0/WorkloadManagementSystem/JobWrapper/Watchdog.py#L566 ).

Anyway, I think that an environment variable would be anyway easier to control, also operationally, so I have just added the possibility to specify such variable in the already existing PR: https://github.com/DIRACGrid/DIRAC/pull/4619
The way to activate it is explained in the thread https://groups.google.com/forum/#!msg/diracgrid-forum/8W1dxKAloAk/BgEKz9dHCQAJ which BTW it might also solve your problems with user jobs not capable of issuing DIRAC commands: in summary, if I was you I would:
1) Put in /Resources/Computing/CEDefaults/ the option "ExtraPilotOptions = --userEnvVariables DIRACSYSCONFIG:::pilot.cfg"
2) Wait for next DIRAC v7r0 patch and then put in the queue(s) for which you want to disable the watchdog the option "ExtraPilotOptions = --userEnvVariables DIRACSYSCONFIG:::pilot.cfg,DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK=True"

Please try and let me know if it works.

Cheers,
Federico

--
You received this message because you are subscribed to the Google Groups "diracgrid-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diracgrid-for...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/diracgrid-forum/129a72b2-0ab4-4953-b04f-25e6b491c66c%40googlegroups.com.

Igor Pelevanyuk

unread,
Sep 3, 2020, 6:20:29 AM9/3/20
to diracgrid-forum
Hi Federico,
I tested this option with a job that reads a small amount of data and time.sleep() for 1 second:
counter = 0
with open(file_name, "rb") as f:
  byte = f.read(1024)
  while byte != "":
    # Do stuff with byte.
    byte = f.read(1024)
    counter += 1
    if counter % 3500 == 0:
      time.sleep(1)

When looking in Pilot log via WebApp I see that option was correctly passed:

2020-08-29 16:24:49 UTC DEBUG    Checksum matched
2020-08-29 16:24:49 UTC INFO     Checking 'pilotTools.py' for checksum
2020-08-29 16:24:49 UTC DEBUG    Checksum matched
2020-08-29 16:24:49 UTC INFO     Executing: python dirac-pilot.py -S JINR-Production -V DIRAC -l DIRAC -o /Security/ProxyToken=ccdb0e2e23e38ce44d43225483aac4d6 -M 5 -C dips://dirac-conf.jinr.ru:9135/Configuration/Server -N lcgce12.jinr.ru -Q cream-pbs-mpd -n DIRAC.JINR-CREAM.ru DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK=True -o /Resources/Computing/CEDefaults/SubmitPool=mpdPool
2020-08-29 16:24:49 UTC DEBUG [PilotParams] Options list: [('-S', 'JINR-Production'), ('-V', 'DIRAC'), ('-l', 'DIRAC'), ('-o', '/Security/ProxyToken=ccdb0e2e23e38ce44d43225483aac4d6'), ('-M', '5'), ('-C', 'dips://dirac-conf.jinr.ru:9135/Configuration/Server'), ('-N', 'lcgce12.jinr.ru'), ('-Q', 'cream-pbs-mpd'), ('-n', 'DIRAC.JINR-CREAM.ru')]
2020-08-29 16:24:49 UTC DEBUG [PilotParams] JSON file loaded: pilot.json
2020-08-29 16:24:49 UTC DEBUG [PilotParams] CE name: lcgce12.jinr.ru
2020-08-29 16:24:49 UTC DEBUG [PilotParams] Setup: JINR-Production

But after some time all of them got Failed state and minor status: Watchdog identified this job as stalled. DIRAC Version: v7r0p27.

I also found that there are at least 3 versions of stalling jobs depending on Minor Status in DIRAC:
  • Watchdog identified this job as stalled (This status I saw for all my test jobs)
  • Job stalled: pilot not running
  • Stalling for more than 11700 sec
Cheers,
Igor Pelevanyuk
четверг, 28 мая 2020 г. в 15:58:02 UTC+3, sta...@gmail.com:

André Sailer

unread,
Sep 3, 2020, 7:15:03 AM9/3/20
to Igor Pelevanyuk, diracgrid-forum
Hi Igor,

Please check what you have in the ExtraPilotOptions?

This printout of the options list doesn't contain userEnvVariables, so I
think the passing didn't really work

> 2020-08-29 16:24:49 UTC DEBUG [PilotParams] Options list: [('-S',
> 'JINR-Production'), ('-V', 'DIRAC'), ('-l', 'DIRAC'), ('-o',
> '/Security/ProxyToken=ccdb0e2e23e38ce44d43225483aac4d6'), ('-M', '5'),
> ('-C', 'dips://dirac-conf.jinr.ru:9135/Configuration/Server'), ('-N',
> 'lcgce12.jinr.ru'), ('-Q', 'cream-pbs-mpd'), ('-n',
'DIRAC.JINR-CREAM.ru')]


The passing of the variable should look like (and the content of
ExtraPilotOptions if nothing else is set)

> --userEnvVariables=DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK:::True

Not just

> DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK=True



Cheers,
Andre

Igor Pelevanyuk

unread,
Sep 3, 2020, 8:06:20 AM9/3/20
to diracgrid-forum
Hi Andre,

Thanks for your suggestions!
I have /Resources/Compuiting/CEDefaults/ExtraPilotOptions = DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK=True
And I see that is wrong. I think it should be 
 /Resources/Compuiting/CEDefaults/ExtraPilotOptions = --userEnvVariables=DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK:::True

I will rerun the test again.

Cheers,
Igor
четверг, 3 сентября 2020 г. в 14:15:03 UTC+3, andre.philippe.sailer:

Igor Pelevanyuk

unread,
Sep 3, 2020, 10:46:06 AM9/3/20
to diracgrid-forum
Hello again,

I can confirm that option:
/Resources/Compuiting/CEDefaults/ExtraPilotOptions= --userEnvVariables DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK:::True
works perfectly.

Thank you, Federico!
Thank you, Andre!

Cheers,
Igor

P.S.
Some output from Pilot stdout:
2020-09-03 12:13:58 UTC INFO     Executing: python dirac-pilot.py -S JINR-Production -V DIRAC -l DIRAC -o /Security/ProxyToken=880865a6b8efb613822e40edf31b640a -M 5 -C dips://dirac-conf.jinr.ru:9135/Configuration/Server -N lcgce12.jinr.ru -Q cream-pbs-mpd -n DIRAC.JINR-CREAM.ru --userEnvVariables DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK:::True -o /Resources/Computing/CEDefaults/SubmitPool=mpdPool
2020-09-03 12:13:59 UTC DEBUG [PilotParams] Options list: [('-S', 'JINR-Production'), ('-V', 'DIRAC'), ('-l', 'DIRAC'), ('-o', '/Security/ProxyToken=880865a6b8efb613822e40edf31b640a'), ('-M', '5'), ('-C', 'dips://dirac-conf.jinr.ru:9135/Configuration/Server'), ('-N', 'lcgce12.jinr.ru'), ('-Q', 'cream-pbs-mpd'), ('-n', 'DIRAC.JINR-CREAM.ru'), ('--userEnvVariables', 'DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK:::True'), ('-o', '/Resources/Computing/CEDefaults/SubmitPool=mpdPool')]
2020-09-03 12:13:59 UTC DEBUG [PilotParams] JSON file loaded: pilot.json

четверг, 3 сентября 2020 г. в 15:06:20 UTC+3, Igor Pelevanyuk:
Reply all
Reply to author
Forward
0 new messages