Scheduler in systemd environment (running very long tasks)

Zbigniew Pomianowski

unread,

Dec 12, 2016, 8:58:47 AM12/12/16

to web2py-users

First of all: I decided to use web2py for my purposes becase it is awesome ;)
I believe it is not a web2py's bug or anything like related thing. It can be more OS and systemd related issue.

Let me explain what I do and what is the environment. I work in a lab where we try to automate many tests on physical devices (like STBs and phones).
I have a single source for master (ubuntu server) and slave servers (ubuntu server/desktop). Master is configured with uwsgi+nginx+mysql+web2py services. Then I do have slaves that use the same source, but can spawn tests within scheduler processes.

I need to connect many physical devices to the slaves (climate chambers, arduino for IR control, v4l2 capture cards, ethernet controled power sources, power supply instruments, measurement instruments... bla bla bla).
I decided to make a GUI using qooxdoo where user can write a python code that allocates physical devices and run specific test scenarios to examine DUT (Device Under Test) condition.
These tests sometimes need to be run for tens of hours. So the workflow can be described as:

user writes a script
the test is enqueued as a task in db (JobGraph does a perfect work for me because I need to control the execution sequence mainly because of the existence of physical devices like climate chambers and etc; allocated lab instrument cannot be used by two tests at the same time, jobgraph can yield it)
every slave has it's unique group-name

DUTs and lab instruments are bound to the specific slave - scheduler group-name

slave executes the test scenario programmed by user

test is nothing more than overriden TestUnit
every LAB instrument has child process which logs parameters (temperature, humidity, voltage bla bla bla)
for DUT is also created instance of a class that spawns child processes (video freeze detection based on gstreamer, udp/tcp/telnet interface to interract with STB)
in test scenario I have plenty of sleeps - test scenario demands for example that STB stays in a cimate chamber for 20h in specific temp and humidity

My systemd service file looks like this:

[Unit]
Description=ATMS workers
After=network-online.target
Wants=network-online.target

[Service]
User=<USER>
Restart=on-failure
RestartSec=120
Environment=DISPLAY=:<DISPLAY_NB> # usually 0
Environment=XAUTHORITY=/home/<USER>/.XauthorityEnvironmentFile={{INSTALL}}/web2py_venv/web2py/applications/atms/private/atms.env
ExecStartPre=/bin/sh -c "${WEB2PYPY} ${WEB2PY} -S atms -M -R ${WEB2PYDIR}/applications/atms/systemd/on_start.py -P"
ExecStart=/bin/sh -c "${WEB2PYPY} ${WEB2PY} -K atms:%H,atms:%H"
ExecStop=/bin/sh -c "${WEB2PYPY} ${WEB2PY} -S atms -M -R ${WEB2PYDIR}/applications/atms/systemd/on_stop.py -P"

[Install]
# graphical because i had to make some kind of preview with ximagesink for fast lookup if video is ok on STB
WantedBy=graphical.target
Alias=atms.service

I realised that for very long test (last one was planned to be longer than 100h) i got sth like this in logs:

gru 11 12:01:52 slaveX sh[2184]:   File "/atms/web2py_venv/web2py/gluon/packages/dal/pydal/adapters/base.py", line 1435, in
gru 11 12:01:52 slaveX sh[2184]:     return str(long(obj))
gru 11 12:01:52 slaveX sh[2184]:   File "/atms/web2py_venv/web2py/gluon/packages/dal/pydal/objects.py", line 82, in <lambda
gru 11 12:01:52 slaveX sh[2184]:     __long__ = lambda self: long(self.get('id'))
gru 11 12:01:52 slaveX sh[2184]: TypeError: long() argument must be a string or a number, not 'NoneType'

The test was stopped 20h before it was supposed to be finished :/

After some digging I found that before these errors i got this one:

gru 11 12:01:34 slaveX sh[2184]: ERROR:web2py.app.atms:[(</tmp/taskId10672_caseId852_duts32/test_script.py.TestCase testMethod=test_example>, 'Traceback (most recent call last):\n  File "/tmp/taskId10672_caseId852_duts32/test_script.py", line 90, in test_example\n    sleep(M10)\n  File "/atms/web2py_venv/web2py/gluon/scheduler.py", line 702, in <lambda>\n    signal.signal(signal.SIGTERM, lambda signum, stack_frame: sys.exit(1))\nSystemExit: 1\n')]
gru 11 12:01:34 slaveX sh[2184]: DEBUG:web2py.app.atms:    new task report: FAILED
gru 11 12:01:34 slaveX sh[2184]: DEBUG:web2py.app.atms:   traceback: Traceback (most recent call last):
.. and many many many tracebacks with errors after that

Line 702 in scheduler.py is:

signal.signal(signal.SIGTERM, lambda signum, stack_frame: sys.exit(1))

....in scheduler's loop function. What does it mean? The process was stopped because kernel/systemd sth else decided to do so??
Long sleep calls can have sth in common?
Did anyone encountered similar problems? Do you have any idea how to prevent against such behavior?

Thank you in advance for any response :)

Niphlod

unread,

Dec 14, 2016, 2:35:24 PM12/14/16

to web2py-users

that piece of code is in place to let the worker being terminated by a sigterm, i.e a ctrl+c, that is useful for development purposes. it *should* have nothing to do with long running tasks, but I'm really honest saying I never had a single task alive for more than an hour. Frankly I don't know how to test it: being in front of a terminal for 4 days is not that feasible.

Zbigniew Pomianowski

unread,

Dec 17, 2016, 10:07:53 AM12/17/16

to web2py-users

I totally agree that debuging such things can be difficult. I just wonder if there are some mechanisms that can kill tasks before they are finished. 100h process let be honest - it is not a common case. Honestly speaking i am not pretty sure how frequent this problem can be. I did not manage to perform enough tests :P

My main suspect is the systemd. I try to figure out if it possible to create some kind of listener to get the source of the SIGTERM (strace utility??). For some reason i think graphical.target can be a reason: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=825394
My next step will be to refactor the code to run with only multi-user.target dependency. I need somehow to rebuild the gstreamer pipeline where the preview is realized by second, independent application.

I will come back when I get any improvement or conclusions.

Zbigniew Pomianowski

unread,

Feb 28, 2017, 8:12:57 AM2/28/17

to web2py-users

Yeah - it seems it was because of the graphical.target. Nevertheless I encounter another issue: for very long task I got status 'failed', but the task is running without any visible problems.

Mirek Zvolský

unread,

Aug 15, 2017, 10:19:26 AM8/15/17

to web2py-users

Hi Zbigniew and Niphlod.

I have same/similar traceback when the task will timeout, in my case in 2h:

File "/home/www-data/web2py/gluon/scheduler.py", line 720, in <lambda>

signal.signal(signal.SIGTERM, lambda signum, stack_frame: sys.exit(1))

SystemExit: 1

scheduler_run.stop_time is scheduler.start_time + 2h, so I think, this is standard quitting for timeout.

But I have another problem (and this is reason why the timeout is gone):

My task restarts again and again in about 10mins. It think I will ask in separate thread.

Debian9+nginx+uwsgi+postgres+web2py

Reply all

Reply to author

Forward