scheduler creates "zombie" processes

Erwn Ltmann

unread,

Nov 2, 2016, 5:53:58 AM11/2/16

to web2py-users

Dear all,

I'm astonished about a lot of processes as sub process of scheduler worker are not finished.

pstree -p 16731

bash(16731)---python2.7(24545)-+-python2.7(24564)---{python2.7}(24565)
                               |-python2.7(24572)-+-python2.7(1110)
                               |                  |-python2.7(8647)
                               |                  |-python2.7(11747)
                               |                  |-python2.7(14117)
                               |                  |-python2.7(14302)

The 16731 is my shell I started the scheduler with four worker:

w2p -K arm:ticker,arm,arm,arm

The pid 24564 is the ticker worker (only hold the ticker) and 24572 one of three standard worker which has to process my task's function.

My first focus was on the function itself. But, if I clip the function ('return True' at start point) the zombies were already there. My next analyze step was to show the pid at the start point of 'executor' function of scheduler.py. In case of zombie processes I never reach this debug point. Next I printed out the list of zombie processes (multiprocessing.active_children()) at the exit point of tasks which passed the timeout (see function async). It's the point in the scheduler code where 'task timeout' is printing out. The timeout is clear because of a process which never returns a result. But, how is it possible?

Here's the list of my extra debug line in function async's timeout part:

09:09:47.752 [24576] Process-4:488,
09:14:28.907 [24576] Process-4:488, Process-4:1125,
09:15:59.526 [24576] Process-4:488, Process-4:1125, Process-4:1301,
09:20:35.924 [24576] Process-4:488, Process-4:1880, Process-4:1125, Process-4:1301,

Why did the 'executor' function never process the code?

def async(self, task):

...

out = multiprocessing.Queue()
queue = multiprocessing.Queue(maxsize=1)
p = multiprocessing.Process(target=executor, args=(queue, task, out))

...
if p.is_alive():
p.terminate()
logger.debug(' +- Zombie (%s)' % multiprocessing.active_children())

And here the extra line in executor:

def executor(queue, task, out):
"""The function used to execute tasks in the background process."""
logger.debug(' task started PID:%s -> %s' % (os.getppid(),os.getpid()))

...

Of course, I have to stress the scheduler to become zombies. The rate is 1 of 1000. In my case 25 times each hour!

Can any body clarify this? May it's concerning pure python.

Thx,
Erwn

Niphlod

unread,

Nov 2, 2016, 7:00:48 AM11/2/16

to web2py-users

I'd say there are a LOT of strange things going on on your system, since you're reporting several different issues that nobody ever faced and all in the last week.

zombie processes shouldn't be there unless you killed improperly a worker process.

Python can't really do anything about it, and that's the way there's a specific API to kill (or terminate) a worker.

Erwn Ltmann

unread,

Nov 2, 2016, 12:47:42 PM11/2/16

to web2py-users

Hi Niphlod,

your replies are always a pleasure to me. :)

On Wednesday, November 2, 2016 at 12:00:48 PM UTC+1, Niphlod wrote:

I'd say there are a LOT of strange things going on on your system, since you're reporting several different issues that nobody ever faced and all in the last week.

Concerning deadlocks and zombies - right? Both issues are faced within the using the scheduler, not web2py in general. And only in cases I start more than one worker.

zombie processes shouldn't be there unless you killed improperly a worker process.
Python can't really do anything about it, and that's the way there's a specific API to kill (or terminate) a worker.

Your right, the killer is the scheduler himself. Why? The scheduler terminates a task after passing the timeout. The timeout happened because the task never does that as it is defined. In cases of zombie situation the sub process is stopping with sem_wait() function (pstack). I don't know way. But, it's happened before the function 'executor' entered, because of no debug line printing at the entry point of that function.

Ok. That's all what I know. I have different RHEL systems (RH6,RH5) with python 2.7.12 and MariaDB. Not realy exotic.

Thank you for your endurance
Erwn

Niphlod

unread,

Nov 2, 2016, 5:51:59 PM11/2/16

to web2py-users

I'm naturally curious to know the culprit in your env, but unfortunately I can't reproduce in any of my systems (*buntu 14.04 + postgresql/sqlite or Win2012r2 + postgresql/sqlite/mssql).

Needs to be said that I mostly hate mysql (and would never trust it nor recommend to the worst of my enemies) but its totally a personal bias. I know several peoples using mysql and never reporting issues whatsoever.

Dunno if I'm the "most powered user" of the scheduler around but I range from 4 to 20 workers (even 50, but "cheating" with the redis-backed version) and these kind of issues never happened, unless a reaaaally long task with a reeeeaally long stdout output (which, once known, it's easy to suppress, and btw the problem is less of a matter since a few releases ago).

"My" worker processes are usually alive for more than a week (i.e. same pid), so it's not a problem of phantom processes or leaks.

Although pretty heavy (resources-wise) if compared to pyres, celery, huey, rq (just to name the "famous ones") , web2py's scheduler, which HAS to work in all supported OS by web2py is rather simple: each worker spawns a single process with a queue 1 element long to communicate over ... that process handles a single task, and then dies. Everything is thrashed and recreated at the next task pick-up.

Erwn Ltmann

unread,

Nov 3, 2016, 5:48:23 AM11/3/16

to web2py-users

@Niphlod: I decided one year ago to use web2py for our projects. And now I have a problem and I have to solve it (shortly) - with or without the group. You are so focused on the scheduler code itself. I search for any hint to understand the problem. It doesn't helps me to know that the mankind has never a problem with the scheduler since decades of wonderful years. ;)

In the meanwhile I found a way to handle well "my" problematic situations (unfortunately a workaround). The answers for the cause are outstanding.

A new try: be abstract away from longlive tasks and timeouts in case of normal worker's task processing. The reason for 'my' TIMEOUT is different (please trust me, I read the entire code of scheduler.py and know about the general concept).

scheduler.py (worker): infinity loop -> pop a task -> call async function -> create process environment -> start the process[1] -> wait for completion the process or have no mercy and terminate the process if timeout caught.

So far so good. Now a more detailed look for [1].

The process environment has an entry function as target (start point) to start the sub process. In the case of scheduler.py it is the function 'executor'. Again, the entry point of this function we want never pass in case of 'zombie' candidates. With pstack I saw the reason: the new process creation process is waiting for a semaphore - sem_wait(). At this point of the sub process nothing is passed in 'executor' function and because of that nothing is processed of my actually task. Of course because the 'executer' didn't call "my" task function.

So, the scheduler's executor catchs the timeout (sub process is still waiting for a semaphore) and call termiate() for the sub process. This process is still waiting again and again and again ... The scheduler.py registered in the meanwhile the task as STOPPED and go ahead to pick up the next task.

Back to my pstree output with additional comments inside for example:

bash(16731) // my shell

\---python2.7(24545) // scheduler.py (-K)

\-+-python2.7(24564)---{python2.7}(24565) // idling worker
|-python2.7(24572) // worker with picked task

                          \-python2.7(1110)    // still waiting for semaphore (TIMEOUT)
                          \-python2.7(8647)    // still waiting for semaphore (TIMEOUT)
                          \-python2.7(11747)   // still waiting for semaphore (TIMEOUT)
                          \-python2.7(14117)                 // run the actually task (RUNNING)
                          \-python2.7(14302)   // still waiting for semaphore (TIMEOUT)

The actually reason is "waiting for a semaphore". But way? And of course in all propabillity it is not a problem of the scheduler.py code itself ;)

Thx again for your endurance.
Erwn

Niphlod

unread,

Nov 3, 2016, 4:25:41 PM11/3/16

to web2py-users

Got that. Thanks for explaining your POV. What I was trying to say is that without being able to reproduce, I can only guess. And since the scheduler code in regards of process handling is quite streamlined, there are not a lot of places where a zombie can be created.

So, let's guess...

Are zombie processes "still running and churning data" or just leftovers ? Usually there are two kinds of zombie processes: completely detached processes running and consuming resources (what theoretically the scheduler should avoid at all costs) and simple leftovers on the "virtual" process table that are there because nobody (the parent process, usually) reclaimed either their output or their status code. The second is something to avoid, of course, but shouldn't hurt the functionality of the app.

from the pstree you posted, it seems that (shorting code here and there)

p = multiprocessing.Process(executor, ....)
p.start()
try:
    #task runs
    p.join(run_timeout)
except:
    #this should be raised only when a general error on the task happened, so it's a STOPPED one
    p.terminate()
    p.join()
else:
    #this is the codepath your task takes, since its the one landing TIMEOUT tasks
    if p.is_alive():
         # this is ultimately the call that SHOULD kill the process you later find as a zombie
         p.terminate()
         ....

this "terminate" is labelled as using SIGKILL on the process. Eventual processes child of that one (i.e. a subprocess call inside the task itself) are not guaranteed to be terminated, but then they'll show as orphaned, while your pstree reports python processes still "attached" to the scheduler worker process.

From where I stand, if the result is the task being labelled as TIMEOUT (with the corresponding "task timeout" debug line), it can only be originated there.

Maybe there's a culprit there.... can you add a p.join() after that p.terminate(), and maybe a few debug lines ?

i.e.

... else:
        if p.is_alive():
            logger.debug(' MD: terminating')
            p.terminate()
            logger.debug(' MD: terminated')
            logger.debug(' MD: joining')
            p.join()
            logger.debug(' MD: joined')
            logger.debug(' task timeout')
            try:
            # we try to get a traceback here
            tr = queue.get(timeout=2)
...

Erwn Ltmann

unread,

Nov 4, 2016, 3:29:12 AM11/4/16

to web2py-users

Sorry for using the term 'zombie'. In the context of unix it is misleading.

On Thursday, November 3, 2016 at 9:25:41 PM UTC+1, Niphlod wrote:

From where I stand, if the result is the task being labelled as TIMEOUT (with the corresponding "task timeout" debug line), it can only be originated there.
Maybe there's a culprit there.... can you add a p.join() after that p.terminate(), and maybe a few debug lines ?
i.e.

... else: if p.is_alive(): logger.debug(' MD: terminating') p.terminate() logger.debug(' MD: terminated') logger.debug(' MD: joining') p.join() logger.debug(' MD: joined') logger.debug(' task timeout') try: # we try to get a traceback here tr = queue.get(timeout=2) ...

Good try. I did it and got the supposed infinity waiting result.

First the pstree for your orientation (worker's focus on 29554):

[01] logs $> pstree -p 16731
bash(16731)---python2.7(28670)-+-python2.7(28678)---{python2.7}(28679)
                               |-python2.7(28680)-+-python2.7(29554)
                               |                  `-{python2.7}(28681)
                               ...

Next some logging lines:

11-04 07:56:47.508 [28680] task starting CPID:29554
...
... configured timeout 30 seconds ...
...
11-04 07:57:02.510 [28680] MD: terminating
11-04 07:57:02.512 [28680] MD: terminated
11-04 07:57:02.512 [28680] MD: joining

I was afraid: the worker is sill waiting for joining the forked process.

And last some lines from pstack for 29554 (the function 'executor' never reaching):

[01] logs $> pstack 29554
#0 0x00000030db40ce51 in sem_wait () from /lib64/libpthread.so.0
#1 0x00002ba981e3123d in PyThread_acquire_lock () from /usr/local/lib/libpython2.7.so.1.0
#2 0x00002ba981e35202 in lock_PyThread_acquire_lock () from /usr/local/lib/libpython2.7.so.1.0
#3 0x00002ba981dfb9c9 in PyEval_EvalFrameEx () from /usr/local/lib/libpython2.7.so.1.0
...
#11 0x00002ba981d82252 in function_call () from /usr/local/lib/libpython2.7.so.1.0
#12 0x00002ba981d54318 in PyObject_Call () from /usr/local/lib/libpython2.7.so.1.0
...
#46 0x00002ba981d82252 in function_call () from /usr/local/lib/libpython2.7.so.1.0
#47 0x00002ba981d54318 in PyObject_Call () from /usr/local/lib/libpython2.7.so.1.0
#48 0x00002ba981d6499f in instancemethod_call () from /usr/local/lib/libpython2.7.so.1.0
#49 0x00002ba981d54318 in PyObject_Call () from /usr/local/lib/libpython2.7.so.1.0
#50 0x00002ba981db701c in slot_tp_init () from /usr/local/lib/libpython2.7.so.1.0
#51 0x00002ba981db0f58 in type_call () from /usr/local/lib/libpython2.7.so.1.0
...
#61 0x00002ba981e20f37 in PyRun_SimpleFileExFlags () from /usr/local/lib/libpython2.7.so.1.0
#62 0x00002ba981e33726 in Py_Main () from /usr/local/lib/libpython2.7.so.1.0
#63 0x00000030da81d9f4 in __libc_start_main () from /lib64/libc.so.6
#64 0x0000000000400629 in _start ()

Erwn

Erwn Ltmann

unread,

Nov 4, 2016, 3:57:34 AM11/4/16

to web...@googlegroups.com

BTW: if I use os.kill(p.pid,signal.SIGKILL) instead of p.termiate() everithing is fine.

11-04 08:40:59.521 [13556] web2py.Scheduler - DEBUG - MD: terminating (os kill)
11-04 08:40:59.521 [13556] web2py.Scheduler - DEBUG - MD: terminated
11-04 08:40:59.522 [13556] web2py.Scheduler - DEBUG - MD: joining
11-04 08:40:59.524 [13556] web2py.Scheduler - DEBUG - MD: joined

No running and still waiting processes are left. :)

But, if I configure timeouts longer then 30 seconds - maybe 10 minutes or more, then the worker is still waiting until this timeout. Because of that I have to find the reason for sem_wait().

Erwn

Niphlod

unread,

Nov 4, 2016, 5:43:34 PM11/4/16

to web2py-users

well, on unix python docs report that p.terminate() use sigkill, so I wonder why in your env one works and not the other...

Niphlod

unread,

Nov 4, 2016, 5:44:31 PM11/4/16

to web2py-users

BTW: do your task write a lot on stdout/stderr and/or return huge results ?

Erwn Ltmann

unread,

Nov 4, 2016, 6:50:14 PM11/4/16

to web2py-users

On Friday, November 4, 2016 at 10:44:31 PM UTC+1, Niphlod wrote:

BTW: do your task write a lot on stdout/stderr and/or return huge results ?

Probably. I wrote into the log more than usual. :( Mostly DEBUGs - it's possible to switch off them.

Niphlod

unread,

Nov 6, 2016, 3:05:32 PM11/6/16

to web2py-users

there was an issue on more than 1kb that showed pretty much a limit on python's implementation, that's solved in recent releases of the scheduler but I wonder if the "tax" on that sem_wait() is trying to flush off or read the buffered output of the task.

Gualter Portella

unread,

Nov 6, 2016, 5:47:28 PM11/6/16

to web...@googlegroups.com

Hi, your tip really worked. Thanks. Please see the code below:

DAL:

indiban_meses = Meses.on(Indiban.meses_id == Meses.id)

indiban_anos = Anos.on(Indiban.anos_id == Anos.id)

indiban_indicadores = Indicadores.on(Indiban.indicadores_id == Indicadores.id)

indiban_empresas = Empresas.on(Indiban.empresas_id == Empresas.id)

indiban_eleban = Eleban.on((Indiban.empresas_id == Eleban.empresas_id)&(Indiban.meses_id == Eleban.meses_id)&(Indiban.anos_id == Eleban.anos_id))

eleban_elementos = Elementos.on(Eleban.elementos_id == Elementos.id)

rows = db((db.indicadores.id==rspla_id)&(db.elementos.id==ativo_id)&

(db.meses.id==mes_id)&(db.anos.id==ano_id)).select(Empresas.nome, Meses.abrev, Anos.numero,

Eleban.valor, Indiban.valor, Empresas.id,

join=[indiban_meses, indiban_anos,

indiban_indicadores, indiban_empresas,

indiban_eleban, eleban_elementos],

orderby=~Eleban.valor, limitby=(0, 10))

SQL:

sql = "SELECT empresas.nome, meses.abrev, anos.numero, eleban.valor, indiban.valor, empresas.id "

sql = sql + "FROM indiban INNER JOIN meses ON indiban.meses_id = meses.id "

sql = sql + "INNER JOIN anos ON indiban.anos_id = anos.id "

sql = sql + "INNER JOIN indicadores ON indiban.indicadores_id = indicadores.id "

sql = sql + "INNER JOIN empresas ON indiban.empresas_id = empresas.id "

sql = sql + "INNER JOIN eleban "

sql = sql + "ON indiban.empresas_id = eleban.empresas_id "

sql = sql + "AND indiban.meses_id = eleban.meses_id AND indiban.anos_id = eleban.anos_id "

sql = sql + "INNER JOIN elementos ON eleban.elementos_id = elementos.id "

sql = sql + "WHERE indicadores.id = 3 AND elementos.id = 5 "

sql = sql + "AND meses.id = 5 AND anos.id = 22 "

sql = sql + "ORDER BY eleban.valor DESC LIMIT 10;"

rows = db.executesql(sql)

--
Resources:
- http://web2py.com
- http://web2py.com/book (Documentation)
- http://github.com/web2py/web2py (Source code)
- https://code.google.com/p/web2py/issues/list (Report Issues)
---
You received this message because you are subscribed to the Google Groups "web2py-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web2py+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.