scheduler autoscaling

Niphlod

unread,

Jun 8, 2014, 5:41:36 PM6/8/14

to web2py-d...@googlegroups.com

Hi all,
with the latest version of the scheduler I started playing a little bit towards a script that autoscales workers. I'd need a bit of a brainstorming from all of you ^_^ .
As you can see, I added a new column with "workers statistics" to the scheduler_worker table. This can be useful (as any metric) to realize a little bit better the worker's state, without going through all the logs. As of right now I managed to pull out metrics without any additional queries, so it's a good bonus given that's "free of (additional) charge".
I realize, now and then, that many users start a boatload of workers just to accomodate the occasional spike of tasks. Until now, I managed to do it though the "max_empty_runs" parameter, given that I knew when such spikes hit during the day/week/month (and I didn't have strict deadlines for results). So, I scheduled (on system's cron facilities) a batch of "auto-expiring" workers that eventually die out when there are no more tasks to process.
Other task processors let you start a process that - once met certain criteria - start/stop workers. Until now, at least I think, users in the need of that should have turned to third-party process monitors (circus, supervisord, etc) to manage workers.
I was thinking towards letting people define their own "start/stop" criteria that gets checked by an external process, with a simple function (fixed name) that returns a True/False.
Something like this (in models):

def myautoscale():
    sw = db.scheduler_worker
    ticker = db(sw.is_ticker == True).select(sw.worker_stats).first()
    if not ticker:
        return True
    if ticker:
        return ticker.worker_stats['queue'] > 30

the "autoscale process" could then do something like this (pseudo-code):


from gluon.shell import run
from gluon import current
import time
code_check = "from gluon import current; current._autoscale = myautoscale() or False"
check_args = (appname, True, True, None, False, code_check)
while True:
    run(*check_args)
    if current._autoscale:
         ....start process
    else:
         ....kill process
    time.sleep(15)

What do you think ?

Problems to solve/brainstorm:
- cmdline syntax. For my usecases so far just the max number of allowed processes would be fine, as I can have (at this point), an external process checking if I need workers started (e.g. counting queued tasks). Others use "--autoscale min,max" as "--autoscale 1,3" in order to allow "from 1 to 3 workers"... this means though that the autoscale process should become also a "monitoring" process, spawning new workers until the min value is fullfilled (good added value also because now the "syntax" to start, e.g. 6 workers is -K appname,appname,appname,appname,appname,appname and then it would be, e.g., -K appname --autoscale 1,6).
- group_names.... workers can be assigned group_names to work with, and on different servers.... this means we'd have either to impose some limits towards this start/stop process and/or figure out a "solid" logic to deal with this additional layer of complexity. I'd leave completely out of the picture the possibility to autoscale "different" apps (now, you can do -K appname1,appname2): if needed, two autoscalers will be needed
- I think that either users should either use autoscale or not. The logic behind this is that there's a "managed" logic or there isn't, and once an "autoscaler" is launched, it will probably affect ALL workers. Figuring out an additional logic to avoid messing with workers not started by the "autoscaler" seems a bit stretched.
- starting workers is a relatively easy process, given that we can start processes whenever we want. Killing them is a taddle bit difficult, because they may be processing a task. Now, with the additional "limit" argument to the terminate() function, it should be easier
Following the same logic as before, these pieces should fit together pretty nicely...

code_start = "from gluon import current;current._scheduler.loop()"
code_stop = "from gluon import current;current._scheduler.terminate(limit=1);db.commit()"
start_args = (appname, True, True, None, False, code_start)
stop_args = (appname, True, True, None, False, code_stop)
##logic to start
p = Process(target=run, args=start_args)
p.start()
##logic to terminate
run(*stop_args) #as we don't need an external process

Paolo Valleri

unread,

Jun 9, 2014, 2:56:23 AM6/9/14

to web2py-d...@googlegroups.com

I like the new cmdline syntax with "--autoscale 1,3" it is far better than -K app,app,app.

On the other hand, I would code the "autoscale process" as a task of the scheduler itself instead of a single/independent process looping in a while (True).

By doing that we could have much more control on that task leaving the internal scheduler code lighter and easier to improve. Actually at a first sight I don't know how to code it in order to take the --autoscale parameters.

Paolo

--
-- mail from:GoogleGroups "web2py-developers" mailing list
make speech: web2py-d...@googlegroups.com
unsubscribe: web2py-develop...@googlegroups.com
details : http://groups.google.com/group/web2py-developers
the project: http://code.google.com/p/web2py/
official : http://www.web2py.com/
---
You received this message because you are subscribed to the Google Groups "web2py-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web2py-develop...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tim Richardson

unread,

Jun 9, 2014, 5:05:43 AM6/9/14

to web2py-d...@googlegroups.com

How would this affect windows users where the scheduler runs as a service, and where Windows will if requested restart failed scheduler services (a feature which relies on the service process being the scheduler worker process, if I understood it properly)?

Niphlod

unread,

Jun 9, 2014, 12:10:34 PM6/9/14

to web2py-d...@googlegroups.com

@Paolo: we can't have a scheduler as a child of a scheduler process, even if in a separate thread (signals gets mixed, and in Windows the result is even worse).

Having it as a task is impossibile, because the task to launch the process would effectively never end, blocking the "currently running" scheduler.

@Tim: if the autoscale process is meant to watch over a "minimum" # of active workers, then you'd not use a service per worker, but a service per autoscaler. If the autoscaler dies, it means (probably) that the db isn't there anymore, or the app itself got deleted.

Reply all

Reply to author

Forward