Daily task on production: scheduler or cron?

693 views
Skip to first unread message

Lisandro

unread,
Dec 23, 2014, 9:18:09 AM12/23/14
to web...@googlegroups.com
I've been reading about web2py's cron and sheduler ([1] and [2])
Also, I've read a post where Massimo says "Please use the scheduler, not cron. Cron must die." [3]

Now I'm creating a web2py app and I want the user to be able to configure a daily background task. The task consist in sending a newsletter to subscribers. The task doesn't take too much time to complete (because the app doesn't use smpt; it connects to an API so the sending is handled by an external service). So the task only makes some query to the database, connect to the API, give the order to send, and disconnect.
The thing is: I want the user to be able to configure the time that the newsletter is sent. For example, the user may configure the sending of the newsletter from monday to friday at 8am, but not saturdays o sundays. 

In this scenario, I first thought that cron would be the way to go. However I read that Massimo's comment ("cron must die") so I don't know what to use. 
I find that scheduler is very complete and robust, but I don't know if it's the best option for this case, considering that the task runs in very little time and only once a day. I'm worried about resource consumption, because the same app is installed multiple times on production, serving multiple websites, so there would be multiple workers running on background (maybe idle workers, however they would take some memory space I guess).

Any tip or comments? Does anyone dealed with something similar? I want to remark the idea that the user must be able to change configuration about the scheduled task.
Thanks in advance!


Tim Richardson

unread,
Dec 23, 2014, 3:07:28 PM12/23/14
to web...@googlegroups.com
I use the scheduler for activities that run daily to some that run every 15 minutes. 
It doesn't do day-of-the-week yet so you would need to check for that in your daily code. I for example have an app which sends SMS every hour, but the task checks that it is in office hours before doing any work. 
The scheduler is a good tool, and once you've learnt it, you have a cross-platform tool at your disposal. It's easy to manipulate in web2py since there is an API (which basically updates records; this will make it easy to meet your requirement that the user chooses when to start the task). It has a simple logging approach (output is saved in tables) and it scales to multiple workers easily.

I think the references to "cron must die" refer to a web2py deprecated feature unfortunately named cron; don't confuse it with the system cron built in your server OS. 
Therefore you may decide to use cron. In which case...
You can run web2py scripts in the context of your application using the web2py command line 
python web2py -S {app} -R <{path_prefix}/mymodule.py 
(see documentation in the book; the -S option needs to be used as well). 

You could for example put your code in a module, and have code in the global context (if __name__ == "__main__" ...) which will be run when you execute the command line. 

Lisandro

unread,
Dec 24, 2014, 9:56:34 AM12/24/14
to web...@googlegroups.com
Thank you very much Tim, excelent info. I will use web2py's Scheduler, because it will be easy to write some code and let the user change the configuration from the webapp. 

However, I have one last concern about system resources on production.
As I'm going to use this in production, my webapp will be installed multiple times (that is, multiple virtual hosts, each one running an instance of web2py). Therefor, accordingly to the documentation, I will have to run the scheduler as a linux service (http://web2py.com/books/default/chapter/29/13/deployment-recipes#Start-the-scheduler-as-a-Linux-service--upstart-). 
In my case, I will be running a service for each instance of web2py running, but as I said, each scheduler will correspond to the sending of a daily newsletter, that is, each scheduler will execute only one task per day. Therefor my doubt: ¿does the scheduler consume much resources being "idle" waiting for tasks to be queued?

In the other hand, after reading web2py's scheduler documentation, I found out that, if I want to schedule a task that runs every day at a certain time, I would have to queue the task in this way:

scheduler.queue_task(
    send_newsletter,             # the function that sends the newsletter
    start_time=first_execution,  # first_execution would be, for example, tomorrow at 8am
    period=86400,                # one day, expressed in seconds
    repeats = 0                  # unlimited repeats
)
Is that ok? If I do it this way, should I increase the heartbeat to 60 seconds or more? That is, taking into account that the scheduler will only have to run one task per day.

Niphlod

unread,
Dec 24, 2014, 1:03:55 PM12/24/14
to web...@googlegroups.com
a)
Some people never consider this as a possibility, but if you have 3 apps, e.g. app1, app2 and app3, you can run one scheduler for all applications. The default "mode" is built to process by default tasks coming from the same app that queues them, but the switch is still there: application_name... The only thing to make sure would be to queue tasks with the explicit application_name='.....', e.g. mysched.queue_task(thefunction, ...., application_name='app1') . In that way you can even queue a task defined in app2 from app1. If instead your defined tasks in app1 are queued only within app1, the explicit application_name is not needed.
 
Of course, the database used by the scheduler would have to be the same, and once it's the same, no matter what appname you pass to the -K parameter...it will process any task queued in there (i.e. from ALL apps) without issues.

b)
If you don't care about leaving the possibility up to the users to receive those notification at ultra-fine-grained times, e.g. 2:37AM, but only e.g. at 00:00, 00:30, 01:00 (every half hour) and so on, you can avoid having the scheduler always active.... you can istantiate it with

mysched = Scheduler(dbsched, max_empty_runs=10)

and start the scheduler with

web2py.py -K app1

every half hour.

A worker will then fired up, will process all queued tasks and then it be terminated automatically after 10 "empty loops", i.e. 10 rounds where no new tasks are found.
I use a lot this "pattern" for e.g., a high number of tasks that needs to be processed before arriving at the office, at 6:00am.
The usecase is pretty much "leave all the raw data coming in during the day, aggregate and do some report on it at fixed intervals"..... in my case, I know that during the day tasks gets queued, but I only need to run them (aggregation and reporting) by 7:00am on the next morning, so I just start the scheduler at 6:00am, let it process all the backlog and then die gracefully when there's no work to do. 

c)
to have the task execute on the exact same time every day, you're encouraged to pass also the prevent_drift parameter set to True. This is explained in the book... quote:

Default behavior: The time period is not calculated between the END of the first round and the START of the next, but from the START time of the first round to the START time of the next cycle). This can cause accumulating 'drift' in the start time of a job. After v 2.8.2, a new parameter prevent_drift was added, defaulting to False. If set to True when queing a task, the start_time parameter will take precedence over the period, preventing drift




Lisandro Rostagno

unread,
Dec 29, 2014, 10:11:35 AM12/29/14
to web...@googlegroups.com
Thanks Niphlod for your tips. The idea of having a separate db for the scheduler crossed my mind when I noticed the "db" parameter for the Scheduler constructor; however, in my case, I have **multiple instances of web2py** (instead of a single instance with multiple apps), like this:

/var/www/web2py1/web2py.py
/var/www/web2py1/applications/myapp
/var/www/web2py1/applications/myapp/models/scheduler.py

/var/www/web2py2/web2py.py
/var/www/web2py2/applications/myapp
/var/www/web2py2/applications/myapp/models/scheduler.py

/var/www/web2py3/web2py.py
/var/www/web2py3/applications/myapp
/var/www/web2py3/applications/myapp/models/scheduler.py

so I guess it's not possible to run only one scheduler (using the web2py.py of the first instance) and getting it execute the tasks queued by the other web2py instances, in spite of they are using the same db for the scheduling. Please correct me if I'm wrong. 

If I'm correct about last thing, and, considering that I don't want to depend on OS cron, then question is: would it be correct to queue the task in the follow way?
scheduler.queue_task(
    send_newsletter,             # the function that sends the newsletter
    start_time=first_execution,  # first_execution would be, for example, tomorrow at 8am
    period=86400,                # one day, expressed in seconds
    repeats = 0                  # unlimited repeats
)
That is, the task is queued one first time, with a period of one day and unlimited repeats. If this is the case, would it make sense to set the scheduler's heartbeat to, let's say, 10 minutes or more?


--
Resources:
- http://web2py.com
- http://web2py.com/book (Documentation)
- http://github.com/web2py/web2py (Source code)
- https://code.google.com/p/web2py/issues/list (Report Issues)
---
You received this message because you are subscribed to a topic in the Google Groups "web2py-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/web2py/LZYGjEX3bXg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to web2py+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Niphlod

unread,
Dec 30, 2014, 9:39:12 AM12/30/14
to web...@googlegroups.com
that's really someone overcomplicating the setup... on the queue_task thing, use prevent_drift as explained before. Again, setting the heartbeat to 10 minutes kinda sucks, as there are too many things that can happen in 10 minutes. BTW: if you don't use cron, what are you planning to use to keep the scheduler process(es) running ?

Lisandro Rostagno

unread,
Jan 12, 2015, 6:56:16 AM1/12/15
to web...@googlegroups.com
Sorry for the delay (I was on vacations).

Thanks Niphlod, I actually forgot that I would finally have to use OS cron to keep the scheduler running, so, taking into account all the tips and suggestions, I will use OS cron to start the scheduler every half hour, and I will instantiate the scheduler with max_empty_runs=10.

Thanks again Tim and Niphlod for your help!



2014-12-30 11:39 GMT-03:00 Niphlod <nip...@gmail.com>:
that's really someone overcomplicating the setup... on the queue_task thing, use prevent_drift as explained before. Again, setting the heartbeat to 10 minutes kinda sucks, as there are too many things that can happen in 10 minutes. BTW: if you don't use cron, what are you planning to use to keep the scheduler process(es) running ?

--
Reply all
Reply to author
Forward
0 new messages