A basic problem about threading and time consuming function...

132 views
Skip to first unread message

Giuseppe Luca Scrofani

unread,
May 25, 2010, 6:33:02 PM5/25/10
to web...@googlegroups.com
Hi all, as promised I'm here to prove you are patient and nice :)
I' have to make this little app where there is a function that read
the html content of several pages of another website (like a spider)
and if a specified keyword is found the app refresh a page where there
is the growing list of "match".
Now, the spider part is already coded, is called search(), it uses
twill to log in the target site, read the html of a list of pages,
perform some searching procedures and keep adding the result to a
list. I integrated this in a default.py controller and make a call in
def index():
This make the index.html page loading for a long time, because now it
have to finish to scan all pages before return all results.
What I want to achieve is to automatically refresh index every 2
second to keep in touch with what is going on, seeing the list of
match growing in "realtime". Even better, if I can use some sort of
ajax magic to not refresh the entire page... but this is not vital, a
simple page refresh would be sufficient.
Question is: I have to use threading to solve this problem?
Alternative solutions?
I have to made the list of match a global to read it from another
function? It would be simpler if I made it write a text file, adding a
line for every match and reading it from the index controller? If I
have to use thread it will run on GAE?

Sorry for the long text and for my bad english :)

gls

mdipierro

unread,
May 25, 2010, 9:00:42 PM5/25/10
to web2py-users
I would use a background process that does the work and adds the items
to a database table. The index function would periodically refresh or
pull an updated list via ajax from the database table. there is no way
for te server to trigger an action in the browser unless 1) the
browser initiates it or 2) the client code embeds an ajax http server.
I would stay away from 1 and 2 and
use reload of ajax.

On May 25, 5:33 pm, Giuseppe Luca Scrofani <glsdes...@gmail.com>
wrote:

Candid

unread,
May 25, 2010, 9:39:42 PM5/25/10
to web2py-users
Well, actually there is a way for the server to trigger an action in
the browser. It's called comet. Of course under the hood it's
implemented on top of http, so it's browser who initiates request, but
from the developer perspective it looks like there is dual channel
connection between the browser and the server, and they both can send
messages to each other asynchronously. There are several
implementation of comet technology. I've used Orbited (http://
orbited.org/) and it worked quite well for me.

Allard

unread,
May 25, 2010, 10:34:37 PM5/25/10
to web2py-users
It seems like Comet would be hard to implement in web2py. Does web2py
use a threadpool internally? If so, I can see you run out of threads
pretty quickly. Ideally you would like to solve these kind of problems
with an asynchronous model (think Gevent, Eventlet, Concurrence,
Toranado). I am working on a project which requires a lot of slow
processing (image resizing, sending emails) based on client initiated
calls. Massimo, have you considered an asynchronous model within
web2py? Curious about your thoughts on it. I would much rather handle
the long running tasks in a green thread then to block a complete
thread.

My first post here and just started to work with web2py on a social
site. Great work Massimo! Batteries included but still light.

Allard

unread,
May 25, 2010, 10:24:15 PM5/25/10
to web2py-users
Comet is a nice way to get this done but I wonder how to implement
comet efficiently in web2py. Massimo, does web2py use a threadpool
under the hood? For comet you would then quickly run out of threads.
If you'd try to do this with a thread per connection things would get
out of hand pretty quickly so the best way is doing the work
asynchronously like Orbited. Alternatives would be using one of the
contemporary Python asynchronous libraries. These libraries provide
monkey patching of synchronous calls like your url fetching. Some
suggestions:

Gevent: now with support of Postgress, probably the fastest out there
Eventlet: used at Lindenlab / Second Life
Concurrence: with handy async mysql interface
Tornado: full async webserver in Python

Massimo: what do you think of an asynchronous model for web2py? It'd
be great to to have asynchronous capabilities. I am writing an app
that will require quite a bit of client initiated background
processing (sending emails, resizing images) which I would rather hand
off to a green thread and not block one the web2py threads. Curious
about your thoughts.

BTW - my first post here. Started to use for web2py for a community
site and enjoy working in it a lot! Great work.

On May 25, 9:39 pm, Candid <roman.bat...@gmail.com> wrote:

mdipierro

unread,
May 25, 2010, 10:59:57 PM5/25/10
to web2py-users
On May 25, 9:24 pm, Allard <docto...@gmail.com> wrote:
> Comet is a nice way to get this done but I wonder how to implement
> comet efficiently in web2py.

I have never used comet but I do not see any major problem

> Massimo, does web2py use a threadpool
> under the hood? For comet you would then quickly run out of threads.

The web server creates a thread pool. for stand alone web2py that
would be Rocket.
You do not run out of them any more than any other web app.

> If you'd try to do this with a thread per connection things would get
> out of hand pretty quickly so the best way is doing the work
> asynchronously like Orbited. Alternatives would be using one of the
> contemporary Python asynchronous libraries. These libraries provide
> monkey patching of synchronous calls like your url fetching. Some
> suggestions:
>
> Gevent: now with support of Postgress, probably the fastest out there
> Eventlet: used at Lindenlab / Second Life
> Concurrence: with handy async mysql interface
> Tornado: full async webserver in Python
>
> Massimo: what do you think of an asynchronous model for web2py? It'd
> be great to to have asynchronous capabilities. I am writing an app
> that will require quite a bit of client initiated background
> processing (sending emails, resizing images) which I would rather hand
> off to a green thread and not block one the web2py threads. Curious
> about your thoughts.

I do not think we can use async IO with web2py. async IO as far as I
understand would require a different programming style.
Anyway, if you have a working proof of concept I would like to see it.

Massimo

Allard

unread,
May 26, 2010, 10:26:39 PM5/26/10
to web2py-users
I won't have time to work out a async proof of concept at this time. I
hope to get this after some more real world profiling with my web2py
app though. To give you an idea of how an async web framework could
feel as natural in programming style as web2py (eg. no call backs all
over the place), have a look at Concurrence documentation if you're
interested:
http://opensource.hyves.org/concurrence/index.html
To implement async for web2py is probably for the most part
straightforward (monkey patching all the IO). The trouble will be with
external libraries that block and can't be monkey patched. For example
db drivers. Maybe those blocking calls are best dealt with in a thread
pool and queue.

The idea of Comet is to keep the connection open to the client and
flow data as it becomes available:
http://en.wikipedia.org/wiki/Comet_%28programming%29
It saves the overhead of a client polling at intervals and
establishing the connection each time. In a thread per connection
model you would need to keep a thread available per client. A thread
per client can get expensive quickly and does not scale nicely. After
a few hundred connections most servers slow down dramatically because
of thread context switching. See also:
http://www.kegel.com/c10k.html

For most web apps a thread per connection (from a threadpool) won't be
a problem but for for things like Ajax email applications or chat / IM
it does get troublesome.

Giuseppe Luca Scrofani

unread,
May 27, 2010, 4:10:40 AM5/27/10
to web...@googlegroups.com
Thanks all for answering friends! I've extracted some good info from
this discussion, and the solution proposed by Massimo work well :)

Doug Warren

unread,
Jun 2, 2010, 11:42:32 PM6/2/10
to web...@googlegroups.com
I started looking at this a bit, you can find the specs for the Comet
protocol, such as it is at
http://svn.cometd.com/trunk/bayeux/bayeux.html It's built on top of
JSON but isn't quite JSON-RPC. More of a publish/subscribe model.
The latest version of 'cometd' just released a beta release available
at http://download.cometd.org/ and includes a JavaScript library for
both dojo and jquery (Well it's written in/for dojo, but has a jquery
style interface as well.) I started poking at it a bit, but I haven't
ever done any Jquery so it will probably be slow.

My plan is to require basic auth, and then put the client ID into the
session and to put any data on subscribed channels into that
connection, then just use long-polling to keep it open.

Since there's no real way for a web2py app to be notified of internal
state changes, I'm not sure long term how I would handle actually
looking for anything to send out over the long poll. Though I've had
some thoughts of writing a scheduler for web2py with granularity of a
second or so.

Mirek Zvolský

unread,
May 1, 2016, 2:09:23 PM5/1/16
to web2py-users, glsd...@gmail.com
Thanks for info and tips, 6 years later.

What I try to do
is a form with single input, where user gives a query string
and then data about (usually ~300) books will be retrieved via z39 and marc protocol/format, parsed and saved into local database.

Of course this will take a time (2? 5? 20? seconds) and I decided
not to show the result immediately,
but show the same form with possibility to enter the next query + there is a list of pending queries (and their status - via ajax testing every 5 seconds)

So my idea was to provide a return from the controller fast and before the return to start a new thread to retrieve/parse/save/commit data.

From this discussion I understand that open new thread isn't best idea.
I think it could be still possible, because if my new thread could be killed 60s later from the web server together with the original thread - such possibility is not fatal problem for me here.

However when (as I read here) this would be a little wild technology,
and because other technologies mentioned here: https://en.wikipedia.org/wiki/Comet_(programming) -paragraph Aternatives, are too difficult for me,
and because I don't want use a scheduler, because I need to start as soon as possible,

I will solve it so,
that I will make 2 http accesses from my page: one with submit (will validate/save the query to database) and one with ajax/javascript (onSubmit from the old page or better: onPageLoaded from the next page where I give the query in .html DOM as some hidden value), which will start the z39 protocol/retrieve/parse/save data.
This will be much better, because web2py in the ajax call will prepare the db variable with proper db model for me (which otherwise I must handle myselves in the separate thread).
Callback from this ajax call should/could be some dummy javascript function, because it is not sure, and not important, if the page still exists when the server job will finish.

So, if somebody is interesting and will read this very old thread, maybe this can give him some idea for time consumming actions.
And maybe somebody will add other important hints or comments (thanks in advance).






Dne středa 26. května 2010 0:33:02 UTC+2 Giuseppe Luca Scrofani napsal(a):

Niphlod

unread,
May 1, 2016, 4:10:31 PM5/1/16
to web2py-users, glsd...@gmail.com
the statement "I don't need to use the scheduler, because I want to start it as soon as possible" is flaky at best. If your "fetching" varies from 2 to 20 seconds and COULD extend further to 60 seconds, waiting a few seconds for the scheduler to start the process is .... uhm... debatable.
Of course relying to ajax if your "feching" can be killed in the process is the only other way.

Mirek Zvolský

unread,
May 2, 2016, 2:35:05 PM5/2/16
to web2py-users, glsd...@gmail.com
You are right.
At this time it works for me via ajax well and I will look carefully for problems.
If so, I will move to scheduler.

I see this is exactly what Massimo(?) writes at the bottom of Ajax chapter of the book.

PS: about times:
At notebook with mobile connection it takes 20-40s. So it could be danger.
At cloud server with SSD it takes 2-10s. But this will be my case. And I feel better when the user can have typical response in 3s instead in 8s.





Dne neděle 1. května 2016 22:10:31 UTC+2 Niphlod napsal(a):

Mirek Zvolský

unread,
May 3, 2016, 6:32:13 AM5/3/16
to web2py-users, glsd...@gmail.com
Hi, Niphlod.

After I have read something about scheduler,
I am definitively sorry for my previous notes
and I choose web2py scheduler of course.

It will be my first use of it (with much older ~3 years web2py app I have used cron only),
so it will take some time to learn with scheduler. But it is sure worth to redesign it so.

Thanks you are patient with me.
Mirek




Dne pondělí 2. května 2016 20:35:05 UTC+2 Mirek Zvolský napsal(a):

Niphlod

unread,
May 3, 2016, 8:21:23 AM5/3/16
to web2py-users, glsd...@gmail.com
NP: as everything it's not the silver bullet but with the redis incarnation I'm sure you can achieve less than 3 second (if you tune heartbeat even less than 1 second) from when the task gets queued to when it gets processed.

Mirek Zvolský

unread,
May 5, 2016, 7:06:04 AM5/5/16
to web2py-users, glsd...@gmail.com
Yes.
I run with scheduler already. It is really nice and great !
Going away from the ajax solution it was easy and there was almost no problem. (I have very easy parameters for the task and I return nothing, just I save into db.)
The result code is cleaner (one task starting call instead of rendering hidden html element + js reading from it + ajax call + parsing args).

Maybe my previous mistake (I mean my message here in this thread) will be helpfull for others to go with scheduler.

What I need to do now is deployment for the scheduler (on Debian and nginx).

PS:
It was fast but important
- find where I can see code errors (in schedulers db tables),
- how to set timeout (in the function call)

Here is the code example - controller and models/scheduler.py:
def find():
    def onvalidation(form):
        form.vars.asked = datetime.datetime.utcnow()
    form = SQLFORM(db.question)
    if form.process(onvalidation=onvalidation).accepted:
        scheduler.queue_task(task_catalogize,
                pvars={'question_id': form.vars.id, 'question': form.vars.question, 'asked': str(form.vars.asked)},  # str to json serialize datetime
                timeout=300)
    return dict(form=form)

import datetime
from gluon.scheduler import Scheduler
def task_catalogize(question_id, question, asked):
    asked = datetime.datetime.strptime(asked, '%Y-%m-%d %H:%M:%S.%f')  # deserialize datetime
    inserted = some_db_actions(question)
    db.question[question_id] = {
            'duration': round((datetime.datetime.utcnow() - asked).total_seconds(), 0),     # same/similar we have in scheduler db tables
            'inserted': inserted}
    db.commit()
scheduler = Scheduler(db)



Dne úterý 3. května 2016 14:21:23 UTC+2 Niphlod napsal(a):

Niphlod

unread,
May 5, 2016, 9:56:32 AM5/5/16
to web2py-users, glsd...@gmail.com
I'd really go with using a specific format for the date you post, since you use a fixed one to parse it (i.e. don't rely on default str()). also, maybe you can just use request.now alltogether ?!?!?

Mirek Zvolský

unread,
May 5, 2016, 11:18:24 AM5/5/16
to web2py-users, glsd...@gmail.com
Debian/nginx/systemd deployment:

I made scheduler working with help:
and

Thank you very much Niphlod, Michael M, Brian M




Dne čtvrtek 5. května 2016 13:06:04 UTC+2 Mirek Zvolský napsal(a):
Reply all
Reply to author
Forward
0 new messages