Using multiprocessing in web2py

196 views
Skip to first unread message

David Mitchell

unread,
Feb 27, 2011, 4:32:22 AM2/27/11
to web...@googlegroups.com
Hello everyone,

I've read through the message archive and there seems to be a fairly clear message: don't using the multiprocessing module within web2py.

However, I'm hoping I might have a use case that's a bit different...

I've got an app that basically does analytics on moderately large datasets.  I've got a number of controller methods that look like the following:

def my_method():
    # Note: all data of interest has previously been loaded into 'session.data'
    results = []
    d = local_import('analysis')
    results += d.my_1st_analysis_method(session)
    results += d.my_2nd_analysis_method(session, date=date)
    results += d.my_3rd_analysis_method(session)
    results += d.my_4th_analysis_method(session, date=date)
    results += d.my_5th_analysis_method(session, date=date)
    return dict(results=results)

The problem I have is that all of the methods in my 'analysis' module, when run in sequence as per the above, simply take too long to execute and give me a browser timeout.  I can mitigate this to some extent by extending the timeout on my browser, but I need to be able to use an iPad's Safari browser and it appears to be impossible to increase the browser timeout on the iPad.  Even if it can be done, that approach seems pretty ugly and I'd rather not have to do it.  What I really want to do is run all of these analysis methods *simultaneously*, capturing the results of each analysis_method into a single variable once they've finished.

All of the methods within the 'analysis' module are designed to run concurrently - although they reference session variables, I've consciously avoided updating any session variables within any of these methods.  While all the data is stored in a database, it's loaded into a session variable (session.data) before my_method is called; this data never gets changed as part of the analysis.

Is it reasonable to replace the above code with something like this:

def my_method():
    import multiprocessing
    d = local_import('analysis')

    tasks = [
        ('job': 'd.my_1st_analysis_method', 'params': ['session']),
        ('job': 'd.my_2nd_analysis_method', 'params': ['session', 'date=date']),
        ('job': 'd.my_3rd_analysis_method', 'params': ['session']),
        ('job': 'd.my_4th_analysis_method', 'params': ['session', 'date=date']),
        ('job': 'd.my_5th_analysis_method', 'params': ['session', 'date=date']),
    ]

    task_queue = multiprocessing.Queue()
    for t in tasks:
        task_queue.put(t['job'])

    result_queue = multiprocessing.Queue()

    for t in tasks:
        args = (arg for arg in t['params'])
        worker = multiprocessing.Worker(work_queue, result_queue, args=args)
        worker.start()

    results = []
    while len(results) < len(tasks):
        result = result_queue.get()
        results.append(result)

    return dict(results=results)

Note: I haven't tried anything using the multiprocessing module before, so if you've got any suggestions as to how to improve the above code, I'd greatly appreciate it...

Is introducing multiprocessing as I've outlined above a reasonable way to optimise code in this scenario, or is there something in web2py that makes this a bad idea?  If it's a bad idea, do you have any suggestions what else I could try?

Thanks in advance

David Mitchell

David Mitchell

unread,
Mar 1, 2011, 5:48:36 AM3/1/11
to web...@googlegroups.com
Bump

Jonathan Lundell

unread,
Mar 1, 2011, 10:56:10 AM3/1/11
to web...@googlegroups.com
On Mar 1, 2011, at 2:48 AM, David Mitchell wrote:
Bump

I tried something like that a while back with Python threads, for much the same reason you describe. In my case, each thread was farming out an xml-rpc request, each to a different server, an ideal case for this kind of thing, since all of my threads were in IO wait. (That is to say, I didn't need MP, since the processing didn't happen locally, but I could benefit from threads of control.)

The problem comes when the user/browser gives up and cancels the request, and immediately resubmits (or simply gets impatient and does a reload). In my case, things would get so tangled that I'd have to restart web2py. 

I think that mp might be a reasonable solution, but I'd look for a way to redefine the approach to be asynchronous. Fill in the results in the database, perhaps, and poll from the client with Ajax, maybe. But do it in such a way that the initial request (to web2py) can complete more or less immediately.

Lorin Rivers

unread,
Mar 1, 2011, 11:31:48 AM3/1/11
to web...@googlegroups.com
David,

If you can do any of the analysis ahead of time and store that, it might help. That's what I had to do.

In my case, I also discovered that the database I was using wanted more RAM than I could give it at the time and had to do my real-time analysis in smaller chunks.


--
Lorin Rivers
Mosasaur: Killer Technical Marketing <http://www.mosasaur.com>
<mailto:lri...@mosasaur.com>
512/203.3198 (m)


VP

unread,
Mar 1, 2011, 11:57:14 AM3/1/11
to web2py-users
>> The problem comes when the user/browser gives up and cancels the request, and immediately resubmits (or simply gets impatient and does a reload). In my case, things would get so tangled that I'd have to restart web2py.

I think the proper way to deal with this is having a queue of tasks
with some type of scheduling.

There are many use-cases when a webapp is merely a web-based interface
to compute-intensive applications running on the server(s). As such,
web2py might want to have built-in support for these types of uses.

pbreit

unread,
Mar 1, 2011, 1:26:59 PM3/1/11
to web...@googlegroups.com
Yes, I think the use of queues in general needs to go way up. I use a very simple queue for emailing:

===models.py===
db.define_table('mail_queue',
    Field('status'),
    Field('email'),
    Field('subject'),
    Field('message'))

===mail_queue.py===
import time

while True:
    rows = db(db.mail_queue.status=='pending').select()
    for row in rows:
        if mail.send(to=row.email,
            subject=row.subject,
            message=row.message):
            row.update_record(status='sent')
        else:
            row.update_record(status='failed')
        db.commit()
    time.sleep(60) # check every minute

===crontab===
@reboot root *applications/init/private/mail_queue.py

The trick is to offload all the tasks into a queue and then have the web page poll the app for completion via Ajax or something.

Massimo inquired about more built-in queuing functionality which is probably not a bad idea since it's such a useful approach. Maybe it can reasonably be delivered as a module.
Reply all
Reply to author
Forward
0 new messages