Process + Threads + asyncio... has sense?

133 views
Skip to first unread message

cr0hn cr0hn

unread,
Apr 18, 2016, 2:33:48 PM4/18/16
to python-tulip
Hi all,

It's the first time I write in this list. Sorry if it's not the best place for this question.

After I read the Asyncio's documentation, PEPs, Guido/Jesse/David Beazley articles/talks, etc, I developed a PoC library that mixes: Process + Threads + Asyncio Tasks, doing an scheme like this diagram:

 main -> Process 1 -> Thread 1.1 -> Task 1.1.1
                                 -> Task 1.1.2
                                 -> Task 1.1.3

                   -> Thread 1.2
                                 -> Task 1.2.1
                                 -> Task 1.2.2
                                 -> Task 1.2.3

        Process 2 -> Thread 2.1 -> Task 2.1.1
                                -> Task 2.1.2
                                -> Task 2.1.3

                  -> Thread 2.2
                                -> Task 2.2.1
                                -> Task 2.2.2
                                -> Task 2.2.3

In my local tests, this approach appear to improve (and simplify) the concurrency/parallelism for some tasks but, before release the library at github, I don't know if my aproach is wrong and I would appreciate your opinion.

Thank you very much for your time.

Regards!

Gustavo Carneiro

unread,
Apr 18, 2016, 3:25:36 PM4/18/16
to cr0hn cr0hn, python-tulip
I don't think you need the threads.

1. If your tasks are I/O bound, coroutines are a safer way to do things, and probably even have better performance;

2. If your tasks are CPU bound, only multiple processes will help, multiple (Python) threads do not help at all.  Only in the special case where the CPU work is mostly done via a C library[*] do threads help.

I would recommend using multiple threads only if interacting with 3rd party code that is I/O bound but is not written with an asynchronous API, such as the requests library, selenium, etc.  But in this case, probably using asyncio.Loop.run_in_executor() is a simpler solution.

[*] and a C API wrapped in such a way that it does a lot of work with few Python calls, plus it releases the GIL, so don't go thinking that a simple scalar math function call can take advantage of multithreading.

--
Gustavo J. A. M. Carneiro
Gambit Research
"The universe is always one step beyond logic." -- Frank Herbert

Imran Geriskovan

unread,
Apr 18, 2016, 3:33:59 PM4/18/16
to Gustavo Carneiro, cr0hn cr0hn, python-tulip
On 4/18/16, Gustavo Carneiro <gjcar...@gmail.com> wrote:
> I don't think you need the threads.
> 1. If your tasks are I/O bound, coroutines are a safer way to do things,
> and probably even have better performance;

Thread vs Coroutine context switching is an interesting topic.
Do you have any data for comparison?

Regards,
Imran

Tobias Oberstein

unread,
Apr 18, 2016, 4:04:42 PM4/18/16
to Imran Geriskovan, Gustavo Carneiro, cr0hn cr0hn, python-tulip
My 2cts:

OS native (= non-green) threads are an OS scheduler driven, preemptive
multitasking approach, necessarily with context switching overhead that
is higher than a cooperative multitasking approach like asyncio event loop.

Eg the context switching with threads involves saving and restoring the
whole CPU core register set. OS native threads also involves bounding
back and forth between kernel- and userspace.

Practical evidence: name one high performance network server that is
using threads (and only threads), and not some event loop thing;)

You want N threads/processes where N is related to number of cores
and/or effective IO concurrency _and_ each thread/process run an event
loop thing. And because of the GIL, you want processes, not threads on
(C)Python.

The effective IO concurrency depends on the number of IO queues your
hardware supports (the NICs or the storage devices). The IO queues
should have affinity to the (nearest) CPU core on an SMP system also.

For network, I once did some experiments of how far Python can go. Here
is Python (PyPy) doing 630k HTTP requests/sec (12.6 GB/sec) using 40 cores:

https://github.com/crossbario/crossbarexamples/tree/master/benchmark/web

Note: that is Twisted, not asyncio, but the latter should behave the
same qualitatively.

Cheers,
/Tobias

>
> Regards,
> Imran
>

Imran Geriskovan

unread,
Apr 18, 2016, 4:26:54 PM4/18/16
to Tobias Oberstein, Gustavo Carneiro, cr0hn cr0hn, python-tulip
>>> I don't think you need the threads.
>>> 1. If your tasks are I/O bound, coroutines are a safer way to do things,
>>> and probably even have better performance;
>>
>> Thread vs Coroutine context switching is an interesting topic.
>> Do you have any data for comparison?

> My 2cts:
> OS native (= non-green) threads are an OS scheduler driven, preemptive
> multitasking approach, necessarily with context switching overhead that
> is higher than a cooperative multitasking approach like asyncio event loop.
> Note: that is Twisted, not asyncio, but the latter should behave the
> same qualitatively.
> /Tobias

Linux OS threads come with 8MB stack per thread + switching
costs as you mentioned.

A) Python threads are not real threads. It multiplexes "Python Threads"
on a single OS thread. (Guido, can you correct me if I'm wrong,
and can you provide some info on multiplexing/context switching of
"Python Threads"?)

B) Where as asyncio multiplexes coroutines on a "Python Thread"?

The question is "Which one is more effective?". The answer is
ofcourse dependent on use case.

However, as a heavy user of coroutines, I begin to think to go back to
"Python Threads".. Anyway that's personal choice.

Now lets clarify advantages and disadvantages between A and B..

Regards,
Imran

Guido van Rossum

unread,
Apr 18, 2016, 6:54:29 PM4/18/16
to Imran Geriskovan, Tobias Oberstein, Gustavo Carneiro, cr0hn cr0hn, python-tulip
On Mon, Apr 18, 2016 at 1:26 PM, Imran Geriskovan <imran.ge...@gmail.com> wrote:
A) Python threads are not real threads. It multiplexes "Python Threads"
on a single OS thread. (Guido, can you correct me if I'm wrong,
and can you provide some info on multiplexing/context switching of
"Python Threads"?)

Sorry, you are wrong. Python threads map 1:1 to OS threads. They are as real as threads come (the GIL notwithstanding).

--
--Guido van Rossum (python.org/~guido)

cr0hn

unread,
Apr 18, 2016, 7:34:06 PM4/18/16
to python-tulip
Thank you for your responses.

The scenario (I forgot in my first post): I'm trying to improve I/O accesses (disk/network...).

So, if a Python thread map with a OS 1:1 thread, and the main problem (I understood that) is the cost of context switching between of threads/coroutines... this raises me a new question:

If I only run a process with 1 thread (the default state) the GIL will change the context after the thread ticks was spent? Or the behavior is like a plain run until the program ends?

Thinking about that, I suppose that if the status is 1 process <-> 1 thread, without context change, obviously the best approach for high performance network I/O are with creating coroutines and not threads, right?

I'm wrong?


En 19 de abril de 2016 en 0:54:28, Guido van Rossum (gu...@python.org) escrito:

---
Daniel García (cr0hn)
Security researcher and ethical hacker

Personal sitehttp://cr0hn.com
Twitter@ggdaniel 
signature.asc

Imran Geriskovan

unread,
Apr 19, 2016, 5:02:06 PM4/19/16
to python...@googlegroups.com
>> A) Python threads are not real threads. It multiplexes "Python Threads"
>> on a single OS thread. (Guido, can you correct me if I'm wrong,
>> and can you provide some info on multiplexing/context switching of
>> "Python Threads"?)

> Sorry, you are wrong. Python threads map 1:1 to OS threads. They are as
> real as threads come (the GIL notwithstanding).

Ok then. Just to confirm for cpython:
- Among these OS threads, only one thread can run at a time due to GIL.

A thread releases GIL (thus allow any other thread began execution)
when waiting for blocking I/O. (http://www.dabeaz.com/python/GIL.pdf)
This is similar to what we do in asyncio with awaits.

Thus, multi-threaded I/O is the next best thing if we do not use asyncio.

Then the question is still this: Which one is cheaper?
Thread overheads or asyncio overheads.

Tobias Oberstein

unread,
Apr 19, 2016, 5:39:28 PM4/19/16
to Imran Geriskovan, python...@googlegroups.com
The overhead of cooperative multitasking is smaller, but for maximum
performance you need to combine that with preemptive multitasking
because to saturate modern hardware, you need high IO concurrency

(I am leaving out stuff like Linux AIO in this discussion)


Gustavo Carneiro

unread,
Apr 19, 2016, 5:52:00 PM4/19/16
to Imran Geriskovan, python-tulip
IMHO, that is the wrong question to ask; that doesn't matter that much.  What matters most is, which one is safer.  Threads appear deceptively simple... that is up to the point where you trigger a deadlock and your whole application just freezes as a result.  Because threads need lots and lots of locks everywhere.  Asyncio code also may need some locks, but only a fraction, because for a lot of things you can get away with not doing any locking.  For example, imagine a simple statistics class, like this:

class MeanStat:
    def __init__(self):
        self.num_values = 0
        self.sum_values = 0

    def add_sample(self, value):
        self.num_values += 1
        self.sum_values += value
       
    @property
    def mean(self):
        return self.sum_values/self.num_values if self.num_values > 0 else 0


The code above can be used as is in asyncio applications.  You can call MeanStat.add_sample() from multiple asyncio tasks at the same time without any locking and you know the MeanStat.mean property will always return a correct value.

However, if you try to do this with a threaded application, if you don't use any locking you will get incorrect results (and what is annoying is that you may not get incorrect results in development, but only in production!), because a thread may be calling MeanStat.mean() and the sum/nvalues expression may en up being calculated in the middle of another thread adding a sample:

    def add_sample(self, value):
        self.num_values += 1
              <<<<< switches to another thread here: num_values was updated, but sum_values was not!
        self.sum_values += value

The correct way to fix that code with threading is to add locks:

class MeanStat:
    def __init__(self):
        self.lock = threading.Lock()
        self.num_values = 0
        self.sum_values = 0

    def add_sample(self, value):
        with self.lock:
            self.num_values += 1
            self.sum_values += value
       
    @property
    def mean(self):
        with self.lock:
            return self.sum_values/self.num_values if self.num_values > 0 else 0

This is a very simple example, but it illustrates some of the problems with threading vs coroutines:

   1. With threads you need more locks, and the more locks you have: a) the lower the performance, and b) the greater the risk of introducing deadlocks;

   2. If you /forget/ that you need locks in some place (remember that most code is not as simple as this example), you get race conditions: code that /seems/ to work fine in development, but behaves strangely in production: strange values being computed, crashes, deadlocks.

So please keep in mind that things are not as black and white as "which is faster".  There are other things to consider.

Tobias Oberstein

unread,
Apr 19, 2016, 6:14:11 PM4/19/16
to Gustavo Carneiro, Imran Geriskovan, python-tulip
Sorry, I should have been more explicit:

With Python (both CPython and PyPy), the least overhead / best
performance (throughput) approach to network servers is:

Use a multi-process architecture with shared listening ports (Linux
SO_REUSEPORT), with each process running an event loop (asyncio/Twisted).

I don't recommend using OS threads (of course) ;)

Am 19.04.2016 um 23:51 schrieb Gustavo Carneiro:
> On 19 April 2016 at 22:02, Imran Geriskovan <imran.ge...@gmail.com

Imran Geriskovan

unread,
Apr 19, 2016, 7:00:08 PM4/19/16
to python...@googlegroups.com
> 1. With threads you need more locks, and the more locks you have: a) the
> lower the performance, and b) the greater the risk of introducing
> deadlocks;
> So please keep in mind that things are not as black and white as "which is
> faster". There are other things to consider.

While handling mutually exclusive muItithreaded I/O,
you dont need any lock. Aside from generalist advices,
reasons for thinking to go back to threads are:

1) Awaits are viral. Async programmining is kind of all_or_nothing.
You need all I/O libraries to be async.

2) You cant use any blocking call anywhere in async server.
If you do, ALL your server is dead in the water till the return
of this blocking call. Do you think that my design is faulty?
Then look at the SSH/TLS implementation of asyncio itself.
During handshake, you are at the mercy of openssh library.
Thus, it is impossible to build medium to highload TLS server.
To do that safely and appropiately you need asyncio
implemenation of openssh!

3) I appreciate the core idea of asyncio. However, it is not cheap.
It hardly justifies the whole new thing, while you can only
drop "await" s and run it as multithreaded and preserving compatibility
with all old libraries. If you did not bought the inverted
async patterns, even you still preserve your chances of migrating
to any other classical language.

4) Major Down side of thread approach is memory consumption.
That is 8MB per thread on linux. Other than this OS threads are cheap
on linux. (Windows is another story.) If your use case can afford
it, why not use it.

Returning to the original subject of this message thread;
as cr...@cr0hn.com proposed certain combinations of processes,
threads and coroutines definetely make sense..

Regards,
Quick Reply

Imran Geriskovan

unread,
Apr 19, 2016, 7:03:14 PM4/19/16
to python-tulip
> This is a very simple example, but it illustrates some of the problems with
> threading vs coroutines:
> 1. With threads you need more locks, and the more locks you have: a) the
> lower the performance, and b) the greater the risk of introducing
> deadlocks;
> So please keep in mind that things are not as black and white as "which is
> faster". There are other things to consider.


cr0hn cr0hn

unread,
Apr 25, 2016, 11:03:25 AM4/25/16
to python-tulip
Thanks for your responses.

I uploaded as GIST my PoC code, if anyone would like to see the code or send any improvement:


Regards,

Imran Geriskovan

unread,
Apr 25, 2016, 4:06:38 PM4/25/16
to cr0hn cr0hn, python-tulip
On 4/25/16, cr0hn cr0hn <cr...@cr0hn.com> wrote:
> I uploaded as GIST my PoC code, if anyone would like to see the code or
> send any improvement:
> https://gist.github.com/cr0hn/e88dfb1fe8ed0fbddf49185f419db4d8
> Regards,

Thanks for the work.

>> 2) You cant use any blocking call anywhere in async server.
>> If you do, ALL your server is dead in the water till the return
>> of this blocking call. Do you think that my design is faulty?
>> Then look at the SSH/TLS implementation of asyncio itself.
>> During handshake, you are at the mercy of openssh library.
>> Thus, it is impossible to build medium to highload TLS server.
>> To do that safely and appropiately you need asyncio
>> implemenation of openssh!

It's openssl. Not ssh... Sorry..
Reply all
Reply to author
Forward
0 new messages