Long-Running processes in ThreadPool executor

40 views
Skip to first unread message

juko...@gmail.com

unread,
Mar 9, 2018, 6:22:17 PM3/9/18
to nameko-dev
I hope someone can help me....

So I have a process that marshals several co-routines via a concurrent.futures.ThreadPoolExecutor.

This works beautifully for quick processes.... but not all of my processes are quick.... and in fact one particular one (transcoding audio) is downright slooow.

And so I have fallen foul of https://github.com/nameko/nameko/issues/477 where the ThreadPool is blocking the heartbeats, which causes the worker to restart (and the whole thing starts again)


Now, I was able to switch the 'worker' side of this to the 'greened' Subprocess - so the actual Transcode shouldn't be blocking anymore.

`from eventlet.green import subprocess`

But I am pretty sure the issue is now with the ThreadPool.

I'm am a total newbie to Eventlet - and honestly didn't anticipate this kind of nuanced issue. So I'm not entirely sure how I can fix this. Does anyone have any thoughts?

I could try running it inside a tpool? or am I over-complicating it, and simply need to use the Futures in a diferent way to how I am right now?

```
                with concurrent.futures.ThreadPoolExecutor(max_workers=25) as executor:
                    x = '/net/' + platform.node() + temp_location
                    futures = [executor.submit(self.audio.transcode, source_location + src, x + tgt, **transcode)
                               for src, tgt in filemap]
                    for future in concurrent.futures.as_completed(futures):
                        future.result()  # Raises exceptions
```

*any* help, guidance, or pointers *greatly* appreciated :)

Geoff

juko...@gmail.com

unread,
Mar 9, 2018, 9:18:43 PM3/9/18
to nameko-dev
OK, I spent a long time really thinking on this, and checking patterns in the failures, and any non-nameko issues - and whilst I am sure I am falling foul of issue linked - it is not necessarily because of the threading/subprocess.

My problem appears to be I/O related, causing severe slowdowns at the OS level.

I thought I'd cracked it with the 'green subprocess, but when the issue continued (and after reassuring myself I'd done that bit right!) I started looking more fundamentally.

I happened to be running a (very crude) SaltStack command to monitor the disks on the workers, when I noticed it slowed down at times (just calculating disk space). So I logged into one of the 'slow' responders, and the shell was very very sluggish. I attempted to clear the `/tmp` folder, and it just hung out, for ages and ages.

I don't (yet) know if it's an issue with one or more hosts running the Nameko services; or if it's a more general issue with the hardware configuration (I have a cluster of identical blades running the services); or a network I/O issue - possible, but not probably.

My money is on crappy Disk IO on a couple of machines.

If anyone has tips on how I might track that down, I'm all ears (and eyes).

Many thanks,

Geoff

juko...@gmail.com

unread,
Mar 10, 2018, 12:14:48 AM3/10/18
to nameko-dev
I know I have a bad habit of replying to myself, but I figure it's polite to close the loop :)

I have confirmed that there is a disk IOPS issue. I used SaltStack to send the following command to all the workers at once (race!):

`fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 | grep iops`

The results (below) clearly show wildly variable IOPS performance.

So what looked like a Nameko/Kombu issue (those were the only errors being thrown) was actually more fundamental :)

Thanks for listening. 

```
worker-02-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=53605KB/s, iops=13401, runt= 58677msec
      write: io=1024.4MB, bw=17876KB/s, iops=4469, runt= 58677msec
worker-05-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=13852KB/s, iops=3463, runt=227064msec
      write: io=1024.4MB, bw=4619.5KB/s, iops=1154, runt=227064msec
worker-07-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=7588.3KB/s, iops=1897, runt=414510msec
      write: io=1024.4MB, bw=2530.6KB/s, iops=632, runt=414510msec
worker-06-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=6205.5KB/s, iops=1551, runt=506908msec
      write: io=1024.4MB, bw=2069.3KB/s, iops=517, runt=506908msec
worker-03-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=6203.6KB/s, iops=1550, runt=507030msec
      write: io=1024.4MB, bw=2068.8KB/s, iops=517, runt=507030msec
worker-08-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=6049.2KB/s, iops=1512, runt=519969msec
      write: io=1024.4MB, bw=2017.3KB/s, iops=504, runt=519969msec
worker-04-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=6021.4KB/s, iops=1505, runt=522370msec
      write: io=1024.4MB, bw=2007.2KB/s, iops=501, runt=522370msec
worker-01-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=3877.2KB/s, iops=969, runt=811256msec
      write: io=1024.4MB, bw=1292.1KB/s, iops=323, runt=811256msec
```


On Friday, March 9, 2018 at 3:22:17 PM UTC-8, juko...@gmail.com wrote:
Reply all
Reply to author
Forward
0 new messages