[Maya-Python] Optimise this: Subprocess IO

292 views
Skip to first unread message

Marcus Ottosson

unread,
Apr 28, 2018, 1:36:09 AM4/28/18
to python_in...@googlegroups.com

Hi all,

I’ve put together a snippet to profile latency and consistency of communicating with another process via stdin/stdout, and the results are rather disappointing.

Any idea of how to make this faster?

I’ve tried various values of bufsize, with/without universal_newlines and various sizes of data, 1.4mb/sec is typically what I get but what is most important is the time taken between sending and receiving a reply (in this case, around 10-12 ms)

Usage

$ python -u subprocess_profiling.py --size=100 --iterations=1000
...
# 1000 iterations
# 20000.00 kb total
# 20.00 kb/iteration
# 14.10 s (1418.00 kb/s)

If you have/install pygal, it’ll generate a nice plot for you as well.

Most interestingly is how this runs quite a bit faster with Python 3.6..

# 1000 iterations
# 20098.00 kb total
# 20.10 kb/iteration
# 0.77 s (26249.37 kb/s)

Which gives me confidence that this is a Python/snippet limitation rather than something in the OS. But I’m using it with Maya, so 2.7 is preferable.

Any ideas?

Best,
Marcus

Justin Israel

unread,
Apr 28, 2018, 5:18:21 AM4/28/18
to python_in...@googlegroups.com
Hey Marcus,

There are a few things you can/should change in this test to fix your issue. First let me start with a link to a modified version of your script:

The results of comparing python2 and python3 against my modified version are pretty close now.

Here is a patch diff of the changes:

@@ -39,12 +39,14 @@
     popen = subprocess.Popen(

         # Open yourself, but run the below block
-        ["python", "-u", __file__],
+        # FIX: Don't use -u for unbuffered stdin/stdout
+        ["python", __file__],

         env=dict(os.environ, **{"CHILD": "1"}),
         stdin=subprocess.PIPE,
         stdout=subprocess.PIPE,
-        # bufsize=0,
+        # FIX: Use some form of buffering
+        bufsize=-1,
         universal_newlines=True,
     )

@@ -70,7 +72,12 @@

         # Write
         write0 = time.clock()
-        popen.stdin.write(data + "%d\n" % i)
+        # FIX: Don't allocate new huge strings to add suffix.
+        # Write to buffered stdin multiple times.
+        # Flushing is important to commit this immediately.
+        popen.stdin.write(data)
+        popen.stdin.write("%d\n" % i)
+        popen.stdin.flush()
         write1 = time.clock()

         # Read
@@ -152,6 +159,8 @@

     log.info("Ready for action..")
     sys.stdout.write("Ready\n")
+    # FIX: Flush to make sure parent sees this immediately
+    sys.stdout.flush()

     t0 = time.clock()
     while True:
@@ -159,8 +168,12 @@

         # Wait time
         t1 = time.clock()
-        duration = "-%.6f" % ((t1 - t0) * 1000)
-        sys.stdout.write(line + duration + "\n")
+        # FIX: Don't allocate huge new strings to add suffix.
+        # Write multiple times to buffered stdout and flush.
+        duration = "-%.6f\n" % ((t1 - t0) * 1000)
+        sys.stdout.write(line)
+        sys.stdout.write(duration)
+        sys.stdout.flush()
         t0 = time.clock()

     log.info("Dying..")

The summary of the changes is to use buffering for your pipes, don't allocated extra temp strings, and to explicitely flush to make sure your "command" gets flushed from the buffer and the reader sees it immediately.  

Justin


--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOAjhQiiBj0yDvMKW9yBY40yi8gpTU75OBU3NeG2BfMk9w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Justin Israel

unread,
Apr 28, 2018, 7:20:43 AM4/28/18
to python_in...@googlegroups.com
Additionally, this note from the py3 docs may account for the performance differences you were seeing:


"Changed in version 3.3.1: bufsize now defaults to -1 to enable buffering by default to match the behavior that most code expects. In versions prior to Python 3.2.4 and 3.3.1 it incorrectly defaulted to 0 which was unbuffered and allowed short reads. This was unintentional and did not match the behavior of Python 2 as most code expected."

The added explicit flushing makes sure each complete "request" is timed accurately. 

Justin 

Marcus Ottosson

unread,
Apr 28, 2018, 10:18:25 AM4/28/18
to python_in...@googlegroups.com
That's genius, never knew of bufsize -1 and the writing multiple times makes a lot of sense as well. Thanks, giving this a try next.

Beyond this however, the Python 3 performance is also quite underwhelming. I'm expecting near-memory performance, ie gigabytes per second with sub-microsecond latency, considering there is no disk io involved.

I'm in control of both processes, one of them being written in c++. Is there anything that can be done to reach those kinds of numbers? What is the bottleneck here, really?

I'm also looking at shared memory addresses via the multiprocessing module, but without understanding why stdin/stdout is this slow I have no reason to expect that to be any faster. 

Justin Israel

unread,
Apr 28, 2018, 11:09:22 PM4/28/18
to python_in...@googlegroups.com
Generally speaking, you should be able to get much more performance over a file descriptor than what the most recent script is doing. But it depends on your specific goal and the kind of protocol you want to speak (the way you format your payload and perform the reading and writing). Your test is specifically measuring a request/response situation, where the parent sends a line, the child reads it, the child sends a reply and the parent reads it.

Have a play with this lower level variation of your original script:

This one is testing the reading/writing in a single direction from parent to child process. I've just used a simpler fork instead of subprocess, and while I have created two directions of pipes, the test is only using a single direction. On my 5+ year old mac laptop I am getting these kind of results:

$ python pipes_test.py -s 32 -i 1000
Total Size: 32036KB
 Send Size: 32KB
  Buf Size: 64KB
     Write: 560.678255 MB/s
      Read: 756.231229 MB/s (total: 32036KB)

$ python pipes_test.py -s 100 -i 1000
Total Size: 100036KB
 Send Size: 100KB
  Buf Size: 64KB
     Write: 588.818921 MB/s
      Read: 688.021778 MB/s (total: 100112KB)

$ python pipes_test.py -s 1000 -i 1000
Total Size: 1000036KB
 Send Size: 1000KB
  Buf Size: 64KB
     Write: 1318.013320 MB/s
      Read: 871.122543 MB/s (total: 1001129KB)

In this test, the readers are reading chunks of data instead of line-based. I think line-based may end up being a bottleneck eventually if you are talking about theoretical max speeds. The buffer sizes and protocol really depend on what kind of pattern you want to apply. Is it a request/response? Are they huge payloads or tiny payloads? You could probably get more performance if you had a protocol such as framing, where the reader first can see how many bytes to expect in the payload (and/or what type of request to handle) and then it can know to read exactly that many bytes. Other than defining your own transfer protocol, something like gRPC, Thrift, or similar, would give you that kind of structured data with protocols that efficiently transfer between sender and receiver. And sure, you could also explore an mmap solution to avoid pipes completely, if the data is huge and you can't get enough speed from pipes.

Justin




Marcus Ottosson

unread,
Apr 28, 2018, 11:11:19 PM4/28/18
to python_in...@googlegroups.com

Ok, I’ve made an update to the gist with your changes.

Performance is considerably better..

Thoughput Latency
Before 1.4 mb/s 14 ms
After 40.0 mb/s 0.002 ms

The main issue at this point is the variety of the latency, currently ranging from 0.002-0.200 ms.

That’s 100x difference, at seemingly random intervals. If I could, I’d add a cache between reading and applying the data, to smooth out the variety, but what I’m looking for is something capable of running per-frame in Maya, during playback. I’m close to concluding that stdin/stdout isn’t a very candidate for this kind of usage.

Edit: Just noticed your post, will read through this next. :)


To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.

Marcus Ottosson

unread,
Apr 28, 2018, 11:21:32 PM4/28/18
to python_in...@googlegroups.com

Have a play with this lower level variation of your original script:

I will, thanks. Those numbers are quite appealing. I won’t be able to fork I think, as the child process a separate executable, but hopefully the numbers carry across regardless.

As for type of data; I have quite specific requirements.

  • Request response pattern, think of the child process as an IK solver
  • ..running per-frame in Maya (currently via addTimeChangedCallback), so no buffering (I think?)
  • Average data size, both in and out: 1-50 kb
  • Plain-Python, i.e. nothing complied. Else I would probably use a third-party library like ZeroMQ or similar. Sole reason being simplified distribution, so not impossible I might have to bite the bullet if this doesn’t pan out.

Testing your example now.


To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsubscribe@googlegroups.com.

Marcus Ottosson

unread,
Apr 29, 2018, 11:01:44 PM4/29/18
to python_in...@googlegroups.com

Can confirm those numbers, these are from the same machine as the results I’ve posted so far.

Total Size: 3203KB
 Send Size: 32KB
  Buf Size: 64KB
     Write: 2258.865412 MB/s
      Read: 2108.172908 MB/s (total: 3203KB)

I wasn’t able to use fork, both because it seems exclusive to Unix but more importantly because the child is an external process (e.g. some.exe) - so I didn’t venture into the multiprocessing module which seems to implement the equivalent functionality cross-platform.

I assume os.pipe is the equivalent of passing subprocess.PIPE to Popen, in which case what remains is passing fixed-size blocks of data, as opposed to lines. Before having tested it, it would make sense for it to be faster, as I’d expect that for Python to know when to stop reading it’d need to continuously query each character for whether it is a \n or not. Reading a fixed size chunk of bytes wouldn’t do that. Determining the size of data coming from both parent and child shouldn’t be a problem, other than the overhead of having to actually compute it.

Ok, replacing os.fork and os.pipe with subprocess.Popen and subprocess.PIPE, and only writing to the child, yields faster results.

Read: 209.440087 MB/s
Write: 494.539 MB/s

Replacing readline with read(maxsize) yields faster results, as expected..

Read: 1,110.881 MB/s
Write: 1,076.158 MB/s

However, something occurred to me as I was doing it this way.

In both of our examples, we write to a child without waiting for it to reply. I expect the time spent between parent-writing-to-child and child-writing-to-parent to be the real performance hog.

In the initial graph, I included the time a child spent waiting for new input, and it’s quite significant. I suspect that for every request, the OS is given an opportunity to task-switch and lazily re-introduce the parent and child processes whenever it sees fit; which could explain the performance bottleneck and variety of time taken per roundtrip.

Ok, so what about waiting for a response with read(maxsize), where maxsize actually does correspond to the exact size the parent expects the child to read? E.g. a full frame of positions for the IK solve, as a serialised JSON dictionary.

$ python subprocess_profile.py --iterations=10000 --size=100 --plot
                   Read: 1,929.443 MB/s
                  Write: 1,668.109 MB/s
             Total time: 1.179 s
             Write Time: |- 0.585 s
              Wait Time: |- 0.594 s
             Wait/Write: |- 1 / 1.01
             Iterations: 10,000
 Avarage roundtrip time: 0.118 ms/request
     Min roundtrip time: 0.049 ms/request
     Max roundtrip time: 0.748 ms/request
   Delta roundtrip time: 0.699 ms
                   Size: 102,400 B
            Buffer size: 32,768 B
                  Peaks: 26

Not bad!

After a few trials, roundtrip time maxes out at about 0.8 ms, which is more than enough for my usecase (e.g. real-time playback at 60 fps, incl. actually computing the data). But most impressive is how rarely these peaks occur now! I noticed that if I run the child buffered - i.e. without the -u flag - I get a lot more variety in latency, and a lot lower transfer rates.

$ python subprocess_profile.py --iterations=10000 --size=100 --buffered --plot
                   Read: 479.789 MB/s
                  Write: 500.890 MB/s
             Total time: 4.557 s
             Write Time: |- 1.950 s
              Wait Time: |- 2.607 s
             Wait/Write: |- 1 / 1.34
             Iterations: 10,000
 Avarage roundtrip time: 0.391 ms/request
     Min roundtrip time: 0.313 ms/request
     Max roundtrip time: 1.870 ms/request
   Delta roundtrip time: 1.557 ms
                   Size: 102,400 B
            Buffer size: 65,536 B
                  Peaks: 32

This is almost solved; the only remaining factor is reducing, ideally eliminating, these “peaks” or “outliers”. Consistent playback during animation is very important. For that I’m thinking, because the peaks are so vastly different from the norm, I’ll try making a request on a timeout, and only wait 0.1 ms. If no reply is returned in time, I’ll make another one, for which the odds of a result being returned in less time than that is quite high. At the very least it may help round out the peaks. Though I’m not sure that would actually work..

To sum it up, here’s the start till end of this thread.

Day Throughput Latency Delta
Saturday 1.4 mb/s 14 ms 10-100x
Sunday 40.0 mb/s 0.002 ms 100x
Monday 2000.0 mb/s 0.5 ms 5x

Do look through this now-shortened version of the test. I wouldn’t be surprised if I’ve got some multiplication wrong and are really looking at KB rather than MB, or seconds rather than milliseconds..

Best,
Marcus

Justin Israel

unread,
Apr 30, 2018, 12:22:36 AM4/30/18
to python_in...@googlegroups.com
Nice write up. And good progress!

I think you may want to switch from time.clock() to time.time(). I was getting less accurate readings under Linux so time.time() might be the better way to go for testing arbitrary start/stop spans.

It totally makes sense that you would need to switch back to subprocess for your specific goal. You are correct that subprocess.PIPE is just telling subprocess to create pipes with your specified bufsize under the hood and assign them to the stdin/stdout/stderr of your Popen object. 

And just a clarification about the max read size. For the test right now you are reading a fixed max size each time, which makes it easy for the read to return after every write. But for your real world application, as I mentioned in an earlier reply, maybe you can do something where the read is always first sitting on a fixed header size (say 20 bytes for example). This header contains information about the payload size and possibly a request type if you need to conditionally handle different payloads. Then you jump to a second read of the payload size, handle it and go back to waiting for another request header.

For the latency issues, to be honest I haven't really had to use Python in such a latency sensitive manner. You may want to do some googles about using python for (soft) realtime applications. This isn't really a hard realtime problem but the search gets relevant information. A quick search shows people discussing how pythons garbage collector kicking in can be one issue, among others. So maybe the first course of action is to try and allocate a buffer for your message ahead of time and to reuse the buffer instead of allocating in your request loops. You are right that it is possible that your processes could be preempted and that you might see different latency between running on windows/osx/linux.  

image.png

Not bad!

After a few trials, roundtrip time maxes out at about 0.8 ms, which is more than enough for my usecase (e.g. real-time playback at 60 fps, incl. actually computing the data). But most impressive is how rarely these peaks occur now! I noticed that if I run the child buffered - i.e. without the -u flag - I get a lot more variety in latency, and a lot lower transfer rates.

$ python subprocess_profile.py --iterations=10000 --size=100 --buffered --plot
                   Read: 479.789 MB/s
                  Write: 500.890 MB/s
             Total time: 4.557 s
             Write Time: |- 1.950 s
              Wait Time: |- 2.607 s
             Wait/Write: |- 1 / 1.34
             Iterations: 10,000
 Avarage roundtrip time: 0.391 ms/request
     Min roundtrip time: 0.313 ms/request
     Max roundtrip time: 1.870 ms/request
   Delta roundtrip time: 1.557 ms
                   Size: 102,400 B
            Buffer size: 65,536 B
                  Peaks: 32

image.png

This is almost solved; the only remaining factor is reducing, ideally eliminating, these “peaks” or “outliers”. Consistent playback during animation is very important. For that I’m thinking, because the peaks are so vastly different from the norm, I’ll try making a request on a timeout, and only wait 0.1 ms. If no reply is returned in time, I’ll make another one, for which the odds of a result being returned in less time than that is quite high. At the very least it may help round out the peaks. Though I’m not sure that would actually work..

To sum it up, here’s the start till end of this thread.

Day Throughput Latency Delta
Saturday 1.4 mb/s 14 ms 10-100x
Sunday 40.0 mb/s 0.002 ms 100x
Monday 2000.0 mb/s 0.5 ms 5x

Do look through this now-shortened version of the test. I wouldn’t be surprised if I’ve got some multiplication wrong and are really looking at KB rather than MB, or seconds rather than milliseconds..

Best,
Marcus

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOAY7j__oh8k%2BQBcA-_C_EjJkmpV%2B7soTrYY_QU-fzJSXw%40mail.gmail.com.

Marcus Ottosson

unread,
Apr 30, 2018, 12:41:27 AM4/30/18
to python_in...@googlegroups.com

But for your real world application, as I mentioned in an earlier reply, maybe you can do something where the read is always first sitting on a fixed header size (say 20 bytes for example).

Yep, got that and makes sense. Was going to implement that here as well, but decided not to.

For the latency issues, to be honest I haven’t really had to use Python in such a latency sensitive manner.

And it’s probably not the best use for it. I’m hoping that if this works it’d be a net gain on the whole, because there are savings to be had elsewhere if it does. The garbage collector is a good point, I think I saw some way of turning it off (temporarily) which might help pinpoint where some of the fluctuations in latency is coming from in the real thing.

Applying this now; the bottleneck at the moment is the C++ side of things (where I can safely hand things off to the next person :)

Thanks Justin!

Reply all
Reply to author
Forward
0 new messages