Hi all,
I’ve put together a snippet to profile latency and consistency of communicating with another process via stdin/stdout, and the results are rather disappointing.
Any idea of how to make this faster?
I’ve tried various values of bufsize, with/without universal_newlines and various sizes of data, 1.4mb/sec is typically what I get but what is most important is the time taken between sending and receiving a reply (in this case, around 10-12 ms)
Usage
$ python -u subprocess_profiling.py --size=100 --iterations=1000
...
# 1000 iterations
# 20000.00 kb total
# 20.00 kb/iteration
# 14.10 s (1418.00 kb/s)
If you have/install pygal, it’ll generate a nice plot for you as well.

Most interestingly is how this runs quite a bit faster with Python 3.6..
# 1000 iterations
# 20098.00 kb total
# 20.10 kb/iteration
# 0.77 s (26249.37 kb/s)
Which gives me confidence that this is a Python/snippet limitation rather than something in the OS. But I’m using it with Maya, so 2.7 is preferable.
Any ideas?
Best,
Marcus
@@ -39,12 +39,14 @@
popen = subprocess.Popen(
# Open yourself, but run the below block
- ["python", "-u", __file__],
+ # FIX: Don't use -u for unbuffered stdin/stdout
+ ["python", __file__],
env=dict(os.environ, **{"CHILD": "1"}),
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
- # bufsize=0,
+ # FIX: Use some form of buffering
+ bufsize=-1,
universal_newlines=True,
)
@@ -70,7 +72,12 @@
# Write
write0 = time.clock()
- popen.stdin.write(data + "%d\n" % i)
+ # FIX: Don't allocate new huge strings to add suffix.
+ # Write to buffered stdin multiple times.
+ # Flushing is important to commit this immediately.
+ popen.stdin.write(data)
+ popen.stdin.write("%d\n" % i)
+ popen.stdin.flush()
write1 = time.clock()
# Read
@@ -152,6 +159,8 @@
log.info("Ready for action..")
sys.stdout.write("Ready\n")
+ # FIX: Flush to make sure parent sees this immediately
+ sys.stdout.flush()
t0 = time.clock()
while True:
@@ -159,8 +168,12 @@
# Wait time
t1 = time.clock()
- duration = "-%.6f" % ((t1 - t0) * 1000)
- sys.stdout.write(line + duration + "\n")
+ # FIX: Don't allocate huge new strings to add suffix.
+ # Write multiple times to buffered stdout and flush.
+ duration = "-%.6f\n" % ((t1 - t0) * 1000)
+ sys.stdout.write(line)
+ sys.stdout.write(duration)
+ sys.stdout.flush()
t0 = time.clock()
log.info("Dying..")
--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOAjhQiiBj0yDvMKW9yBY40yi8gpTU75OBU3NeG2BfMk9w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAPGFgA2aLAJzuyoVRMx316mGudCP7cGicXy5LNKY7o1rmpDzkA%40mail.gmail.com.
$ python pipes_test.py -s 32 -i 1000
Total Size: 32036KB
Send Size: 32KB
Buf Size: 64KB
Write: 560.678255 MB/s
Read: 756.231229 MB/s (total: 32036KB)
$ python pipes_test.py -s 100 -i 1000
Total Size: 100036KB
Send Size: 100KB
Buf Size: 64KB
Write: 588.818921 MB/s
Read: 688.021778 MB/s (total: 100112KB)
$ python pipes_test.py -s 1000 -i 1000
Total Size: 1000036KB
Send Size: 1000KB
Buf Size: 64KB
Write: 1318.013320 MB/s
Read: 871.122543 MB/s (total: 1001129KB)
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOBuW%2BNuwHkVbqmCHCuxipfa_mFYa4Tz3DFo3oo8v8oBqQ%40mail.gmail.com.
Ok, I’ve made an update to the gist with your changes.
Performance is considerably better..
| Thoughput | Latency | |
|---|---|---|
| Before | 1.4 mb/s | 14 ms |
| After | 40.0 mb/s | 0.002 ms |
The main issue at this point is the variety of the latency, currently ranging from 0.002-0.200 ms.

That’s 100x difference, at seemingly random intervals. If I could, I’d add a cache between reading and applying the data, to smooth out the variety, but what I’m looking for is something capable of running per-frame in Maya, during playback. I’m close to concluding that stdin/stdout isn’t a very candidate for this kind of usage.
Edit: Just noticed your post, will read through this next. :)
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOAjhQiiBj0yDvMKW9yBY40yi8gpTU75OBU3NeG2BfMk9w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.
Have a play with this lower level variation of your original script:
I will, thanks. Those numbers are quite appealing. I won’t be able to fork I think, as the child process a separate executable, but hopefully the numbers carry across regardless.
As for type of data; I have quite specific requirements.
addTimeChangedCallback), so no buffering (I think?)Testing your example now.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOAjhQiiBj0yDvMKW9yBY40yi8gpTU75OBU3NeG2BfMk9w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsubscribe@googlegroups.com.
Can confirm those numbers, these are from the same machine as the results I’ve posted so far.
Total Size: 3203KB
Send Size: 32KB
Buf Size: 64KB
Write: 2258.865412 MB/s
Read: 2108.172908 MB/s (total: 3203KB)
I wasn’t able to use fork, both because it seems exclusive to Unix but more importantly because the child is an external process (e.g. some.exe) - so I didn’t venture into the multiprocessing module which seems to implement the equivalent functionality cross-platform.
I assume os.pipe is the equivalent of passing subprocess.PIPE to Popen, in which case what remains is passing fixed-size blocks of data, as opposed to lines. Before having tested it, it would make sense for it to be faster, as I’d expect that for Python to know when to stop reading it’d need to continuously query each character for whether it is a \n or not. Reading a fixed size chunk of bytes wouldn’t do that. Determining the size of data coming from both parent and child shouldn’t be a problem, other than the overhead of having to actually compute it.
Ok, replacing os.fork and os.pipe with subprocess.Popen and subprocess.PIPE, and only writing to the child, yields faster results.
Read: 209.440087 MB/s
Write: 494.539 MB/s
Replacing readline with read(maxsize) yields faster results, as expected..
Read: 1,110.881 MB/s
Write: 1,076.158 MB/s
However, something occurred to me as I was doing it this way.
In both of our examples, we write to a child without waiting for it to reply. I expect the time spent between parent-writing-to-child and child-writing-to-parent to be the real performance hog.
In the initial graph, I included the time a child spent waiting for new input, and it’s quite significant. I suspect that for every request, the OS is given an opportunity to task-switch and lazily re-introduce the parent and child processes whenever it sees fit; which could explain the performance bottleneck and variety of time taken per roundtrip.
Ok, so what about waiting for a response with read(maxsize), where maxsize actually does correspond to the exact size the parent expects the child to read? E.g. a full frame of positions for the IK solve, as a serialised JSON dictionary.
$ python subprocess_profile.py --iterations=10000 --size=100 --plot
Read: 1,929.443 MB/s
Write: 1,668.109 MB/s
Total time: 1.179 s
Write Time: |- 0.585 s
Wait Time: |- 0.594 s
Wait/Write: |- 1 / 1.01
Iterations: 10,000
Avarage roundtrip time: 0.118 ms/request
Min roundtrip time: 0.049 ms/request
Max roundtrip time: 0.748 ms/request
Delta roundtrip time: 0.699 ms
Size: 102,400 B
Buffer size: 32,768 B
Peaks: 26

Not bad!
After a few trials, roundtrip time maxes out at about 0.8 ms, which is more than enough for my usecase (e.g. real-time playback at 60 fps, incl. actually computing the data). But most impressive is how rarely these peaks occur now! I noticed that if I run the child buffered - i.e. without the -u flag - I get a lot more variety in latency, and a lot lower transfer rates.
$ python subprocess_profile.py --iterations=10000 --size=100 --buffered --plot
Read: 479.789 MB/s
Write: 500.890 MB/s
Total time: 4.557 s
Write Time: |- 1.950 s
Wait Time: |- 2.607 s
Wait/Write: |- 1 / 1.34
Iterations: 10,000
Avarage roundtrip time: 0.391 ms/request
Min roundtrip time: 0.313 ms/request
Max roundtrip time: 1.870 ms/request
Delta roundtrip time: 1.557 ms
Size: 102,400 B
Buffer size: 65,536 B
Peaks: 32

This is almost solved; the only remaining factor is reducing, ideally eliminating, these “peaks” or “outliers”. Consistent playback during animation is very important. For that I’m thinking, because the peaks are so vastly different from the norm, I’ll try making a request on a timeout, and only wait 0.1 ms. If no reply is returned in time, I’ll make another one, for which the odds of a result being returned in less time than that is quite high. At the very least it may help round out the peaks. Though I’m not sure that would actually work..
To sum it up, here’s the start till end of this thread.
| Day | Throughput | Latency | Delta |
|---|---|---|---|
| Saturday | 1.4 mb/s | 14 ms | 10-100x |
| Sunday | 40.0 mb/s | 0.002 ms | 100x |
| Monday | 2000.0 mb/s | 0.5 ms | 5x |
Do look through this now-shortened version of the test. I wouldn’t be surprised if I’ve got some multiplication wrong and are really looking at KB rather than MB, or seconds rather than milliseconds..
Best,
Marcus
Not bad!
After a few trials, roundtrip time maxes out at about 0.8 ms, which is more than enough for my usecase (e.g. real-time playback at 60 fps, incl. actually computing the data). But most impressive is how rarely these peaks occur now! I noticed that if I run the child buffered - i.e. without the
-uflag - I get a lot more variety in latency, and a lot lower transfer rates.$ python subprocess_profile.py --iterations=10000 --size=100 --buffered --plot Read: 479.789 MB/s Write: 500.890 MB/s Total time: 4.557 s Write Time: |- 1.950 s Wait Time: |- 2.607 s Wait/Write: |- 1 / 1.34 Iterations: 10,000 Avarage roundtrip time: 0.391 ms/request Min roundtrip time: 0.313 ms/request Max roundtrip time: 1.870 ms/request Delta roundtrip time: 1.557 ms Size: 102,400 B Buffer size: 65,536 B Peaks: 32
This is almost solved; the only remaining factor is reducing, ideally eliminating, these “peaks” or “outliers”. Consistent playback during animation is very important. For that I’m thinking, because the peaks are so vastly different from the norm, I’ll try making a request on a timeout, and only wait 0.1 ms. If no reply is returned in time, I’ll make another one, for which the odds of a result being returned in less time than that is quite high. At the very least it may help round out the peaks. Though I’m not sure that would actually work..
To sum it up, here’s the start till end of this thread.
Day Throughput Latency Delta Saturday 1.4 mb/s 14 ms 10-100x Sunday 40.0 mb/s 0.002 ms 100x Monday 2000.0 mb/s 0.5 ms 5x Do look through this now-shortened version of the test. I wouldn’t be surprised if I’ve got some multiplication wrong and are really looking at KB rather than MB, or seconds rather than milliseconds..
Best,
Marcus
--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOAY7j__oh8k%2BQBcA-_C_EjJkmpV%2B7soTrYY_QU-fzJSXw%40mail.gmail.com.
But for your real world application, as I mentioned in an earlier reply, maybe you can do something where the read is always first sitting on a fixed header size (say 20 bytes for example).
Yep, got that and makes sense. Was going to implement that here as well, but decided not to.
For the latency issues, to be honest I haven’t really had to use Python in such a latency sensitive manner.
And it’s probably not the best use for it. I’m hoping that if this works it’d be a net gain on the whole, because there are savings to be had elsewhere if it does. The garbage collector is a good point, I think I saw some way of turning it off (temporarily) which might help pinpoint where some of the fluctuations in latency is coming from in the real thing.
Applying this now; the bottleneck at the moment is the C++ side of things (where I can safely hand things off to the next person :)
Thanks Justin!