Fastest Way to Stream in Frames

117 views
Skip to first unread message

DaAwesomeP

unread,
Oct 13, 2022, 2:36:02 PM10/13/22
to open-lighting
Hello!

In my current application I am trying to hit at least 60fps. I save all of the DMX data of 24 universes into 24 different ola::DmxBuffer instances. This task has basically no overhead.

I then loop through the buffers and run mOLAClient->SendDmx(universe, mBuffers[i]). This loop takes approximately 36ms. When I add the 6ms per frame that my application runs at (which I think is a very reasonably tiny amount of time), then this totals out to 42ms per frame, or 24 fps. This is not even near normal DMX speeds of 44 fps.

Some more rough numbers:
  • Sending 1 universe has a 1ms to 3ms performance hit
  • Sending 2 universes has a 3ms to 9ms performance hit
  • Sending 3 universes has a 4ms to 13ms performance hit
Am I doing something wrong?

Should I spawn a thread for each universe to send to OLA? Can OLA handle this (i.e. does it spawn a thread as well)?

Is the lockup in my application doing this synchronously or in OLA doing this synchronously? Is the time to complete SendDmx the time for OLA to receive the data into a queue (should be fast) or is it the time for OLA to write out the DMX to all other places (i.e. sACN, DMX, etc.)?

Thanks!
Perry

E.S. Rosenberg

unread,
Oct 13, 2022, 4:38:41 PM10/13/22
to open-l...@googlegroups.com
Hey Perry,

Can you provide more information?
Through what plugin are you trying to send the output?
Spawning more threads if your CPU doesn't have them available would probably do more harm than good, as far as I recall OLA spawns threads as needed for each plugin etc.

If you want to hit 60fps did you also reduce the amount of data that you are sending in each frame? And is the receiving equipment programmed properly (ie. handles shorter frames/higher refresh rates properly)?

HTH,
Eli

Op do 13 okt. 2022 om 21:36 schreef DaAwesomeP <p2kwo...@live.com>:
--
The Open Lighting Project: open-l...@googlegroups.com, #openlighting (irc.freenode.org)
To unsubscribe from this group, send email to open-lightin...@googlegroups.com
For more options, visit https://groups.google.com/groups/opt_out?hl=en
---
You received this message because you are subscribed to the Google Groups "open-lighting" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-lightin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/open-lighting/27357562-b981-4a32-bb3f-c0fdbea3e985n%40googlegroups.com.

Peter Newman

unread,
Oct 13, 2022, 4:56:10 PM10/13/22
to open-lighting
I think it's the time to receive it into a queue, but Simon would be more expert without me digging through lots of code.

You could also unpatch everything else to see if it makes a difference (which ought to answer your question).

We've also got these for testing:

If you search of latency and throughput there's some previous discussion on similar stuff.

DaAwesomeP

unread,
Oct 14, 2022, 9:59:35 AM10/14/22
to open-lighting
Hi all,

Thanks for the responses! Here is some more info:
  • The universes vary in size, but each has around 130 to 150 channels in use. The largest uses 171 channels. The channels start at 1 and increment sequentially, so it should be able to send smaller packets and cut off the last zeros if it is optimized to do so. I first blackout the DMX buffer once and then only update the channels in the buffer that change before sending the DMX.
  • I unpatched all of the universes, and I still see around 30ms to 40ms of overhead. If I patch 12 universes to sACN (broadcast, so no reliance on devices) with the second 12 nonexistent, then I see no change in overhead. No change with all 24 universes patched to sACN as well.
  • Since there does not appear to be a change in overhead based on the outputs patched, it sounds like this is purely the time to queue the data into OLA (which hopefully can be improved!). I am starting to think that the slowdown is related to opening and closing a socket each time. Is there a way to send all 24 universes in one go?
  • I will take a look at the API source and confirm that claim above about the socket.
  • I will eventually be using this with KiNETv2 (unicast). I am very certain that the devices that I will use can achieve 60fps. I think it is at least reasonable to be able to get DMX frame rates (44fps) from OLA.
  • Yeah, a thread for each universe is maybe a bad idea, but a thread pool of say 2 or 3 threads to split up the loop of 24 into 12 or 4 could maybe work. Though I am thinking the better solution is to optimize SendDMX to work better for many universes.
I am going to modify one of the tests that Peter linked to try to replicate my findings outside of my project. I think I will have it oscillate the first 160 channels of multiple universes between 0 and 255. I'll do it in my GitHub fork and then hopefully y'all can try it out as well!

Thanks again!
Perry

DaAwesomeP

unread,
Oct 14, 2022, 10:36:54 AM10/14/22
to open-lighting
This is purely speculation from a brief look at the sources without a complete understanding but:
  1. StreamingClient::SendDMX() calls StreamingClient::Send() (source)
  2. StreamingClient::Send() calls SelectServer::RunOnce() (source)
  3. SelectServer::RunOnce() calls SelectServer::CheckForEvents() (source)
  4. SelectServer::CheckForEvents() appears to run an event loop (source)
A single socket seems to stay open, so I was wrong about that guess.

I don't have an understanding of exactly what this event loop does. It seems like because it is running an event loop for each call to SendDMX(), is it treating each universe as a complete frame? Is there a method anywhere in OLA to stream in multiple universes in one frame? Would multiple calls to StreamDmxData() in a loop result in a race or would it effectively send all of the universes into one frame (source)?

What do you think?

Perry

DaAwesomeP

unread,
Oct 14, 2022, 11:45:21 AM10/14/22
to open-lighting
OK! I was able to reproduce it in a fresh VM with a fresh build of ola/master with my multi-universe throughput test. Please give it a try and let me know your results: https://github.com/DaAwesomeP/ola/blob/DaAwesomeP-StreamingClientThroughput/examples/ola-throughput-multi.cpp I added it to the Makefile in that branch so a simple make all should build it.

When I run ./examples/ola_throughput_multi -u 24 I see a frame time of 34ms with no universes configured in OLA. My VM is KVM and has 4 threads available. I configured OLA with ./configure --disable-python-libs  --disable-doxygen-doc and ran make all with no issues.

This test does not change any data. All of the frames are blackout frames, so this is purely testing the throughput of sending universes with the smallest data length.

Perry

DaAwesomeP

unread,
Oct 14, 2022, 1:55:35 PM10/14/22
to open-lighting
OK, commenting out m_ss->RunOnce() definitely solves the time delay. The overhead drops to 0ms. With a crazy 240 universes I see a couple ms of latency, but that's it. 24 universes has 0ms latency.

I added an option to my new example called --oscillate-data that actually changes channels (the entire universe) between full on and off (same file in my fork linked in previous message). I am getting good output from KiNETv1. When I run with m_ss->RunOnce(), the strobing on my KiNET output is not uniform/consistent at the example's default 40ms loop delay. When I comment out m_ss->RunOnce(), then it becomes very smooth and consistent. As low as 18ms looks consistent to the eye!

Looking at the comments in that file, it seems that m_ss->RunOnce() was added due to some race condition. Maybe Simon could shed some light on that? Or maybe instead of eliminating this check we could find a way to send multiple universes after checking the event loop only once?

OK, apologies for the message spam. This is probably my last finding until someone else responds :D

Perry

mic...@cubic.org

unread,
Oct 14, 2022, 3:10:33 PM10/14/22
to open-l...@googlegroups.com
Can you calculate the data for the next frame while the last frame is transmitted?

13.10.2022 20:36:19 DaAwesomeP <p2kwo...@live.com>:

--

DaAwesomeP

unread,
Oct 14, 2022, 4:32:28 PM10/14/22
to open-lighting
> Can you calculate the data for the next frame while the last frame is transmitted?

I could put all of the OLA API in its own single thread, but it would still take 30ms to 35ms to send out 24 universes, as my program takes only 6ms per frame. While onscreen my program would speed up, it would not noticeably speed up how fast data is going to the lights. The reason I need this high FPS (at least DMX speed) is that slow fades currently don't look good at all (lots of stepping) when I am sending all 24 universes.

I don't entirely understand what m_ss->RunOnce() is doing, but it seems to me like it is forcing OLA to finish processing and flushing out all of the data in a queue sent to it since the last StreamingClient::SendDMX(). Since I am sending 24 universes at a time (with the intention of that being one frame), I only want to force a flush after each 24 StreamingClient::SendDMX() calls. I'm not very concerned about the synchronization of after which of those 24 universes everything gets pushed (at 44fps to 60fps this should not be noticeable), but it definitely can't take 30ms per all 24 universes.

I discovered that the wait caused by m_ss->RunOnce() is due entirely to m_ss->RunOnce() (and not actually sending out packets), as I saw a dramatic improvement by commenting out m_ss->RunOnce() when sending to a real device. It does not appear that this wait is necessary for OLA to send packet quickly, but rather there is some situation somewhere (a race) that that call is preventing (as mentioned in the code comments), but I do not know why.

Perry

mic...@cubic.org

unread,
Oct 14, 2022, 4:57:05 PM10/14/22
to open-l...@googlegroups.com
The grandMA2 sends dmx also at only 33fps. It is important that the dmx outout is done using a very precise scheduling  Do not use sleep-for. Use sleep-until with soft realtime scheduling. Make sure, that the calculation does not introduce jitters in your dmx output.
Do not base your calculation on the real time, but instead use a virtual time that corresponds to your frame interval. Otherwise a jitter in your calculation can be introduced by a jitter in when your calculation is startet. You calculate data for the future, more precisely for the moment when you send out your dmx data.

14.10.2022 22:32:40 DaAwesomeP <p2kwo...@live.com>:

DaAwesomeP

unread,
Oct 14, 2022, 5:28:23 PM10/14/22
to open-lighting
I agree...but I'm not understanding how that is related to the issue at hand? Have you taken a look at the sources that I linked to? This is purely an issue with the function that allows you to get data into OLA, not with the output regulation. I have not added any sleeps anywhere.

Also, does OLA currently have any sort of input rate regulation implemented at all? My past experience  with OLA makes me think not and that this regulation is defined for each output type (i.e. UART DMX, sACN, etc.) and not regulated at the C++ API data input. That would make sense since many protocols can operate at vastly different speeds.

The issue is that with the current implementation of StreamingClient::SendDMX(), it does not seem possible to queue more than 1 universe worth of data at a time. Yes, you should load the data in and then separately send it out on a proper interval, but the issue is with loading that data in, so I'm not sure to what you are referring to. I have not touched or read any of the output/rate-regulating code that manages the data once OLA has it.

Perry

DaAwesomeP

unread,
Oct 14, 2022, 5:32:39 PM10/14/22
to open-lighting
I made another discovery: the "advanced" API version of this call in OlaClientCore does not implement this time-inducing check found in StreamingClient. I'm confused why the race would exist in StreamingClient but not in OlaClientCore.  They both call m_stub->StreamDmxData() (assuming callback arguments are turned off) in the same way.


Perry

DaAwesomeP

unread,
Oct 14, 2022, 5:33:10 PM10/14/22
to open-lighting
I think I will translate this all into a GitHub issue at this point since there is a clear inconsistency between those APIs.

Perry

DaAwesomeP

unread,
Oct 14, 2022, 7:26:58 PM10/14/22
to open-lighting

Simon Newton

unread,
Oct 18, 2022, 9:35:01 AM10/18/22
to open-l...@googlegroups.com
What hardware is this on?

As you've discovered, the streaming client calls ss->RunOnce(), which under the hood calls select() / epoll() which what should be a 0s timeout. If that's taking 1ms something doesn't sound right.

If you use the non-streaming client (OlaClient) and set a null callback then you get the same streaming behavior but without the call to ss->RunOnce(). This is because if you're using the OlaClient, then the assumption is you've created your own SelectServer (or used the Wrapper to do so).

If you can provide strace output with timing information that would help narrow down the problem.


Simon
 

DaAwesomeP

unread,
Oct 18, 2022, 12:39:34 PM10/18/22
to open-lighting
Hi Simon,

Thanks for following up!


> What hardware is this on?

Debian x64. I tested with no universes added in web UI and with 24 universes added with sACN broadcast output. I also tested with the first universe set to KiNETv1 output. I tried directly on hardware on an 8-thread machine (Xeon W-2123) as well as in a 2-thread KVM machine. None of those permutations seemed to affect the performance.

> As you've discovered, the streaming client calls ss->RunOnce(), which under the hood calls select() / epoll() which what should be a 0s timeout. If that's taking 1ms something doesn't sound right.

I found that ss->RunOnce()takes 0ms for one universe, but it starts to take longer for multiple universes being sent sequentially in a for loop. Using the test that I made here I was able to reproduce it on multiple Debian machines: https://github.com/DaAwesomeP/ola/blob/2140930eeea4d08d62cea937852dcb1c9b69c8b6/examples/ola-throughput-multi.cpp


> If you can provide strace output with timing information that would help narrow down the problem.

I'll try to get a strace output, but please feel free to give it a try with the program above!

Thanks!
Perry

DaAwesomeP

unread,
Oct 28, 2022, 12:14:45 PM10/28/22
to open-lighting
Hi Simon,

OK, I have some strace results! Please see the attached archive and pasted excerpts below!

The commands used to generate the traces are in the shell script in the archive. I did this with 0 universes configured in OLA. The archive contains traces for running at different intervals, numbers of universes (1, 2, 24, 240), oscillating vs. fixed data, and both APIs.

Here is one frame from the streaming client with 24 universes (strace_ola_throughput_multi_u24_s20000.txt):

clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=20000000}, NULL) = 0
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\207\3\32\rStreamDmxData\"\207\4\10\1\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\210\3\32\rStreamDmxData\"\207\4\10\2\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\211\3\32\rStreamDmxData\"\207\4\10\3\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\212\3\32\rStreamDmxData\"\207\4\10\4\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\213\3\32\rStreamDmxData\"\207\4\10\5\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\214\3\32\rStreamDmxData\"\207\4\10\6\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\215\3\32\rStreamDmxData\"\207\4\10\7\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\216\3\32\rStreamDmxData\"\207\4\10\10\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\217\3\32\rStreamDmxData\"\207\4\10\t\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\220\3\32\rStreamDmxData\"\207\4\10\n\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\221\3\32\rStreamDmxData\"\207\4\10\v\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\222\3\32\rStreamDmxData\"\207\4\10\f\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\223\3\32\rStreamDmxData\"\207\4\10\r\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\224\3\32\rStreamDmxData\"\207\4\10\16\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\225\3\32\rStreamDmxData\"\207\4\10\17\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\226\3\32\rStreamDmxData\"\207\4\10\20\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\227\3\32\rStreamDmxData\"\207\4\10\21\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\230\3\32\rStreamDmxData\"\207\4\10\22\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\231\3\32\rStreamDmxData\"\207\4\10\23\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\232\3\32\rStreamDmxData\"\207\4\10\24\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\233\3\32\rStreamDmxData\"\207\4\10\25\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\234\3\32\rStreamDmxData\"\207\4\10\26\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
epoll_wait(7, [], 10, 1)                = 0
sendto(6, "\36\2\0\20\10\n\20\235\3\32\rStreamDmxData\"\207\4\10\27\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
write(1, "frame time: 0040 ms\n", 20)   = 20

And for comparison this is the same configuration but with the advanced client (strace_ola_throughput_multi_u24_s20000_a.txt):

epoll_wait(3, [], 10, 15)               = 0
sendto(6, "\36\2\0\20\10\n\20\204\5\32\rStreamDmxData\"\207\4\10\1\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\205\5\32\rStreamDmxData\"\207\4\10\2\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\206\5\32\rStreamDmxData\"\207\4\10\3\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\207\5\32\rStreamDmxData\"\207\4\10\4\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\210\5\32\rStreamDmxData\"\207\4\10\5\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\211\5\32\rStreamDmxData\"\207\4\10\6\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\212\5\32\rStreamDmxData\"\207\4\10\7\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\213\5\32\rStreamDmxData\"\207\4\10\10\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\214\5\32\rStreamDmxData\"\207\4\10\t\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\215\5\32\rStreamDmxData\"\207\4\10\n\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\216\5\32\rStreamDmxData\"\207\4\10\v\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\217\5\32\rStreamDmxData\"\207\4\10\f\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\220\5\32\rStreamDmxData\"\207\4\10\r\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\221\5\32\rStreamDmxData\"\207\4\10\16\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\222\5\32\rStreamDmxData\"\207\4\10\17\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\223\5\32\rStreamDmxData\"\207\4\10\20\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\224\5\32\rStreamDmxData\"\207\4\10\21\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\225\5\32\rStreamDmxData\"\207\4\10\22\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\226\5\32\rStreamDmxData\"\207\4\10\23\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\227\5\32\rStreamDmxData\"\207\4\10\24\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\230\5\32\rStreamDmxData\"\207\4\10\25\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\231\5\32\rStreamDmxData\"\207\4\10\26\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
sendto(6, "\36\2\0\20\10\n\20\232\5\32\rStreamDmxData\"\207\4\10\27\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
write(1, "frame time: 0004 ms\n", 20)   = 20

The example and OLA were built from DaAwesomeP/ola branch DaAwesomeP-StreamingClientThroughput commit 2140930. The only change in this branch is to add the additional multi throughput example. I built with: autoreconf -i && ./configure --disable-python-libs  --disable-doxygen-doc && make -j4 and ran OLA with LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/vm/GitHub/ola-multi3/olad/.libs/ ./olad/olad -c ~/ola -d /home/vm/GitHub/ola-multi3/olad/www -p 9090.

Best,
Perry
strace_and_output_ola_throughput_multi.tar.gz

Simon Newton

unread,
Oct 29, 2022, 2:08:42 AM10/29/22
to open-l...@googlegroups.com
Can you run it with -tt please, we need to see the timestamps.

Simon


DaAwesomeP

unread,
Oct 31, 2022, 9:41:53 AM10/31/22
to open-lighting
Hi Simon,

OK, new strace outputs attached!

Here is a new excerpt that is one frame from the streaming client with 24 universes (strace_ola_throughput_multi_u24_s20000.txt):

09:32:23.681649 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=20000000}, NULL) = 0
09:32:23.702527 epoll_wait(7, [], 10, 1) = 0
09:32:23.703887 sendto(6, "\36\2\0\20\10\n\20\372\3\32\rStreamDmxData\"\207\4\10\1\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.704173 epoll_wait(7, [], 10, 1) = 0
09:32:23.705745 sendto(6, "\36\2\0\20\10\n\20\373\3\32\rStreamDmxData\"\207\4\10\2\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.705942 epoll_wait(7, [], 10, 1) = 0
09:32:23.707136 sendto(6, "\36\2\0\20\10\n\20\374\3\32\rStreamDmxData\"\207\4\10\3\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.707248 epoll_wait(7, [], 10, 1) = 0
09:32:23.708502 sendto(6, "\36\2\0\20\10\n\20\375\3\32\rStreamDmxData\"\207\4\10\4\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.708698 epoll_wait(7, [], 10, 1) = 0
09:32:23.710066 sendto(6, "\36\2\0\20\10\n\20\376\3\32\rStreamDmxData\"\207\4\10\5\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.710203 epoll_wait(7, [], 10, 1) = 0
09:32:23.711564 sendto(6, "\36\2\0\20\10\n\20\377\3\32\rStreamDmxData\"\207\4\10\6\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.711821 epoll_wait(7, [], 10, 1) = 0
09:32:23.713663 sendto(6, "\36\2\0\20\10\n\20\200\4\32\rStreamDmxData\"\207\4\10\7\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.713842 epoll_wait(7, [], 10, 1) = 0
09:32:23.715643 sendto(6, "\36\2\0\20\10\n\20\201\4\32\rStreamDmxData\"\207\4\10\10\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.715907 epoll_wait(7, [], 10, 1) = 0
09:32:23.717690 sendto(6, "\36\2\0\20\10\n\20\202\4\32\rStreamDmxData\"\207\4\10\t\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.717852 epoll_wait(7, [], 10, 1) = 0
09:32:23.719231 sendto(6, "\36\2\0\20\10\n\20\203\4\32\rStreamDmxData\"\207\4\10\n\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.719497 epoll_wait(7, [], 10, 1) = 0
09:32:23.720886 sendto(6, "\36\2\0\20\10\n\20\204\4\32\rStreamDmxData\"\207\4\10\v\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.721067 epoll_wait(7, [], 10, 1) = 0
09:32:23.722532 sendto(6, "\36\2\0\20\10\n\20\205\4\32\rStreamDmxData\"\207\4\10\f\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.722753 epoll_wait(7, [], 10, 1) = 0
09:32:23.724459 sendto(6, "\36\2\0\20\10\n\20\206\4\32\rStreamDmxData\"\207\4\10\r\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.724663 epoll_wait(7, [], 10, 1) = 0
09:32:23.725923 sendto(6, "\36\2\0\20\10\n\20\207\4\32\rStreamDmxData\"\207\4\10\16\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.726076 epoll_wait(7, [], 10, 1) = 0
09:32:23.727409 sendto(6, "\36\2\0\20\10\n\20\210\4\32\rStreamDmxData\"\207\4\10\17\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.727597 epoll_wait(7, [], 10, 1) = 0
09:32:23.728896 sendto(6, "\36\2\0\20\10\n\20\211\4\32\rStreamDmxData\"\207\4\10\20\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.729061 epoll_wait(7, [], 10, 1) = 0
09:32:23.730384 sendto(6, "\36\2\0\20\10\n\20\212\4\32\rStreamDmxData\"\207\4\10\21\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.730546 epoll_wait(7, [], 10, 1) = 0
09:32:23.731804 sendto(6, "\36\2\0\20\10\n\20\213\4\32\rStreamDmxData\"\207\4\10\22\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.731939 epoll_wait(7, [], 10, 1) = 0
09:32:23.733131 sendto(6, "\36\2\0\20\10\n\20\214\4\32\rStreamDmxData\"\207\4\10\23\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.733244 epoll_wait(7, [], 10, 1) = 0
09:32:23.734465 sendto(6, "\36\2\0\20\10\n\20\215\4\32\rStreamDmxData\"\207\4\10\24\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.734601 epoll_wait(7, [], 10, 1) = 0
09:32:23.735938 sendto(6, "\36\2\0\20\10\n\20\216\4\32\rStreamDmxData\"\207\4\10\25\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.736201 epoll_wait(7, [], 10, 1) = 0
09:32:23.738038 sendto(6, "\36\2\0\20\10\n\20\217\4\32\rStreamDmxData\"\207\4\10\26\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.738294 epoll_wait(7, [], 10, 1) = 0
09:32:23.739690 sendto(6, "\36\2\0\20\10\n\20\220\4\32\rStreamDmxData\"\207\4\10\27\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:32:23.739977 write(1, "frame time: 0037 ms\n", 20) = 20

And the same new excerpt from the advanced client (strace_ola_throughput_multi_u24_s20000_a.txt):

09:31:11.399860 epoll_wait(3, [], 10, 14) = 0
09:31:11.414560 sendto(6, "\36\2\0\20\10\n\20\210\n\32\rStreamDmxData\"\207\4\10\1\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.414903 sendto(6, "\36\2\0\20\10\n\20\211\n\32\rStreamDmxData\"\207\4\10\2\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.415181 sendto(6, "\36\2\0\20\10\n\20\212\n\32\rStreamDmxData\"\207\4\10\3\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.415407 sendto(6, "\36\2\0\20\10\n\20\213\n\32\rStreamDmxData\"\207\4\10\4\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.415635 sendto(6, "\36\2\0\20\10\n\20\214\n\32\rStreamDmxData\"\207\4\10\5\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.415852 sendto(6, "\36\2\0\20\10\n\20\215\n\32\rStreamDmxData\"\207\4\10\6\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.416097 sendto(6, "\36\2\0\20\10\n\20\216\n\32\rStreamDmxData\"\207\4\10\7\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.416314 sendto(6, "\36\2\0\20\10\n\20\217\n\32\rStreamDmxData\"\207\4\10\10\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.416534 sendto(6, "\36\2\0\20\10\n\20\220\n\32\rStreamDmxData\"\207\4\10\t\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.416746 sendto(6, "\36\2\0\20\10\n\20\221\n\32\rStreamDmxData\"\207\4\10\n\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.416965 sendto(6, "\36\2\0\20\10\n\20\222\n\32\rStreamDmxData\"\207\4\10\v\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.417176 sendto(6, "\36\2\0\20\10\n\20\223\n\32\rStreamDmxData\"\207\4\10\f\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.417395 sendto(6, "\36\2\0\20\10\n\20\224\n\32\rStreamDmxData\"\207\4\10\r\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.417607 sendto(6, "\36\2\0\20\10\n\20\225\n\32\rStreamDmxData\"\207\4\10\16\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.417826 sendto(6, "\36\2\0\20\10\n\20\226\n\32\rStreamDmxData\"\207\4\10\17\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.418096 sendto(6, "\36\2\0\20\10\n\20\227\n\32\rStreamDmxData\"\207\4\10\20\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.418323 sendto(6, "\36\2\0\20\10\n\20\230\n\32\rStreamDmxData\"\207\4\10\21\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.418540 sendto(6, "\36\2\0\20\10\n\20\231\n\32\rStreamDmxData\"\207\4\10\22\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.418794 sendto(6, "\36\2\0\20\10\n\20\232\n\32\rStreamDmxData\"\207\4\10\23\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.419013 sendto(6, "\36\2\0\20\10\n\20\233\n\32\rStreamDmxData\"\207\4\10\24\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.419238 sendto(6, "\36\2\0\20\10\n\20\234\n\32\rStreamDmxData\"\207\4\10\25\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.419456 sendto(6, "\36\2\0\20\10\n\20\235\n\32\rStreamDmxData\"\207\4\10\26\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.419683 sendto(6, "\36\2\0\20\10\n\20\236\n\32\rStreamDmxData\"\207\4\10\27\22\200\4"..., 546, MSG_NOSIGNAL, NULL, 0) = 546
09:31:11.419911 write(1, "frame time: 0005 ms\n", 20) = 20

Same commands and environment as before just with strace -tt this time.

It seems that no single epoll_wait() is particularly long, but the cumulative wait time does add up.

Thanks again,
Perry
strace_tt_and_output_ola_throughput_multi.tar.gz

Simon Newton

unread,
Nov 1, 2022, 8:03:15 AM11/1/22
to open-l...@googlegroups.com
Each call to epoll_wait has a 1ms delay on it, likely from https://github.com/OpenLightingProject/ola/blob/master/common/io/EPoller.cpp#L315 

Can you change that 1 to 0, rebuild and see if that helps.

Simon


DaAwesomeP

unread,
Nov 1, 2022, 11:09:35 AM11/1/22
to open-lighting
OK, wow that definitely helped! Still not quite a match, but definitely better. New results attached. I will open a pull if this is the best solution.

Some brief recounts:
  • output_ola_throughput_multi_u01_s20000_a (advanced) vs output_ola_throughput_multi_u01_s20000 (streaming): Match, basically no delay!
  • output_ola_throughput_multi_u02_s20000_a (advanced) vs output_ola_throughput_multi_u02_s20000 (streaming): Match, basically no delay!
  • output_ola_throughput_multi_u24_s20000_a (advanced) vs output_ola_throughput_multi_u24_s20000 (streaming): Much, much closer in response time and definitely usable. Though it is still more ms per frame than the advanced.
  • output_ola_throughput_multi_u240_s20000_a (advanced) vs output_ola_throughput_multi_u240_s20000 (streaming): Still about 25ms difference per frame.
Now that epoll_wait() isn't delaying as long each time, it seems that the overhead now definitely much lower. However, unless there are more optimizations/delay reductions to apply, I think maybe the next step would be to add a method to send multiple universes (in one function call) with the streaming client with only a single epoll_wait() call.

Thank you!
Perry

DaAwesomeP

unread,
Nov 1, 2022, 11:10:28 AM11/1/22
to open-lighting
Oops, forgot to attach. Here are the new results with the EPoller.cpp modification.

Perry
EPollerModifcation_strace_and_output_ola_throughput_multi.tar.gz

Simon Newton

unread,
Nov 1, 2022, 7:05:27 PM11/1/22
to open-l...@googlegroups.com
Can you send an strace for output_ola_throughput_multi_u240_s20000 with -T -tt please. 

DaAwesomeP

unread,
Nov 2, 2022, 9:44:16 AM11/2/22
to open-lighting
Sure, new outputs attached!

Thanks!
Perry
strace_ola_throughput_multi_u240_s20000.txt
output_ola_throughput_multi_u240_s20000.txt
EPollerModifcation_strace_T_and_output_ola_throughput_multi.tar.gz

Simon Newton

unread,
Nov 3, 2022, 8:29:22 PM11/3/22
to open-l...@googlegroups.com
In the strace you posted:

sendto() takes 25uS
epoll() takes 25uS
Between calls to epoll() in the same batch there is ~100uS.

Can you time the call to ola_client.SendDmx() ? I'm curious to see how long that's taking overall. 

Simon




DaAwesomeP

unread,
Nov 3, 2022, 8:32:00 PM11/3/22
to open-lighting
Sure! Should I pull out std::chrono or is there a simple way to do that with strace?

Perry

Simon Newton

unread,
Nov 4, 2022, 7:24:50 AM11/4/22
to open-l...@googlegroups.com
Go for  std::chrono



DaAwesomeP

unread,
Nov 4, 2022, 4:26:59 PM11/4/22
to open-lighting
OK!

New version of example here that includes sendDmx() timing measurements: https://github.com/DaAwesomeP/ola/blob/a5acd855c079c91c59a31a38204d2a138eb188bb/examples/ola-throughput-multi.cpp

All measurements are now in microseconds.

New strace and outputs attached, still with the EPoller.cpp modification.

Thanks,
Perry
sendDmxTimings_EPollerModifcation_strace_T_and_output_ola_throughput_multi.tar.gz

Peter Newman

unread,
Nov 6, 2022, 11:54:48 PM11/6/22
to open-lighting
I think it would be great to get a PR into 0.10 branch for the epoll change, given it definitely speeds things up on it's own.

Have we checked there's not a similar bug in select via  --no-use-epoll? Or kqueue if you've got a Mac/BSD?

DaAwesomeP

unread,
Dec 3, 2022, 1:28:00 PM12/3/22
to open-lighting
Pull opened to patch EPoll delay: https://github.com/OpenLightingProject/ola/pull/1801

I have not yet had a chance to test with --no-use-epoll. I also don't have a Mac or BSD machine that I can easily test on.

Simon, let me know if you have any ideas where the rest of the delays may be coming from.

Thanks!
Perry
Reply all
Reply to author
Forward
0 new messages