Performance analysis

26 views
Skip to first unread message

Micah Wedemeyer

unread,
Jan 20, 2015, 11:22:37 AM1/20/15
to lightsh...@googlegroups.com
I was talking with a friend about the performance issues as light channels go up, and he asked a pretty simple question: Why should more light channels make a difference?

Based on some of my experiments with multi-threading, I've found that with few channels, the majority of time is spent waiting on data to play out through the audio out, and with many channels (and no cache), the FFT analysis takes the majority of time, to the point that it starves the rest of the app. In other words, the FFT analysis falls behind the audio playback and you get stuttering.

But, when I looked in the FFT code, it really doesn't seem like the actual FFT (ie. numpy.rfft ) has anything to do with the number of channels. The only channel-related code in there is when the FFT results are summed and log10'ed. I did an experiment where it runs the FFT and then just discards the result and returns a matrix of zeros for each channel. In other words, just skip the last for loop in the fft.py module. In that case, CPU usage dropped considerably and it was back to being audio output bound.

So, what this leads me to believe is that the real performance issue related to increased light channels is memory allocation and copying in that bottom for-loop in fft.py, not the raw FFT itself. I think that the array slicing may end up allocating and copying a lot of values from the original power array into new arrays prior to the summing.

I think this could be reduced to a single pass over the power array, where each element in the power array is added to an element in the matrix array. So, the matrix array starts at zeros (like it is now), but instead of using np.sum of a slice of the power array, we instead just iterate over the power array and add each element in it to an element in the power array. Basically, we use the (kinda) inverse of the piff function to map from the power array index (ie. 0 to chunk_size - 1) to the matrix index (ie. 0 to gpiolen). This completely eliminates all the slicing of the array and the potential array element copying issues associated.

I'm going to play with this a bit and see what the results are. Ultimately, this will only affect non-cached results, but the question still remains, and is even more prescient for the cached results: Why should more channels make a difference? 8 channels? 24? 64? These are all very small numbers as far as for loops and such are concerned, even on a low powered machine like the pi. If the cache is just reading from a preloaded matrix, why should it matter if the matrix is 4000x8 or 4000x64, if each row is read once per sound chunk, and we have to wait on the audio out to play the chunk before using the next row in the matrix?

(Note: This is mainly for on/off, I'm still unclear exactly how PWM works, but it seems to require much more precise timing.)

Todd Giles

unread,
Jan 20, 2015, 11:47:11 AM1/20/15
to lightsh...@googlegroups.com
On Tue Jan 20 2015 at 9:22:38 AM Micah Wedemeyer <micahwe...@gmail.com> wrote:
I was talking with a friend about the performance issues as light channels go up, and he asked a pretty simple question: Why should more light channels make a difference?

Based on some of my experiments with multi-threading, I've found that with few channels, the majority of time is spent waiting on data to play out through the audio out, and with many channels (and no cache), the FFT analysis takes the majority of time, to the point that it starves the rest of the app. In other words, the FFT analysis falls behind the audio playback and you get stuttering.

But, when I looked in the FFT code, it really doesn't seem like the actual FFT (ie. numpy.rfft ) has anything to do with the number of channels. The only channel-related code in there is when the FFT results are summed and log10'ed. I did an experiment where it runs the FFT and then just discards the result and returns a matrix of zeros for each channel. In other words, just skip the last for loop in the fft.py module. In that case, CPU usage dropped considerably and it was back to being audio output bound.

This is indeed the case, the FFT itself is not channel dependent - only the binning at the end is channel dependent.  Any improvements on this binning would be more than welcome!  I hadn't put too much more time into improving it yet as it "worked" for all of my setups (even for realtime analysis with an 8-channel audio-in mode).  That said, if caching were not necessary - it simplifies things and would of course improve the audio-in mode as well.
 
So, what this leads me to believe is that the real performance issue related to increased light channels is memory allocation and copying in that bottom for-loop in fft.py, not the raw FFT itself. I think that the array slicing may end up allocating and copying a lot of values from the original power array into new arrays prior to the summing.

I think this could be reduced to a single pass over the power array, where each element in the power array is added to an element in the matrix array. So, the matrix array starts at zeros (like it is now), but instead of using np.sum of a slice of the power array, we instead just iterate over the power array and add each element in it to an element in the power array. Basically, we use the (kinda) inverse of the piff function to map from the power array index (ie. 0 to chunk_size - 1) to the matrix index (ie. 0 to gpiolen). This completely eliminates all the slicing of the array and the potential array element copying issues associated.

I'm going to play with this a bit and see what the results are.

Please do, let us know your results!
 
Ultimately, this will only affect non-cached results, but the question still remains, and is even more prescient for the cached results: Why should more channels make a difference? 8 channels? 24? 64? These are all very small numbers as far as for loops and such are concerned, even on a low powered machine like the pi. If the cache is just reading from a preloaded matrix, why should it matter if the matrix is 4000x8 or 4000x64, if each row is read once per sound chunk, and we have to wait on the audio out to play the chunk before using the next row in the matrix?

These are all great questions - I personally have not gone above 8 channels yet, and thus this hasn't been an urgent issue for me personally.  I'll eventually get to more channels and investigate this closer if I run into performance issues --- any findings you make in this area though would be welcome!
 
(Note: This is mainly for on/off, I'm still unclear exactly how PWM works, but it seems to require much more precise timing.)

PWM mode currently uses wiring pi's software PWM:


Of particular note, each pin using software pwm uses ~ 0.5% CPU, and you must run as admin to use software pwm (wiringPiSetupSys() is not fast enough).

If we wanted to do larger amounts of channels at the same time with PWM I'd think we'd want to move to an external pwm chip solution.  That said, I'm still interested in seeing what all we can do with the Pi as-is (keeping cost down, and it's just fun to see how much we can do with the base pi).
 

--
http://www.lightshowpi.com/
---
You received this message because you are subscribed to the Google Groups "LightShow Pi Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lightshowpi-d...@googlegroups.com.
To post to this group, send email to lightsh...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lightshowpi-dev/b17fdd1c-b6e3-4b93-86a8-253a9d7d203e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Micah Wedemeyer

unread,
Jan 20, 2015, 12:33:45 PM1/20/15
to lightsh...@googlegroups.com
I sort of reversed/inversed the piff function to go from the FFT bin to the gpio channel, and I pre-computed all possible bins (chunk_size, right?) into a lookup table. So, if the chunk_size is 2048, I create a lookup table with 2048 entries, where each entry maps to a gpio channel (ie. 0 - 8, or 0 - 64).

This meant it was O(1) (constant) time to figure out which channel to assign a single FFT bin to, so I could iterate over the power array a single time and assign each bin to a channel. So, O(n) where n is the number of bins. With that, I was able to get non-cached smooth playback with 64 pins in on/off mode. It still fell apart when I went to 128 channels, but at that point it was not a case of starving while waiting on light matrix data, so there may be other issues when you get to 128 channels.

As to removing the need for cache, I'm not sure about that, plus the cache is a very simple (and good) implementation.

I'll clean things up a bit and post my code a little later today.

There's a good chance there are some off-by-1 errors or other small mistakes in my code, as I really don't understand the FFT code all that well, like why the piff function does what it does. I know what it does, so I could figure out the inverse, but not exactly why.

My ultimate goal was to show that adding a little overhead with objects and indirection is probably not the deciding factor in whether the app is performant or not. My hunch is that we'll get the majority of performance improvement by looking closely at where time is being spent and focusing effort there. My investigations are showing that regardless of caching, loading and feeding the light matrix data is a bottleneck when you start upping the channels. That seems very strange to me.

--
http://www.lightshowpi.com/
---
You received this message because you are subscribed to a topic in the Google Groups "LightShow Pi Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lightshowpi-dev/foBc9gylXPc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lightshowpi-d...@googlegroups.com.

To post to this group, send email to lightsh...@googlegroups.com.

Micah Wedemeyer

unread,
Jan 20, 2015, 1:02:08 PM1/20/15
to lightsh...@googlegroups.com
Here's the related code. https://bitbucket.org/micahwedemeyer/lightshowpi/commits/abbe58dbfaed3e298911c07dcb57402d5ebd1db7

I didn't get a chance to clean it up as much as I'd like. There is some mingling of 2 different ideas there:
  • Playing around with cache loading, to see if there's a way to speed that up (I'm still curious here, perhaps trying to use one of the native binary formats that numpy supports)
  • Reworking the post-FFT binning process.
The 2nd one is the more interesting at this point. Please take a look at the changes to fft.py and how I create the fft_bucket_to_channel_map in fft_analyzer.py

Okay, it's been fun, but time to get back to my real job...

Todd Giles

unread,
Jan 20, 2015, 1:05:21 PM1/20/15
to lightsh...@googlegroups.com
Understand getting back to my real job ;)  Wishing I had more time in the day!  Glad you're spending time in this, and I hope you are finding it interesting and fun as well!

Reply all
Reply to author
Forward
0 new messages