I was talking with a friend about the performance issues as light channels go up, and he asked a pretty simple question: Why should more light channels make a difference?
Based on some of my experiments with multi-threading, I've found that with few channels, the majority of time is spent waiting on data to play out through the audio out, and with many channels (and no cache), the FFT analysis takes the majority of time, to the point that it starves the rest of the app. In other words, the FFT analysis falls behind the audio playback and you get stuttering.
But, when I looked in the FFT code, it really doesn't seem like the actual FFT (ie. numpy.rfft ) has anything to do with the number of channels. The only channel-related code in there is when the FFT results are summed and log10'ed. I did an experiment where it runs the FFT and then just discards the result and returns a matrix of zeros for each channel. In other words, just skip the last for loop in the fft.py module. In that case, CPU usage dropped considerably and it was back to being audio output bound.
So, what this leads me to believe is that the real performance issue related to increased light channels is memory allocation and copying in that bottom for-loop in fft.py, not the raw FFT itself. I think that the array slicing may end up allocating and copying a lot of values from the original power array into new arrays prior to the summing.
I think this could be reduced to a single pass over the power array, where each element in the power array is added to an element in the matrix array. So, the matrix array starts at zeros (like it is now), but instead of using np.sum of a slice of the power array, we instead just iterate over the power array and add each element in it to an element in the power array. Basically, we use the (kinda) inverse of the piff function to map from the power array index (ie. 0 to chunk_size - 1) to the matrix index (ie. 0 to gpiolen). This completely eliminates all the slicing of the array and the potential array element copying issues associated.
I'm going to play with this a bit and see what the results are. Ultimately, this will only affect non-cached results, but the question still remains, and is even more prescient for the cached results: Why should more channels make a difference? 8 channels? 24? 64? These are all very small numbers as far as for loops and such are concerned, even on a low powered machine like the pi. If the cache is just reading from a preloaded matrix, why should it matter if the matrix is 4000x8 or 4000x64, if each row is read once per sound chunk, and we have to wait on the audio out to play the chunk before using the next row in the matrix?
(Note: This is mainly for on/off, I'm still unclear exactly how PWM works, but it seems to require much more precise timing.)