We discussed math like optimizations for Cortex-M on the previous SIG meeting, where CMSIS-DSP is a good option. As promised,
here are frontend optimizations for the micro speech example. Checkout
these lines for a hint on what's done. This will give you a significant reduction of the frontend cycle count.