I did some work on implementing Falcon (i.e. soon-to-be FN-DSA) on an ARM Cortex-M4 (an M4F, specifically). Around 2019 I already had a run at it but this time it got better. In a nutshell it's twice faster.
For Falcon-512, keygen in 72m cycles, signing in 22m, verifying in 255k cycles (359k if including the public key hash in the signed data, this grants BUFF properties). Verification cost could be reduced by another 50k or so by unrolling some loops in the NTT and using the XKCP implementation for SHAKE, but that would substantially increase the code footprint.
Compared with Dilithium on the same platform (I am using the results from
https://eprint.iacr.org/2022/112), Falcon keygen is (as expected) a lot more expensive (45x), signing is also more expensive but maybe not as much as one would fear (about 5.4x, less than an order of magnitude), but verification is quite faster (4x to 6x faster).
Thomas