Yes, especially on ARM platforms. Regarding SSE2, there are still
few functions that could be optimized. It's pretty easy/fun to do with
intrinsics.
There's also some unfinished higher-level attempt at reducing (or shortening)
the number of analysis passes done during encoding. For instance: using
more memory for storing coefficients token (while the optimal distribution
of probabilities is being recorded) and avoid re-doing the last pass. Pretty
much like libvpx is doing.
Parallel multithreaded analysis pass could be useful, too.