This code is well commented about the parallelism. E.g. "count bits of
each 2-bit chunk", that's about doing all the 2-bit chunks in parallel.
Up to the width of the architecture's `int`, which you can infer from
the code is 32.
However, how it works is a different matter: there's no comment about
that. Studying the hand-optimized code resulting from some idea is like
reverse-engineering machine code, or (slight exaggeration) trying to
deduce the shape of a human of a given age from a DNA sequencing.
Instead I would look for a description of the basic idea, and how that
idea was expressed in code.
And instead of reinventing this particular wheel, why not use one of the
umpteen existing solutions? In C++ you have `std::bitset::count`,
portably. And in both C and C++ there are somewhat less portable C level
compiler intrinsics, e.g. as listed at <url:
https://en.wikichip.org/wiki/population_count>.
Cheers & hth.,
- Alf