It looks to me like your work unit generation is lazy. I also surmise
that it's taking 15 seconds to do 50,000 MD5 computations == 0.3ms per
computation, and that might include the rest of your program, too.
Is it possible that the lazy computation of the work unit -- which is
not parallelized -- is only running fast enough to supply one agent
with input?
Have you tried completely materializing your work units before timing
the agent part?
Great news! Happy to help. This stuff is pretty new to me, too :)