Hi Sean,
First sorry for the delayed response! We are currently working on exactly this type of problem - how to efficiently store, access, search and voxelize particle into VDB grids. Stay tuned for further updates on this work, but if everything goes as planned we hope to share it in a release scheduled for August. Having said this it sounds like you've already made some good progress yourself so I'll try to address your questions without reference to our parallel work.
First off I'm assuming you're referring to the fundamental FLIP step where velocities on particles are rasterizing into a (MAC) VDB grid!? As you probably already know there are generally two strategies - a gather or a scatter, both with it's pros and cons. The former requires a mapping from voxels to contributing particles, whereas the latter (that voxelizes particles individually) requires an auxiliary grid to accumulate the weights. Performance is highly dependt on the efficiency of your mapping and threading strategies so it's hard to say which one is best. The scattering approach is arguably the easiest since all you need are extra buffers (e.g. available in the LeafManager) or an additional grid to accumulate the particle weights, whereas the gather requires a particle acceleration structure. As a side-node it's our believe that the gather approach has potential for better performance then the scatter, but it really hinges on the performance of your particle acceleration structure. So your next question seems to be how to safely thread the scattering approach. Again, you are faced with two different approaches - a parallel_reduce or a parallel_for. The former requires each thread to allocate it's own grid (to store velocity components and weights) and then the join method will union grids as threads are terminated (see e.g. tools/ParitlceToLevelSet.h for an example). Alternatively (and this is likely faster) if you can pre-determine the topology of the grids from the particles, you should allocate them up front and use parallel_for. In other words you can only do thread-safe writes to a grid if it is pre-allocated. One way to do this is to use the particles to generate a BoolGrid of all voxles with a particle (you have even dilate this grid) and then use a topology copy-constructor to generate the velocity grids. Let me know if you have more questions or need help deciphering my suggestions.
-Ken