From what I understand, you have decomposed your main problem into many small, quickly solvable problems, right?
Do you have real cores, or do you use hardware multithreading/SMT/hyperthreading (all the same thing)? I.e. do you have an 8-core/16-thread machine, or a 16-core machine?
I'm asking because:
1) as a rule of thumb, using SMT (i.e. running two streams of instructions on a single core simultaneously) gives you an 1.5x gain in performance, instead of the expected 2x. Your mileage will vary depending on memory accesses, types of instructions in the different threads, whether they not interfere in their itineraries in the processor pipeline, etc, etc.. You should get 1x only in rare cases. But I've personally never witnessed a 2x approach.
2) If you have two processors (i.e. two sockets on the motherboard), it's again an all different story, because each of the processor manages its own memory, but can share it with the other processor, but at a huge cost, in the ballpark of 300 to 400 cycles. If your processes consume a lot of memory, you may end up having your processors exchanging data slowly, and at a high cost.
If my premise is correct, then you can consider using a second machine (or more), or trying on a true 16-core machine, to see how it behaves. There will be memory contention at one point, but if you have one of the latest designs for workstations or servers, there should be ample cache size, and this should probably work.
I hope this answers the question from the right angle. It's not solver-related but hardware-related.
Happy to follow up if you have more questions.