Have you ever looked at `concurrent.futures` from Python 3 stdlib?. I find it much nicer than multiprocessing, although it does not have support for advanced multiprocessing features like shared memory (something your application could surely benefit from). If the approach of concurrent.futures fits your needs, then DO NOT use it directly, but look at the mpi4py.futures packages, which is a drop-in MPI-based implementation
https://mpi4py.readthedocs.io/en/stable/mpi4py.futures.html. Communication with worker processes will involve pickle serialization and may be a bit slow (memory copies within the pickle module). I still have to find some time to implement a more efficient copy-free communication approach for large data applications.
Or perhaps use a pure-MPI approach. Split COMM_WORLD in subcomms able to use shared memory [ node_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED) ], and then use MPI.Win.Allocate_shared() within each node-local subcomm to easily access shared memory. If your data involves large NumPy arrays, my guess is this approach will give you the most performant solution. Of course, the implementation requires some knowledge of MPI.