Hi Simone,
I think that if you are limited in the RAM per node then using MPI is a mistake and instead you should be going for a threading-based solutions so that all the threads can pool the available RAM. Alternatively maybe you are using Open MPI in hybrid mode but it seems unlikely from your comments. PyTables using Numexpr as the backend is the obvious solution here. Threads do not usually scale as nicely as processes, but the setup/takedown time is minimal and they share RAM. If you have an SMP (shared memory process) environment for your cluster then it's usually suitable for such a thing. PyTables is just generally a better HDF5 interface than h5py in my opinion.
In particular pay attention to the 'evaluate' functionality which uses this library:
If those 16 core machines are two CPUs, then certainly then Numexpr probably will not scale well past 8 cores, but 8 cores is better than 3. It's very difficult to beat NumExpr in imaging processing on CPU in the Python landscape, due to the amount of data flowing between the CPUs and memory. I've used pyFFTW and NumExpr in combination to be able to keep up with competitors who program in Fortran-90.
If your data is compressible you could also look at zarr, which uses blosc to compress Numpy arrays into chunks, and decompresses them at the cache for processing:
PyTables also includes blosc support now.
Stuff like Dask and Hadoop are for parallelizing algorithms in machine learning and similar fields. If you just want to do matrix algebra they're probably sub-optimal. It comes down to, how big are each of your stacks? You said chunking isn't practical, why?
Robert