Hi.
Just use MPI and do one mesh per core, do not use OpenMP at all. If you really have only 8000 per mesh it should go very fast. Out of interest what are you modelling?
But....
...so...It would be worth investigating if it is faster with 8 meshes, as 8 cores are high Performance, 4 are efficient (I assume slow). And also to investigate if the performance cores are used first, and if not can you can assign the meshes only to the high performance cores, not the efficient ones. If that is too complicated to do maybe make a few much smaller meshes and get the order right so that they get assigned to efficient cores, and solve quicker, so the performance cores do not wait on the efficient cores, just one idea of the top of my head. Interested to hear what the others say here.
Rob