mpi4py.futures do indeed have a executor setup cost (spawning the worker processes with MPI), while multiprocessing uses fork(), which on Linux and macOS in way faster than lauching brand-new processes from scratch, even more if that process involves a Python runtime.
How much time does each of your individual tasks require to run? If your individual tasks run very quickly, and you just have a few of them, then the setup cost of mpi4py.futures will hit you. Otherwise, mpi4py.futures should have negligible cost compared to multiprocessing. If that's not the case, my guess is that something is wrong on your side with MPI. But without more information, I can only guess. You should really provide reproducing code, defining tasks that take approximately the same time to run as in our real application, but with simple time.sleep() call to introduce an artificial delay. That way I can test the mpi4py.futures and multiprocessing versions of your code on my side, and try to figure out how things behave.