There really is not such a great way to do this. You have found the tools we usually recommend. This is more of an issue that Hadoop is not so optimized to tune for this.
It would be nice to be able to say, use exactly N mappers for this job. This is not always possible because the input formats themselves have something to say about how data is partitioned (well, actually, they have complete control on that, as far as I know).
Lastly, it goes a *bit* against the idea of a mapper, which should be that it is the trivially parallelizable portion of your code. As such, you basically want as many as possible to minimize latency. Due to fixed startup costs, to minimize total cost, there is some optimal number mappers to use if you knew the trade off between startup and job cost.
Anyway, we don't have anything so great right now.