Flow optimization question.

Kostya Salomatin

unread,

Aug 27, 2016, 4:30:29 PM8/27/16

to Scalding Development

Hello scalding users,

I've got a question about optimization of my flows. One can tune the number of reducers per step easily, but there are very few tools to control the number of mappers per step. I often use map-only steps with expensive computation (e.g. with crosses or hashJoins), that is why I need a good control of my mappers. I know two ways to control the number of mappers, and both have disadvantages for me. The first one is via split.{minsize, maxsize} job arguments, but it affects the whole flow, I can't change it per job. The second one is via shard (which I personally like), but shard triggers an extra map reduce step and we have software that monitors the job efficiency and complains if it thinks the job abuses the resources. Those shard jobs that split the data into very small chunks are always a red flag for this monitoring software.

What I end up doing very often, to trick this software, is try to attach my expensive map operation to the reduce step of the shard. For example, if my next operation is a cross, that triggers a new MR job, I load the dataset I cross with into memory using .toIterableExecution and replace "cross" call with "map" call. I don't like using this pattern just to make tracking software happy.

Are there any better alternative patterns that I possibly overlook?

Thanks,
Kostya

P. Oscar Boykin

unread,

Aug 28, 2016, 3:11:38 PM8/28/16

to Kostya Salomatin, Scalding Development

There really is not such a great way to do this. You have found the tools we usually recommend. This is more of an issue that Hadoop is not so optimized to tune for this.

It would be nice to be able to say, use exactly N mappers for this job. This is not always possible because the input formats themselves have something to say about how data is partitioned (well, actually, they have complete control on that, as far as I know).

Lastly, it goes a *bit* against the idea of a mapper, which should be that it is the trivially parallelizable portion of your code. As such, you basically want as many as possible to minimize latency. Due to fixed startup costs, to minimize total cost, there is some optimal number mappers to use if you knew the trade off between startup and job cost.

Anyway, we don't have anything so great right now.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Levenson

unread,

Aug 31, 2016, 11:53:46 PM8/31/16

to P. Oscar Boykin, Kostya Salomatin, Scalding Development

I think you *can* tune the min/max size by using the sourceConfInit method in Sources and applying the setting there? It may not work though, I am not sure if this particular setting can be configured per-source or not.

It is up to the InputFormat to decide how many mappers, in part because some file types are not actually arbitrarily splittable -- there is often a max amount which you can split a file, and some formats (like gzipped data) can't be split at all. So that's where that comes from.

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "Scalding Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Alex Levenson

@THISWILLWORK

Reply all

Reply to author

Forward