I know we had to bump the local storage on the worker nodes a while
back to 1 TB each in order for some queries to succeed. Since having
made that change, we haven't had any similar problems. Is that the
local storage you mention that could eliminated or reduced by using
Celeborn? If so, I suspect our IT department would like it because
they could reclaim some of the added storage that goes unused most
of them time. :)
Hive-MR3 writes a lot of data to local disks, especially for shuffling. The story goes like:
1. A mapper writes its output to local disks.
2. A reducer fetches the mapper output via shuffle servers running inside the mapper, and stores it on its local disks.
3. The same reducer reads this data to produce its output.
Please see the attached image (vm3.png) which assumes that local disks of worker nodes are mapped to SAN. Then, the output of a mapper is stored twice on local disks, which incur three network transfers to/from SAN and one network transfer between mapper and reducer. Next, the reducer reads the mapper output, which incurs another transfer from SAN.
Now, please see the diagram at this page:
By using Celeborn and enabling an option that skips 3 in the diagram, we can eliminate part of the path (2). Please see another attached diagram vm4celeborn.png. This is the first advantage of using Celeborn.
Another advantage is that worker nodes need much smaller local disk storage because most of (but not all of) local data is sent to Celeborn. Thus, for example, a worker node migh be fine with 50GB local storage, while you can attach 10TB local storage to Celeborn which will be effectively shared by all worker nodes.
The disadvantage is:
1. You need to learn to operate Celeborn master and workers.
2. Celeborn workers usually assume physical local disk storage. In you case, it would not be physical local storage, so there will be hidden overhead.
3. Celeborn has bugs which we discovered under stress testing. For executing ordinary queries, hopefully this should not be a problem in practice.
4. We don't have a use case of Hive-MR3-Celeborn. A guy in the MR3 slack tested it in his company and he didn't report any problem.
In summary, if you find that local disk reads/writes are a serious performance bottleneck, Celeborn (remote shuffle service) can be an option to consider.
-- Sungwoo