For this problem, the solution is to set hive.exec.orc.split.strategy to ETL.
1. If
hive.exec.orc.split.strategy is set to BI, split computation uses a single thread, so it takes a lot of time to finish split computation if the input data is stored across thousands of files.
2. If
hive.exec.orc.split.strategy is set to ETL, split computation uses multiple threads as specified by hive.orc.compute.splits.num.threads. The default value for hive.orc.compute.splits.num.threads is 10, and I observe that increasing the value to 20 or higher reduces the time for split computation quite a bit when there are 30,000 input files.
3. If hive.exec.orc.split.strategy is set to HYBRID, split computation mixes BI and ETL. However, the percentage of ETL can be as small as 1 percent, which effectively makes hive.orc.compute.splits.num.threads irrelevant. This is the default value in Hive-MR3 configurations, and my guess is that this is the reason why your query takes a long time in the Initializing state.
So summarize, in order to reduce the time for split computation in the Initializing state, set
hive.exec.orc.split.strategy to ETL and increase the value for
hive.orc.compute.splits.num.threads as necessary. It's too bad that I assumed HYBRID would put a lot more weight on ETL than on BI :-(
I'd appreciate it if you could report new experiment results here.
Cheers,
-- Sungwoo