Just for fun, I'll share a little case study for discussion.I have a SQL that I use to count exactly when the sales amount exceeds 10 billion, and when it exceeds 50 billion. (Let's not get hung up on the correctness of the logic of the statistics.)
I used HIVE(HDP 3.1.5.26-1)/HIVE-MR3/SPARK(KYUUBI1.8.0+3.4.1) for testing respectively.
The time spent on each of the three engines is as follows:
HIVE:95.209 SEC (197G MEM PRE-JOB)
HIVE-MR3: 93.988 SEC (419G MEM LLVM)
SPARK-KYUUBI: 207.63 SEC (1TB MEM LLVM)
The reason for this discrepancy, I believe, is the difference in sort-ordering.
MR3/TEZ uses merge sort and performs very well in sorting related scenarios. It combines performance and stability.
SPARK uses global sorting, and when TASK is skewed, the performance is very poor.
In addition, for many MPP-type databases, the same problem exists in scenarios where memory is not sufficient.
In the unorder scenario, SPARK and MPP database can really get very good efficiency, but when it comes to the ORDER scenario, the gap between them is as mentioned above.
We in the daily ETL, many times will be involved in ordered correlation, sorting partition, sorting calculation, TOPN and other scenarios, this time we often face the use of Spark resulting in slow performance problems (partition skew).
If users are not aware of this, when they switch to Spark/MPP, they will surely face many scenarios of degraded SQL computation performance. This problem is not very easy to solve(The volume of data is large enough).
Also, it's hard to ask other departments to work with you to change SQL code, especially if your users are data analysts who don't have a lot of programming experience, so I'm sure you're in big trouble.
So, with all due respect, I don't personally agree that Spark is the de facto standard for large-scale data processing, and What I think is strong about Spark right now is that it has a very broad ecosystem.