slow query little case

12 views

Skip to first unread message

Ill

unread,

Nov 12, 2023, 5:44:37 AM11/12/23

to MR3

Just for fun, I'll share a little case study for discussion.I have a SQL that I use to count exactly when the sales amount exceeds 10 billion, and when it exceeds 50 billion. (Let's not get hung up on the correctness of the logic of the statistics.)

select min(if(total>=10000000000,statistical_dimension,'2224-10-31 20:00:00')) as over10000000000, min(if(total>=50000000000,statistical_dimension,'2224-10-31 20:00:00')) as over50000000000 from ( select sum(payment) over(ORDER BY statistical_dimension asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as total, statistical_dimension from( select payment. concat(substr(pay_time,1,17),'00') as statistical_dimension from dw_base.b_std_trade where pay_time >= '2023-10-31 20:00:00' and part in ('2023-11','active') ) small ) s1.

I used HIVE(HDP 3.1.5.26-1)/HIVE-MR3/SPARK(KYUUBI1.8.0+3.4.1) for testing respectively.The time spent on each of the three engines is as follows:
HIVE:95.209 SEC (197G MEM PRE-JOB)
HIVE-MR3: 93.988 SEC (419G MEM LLVM)
SPARK-KYUUBI: 207.63 SEC (1TB MEM LLVM)The reason for this discrepancy, I believe, is the difference in sort-ordering.MR3/TEZ uses merge sort and performs very well in sorting related scenarios. It combines performance and stability.SPARK uses global sorting, and when TASK is skewed, the performance is very poor.
In addition, for many MPP-type databases, the same problem exists in scenarios where memory is not sufficient.In the unorder scenario, SPARK and MPP database can really get very good efficiency, but when it comes to the ORDER scenario, the gap between them is as mentioned above.We in the daily ETL, many times will be involved in ordered correlation, sorting partition, sorting calculation, TOPN and other scenarios, this time we often face the use of Spark resulting in slow performance problems (partition skew).If users are not aware of this, when they switch to Spark/MPP, they will surely face many scenarios of degraded SQL computation performance. This problem is not very easy to solve(The volume of data is large enough).Also, it's hard to ask other departments to work with you to change SQL code, especially if your users are data analysts who don't have a lot of programming experience, so I'm sure you're in big trouble.

So, with all due respect, I don't personally agree that Spark is the de facto standard for large-scale data processing, and What I think is strong about Spark right now is that it has a very broad ecosystem.

Sungwoo Park

unread,

Nov 12, 2023, 6:17:02 AM11/12/23

to MR3

Thank you for sharing the story. I would like add that with the Tez runtime (in HDP and Hive-MR3), one can also partially emulate the Spark-mode by setting:

tez.runtime.pipelined-shuffle.enabled to true
tez.runtime.enable.final-merge.in.output to false

With the above setting, mappers send ordered output to reducers without sorting, similarly to Spark.

Reply all

Reply to author

Forward

0 new messages