HI.
I found that when HIVE ON MR3 is used to execute SQL with de-duplication statistics, its execution performance is always poor.
I prepared a table with 200 million data, as shown below:
First of all, I use HIVE ON MR3 to perform a group de-replication statistics on this table. The time consumption is as follows:(ORC TABLE)
The resources used by mr3 are shown in the following figure:
Then, I executed the same SQL using APACHE KYUUBI+SPARK3.3.2. The execution time is as follows:
The resources used by KYUUBI SPARK are shown in the following figure:
When I use Kyuubi Spark, I only use 1/10 of MR3 resources, but the execution efficiency has improved by nearly 40%.
I think there may be some problems in the process of using MR3. I need to optimize the parameters of MR3 to a certain extent. Otherwise, according to previous experience, MR3 is unlikely to be so much less efficient than SPARK.
Can you help me?