Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Comparison between Hive-Tez and Hive-MR3

43 views
Skip to first unread message

Sungwoo Park

unread,
Oct 25, 2023, 6:02:58 AM10/25/23
to MR3
There was a question on the compatibilty between Hive-Tez in HDP and Hive-MR3, so we compared the two systems using the TPC-DS benchmark.

--- Metastore
We use Metastore run by HDP 3.1.4.

--- Hive-Tez
We use HiveServer2 run by HDP 3.1.4 which executes Hive-Tez.
We allocated 8GB to each container.
(Too many queries fail if 4GB is allocated to each container.)
hive.auto.convert.join.noconditionaltask.size is set to the default value of 1145044992 in Hive-Tez.

--- Hive-MR3
We use the release candiate of Hive 3 on MR3 1.8 (with Celeborn support).
We allocate 4GB to each task and 72GB to each container.
hive.auto.convert.join.noconditionaltask.size is set to the default value of 4000000000 in Hive-MR3, except for query 23 which uses 850000000.
 
--- Dataset
We first generate a dataset of 1TB TPC-DS in text format.
Then we use Hive-Tez (not Hive-MR3) create an ORC dataset by executing 'create table XXX as as select * from tpcds_text_1000.XXX' without partitioning.
Note that Hive-MR3 is not involved at all in data generation.
See the attached SQL script for loading the ORC dataset.

--- Result
We run 99 modified TPC-DS queries found at the directocry hive/benchmarks/hive-testbench/sample-queries-tpcds-hive4/ of MR3 release.
 
See the attached Excel table.

Total running time of Hive-Tez: 10514.34 seconds
Total running time of Hive-MR3: 2356.366 seconds

--- Correctness
Hive-Tez and Hive-MR3 agree on the results of all queries except query 65 and query 70.

Query 65: Hive-Tez 1830 rows vs Hive-MR3 1820 rows
The difference is perhaps due to rounding errors, so we can ignore the

Query 70: Hive-Tez 17 rows vs Hive-MR3 56 rows
All the rows from Hive-Tez are all included in the result from Hive-MR3.
The difference is likely to be due to a bug in Hive-Tez.
For example, when tested with 10TB TPC-DS, we see the following results from query 70:
    Hive-LLAP   --> 25 rows
    Spark 3.4.0 --> 124 rows
    Hive-MR3    --> 124 rows
    Trino 418   --> 124 rows
1000orchdp.sql
hivemr3.release1.8.compare.hivetez.xlsx

Sungwoo Park

unread,
Oct 25, 2023, 12:10:38 PM10/25/23
to MR3
Continuing the comparison between Hive-Tez and Hive-MR3, here is the result of concurrent tests.

We submit the first 50 queries of TPC-DS (query 1 to query 50) at once. We measure the execution time of the longest running query, which is always query 23 (consisting of query 23-1 and query 23-2) for both Hive-Tez and Hive-MR3.

Hive-Tez:
2906.999 seconds (query 23-1: 2254.654 seconds + query 23-2: 652.345 seconds)
A total of 50 Tez DAGAppMasters, each with 4GB, are created.

Hive-MR3: 
first batch:  792.962 seconds (608.746 + 184.216)
second batch:  828.303 seconds (639.835 + 188.468)
third batch:  809.656 seconds (620.914 + 188.742)
We allocate 64GB memory to MR3 DAGAppMaster.

We see that Hive-MR3 yields about 3.5 times higher throughput than Hive-Tez.

--- Sungwoo
Reply all
Reply to author
Forward
0 new messages