Bug in Hive-MR3: tblproperties with "skip.header.line.count" does not work on compressed text files

93 views
Skip to first unread message

Sungwoo Park

unread,
Oct 14, 2022, 8:38:11 AM10/14/22
to MR3
Currently Hive-MR3 does not correctly read compressed text files into tables with a "skip.header.line.count" property.

This bug was introduce in  in HIVE-21924 (https://issues.apache.org/jira/browse/HIVE-21924). Later the bug was discovered and addressed in HIVE-22769 (https://issues.apache.org/jira/browse/HIVE-22769), along with its two subtasks HIVE-24224 and HIVE-24381. We backported HIVE-22769 and HIVE-24224 in MR3 1.3, but did not backport HIVE-24381. (So, Hive-MR3 1.2 does not show this problem.)

Until the release of MR3 1.6, please do not use compressed text files with tblproperties with "skip.header.line.count". The Docker image hive3:1.6-SNAPSHOT has backported HIVE-24381.

Thank David for reporting this problem.

Cheers,

--- Sungwoo




Carol Chapman

unread,
Oct 18, 2022, 4:19:45 AM10/18/22
to MR3
Tks

Carol Chapman

unread,
Oct 18, 2022, 4:31:17 AM10/18/22
to MR3
By the way, according to this blog:  ⁣Performance Evaluation of Spark 2, Spark 3, Hive-LLAP, and Hive on MR3 | DataMonad , Compared with HIVE3.0, HIVE4.0 has a certain degree of performance improvement. Are there relevant test results?   Although HIVE4.0 has not been officially released, we want to have a glimpse of the improvement effect.

Carol Chapman

unread,
Oct 18, 2022, 4:31:28 AM10/18/22
to MR3
By the way, according to this blog:  ⁣Performance Evaluation of Spark 2, Spark 3, Hive-LLAP, and Hive on MR3 | DataMonad , Compared with HIVE3.0, HIVE4.0 has a certain degree of performance improvement. Are there relevant test results?   Although HIVE4.0 has not been officially released, we want to have a glimpse of the improvement effect.
 

On Tuesday, 18 October 2022 at 16:19:45 UTC+8 Carol Chapman wrote:

Sungwoo Park

unread,
Oct 20, 2022, 3:56:45 AM10/20/22
to MR3
We compared the performance of Hive 3 on MR3 vs Hive 4 on MR3 in April 2020. Hive 4 was generally faster than Hive 3 at that time. Please see the blog article:


With the impending release of Hive 4 (its release tag has changed from 4.0.0-alpha2-SNAPSHOT to 4.0.0-SNAPSHOT yesterday), we are also stabilizing Hive 4 on MR3. From our evaluation, Hive 4 (on Tez) is quite unstable. In April 2020, Hive 4 completed all 99 TPC-DS queries successfully, but the latest commit in the master branch fails to complete some TPC-DS queries or return wrong results. If anyone wants to try Hive 4 on MR3 before the release of Hive 4, please let us know.

Cheers,

--- Sungwoo

Carol Chapman

unread,
Oct 20, 2022, 6:28:56 AM10/20/22
to MR3
That sounds great.  If we can try the latest HIVE4 on MR3, we are willing to try.

Sungwoo Park

unread,
Oct 22, 2022, 11:09:28 AM10/22/22
to MR3
In its current release, Hive 4 is a huge setback in comparison with Hive 3. See the Jira ticket I created which reports all the failing TPC-DS queries when tested with Hive 4 on Tez.


This is really bad because Hive 3 completes all the 99 TPC-DS queries producing correct results (when cross-checked with SparkSQL and Presto). It seems that the Hive community does not even run system tests (e.g., using TPC-DS).

When Hive 4 on MR3 returns correct results on the TPC-DS benchmark, we will release it. Until all the bugs reported in the above Jira ticket, we are going to disable all the (buggy) optimizations.

Cheers,

--- Sungwoo

Sungwoo Park

unread,
Oct 25, 2022, 2:27:46 PM10/25/22
to MR3
I have uploaded Hive 4 on MR3 using the latest release in the master branch of Apache Hive (as of yesterday).


Here are a few comments.

1. I disabled several optimization implemented after April 2020, which are buggy and do not bring significant performance improvement. You can check the last section in conf/tpcds/hive4/hive-site.xml to see the optimizations that are disabled (e.g., hive.optimize.shared.work.dppunion).

2. I tested with TPC-DS 1TB ORC and all 99 queries return correct results.

3. When tested with TPC-DS 1TB, Hive 4 on MR3 is no faster than Hive 3 on MR3 on average (mostly because of the increased time in compiling queries). This is bad news because when we tested in April 2020, Hive 4 on MR3 was noticeably faster than Hive 3 on MR3. I suspect that several optimizations implemented after that actually degraded the performance.

4. I tested with TPC-DS 1TB using ZSTD, and found no big difference from Zlib. The uploaded release does not support ZSTD yet because of a bug in ORC library. (If interested, please see https://issues.apache.org/jira/browse/HIVE-26668)

Cheers,

-- Sungwoo

Ill

unread,
Dec 1, 2022, 5:07:26 AM12/1/22
to MR3
HI,
I found that the code of HIVE ON MR3 was updated three days ago, and the problem was fixed.
When can MR3 release a HOTFIX version?
Thanks!

Sungwoo Park

unread,
Dec 1, 2022, 5:38:11 AM12/1/22
to Ill, MR3
Which problem are you referring to? If you are using Hive-MR3 on Kubernetes, the Docker image (mr3project/hive3:1.6-SNAPSHOT) has already been updated.

--- Sungwoo

--
You received this message because you are subscribed to the Google Groups "MR3" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hive-mr3+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/0a34addb-bab8-4634-947a-296e55074d96n%40googlegroups.com.

Sungwoo Park

unread,
Dec 9, 2022, 12:15:37 AM12/9/22
to MR3
I have released hivemr3-1.6-SNAPSHOT-hive3.1.3-k8s.tar.gz which is built using the latest source code of Hive-MR3.


Hive 3 on MR3 1.6-SNAPSHOT includes HIVE-23953 (https://issues.apache.org/jira/browse/HIVE-23953).

Cheers,

--- Sungwoo

Reply all
Reply to author
Forward
0 new messages