Hive 4 and Iceberg

127 views
Skip to first unread message

Sungwoo Park

unread,
Jan 25, 2023, 7:09:14 AM1/25/23
to MR3
Hive 4 is nearing its initial release (probably mid this year), and one of its key features is the support of Iceberg. (Hive 3 also supports Iceberg, but only to a limited degree). We have been stabilizing Hive 4 on MR3, but chose not to release it because of rather serious bugs in Hive 4.

To be specific, we found a few bugs in the query optimizer (introduced after 2020) which caused a few TPC-DS queries to either fail or return wrong results. This is really bad news because Hive 3 (whether on Tez or on MR3) returns correct results on all the 99 TPC-DS queries. These bugs were reported in the following JIRA ticket, and we have also figured out workarounds to some of these problems.


The good news is that Seonggon (of MR3 team) has fixed the remaining bugs, and now Hive 4 returns correct results on all the 99 TPC-DS queries, both from ORC tables and from Iceberg tables. Seonggon created two JIRA tickets:


So, if you would like to try Hive 4 on MR3, please let us know -- we will release the current build of Hive 4 on MR3. Especially if you are expecting performance boost by using Iceberg on S3 for storage. Migrating from Hive 3 to Hive 4 is quite simple.

Sungwoo

David Engel

unread,
Jan 25, 2023, 2:24:56 PM1/25/23
to Sungwoo Park, MR3
Hi Sungwoo,

I've put Apache Iceberg on my TODO list of things to try to look into.
Am I correct in assuming that using Iceberg with Hive 4 would address
the performance issues we've discussed before that are caused by S3's
lack of a move/rename operation?

David
--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Jan 25, 2023, 9:51:59 PM1/25/23
to David Engel, MR3
I've put Apache Iceberg on my TODO list of things to try to look into.
Am I correct in assuming that using Iceberg with Hive 4 would address
the performance issues we've discussed before that are caused by S3's
lack of a move/rename operation?

From my understanding (and also from my own testing), that's an advantage that Iceberg provides when using S3. Moreover, split computation no longer requires directory listing and query compilation can be noticeably faster, especially if there are lots of sub-directories.

Sungwoo

Sungwoo Park

unread,
Jan 27, 2023, 12:04:06 PM1/27/23
to David Engel, MR3
I have uploaded a new Docker image mr3project/hive4:1.7-SNAPSHOT. I have also updated the Git repo mr3project/mr3-run-k8s which sets additional configurations in hive-site.xml for running Hive 4. You can reuse existing settings for running Hive 3, except:

1) Docker image (which should mr3project/hive4:1.7-SNAPSHOT)
2) Metastore database

All the following operations are executed okay:

1) loading ORC managed tables from external Text tables
2) loading Iceberg tables from ORC tables
3) converting existing non-transactional ORC tables to Iceberg tables
4) loading ORC iceberg tables from Text tables

Sungwoo

David Engel

unread,
Jan 30, 2023, 4:13:35 PM1/30/23
to Sungwoo Park, MR3
Am I correct in assuming that Hive 3 and Hive 4 can not share the same
Metastore? IOW, even though your, new snapshot can use an existing
configuration, Hive 3 and Hive 4 (and their respective Metastores) are
essentially independent and should use separate warehouse directories.

Also, you mention non-transactional ORC tables below in item 3. I
thin in my very, limited, Iceberg research so far, I saw other
references to non-transactional tables. Does that mean Iceberg tables
either don't support transactional tables or have their own, different
support for transactions?

David
--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Jan 30, 2023, 5:49:04 PM1/30/23
to David Engel, MR3
Am I correct in assuming that Hive 3 and Hive 4 can not share the same
Metastore?  IOW, even though your, new snapshot can use an existing
configuration, Hive 3 and Hive 4 (and their respective Metastores) are
essentially independent and should use separate warehouse directories.

Right, they cannot share the same Metastore. Hive 4 needs its own Metastore using the Hive 4 jar. Hive 3 and Hive 4 may use the same backing database server (e.g., the same MySQL database server), but they should use separate database tables. So, you need to adjust the configuration before using Hive 4.

In the case of using mr3-run-k8s/typescript, this means:

1) metastoreEnv.databaseHost can be the same for both Hive 3 and Hive 4 (if the database server supports Hive 4).
2) metastoreEnv.databaseName must be different.

(When Hive 4 is released, Hive 4 Metastore may accept access from Hive 3. Not sure if this will be supported.)

Also, you mention non-transactional ORC tables below in item 3.  I
thin in my very, limited, Iceberg research so far, I saw other
references to non-transactional tables.  Does that mean Iceberg tables
either don't support transactional tables or have their own, different
support for transactions?

Iceberg supports transactional tables by design. Case 3) means that existing non-transactional non-Iceberg ORC tables can be converted to Iceberg ORC tables by adding meta-data only, by executing 'ALTER TABLE', e.g.:

--- create non-transactional ORC table
create table call_center
stored as orc
TBLPROPERTIES('transactional'='false')
as select * from tpcds_text_2.call_center;

--- convert to Iceberg table
ALTER TABLE call_center SET TBLPROPERTIES ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler')

Please note that some Iceberg operations are not supported or buggy in the current build of Hive 4. For example, we found that creating Iceberg tables using CTAS from text tables is not stable. E.g., In the case of TPC-DS data, loading TPC-DS Iceberg tables succeeds, but queries return wrong results (empty results).

Sungwoo

David Engel

unread,
Jan 30, 2023, 9:14:59 PM1/30/23
to Sungwoo Park, MR3
Thanks for the clarifications.

David
--
David Engel
da...@istwok.net

Ill

unread,
Jan 31, 2023, 2:48:16 AM1/31/23
to MR3
Great
Reply all
Reply to author
Forward
0 new messages