Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Apache Ozone

73 views
Skip to first unread message

David Engel

unread,
Oct 18, 2023, 5:08:24 PM10/18/23
to MR3
We are running Hive/MR3 on Kubernetes using VMs on a Dell VxRail cluster.  Because our  disks are actually virtual disks on a SAN and our unfortunate choice of MinIO/S3 and its lack of a rename operation, many of our Hive queries can involve 2, 4 or even 6 copies accross the network.  I understand Apache Iceberg and Hive4 should mitigate some of the extra copies but when those will be ready is still unknown.  Running HDFS on Kubernets is apparently not recommended nor well supported.

I recently ran across Apache Ozone (https://ozone.apache.org/) which claims to run well on Kubernetes and provide a "Hadoop-compatible" file system.  I'm wondering if Hive/MR3 can support Apache Ozone on Kubernetes without changes?  If not, what would it take to do so and is it even worth persuing?

Sungwoo Park

unread,
Oct 18, 2023, 10:45:16 PM10/18/23
to David Engel, MR3
Hi David,

I haven't tried Hive-MR3-Ozone, but my guess is that it should work after modifying core-site.xml and adding the Ozone jar in the classpath. I can try Hive-MR3-Ozone using a small deployment of Ozone sometime later this week (or next week), if you would like to try Hive-MR3-Ozone.

If you are concerned with local disk usage, you might be interested in Hive-MR3-Celeborn (which will be released with MR3 1.8). By using streaming mode of Hive-MR3-Celeborn, you can avoid generating most of spill files (for unordered edges), thus nearly cutting local disk usage by half. 

By the way, please consider joining MR3 Slack if you need quick communication:
https://join.slack.com/t/mr3-help/shared_invite/zt-1wpqztk35-AN8JRDznTkvxFIjtvhmiNg

Thanks,

--- Sungwoo

On Thu, Oct 19, 2023 at 6:08 AM David Engel <da...@istwok.net> wrote:
We are running Hive/MR3 on Kubernetes using VMs on a Dell VxRail cluster.  Because our  disks are actually virtual disks on a SAN and our unfortunate choice of MinIO/S3 and its lack of a rename operation, many of our Hive queries can involve 2, 4 or even 6 copies accross the network.  I understand Apache Iceberg and Hive4 should mitigate some of the extra copies but when those will be ready is still unknown.  Running HDFS on Kubernets is apparently not recommended nor well supported.

I recently ran across Apache Ozone (https://ozone.apache.org/) which claims to run well on Kubernetes and provide a "Hadoop-compatible" file system.  I'm wondering if Hive/MR3 can support Apache Ozone on Kubernetes without changes?  If not, what would it take to do so and is it even worth persuing?

--
You received this message because you are subscribed to the Google Groups "MR3" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hive-mr3+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/8edb69fd-48b7-4fb2-ba7b-efe5880a97aan%40googlegroups.com.

David Engel

unread,
Oct 19, 2023, 2:45:50 PM10/19/23
to Sungwoo Park, MR3
Local disk space is not the issue. I'm talking about writes to
directories named like the following:

.hive-staging_hive_2023-06-13_22-43-39_737_9038277851085593007-3307

When data is inserted into a table, Hive writes files to directories
like this, sometimes multiple times, as it accumulates results. This
happens even with non-transactional tables. Hive eventually writes
the results to the table's root directory or appripriate partition. I
assume this is all to maintain some level of atomicity or consistency
as Hive updates the Metastore.

As previously noted baecaus of our use of VMs, each of these writes
turns into two writes. The first write is from the worker pod to a
Minio pod. The MinIO pod then writes the data to it's virtual disk
which is actually a virtual disk located on our SAN and results in
another write over the network.

Even though the undrlying network is 10 gbps and some of the writes
are actually internal to the VM cluster, I contend that all of these
writes of 10s and 100s og GB add up and result in slower operation
than need be. Ideally, we'd use the native, HDFS support provided by
our SAN but it's an extra add-on and the cost is prohibitive.
Consequently, I'm looking for any filesystem options that are
comptible with Hive/MR3 and can avoid these extra copies.

Joining your Slack channel is on my TODO list and has been since you
created it. Unfortunately, my TODO list acts more like a stack where
things get pushed on and seldom get popped off. I might have a tiny
lull this afternoon, though, and will try to get it done.

David
> > <https://groups.google.com/d/msgid/hive-mr3/8edb69fd-48b7-4fb2-ba7b-efe5880a97aan%40googlegroups.com?utm_medium=email&utm_source=footer>
> > .
> >

--
David Engel
da...@istwok.net

David Engel

unread,
Oct 19, 2023, 5:32:37 PM10/19/23
to Sungwoo Park, MR3
I neglected to answer the first and most imporatnt part.

If you have time to try Ozone yourself next week, please do. I will
hopefully try myself next week too.

David

On Thu, Oct 19, 2023 at 11:45:03AM +0900, Sungwoo Park wrote:

Sungwoo Park

unread,
Oct 19, 2023, 5:47:26 PM10/19/23
to David Engel, MR3
Alright, let me try Hive-MR3-K8s with Ozone.

You have been operating Hive-MR3-K8s for almost a year, and if you remember any problem you had with Hive-MR3, please share it with us.

--- Sungwoo

David Engel

unread,
Oct 19, 2023, 9:17:10 PM10/19/23
to Sungwoo Park, MR3
Okay. I'll try to put together a list of things we've had to
customize or do differnently. I'll also issue we still occasionally
have. It might be tomorrow but will more likely be next week.

David
--
David Engel
da...@istwok.net
Reply all
Reply to author
Forward
0 new messages