Local disk space is not the issue. I'm talking about writes to
directories named like the following:
.hive-staging_hive_2023-06-13_22-43-39_737_9038277851085593007-3307
When data is inserted into a table, Hive writes files to directories
like this, sometimes multiple times, as it accumulates results. This
happens even with non-transactional tables. Hive eventually writes
the results to the table's root directory or appripriate partition. I
assume this is all to maintain some level of atomicity or consistency
as Hive updates the Metastore.
As previously noted baecaus of our use of VMs, each of these writes
turns into two writes. The first write is from the worker pod to a
Minio pod. The MinIO pod then writes the data to it's virtual disk
which is actually a virtual disk located on our SAN and results in
another write over the network.
Even though the undrlying network is 10 gbps and some of the writes
are actually internal to the VM cluster, I contend that all of these
writes of 10s and 100s og GB add up and result in slower operation
than need be. Ideally, we'd use the native, HDFS support provided by
our SAN but it's an extra add-on and the cost is prohibitive.
Consequently, I'm looking for any filesystem options that are
comptible with Hive/MR3 and can avoid these extra copies.
Joining your Slack channel is on my TODO list and has been since you
created it. Unfortunately, my TODO list acts more like a stack where
things get pushed on and seldom get popped off. I might have a tiny
lull this afternoon, though, and will try to get it done.
David
> > <
https://groups.google.com/d/msgid/hive-mr3/8edb69fd-48b7-4fb2-ba7b-efe5880a97aan%40googlegroups.com?utm_medium=email&utm_source=footer>
> > .
> >
--
David Engel
da...@istwok.net