Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Workdir location

10 views
Skip to first unread message

David Engel

unread,
Apr 16, 2025, 11:50:04 AMApr 16
to MR3
I'm using the Kubernetes instructions for setting up MR3 2.0. The
default is to use a persistent volume using NFS for the scratch
directory and results cache. The directions also describe how to
optionally put the scratch directory and results cache in S3/HDFS.
Are there any significant advantages with either approach?

David
--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Apr 16, 2025, 12:20:53 PMApr 16
to MR3
The main advantage of using S3/HDFS is the simplicity of the setup. No need to bother with PersistentVolume. Since only a small amount of data is written for this type of intermediate data, there is no performance penalty, either. (Even on Amazon AWS, the cost associated with this type of intermediate data is negligible, in comparison with the cost for reading input/data.)

However, if you want to use Ranger, we recommend using PersistentVolume because you want to keep the data written by Ranger and Ranger cannot write to HDFS/S3. If you don't use PersistentVolume, Ranger writes data to container-local directories, which are ephemeral. Similarly for MR3-UI and Grafana.

Another small benefit of using PersistentVolume is that it is easier to inspect the intermediate data (if you want to check it for some reason).

To summarize, if you don't plan to use Ranger, MR3-UI, and Grafana, using HDFS/S3 is perfectly fine. Otherwise, we recommend using PersistentVolume.

--- Sungwoo

David Engel

unread,
Apr 16, 2025, 5:20:22 PMApr 16
to Sungwoo Park, MR3
Thanks for the thorough explanation. While we do have other
persistent volumes we need mounted, I think we'll use HDFS as it
eliminates another location that, albeit unlikely, could run out of
space. However, I noticed the following additional properites that
use work-dir by default. Should they be moved to HDFS too or is there
no need to worry about them?

<property>
<name>hive.repl.rootdir</name>
<value>/opt/mr3-run/work-dir/${user.name}</value>
</property>

<property>
<name>hive.downloaded.resources.dir</name>
<value>/opt/mr3-run/work-dir/${hive.session.id}_resources</value>
</property>

David
> --
> You received this message because you are subscribed to the Google Groups "MR3" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hive-mr3+u...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/hive-mr3/37b84329-9f0f-47c2-84a1-57cf522af130n%40googlegroups.com.


--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Apr 16, 2025, 7:33:42 PMApr 16
to MR3
From the previous MR3 documentation:

Do not update the configuration key `hive.downloaded.resources.dir` because it should point to a directory on the local file system.

(I removed this line in the new MR3 documentation because I thought that users don't need to change the value and thus it was only a distraction.)

hive.repl.rootdir is used only when you use "REPL DUMP command" for replicating data between two Hive installations. If you plan to use "REPL DUMP", it should point to a directory on HDFS/S3. Otherwise it is okay to leave it as it is.

A minor implementation note. "REPL DUMP" is another reason that Apache Hive cannot run on Kubernetes in its current form because this command initiates MapReduce jobs of Hadoop. Then, how does Hive-MR3-K8s handle this issue? We turn the MapReduce jobs to MR3 DAGs and send them to MR3, so no more dependence on Hadoop :-) Another similar example is compaction.

Cheers,

--- Sungwoo

David Engel

unread,
Apr 17, 2025, 5:54:00 AMApr 17
to Sungwoo Park, MR3
Thanks, again. I'll leave those settings alone. Without work-dir
being mounted from outside, I was mildly concerned about it filling up
under extreme or unusual cases.

Davidd
> To view this discussion visit https://groups.google.com/d/msgid/hive-mr3/5f8a4f25-1650-4706-a4fb-5209a0c2e84an%40googlegroups.com.


--
David Engel
da...@istwok.net
Reply all
Reply to author
Forward
0 new messages