Hive on MR3 on Kubernetes Eval Questions

115 views
Skip to first unread message

David Engel

unread,
Jun 28, 2022, 3:59:30 PM6/28/22
to MR3
Hi,

We're evaluating Hive on MR3 on Kubernetes to replace old, HDP 2.5 and 3.1 clusters.  We're also interested in Spark on MR3 but my focus right now is on Hive since it's what I'm most familiar with.  I've setup a modest, 8-node cluster of virtual machines with each node having 8 cores, 64 GB RAM and 1 TB disk.  I'm completely new to Kubernetes and MinIO so probably don't have things setup optimally.  For Hive/MR3, I followed the instructioins on https://mr3docs.datamonad.com/docs/quick/k8s/run-k8s/.  I'd appreciate help with the following questions and issues.

1. Neither "show databases" nor "show tables" show any results.  The databases and tables are there and I can create and use them but the "show" queries don't return anything.  For Metastore database, I'm using MySQL installed the master node running Ubuntu LTS 20.04.

2. When I run  what I think is a simple query of the form "select count(*) from large_table where some_column = 'some_value'" I get an error like "Terminating unsuccessfully: Vertex failed, vertex_28875_0000_19_01."  By "large_table" I mean one with 5.6 billion rows.  I have the Hive workers configured for 2 cpus and 12 GB RAM.  I need to increase that back to 16 GB like the instructioins.  I forgot to reset them after adding more RAM to ghe virtual machines.  FWIW, when I first ran the same query on the same table in the old cluster, it too failed with the same error.  However, it succeeded after that.

3. What's the recommended way to run Hive/MR3 with MinIO/S3?  I currently have MinIO running on every worker node and let Hive/MR3/Kubernetes schedule the Hive workers as they see fit.  Does Hive on MR3 have any notion of data locality like Hive on HDFS does (or at least did)?  If not, is it preferred to run MinIO and Hive workers on separate nodes?

4. What's the recommended way to make common storage available to the docker image running the Hive server and workers for things like "ADD JAR" and "ADD FILE" commands?  On our old clusters, we have home and other common directories NFS mounted on the Hive server and worker nodes.

5. Some of our projects make extensive use of Hive transforms written in Python.  Assuming we can solve #4 above, what is the recommended way to deploy custom, Python setups on worker nodes.  It appears the supplied, docker images are based on Debian 10.  Is it preferred to build a new docker image with the extra parts we need from Debian 10, try to build a completely stand-alone and statically linked version of Python or something in between?

Thanks
David

Sungwoo Park

unread,
Jun 29, 2022, 4:54:52 AM6/29/22
to MR3
Hi,

1.
We have not seen the same problem before, so not sure why this happens. However, it might be due to the incompatibility between your MySQL database and the MySQL connector that Metastore downloads, so if you let us know the version of your MySQL database, we could try it in our local environment.

With the Docker image available in the DockerHub, Metastore downloads mysql-connector-java 8.0.28 (from https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-8.0.28.tar.gz). The user should use a MySQL database compatible with this connector. The MR3 Docker image does not include a MySQL driver because of the copyrigh of the connector.

If you would like to try a custom MySQL driver, there are two options (which are both a bit tricky):

1) Build a custom Docker image with the MySQL driver copied in /opt/mr3-run/lib
2) Use a subdirectory 'lib' of the PersistentVolume

For more details, please see #1 at:
https://mr3docs.datamonad.com/docs/k8s/troubleshoot/

2.
If you could share the message and stack trace in the log, we could look into the problem.

Increasing the capacity of work Pods is one option, but by creating large worker Pods and packing multiple tasks in a single worker Pod, you could make more efficient use of resources while avoiding OOM.

For example, if each Kubernetes node has 8 CPUs and 64GB RAM, you could create MR3 worker Pods with about the same amount of memory. For configuring resources, please see: https://mr3docs.datamonad.com/docs/k8s/performance/performance-tuning/

Example) To assign 7.5 CPUs and 60GB to a single worker Pod which is shared by 10 concurrent tasks:

    <property>
      <name>hive.mr3.resource.vcores.divisor</name>
      <value>100</value>
    </property>

    <property>
      <name>hive.mr3.map.task.memory.mb</name>
      <value>6144</value>
    </property>

    <property>
      <name>hive.mr3.map.task.vcores</name>
      <value>0</value>
    </property>

    <property>
      <name>hive.mr3.reduce.task.memory.mb</name>
      <value>6144</value>
    </property>

    <property>
      <name>hive.mr3.reduce.task.vcores</name>
      <value>0</value>
    </property>

    <property>
      <name>hive.mr3.all-in-one.containergroup.memory.mb</name>
      <value>61440</value>
    </property>

    <property>
      <name>hive.mr3.all-in-one.containergroup.vcores</name>
      <value>7500</value>
    </property>

3.
AFAIK, MinIO/S3 does not support data locality, so there is no particular advantage in collocating compute nodes with MinIO nodes. When using MinIO/S3, a typical setup would be based on 'separation of compute and storage' where compute nodes (running Hive, Spark, Presto, etc) are connected to MinIO nodes via fast network.

Hive on MR3 supports data locality only when HDFS is used. On Hadoop, data locality is fully exploited. On Kubernets, Hive on MR3 tries to execute a map task in a Pod running on a physical node hosting the input data. For more details, please see '#Location hints when using HDFS' at:

https://mr3docs.datamonad.com/docs/k8s/performance/performance-tuning-k8s/

4.
This is automatically taken care of by HiveServer2, provided that PersistentVolume is properly mounted. So, you don't need to update the Docker image.

For using UDFs, please see examples at:

https://mr3docs.datamonad.com/docs/k8s/user/use-udf/
https://mr3docs.datamonad.com/docs/k8s/advanced/use-s3/

5.
You can use Python scripts similarly to common UDFs, except that you should set mr3.container.localize.python.working.dir.unsafe to true in mr3-site.xml. For more details, please see:

https://mr3docs.datamonad.com/docs/k8s/user/use-udf/

If you have lots of Python files or some additional requirements on running Python, it might make sense to build a new Docker image. In such a case, put all the Python files in mr3-run/kubernetes/hive/hive/ (which is mapped to /opt/mr3-run/hive/ inside Docker iamges) before building a Docker image, and all Python scripts should be automatically available to MR3 workers. (However, we did not try this approach ourselves.)

For using Spark on MR3, we are currently testing Spark 3.2.1 on MR3. We are extending MR3 Cloud to launch both Hive on MR3 and Spark on MR3 that share the common Metastore.

If you have any other question, please let us know.

--- Sungwoo


David Engel

unread,
Jun 30, 2022, 12:08:47 PM6/30/22
to MR3
1. We're using MySQL 8.0.29-0ubuntu0.20.04.3.  It's the current version in Ubuntu LTS 20.04.  I moved the hivemr3 database to docker containers running MySQL 5.6 and 8.0.28 and the behavior was the same.

I don't think the problem is with the connector but rather something unexpected or wrong in our database.  Do you know where in the Metastore source code that it queries the DBS and TBLS tables?  I looked briefly last night but didn't find it.  I think manually running the same queries Metastore runs will shed some light on this issue.

2. The log is attached.  This is from using the initial settings of 2 cores and 16 GB.  I have not yet tried your suggestion to increase the cores/memory.

Here is the query.

100.69.216.140:testdb> select count(*) from test_table where ip =
                         hex('08080808');

And here is what I think might be an important part:

Caused by: java.lang.RuntimeException: Hive 2 Internal error: unsupported conver
sion from type: binary

The ip column is type binary representing an IP address.  It is either NULL or of length 4 or 16 bytes.

3. Thanks.  Currently, all nodes are in the same VM cluster so the virtual networking cost should be pretty fast.  If we take this to production, I'll move MinIO to its own nodes.

What's you opinion/experience with running HDFS instead of S3 on Kubernetes?  While it's possible, my limited searching seems to indicate that S3 is preferable with Kubernetes.

4 and 5. Thanks.  I was able to confirm that (at least a subset of) our Java UDFs work.  Our python transforms will take some more testing.

David
hivemr3.log

David Engel

unread,
Jun 30, 2022, 3:11:35 PM6/30/22
to MR3
2. Disregard this one.  I had the meaning of the hex() and unhex() functions backwards.  I finally realized this when I tried the query using our UDF for converting IP address strings to binary.

David

Sungwoo Park

unread,
Jun 30, 2022, 3:23:26 PM6/30/22
to MR3
For 2, it looks like a conversion error fails the query. The ip column is of type binary, so it cannot be compared with hex('08080808'), which is of type string.

Could you try this query and see what happens?

select count(*) from test_table where ip = unhex('08080808');

For 3, my opinion is that Kubernetes does not particularly prefer S3 to HDFS because it is agnostic to the storage system. However, if a new storage system should be built to be used with Kubernetes, I would go for S3 mostly because it is relatively easier to maintain than HDFS.

HDFS has its own strength. For example, in the current implementation of Hive, HDFS provides better performance than S3 when writing data to Hive tables. HDFS also supports data locality, which can be useful if data nodes are collocated with Kubernetes nodes.

If your organization is already running HDFS and has enough resources to maintain it (which would be part of maintaining Hadoop), then it may make sense to use the current HDFS instead of migrating all the data to a new S3 system. For me, the only reason to choose S3 over HDFS is the maintenance cost.

For 1, I don't have a clue yet, so let me think more about it.

--- Sungwoo

David Engel

unread,
Jun 30, 2022, 4:11:28 PM6/30/22
to MR3
2. Yes, I found that PEBKAC earlier. :)

3. Our interest in Kubernetes and S3 vs. HDFS stems from the following.  Our existing clusters using HDP 2.5 and 3.1 are running on discreet, aging and in some cases, failing, hardware.  Our IT department has invested heavily in new, VM clusters from Dell and wants to move our Hive/Spark use to their VM cluster and get rid of the old hardware.  The obvious choices are to replicate HDP as is or upgrade to Cloudera's current offerings.  We (engineering and development) are not happy with either choice -- we don't want to remain stuck with an aging and abandoned cluster framework with the former and our relatively, small company can't  easily afford the latter.  Another option is installing and maintaining the various Apache/Hadoop/HDFS components ourselves but experience tells us that is  a very hard, tedious and time consuming job.  Consequently, we're looking at alternative ways to run Hive, Spark and possibly Presto/Trino.
  
David

Sungwoo Park

unread,
Jul 1, 2022, 3:59:16 AM7/1/22
to MR3
1.
Could you share the logs of HiveServer2 and Metastore when you execute 'show databases'? The logs might give some hint on why no result is returned.

By default, the logging level in Hive-MR3-K8s is INFO. To change the logging level, update this file before executing HiveServer2 and Metastore: kubernetes/conf/hive-log4j2.properties

3.
In case that you haven't seen our blog articles, we have previously compared Hive-MR3 with Hive-LLAP, Spark, and Presto using the TPC-DS benchmark.


Performance is not the only criterion for choosing a solution, but the above results may be useful for evaluating different options. Our results use the TPC-DS benchmark in our internal clusters, so the results might be different when using your own workloads in your clusters. However, I believe one would get more or less similar results from other workloads.

Thanks,

--- Sungwoo

David Engel

unread,
Jul 14, 2022, 3:41:26 PM7/14/22
to MR3
Sorry for the slow reply.  I was on vacation all last week and just now noticed your latest message..

1. I was about to change the log4j settings to get the logs you requested.  Before dooing so, I tried show databases/tables again and it now works.  One of the configuration tweaks or restarts I've done along the way must have fixed it.  Either way, it's no longer an issue.

3. I'm well aware of your blog posts and thre reported performance.  I even partially tested an earlier version of Hive/MR3 on our HDP 2.5 cluster a few years ago.  The results were promising but ultimately incomplete.  I no longer remember the details, but because I could not adversely affect our existing Hive server, I had to run the MR3 version in parallel with both using our existing Metastore and I was not able to get everthing working in the time I had allotted.

This leads nicely, though, into my likely next question.  In my small, test cluster, Hive/MR3 is performing comparably to our HDP 2.5 cluster running Hive/Tez (without LLAP).  This same, test cluster running Trino is much faster for simple queries (searches and aggregations) than Hive/MR3 and comparably to PrestoDB running on our HDP 2.5 cluster.  OUr test cluster is not currently big enough to run both Hive/MR3 and Trino at the same time without significantly slowing down both.  The results so far are promising enough, however, that my boss has asked IT to build me several more VMs to add to the test cluster.  I'm currently waiting on them to be ready.

Long story short, I might be asking you soon for help tuning Hive/MR3 in order to get its performance closer to that of Trino.  Our use of PrestoDB on the HDP cluster was purely to solve the lack of speed issue for simple, interactive qureies using Hive/Tez.  If Hive/MR3 can perform close enough to Trino, we won't need the latter at all.

David

Sungwoo Park

unread,
Jul 14, 2022, 9:30:21 PM7/14/22
to MR3
1. The Hive-MR3 Docker image on DockerHub has a limit of 1TB on the total memory of workers Pods. For example, you can run up to 16 worker Pods each with 64GB memory. If you would like to test Hive-MR3 in a large cluster, we can upload a new image, so please let us know.

2. For performance tuning, you could start with the following pages:


The first topics to consider are:

- Resources for mappers, reducers, and ContainerWorkers
- Memory settings for Hive and Tez
- Eliminating fetch delays
- Page cache for Docker containers

For changing kernel parameters and configuring network interfaces (in case that you need to deal with packet loss and drops causing fetch delays), you might need assistance from the administrator of your cluster.

Cheers,

--- Sungwoo

David Engel

unread,
Jul 21, 2022, 3:35:39 PM7/21/22
to Sungwoo Park, MR3
I've done some work on performance tuning but am currently waiting on
IT to reconfigure the VMs in our virtual cluster.

In the meantime, I'm trying to test out transform scripts using
Python. I've successfully built and deployed a new docker image with
Python included. I'm so far unable to get Hive to find my transform
script.

In our existing cluster, we have home and a few other directories nfs
mounted on the system running the Hive server. That allows the server
to find the files we use in "add file" directives and push them to
worker nodes. We aren't currently mounting any direcories specially
in our test, Hive/MR3 cluster so I'm doing something like the
following:

add file s3a:/hivemr3/myscript.py

Using an s3a URL was how I was previously able to get the Hive server
to load the jar with our custome, UDFs. This doesn't seem to be
working in this case. Hive doesn't complain about the add file
directive but myscript.py does not make it to the worker node/image as
far as I can tell.

I'm guessing add file either works differently in Hive/MR3 than it
does in Hive/Hadoop or doesn't even work at all. Can you please
clarify? If it doesn't work at all, is their a recommended work
around? I'm thinking NFS mounting the desired direcotries into the
Hive worker, Docker images might be the preferred way.

David
> --
> You received this message because you are subscribed to the Google Groups "MR3" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hive-mr3+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/02c251cd-0d20-4c34-8805-04184bc7b6d3n%40googlegroups.com.


--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Jul 22, 2022, 7:02:47 AM7/22/22
to MR3
Hive on MR3 supports transform scripts written in Python, but it has several restrictions. Please see the documentation for a summary:

https://mr3docs.datamonad.com/docs/k8s/user/use-udf/

1.
Unlike UDFs, transform scripts are disabled in SQL standard based authorization (https://issues.apache.org/jira/browse/HIVE-6415). Hence you can use transform scripts only if authorization is completely disabled in HiveServer2.

In order to use transform scripts written in Python, set the following configuration keys in hive-site.xml and mr3-site.xml before starting HiveServer2.

1) hive.security.authorization.manager to org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory or org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory in hive-site.xml

2) hive.security.authorization.enabled to false in hive-site.xml

3) mr3.container.localize.python.working.dir.unsafe to true in mr3-site.xml

You cannot use transform scripts if Ranger is used for authorization. This combination does not make much sense anyway.

2.
Now, you can use transform scripts. For example, you can execute 'add file' to load files and execute queries:

add file /opt/mr3-run/work-dir/foo.py;
add file s3a://hivemr3/bar.py;
SELECT TRANSFORM(key, value) USING 'python3 foo.py' AS key, value FROM src limit 6;
SELECT TRANSFORM(key, value) USING 'python3 bar.py' AS key, value FROM src limit 6;

In the first line above, '/opt/mr3-run/work-dir/' is a path inside HiveServer2 Pod where the PersistentVolume is mounted. In this way, you can load files available in the PersistentVolume.

A Python script must have an extension either 'py' or 'PY' (which is a requirement specific to MR3).

Python command (e.g., 'python3') should be available inside ContainerWorker Pods, so a custom Docker image should be built, as you have already done.

3.
With authorization disabled, transform scripts should be used carefully, especially if several users access HiveServer2 simultaneously.

- If a transform script fails (e.g., because of a syntax error), the query  may not terminate normally, so the user should kill it manually.

- Adding a Python file works only if no Python file with the same name from a different source was added previously. Here is an example:

add file /opt/mr3-run/work-dir/newline.py;
add file s3a://hivemr3/newline.py;
SELECT TRANSFORM(key, value) USING 'python3 newline.py' AS key, value FROM src limit 6;

Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask. java.io.IOException: Previous writer likely failed to write file:/opt/mr3-run/work-dir/hive/hive/_mr3_session_dir/b6acebbe/newline.py. Failing because I am unlikely to write too.

In this case, the user (or whoever is accessing HiveServer2 at the moment) should delete one manauly, e.g.:

delete file /opt/mr3-run/work-dir/newline.py;

4.
If you want to use lots of Python scripts, it might make sense to build a custom Docker image with all Python scripts built-in. By configuring Python so as to look for the Python scripts automatically, one could skip 'add file' commands altogether. I have not tried this approach myself, but if you would like to pursue this approach, please let us know.

Cheers,

--- Sungwoo

Sungwoo Park

unread,
Jul 22, 2022, 8:44:50 AM7/22/22
to MR3
Here is a minor note . We can run two HiveServer2 instances, each with its own hive-site.xml, that share MR3 master and workers. In this setting, one can enable authorization so that ordinary users can run queries, while the other can disable authorization so that admin users can run transform scripts.

--- Sungwoo

David Engel

unread,
Jul 23, 2022, 5:19:52 PM7/23/22
to Sungwoo Park, MR3
We are well aware of the deprecation status of Hive transforms.
However, we have a legacy project which predates that change and makes
extensive use of transforms. Your pre-announcement of MR 1.5,
especially the bit about Hive and Spark sharing a common Metastore, is
very exciting. One hope is that we can one day rewrite our Hive
transforms in Spark. Of course, being a legacy project, it's not
clear when or if the resources will be made available to make that
change.

Anyway, thanks for the tips regarding the use-udf URL and the
mr3.container.localize.python.working.dir.unsafe setting. However, the
setting appears to have no effect. The test, Python script I'm using
always shows up in the DAG subdirectory. Regardless, I don't think
the limitation of not duplicating script names will work for us. We
don't intentionally duplicate script names. However, we do sometimes
run multiple data sets in parallel where one or both sets of scripts
might be changed for testing or development.

Another approach I'm considering is mounting our development/home
directories into the cluster. Doing that would allow using the full,
absolute path to distinguish which script to run. I successfully
modified yaml/hive.yaml to add an NFS volume that gets mounted in the
hivemr3-hiveserver2 container. However, the mr3worker containers
apparently don't share the same volumeMounts. Is it possible to add
an NFS volume to the mr3worker containers?

David
> To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/1fe51db1-376e-4543-9d53-83b6d4ed465en%40googlegroups.com.


--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Jul 24, 2022, 1:39:44 AM7/24/22
to MR3
1.
When testing mr3.container.localize.python.working.dir.unsafe, could you check if the existing ContainerMap object built from mr3-site.xml is deleted before restarting HiveServer2? It might be that the previous ConfigMap object is still in use (that does not set mr3.container.localize.python.working.dir.unsafe). You can check mr3-conf.properties inside DAGAppMaster Pod, e.g.:

$ kubectl exec -n hivemr3 -it mr3master-5782-0-7bc4bf6b95-jmxzd /bin/bash
hive@mr3master-5782-0:/opt/mr3-run/hive$ grep mr3.container.localize.python.working.dir.unsafe mr3-conf.properties
mr3.container.localize.python.working.dir.unsafe=true

mr3.container.localize.python.working.dir.unsafe is covered by one of the Q tests in Hive-MR3 (remote_script.q), and I also tested it manually. If this still does not work, I guess you use some setting that I am not aware of.

2.
You can mount an NFS volume as a PersistentVolume in worker Pods with the configuration key mr3.k8s.worker.persistentvolumeclaim.mounts in mr3-site.xml. From https://mr3docs.datamonad.com/docs/mr3/guide/configure-mr3/:

Comma-separated list of pairs of a PersistentVolumeClaim and its mount point. For example, foo1=bar1,foo2=bar2,foo3=bar3 mounts PersistentVolumeClaim foo1 on directory bar1 in ContainerWorker Pods, and so on.


With the executable scripts in https://github.com/mr3project/mr3-run-k8s, you can set this configuration key using two variables in kubernetes/env.sh:

WORK_DIR_PERSISTENT_VOLUME_CLAIM=workdir-pvc
WORK_DIR_PERSISTENT_VOLUME_CLAIM_MOUNT_DIR=/opt/mr3-run/work-dir

In this example, PersistentVolumeClaim workdir-pvc is mounted in the directory /opt/mr3-run/work-dir. Internally kubernetes/hive/mr3/mr3-setup.sh sets the configuration key:

-Dmr3.k8s.worker.persistentvolumeclaim.mounts=$WORK_DIR_PERSISTENT_VOLUME_CLAIM=$WORK_DIR_PERSISTENT_VOLUME_CLAIM_MOUNT_DIR"

You can also mount several PersistentVolumes. Either you could rebuild the Docker image from scratch after updating mr3-setup.sh, or use an ugly hack like (with /foo/bar1, /foo/bar2, /foo/bar3 in Docker image):

WORK_DIR_PERSISTENT_VOLUME_CLAIM=pvc1
WORK_DIR_PERSISTENT_VOLUME_CLAIM_MOUNT_DIR=/foo/bar1,pvc2=/foo/bar2,pvc3=/foo/bar3

Sungwoo

David Engel

unread,
Jul 24, 2022, 10:42:52 PM7/24/22
to Sungwoo Park, MR3
Thanks for much for all of your help. You are going above and beyond
helping this Kubernets neophyte out.

1. conf/mr3-site.xml is the correct file to use, right? It doesnn't
seem to be taking effect.

user01@cg-node$ kubectl exec -n hivemr3 -it mr3master-4789-0-69cbdbdc7c-v5dtr /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
hive@mr3master-4789-0:/opt/mr3-run/hive$ grep mr3.container.localize.python.working.dir.unsafe mr3-conf.properties

user01@cg-node$ grep -B1 -A2 mr3.container.localize.python.working.dir.unsafe conf/mr3-site.xml
<property>
<name>mr3.container.localize.python.working.dir.unsafe</name>
<value>true</value>
</property>

Also, when restarting Hive, I'm only deleting the hiveserver2 and
mr3master deployments and then running run-hive.sh. I shouldn't need
to restart Metastore, correct?

2. I haven't gotten the NFS pvc working quite yet. I'm sure I'll
figure it out soon enough. Thanks for the pointer.

David

On Sat, Jul 23, 2022 at 10:39:44PM -0700, Sungwoo Park wrote:
> 1.
> When testing mr3.container.localize.python.working.dir.unsafe, could you
> check if the existing ContainerMap object built from mr3-site.xml is
> deleted before restarting HiveServer2? It might be that the previous
> ConfigMap object is still in use (that does not set
> mr3.container.localize.python.working.dir.unsafe). You can check
> mr3-conf.properties inside DAGAppMaster Pod, e.g.:
>
> $ kubectl exec -n hivemr3 -it mr3master-5782-0-7bc4bf6b95-jmxzd /bin/bash
> hive@mr3master-5782-0:/opt/mr3-run/hive$ grep
> mr3.container.localize.python.working.dir.unsafe mr3-conf.properties
> mr3.container.localize.python.working.dir.unsafe=true
>
> mr3.container.localize.python.working.dir.unsafe is covered by one of the Q
> tests in Hive-MR3 (remote_script.q), and I also tested it manually. If this
> still does not work, I guess you use some setting that I am not aware of.
>
> 2.
> You can mount an NFS volume as a PersistentVolume in worker Pods with the
> configuration key mr3.k8s.worker.persistentvolumeclaim.mounts in
> mr3-site.xml. From
> https://mr3docs.datamonad.com/docs/mr3/guide/configure-mr3/:
>
> *Comma-separated list of pairs of a PersistentVolumeClaim and its mount
> point. For example, foo1=bar1,foo2=bar2,foo3=bar3 mounts
> PersistentVolumeClaim foo1 on directory bar1 in ContainerWorker Pods, and
> so on.*
> You received this message because you are subscribed to a topic in the Google Groups "MR3" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/hive-mr3/wgQnWbKdvsI/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to hive-mr3+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/510531b2-a132-46d1-9fac-b8491bd0c8bbn%40googlegroups.com.


--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Jul 25, 2022, 2:09:55 AM7/25/22
to MR3
1.
This probably happens because the previous ConfigMap resources are not deleted. When using executable scripts, reconfiguring and restarting HiveServer2/DAGAppMaster is a bit tricky because of ConfigMap resources.

The following ConfigMap resources are created by HiveServer2 and DAGAppMaster:

$ kubectl get configmaps -n hivemr3 | grep mr3conf
mr3conf-configmap-master           1      24h
mr3conf-configmap-worker           1      24h

These Service resources are also created by HiveServer2 and DAGAppMaster:

gitlab-runner@orange1:~/mr3-cloud/src/general/test$ kubectl get svc -n hivemr3 | grep "service-"
service-master-9001-0   ClusterIP      10.96.246.161   <none>         80/TCP,9890/TCP                                24h
service-worker          ClusterIP      None            <none>         <none>                                         24h

Check if these resources are found when restarting HiveServer2/DAGAppMaster, and delete them, e.g.:

$ kubectl delete configmap -n hivemr3 mr3conf-configmap-master mr3conf-configmap-worker
$ kubectl delete svc -n hivemr3 service-master-9001-0 service-worker

If you are still not sure, just delete all the resources. See 'Stopping Hive on MR3' on this page: https://mr3docs.datamonad.com/docs/quick/k8s/run-k8s/

2.
Here is an example of creating PV and PVC for NFS:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: workdir-pv
  namespace: hivemr3
spec:
  capacity:
    storage: 100Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Delete
  nfs:
    server: 192.168.10.1
    path: /home/nfs/hivemr3/orange5

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: workdir-pvc
  namespace: hivemr3
  annotations: {}
spec:
  resources:
    requests:
      storage: 100Gi
  accessModes:
  - ReadWriteMany
  volumeName: ''
  storageClassName: ''
  selector: {}

--- Sungwoo

Sungwoo Park

unread,
Jul 25, 2022, 9:33:39 AM7/25/22
to MR3
Actually you can inspect the ConfigMap object directly, e.g.:

$ kubectl describe configmap -n hivemr3 mr3conf-configmap-master | grep python
mr3.container.localize.python.working.dir.unsafe=true

This ConfigMap is created from conf/mr3-site.xml by HiveServer2, and eventually mounted as /opt/mr3-run/hive.mr3-conf.properties file inside DAGAppMaster Pod. Hence, after updating conf/mr3-site.xml, you want to delete this ConfigMap (among others) before restarting HiveServer2.

Cheers,

--- Sungwoo

David Engel

unread,
Jul 26, 2022, 5:46:40 PM7/26/22
to Sungwoo Park, MR3
1. I believe you are correct. After a complete restart of the
cluster, I saw the expected behavior when
mr3.container.localize.python.working.dir.unsafe is true.

2. Thanks for your .yaml example. It is similar to another example I
was trying to use but yours had a few extra things that might or might
not have been important. I also had some other issues before I
realized run-hive.sh was creating/recreating the workdir pv/pvc. It's
all working now, though, and I successfully ran a Python, transform
sctipt.

3. Our IT finally reconfigured the VM nodes for me so I can tackle the
last, main bit of my Hive/MR3 evaluation. That is trying to get
perofmance similar to that of Trino for simple queries. Our analysts
like to run simple lookups and aggregations interactively when
investigating issues. We started using PrestoDB long ago for those
due to pre-LLAP Hive being unacceptably slow.

What follows is an example of what we might like to run interactively.

The dns_ip_answer table is stored in ORC files with two levels of
partitions. The first partition is called 'day' and represents the
date the data was collected. The second partition represents the
source of the data and is not used in this sample query. For the
sample quere below, there are typically 4 to 8 distinct sources per
day and an average of about 10000000 total rows per day.

Here is the sample query I'm running.

select count(distinct ip),
sum(statistic_occurs64bit),
min(statistic_timefirstseen),
max(statistic_timelastseen)
from dns_ip_answer
where day between '2022-05-01' and '2022-05-31'
and ip between unhex('01010101') and unhex('01010101') ;

Here is a section of conf/hive-site.xml configuring the pods.

<property>
<name>hive.mr3.containergroup.scheme</name>
<value>all-in-one</value>
</property>

<property>
<name>hive.mr3.container.max.java.heap.fraction</name>
<value>0.7f</value>
</property>

<property>
<name>hive.mr3.resource.vcores.divisor</name>
<value>100</value>
</property>

<property>
<name>hive.mr3.map.task.memory.mb</name>
<value>8192</value>
<!--<value>24576</value>-->
</property>

<property>
<name>hive.mr3.map.task.vcores</name>
<value>100</value>
<!--<value>300</value>-->
</property>

<property>
<name>hive.mr3.reduce.task.memory.mb</name>
<value>8192</value>
<!--<value>24576</value>-->
</property>

<property>
<name>hive.mr3.reduce.task.vcores</name>
<value>100</value>
<!--<value>300</value>-->
</property>

<property>
<name>hive.mr3.all-in-one.containergroup.memory.mb</name>
<value>98304</value>
</property>

<property>
<name>hive.mr3.all-in-one.containergroup.vcores</name>
<value>1200</value>
</property>

With our test cluster, I can get 6 mr3worker pods running with this
configuration. In short, I have 6 nodes with about 96 GB and 12 cores
available to run mr3 workers.

With the hiveserver2, mr3master and 6 mr3worker pods already running,
the sample query consistently takes about 45-50 seconds. I've tried
all (I think) of the suggestions found at the locactions below and
none of them seemed to make any difference.

https://mr3docs.datamonad.com/docs/k8s/performance/performance-tuning/
https://mr3docs.datamonad.com/docs/k8s/performance/performance-tuning-k8s/

In contrast, when running Trino with 16 GB workers, the same query
only takes about 12 seconds with 2 workers and about 5 seconds with 6
workers. I realize the absolute difference between 5-12 to 45-50
seconds isn't all that great but the 4x to 10x slower seems to hold
for slightly more comples queries.

Is it possible to get Hive/MR3 closer to Trino for this type of query?
We could continue using Trino/PrestoDB but it would be nice if we
could settle on just Hive/MR3 and not have to maitain two sets of UDFs
and train people on the different dialects of SQL.

David
> To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/26eb2321-448e-407d-b890-2f5a98bce2fdn%40googlegroups.com.


--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Jul 26, 2022, 9:34:58 PM7/26/22
to MR3
A quick question - did you run the analyze command, e.g., 'analyze table dns_ip_answer compute statistics for columns'?

--- Sungwoo

David Engel

unread,
Jul 26, 2022, 10:33:47 PM7/26/22
to Sungwoo Park, MR3
Very good question. I'd totally forgotten about that command. Here
is the output if it helps.

INFO : Status: Succeeded
INFO : Map 1 190 tasks 1052311 milliseconds
INFO : Reducer 2 253 tasks 408723 milliseconds
INFO : Status: DAG finished successfully in 1392.86 seconds
INFO :
INFO : Query Execution Summary
INFO : ----------------------------------------------------------------------------------------------
INFO : OPERATION DURATION
INFO : ----------------------------------------------------------------------------------------------
INFO : Compile Query 142.03s
INFO : Prepare Plan 1.40s
INFO : Submit Plan 0.09s
INFO : Start DAG 0.00s
INFO : Run DAG 1392.86s
INFO : ----------------------------------------------------------------------------------------------
INFO :
INFO : Task Execution Summary
INFO : --------------------------------------------------------------------------------------------------------------------------------
INFO : VERTICES TOTAL_TASKS FAILED_ATTEMPTS KILLED_TASKS INPUT_RECORDS OUTPUT_RECORDS
INFO : --------------------------------------------------------------------------------------------------------------------------------
INFO : Map 1 190 0 0 5,655,091,437 8,221
INFO : Reducer 2 253 0 0 8,221 0
INFO : --------------------------------------------------------------------------------------------------------------------------------
INFO :
INFO : Starting task [Stage-2:STATS] in serial mode
Map 1 100.0% [==================================>] 190/190 eta [ 00:00]
Reducer 2 25.1% [========> ] 253/1009 eta [1:38:57]
INFO : Completed executing command(queryId=hive_20220727014330_debcc31e-9203-4532-85c4-aa66bd350ce1); Time taken: 1990.697 seconds
Query OK, 0 rows affected
Time: 2133.950s

There was no change in the test query afterwards as it still took 48
seconds.

David
> To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/72506666-26cd-4b5c-88a6-eaf04f1118f3n%40googlegroups.com.


--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Jul 27, 2022, 1:29:40 AM7/27/22
to MR3
From my understanding, this is an example of a query that Presto is particularly good at: no join operation, all intermediate data fits in memory, etc. While I don't think we can bring the performance of Hive-MR3 close to that of Presto for this type of query, we can still try to make better use of resources for Hive-MR3.

As an example, let me use a similar query on the 10-terabyte TPC-DS dataset.

select count(distinct ss_sold_time_sk), sum(ss_quantity), min(ss_list_price), max(ss_sales_price)
from store_sales
where ss_sold_date_sk between 2451922 and 2452287;

1.
For this type of query, 8GB per task might be unnecessarily large. Here is the result of the above query in our local K8s cluster where 6 worker Pods are created:

-- (12 cores, 96GB memory, 12 tasks) x 6 workers: 40.593 seconds, 51.442 seconds, 52.057 seconds
-- (12 cores, 48GB memory, 12 tasks) x 6 workers: 42.496 seconds, 40.249 seconds, 38.938 seconds
-- (12 cores, 36GB memory, 12 tasks) x 6 workers: 43.809 seconds, 42.952 seconds, 40.581 seconds
-- (12 cores, 24GB memory, 12 tasks) x 6 workers: OOM

We see that allocating only 3GB to an individual task achieves nearly the same performance. In your cluster, I guess increasing the memory for each worker would have no effect, while increasing the number of cores and using as many tasks would have an immediate effect.

2.
After running the analyze command, min() and max() can be computed directly from statistics. We can also split the query into smaller queries to get the result faster.

select count(distinct ss_sold_time_sk), sum(ss_quantity), min(ss_list_price), max(ss_sales_price)
from store_sales
where ss_sold_date_sk between 2451922 and 2452287;
--> 51.136 seconds

select min(ss_list_price), max(ss_sales_price)
from store_sales
where ss_sold_date_sk between 2451922 and 2452287;
--> 1.092 seconds

select sum(ss_quantity)
from store_sales
where ss_sold_date_sk between 2451922 and 2452287;
--> 4.816 seconds

select count(distinct ss_sold_time_sk)
from store_sales
where ss_sold_date_sk between 2451922 and 2452287;
--> 27.055 seconds

3.
If the data source is connected over network and the same input data is accessed repeatedly, using LLAP I/O might help (although I wouldn't really recommend this for on-premise clusters). For the above query, here is the result of using LLAP I/O in our local cluster where each worker consumes 96GB memory. Running the same query repeatedly increases cache hit ratio, thus reducing the total running time. (The result here is not very different because the data source (HDFS) is collocated.)

-- (12 cores, 60GB, 12 tasks, 36GB LLAP IO) x 6 workers: 111.79 seconds, 60.912 seconds, 50.146 seconds, 45.62 seconds, 37.867 seconds

This is all I can think of about your query for now. You could try changing Hive configuration parameters, but it would be hard to beat Presto because of its architecture strength (which, however, translates to its architectural weakness when it comes to join operations and fault tolerance).

Cheers,

--- Sungwoo

Sungwoo Park

unread,
Jul 27, 2022, 4:39:32 AM7/27/22
to MR3
Out of curiosity, depending on the type of day, I wonder if it makes any difference to rewrite

where day between '2022-05-01' and '2022-05-31'

as:

where day between cast('2022-05-01' as date) and cast('2022-05-31' as date)

--- Sungwoo

Sungwoo Park

unread,
Jul 27, 2022, 12:04:08 PM7/27/22
to MR3
I evaluated Presto 0.271 using the same query on the same 10TB TPC-DS dataset in the same cluster.

Setup:
- Presto 0.271 (which is rather old, but should be okay for quick comparison)
- 6 Presto workers
- Each worker runs on a node with 256GB memory and 40 core (vs 48GB memory and 12 cores for a single Hive-MR3 worker).
- Worker configurations are the same as in the blog article: https://www.datamonad.com/post/2019-08-22-comparison-presto317-0.10/

Result:
- The same query takes about 1.25 minutes (vs less than 50 seconds for Hive-MR3).

$ ./run_query.sh /home/gitlab-runner/a.sql

WARNING: COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough
"46800","271662029378","1.0","200.0"

As Hive-MR3 far outperforms Presto 0.271 for a similar query in my experiment, there might be some configuration changes we could try in your Hive-MR3 setup.

Q. Do you use S3 (like MinIO) to store datasets?

-- Sungwoo

David Engel

unread,
Jul 27, 2022, 2:57:40 PM7/27/22
to Sungwoo Park, MR3
I agree that Hive is not necessarily the best tool for queries like
this.

1. I tried reconfiguring to have workers with more cores but saw no
change.

2. I don't see similar improvements when breaking up the query. Each
individual select essentially takes the same time as running all in
one query.

3.The data is logically over a network. However, all nodes are
virtual machines running in the same, Dell, VM cluster. It's likely
that some, maybe even most, "network" operations are as fast as VMware
can move data between VMs.

David
> To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/99c3f835-c9f4-4ff9-bab2-93693a50acdfn%40googlegroups.com.


--
David Engel
da...@istwok.net

David Engel

unread,
Jul 27, 2022, 3:01:21 PM7/27/22
to Sungwoo Park, MR3
Day is used as a date but is actually defined as a string. I don't
remember the exact details any more as it was several years ago gut
when we first tried using day as a partition, Hive had a bug using
dates as partitions.

David
> To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/c32bedbe-f896-4c11-a195-f57be84f78e5n%40googlegroups.com.


--
David Engel
da...@istwok.net

David Engel

unread,
Jul 27, 2022, 3:07:22 PM7/27/22
to Sungwoo Park, MR3
That's very interesting results and counter to what we're seeing.

Yes, we are using MinIO. As previously reported, all of our nodes are
VMs in the same VMware cluster so network access should be very fast.

I don't know how much this would affect results, but I have
hive.exec.scratchdir and hive.query.results.cache.directory set to use
S3/MinIO. I did that initially to avoid the extra complexit of
setting up more PVs. Do you think it would make much difference to
use a local PV on each node? I have about 100GiB available on each
node.

David
> To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/c9f264b7-a785-40d0-bf3f-836d05276744n%40googlegroups.com.


--
David Engel
da...@istwok.net

David Engel

unread,
Jul 27, 2022, 4:39:07 PM7/27/22
to Sungwoo Park, MR3
Nevermind the scratchdir and results.cache question. I confused the
workdir usage with something else.

David
--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Jul 27, 2022, 10:32:11 PM7/27/22
to MR3
1. I tried reconfiguring to have workers with more cores but saw no
change.

2. I don't see similar improvements when breaking up the query. Each
individual select essentially takes the same time as running all in
one query.

From these two points and the use of S3, I guess the S3 configuration for accessing input data might be the culprit. From my understanding, count(distinct) is a very expensive operation and should take much longer than the simple sum() query, as shown in my experiment. As this is not the case  in your experiment and having more workers does not affect the performance either, it seems that S3 is the bottleneck (irrespective of network hardware).

I wonder if you set S3 configuration parameters in conf/core-site.xml. The following page (Accessing S3-compatible storage) talks a bit about this topic: https://mr3docs.datamonad.com/docs/k8s/advanced/access-s3/
Or, could you try large values for these configuration keys in conf/core-site.xml, if you haven't set them yet?:

    <property>
      <name>fs.s3a.path.style.access</name>
      <value>true</value>
    </property>

    <property>
      <name>fs.s3a.connection.maximum</name>
      <value>5000</value>
    </property>

    <property>
      <name>fs.s3.maxConnections</name>
      <value>5000</value>
    </property>

    <property>
      <name>fs.s3a.threads.max</name>
      <value>250</value>
    </property>

    <property>
      <name>fs.s3a.threads.core</name>
      <value>250</value>
    </property>

--- Sungwoo

Sungwoo Park

unread,
Jul 28, 2022, 12:55:00 AM7/28/22
to MR3
I ran a small experiment to show the effect of configurations for S3.

=== Dataset and basic settings

Dataset: 1TB TPC-DS, stored in MinIO, accessed over secure network

Query:

select count(distinct ss_sold_time_sk), sum(ss_quantity), min(ss_list_price), max(ss_sales_price)
from store_sales
where ss_sold_date_sk between 2451922 and 2452287;

Hive-MR3 runs on Kubernetes.
We use 3 workers.
Each worker has 19 cores and 60GB memory.
DAGAppMaster should be given enough cores. In my experiment, DAGAppMaster uses 6 cores.

=== Setting #1: set S3 configuration parameters in conf/core-site.xml

conf/core-site.xml:


    <property>
      <name>fs.s3a.path.style.access</name>
      <value>true</value>
    </property>

    <property>
      <name>fs.s3a.connection.maximum</name>
      <value>5000</value>
    </property>

    <property>
      <name>fs.s3.maxConnections</name>
      <value>5000</value>
    </property>

    <property>
      <name>fs.s3a.threads.max</name>
      <value>250</value>
    </property>

    <property>
      <name>fs.s3a.threads.core</name>
      <value>250</value>
    </property>

    <property>
      <name>mapreduce.input.fileinputformat.list-status.num-threads</name>
      <value>50</value>
    </property>

In DAGAppMaster, initializing Map vertex takes about 5 seconds (spent in HiveSplitGenerator). This operation is fast on HDFS, but not as fast on S3. As far as I know, Presto implements its own optimization and is efficient in accessing S3, whereas Hive uses the Hadoop library which is slower.

Map 1                 llap  Initializing     -1          0        0       -1       0       0  

=== Result #1 - all the queries run normally.


select count(distinct ss_sold_time_sk), sum(ss_quantity), min(ss_list_price), max(ss_sales_price)
from store_sales
where ss_sold_date_sk between 2451922 and 2452287;
+--------+--------------+------+--------+
1 row selected (24.217 seconds)
1 row selected (20.258 seconds)
1 row selected (21.428 seconds)


select count(distinct ss_sold_time_sk)
from store_sales
where ss_sold_date_sk between 2451922 and 2452287;
+--------+
1 row selected (15.761 seconds)
1 row selected (16.398 seconds)
1 row selected (14.358 seconds)


select sum(ss_quantity)
from store_sales
where ss_sold_date_sk between 2451922 and 2452287;
+--------------+
1 row selected (8.124 seconds)
1 row selected (8.945 seconds)
1 row selected (8.722 seconds)

select min(ss_list_price)

from store_sales
where ss_sold_date_sk between 2451922 and 2452287;
+------+
1 row selected (8.529 seconds)
1 row selected (9.412 seconds)
1 row selected (8.819 seconds)

select max(ss_sales_price)

from store_sales
where ss_sold_date_sk between 2451922 and 2452287;
+--------+
1 row selected (9.742 seconds)
1 row selected (10.415 seconds)
1 row selected (8.6 seconds)

=== Setting #2: unset all the above S3 configuration parameters in conf/core-site.xml

In DAGAppMaster, initializing Map vertex takes forever.

Map 1                 llap  Initializing     -1          0        0       -1       0       0  
Reducer 2             llap           New      4          0        0        4       0       0  
Reducer 3             llap           New      1          0        0        1       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 00/03  [>>--------------------------] 0%    ELAPSED TIME: 256.85 s   

The bottom-line is that S3 parameters should be set appropriately.

--- Sungwoo

David Engel

unread,
Jul 28, 2022, 2:58:20 PM7/28/22
to Sungwoo Park, MR3
It's probably worth repeating that I followed the instructioins on
https://mr3docs.datamonad.com/docs/quick/k8s/run-k8s/. As such, my
configuration is that of https://github.com/mr3project/mr3-run-k8s.git
except for things that had to be change like S3 endpoint.

The S3 settings I was using are the following defaults from
mr3-run-k8s.git:

fs.s3a.path.style.access = true
fs.s3a.connection.maximum = 4000
fs.s3.maxConnections = 4000
fs.s3a.threads.max = 250
fs.s3a.threads.core = 250
mapreduce.input.fileinputformat.list-status.num-threads = <unset>

I changed them to the following but saw no appreciable change.

fs.s3a.path.style.access = true
fs.s3a.connection.maximum = 16000
fs.s3.maxConnections = 16000
fs.s3a.threads.max = 1000
fs.s3a.threads.core = 1000
mapreduce.input.fileinputformat.list-status.num-threads = 50

Next, I changed mr3.am.resource.cpu.cores to 6 in conf/mr3-site.xml.
No change.

Finally, I change requests.cpu and limits.cpu to 6 in yaml/hive.yaml.
Still no change.

David
> > *1. I tried reconfiguring to have workers with more cores but saw
> > nochange.2. I don't see similar improvements when breaking up the query.
> > Eachindividual select essentially takes the same time as running all inone
> > query.*
> To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/2076516d-06af-45ae-bfef-15132af683fbn%40googlegroups.com.


--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Jul 28, 2022, 10:58:16 PM7/28/22
to MR3
Judging from your results, I think there might be something that we miss in your configuration of Hive-MR3. At this stage, without access to the VM cluster and without the log of HiveServer2/MR3 master/MR3 workers, it would be hard to figure out why. If you would like to continue to work on this problem, please let me know (by email).

As the last note, it may be that Trino is really fast, even on count(distinct). I never tested Trino myself, but for Presto, it was slower than Hive in my own testing based on TPC-DS.

--- Sungwoo

David Engel

unread,
Jul 28, 2022, 11:29:55 PM7/28/22