ORC File Concatenation

44 views
Skip to first unread message

David Engel

unread,
Nov 9, 2022, 3:52:20 PM11/9/22
to MR3
Are there any settings which control the concatenation of ORC files
when initiated with ALTER TABLE ... CONCATENATE? The way our data
comes in and gets loaded, it leads to lots of files with sizes in the
range of a few MB to a few 10s, of MB. I tried running the
aforementioned CONCATENATE query to combine the ORC files into larger
ones. It doesn't have the desired effect, though, and looks like it
simply does the same as COMPACT 'MAJOR'.

David
--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Nov 10, 2022, 5:36:05 AM11/10/22
to David Engel, MR3
Could you try increasing the value of mapreduce.input.fileinputformat.split.minsize (e.g., to 240000000)?

Sungwoo

--
You received this message because you are subscribed to the Google Groups "MR3" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hive-mr3+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/Y2wTAo%2BB/MKmnPFY%40opus.istwok.net.

David Engel

unread,
Nov 10, 2022, 8:53:22 PM11/10/22
to Sungwoo Park, MR3
That doesn't seem to have any effect. I've tried reading the Hive
source and found some of the concatenate logic but not yet any setting
which controls the resulting file sizes.

David
--
David Engel
da...@istwok.net

David Engel

unread,
Nov 10, 2022, 9:20:53 PM11/10/22
to Sungwoo Park, MR3
I just did a quick search for a setting I saw earlier and it lead me
to the code which silently turns A-T CONCATENATE on ACID tables into
COMPACT 'MAJOR'. I suspect the only way to force the creation of
larger files is to copy the data to a temporary table, delete it from
the target table and then re-insert it with job having a sufficiently
small enough number of threads.

Non-ACID tables fires off a job which essentially does an in-place
(RE)INSERT OVERWRITE. The test I ran for this created 3 threads. I
suspect there is a setting which could control that.

David
--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Nov 10, 2022, 10:10:39 PM11/10/22
to David Engel, MR3
I just did a quick search for a setting I saw earlier and it lead me
to the code which silently turns A-T CONCATENATE on ACID tables into
COMPACT 'MAJOR'.  I suspect the only way to force the creation of
larger files is to copy the data to a temporary table, delete it from
the target table and then re-insert it with job having a sufficiently
small enough number of threads.

A quick thought -- Deleting data from an existing table generates new delta files, so if you would like to effectively rebuild the table, maybe 'delete followed by re-insert' could be implemented as 'delete the whole table and create a fresh table with CTAS'. 

Sungwoo

Sungwoo Park

unread,
Nov 11, 2022, 12:01:33 AM11/11/22
to David Engel, MR3
Non-ACID tables fires off a job which essentially does an in-place
(RE)INSERT OVERWRITE.  The test I ran for this created 3 threads.  I
suspect there is a setting which could control that.

For controlling the size of output ORC files, we can change mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize. (I have found that changing only minsize does not achieve the desired result.)

Here is the result of a quick experiment.

create table inventory_copy2 ( inv_item_sk          bigint, inv_warehouse_sk    bigint, inv_quantity_on_hand  int, inv_date_sk bigint) stored as orc TBLPROPERTIES('transactional'='false', 'transactional_properties'='default');
insert into table inventory_copy2 select * from inventory_copy;
--> 130 files, max 17M, min 4.4M

set mapreduce.input.fileinputformat.split.minsize=248435456;
set mapreduce.input.fileinputformat.split.maxsize=248435456;
alter table inventory_copy2 concatenate;
--- 7 files, max 248M, min 66M

set mapreduce.input.fileinputformat.split.minsize=128435456;
set mapreduce.input.fileinputformat.split.maxsize=128435456;
alter table inventory_copy2 concatenate;
--> 12 files, max 135M, min 101M

set mapreduce.input.fileinputformat.split.minsize=68435456;
set mapreduce.input.fileinputformat.split.maxsize=68435456;
alter table inventory_copy2 concatenate;
--> 22 files, max 79M, min 57M

Sungwoo

David Engel

unread,
Nov 12, 2022, 2:27:07 PM11/12/22
to Sungwoo Park, MR3
On Fri, Nov 11, 2022 at 02:00:49PM +0900, Sungwoo Park wrote:
> >
> > Non-ACID tables fires off a job which essentially does an in-place
> > (RE)INSERT OVERWRITE. The test I ran for this created 3 threads. I
> > suspect there is a setting which could control that.
> >
>
> For controlling the size of output ORC files, we can
> change mapreduce.input.fileinputformat.split.minsize
> and mapreduce.input.fileinputformat.split.maxsize. (I have found that
> changing only minsize does not achieve the desired result.)

Isn't mapreduce.input.fileinputformat.split.maxsize also the setting
which affects the choice of split stratget when set to HYBRID?

> Here is the result of a quick experiment.
>
> create table inventory_copy2 ( inv_item_sk bigint,
> inv_warehouse_sk bigint, inv_quantity_on_hand int, inv_date_sk bigint)
> stored as orc TBLPROPERTIES('transactional'='false',
> 'transactional_properties'='default');
> insert into table inventory_copy2 select * from inventory_copy;
> --> 130 files, max 17M, min 4.4M
>
> set mapreduce.input.fileinputformat.split.minsize=248435456;
> set mapreduce.input.fileinputformat.split.maxsize=248435456;
> alter table inventory_copy2 concatenate;
> --- 7 files, max 248M, min 66M
>
> set mapreduce.input.fileinputformat.split.minsize=128435456;
> set mapreduce.input.fileinputformat.split.maxsize=128435456;
> alter table inventory_copy2 concatenate;
> --> 12 files, max 135M, min 101M
>
> set mapreduce.input.fileinputformat.split.minsize=68435456;
> set mapreduce.input.fileinputformat.split.maxsize=68435456;
> alter table inventory_copy2 concatenate;
> --> 22 files, max 79M, min 57M

I'll put this on the list of things to try later. Thanks.

David Engel

unread,
Nov 12, 2022, 2:50:54 PM11/12/22
to Sungwoo Park, MR3
I've been using a mix of reloading data from scratch and copying data
out, deleting and copying back in. For existing projects where we
need to update our processing and also want to change the
partitioning, I've been reloading data from scratch as that seems to
be faster than other methods of moving data between clusters. For
dormant projects, but whose data is still sometimes referenced, I've
moved data in various ways and then when changing the partitioning,
I've copied from an old table to a new one.

One disturbing trend with the copying from one table to another is
that things seem to be taking longer and longer. For example, one
table I'm almost finished copying is partitioned by day. I've been
copying it in 1 month chunks so about 30 partitions at a time. Each
partition is about 20 GB so the total amount copied in a chunk is
about 500-600 GB. The first chunks copied fairly routinely and about
as quickly as expected. The recent chunks are all taking several
hours to copy. The copying to a .hive-staging directory in the target
table appears to happen at about network speed. However, the part
where Hive moves the staged data into place and updates the Metastore
appears to be where Hive gets really, painfully slow.

Do you know how find out why that copy is taking so long? This is the
last of the data I intend to copy this way for the near term so it
might be a moot point. I'd still like to know what's going on,
though, so I can fix it for future copies or know if it's a more
serious issue that will affect other, normal processing.

Sungwoo Park

unread,
Nov 13, 2022, 12:51:41 AM11/13/22
to David Engel, MR3
Isn't mapreduce.input.fileinputformat.split.maxsize also the setting
which affects the choice of split stratget when set to HYBRID?

Yes, it is used when hive.exec.orc.split.strategy is set to HYBRID or ETL.

Sungwoo
 

Sungwoo Park

unread,
Nov 13, 2022, 1:43:56 AM11/13/22
to David Engel, MR3
One disturbing trend with the copying from one table to another is
that things seem to be taking longer and longer.  For example, one
table I'm almost finished copying is partitioned by day.  I've been
copying it in 1 month chunks so about 30 partitions at a time.  Each
partition is about 20 GB so the total amount copied in a chunk is
about 500-600 GB.  The first chunks copied fairly routinely and about
as quickly as expected.  The recent chunks are all taking several
hours to copy.  The copying to a .hive-staging directory in the target
table appears to happen at about network speed.  However, the part
where Hive moves the staged data into place and updates the Metastore
appears to be where Hive gets really, painfully slow.

Do you know how find out why that copy is taking so long?  This is the
last of the data I intend to copy this way for the near term so it
might be a moot point.  I'd still like to know what's going on,
though, so I can fix it for future copies or know if it's a more
serious issue that will affect other, normal processing.

(Do you copy one chunk at a time by executing 'insert overwrite' and observe the slow-down when copying recent chunks? Or do you try to copy all chunks by executing a single command?)

Not sure why loading takes increasingly longer in your case. The symptom looks like some sort of resource leak, and it could be due to a bug in Metastore or HiveServer2. To find out the source of the problem, I think we should watch Metastore and HiveServer2 (e.g., memory consumption and active threads) because the performance bottleneck can happen at various stages (HiveServer2, Metastore, database, and S3).

For configuration parameters related to MoveTask, we could change:
hive.mv.files.thread
hive.load.dynamic.partitions.thread
hive.metastore.fshandler.threads
(However, I don't think changing these parameters will solve the problem of the loading time getting increasingly longer.)

Sungwoo


David Engel

unread,
Nov 13, 2022, 1:10:24 PM11/13/22
to Sungwoo Park, MR3
On Sun, Nov 13, 2022 at 03:43:42PM +0900, Sungwoo Park wrote:
> >
> > One disturbing trend with the copying from one table to another is
> > that things seem to be taking longer and longer. For example, one
> > table I'm almost finished copying is partitioned by day. I've been
> > copying it in 1 month chunks so about 30 partitions at a time. Each
> > partition is about 20 GB so the total amount copied in a chunk is
> > about 500-600 GB. The first chunks copied fairly routinely and about
> > as quickly as expected. The recent chunks are all taking several
> > hours to copy. The copying to a .hive-staging directory in the target
> > table appears to happen at about network speed. However, the part
> > where Hive moves the staged data into place and updates the Metastore
> > appears to be where Hive gets really, painfully slow.
> >
> > Do you know how find out why that copy is taking so long? This is the
> > last of the data I intend to copy this way for the near term so it
> > might be a moot point. I'd still like to know what's going on,
> > though, so I can fix it for future copies or know if it's a more
> > serious issue that will affect other, normal processing.
> >
>
> (Do you copy one chunk at a time by executing 'insert overwrite' and
> observe the slow-down when copying recent chunks? Or do you try to copy all
> chunks by executing a single command?)

The target partitions are empty so I don't use overwrite. I basically
do the following:

insert into desttable
select col1, col2, colN
from sourctable
where day between '2019-01-01' and '2019-01-31'

It is when all vertices and the overall report being 100% complete
that things pause for several hours.

> Not sure why loading takes increasingly longer in your case. The symptom
> looks like some sort of resource leak, and it could be due to a bug in
> Metastore or HiveServer2. To find out the source of the problem, I think we
> should watch Metastore and HiveServer2 (e.g., memory consumption and active
> threads) because the performance bottleneck can happen at various stages
> (HiveServer2, Metastore, database, and S3).

I don't suspect a resource leak as other operations appear to be
working normally.

> For configuration parameters related to MoveTask, we could change:
> hive.mv.files.thread
> hive.load.dynamic.partitions.thread
> hive.metastore.fshandler.threads
> (However, I don't think changing these parameters will solve the problem of
> the loading time getting increasingly longer.)

After I confirm that the just completed copies are correct, I might
try tweaking some fo these settings.

I wonder if large, uncompacted deltas in other partitioins are
contributing to the problem. With each new, month's copy, there are
roughtly 30 more, large deltas. I've noticed before that Hive really
bogs down when there are large deltas. In these cases the partitions
don't overlap at all, but IME, Hive doesn't always realize that and
doesn't optimzie accordingly.

Sungwoo Park

unread,
Nov 14, 2022, 2:56:23 AM11/14/22
to David Engel, MR3
I wonder if large, uncompacted deltas in other partitioins are
contributing to the problem.  With each new, month's copy, there are
roughtly 30 more, large deltas.  I've noticed before that Hive really
bogs down when there are large deltas.  In these cases the partitions
don't overlap at all, but IME, Hive doesn't always realize that and
doesn't optimzie accordingly.

I think we can first analyze the execution time of your query. Here is an example of analyzing a sample query:

insert into store_sales
select * from tpcds_bin_partitioned_orc_1003.store_sales
where ss_sold_date_sk between (2450816 + 183) and (2450998 + 183);

After finishing a query, get the log of HiveServer2 and execute:

$ grep -e "DAG finished successfully" -e "Starting task" -e "Executing command" -e "Completed executing command" hs2.log
...
Status: DAG finished successfully in 245.69 seconds
2022-11-14T07:18:27,850  INFO [HiveServer2-Background-Pool: Thread-363] SessionState: Status: DAG finished successfully in 245.69 seconds
2022-11-14T07:20:52,788  INFO [HiveServer2-Background-Pool: Thread-363] ql.Driver: Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
2022-11-14T07:20:52,788  INFO [HiveServer2-Background-Pool: Thread-363] ql.Driver: Starting task [Stage-0:MOVE] in serial mode
2022-11-14T07:21:26,328  INFO [HiveServer2-Background-Pool: Thread-363] ql.Driver: Starting task [Stage-3:STATS] in serial mode
2022-11-14T07:21:36,290  INFO [HiveServer2-Background-Pool: Thread-363] ql.Driver: Completed executing command(queryId=hive_20221114071419_d0ca1e0b-e311-45a9-8294-9bfc23ccece8); Time taken: 434.717 seconds

In this example, the total execution time is 434.717 seconds.

1) Executing a DAG (in MR3) takes 245.69 seconds.

2) After completing the execution of the DAG and before starting [Stage-2:DEPENDENCY_COLLECTION], HiveServer2 spends about 07:20:52-07:18:27 = 145 seconds closing the operation (executing FileSinkOperator.jobCloseOp()). In the case of using S3, this operation is quite inefficient because renaming is implemented as copy-and-delete. In the end, the output directory is copied twice.

3) [Stage-0:MOVE] spends 07:21:26-07:20:52 = 34 seconds.

4) [Stage-3:STATS] spends 07:21:36-07:21:26 = 10 seconds.

Could you analyze your query as shown above? The steps 1) and 2) do not involve Metastore.

Sungwoo

Sungwoo Park

unread,
Nov 14, 2022, 6:24:39 AM11/14/22
to David Engel, MR3
I wonder if large, uncompacted deltas in other partitioins are
contributing to the problem.  With each new, month's copy, there are
roughtly 30 more, large deltas.  I've noticed before that Hive really
bogs down when there are large deltas.  In these cases the partitions
don't overlap at all, but IME, Hive doesn't always realize that and
doesn't optimzie accordingly.

To avoid creating delta_??? directories and create base_???, you can try 'insert overwrite table' instead of 'insert into'. 

Example of a query producing base_???:

insert overwrite table store_sales partition (ss_sold_date_sk)
select * from tpcds_bin_partitioned_orc_1003.store_sales
where ss_sold_date_sk between 2450816 and 2450998
sort by ss_sold_date_sk;

Example of a query producing delta_???:

insert into store_sales partition (ss_sold_date_sk)
select * from tpcds_bin_partitioned_orc_1003.store_sales
where ss_sold_date_sk between (2450816 + 183 * 3) and (2450998 + 183 * 3)
sort by ss_sold_date_sk;

Avoiding delta_??? directories may fix the problem of increasingly longer loading time, but unfortunately this is not something I can confirm at this point :-(

Cheers,

Sungwoo


David Engel

unread,
Nov 14, 2022, 10:32:44 PM11/14/22
to Sungwoo Park, MR3
I'm still waiting for my test query to finish but I think you've
already helped identify the main problem. Not terribly surprisingly,
it appears that S3, or rather our speciric, MinIO configuration is to
blame.

As you may recall, our IT really pushed running Hive/MR3 on Kubernetes
in their VM cluster. I think that's still reasonable for the master
and worker nodes. It probably isn't for the storage nodes as I
initially feared. The PVs used by MinIO are actually virtual disks
locate on an ajacent SAN. Because of that, any S3 operation whose
data isn't already cached in the MinIO node's RAM requires two network
copies -- one between the MinIO and the SAN and again between MinIO
and the client. Add in the new revaluation that S3 moves/renames are
really copies/deletes(*), the number of copies doubles again.

I think I'll have to discuss our S3 situation with my fellow engineers
and IT again. Unfortunately, the last time I looked, using the VM
cluster vendors' S3 implementation for the amount of data we like to
keep on-line was prohibitive for our small business. Is it possible
to configure Hive with multiple, S3 endpoints? Maybe we could keep
our main, active data small enough to fit in what the vendor allows
for free and keep the rest in external tables on a slower, Minio, VM
setup.

David

(*)I was beginning to suspect that but hadn't yet checked on it. That
definitely explains some but not all of the slowness.

--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Nov 15, 2022, 8:05:16 AM11/15/22
to David Engel, MR3
I think I'll have to discuss our S3 situation with my fellow engineers
and IT again.  Unfortunately, the last time I looked, using the VM
cluster vendors' S3 implementation for the amount of data we like to
keep on-line was prohibitive for our small business.  Is it possible
to configure Hive with multiple, S3 endpoints?  Maybe we could keep
our main, active data small enough to fit in what the vendor allows
for free and keep the rest in external tables on a slower, Minio, VM
setup.

 I think it is tricky (or impossible) to use multiple S3 endpoints because we can use a single value for the configuration key fs.s3a.endpoint in core-site.xml.

I wonder if you can mount virtual disks of SAN on a PersistentVolume in every Pod (HS2, MR3 DAGAppMaster, and ContainerWorkers). If this is feasible, you could store cold data in external tables mapped to virtual disks of SAN (via the PersistentVolume).

Here is a simple example.

1. Assume that a virtual disk is accessible at /opt/mr3-run/work-dir/external/ inside all Pods.

2. Copy the ORC data for a table under a directory /opt/mr3-run/work-dir/external/tpcds_bin_partitioned_orc_1000/call_center/ (which would point to tpcds_bin_partitioned_orc_1000/call_center/ on the virtual disk).

3. Make sure that the ORC data was created in a non-transactional table (without base_??? and delta_??? directories).

bash-4.2$ pwd
/opt/mr3-run/work-dir/external/tpcds_bin_partitioned_orc_1000/call_center
bash-4.2$ ls
000000_0

4. Create an external table.

create external table call_center(cc_call_center_sk bigint , ...)
stored as orc
location 'file:/opt/mr3-run/work-dir/external/tpcds_bin_partitioned_orc_1000/call_center';

5. Now you can read the external table.

In order to add new partitions, you can prepare new data in a temporary table using S3, manually copy it under the destination directory on the virtual disk, and run 'msck repair'. This is not an ideal solution, but if the data is accessed only sporadically and thus not really worth storing on S3, this plan might be acceptable.

Sungwoo

David Engel

unread,
Nov 15, 2022, 10:45:05 AM11/15/22
to Sungwoo Park, MR3
On Tue, Nov 15, 2022 at 10:05:04PM +0900, Sungwoo Park wrote:
> >
> > I think I'll have to discuss our S3 situation with my fellow engineers
> > and IT again. Unfortunately, the last time I looked, using the VM
> > cluster vendors' S3 implementation for the amount of data we like to
> > keep on-line was prohibitive for our small business. Is it possible
> > to configure Hive with multiple, S3 endpoints? Maybe we could keep
> > our main, active data small enough to fit in what the vendor allows
> > for free and keep the rest in external tables on a slower, Minio, VM
> > setup.
> >
>
> I think it is tricky (or impossible) to use multiple S3 endpoints because
> we can use a single value for the configuration key fs.s3a.endpoint in
> core-site.xml.
>
> I wonder if you can mount virtual disks of SAN on a PersistentVolume in
> every Pod (HS2, MR3 DAGAppMaster, and ContainerWorkers). If this is
> feasible, you could store cold data in external tables mapped to virtual
> disks of SAN (via the PersistentVolume).

That might be possible. We currently mount the home directories and a
very large, data filesystem on our client VM into all pods via NFS.
Both are virtual disks which ultimately reside on the SAN and I've
noticed similar slowness with them, though, not quita as bad. I
intend to check if the SAN can natively support NFS so those could be
mounted directly from the SAN into pods.

David
--
David Engel
da...@istwok.net

David Engel

unread,
Nov 15, 2022, 4:49:04 PM11/15/22
to Sungwoo Park, MR3
We might actually have a really, good solution. I finally got
specific information from IT on what SAN they are using and it appears
to support both HDFS and S3. It's not yet clear if that support is
standard or an extra cost option. Assuming it is standard or
affordable, would HDFS be the preferred solution because of the S3
move/rename issue?

David
--
David Engel
da...@istwok.net

Sungwoo Park

unread,
Nov 15, 2022, 5:03:22 PM11/15/22
to David Engel, MR3
We might actually have a really, good solution.  I finally got
specific information from IT on what SAN they are using and it appears
to support both HDFS and S3.  It's not yet clear if that support is
standard or an extra cost option.  Assuming it is standard or
affordable, would HDFS be the preferred solution because of the S3
move/rename issue?

Yes, I think so, provided that scalability is not an issue. (Hive-MR3 even exploits data locality if data is stored on HDFS and worker Pods are collocated with HDFS nodes, but this is irrelevant in your case because you use SAN.)

Sungwoo
 

David Engel

unread,
Nov 15, 2022, 6:11:58 PM11/15/22
to Sungwoo Park, MR3
As best I can tell, the HDFS option is an extra cost option whereas
the S3 option is standard. I'll ask my IT guy to check on the cost.

Data locality won't be applicable. The HDFS support on the storage
cluster is purely HDFS with only name- and datanodes. Any worker and
compute nodes must be external to the SAN cluster.

The S3 support is a subset of Amazon's, full, S3 API. Can you please
take a look at the following PDF and see if you think it would be
sufficient for Hive/MR3?

https://www.delltechnologies.com/asset/en-us/products/storage/industry-market/h18293-dell-emc-powerscale-onefs-s3-api-guide.pdf.external

Sungwoo Park

unread,
Nov 15, 2022, 8:53:34 PM11/15/22
to David Engel, MR3
The S3 support is a subset of Amazon's, full, S3 API.  Can you please
take a look at the following PDF and see if you think it would be
sufficient for Hive/MR3?

https://www.delltechnologies.com/asset/en-us/products/storage/industry-market/h18293-dell-emc-powerscale-onefs-s3-api-guide.pdf.external

It looks like the subset of API is enough for running Hive-MR3 (or Hive in general).

Internally Hive uses the Hadoop FileSystem library. S3AFileSystem in Hadoop uses the following S3 API requests:

Supported by OneFS:
  CopyObjectRequest
  DeleteObjectsRequest
  ListObjectsRequest
  ListObjectsV2Request
  PutObjectRequest
  UploadPartRequest
  AbortMultipartUploadRequest

Not supported by OneFS:
  GetObjectMetadataRequest
  InitiateMultipartUploadRequest
  ListMultipartUploadsRequest

From the specification, the last three requests are not supported by OneFS (maybe except GetObjectMetadataRequest, which might be another name for GetObjectRequest which is supported by OneFS). The unsupported requests are not used in Hive, so there is a good chance that OneFS can be used with Hive out of the box. However, OneFS has a few limitations, so I think we need to try it out to see if Hive-MR3 works okay.

Sungwoo

David Engel

unread,
Nov 17, 2022, 12:29:18 AM11/17/22
to Sungwoo Park, MR3
Thanks for checking.
Reply all
Reply to author
Forward
0 new messages