--
You received this message because you are subscribed to the Google Groups "MR3" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hive-mr3+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/Y2wTAo%2BB/MKmnPFY%40opus.istwok.net.
I just did a quick search for a setting I saw earlier and it lead me
to the code which silently turns A-T CONCATENATE on ACID tables into
COMPACT 'MAJOR'. I suspect the only way to force the creation of
larger files is to copy the data to a temporary table, delete it from
the target table and then re-insert it with job having a sufficiently
small enough number of threads.
Non-ACID tables fires off a job which essentially does an in-place
(RE)INSERT OVERWRITE. The test I ran for this created 3 threads. I
suspect there is a setting which could control that.
Isn't mapreduce.input.fileinputformat.split.maxsize also the setting
which affects the choice of split stratget when set to HYBRID?
One disturbing trend with the copying from one table to another is
that things seem to be taking longer and longer. For example, one
table I'm almost finished copying is partitioned by day. I've been
copying it in 1 month chunks so about 30 partitions at a time. Each
partition is about 20 GB so the total amount copied in a chunk is
about 500-600 GB. The first chunks copied fairly routinely and about
as quickly as expected. The recent chunks are all taking several
hours to copy. The copying to a .hive-staging directory in the target
table appears to happen at about network speed. However, the part
where Hive moves the staged data into place and updates the Metastore
appears to be where Hive gets really, painfully slow.
Do you know how find out why that copy is taking so long? This is the
last of the data I intend to copy this way for the near term so it
might be a moot point. I'd still like to know what's going on,
though, so I can fix it for future copies or know if it's a more
serious issue that will affect other, normal processing.
I wonder if large, uncompacted deltas in other partitioins are
contributing to the problem. With each new, month's copy, there are
roughtly 30 more, large deltas. I've noticed before that Hive really
bogs down when there are large deltas. In these cases the partitions
don't overlap at all, but IME, Hive doesn't always realize that and
doesn't optimzie accordingly.
I wonder if large, uncompacted deltas in other partitioins are
contributing to the problem. With each new, month's copy, there are
roughtly 30 more, large deltas. I've noticed before that Hive really
bogs down when there are large deltas. In these cases the partitions
don't overlap at all, but IME, Hive doesn't always realize that and
doesn't optimzie accordingly.
I think I'll have to discuss our S3 situation with my fellow engineers
and IT again. Unfortunately, the last time I looked, using the VM
cluster vendors' S3 implementation for the amount of data we like to
keep on-line was prohibitive for our small business. Is it possible
to configure Hive with multiple, S3 endpoints? Maybe we could keep
our main, active data small enough to fit in what the vendor allows
for free and keep the rest in external tables on a slower, Minio, VM
setup.
We might actually have a really, good solution. I finally got
specific information from IT on what SAN they are using and it appears
to support both HDFS and S3. It's not yet clear if that support is
standard or an extra cost option. Assuming it is standard or
affordable, would HDFS be the preferred solution because of the S3
move/rename issue?
The S3 support is a subset of Amazon's, full, S3 API. Can you please
take a look at the following PDF and see if you think it would be
sufficient for Hive/MR3?
https://www.delltechnologies.com/asset/en-us/products/storage/industry-market/h18293-dell-emc-powerscale-onefs-s3-api-guide.pdf.external