create external s3 table

Bartek Siudeja

unread,

Feb 10, 2021, 8:41:05 PM2/10/21

to MR3

Hello,

I was playing with hive on kubernetes, on minikube using pre-built images. So far it was a quite smooth experience. I can use beeline, run queries, create tables, also external tables on s3. Really nice, all on a single aws instance (s3 access is via iam role)!

But I noticed one issue. Maybe it is just hive, not mr3 related. Create external table commands seem really slow, and slow down even more with number of objects in s3 path:

# 30 partitions (shards), 1 file in each partition

CREATE EXTERNAL TABLE table1 (value string) PARTITIONED BY (shard string) LOCATION 's3a://path/date=2021-02-01/';

INFO : Completed compiling command(queryId=); Time taken: 7.753 seconds

INFO : Completed executing command(queryId=); Time taken: 3.735 seconds

# 30 * 60 partitions (dates and shards), 1 file in each partition

CREATE EXTERNAL TABLE table2 (value string) PARTITIONED BY (`date` string, shard string) LOCATION 's3a://path/';

INFO : Completed compiling command(queryId=); Time taken: 488.435 seconds

INFO : Completed executing command(queryId=); Time taken: 0.227 seconds

This is surprising to me since create table is not really creating partitions, just an empty table. I need to run msck repair table table2 to actually get all partitions (and this works in about 20 seconds). So time spent in create command seems unnecessary.

I thought I tracked this down to https://issues.apache.org/jira/browse/HIVE-20316. Essentially turning off file listing in create external table command. But I am getting the same slow speeds for both hive3:1.2 and hive4:1.2 dockerhub images. Is this patch included in any of those images (hive issue mentions 3.2.0 and 4.0.0 versions as fixed)? Or maybe this is something completely different?

Best,

Bartek

Sungwoo Park

unread,

Feb 10, 2021, 9:47:35 PM2/10/21

to MR3

Hello,

HIVE-20316 is included Hive 4 on MR3, but not in Hive 3 on MR3 yet. As you experience the problem even with Hive 4 on MR3, it might be due to some other issue. I believe it is not related to MR3, but let me try to see why this happens and get back to you.

FYI, MR3 1.3 (to be released this month) will include HIVE-20316 in its Hive 3 distribution. Besides it will have no limit in the amount of resources that MR3 can use on Kubernetes.

Best,

--- Sungwoo

Bartek Siudeja

unread,

Feb 11, 2021, 1:21:07 AM2/11/21

to MR3

Thank you for a very quick answer.

It might be that hive is doing something else with s3 while running create external table command. I am not sure what, but I am also not really familiar with hive codebase. I would really appreciate it if you could run some tests on your side. I believe any s3 external table which already has lots of partition data in many subfolders on s3 is automatically problematic for me. I also tried not including `location` at all with create, then `alter table X set location s3://true_location/`. Hive is being really slow at alter table.

Thanks again,

Bartek

Sungwoo Park

unread,

Feb 11, 2021, 1:39:08 AM2/11/21

to MR3

Hello

I have a quick question. What is the value of hive.exec.scratchdir in hive-site.xml? Does it point to an S3 bucket?

I assume that you set fs.defaultFS to file:/// in core-site.xml.

<name>fs.defaultFS</name>

</property>

Cheers,

--- Sungwoo

Sungwoo Park

unread,

Mar 11, 2021, 9:53:13 AM3/11/21

to MR3

This is still an open question, but the following JIRA seems to suggest that it is actually the expected behavior of the current implementation of Hive Metastore (although it is about Hive 2.3). (On MinIO simulating S3, this problem is not reproduced.)

https://issues.apache.org/jira/browse/HIVE-24849

Cheers,

--- Sungwoo

Sungwoo Park

unread,

Mar 12, 2021, 5:38:02 AM3/12/21

to MR3

My finding is that the current code is working as implemented, although I guess it is not intended. It turns out that both HiveServer2 and Metastore try connections to S3 when 'create external table' is executed with a concrete location. Here is part of the trace log of MinIO when 'create external table' is executed.

0: jdbc:hive2://orange1:444/> CREATE EXTERNAL TABLE t2 (value string) PARTITIONED BY (`date` string, shard string) LOCATION 's3a://tmp/load/';

01:29:41.889 [404 Not Found] s3.HeadObject orange0:9000/tmp/load 192.168.10.2      667µs       ↑ 153 B ↓ 196 B
01:29:41.893 [404 Not Found] s3.HeadObject orange0:9000/tmp/load/ 192.168.10.2      749µs       ↑ 153 B ↓ 196 B
01:29:41.897 [200 OK] s3.ListObjectsV2 orange0:9000/tmp/?list-type=2&delimiter=%2F&max-keys=1&prefix=load%2F&fetch-owner=false 192.168.10.2      1.788ms      ↑ 153 B ↓ 611 B
01:29:41.903 [404 Not Found] s3.HeadObject orange0:9000/tmp/load 192.168.10.2      561µs       ↑ 153 B ↓ 196 B
01:29:41.906 [404 Not Found] s3.HeadObject orange0:9000/tmp/load/ 192.168.10.2      714µs       ↑ 153 B ↓ 196 B
01:29:41.910 [200 OK] s3.ListObjectsV2 orange0:9000/tmp/?list-type=2&delimiter=%2F&max-keys=1&prefix=load%2F&fetch-owner=false 192.168.10.2      1.079ms      ↑ 153 B ↓ 611 B
01:29:41.914 [200 OK] s3.ListObjectsV2 orange0:9000/tmp/?list-type=2&delimiter=%2F&max-keys=5000&prefix=load%2F&fetch-owner=false 192.168.10.2      10.547ms     ↑ 153 B ↓ 6.8 KiB
01:29:41.962 [200 OK] s3.HeadBucket orange0:9000/tmp/ 192.168.10.2      507µs       ↑ 153 B ↓ 189 B
01:29:41.983 [404 Not Found] s3.HeadObject orange0:9000/tmp/load 192.168.10.2      419µs       ↑ 153 B ↓ 196 B
01:29:41.985 [404 Not Found] s3.HeadObject orange0:9000/tmp/load/ 192.168.10.2      810µs       ↑ 153 B ↓ 196 B
01:29:41.989 [200 OK] s3.ListObjectsV2 orange0:9000/tmp/?list-type=2&delimiter=%2F&max-keys=1&prefix=load%2F&fetch-owner=false 192.168.10.2      1.415ms      ↑ 153 B ↓ 611 B

The problem here is that HiveServer2 and Metastore send ListObjectV2 requests which are equivalent to executing 'aws s3api list-objects-V2 --bucket ...' and thus make a recursive traversal of the destination bucket. As a result, the execution time is proportional to the number of entries inside the bucket.

For HiveServer2, we can update toTable() method in CreateTableDesc.java. It has this code:

if (!this.isCTAS && (tbl.getPath() == null || (tbl.isEmpty() && !isExternal()))) {

This code is bad because it calls tbl.isEmpty() which sends a ListObjectV2 request, but only to return false in the end as isExternal() evalutes to true. We can revise the code as follows:

if (!this.isCTAS && (tbl.getPath() == null || (!isExternal() && tbl.isEmpty()))) {

Now, HiveServer2 does not send ListObjectV2 requests. Still Metastore sends a ListObjectV2 request from Warehouse.isDir().

So, a partial solution is to revise toTable() method in CreateTableDesc.java.

A more principled solution would be to avoid calling FileSystem.listStatus() or FileSystem.getFileStatus() for S3, but this is beyond our scope.

As many people must have seen this problem in production, what I am still puzzled about is how commercial distributions of Hive like CDP avoid this issue. Either it might be a configuration problem of S3 on the server side (e.g., in our local cluster where S3 is simulated with MinIO, everything is very fast), or a configuration problem of S3A on the client side.

At any rate, let me include this patch in the next snapshot release. I believe this is also a fix for HIVE-24849 (https://issues.apache.org/jira/browse/HIVE-24849).

Cheers,

--- Sungwoo

Bartek Siudeja

unread,

Mar 12, 2021, 11:10:33 AM3/12/21

to Sungwoo Park, MR3

Amazing job investigating this! Big thanks, Sungwoo.

If I understand correctly this is a partial solution (also for the mentioned hive ticket). Things still could be slow because of `isDir` in metastore, or really because of `FileSystem.getFileStatus` call in this one?

You mention S3 or S3A configuration problems as a possibility (matching FS issue above). Maybe AWS fixed this issue internally in some emr version, not in 5.23 as in the ticket. EMR is using proprietary aws s3 connector at least in spark, no apache s3a in there.

Thanks again for digging into this,

Bartek

--
You received this message because you are subscribed to a topic in the Google Groups "MR3" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hive-mr3/BLBjzcC3JqA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hive-mr3+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/4ef25c62-ba4f-4ced-a345-1140b34c9a6cn%40googlegroups.com.

Sungwoo Park

unread,

Mar 13, 2021, 12:22:02 AM3/13/21

to MR3

If I understand correctly this is a partial solution (also for the mentioned hive ticket). Things still could be slow because of `isDir` in metastore, or really because of `FileSystem.getFileStatus` call in this one?

The operations are still slow because Warehouse.isDir(), so actually this is not a solution, unfortunately. Warehouse.isDir() calls FileSystem.getFileStatus(), which can take long if FileSystem is S3.

Cheers,

--- Sungwoo

Sungwoo Park

unread,

Aug 12, 2021, 7:33:52 AM8/12/21

to MR3

This issue has been fixed in MR3 1.3. It turns out that the fix shown above is a complete solution (not a partial solution) because Warehouse.isDir() calls only a single S3 read request. Here is the result of testing on AWS. data2 contains 2 * 2 files, data10 contains 10 * 10 files, and data20 contains 20*20 files. I made a pull request for the JIRA: https://issues.apache.org/jira/browse/HIVE-24849

0: jdbc:hive2://3.35.2.152:9852/> CREATE EXTERNAL TABLE data2(value string) PARTITIONED BY (`date` string, shard string) LOCATION 's3a://mr3-hive-warehouse-dir/data2';

No rows affected (0.225 seconds)

0: jdbc:hive2://3.35.2.152:9852/> msck repair table data2;

No rows affected (1.633 seconds)

0: jdbc:hive2://3.35.2.152:9852/> CREATE EXTERNAL TABLE data10(value string) PARTITIONED BY (`date` string, shard string) LOCATION 's3a://mr3-hive-warehouse-dir/data10';

No rows affected (0.252 seconds)

0: jdbc:hive2://3.35.2.152:9852/> msck repair table data5;

No rows affected (9.075 seconds)

0: jdbc:hive2://3.35.2.152:9852/> CREATE EXTERNAL TABLE data20(value string) PARTITIONED BY (`date` string, shard string) LOCATION 's3a://mr3-hive-warehouse-dir/data20';

No rows affected (0.234 seconds)

0: jdbc:hive2://3.35.2.152:9852/> msck repair table data20;

No rows affected (17.282 seconds)

Cheers,

--- Sungwoo

Reply all

Reply to author

Forward