Creating Hive external table with STORED BY 'io.delta.hive.DeltaStorageHandler' throws error

Mich Talebzadeh

unread,

Jun 12, 2021, 1:31:31 PM6/12/21

to Delta Lake Users and Developers

Hi,

This is a simple example trying to create A Hive external delta lake table.

Hive 3.1.1

Hadoop 3.1.1

Downloaded and added the jar file delta-hive-assembly_2.12-0.2.0.jar to HDFS directory

hdfs dfs -ls hdfs://rhes75:9000/jars/delta-hive-assembly_2.12-0.2.0.jar

-rw-r--r-- 3 hduser supergroup 16450204 2021-06-12 16:11 hdfs://rhes75:9000/jars/delta-hive-assembly_2.12-0.2.0.jar

as per instructions here

https://github.com/delta-io/connectors

Add Hive uber JAR

The second step is to upload the above uber JAR to the machine that runs Hive. Next, make the JAR accessible to Hive. There are several ways to do this, listed below. To verify that the JAR was properly added, run LIST JARS; in the Hive CLI.

in the Hive CLI, run ADD JAR <path-to-jar>;

which I do as follows:

0: jdbc:hive2://rhes75:10099/default> ADD JAR hdfs://rhes75:9000/jars/delta-hive-assembly_2.12-0.2.0.jar;
No rows affected (0.039 seconds)

plus the following

0: jdbc:hive2://rhes75:10099/default> SET hive.input.format=io.delta.hive.HiveInputFormat;
No rows affected (0.001 seconds)
0: jdbc:hive2://rhes75:10099/default> SET hive.tez.input.format=io.delta.hive.HiveInputFormat;
No rows affected (0.002 seconds)

So I should be able to create an external Hive table

0: jdbc:hive2://rhes75:10099/default> CREATE EXTERNAL TABLE test.deltaTable(col1 INT, col2 STRING) STORED BY 'io.delta.hive.DeltaStorageHandler' LOCATION 'hdfs://rhes75:9000/tmp/deltaTable'
. . . . . . . . . . . . . . . . . . > Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hive/metastore/MetaStoreUtils (state=08S01,code=1)

If I wanted to create an external Hive table, I would do

CREATE EXTERNAL TABLE IF NOT EXISTS test.something
(
TransactionDate date
...
)
COMMENT 'External csv files for ..'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 'hdfs://rhes75:9000/data/stg/accounts/nw/10124772/archive'
;

Does Hive understand this STORED BY 'io.delta.hive.DeltaStorageHandler'

Thanks

view my Linkedin profile

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

Logan Boyd

unread,

Jun 13, 2021, 12:20:38 PM6/13/21

to Delta Lake Users and Developers

Currently only Hive 2.x is supported. You can see this further down on the Connectors page.

I'm in the same boat, we use Cloudera CDP 7.1.6 with Hive 3.1 and my current pattern is to maintain the Delta table via Merge commands but then do a READ from Delta and a WRITE to ORC after my Merge and then my Hive table points to the ORC files.

It's the best we can do for now.

Logan

Mich Talebzadeh

unread,

Jun 13, 2021, 2:51:48 PM6/13/21

to Logan Boyd, Delta Lake Users and Developers

Thanks Logan

I took your suggestion and tried to put the show on the road so to speak. :)

First these are my assumptions as I have < 24 hours working on delta stuff.

Start with creating a delta table from a DF and store it on the external location (in my case HDFS). The only thing I know about this delta table is the location it was created.

appName="deltatest"
builder = SparkSession.builder \
.appName(appName) \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.enableHiveSupport()
spark = configure_spark_with_delta_pip(builder).getOrCreate()

hdfs_path = "hdfs://rhes75:9000/delta/randomDataDelta"
DB = "test"
tableName = "randomDataDelta"
fullyQualifiedTableName = DB + "."+ tableName
# Read delta table I have just created
df2 = spark.read.format("delta").load(hdfs_path)

print("""Schema of delta table""")
df2.printSchema()
#df2.orderBy("id").show(10000,False)

In Spark, create an ORC table in a Hive database (mine is called test) if does not yet exist

sqltext = ""
if (spark.sql("SHOW TABLES IN test like 'randomDataDelta'").count() == 1):
rows = spark.sql(f"""SELECT COUNT(1) FROM {fullyQualifiedTableName}""").collect()[0][0]
#print (f"""number of rows from ORC table {fullyQualifiedTableName} is {rows}""")
else:
print("\nTable test.randomDataDelta does not exist, creating table ")
sqltext = """
CREATE TABLE if not exists test.randomDataDelta(
ID INT
, CLUSTERED INT
, SCATTERED INT
, RANDOMISED INT
, RANDOM_STRING VARCHAR(50)
, SMALL_VC VARCHAR(50)
, PADDING VARCHAR(40)
)
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )
"""
spark.sql(sqltext)
df2.createOrReplaceTempView("tmp")
sqltext = """
INSERT OVERWRITE TABLE test.randomDataDelta
SELECT
ID
, CLUSTERED
, SCATTERED
, RANDOMISED
, RANDOM_STRING
, SMALL_VC
, PADDING
FROM tmp
"""
spark.sql(sqltext)
spark.sql(f"""select MIN(id) AS minID, MAX(id) AS maxID from {fullyQualifiedTableName}""").show()

So it works fine and it is a workaround.

I wrote an article on ORC file and storage index in Hive in Linkedin back in early 2016 and I recall there were issues reading from writing to ORC tables in Spark a while back. Sounds like the Spark handling of ORC tables is resolved.

I attach the sample PySpark code that generates random data, stores it in delta table, reads it back from HDFS directory where delta data resides and finally writes DF to ORC table in Hive

HTH

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/683e9390-cdbc-491f-90db-517fe06c0920n%40googlegroups.com.

dynamic_ARRAY_generator_delta.py

Mich Talebzadeh

unread,

Jun 14, 2021, 2:09:19 PM6/14/21

to Logan Boyd, Delta Lake Users and Developers

Hi Logan,

Did you create your ORC table in Spark with "transactional" = "true"

STORED AS ORC
TBLPROPERTIES (
"transactional" = "true",

Thanks

Logan Boyd

unread,

Jun 14, 2021, 2:29:23 PM6/14/21

to Delta Lake Users and Developers

We create our hive tables manually with HQL DDL files.

We do NOT use Hive transactional tables but I also don't specifically specify them as transactional=false, that is just the default for create external table.

I think that may be the issue for you is that you are not using the EXTERNAL modifier when creating the table.

-Logan

Mich Talebzadeh

unread,

Jun 14, 2021, 4:47:50 PM6/14/21

to Logan Boyd, Delta Lake Users and Developers

OK thanks.

To clarify we have a final delta table created with Spark as below

hdfs_path = "hdfs://rhes75:9000/delta/randomDataDelta"

df.write.format("delta").mode("append").save(hdfs_path)

Then create an ORC external table in another hdfs location with Hive cli

0: jdbc:hive2://rhes75:10099/default> CREATE EXTERNAL TABLE test.randomDataDelta(
. . . . . . . . . . . . . . . . . . > ID INT
. . . . . . . . . . . . . . . . . . > , CLUSTERED INT
. . . . . . . . . . . . . . . . . . > , SCATTERED INT
. . . . . . . . . . . . . . . . . . > , RANDOMISED INT
. . . . . . . . . . . . . . . . . . > , RANDOM_STRING VARCHAR(50)
. . . . . . . . . . . . . . . . . . > , SMALL_VC VARCHAR(50)
. . . . . . . . . . . . . . . . . . > , PADDING VARCHAR(40)
. . . . . . . . . . . . . . . . . . > )
. . . . . . . . . . . . . . . . . . > COMMENT 'Testing delta table'
. . . . . . . . . . . . . . . . . . > STORED AS ORC
. . . . . . . . . . . . . . . . . . > LOCATION "hdfs://rhes75:9000/tmp/randomDataDelta"
. . . . . . . . . . . . . . . . . . > ;
No rows affected (0.267 seconds)

Note the new ORC external table is created in another hdfs location

Then insert overwrite to ORC table from Spark throws this error

2021-06-14 21:32:46,592 ERROR executor.Executor: Exception in task 2.0 in stage 12.0 (TID 171)
org.apache.orc.FileFormatException: Malformed ORC file hdfs://rhes75:9000/tmp/randomDataDelta/part-00000-880d7c58-1a6b-464a-aae0-b75531a9bf60-c000.snappy.parquet. Invalid postscript.

I did a Google search on this error and it points to an ORC external table that is incompatible with what is written by delta? That is my guess here

Malformed ORC file Invalid postscript

The issue here is that you imported the sqoop data as --as-textfile but you created the table with ORC store format. Therefore hive is throwing this error because data is not in ORC format.

When I try reading it in Hive from ORC table I get

0: jdbc:hive2://rhes75:10099/default> select count(1) from test.randomDataDelta;
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I (state=08S01,code=1)

Maybe Hadoop 3.1.1 (in our case) is not compatible with the ORC table. I reported this three years ago

https://user.hive.apache.narkive.com/Td3He6Vj/failed-execution-error-return-code-1-from-org-apache-hadoop-hive-ql-exec-mr-mapredtask-orc-split

So it has not been fixed yet!

Thanks

To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/c35c3065-11ff-4d71-a745-9234bee04cdfn%40googlegroups.com.

Logan Boyd

unread,

Jun 14, 2021, 10:11:26 PM6/14/21

to Delta Lake Users and Developers

After you create your delta table you need to have a separate statement that reads the delta table and writes the orc files.

Something like

spark.read.format('delta').load(hdfs_path_for_delta_table).repartition(num_files).write.orc(hdfs_path_for_external_hive_table, mode='overwrite')

where num_files can be used to compact your data into a specific number of ORC files. You should target between 128MB and 1GB files to avoid the many small files problem

-Logan

Mich Talebzadeh

unread,

Jun 15, 2021, 5:50:50 AM6/15/21

to Logan Boyd, Delta Lake Users and Developers

Thanks for the hint.

Simple test

hdfs_path_for_delta_table = "hdfs://rhes75:9000/delta/randomDataDelta"
hdfs_path_for_external_hive_table = "hdfs://rhes75:9000/tmp/deltaTable"
num_partitions = 10

spark.read.format("delta") \
.load(hdfs_path_for_delta_table) \
.repartition(num_partitions) \
.write \
.orc(hdfs_path_for_external_hive_table, mode="overwrite")

Still getting this error

2021-06-15 09:24:46,388 ERROR executor.Executor: Exception in task 2.0 in stage 13.0 (TID 178)

org.apache.orc.FileFormatException: Malformed ORC file hdfs://rhes75:9000/tmp/randomDataDelta/part-00000-880d7c58-1a6b-464a-aae0-b75531a9bf60-c000.snappy.parquet. Invalid postscript.

Logically one expects the delta table to have ORC like features hence it should be possible to read data and load it into an externally created ORC table in Hive.

The error I am getting implies some incompatibility issue as reported here and may be it is the cause of it

reason:
The orc format is a columnar storage table. You cannot import data directly from a local file. Only when the data source table is also stored in the orc format can it be loaded directly. Otherwise, the above error will be reported.

terms of settlement:
Either change the data source table to a table stored in the orc format, or create a new temporary table in the textfile format, first load the source file data into the table, and then insert the data into the orc target table from the textfile table

Even when I tried to load into the textfile table and populate the ORC table from the textfile table it threw the same error.

I suspect there is an underlying issue there.

Cheers

To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/1b8519f1-4b80-46dd-9060-7b5de14ca120n%40googlegroups.com.

Logan Boyd

unread,

Jun 15, 2021, 11:20:39 AM6/15/21

to Delta Lake Users and Developers

The error message is because you are reading parquet files using an ORC data reader. I'm not sure how you're doing that with the code you showed though.

Can you do a simple test of reading from the delta table and doing a .show() or a .count() ?

-Logan

Mich Talebzadeh

unread,

Jun 15, 2021, 12:26:04 PM6/15/21

to Logan Boyd, Delta Lake Users and Developers

No problem. See below

1) Create delta table first

df.write.format("delta").mode("overwrite").save(hdfs_path_for_delta_table)

2) Read that delta table

df2 = spark.read.format("delta").load(hdfs_path_for_delta_table)
df2.printSchema()
df2.orderBy("id").show(100,False)
rows = df2.count()
print(f"""No of rows is {rows}""")

The output

+---+-------------------+---------+----------+--------------------------------------------------+-------------------------------------------------+----------------------------------------+
|ID |CLUSTERED |SCATTERED|RANDOMISED|RANDOM_STRING |SMALL_VC |PADDING |
+---+-------------------+---------+----------+--------------------------------------------------+-------------------------------------------------+----------------------------------------+
|1 |0.0 |0.0 |2.0 |KZWeqhFWCEPyYngFbyBMWXaSCrUZoLgubbbPIayRnBUbHoWCFJ| 1|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|2 |0.07142857142857142|1.0 |13.0 |dffxkVZQtqMnMcLRkBOzZUGxICGrcbxDuyBHkJlpobluliGGxG| 2 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|3 |0.14285714285714285|2.0 |3.0 |LIixMEOLeMaEqJomTEIJEzOjoOjHyVaQXekWLctXbrEMUyTYBz| 3 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|4 |0.21428571428571427|3.0 |3.0 |tgUzEjfebzJsZWdoHIxrXlgqnbPZqZrmktsOUxfMvQyGplpErf| 4 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|5 |0.2857142857142857 |4.0 |9.0 |qVwYSVPHbDXpPdkhxEpyIgKpaUnArlXykWZeiNNCiiaanXnkks| 5 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|6 |0.35714285714285715|5.0 |12.0 |KfFWqcajQLEWVxuXbrFZmUAIIRgmKJSZUqQZNRfBvfxZAZqCSg| 6 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|7 |0.42857142857142855|6.0 |5.0 |jzPdeIgxLdGncfBAepfJBdKhoOOLdKLzdocJisAjIhKtJRlgLK| 7 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|8 |0.5 |7.0 |3.0 |xyimTcfipZGnzPbDFDyFKmzfFoWbSrHAEyUhQqgeyNygQdvpSf| 8 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|9 |0.5714285714285714 |8.0 |7.0 |NxrilRavGDMfvJNScUykTCUBkkpdhiGLeXSyYVgsnRoUYAfXrn| 9 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|10 |0.6428571428571429 |9.0 |9.0 |cBEKanDFrPZkcHFuepVxcAiMwyAsRqDlRtQxiDXpCNycLapimt| 10 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|11 |0.7142857142857143 |10.0 |7.0 |JXDCGLmlZGEONYlgCtjfIZSOcMzCPVNPkNaHedcmpMbXDuCLmH| 11 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|12 |0.7857142857142857 |11.0 |7.0 |rQHQwjyaxErPZDSMoJaQNjCxkvndKmeGRvRzZfbdQGofBChQBi| 12 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|13 |0.8571428571428571 |12.0 |4.0 |sRRZlPoQjoPUlieInVsbBiRMarjfYZZqCVBiYqwoFWJKBQxBOu| 13 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|14 |0.9285714285714286 |13.0 |1.0 |NSWPHprVqsUeQCtDRzzXhqouwZqxZOOHjkJQQrkaPehMvbfrny| 14 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
+---+-------------------+---------+----------+--------------------------------------------------+-------------------------------------------------+----------------------------------------+

No of rows is 14

So Spark works OK with the delta table.

I can store data from delta in a a managed text file in Hive

a) create a managed table as textfile in Hive

use test;
DROP TABLE IF EXISTS test.randomDataText;
CREATE TABLE test.randomDataText(

ID INT
, CLUSTERED INT
, SCATTERED INT
, RANDOMISED INT
, RANDOM_STRING VARCHAR(50)
, SMALL_VC VARCHAR(50)
, PADDING VARCHAR(40)
)

COMMENT 'Testing delta table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
STORED AS TEXTFILE;

b) Go back and populate that textfile in Spark from delta file

df2.createOrReplaceTempView("tmp")
sqltext = """

INSERT OVERWRITE TABLE test.randomDataText

SELECT
ID
, CLUSTERED
, SCATTERED
, RANDOMISED
, RANDOM_STRING
, SMALL_VC
, PADDING
FROM tmp
"""
spark.sql(sqltext)

spark.sql("""SELECT COUNT(1) AS rows FROM test.randomDataText""").show()

+----+
|rows|
+----+
| 14|
+----+

So far OK.

The EXTERNAL ORC table in Hive is created as below

use test;
DROP TABLE IF EXISTS test.randomDataDelta;
CREATE EXTERNAL TABLE test.randomDataDelta(

ID INT
, CLUSTERED INT
, SCATTERED INT
, RANDOMISED INT
, RANDOM_STRING VARCHAR(50)
, SMALL_VC VARCHAR(50)
, PADDING VARCHAR(40)
)

COMMENT 'Testing delta table'
STORED AS ORC
LOCATION "hdfs://rhes75:9000/tmp/randomDataDelta"
;

Now doing INSERT/SELECT in Spark from the managed text table in Hive into the external ORC table

# Load the data from textable into ORC external table
spark.sql("""INSERT INTO test.randomDataDelta SELECT * FROM test.randomDataText""")

That works fine, no apparent error

# Try to read that data

print("""\n Reading from ORC table just populated""")
try:
spark.sql("""SELECT * from test.randomDataDelta order by ID""").show(20,False)
except Exception as e:
print(f"""{e}, quitting""")
sys.exit(1)

Output

Reading from ORC table just populated
2021-06-15 17:15:20,604 ERROR executor.Executor: Exception in task 0.0 in stage 27.0 (TID 390)
org.apache.orc.FileFormatException: Malformed ORC file hdfs://rhes75:9000/tmp/randomDataDelta/part-00000-6c7de582-742d-418d-8256-ed8da3349c21-c000.snappy.parquet. Invalid postscript.

Try reading ORC table in Hive

0: jdbc:hive2://rhes75:10099/default> select * from test.randomDataDelta order by id;

Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I (state=08S01,code=1)

I gather in my case it may be related to Hadoop Version 3.1.1 that I am using. It is also interesting that Spark thinks that the ORC file is snappy.parquet

To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/3c7dabd1-0549-4be2-9fd3-4ed872433af8n%40googlegroups.com.

Logan Boyd

unread,

Jun 15, 2021, 12:32:12 PM6/15/21

to Delta Lake Users and Developers

You're creating your Hive table with the location of your Delta Parquet data. That doesn't work. The Hive table needs to point to the folder with the ORC files

-Logan

Mich Talebzadeh

unread,

Jun 15, 2021, 12:59:01 PM6/15/21

to Logan Boyd, Delta Lake Users and Developers

Thanks Logan for spotting the mistake

1) Create ORC table in a new folder

use test;
DROP TABLE IF EXISTS test.randomDataDelta;
CREATE EXTERNAL TABLE test.randomDataDelta(
ID INT
, CLUSTERED INT
, SCATTERED INT
, RANDOMISED INT
, RANDOM_STRING VARCHAR(50)
, SMALL_VC VARCHAR(50)
, PADDING VARCHAR(40)
)
COMMENT 'Testing delta table'
STORED AS ORC

LOCATION "hdfs://rhes75:9000/orc/randomDataDelta"
;

Then run the original Spark code

hdfs_path_for_delta_table = "hdfs://rhes75:9000/delta/randomDataDelta"

hdfs_path_for_external_hive_table = "hdfs://rhes75:9000/orc/deltaTable"
num_partitions = 10

df.write.format("delta").mode("overwrite").save(hdfs_path_for_delta_table)

spark.read.format("delta") \
.load(hdfs_path_for_delta_table) \
.repartition(num_partitions) \
.write \
.orc(hdfs_path_for_external_hive_table, mode="overwrite")

spark.sql(f"""SELECT * FROM {fullyQualifiedTableName} ORDER BY id""").show(20,False)

rows = spark.sql(f"""SELECT COUNT(1) FROM {fullyQualifiedTableName}""").collect()[0][0]

print (f"""number of rows from ORC table {fullyQualifiedTableName} is {rows}""")

+---+---------+---------+----------+--------------------------------------------------+-------------------------------------------------+----------------------------------------+
|id |clustered|scattered|randomised|random_string |small_vc |padding |
+---+---------+---------+----------+--------------------------------------------------+-------------------------------------------------+----------------------------------------+
|1 |0 |0 |2 |KZWeqhFWCEPyYngFbyBMWXaSCrUZoLgubbbPIayRnBUbHoWCFJ| 1|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|2 |0 |1 |0 |ffxkVZQtqMnMcLRkBOzZUGxICGrcbxDuyBHkJlpobluliGGxGR| 2 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|3 |0 |2 |3 |LIixMEOLeMaEqJomTEIJEzOjoOjHyVaQXekWLctXbrEMUyTYBz| 3 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|4 |0 |3 |3 |tgUzEjfebzJsZWdoHIxrXlgqnbPZqZrmktsOUxfMvQyGplpErf| 4 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|5 |0 |4 |9 |qVwYSVPHbDXpPdkhxEpyIgKpaUnArlXykWZeiNNCiiaanXnkks| 5 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|6 |0 |5 |9 |fFWqcajQLEWVxuXbrFZmUAIIRgmKJSZUqQZNRfBvfxZAZqCSgW| 6 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|7 |0 |6 |5 |jzPdeIgxLdGncfBAepfJBdKhoOOLdKLzdocJisAjIhKtJRlgLK| 7 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|8 |0 |7 |3 |xyimTcfipZGnzPbDFDyFKmzfFoWbSrHAEyUhQqgeyNygQdvpSf| 8 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|9 |0 |8 |7 |NxrilRavGDMfvJNScUykTCUBkkpdhiGLeXSyYVgsnRoUYAfXrn| 9 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|10 |0 |9 |9 |cBEKanDFrPZkcHFuepVxcAiMwyAsRqDlRtQxiDXpCNycLapimt| 10 |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
+---+---------+---------+----------+--------------------------------------------------+-------------------------------------------------+----------------------------------------+

number of rows from ORC table test.randomDataDelta is 10

To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/589b43f8-f75a-4965-bab2-efeb37b86352n%40googlegroups.com.

Mich Talebzadeh

unread,

Jun 16, 2021, 3:27:16 AM6/16/21

to Logan Boyd, Delta Lake Users and Developers

One useful option when writing to ORC on HDFS from Spark would be to compress data to reduce storage space

hdfs_path_for_delta_table = "hdfs://rhes75:9000/delta/deltaTable"
hdfs_path_for_external_hive_table = "hdfs://rhes75:9000/orc/randomDataDelta"

spark.read.format("delta") \
.load(hdfs_path_for_delta_table) \
.repartition(num_partitions) \
.write \

.option("orc.compress", "snappy") \
.orc(hdfs_path_for_external_hive_table, mode="overwrite")

Of course as we cannot read data in this ORC external table from Hive, then no real verification can be done until the below error is resolved

0: jdbc:hive2://rhes75:10099/default> select * from test.randomDataDelta order by id;
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I (state=08S01,code=1)

Artur Brandys

unread,

Dec 21, 2021, 10:15:46 AM12/21/21

to Delta Lake Users and Developers

Hi.

We have exactly the same error.

We have Hive 3.1.0 and the latest jar connector.

In the documentation, it says that this version supports Hive 2.x and 3.x

Shixiong(Ryan) Zhu

unread,

Dec 21, 2021, 9:39:11 PM12/21/21

to Artur Brandys, Delta Lake Users and Developers

Hey Artur,

0.2.0 doesn't support Hive 3 yet. You can check https://github.com/delta-io/connectors/tree/v0.2.0#frequently-asked-questions-faq which has the right doc for version 0.2.0. We will release a version 0.3.0 soon to support Hive 3. The doc currently is not clear and we will update it to make it clearer.

Best Regards,

Ryan

--

You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/ca534b79-5fc4-4b37-b914-67376c163c8bn%40googlegroups.com.

Artur Brandys

unread,

Dec 22, 2021, 3:35:27 AM12/22/21

to Delta Lake Users and Developers

Ok I understand, thank you for your clarification. So we will be waiting for the 0.3.0 release :)

Reply all

Reply to author

Forward