Error in ALTER TABLE ... CONCATENATE

25 views

Skip to first unread message

David Engel

unread,

Feb 23, 2024, 3:07:48 PM2/23/24

to MR3

I'm again evaluting various ways to compact data in Hive tables into
larger files. This, in conjunction with ALTER TABLE ... COMPUTE
STATISTICS, can drmatically speed up some types of queries. With the
latest Hive/MR3, I'm getting the below error when running ALTER TABLE
... CONCATENATE. I vaguely recall getting this "this.reader is null"
error before with another query but I don't remember how I fixed or
worked around it.

David

set mapreduce.input.fileinputformat.split.minsize=268435456
set hive.exec.orc.default.block.size=268435456
alter table mytable partition (day='2024-02-15') concatenate
INFO : Compiling command(queryId=hive_20240223193644_23064ca5-8105-40da-afd1-75f9260669fa): alter table mytable partition (day='2024-02-15') concatenate
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hive_20240223193644_23064ca5-8105-40da-afd1-75f9260669fa); Time taken: 0.116 seconds
INFO : Executing command(queryId=hive_20240223193644_23064ca5-8105-40da-afd1-75f9260669fa): alter table mytable partition (day='2024-02-15') concatenate
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Run MR3 instead of Tez
INFO : MR3Task.execute(): hive_20240223193644_23064ca5-8105-40da-afd1-75f9260669fa:169
INFO : Starting MR3 Session...
INFO : Finished building DAG, now submitting: hive_20240223193644_23064ca5-8105-40da-afd1-75f9260669fa:169
INFO : Status: Running (Executing on MR3 DAGAppMaster): hive_20240223193644_23064ca5-8105-40da-afd1-75f9260669fa:169
INFO : Status: Running

INFO : File Merge: -/-
INFO : File Merge: 0(+269)/269
INFO : File Merge: 0(+269)/269
INFO : File Merge: 0(+269)/269
INFO : File Merge: 0(+269)/269
INFO : File Merge: 0(+269)/269
INFO : File Merge: 0(+269)/269
INFO : File Merge: 0(+269)/269
INFO : File Merge: 0(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
INFO : File Merge: 2(+269)/269
Traceback (most recent call last):
File "/home/dengel/bin/run-hive-query", line 153, in <module>
run_hql(verbose, cursor, hql)
File "/home/dengel/bin/run-hive-query", line 122, in run_hql
raise Exception(
Exception: query returned abnormal status ERROR_STATE (TGetOperationStatusResp(status=TStatus(statusCode=0, infoMessages=None, sqlState=None, errorCode=None, errorMessage=None), operationState=5, sqlState='08S01', errorCode=2, errorMessage='Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.DDLTask. Terminating unsuccessfully: Vertex failed, vertex_4354_0000_167_00. File Merge 269 tasks 67012 milliseconds: Failed, Some(Task unsuccessful: File Merge, task_4354_0000_167_00_000002, java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.io.orc.Reader.getObjectInspector()" because "this.reader" is null
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:417)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
at com.datamonad.mr3.tez.ProcessorWrapper.run(TezProcessor.scala:63)
at com.datamonad.mr3.worker.LogicalIOProcessorRuntimeTask.$anonfun$run$1(RuntimeTask.scala:316)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.io.orc.Reader.getObjectInspector()" because "this.reader" is null
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:223)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:156)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:359)
... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.io.orc.Reader.getObjectInspector()" because "this.reader" is null
at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:180)
at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.process(OrcFileMergeOperator.java:74)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:214)
... 16 more
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.io.orc.Reader.getObjectInspector()" because "this.reader" is null
at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:128)
... 18 more

java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.plan.MapWork.getPathToPartitionInfo()" because "this.mrwork" is null
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:417)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
at com.datamonad.mr3.tez.ProcessorWrapper.run(TezProcessor.scala:63)
at com.datamonad.mr3.worker.LogicalIOProcessorRuntimeTask.$anonfun$run$1(RuntimeTask.scala:316)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.plan.MapWork.getPathToPartitionInfo()" because "this.mrwork" is null
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:460)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:915)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:908)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:716)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:164)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:690)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:649)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:152)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.getMRInput(MergeFileRecordProcessor.java:256)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.init(MergeFileRecordProcessor.java:82)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:358)
... 14 more

java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.plan.MapWork.getPathToPartitionInfo()" because "this.mrwork" is null
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:417)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
at com.datamonad.mr3.tez.ProcessorWrapper.run(TezProcessor.scala:63)
at com.datamonad.mr3.worker.LogicalIOProcessorRuntimeTask.$anonfun$run$1(RuntimeTask.scala:316)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.plan.MapWork.getPathToPartitionInfo()" because "this.mrwork" is null
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:460)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:915)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:908)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:716)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:164)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:690)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:649)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:152)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.getMRInput(MergeFileRecordProcessor.java:256)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.init(MergeFileRecordProcessor.java:82)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:358)
... 14 more
)', taskStatus='[{"returnValue":2,"errorMsg":"org.apache.hadoop.hive.ql.metadata.HiveException: Terminating unsuccessfully: Vertex failed, vertex_4354_0000_167_00. File Merge 269 tasks 67012 milliseconds: Failed, Some(Task unsuccessful: File Merge, task_4354_0000_167_00_000002, java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.io.orc.Reader.getObjectInspector()" because "this.reader" is null
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:417)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
at com.datamonad.mr3.tez.ProcessorWrapper.run(TezProcessor.scala:63)
at com.datamonad.mr3.worker.LogicalIOProcessorRuntimeTask.$anonfun$run$1(RuntimeTask.scala:316)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.io.orc.Reader.getObjectInspector()" because "this.reader" is null
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:223)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:156)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:359)
... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.io.orc.Reader.getObjectInspector()" because "this.reader" is null
at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:180)
at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.process(OrcFileMergeOperator.java:74)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:214)
... 16 more
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.io.orc.Reader.getObjectInspector()" because "this.reader" is null
at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:128)
... 18 more

java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.plan.MapWork.getPathToPartitionInfo()" because "this.mrwork" is null
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:417)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
at com.datamonad.mr3.tez.ProcessorWrapper.run(TezProcessor.scala:63)
at com.datamonad.mr3.worker.LogicalIOProcessorRuntimeTask.$anonfun$run$1(RuntimeTask.scala:316)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.plan.MapWork.getPathToPartitionInfo()" because "this.mrwork" is null
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:460)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:915)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:908)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:716)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:164)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:690)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:649)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:152)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.getMRInput(MergeFileRecordProcessor.java:256)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.init(MergeFileRecordProcessor.java:82)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:358)
... 14 more

java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.plan.MapWork.getPathToPartitionInfo()" because "this.mrwork" is null
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:417)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
at com.datamonad.mr3.tez.ProcessorWrapper.run(TezProcessor.scala:63)
at com.datamonad.mr3.worker.LogicalIOProcessorRuntimeTask.$anonfun$run$1(RuntimeTask.scala:316)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hive.ql.plan.MapWork.getPathToPartitionInfo()" because "this.mrwork" is null
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:460)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:915)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:908)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:716)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:164)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:690)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:649)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:152)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.getMRInput(MergeFileRecordProcessor.java:256)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.init(MergeFileRecordProcessor.java:82)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:358)
... 14 more
)","beginTime":1708717004780,"endTime":1708717071851,"taskId":"Stage-0","taskState":"FINISHED","taskType":"DDL","name":"DDL","elapsedTime":67071}]', operationStarted=1708717004647, operationCompleted=1708717071880, hasResultSet=False, progressUpdateResponse=TProgressUpdateResp(headerNames=['VERTICES', 'MODE', 'STATUS', 'TOTAL', 'COMPLETED', 'RUNNING', 'PENDING', 'FAILED', 'KILLED'], rows=[['File Merge ', 'container', 'Failed', '269', '2', '266', '1', '3', '0']], progressedPercentage=0.0074349441565573215, status=1, footerSummary='VERTICES: 00/01', startTime=1708717004821)))

--
David Engel
da...@istwok.net

Sungwoo Park

unread,

Feb 25, 2024, 9:00:40 AM2/25/24

to MR3

We discussed a lot about the concatenate command in this previous thread:

https://groups.google.com/g/hive-mr3/c/6dY2FdeNijs/m/FiBx257qAgAJ

Here are some of my own notes:

1. Alter Table Concatenate is implemented as Major compaction

To change the size of ORC files, we should update both mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize.

Examples.

create table inventory_copy2 ( inv_item_sk bigint, inv_warehouse_sk bigint, inv_quantity_on_hand int, inv_date_sk bigint) stored as orc TBLPROPERTIES('transactional'='false', 'transactional_properties'='default');

insert into table inventory_copy2 select * from inventory_copy;
--- 130 files, max 17M

set mapreduce.input.fileinputformat.split.minsize=248435456;
set mapreduce.input.fileinputformat.split.maxsize=248435456;
alter table inventory_copy2 concatenate;
--- 7 files, max 248M, min 66M

set mapreduce.input.fileinputformat.split.minsize=128435456;
set mapreduce.input.fileinputformat.split.maxsize=128435456;
alter table inventory_copy2 concatenate;
--- 12 files, max 135M, min 101M

set mapreduce.input.fileinputformat.split.minsize=68435456;
set mapreduce.input.fileinputformat.split.maxsize=68435456;
alter table inventory_copy2 concatenate;
--- 22 files, max 79M, min 57M

2. In some cases, you get NPE when executing Concatenate. (For an example, see list_bucket_dml_8.q, although it uses RC format, not ORC format.)

From my own testing of Hive 4 on Tez (using the Hive master branch), Concatenate/Merge can only be performed on managed tables.

3. Even with hive.stats.autogather set to true in hive-site.xml, you still want to manually execute 'COMPUTE STATISTICS' because Hive 3 on MR3 is based on Hive 3.1.3 which is not complete in its implementation of auto-gathering statistics. (We tried to backport related patches, but did not make it.) Things might be easier when Hive 4 on MR3 is released. At that time, it may be just irrelevant if everything is migrated to Iceberg tables.

--- Sungwoo

David Engel

unread,

Feb 25, 2024, 6:19:31 PM2/25/24

to Sungwoo Park, MR3

I know we talked about concatenate and compact a long time ago but I
didn't remember any NPEs from that discussion. I've sletp since then,
though, so I could have easily forgotten.

My reason for looking at it again is that I recently completed our
switch from using MinIO for storage to HDFS. Some things are running
much more smoothly now and I hoped this might too. I've gotten into
the habit of using hive.merge.tezfiles=true when inserting to some
tables. That helps pack things into larger files but I wanted to give
explicit concatenate another try. I believe it was another use of
hive.merge.tezfiles where I saw this particualr NPE.

Thanks for bringing up the compute statistics case. You saved me from
following up with questions related to that... except for one. :) That
is what is the status of Hive 4? There's not much talk of it on the
Hive user list and I don't follow the developers lists.

David

> --
> You received this message because you are subscribed to the Google Groups "MR3" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hive-mr3+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/hive-mr3/d1e0b355-e598-40d0-a4df-7659a9c90d57n%40googlegroups.com.

--
David Engel
da...@istwok.net

Sungwoo Park

unread,

Feb 27, 2024, 10:05:13 AM2/27/24

to MR3

For the NPE error, we don't have test cases other that q tests included in Hive.

If you know how to reproduce the error, please let me know. At this point, it is not clear whether the error is an MR3 problem or a Hive problem.

It seems to me that Hive 4 will be released within a few months. My overall impression is that it is released not because the current build is stable with no major bugs, but because Hive users would like to try Hive 4 in its current form anyway. Maybe Cloudera folks felt okay with whatever state Hive was in, and didn't find it really necessary to release Hive 4.

We will eventually release Hive 4-MR3, but not right after the release of Hive 4. From our previous testing, Hive 4 has some performance issues and it might take a while for Hive 4 to stabilize.