INSERT Issue with hive-bigquery-storage-handler

173 views
Skip to first unread message

Vishal Verma

unread,
Feb 15, 2022, 10:33:41 PM2/15/22
to Google Cloud Dataproc Discussions
We have got a requirement from one of our client to setup Hive/Spark on VM and its connectivity with BigQuery. They don't want to use Dataproc as a solution for this.
We setup a VM with Hadoop/Hive and configured it with GCS and BigQuery connector. We got a reference from this repository (https://github.com/GoogleCloudDataproc/hive-bigquery-storage-handler)  and used it for our Hive-BigQuery connectivity. We are able to fetch record in Hive shell from underline BQ storage but somehow INSERTS are not working. It's showing us Success message but no records are inserted actually. I just followed these two steps.
i) Created table in Hive Shell:
CREATE TABLE bq_test (name string, age bigint)  
 STORED BY
 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler'
 TBLPROPERTIES (
 'bq.dataset'='poc_hive_bq',
 'bq.table'='hive_bq_tbl',
 'mapred.bq.project.id'='fk-batch-pocs',
 'mapred.bq.temp.gcs.path'='gs://poc_hive_bq/tmp',
 'mapred.bq.gcs.bucket'='poc_hive_bq'
 );
ii) Tried to insert record:
INSERT INTO bq_test VALUES ('Vishal', 33);

Let me know if you see any issue with the syntax which we are using here. Also attached image are output which we are getting after triggering insert statement. After running insert query it gives me Success message. See this image please. Above one is insert to BQ which is not inserting any record and 2nd one is insert to Hive native table which is working fine. Only difference I see between these two is highlighted one.

Note: We have tried it with external table as well. Also tested it with both Tez and MR execution engine.

Please guide as necessary.
hive_output.PNG

Mich Talebzadeh

unread,
Feb 16, 2022, 4:22:28 AM2/16/22
to Google Cloud Dataproc Discussions
You can deploy a single Dataproc node for this purpose. Anyway what versions of Hive, Spark etc you have installed on that VM?

HTH

Mich Talebzadeh

unread,
Feb 16, 2022, 5:23:24 AM2/16/22
to Google Cloud Dataproc Discussions
BTW what is the role of Spark here?

Your client wants to access BigQuery data in Hive but Hive is just another storage layer much like BigQuery. If you want to do processing BigQuery data and store it in Hive, you can use the Spark efficient BigQuery connector to do so and then do your etl and write from Spark to Hive. 

Vishal Verma

unread,
Feb 21, 2022, 7:52:04 AM2/21/22
to Google Cloud Dataproc Discussions
Actually they want to setup BigQuery with both Hive and Spark. I understand that Spark BQ connector is more efficient and most of the things are working with it but I need to implement Hive-BQ connectivity as well. So are you able to insert record from Hive to BQ table? Also this Hive-BQ storage connector internally using Avro file format. Can we somehow use other file format as well? Let me know if you have any inputs on these things.

Mich Talebzadeh

unread,
Feb 21, 2022, 10:52:25 AM2/21/22
to Google Cloud Dataproc Discussions
Where is your Hive in Cloud or on-premise?

Are you going to run your Spark in Dataproc or similar VM in GCP?

HTH

Mich Talebzadeh

unread,
Feb 21, 2022, 2:08:32 PM2/21/22
to Google Cloud Dataproc Discussions
Basically if you want to export data from Hive and import it into BigQuery you can use AVRO or Parquet or ORC format

1) create a table in Hive as AVRO or parquet from the source table
CREATE TABLE ll_18201960_AVRO STORED AS AVRO AS select * from ll_18201960;
CREATE TABLE ll_18201960_parquet STORED AS PARQUET AS select * from ll_18201960;

2) Extract these two tables to a directory

hdfs dfs -get /user/hive/warehouse/elayer.db/ll_18201960_avro .
hdfs dfs -get /user/hive/warehouse/elayer.db/ll_18201960_parquet .

3) Push them into storage buckets in GCP

gsutil cp -r ll_18201960_avro gs://accountsbucket/accounts/ll_18201960_avro
gsutil cp -r ll_18201960_parquet gs://accountsbucket/accounts/ll_18201960_parquet

4) Load them into BigQuery from GCP bucket

bq load --autodetect --replace=true --source_format=AVRO accounts.ll_18201960_avro "gs://accountsbucket/accounts/ll_18201960_avro/*"
bq load --autodetect --replace=true --source_format=PARQUET accounts.ll_18201960_parquet "gs://accountsbucket/accounts/ll_18201960_parquet/*"


That is one way.

The other way is to use Spark on-premise to read from Hive tables and load them into BigQuery dataset

Vishal Verma

unread,
Mar 3, 2022, 9:22:43 AM3/3/22
to Google Cloud Dataproc Discussions
All I wanted to understand is that when I am creating a table with this syntax, I am not able to insert any record in this table further from Hive shell.

CREATE EXTERNAL TABLE bq_test (id BIGINT, name STRING, dept STRING, loc STRING)  

 STORED BY
 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler'
 TBLPROPERTIES (
 'bq.dataset'='poc_hive_bq',
 'bq.table'='bq_native_tbl',
 'mapred.bq.project.id'='<some_project_id>',
 'mapred.bq.temp.gcs.path'='gs://poc_hive_bq/tmp',
 'mapred.bq.gcs.bucket'='poc_hive_bq'
 );

Inserts like this are not working: INSERT INTO bq_test VALUES(4, 'DDD', 'DDD', 'DDD');
Note: Not getting any error message. Just records are not added to table.


Do you have any experience with this connector. Please let me know.

Regards,
Vishal

Mich Talebzadeh

unread,
Mar 4, 2022, 12:48:43 PM3/4/22
to Google Cloud Dataproc Discussions
In hive

can you 

desc formatted bq_test

HTH

Mich Talebzadeh

unread,
Mar 4, 2022, 3:41:25 PM3/4/22
to Google Cloud Dataproc Discussions
OK,

I built that jar file hive-bigquery-storage-handler-1.0-shaded.jar. Have a table in GBQ that is populated from Spark on Kubernetes. Created a Hive external table on-premise (I have all the Google SDK etc.)

use test;
DROP TABLE IF EXISTS test.randomdatabq;
CREATE EXTERNAL TABLE test.randomdatabq(
       ID BIGINT
     , CLUSTERED DOUBLE
     , SCATTERED DOUBLE
     , RANDOMISED DOUBLE
     , RANDOM_STRING STRING
     , SMALL_VC STRING
     , PADDING  STRING
     , op_type STRING
     , op_time TIMESTAMP
    )

STORED BY
 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler'
 TBLPROPERTIES (
 'bq.dataset'='test',
 'bq.table'='randomdata',
 'mapred.bq.project.id'='xxx',
 'mapred.bq.temp.gcs.path'='gs://tmp_storage_bucket/tmp',
 'mapred.bq.gcs.bucket'='etcbucket/hive/test'
)
;
DESCRIBE FORMATTED test.randomdatabq;


OK I can pickup data from GBQ table

0: jdbc:hive2://rhes75:10099/default> SELECT id,clustered, scattered, randomised, random_string, small_vc, op_type, from_unixtime(cast(cast(op_time as bigint)/1000 as bigint), 'yyyy-MM-dd hh:mm:ss') FROM test.randomdatabq LIMIT 2
. . . . . . . . . . . . . . . . . . > +-----+----------------------+------------+-------------+----------------------------------------------------+----------------------------------------------------+----------+----------------------+
| id  |      clustered       | scattered  | randomised  |                   random_string                    |                      small_vc                      | op_type  |         _c7          |
+-----+----------------------+------------+-------------+----------------------------------------------------+----------------------------------------------------+----------+----------------------+
| 2   | 0.10000000149011612  | 1.0        | 0.0         | ffxkVZQtqMnMcLRkBOzZUGxICGrcbxDuyBHkJlpobluliGGxGR |                                                2   | 1        | 2022-03-04 07:38:58  |
| 1   | 0.0                  | 0.0        | 2.0         | KZWeqhFWCEPyYngFbyBMWXaSCrUZoLgubbbPIayRnBUbHoWCFJ |                                                 1  | 1        | 2022-03-04 07:38:58  |
+-----+----------------------+------------+-------------+----------------------------------------------------+----------------------------------------------------+----------+----------------------+
2 rows selected (3.866 seconds)

This works. However, I cannot insert into GBQ from Hive itself. I got this error

0: jdbc:hive2://rhes75:10099/default> insert into test.randomdatabq values(11,0.10000000149011612, 1.0, 0.0, 'ffxkVZQtqMnMcLRkBOzZUGxICGrcbxDuyBHkJlpobluliGGxGR', 2, 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',1,'2022-03-04 07:38:58')
. . . . . . . . . . . . . . . . . . > Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)


Will have to investigate it further.

BTW, GBQ table columns need to be all lowercase.

HTH

Mich Talebzadeh

unread,
Mar 7, 2022, 1:40:41 PM3/7/22
to Google Cloud Dataproc Discussions
Hi,

After using the correct jar file added to Hive as below

ADD JAR <LOCATION>/hive-bigquery-storage-handler-1.0-shaded.jar;

and also adding the jar file gcs-connector-hadoop3-2.2.5-shaded.jar as suggested by Igor, to HADOOP_CLASSPATH, I can insert a rows through Hive table into GBQ table as below

DROP TABLE IF EXISTS test.testmebq;
CREATE EXTERNAL TABLE test.testmebq(
       ID BIGINT
    )
STORED BY
 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler'
 TBLPROPERTIES (
 'mapred.bq.temp.gcs.path'='gs://tmp_storage_bucket/tmp',
 'mapred.bq.gcs.bucket'='etcbucket/hive/test'
)
;
and creating a similar table in GBQ dataset test.testme with id column  in lowercase. the insert from Hive goes through with no error!

    > insert  into test.testmebq values(11);
2022-03-07 18:10:17,159 INFO  [main] conf.HiveConf: Using the default value passed in for log id: 3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b
2022-03-07 18:10:17,406 WARN  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] ql.Driver: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20220307181017_90752d3e-bb82-4f5a-94db-49afdf96cabf
2022-03-07 18:10:17,406 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] ql.Driver: Query ID = hduser_20220307181017_90752d3e-bb82-4f5a-94db-49afdf96cabf
Total jobs = 1
2022-03-07 18:10:17,406 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] ql.Driver: Total jobs = 1
2022-03-07 18:10:17,406 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] ql.Driver: Starting task [Stage-0:DDL] in serial mode
2022-03-07 18:10:17,420 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] avro.AvroSerDe: AvroSerde::initialize(): Preset value of avro.schema.literal == null
2022-03-07 18:10:17,420 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] avro.AvroSerDe: AvroSerde::initialize(): Preset value of avro.schema.literal == null
2022-03-07 18:10:17,420 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] avro.AvroSerDe: AvroSerde::initialize(): Preset value of avro.schema.literal == null
2022-03-07 18:10:17,466 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] ql.Driver: Starting task [Stage-1:DDL] in serial mode
2022-03-07 18:10:17,822 INFO  [LocalJobRunner Map Task Executor #0] exec.MapOperator: Initializing operator MAP[0]
2022-03-07 18:10:17,822 INFO  [LocalJobRunner Map Task Executor #0] mr.ExecMapper:
<MAP>Id =0
  <Children>null
  <\Children>
  <Parent><\Parent>
<\MAP>
2022-03-07 18:10:18,742 WARN  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2022-03-07 18:10:18,742 Stage-2 map = 0%,  reduce = 0%
2022-03-07 18:10:18,742 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] exec.Task: 2022-03-07 18:10:18,742 Stage-2 map = 0%,  reduce = 0%
2022-03-07 18:10:19,187 INFO  [LocalJobRunner Map Task Executor #0] exec.FileSinkOperator: RECORDS_OUT_1_test.testmebq:1, RECORDS_OUT_INTERMEDIATE:0, RECORDS_OUT_OPERATOR_FS_5:1,
2022-03-07 18:10:19,187 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner:
2022-03-07 18:10:19,187 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task: Task:attempt_local378396178_0007_m_000000_0 is done. And is in the process of committing
2022-03-07 18:10:19,190 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner: map
2022-03-07 18:10:19,190 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task: Task 'attempt_local378396178_0007_m_000000_0' done.
2022-03-07 18:10:19,190 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner: Finishing task: attempt_local378396178_0007_m_000000_0
2022-03-07 18:10:19,190 INFO  [Thread-1079] mapred.LocalJobRunner: map task executor complete.
2022-03-07 18:10:19,745 Stage-2 map = 100%,  reduce = 0%
2022-03-07 18:10:19,745 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] exec.Task: 2022-03-07 18:10:19,745 Stage-2 map = 100%,  reduce = 0%
Ended Job = job_local378396178_0007
2022-03-07 18:10:19,746 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] exec.Task: Ended Job = job_local378396178_0007
MapReduce Jobs Launched:
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] ql.Driver: MapReduce Jobs Launched:
Stage-Stage-2:  HDFS Read: 19312718 HDFS Write: 0 SUCCESS
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] ql.Driver: Stage-Stage-2:  HDFS Read: 19312718 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] ql.Driver: Total MapReduce CPU Time Spent: 0 msec
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] ql.Driver: Completed executing command(queryId=hduser_20220307181017_90752d3e-bb82-4f5a-94db-49afdf96cabf); Time taken: 2.341 seconds
OK
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] ql.Driver: OK
2022-03-07 18:10:19,747 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] lockmgr.DbTxnManager: Stopped heartbeat for query: hduser_20220307181017_90752d3e-bb82-4f5a-94db-49afdf96cabf
Time taken: 2.607 seconds
2022-03-07 18:10:19,774 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] CliDriver: Time taken: 2.607 seconds
2022-03-07 18:10:19,775 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] conf.HiveConf: Using the default value passed in for log id: 3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b
2022-03-07 18:10:19,775 INFO  [3bf1fd70-21ae-4648-8f4f-bcd12eb0d63b main] session.SessionState: Resetting thread name to  main

However, the insert does not show in Hive or GBQ table 

0: jdbc:hive2://rhes75:10099/default> select * from testmebq

. . . . . . . . . . . . . . . . . . > +--------------+

testmebq.id  |

+--------------+

+--------------+

No rows selected (2.512 seconds)


and neither in GBQ table

SELECT * FROM test.testme
Query complete (0.3 sec elapsed, 0 B processed)
Job information
Results
JSON
Execution details
This query returned no results.


Yes in short there is no error thrown but no rows added!


let me see the likely cause


HTH

Mich Talebzadeh

unread,
Mar 8, 2022, 5:18:58 AM3/8/22
to Google Cloud Dataproc Discussions
Hi Vishal,

Can you run hive in debug mode as below

${HIVE_HOME}/bin/hive --hiveconf hive.root.logger=DEBUG,console'

and try to insert again?

Mich Talebzadeh

unread,
Mar 8, 2022, 10:26:06 AM3/8/22
to Google Cloud Dataproc Discussions
Also I am surprised why data does not end up in local Hive table. Is there an issue with hive-bigquery-storage-handler-1.0-shaded.jar or version compatibility with Hive 3.1.1? You would have thought that a simple insert into Hive table will be OK at least, regardless of row added to BigQuery table or not. 

Any ideas?

Mich Talebzadeh

unread,
Mar 9, 2022, 5:58:40 AM3/9/22
to Google Cloud Dataproc Discussions
I believe with mr as engine, there is now the issue of OutputCommitter not being called somehow, something probably to do with jar,  as all goes through ok with no result.

Mich Talebzadeh

unread,
Mar 11, 2022, 4:05:40 AM3/11/22
to Google Cloud Dataproc Discussions
I don't think this project https://github.com/GoogleCloudDataproc/hive-bigquery-storage-handler is maintained? Can someone verify that please?

Mich 

Reply all
Reply to author
Forward
0 new messages