Initialization action in Serverless Dataproc job

1,013 views
Skip to first unread message

Ankit Mukhija

unread,
Jul 18, 2022, 12:56:22 PM7/18/22
to Google Cloud Dataproc Discussions
Hello everyone,

I am trying to do initialization action on serverless datapoc, since there is no direct provisio I am trying using docker image to run cloud_sql_proxy.sh (shell script to configure cloud sql metastore using cloud proxy) but somehow it's not getting executed. Anyone has any experience or tried similar thing



bla...@google.com

unread,
Jul 19, 2022, 9:10:07 AM7/19/22
to Google Cloud Dataproc Discussions
Hi Ankit --

The Hive Metastore Service, or HMS, is not bundled with Dataproc Serverless. If you can, you want to create and use a Dataproc Metastore Service (DPMS) instance as it includes HMS and provides a thrift endpoint. You can import your old Metastore into DPMS. If your existing Metastore is in CloudSQL and cannot be migrated to DPMS, you're going to want to create a 1-node Dataproc 2.x cluster to act as a Thrift proxy for Serverless Spark jobs.

Blake

Ankit Mukhija

unread,
Jul 19, 2022, 11:34:50 AM7/19/22
to Google Cloud Dataproc Discussions
Hi Blake, Thanks for the response. I understood about the hive metastore but what if we want to run as script before a spark job ,like we want to mimic the initialization action functionality in dataproc.

Regards,
Ankit

bla...@google.com

unread,
Jul 20, 2022, 3:49:40 AM7/20/22
to Google Cloud Dataproc Discussions
Hi Ankit --

Are you trying to use the Cloud SQL Proxy to connect to another database that's not a Hive Metastore? Can you expand a little bit more on what your init actions will do?

In Dataproc Serverless world, you need to put all customizations into the container. If you have pre or post-processing steps, recommend that you use Cloud Run or Cloud Dataproc on GCE. You can orchestrate all of this with Cloud Composer (Managed Airflow).

Helpful Links:

Deepanshu Bhateja

unread,
Jan 29, 2024, 5:47:44 PM1/29/24
to Google Cloud Dataproc Discussions

Hello Blake,

I hope this message finds you well. I find myself in a similar situation where I need to connect to external Hive-Metastore(Cloud SQL) through Serverless spark jobs.  I'm writing to get your assistance in gaining a clear understanding of using " 1-node Dataproc 2.x cluster to act as a Thrift proxy for Serverless Spark jobs". Any documentation or resources that delves into the details of configuration options that are needed to be modified while creating a cluster. 

Thanks for your time and assistance, I look forward to hearing from you soon.

Best Regards,
Deepanshu

Specifically, I would like to 

bla...@google.com

unread,
Feb 5, 2024, 2:32:24 PM2/5/24
to Google Cloud Dataproc Discussions
Hi Deepanshu --

Per "Submit a Spark batch workload" docs, you can just specify the Hive Metastore config:

"gcloud dataproc batches submit \
    --properties=spark.sql.catalogImplementation=hive,spark.hive.metastore.uris=METASTORE_URI,spark.hive.metastore.warehouse.dir=WAREHOUSE_DIR> \
    other args ..."

Happy developing,
Blake 
Reply all
Reply to author
Forward
0 new messages