DataFlow failed with return code 1 with Airflow DataflowHook.start_python_dataflow

1,579 views
Skip to first unread message

Srikanth Gudimalla

unread,
Jun 8, 2018, 10:56:26 AM6/8/18
to cloud-composer-discuss
I am getting the below error, when I run the below code. I am trying to transform gvcf/vcf files in my google cloud storage to bigquery using gcp-variant-transforms api.

[2018-06-06 16:46:42,589] {models.py:1428} INFO - Executing on 2018-06-06 21:46:34.252526 

[2018-06-06 16:46:42,589] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run GcsToBigQuery gcsToBigquery_ID 

2018-06-06T21:46:34.252526 --job_id 168 --raw -sd DAGS_FOLDER/GcsToBigQuery.py'] 

[2018-06-06 16:46:43,204] {base_task_runner.py:98} INFO - Subtask: 

[2018-06-06 16:46:43,202] {init.py:45} INFO - Using executor SequentialExecutor 

[2018-06-06 16:46:43,284] {base_task_runner.py:98} INFO - Subtask: 

[2018-06-06 16:46:43,283] {models.py:189} INFO - Filling up the DagBag from /apps/airflow/dags/GcsToBigQuery.py 

[2018-06-06 16:46:43,853] {base_task_runner.py:98} INFO - Subtask: 

[2018-06-06 16:46:43,852] {gcp_dataflow_hook.py:111} INFO - Start waiting for DataFlow process to complete. 

[2018-06-06 16:46:46,931] {base_task_runner.py:98} INFO - Subtask: 

[2018-06-06 16:46:46,930] {GcsToBigQuery.py:48} ERROR - Status : FAIL : gcsToBigquery: Not able to run: DataFlow failed with return code 1 

[2018-06-06 16:46:46,931] {base_task_runner.py:98} INFO - Subtask: 

[2018-06-06 16:46:46,930] {python_operator.py:90} INFO - Done. Returned value was: None

Here is my code:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.contrib.hooks.gcp_dataflow_hook import DataFlowHook
from airflow.operators.python_operator import PythonOperator
import logging

default_args = {
    'owner': 'My Name',
    'depends_on_past': False,
    'start_date': datetime(2018, 6, 6),
    'email': ['MY Email'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG('GcsToBigQuery', default_args=default_args,
          description='To move GVCF/VCF files from Google Cloud Storage to Big Query',
          schedule_interval='@once',
          start_date=datetime(2018, 6, 6))

dataflow_py_file = 'gcp_variant_transforms.vcf_to_bq'
PY_OPTIONS = ['-m']

DATAFLOW_OPTIONS_PY = {
    "project": "project-Name",
    "input_pattern": "gs://test-gvcf/1000-genomes.vcf",
    "output_table": "project-Name:1000genomicsID.1000_genomesSamp",
     "staging_location": "gs://test-gvcf/vcftobq/staging",
     "temp_location": "gs://test-gvcf/vcftobq/temp",
     "job_name": "dataflowstarter25",
     #"setup_file": "./setup.py",
     "runner": "DataflowRunner"
}


def gcsToBigquery():
    try:
        dataflowHook = DataFlowHook(gcp_conn_id='google_cloud_platform_id')
        dataflowHook.start_python_dataflow(task_id='dataflowStarter2_ID', variables=DATAFLOW_OPTIONS_PY,
                                       dataflow=dataflow_py_file, py_options=PY_OPTIONS)
    except Exception as e:
        logging.error("Status : FAIL : gcsToBigquery: Not able to run: " + str(e.message))

gcsToBigquery_task = PythonOperator(task_id='gcsToBigquery_ID',
                                    python_callable=gcsToBigquery,
                                    dag=dag)

Feng Lu

unread,
Jun 11, 2018, 4:44:54 AM6/11/18
to srikan...@gmail.com, cloud-composer-discuss
Hi Srikanth,

Could you try to use a standard operator like DataflowPythonOperator to launch your job?

Feng 

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/53217132-1425-4126-ae10-f06ef39b2ab8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Srikanth Gudimalla

unread,
Jun 11, 2018, 10:59:09 AM6/11/18
to cloud-composer-discuss
Thanks Feng Lu!

It is working now.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

Srikanth Gudimalla

unread,
Jun 11, 2018, 12:58:30 PM6/11/18
to cloud-composer-discuss
FYI:

gcsToBigquery_task = DataFlowPythonOperator(task_id='gcstobigquery',
                                    py_file='gcp_variant_transforms.vcf_to_bq',
py_options=['-v', '-m'],
dataflow_default_options={
"input_pattern": "gs://testgvcf/1000-genomes.vcf",
"output_table": "project-poc:genomicsdataset.genomes3",
"project": "project-poc",
"staging_location": "gs://testgvcf/staging",
"temp_location": "gs://testgvcf/temp",
"job_name": "dataflowstarter3",
"setup_file": "/home/airflow/gcs/data/gcp-variant-transforms/setup.py"},
gcp_conn_id='google_cloud_platform_conn',
dag=dag)

Before running the above program, Install gcp_variant_transforms api in your cloud with the following command



And I made it work with direct python command and DataFlowHook.start_python_dataflow method too.

It is mainly because gcp_variant_transforms is not installed in the cloud.

Hope it helps! Thanks!
Reply all
Reply to author
Forward
0 new messages