Output DataPrep job to PostGres

149 views
Skip to first unread message

dan slaman

unread,
Aug 22, 2018, 10:43:35 AM8/22/18
to Google Cloud SQL discuss
I'm running jobs using DataPrep and DataFlow. I write out to BigQuery at the moment, my teams plans are to move from our postgres DB to BigQuery once we outgrow the former.

However, DataPrep and DataFlow is far superior to producing our desired output. We are currently writing the output to CSV in a storage bucket rather then running code that chunks the rows into Postgres from our local machines.

My question, feature request is, when can I write the DataPrep and DataFlow outputs directly into my Postgres DB? Would be ideal. Even when we move to BigQuery, our postgres implementation still Satisfies business cases on the cheap. So we won't move off it completely.

Thanks for reading


Jordan (Cloud Platform Support)

unread,
Aug 22, 2018, 3:55:34 PM8/22/18
to Google Cloud SQL discuss
You can find examples on Stack Exchange of writing to a Postgres database via the Java JdbcIO transform. If you are not using Java than it is recommended to post an official feature request with the Apache Beam team in their Issue Tracker. 

As a workaround you can always use the TextIO transform to write the data to Google Cloud Storage in something like a .csv file. Then setup a trigger that would run a simple function in Google Cloud Functions that would read the file and write it to your Postgres db. 

dan slaman

unread,
Aug 27, 2018, 11:11:06 AM8/27/18
to Google Cloud SQL discuss
Errrr, so If I kicked off a DataFlow job, and it's been running for over 13 hours...should I just kill it?  I did the same job with a smaller data set and it completed in 20 minutes...now with a larger data set it's taken about 14 hours and it's still not there.

Jordan (Cloud Platform Support)

unread,
Sep 2, 2018, 5:11:24 PM9/2/18
to Google Cloud SQL discuss
It is recommended to look at the Stackdriver logs for your job to investigate why it is taking so long. Since you increased the data, this is most likely a memory issue where the workers performing the job do not have enough memory resources to handle your increased data. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.

- The solution is to either audit your code to ensure that you do not have a memory leak, and/or to increase the machine type to an machine with more memory. Also ensure that your work and data can be properly split so that Dataflow can scale up more workers to complete a job instead of tasking a single worker to do it all (which would be limited to machine type resources). 

dan slaman

unread,
Sep 3, 2018, 6:38:27 PM9/3/18
to Google Cloud SQL discuss
OK. Some more info, I wrote the job from scratch in Apache beam and it worked. So I agree DataPrep isn't setting arguments correctly. How do I set the parameters in DataPrep the way I would in a Cloud DataFlow or Apache Beam job?

Jordan (Cloud Platform Support)

unread,
Sep 3, 2018, 8:53:13 PM9/3/18
to Google Cloud SQL discuss
I believe Dataprep handles all of the Dataflow configuration for you. I therefore highly recommend that you report this issue in the Public Issue Tracker. Be sure to include the job Id for the stuck job, and the changes you made in your custom Beam job that resulted in the job completing successfully. The Dataprep team will then be able to improve their system with your feedback. 

- All further communications about this issue should continue in your Public Issue Tracker report with the Dataprep team.  
Reply all
Reply to author
Forward
0 new messages