Processing multiple files from a folder in a bucket

176 views
Skip to first unread message

Mahendra Navarange

unread,
Sep 2, 2022, 10:11:51 AM9/2/22
to cdap...@googlegroups.com, Emilian Damian
Hi,

I have a question about how to process multiple files from a GCS bucket.

These files have a list of columns and view name. There is one json file per view. 
I need to create a view in BigQuery using the view definition provided in these files.

I can use BigQuery Execute along with Argument Setter to build one view. However, I can't iterate this pipeline for all view files. 

Has anyone faced this issue? How do you process multiple files using a single pipeline?

With regards,
Mahendra Navarange
AppsBroker Ltd





This message (and any associated files) is intended only for the use of the individual or entity to which it is addressed and may contain information that is confidential, subject to copyright, or constitutes a trade secret. If you are not the intended recipient you are hereby notified that any dissemination, copying or distribution of this message, or files associated with this message, is strictly prohibited.

Internet communications cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. Therefore, we do not accept responsibility for any errors or omissions that are present in this message, or any attachment, that have arisen as a result of e-mail transmission. If verification is required, please request a hard-copy version. Any views or opinions presented are solely those of the author and do not necessarily represent those of the company.

Albert Shau

unread,
Sep 6, 2022, 6:20:29 PM9/6/22
to CDAP User
Hi Mahendra,

The GCS source will read the content of all files under the path that you specify. By default, it will only look at the files in that directory, but you can configure it to look recursively and also use a regex to filter out which files to read. You could write some custom sink that takes the data and creates tables, or write an action plugin that just lists objects at the GCS path, reads them and creates tables. However, this type of use case isn't really the type of use case that CDAP is trying to solve, as there will be a lot of overhead associated with running this workload on Hadoop. Pipelines are more focused on reading large amounts of data in a parallel manner, rather than reading small messages and performing an action.

This type of use case looks like a much better fit for Cloud Functions (https://cloud.google.com/functions/docs/calling/storage), where you trigger some function to run whenever a GCS object is created.

Regards,
Albert
Reply all
Reply to author
Forward
0 new messages