How to design a daily job for bulk inserts of data

47 views
Skip to first unread message

Niklas Andersson

unread,
Sep 18, 2016, 10:37:19 AM9/18/16
to Google App Engine
I have a bunch of files at Cloud Storage that I need to parse and insert into Cloud SQL once per day.
The operation takes about an hour on a n1-standard-2 instance, and then ~10mins to insert the data into Cloud SQL.

My idea of design to solve this problem with GAE is:
  1. Define a scheduled task in GAE, it does a HTTP request to a URL
  2. Create a URL handler that can respond to the HTTP req mentioned above in GAE, it’s main task would be to boot a GCE instance
  3. The GCE instance would be based on a template where it does a few things automatically at boot:
    1. Download the file from Cloud Storage
    2. Parses it
    3. Inserts the data into Cloud SQL
    4. Shuts down

Would this be the best design to solve the problem?

Evan Jones

unread,
Sep 19, 2016, 10:59:37 AM9/19/16
to Google App Engine
From my limited experience of doing ETL-type tasks on App Engine, I'd suggest using a "backend instance" with basic scaling if you can (although these instances are probably slower than the n1-standard-2, and it certainly will have much less memory). This would avoid the complexity of managing the lifetime of some other instance, and if you have a bunch of existing App Engine standard stuff, is simpler to manage. Backend instances can take a long time to process a cron task request, or can start a background thread to do the ETL, so you can avoid the request timer: https://cloud.google.com/appengine/docs/java/an-overview-of-app-engine

If you really need to run this on a machine with more memory or custom code, we've used an App Engine flexible module that we leave running all the time. This does cost us some fraction of the instance time, but we figured the simplicity of not needing to deal with starting, stopping, and waiting for instance boot makes it worth it, and relative to our other costs, running a single small VM 24x7 is not significant. We used an AE flexible module so that App Engine manages the instance for us, and so deploying to it is extremely similar to our other app engine code.

Jim

unread,
Sep 19, 2016, 3:22:03 PM9/19/16
to Google App Engine
I concur with Evan re: backend instances.  I also suggest that you make copious use of tasks.  In your step #3, rather than actually inserting each row into Cloud SQL, I would drop an individual task onto a push queue for each row insert, and then have another process which fires for each task to insert an individual row.  Your "main" process which is iterating through your parsed data will run a _lot_ faster if it doesn't have to wait on the database for each row insert.  If your experience is like ours, you'll find that one process parsing a text file and dropping tasks can keep 5-10 instances busy doing actual database inserts.  Depending on the size of your input files, you might also want to consider splitting up the processing of your input file, or using map-reduce if your files are truly "big".
Reply all
Reply to author
Forward
0 new messages