Thank you very much Gilberto!
It's great to make contact with people out there who are on the same boat.
I've just been watching a series of videos on pipelines, and I'm starting to get the pattern for big data processing that Google promotes:
Datastore -> Cloud Storage -> BigQuery
The key point is that BigQuery is "append only", something I didn't realize before.
Here are the videos:
- Google I/O 2012 - Building Data Pipelines at Google Scale: http://youtu.be/lqQ6VFd3Tnw
- BigQuery: Simple example of a data collection and analysis pipeline + Yo...: http://youtu.be/btJE659h5Bg
- GCP Cloud Platform Integration Demo: http://youtu.be/JcOEJXopmgo via @YouTube
All I need it seems is the Pipeline API, iterating over the Datastore (I guess, in order with a query) and producing a CSV (and other formats) output.
That should allow me to do what I do already but on top of multiple (perhaps sequential) task queues, rather than just one.
From the point of view of costs, currently I heavily rely on, possibly abusing, memcache. With no memcache, I expect costs to go up.
A further improvement would be to update only subsets of data, rather than a whole lot. I've been designing a new datastore 'schema' so that my data is hierarchically organized in entity groups, therefore I could generate a file per entity group (once that's changed) and have a final stage that assembles those files together.
I'm pretty happy with my current task because, as I wrote, is simple and elegant.
If I could upgrade the same algorithm to a Datastore input reader for pipelines, that should do for us.
Emanuele