Hi Andrew,
Apologies for taking so long to respond.
I can only give definite answers about features that are available
now, and for the hypotheticals I'll have to do a mental judo-flip and
answer your question with a question.
On Jan 10, 9:33 pm, Andrew Stevens <
andrewstev...@qone.com.au> wrote:
> ...
> Perhaps a somewhat hypothetical question - looking forward to the future of
> BQ handling truly massive amounts of batch & live data from many dispersed
> systems via an ETL:
Regarding "truly massive amounts," do you estimate that your daily ETL
will be larger than the current limits of:
100 ingestion jobs per day * 100 GB per job = 10 TB per day?
(
https://developers.google.com/bigquery/docs/quota-policy )
> How do we guarantee the performance of streaming or near streaming data,
> via perhaps a quite capable ETL layer hosted by an IaaS provider, but able
> to have a fat enough pipe into google infrastructure to ensure that all
> that data is making its way in, in a timely fashion
In the current system there is no direct support for streaming
updates. If we did support streaming, what would you estimate your
needs to be in terms of rows, fields and bytes per second?
Also in the current system there is an upper limit on the number of
ingestion jobs per day (100), so if your streaming ETL was constant
throughout the day then you would average an ingestion about every 15
minutes. That would take a few minutes to digest, so let's estimate
that your data would lag real time by 20 minutes (without true
streaming support). Would this be "timely" enough for you?
If your stream of data fluctuates over the course of the day, the
fastest you could make ingestion job requests would be 10 requests
every 5 minutes. At that rate you would use up your 100/day quota in
50 minutes, assuming you could upload 3.413 GB/s.
>
> The neatest solution would likely be an ETL piece through appengine,
> however currently off the shelf cloud based ETL providers haven't made
> their way there yet.
>
If we were to try to work with some of these ETL providers, which ones
would be most useful to you, and why? (and, of course, our readers at
home are encouraged to list their favorites as well)
> The overarching architectural consideration is where an ETL piece should
> physically live in relation to the big data storage/analytics piece to
> prevent any bottle neck ingesting data - presumably theres benefits of
> ETL/Big Data residing in the same data centre(s)? presumably there is a
> benefit to the same vendor providing both services? How much does this/will
> this hurt if it's not the case?
I don't have enough information to give an estimate on this now.
Google would definitely need to handle the 'Load' portion of ETL, and
that's a number I could research for you.
Where is the Extraction happening? How does the data get transferred
to Google? (I wonder if we should call this ETTL?) Is the data
'Transformed' before or after it arrives at Google? (i.e. do you use
App Engine or some other transformation at Google, or clean it up
before you send it?) And finally, how does the data get into Google
Storage (e.g. direct upload or through App Engine)?
And, finally, remember the current quotas on ingestion jobs.
I suspect that the current quotas on ingestion jobs will out weigh any
processing bottlenecks. If the quotas change or different quotas
apply to streaming input, then I think we need to dig in to the
detailed bottleneck analysis. For now, I think if you and other
readers could answer the questions above, it would help us to plan for
working with more ETL providers.
Thanks for starting this discussion!
/Rufus