ETL Bandwidth/Latency

Andrew Stevens

unread,

Jan 11, 2012, 12:33:18 AM1/11/12

to bigquery...@googlegroups.com

Hi All, Hi to the few of you I met while in MTV, I thought my question was broadly relevant enough to post publicly -

Perhaps a somewhat hypothetical question - looking forward to the future of BQ handling truly massive amounts of batch & live data from many dispersed systems via an ETL:

How do we guarantee the performance of streaming or near streaming data, via perhaps a quite capable ETL layer hosted by an IaaS provider, but able to have a fat enough pipe into google infrastructure to ensure that all that data is making its way in, in a timely fashion

The neatest solution would likely be an ETL piece through appengine, however currently off the shelf cloud based ETL providers haven't made their way there yet.

The overarching architectural consideration is where an ETL piece should physically live in relation to the big data storage/analytics piece to prevent any bottle neck ingesting data - presumably theres benefits of ETL/Big Data residing in the same data centre(s)? presumably there is a benefit to the same vendor providing both services? How much does this/will this hurt if it's not the case?

Rufus

unread,

Jan 15, 2012, 7:10:27 PM1/15/12

to BigQuery discuss

Hi Andrew,

Apologies for taking so long to respond.

I can only give definite answers about features that are available
now, and for the hypotheticals I'll have to do a mental judo-flip and
answer your question with a question.

On Jan 10, 9:33 pm, Andrew Stevens <andrewstev...@qone.com.au> wrote:
> ...

> Perhaps a somewhat hypothetical question - looking forward to the future of
> BQ handling truly massive amounts of batch & live data from many dispersed
> systems via an ETL:

Regarding "truly massive amounts," do you estimate that your daily ETL
will be larger than the current limits of:
100 ingestion jobs per day * 100 GB per job = 10 TB per day?
(https://developers.google.com/bigquery/docs/quota-policy )

> How do we guarantee the performance of streaming or near streaming data,
> via perhaps a quite capable ETL layer hosted by an IaaS provider, but able
> to have a fat enough pipe into google infrastructure to ensure that all
> that data is making its way in, in a timely fashion

In the current system there is no direct support for streaming
updates. If we did support streaming, what would you estimate your
needs to be in terms of rows, fields and bytes per second?

Also in the current system there is an upper limit on the number of
ingestion jobs per day (100), so if your streaming ETL was constant
throughout the day then you would average an ingestion about every 15
minutes. That would take a few minutes to digest, so let's estimate
that your data would lag real time by 20 minutes (without true
streaming support). Would this be "timely" enough for you?

If your stream of data fluctuates over the course of the day, the
fastest you could make ingestion job requests would be 10 requests
every 5 minutes. At that rate you would use up your 100/day quota in
50 minutes, assuming you could upload 3.413 GB/s.

>
> The neatest solution would likely be an ETL piece through appengine,
> however currently off the shelf cloud based ETL providers haven't made
> their way there yet.
>

If we were to try to work with some of these ETL providers, which ones
would be most useful to you, and why? (and, of course, our readers at
home are encouraged to list their favorites as well)

> The overarching architectural consideration is where an ETL piece should
> physically live in relation to the big data storage/analytics piece to
> prevent any bottle neck ingesting data - presumably theres benefits of
> ETL/Big Data residing in the same data centre(s)? presumably there is a
> benefit to the same vendor providing both services? How much does this/will
> this hurt if it's not the case?

I don't have enough information to give an estimate on this now.
Google would definitely need to handle the 'Load' portion of ETL, and
that's a number I could research for you.

Where is the Extraction happening? How does the data get transferred
to Google? (I wonder if we should call this ETTL?) Is the data
'Transformed' before or after it arrives at Google? (i.e. do you use
App Engine or some other transformation at Google, or clean it up
before you send it?) And finally, how does the data get into Google
Storage (e.g. direct upload or through App Engine)?

And, finally, remember the current quotas on ingestion jobs.

I suspect that the current quotas on ingestion jobs will out weigh any
processing bottlenecks. If the quotas change or different quotas
apply to streaming input, then I think we need to dig in to the
detailed bottleneck analysis. For now, I think if you and other
readers could answer the questions above, it would help us to plan for
working with more ETL providers.

Thanks for starting this discussion!

/Rufus

Alex Moore

unread,

Jan 16, 2012, 2:29:23 AM1/16/12

to bigquery...@googlegroups.com

This actually answers most of my question I posted in another post. I'm not the OP but these questions directly apply to my project as well.

>Regarding "truly massive amounts," do you estimate that your daily ETL
>will be larger than the current limits of:
>100 ingestion jobs per day * 100 GB per job = 10 TB per day?
>(https://developers.google.com/bigquery/docs/quota-policy )

Ours maybe tiny by comparison 3gb per day :) but for us that's a huge amount of data to deal with

> How do we guarantee the performance of streaming or near streaming data,
> via perhaps a quite capable ETL layer hosted by an IaaS provider, but able
> to have a fat enough pipe into google infrastructure to ensure that all
> that data is making its way in, in a timely fashion

The users of our system use the data for three purposes:

1. Historical analysis

For historical analysis it doesn't matter if the data's 15 minutes out of date

2. What's happening recently

This is important that they know what's going on for the last 15 minutes. Really it would be optimal if they only had to wait a minute or so to get updates. (5 minutes would probably be fine).

Note: We want to generate alarms and alerts that send SMSs and Emails and make lights flash, so a 15 - 20 minute delay is quite long.

3. what's happening right now

This can be solved by a different method, by sending a command to the sensor and opening a socket and just streaming the data by UDP

The problem with catering to number 2 is that without quick updates we would probably have to push the data feed to two systems one for historical data and one for recent data. For us obviously it's better if we can just query a single database

> The neatest solution would likely be an ETL piece through appengine,
> however currently off the shelf cloud based ETL providers haven't made
> their way there yet.

At the moment we're using our own tcp/udp servers for the ETL.

As far as getting the data goes, currently we have 5000 devices that send data via UDP to our own UDP server (running on Amazon AWS). That then writes directly to a postgres database table. Then a process picks up and decodes that data and shoves that data into another table.

My preferred process would be:

1. Devices send data to google (appengine)

2. appengine decodes data and sends to bigquery

3. The UI of application makes requests to bigquery and that's where all the data comes from. The user has access all data which is 5 minutes or older but can stream live data if they want to.

Andrew Stevens

unread,

Jan 16, 2012, 7:33:31 PM1/16/12

to bigquery...@googlegroups.com

Hi Rufus,

Thanks for the detailed response!

Just briefly the generic goal for our bigdata ambitions is to mesh CRM data and operational feeds to provide live(-ish) status updates about our customers to our front line support staff on the day. We have a sub-set of data targetted, and a sub-set of customers who we think would be suitable for a POC of this idea. We handle about 40,000 customers a day each of whom have a variety of operational attributes(~5-10) plus your typical CRM data information.

I'm sure we could give the 100 ingestions/day a good run, the data quota should be plenty for this initial trial. I'm struggling to guesstimate the bytes/rows quota right now but its on my todo list.

ETL: I've had a chat with Jaspersoft in San Fran, haven't spoken to anyone else. A vendor that supports and has a connector for BQ is crucial, our company doesn't and won't be able to solve that piece as its just too far outside our business. Would really appreciate suggestions from googles end when you have some partners on board.

Extraction would be from a data centre in EU. Data would be transformed before it hits google, presuming the ETL isnt on GAE. Direct upload into GCS I would say, not sure what value a GAE app would add here?

The final step for us is once we have this data reliably digesting into BQ is to connect BI Tools and web views (eg. GAE apps) over the top so we can cut and deliver the data to suit the end user.

Thanks!

Amanda Bradford

unread,

Apr 25, 2012, 7:22:43 PM4/25/12

to bigquery...@googlegroups.com

Hi Andrew,

Is your company still evaluating bigQuery? Would love to hear more about your project and see how we can better help you with it - lots have happened since January in the realm of BI + ETL tools! Please email me at amandab...@google.com and we can chat more about your big-data project.

Reply all

Reply to author

Forward