Filter Questions

560 views
Skip to first unread message

Derek Perkins

unread,
Jul 22, 2015, 2:31:45 AM7/22/15
to Google Cloud Bigtable Discuss
I know that Bigtable is optimized for reading contiguous rows, so I am curious how filters play into things.
  1. Filter location: Does the filtering happen on the Bigtable cluster so that only the data I ask for comes back over the wire, or does it return everything and then get filtered inside the Bigtable package?
  2. ChainFilters vs InterleaveFilters: Is the difference as simple as AND (meets all filters) vs OR (meets any of the criteria and is deduped)?
  3. Reading single rows vs filters: What would be some recommended cutoff points or things to keep in mind when getting non-continuous data?
    1. Scenario: Assume a server reports 1 row of metrics daily for 5 years. Running a large report, I only want to get 1 reading per week on Mondays - 260 out of 1825 rows. Am I better off doing a RowRange across the whole period with a RowKeyFilter on the key or just running 260 individual ReadRow requests (assuming predictable keys)?
    2. What if in the same scenario as above, I only want to get data for the 1st of each month, so 60 data points out of 1825. Does that change the recommendation from the above? What about just 1 data point per year, so 5 rows?
    3. What if I had 100,000 servers reporting those daily metrics and I wanted to get weekly / monthly data from 100 of them?
Thanks,
Derek

Douglas Mcerlean

unread,
Jul 22, 2015, 8:44:16 AM7/22/15
to google-cloud-b...@googlegroups.com

Hi Derek, thanks for your questions! Some answers are inline, along with some clarification questions:

On Jul 22, 2015 2:31 AM, "Derek Perkins" <de...@derekperkins.com> wrote:
>
> I know that Bigtable is optimized for reading contiguous rows, so I am curious how filters play into things.
> https://godoc.org/google.golang.org/cloud/bigtable#Filter
> Filter location: Does the filtering happen on the Bigtable cluster so that only the data I ask for comes back over the wire, or does it return everything and then get filtered inside the Bigtable package?

The filters that can be issued via the Go API are all natively supported by Bigtable, and are executed server-side. This is not always the case when using the HBase client, as some HBase filters don't map cleanly to Bigtable ones, but we still try to do as much work at the server as possible.


> ChainFilters vs InterleaveFilters: Is the difference as simple as AND (meets all filters) vs OR (meets any of the criteria and is deduped)?

Not quite, but close. In a Chain filter the order may matter, for example if you apply a StripValueTransformer before a ValueRangeFilter as opposed to after (apologies if those aren't the exact names used by Go). You can think of it loosely as piping one program into another in a Unix shell.

In an Interleave filter there should be no deduping. If multiple interleaved filters all produce an output, then those outputs should all appear in the result or be sent to the next filter in a Chain as appropriate. You can think of it loosely as running multiple programs in a shell all producing output at the same time (albeit without stomping on each other's output).


> Reading single rows vs filters: What would be some recommended cutoff points or things to keep in mind when getting non-continuous data?

We don't have a really hard and fast rule for this, though generally we'd suggest trying to make the data contiguous if it's expected to be accessed together. If that's not an option, it comes down to a trade-off between code complexity and efficiency, and so the right course depends on your situation.


> Scenario: Assume a server reports 1 row of metrics daily for 5 years. Running a large report, I only want to get 1 reading per week on Mondays - 260 out of 1825 rows. Am I better off doing a RowRange across the whole period with a RowKeyFilter on the key or just running 260 individual ReadRow requests (assuming predictable keys)?

For so few rows, the performance difference will likely be negligible, so I would put the work on the server if there's a RowKeyFilter that will do the job.


> What if in the same scenario as above, I only want to get data for the 1st of each month, so 60 data points out of 1825. Does that change the recommendation from the above? What about just 1 data point per year, so 5 rows?

Again, for small data sets do the simple thing for sure.


> What if I had 100,000 servers reporting those daily metrics and I wanted to get weekly / monthly data from 100 of them?

Here grabbing individual rows might be worth it, as this is a selective filter on a large data set. However, it depends a bit on your row key structure, as you could potentially pare down the range of rows the server needs to scan. What does your schema look like? Additionally, how often do you expect to run this query? If it's not often, simple code is more important than high performance.
> Thanks,
> Derek
>
> --
> You received this message because you are subscribed to the Google Groups "Google Cloud Bigtable Discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-bigtabl...@googlegroups.com.
> To post to this group, send email to google-cloud-b...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-bigtable-discuss/f831a71a-7f69-4c61-80bd-a95339002870%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Derek Perkins

unread,
Jul 22, 2015, 10:33:19 AM7/22/15
to Google Cloud Bigtable Discuss
Doug,

I really appreciate your quick and thoughtful response, I already feel like I have a better understanding. In my previous scenario, I was trying to keep things pretty simple and followed one of the examples from the Bigtable time series pdf. It's similar to my situation, but not exact. We're running a multi-tenant marketing system, and each of our tenants is likely to have many of their own clients. Different marketing campaigns are executed for each client, and we're grabbing some external metrics like social shares throughout the day. This is what my row key looks like.

tenantID#clientID#campaignID#YYMMDDHHmm#metricID

Our most common query is to show all the data for a campaign with a date range selected, and we also support rollup reporting where you could pull up all of 1 or a few clients at a time, or view all the metrics for the entire tenant. Users are just as likely to look at 1 campaign over the last week as all of the campaigns for 5 years. We're doing a lot of client side filtering in javascript, so we have a max row count of about 100k rows that we can work with before impacting browser performance. What I want to do is sample the data to make sure that I only have ~100k rows. For example, if there are 1M matching rows in Bigtable, I only want to return every 10th day.

The tenant can pick which metrics to monitor, how often to report those metrics, and some of the metrics can match multiple results (aka create multiple rows), so it's difficult for me to know before running the query what the total count is. The easiest solution appears to be running a count on a key-only query to get that count, though that seems counter how I'd like to use Bigtable. I've also contemplated keeping another table using part of that same key tenantID#clientID#campaignID#YYMMDD and storing the total rows per day there, so for every request, I can query that table first to determine my sampling rate, then run the full query. It's more complex, but seems like it'd significantly reduce the Bigtable workload.

We expect vast differences between tenants, some probably returning 3-5 magnitudes more rows than others, so a few large ones could return 100 million rows when viewing a rollup of their entire tenantID, and others will already have fewer than 100k rows total. These API calls power our entire system, and so will be running up to thousands / minute as we start scaling. Our other approaches have failed to scale, so we're really betting the farm on Bigtable.

Given that scenario, I have a few questions:
  1. When sampling every Nth day of data, based on your previous answers, it seems like I should just use a RowKeyFilter. Is that still the correct approach?
  2. What's the best way to get a total count to determine my sampling rate?
  3. If I report on a user selected (aka random) of N clients or campaigns, should I use an InterleaveFilter using N RowKeyFilters to grab all of those rows in parallel?
Thanks again for your time, I really appreciate it.

Derek

Douglas Mcerlean

unread,
Jul 22, 2015, 11:48:21 AM7/22/15
to google-cloud-b...@googlegroups.com

Actually, for the use case you describe I'd issue multiple reads. The RowKeyFilter you'd have to use could get complicated, which obviates any code complexity advantage. And you're often going to be making selective queries over large data. I would just loop over each requested tenant, client, campaign, date, and metric, and issue a read over the appropriate PrefixRange. This will also get you parallelism, which a RowKeyFilter would not.


> What's the best way to get a total count to determine my sampling rate?

If you don't want to keep it around, you will unfortunately need to run the full query and count the results. You might be better off with an approach where, whenever you've hit 100k rows, you cut your sampling factor in half for future reads and drop the appropriate rows that have already been buffered.


> If I report on a user selected (aka random) of N clients or campaigns, should I use an InterleaveFilter using N RowKeyFilters to grab all of those rows in parallel?

I would issue these separately as well, since they're known keys that will generally be quite selective.


> Thanks again for your time, I really appreciate it.

Happy to help, and as always let us know if you have any other questions!


>
> Derek
>
> --
> You received this message because you are subscribed to the Google Groups "Google Cloud Bigtable Discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-bigtabl...@googlegroups.com.
>
> To post to this group, send email to google-cloud-b...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-bigtable-discuss/077f46c5-b9d9-4652-a69f-37abb8eb4731%40googlegroups.com.

David Symonds

unread,
Jul 22, 2015, 1:00:23 PM7/22/15
to google-cloud-b...@googlegroups.com
I can answer your first two questions, but not your third.


On 22 July 2015 at 16:24, Derek Perkins <de...@derekperkins.com> wrote:

> Filter location: Does the filtering happen on the Bigtable cluster so that
> only the data I ask for comes back over the wire, or does it return
> everything and then get filtered inside the Bigtable package?

The filtering happens on the server side.


> ChainFilters vs InterleaveFilters: Is the difference as simple as AND (meets
> all filters) vs OR (meets any of the criteria and is deduped)?

Roughly, yes.

Derek Perkins

unread,
Jul 22, 2015, 4:12:34 PM7/22/15
to Google Cloud Bigtable Discuss

> When sampling every Nth day of data, based on your previous answers, it seems like I should just use a RowKeyFilter. Is that still the correct approach?
Actually, for the use case you describe I'd issue multiple reads. The RowKeyFilter you'd have to use could get complicated, which obviates any code complexity advantage. And you're often going to be making selective queries over large data. I would just loop over each requested tenant, client, campaign, date, and metric, and issue a read over the appropriate PrefixRange. This will also get you parallelism, which a RowKeyFilter would not.


So to make sure I'm understanding you, you would not do a single RowRange and filter out non-matching rows, you would issue a single RowRange per day that I am sampling, potentially 100+ requests? It's probably about the same code complexity for me either way, so I want to do whatever is most performant on the Bigtable side. On the MySQL side of things, I would typically do that all in one request because that's usually gentler on the database, but what I'm hearing you say is that the individual connections aren't as important here as not iterating through a lot of rows that I don't need.
 

> What's the best way to get a total count to determine my sampling rate?
If you don't want to keep it around, you will unfortunately need to run the full query and count the results. You might be better off with an approach where, whenever you've hit 100k rows, you cut your sampling factor in half for future reads and drop the appropriate rows that have already been buffered.


I think that I'll keep track of it somewhere. If the response is on the high end of 100 million rows, I'm not going to be able to buffer all of that in memory.
 

> If I report on a user selected (aka random) of N clients or campaigns, should I use an InterleaveFilter using N RowKeyFilters to grab all of those rows in parallel?
I would issue these separately as well, since they're known keys that will generally be quite selective.


Sounds like a good strategy.

I feel like I have a much better handle on how to responsibly use Bigtable now. :)

Douglas Mcerlean

unread,
Jul 22, 2015, 5:09:15 PM7/22/15
to google-cloud-b...@googlegroups.com
On Wed, Jul 22, 2015 at 4:12 PM, Derek Perkins <de...@derekperkins.com> wrote:

> When sampling every Nth day of data, based on your previous answers, it seems like I should just use a RowKeyFilter. Is that still the correct approach?
Actually, for the use case you describe I'd issue multiple reads. The RowKeyFilter you'd have to use could get complicated, which obviates any code complexity advantage. And you're often going to be making selective queries over large data. I would just loop over each requested tenant, client, campaign, date, and metric, and issue a read over the appropriate PrefixRange. This will also get you parallelism, which a RowKeyFilter would not.


So to make sure I'm understanding you, you would not do a single RowRange and filter out non-matching rows, you would issue a single RowRange per day that I am sampling, potentially 100+ requests? It's probably about the same code complexity for me either way, so I want to do whatever is most performant on the Bigtable side. On the MySQL side of things, I would typically do that all in one request because that's usually gentler on the database, but what I'm hearing you say is that the individual connections aren't as important here as not iterating through a lot of rows that I don't need.
Correct. It sounds to me like in the common case the rows you want will be in decently-sized contiguous chunks, with much larger gaps in between. Issuing multiple range queries allows you to avoid going to those tablets at all. Of course if you want to filter for a specific metric, you'll still want a RowKeyFilter applied within each range.

Just to sanity check my assumptions, what's a typical sampling rate, and what's a typical number of metrics for a campaign on any given day? From the numbers you've given I'm expecting that they're both rather large. It also occurs to me that this may not be so easy when you want to downsample a rollup of a particular tenant, since you don't know which clients/campaigns it has underneath it. Is that something you need to do regularly?
 

> What's the best way to get a total count to determine my sampling rate?
If you don't want to keep it around, you will unfortunately need to run the full query and count the results. You might be better off with an approach where, whenever you've hit 100k rows, you cut your sampling factor in half for future reads and drop the appropriate rows that have already been buffered.


I think that I'll keep track of it somewhere. If the response is on the high end of 100 million rows, I'm not going to be able to buffer all of that in memory.
 

> If I report on a user selected (aka random) of N clients or campaigns, should I use an InterleaveFilter using N RowKeyFilters to grab all of those rows in parallel?
I would issue these separately as well, since they're known keys that will generally be quite selective.


Sounds like a good strategy.

I feel like I have a much better handle on how to responsibly use Bigtable now. :)

--
You received this message because you are subscribed to the Google Groups "Google Cloud Bigtable Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-bigtabl...@googlegroups.com.
To post to this group, send email to google-cloud-b...@googlegroups.com.

Derek Perkins

unread,
Jul 22, 2015, 6:43:26 PM7/22/15
to Google Cloud Bigtable Discuss, dmmce...@google.com

> When sampling every Nth day of data, based on your previous answers, it seems like I should just use a RowKeyFilter. Is that still the correct approach?
Actually, for the use case you describe I'd issue multiple reads. The RowKeyFilter you'd have to use could get complicated, which obviates any code complexity advantage. And you're often going to be making selective queries over large data. I would just loop over each requested tenant, client, campaign, date, and metric, and issue a read over the appropriate PrefixRange. This will also get you parallelism, which a RowKeyFilter would not.


So to make sure I'm understanding you, you would not do a single RowRange and filter out non-matching rows, you would issue a single RowRange per day that I am sampling, potentially 100+ requests? It's probably about the same code complexity for me either way, so I want to do whatever is most performant on the Bigtable side. On the MySQL side of things, I would typically do that all in one request because that's usually gentler on the database, but what I'm hearing you say is that the individual connections aren't as important here as not iterating through a lot of rows that I don't need.
Correct. It sounds to me like in the common case the rows you want will be in decently-sized contiguous chunks, with much larger gaps in between. Issuing multiple range queries allows you to avoid going to those tablets at all. Of course if you want to filter for a specific metric, you'll still want a RowKeyFilter applied within each range.

Just to sanity check my assumptions, what's a typical sampling rate, and what's a typical number of metrics for a campaign on any given day? From the numbers you've given I'm expecting that they're both rather large. It also occurs to me that this may not be so easy when you want to downsample a rollup of a particular tenant, since you don't know which clients/campaigns it has underneath it. Is that something you need to do regularly?

 We're just launching right now, so while I am pretty confident about our access patterns, I'm not 100% sure. This is what we expect our average client to have stored. (Our first client that is onboarded now has significantly higher numbers than this). Also, we store all our client / campaign relationship data in MySQL, so we know everything outside of Bigtable except the total row count.
  1. Expected Average Tenant
    1. Assumptions
      1. 100 clients
      2. 3 campaigns per client
      3. 100 metrics per campaign
      4. 3 rows per metric
      5. 300 avg rows per campaign per day
      6. 200 users
    2. Most common queries
      1. 3 months for a single campaign: 1 campaign x 100 metrics x 3 rows x 90 days = 45,000 rows (no sampling)
      2. 6 months for a single client: 1 client x 3 campaigns x 100 metrics x 3 rows x 180 days = 162,000 rows (sample every other day for 81,000 rows)
      3. 12 months for the whole tenant rollup: 1 tenant x 100 clients x 3 campaigns x 100 metrics x 3 rows x 365 days = 32.8M rows (sample first day and last day of period for 180,000 rows - our sampling is a little more complex, where for these larger overviews, we'll typically return less columns to up the row count, but that's up to how our users define their views)
    3. Usage
      1. 50 high usage users pulling 10-20 of the above queries per day
      2. 10-20 tenant rollups per day
      3. 150 occasional users
  2. Our current client
    1. Assumptions
      1. 20 clients
      2. 10 campaigns per client
      3. 5,000 metrics per campaign
      4. 10 rows per metric
      5. 50,000 avg rows per campaign per day
      6. 40 users
    2. Common queries
      1. 3 months for a single campaign: 1 campaign x 5,000 metrics x 10 rows x 90 days = 4.5M rows (sample once a month for 160,000 rows)
      2. 6 months for a single client: 1 client x 10 campaigns x 5,000 metrics x 10 rows x 180 days = 270M rows (sample somehow)
      3. 12 months for the whole tenant rollup: 1 tenant x 20 clients x 10 campaigns x 5,000 metrics x 10 rows x 365 days = 3.6B rows (sample somehow)
    3. Usage
      1. 40 extremely high usage users pulling 50-75 of the above queries per day
      2. 25-50 tenant rollups per day

Douglas Mcerlean

unread,
Jul 23, 2015, 6:05:35 PM7/23/15
to Derek Perkins, Google Cloud Bigtable Discuss
This is a really interesting use case...given all this detail, I'm actually going to suspend judgement for a moment since I could see things going either way in some cases. Here are my thoughts:

You'll certainly want to issue separate prefix reads for tenants/clients/campaigns, as these are big contiguous chunks (O(100000) rows for a year of data, even for small clients) that will have big contiguous gaps between them.

As you reach dates/metrics, things are a little less clear. For smaller tenants without too much downsampling, you're looking at 300 good rows, then 300 bad, then 300 good, and so on. At that scale, it's probably faster to let the server manage the filtering. However, for large clients we're talking about 50000 on, then 1.5M off, and so on. For that, it's again pretty cut and dried that you should scan those chunks separately. The question is, where should the cutoff be? And the meta-question is, how important is it to even get that right, given that you're only expecting a few thousand queries/tenant/day?

As a further clarification, around what size do you expect your rows to be? The larger the rows, the better separate reads look, as even relatively nearby rows may be in separate regions, and pre-fetching the next few rows from disk will be less of a benefit.

Derek Perkins

unread,
Jul 23, 2015, 7:21:30 PM7/23/15
to Google Cloud Bigtable Discuss, dmmce...@google.com
Doug, I really appreciate the time you've put into helping us analyze this.

The question is, where should the cutoff be? And the meta-question is, how important is it to even get that right, given that you're only expecting a few thousand queries/tenant/day?

It's very possible that I could be off here, depending on how often they change views on our front end. It could easily be 10x, but unlikely to exceed 100x my estimate. At some point, we'll have to just pick a method and then see what the performance data has to say. If I had a gun to my head and had to pick a sampling threshold based on our conversations so far, I would probably issue separate queries for sampling rates under 25% and row filters for anything above that. (that's also based on me storing row counts somewhere else, so I can pick the sampling rate before retrieving any data)

As a further clarification, around what size do you expect your rows to be?

A full row for us is only about 1kb.

Under the conditions I've described, I'm under the impression that I should have a lot of room to grow before maxing out my starter 3 node cluster. Is that right? It seems like the quota we'd be running into first is the 200k reads / sec. That's referring to rows, not rows * columns?

Does Cloud Bigtable stream query results back / paginate under the hood? I'm hoping to stream the data to my frontend in 1,000 row chunks and complete in a few seconds.


Douglas Mcerlean

unread,
Jul 28, 2015, 1:16:28 PM7/28/15
to Derek Perkins, Google Cloud Bigtable Discuss
Sorry this got lost

On Thu, Jul 23, 2015 at 7:21 PM, Derek Perkins <de...@derekperkins.com> wrote:
Doug, I really appreciate the time you've put into helping us analyze this.

The question is, where should the cutoff be? And the meta-question is, how important is it to even get that right, given that you're only expecting a few thousand queries/tenant/day?

It's very possible that I could be off here, depending on how often they change views on our front end. It could easily be 10x, but unlikely to exceed 100x my estimate. At some point, we'll have to just pick a method and then see what the performance data has to say. If I had a gun to my head and had to pick a sampling threshold based on our conversations so far, I would probably issue separate queries for sampling rates under 25% and row filters for anything above that. (that's also based on me storing row counts somewhere else, so I can pick the sampling rate before retrieving any data)
Yeah, at a certain point you just have to try it out. The important thing is understanding what the tradeoffs are. 

As a further clarification, around what size do you expect your rows to be?

A full row for us is only about 1kb.

Under the conditions I've described, I'm under the impression that I should have a lot of room to grow before maxing out my starter 3 node cluster. Is that right? It seems like the quota we'd be running into first is the 200k reads / sec. That's referring to rows, not rows * columns?
For reads the primary bottleneck will be throughput, not QPS. Our guidance on the "create cluster" page is about ~10MB/sec/server, which for you will be 10000 rows/sec/server. It's possible, however, that you'll see better performance given that you're doing large sequential scans without much filtering. I'd encourage you see what you can get out of 3 servers, and (shameless plug) take advantage of our one-click linear scaling if you need more ;)

Does Cloud Bigtable stream query results back / paginate under the hood? I'm hoping to stream the data to my frontend in 1,000 row chunks and complete in a few seconds.


Cloud Bigtable streams back the rows under the hood, so you'll be able buffer as many or as few rows as you want before sending them along to the frontend.
Reply all
Reply to author
Forward
0 new messages