Hi Derek, thanks for your questions! Some answers are inline, along with some clarification questions:
On Jul 22, 2015 2:31 AM, "Derek Perkins" <de...@derekperkins.com> wrote:
>
> I know that Bigtable is optimized for reading contiguous rows, so I am curious how filters play into things.
> https://godoc.org/google.golang.org/cloud/bigtable#Filter
> Filter location: Does the filtering happen on the Bigtable cluster so that only the data I ask for comes back over the wire, or does it return everything and then get filtered inside the Bigtable package?
The filters that can be issued via the Go API are all natively supported by Bigtable, and are executed server-side. This is not always the case when using the HBase client, as some HBase filters don't map cleanly to Bigtable ones, but we still try to do as much work at the server as possible.
> ChainFilters vs InterleaveFilters: Is the difference as simple as AND (meets all filters) vs OR (meets any of the criteria and is deduped)?
Not quite, but close. In a Chain filter the order may matter, for example if you apply a StripValueTransformer before a ValueRangeFilter as opposed to after (apologies if those aren't the exact names used by Go). You can think of it loosely as piping one program into another in a Unix shell.
In an Interleave filter there should be no deduping. If multiple interleaved filters all produce an output, then those outputs should all appear in the result or be sent to the next filter in a Chain as appropriate. You can think of it loosely as running multiple programs in a shell all producing output at the same time (albeit without stomping on each other's output).
> Reading single rows vs filters: What would be some recommended cutoff points or things to keep in mind when getting non-continuous data?
We don't have a really hard and fast rule for this, though generally we'd suggest trying to make the data contiguous if it's expected to be accessed together. If that's not an option, it comes down to a trade-off between code complexity and efficiency, and so the right course depends on your situation.
> Scenario: Assume a server reports 1 row of metrics daily for 5 years. Running a large report, I only want to get 1 reading per week on Mondays - 260 out of 1825 rows. Am I better off doing a RowRange across the whole period with a RowKeyFilter on the key or just running 260 individual ReadRow requests (assuming predictable keys)?
For so few rows, the performance difference will likely be negligible, so I would put the work on the server if there's a RowKeyFilter that will do the job.
> What if in the same scenario as above, I only want to get data for the 1st of each month, so 60 data points out of 1825. Does that change the recommendation from the above? What about just 1 data point per year, so 5 rows?
Again, for small data sets do the simple thing for sure.
> What if I had 100,000 servers reporting those daily metrics and I wanted to get weekly / monthly data from 100 of them?
Here grabbing individual rows might be worth it, as this is a selective filter on a large data set. However, it depends a bit on your row key structure, as you could potentially pare down the range of rows the server needs to scan. What does your schema look like? Additionally, how often do you expect to run this query? If it's not often, simple code is more important than high performance.
> Thanks,
> Derek
>
> --
> You received this message because you are subscribed to the Google Groups "Google Cloud Bigtable Discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-bigtabl...@googlegroups.com.
> To post to this group, send email to google-cloud-b...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-bigtable-discuss/f831a71a-7f69-4c61-80bd-a95339002870%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Actually, for the use case you describe I'd issue multiple reads. The RowKeyFilter you'd have to use could get complicated, which obviates any code complexity advantage. And you're often going to be making selective queries over large data. I would just loop over each requested tenant, client, campaign, date, and metric, and issue a read over the appropriate PrefixRange. This will also get you parallelism, which a RowKeyFilter would not.
> What's the best way to get a total count to determine my sampling rate?
If you don't want to keep it around, you will unfortunately need to run the full query and count the results. You might be better off with an approach where, whenever you've hit 100k rows, you cut your sampling factor in half for future reads and drop the appropriate rows that have already been buffered.
> If I report on a user selected (aka random) of N clients or campaigns, should I use an InterleaveFilter using N RowKeyFilters to grab all of those rows in parallel?
I would issue these separately as well, since they're known keys that will generally be quite selective.
> Thanks again for your time, I really appreciate it.
Happy to help, and as always let us know if you have any other questions!
>
> Derek
>
> --
> You received this message because you are subscribed to the Google Groups "Google Cloud Bigtable Discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-bigtabl...@googlegroups.com.
>
> To post to this group, send email to google-cloud-b...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-bigtable-discuss/077f46c5-b9d9-4652-a69f-37abb8eb4731%40googlegroups.com.
> When sampling every Nth day of data, based on your previous answers, it seems like I should just use a RowKeyFilter. Is that still the correct approach?
Actually, for the use case you describe I'd issue multiple reads. The RowKeyFilter you'd have to use could get complicated, which obviates any code complexity advantage. And you're often going to be making selective queries over large data. I would just loop over each requested tenant, client, campaign, date, and metric, and issue a read over the appropriate PrefixRange. This will also get you parallelism, which a RowKeyFilter would not.
> What's the best way to get a total count to determine my sampling rate?
If you don't want to keep it around, you will unfortunately need to run the full query and count the results. You might be better off with an approach where, whenever you've hit 100k rows, you cut your sampling factor in half for future reads and drop the appropriate rows that have already been buffered.
> If I report on a user selected (aka random) of N clients or campaigns, should I use an InterleaveFilter using N RowKeyFilters to grab all of those rows in parallel?
I would issue these separately as well, since they're known keys that will generally be quite selective.
> When sampling every Nth day of data, based on your previous answers, it seems like I should just use a RowKeyFilter. Is that still the correct approach?
Actually, for the use case you describe I'd issue multiple reads. The RowKeyFilter you'd have to use could get complicated, which obviates any code complexity advantage. And you're often going to be making selective queries over large data. I would just loop over each requested tenant, client, campaign, date, and metric, and issue a read over the appropriate PrefixRange. This will also get you parallelism, which a RowKeyFilter would not.So to make sure I'm understanding you, you would not do a single RowRange and filter out non-matching rows, you would issue a single RowRange per day that I am sampling, potentially 100+ requests? It's probably about the same code complexity for me either way, so I want to do whatever is most performant on the Bigtable side. On the MySQL side of things, I would typically do that all in one request because that's usually gentler on the database, but what I'm hearing you say is that the individual connections aren't as important here as not iterating through a lot of rows that I don't need.
> What's the best way to get a total count to determine my sampling rate?
If you don't want to keep it around, you will unfortunately need to run the full query and count the results. You might be better off with an approach where, whenever you've hit 100k rows, you cut your sampling factor in half for future reads and drop the appropriate rows that have already been buffered.I think that I'll keep track of it somewhere. If the response is on the high end of 100 million rows, I'm not going to be able to buffer all of that in memory.> If I report on a user selected (aka random) of N clients or campaigns, should I use an InterleaveFilter using N RowKeyFilters to grab all of those rows in parallel?
I would issue these separately as well, since they're known keys that will generally be quite selective.Sounds like a good strategy.I feel like I have a much better handle on how to responsibly use Bigtable now. :)
--
You received this message because you are subscribed to the Google Groups "Google Cloud Bigtable Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-bigtabl...@googlegroups.com.
To post to this group, send email to google-cloud-b...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-bigtable-discuss/5772e72e-cdf9-412c-9174-9c1a8493de04%40googlegroups.com.
> When sampling every Nth day of data, based on your previous answers, it seems like I should just use a RowKeyFilter. Is that still the correct approach?
Actually, for the use case you describe I'd issue multiple reads. The RowKeyFilter you'd have to use could get complicated, which obviates any code complexity advantage. And you're often going to be making selective queries over large data. I would just loop over each requested tenant, client, campaign, date, and metric, and issue a read over the appropriate PrefixRange. This will also get you parallelism, which a RowKeyFilter would not.So to make sure I'm understanding you, you would not do a single RowRange and filter out non-matching rows, you would issue a single RowRange per day that I am sampling, potentially 100+ requests? It's probably about the same code complexity for me either way, so I want to do whatever is most performant on the Bigtable side. On the MySQL side of things, I would typically do that all in one request because that's usually gentler on the database, but what I'm hearing you say is that the individual connections aren't as important here as not iterating through a lot of rows that I don't need.Correct. It sounds to me like in the common case the rows you want will be in decently-sized contiguous chunks, with much larger gaps in between. Issuing multiple range queries allows you to avoid going to those tablets at all. Of course if you want to filter for a specific metric, you'll still want a RowKeyFilter applied within each range.Just to sanity check my assumptions, what's a typical sampling rate, and what's a typical number of metrics for a campaign on any given day? From the numbers you've given I'm expecting that they're both rather large. It also occurs to me that this may not be so easy when you want to downsample a rollup of a particular tenant, since you don't know which clients/campaigns it has underneath it. Is that something you need to do regularly?
The question is, where should the cutoff be? And the meta-question is, how important is it to even get that right, given that you're only expecting a few thousand queries/tenant/day?
As a further clarification, around what size do you expect your rows to be?
Doug, I really appreciate the time you've put into helping us analyze this.The question is, where should the cutoff be? And the meta-question is, how important is it to even get that right, given that you're only expecting a few thousand queries/tenant/day?It's very possible that I could be off here, depending on how often they change views on our front end. It could easily be 10x, but unlikely to exceed 100x my estimate. At some point, we'll have to just pick a method and then see what the performance data has to say. If I had a gun to my head and had to pick a sampling threshold based on our conversations so far, I would probably issue separate queries for sampling rates under 25% and row filters for anything above that. (that's also based on me storing row counts somewhere else, so I can pick the sampling rate before retrieving any data)
As a further clarification, around what size do you expect your rows to be?A full row for us is only about 1kb.Under the conditions I've described, I'm under the impression that I should have a lot of room to grow before maxing out my starter 3 node cluster. Is that right? It seems like the quota we'd be running into first is the 200k reads / sec. That's referring to rows, not rows * columns?
Does Cloud Bigtable stream query results back / paginate under the hood? I'm hoping to stream the data to my frontend in 1,000 row chunks and complete in a few seconds.