Indexing a monotonically increasing value: How much does it limit write throughput?

124 views
Skip to first unread message

Michael Hermus

unread,
May 7, 2012, 9:01:14 AM5/7/12
to Google App Engine
All,

I have read in several places that using a monotonically increasing
value for a key or in an index is a bad thing, primarily because of
the way Big Table manages the process of splitting overloaded tablets.
In a few comments, I have read that it really isn't an issue unless
you hit around 300 QPS.

There are many cases where I would like to use timestamp in an index
(to order or limit query results based on creation time, for example).
So, my questions are:

1) Is there a definitive issue with throughput when using a timestamp
in an index?
2) If yes, what is the approximate QPS threshold where it becomes an
problem?
3) What can we expect after exceeding this threshold?

Thanks in advance!

Mike Wesner

unread,
May 7, 2012, 9:39:38 AM5/7/12
to google-a...@googlegroups.com
I'd also like to know if this is any different on high replication than it was on master slave.  (I know its a lower level bigtable side effect, but does hrd/paxos deal with it differently?)


Are you planning on writing these entities at hundreds of qps?   Most of the conversations I have had about this revolve around the entity key rather than the indexes, but I suppose you could create a similar problem there.

Have you done any tests?  This might be one thing that is better to actually test even if you get a solid answer here.

Mike

Michael Hermus

unread,
May 7, 2012, 10:02:55 AM5/7/12
to google-a...@googlegroups.com
Mike: You are absolutely right, testing is essential, and I have not yet performed any. However, I would love some insight (without the effort associated with setting up adequate test scenarios) before making certain design choices.

I am not 100% sure if indexed values have the same limitation, or if it exists only with keys. My thought is that index rows would have the same issues with hot tablets at some point. Also, I think Brett Slatkin's pipeline presentation referenced this issue in regards to an indexed value (as opposed to a key).

Jeff Schnitzer

unread,
May 7, 2012, 2:57:24 PM5/7/12
to google-a...@googlegroups.com
There was a discussion of this in an I/O session last year. I don't
remember if it was the "more 9s please" or the "life in production"
session. In the Q&A I asked if the HRD affects this limit, and IIRC
the answer was that the limit would effectively be multiplied by the
number of data centers that receive the writes. So if you are writing
across 3 data centers, the limit would be approximately 3X the
master/slave limit.

Google doesn't disclose the actual number of data centers in operation
(other than "more than 2") so it's hard to guess what this number
really is.

Jeff
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/9VrhgQ8J23QJ.
>
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.

Michael Hermus

unread,
May 7, 2012, 10:06:36 PM5/7/12
to google-a...@googlegroups.com
Jeff: I will poke around there, thanks. I love Objectify by the way - great work.

I think one workaround might be to create a 'Sharded Index'. Let's say we want to index a timestamp; we could do the following:


public void setTimestamp(Long timestamp) {
    int shardNum= <well distributed numeric value in a defined range, either randomly generated or from existing variable>

    this.shardedTimestamp= String.valueOf(shardNum) + String.valueOf(timestamp);

}

public Long getTimestamp() {
   
return Long.parseLong(timestamp.substring(1));
}



When it comes time to query, we create a utility wrapper that creates sharded queries. Something like this:

public class ShardedQuery<T>
{
   
int numberOfShards;
   
public List<Query<T>> queries;
   
   
public ShardedQuery(Query<T> q, int numberOfShards){
       
this.numberOfShards= numberOfShards;
       
this.queries= new ArrayList<Query<T>>(numberOfShards);
       
for(int i=0; i < numberOfShards; i++){
           
this.queries.add(q.clone());
       
}
   
}
   
   
public void addShardedFilter(String condition, String value){
       
for(int i=0; i < numberOfShards; i++){
           
String shardedVal= String.valueOf(i) + value;
           
this.queries.get(i).filter(condition, shardedVal);
       
}
   
}
   
   
public List<T> execute(){
       
List<QueryResultIterable<T>> iterables= new ArrayList<QueryResultIterable<T>>(numberOfShards);
       
for(int i=0; i < numberOfShards; i++){
            iterables
.add(this.queries.get(i).fetch());
       
}
       
       
List<T> out= new LinkedList<T>();
       
for(QueryResultIterable<T> result : iterables){
           
for (T item : result){
               
out.add(item);
           
}
       
}
       
       
return out;
   
}

}

Actually, you could probably add something into Objectify to handle all this for you.. perhaps create a '@ShardedIndex' annotation that would automatically prepend a shardNum and split queries for you.


stevep

unread,
May 7, 2012, 10:21:07 PM5/7/12
to Google App Engine
I use a DateProperty field with auto_add = True and indexed = True all
the time to enable querying a section of records. AFAIK this is likely
the best way for doing things like "process all of yesterday's
invoices that were booked between 10:00-11:00". Unfortunately it often
requires the use of a custom index because you are going to add an
order clause on the query. That adds to your write and index overhead
costs, and gets you involved with eventual consistency issues. This is
all very different from the complex issues related to how to create a
true, monotonically increasing value. (The date index should work for
most use cases if some records get the same create date.) If you
absolutely must have a unique field value, then you are into the more
complex monotonic issue for which I have yet to read a really good
solution at very high QPS. (I only dream of such issues, my practical
reality more often involves QPM :-)

There are some circumstances I have encountered when using an indexed
date field where it would be very helpful to know how the eventual
consistency of the index gets worked out. If anyone has insights into
this, please consider my SO question. thx -stevep

SO Link:
http://stackoverflow.com/questions/9357068/do-composite-index-updates-maintain-same-order-as-original-updates

Jeff Schnitzer

unread,
May 8, 2012, 3:19:08 AM5/8/12
to google-a...@googlegroups.com
How common is it to need this kind of logic? What kind of behavior
did you start to see and what was the qps load when it happened? In
all the talk of this issue, nobody ever described the failure mode...
datastore timeouts?

I hesitate to bake exotic, data-mangling behavior like this into
Objectify because once it gets added, it's pretty much impossible to
remove. Even if Google "fixes" the underlying issue.

Jeff
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/SQT-fD0inH8J.

Amy Unruh

unread,
May 8, 2012, 4:05:43 AM5/8/12
to google-a...@googlegroups.com
On 7 May 2012 23:39, Mike Wesner <mike....@webfilings.com> wrote:
I'd also like to know if this is any different on high replication than it was on master slave.  (I know its a lower level bigtable side effect, but does hrd/paxos deal with it differently?)

There can be such an issue with throughput when indexing a monotonically increasing value like a timestamp.
Interestingly, the use of HRD may reduce this somewhat, as the access to multiple data centers can help reduce this issue for a given data center.


Are you planning on writing these entities at hundreds of qps?   Most of the conversations I have had about this revolve around the entity key rather than the indexes, but I suppose you could create a similar problem there.

Have you done any tests?  This might be one thing that is better to actually test even if you get a solid answer here.

Mike



On Monday, May 7, 2012 8:01:14 AM UTC-5, Michael Hermus wrote:
All,

I have read in several places that using a monotonically increasing
value for a key or in an index is a bad thing, primarily because of
the way Big Table manages the process of splitting overloaded tablets.
In a few comments, I have read that it really isn't an issue unless
you hit around 300 QPS.

There are many cases where I would like to use timestamp in an index
(to order or limit query results based on creation time, for example).
So, my questions are:

1) Is there a definitive issue with throughput when using a timestamp
in an index?
2) If yes, what is the approximate QPS threshold where it becomes an
problem?
3) What can we expect after exceeding this threshold?

Thanks in advance!

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/FHUqvBlr6gQJ.

Michael Hermus

unread,
May 8, 2012, 7:09:19 AM5/8/12
to google-a...@googlegroups.com
Hahah... 'data-mangling' behavior is a good description. I guess you are right; it isn't actually appropriate for Objectify since this is probably rarely an issue in practice for most apps, but it is a convenient place to put such logic if one were inclined.

Plus, I don't think my hastily sketched concept wouldn't even work properly because of sort ordering.You could get range queries to work, but open ended inequality filters would fail.

Michael Hermus

unread,
May 8, 2012, 11:05:43 AM5/8/12
to google-a...@googlegroups.com
I searched through Google I/O 2011 sessions, and I found this one: http://www.google.com/events/io/2011/sessions/scaling-app-engine-applications.html
A good article on the topic (many people are probably already familiar with this post): http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/

Both confirm that this is definitely an issue, and provide some ideas on how to avoid the situation. One thing that I have seen Nick Johnson mention in various posts is that a Batch Put may alleviate the hot tablet issue to some extent; I wonder If I am misinterpreting him, or if that is a simple solution. In other words, just throw work onto a pull queue, and take it off in batches of 100 (or whatever). If most of the indexed rows for that batch of entities all go to the same tablet, and are processed as a batch, it would seem to be significantly more efficient.

Of course, as Mike pointed out in an earlier post, unless someone from the GAE team can provide a definitive answer, it all boils down to trying something and testing it out under load.

stevep

unread,
May 8, 2012, 8:00:44 PM5/8/12
to Google App Engine
Believe it or not, Ikai could do version of the "Dummies" books for
GAE with his helpful doodles and have a good alternative to life at to
Google-Plex.

The "high transaction rate, monotonically increasing" field that
supports queries without having to dedicate an index is something that
if Google could accomplish would be even cooler than Task Queues --
and I love me some TQs.

Sure its hard, but if there is ONE issue that has been consistently
discussed without a good solution over my years viewing GAE discussion
threads, this IS IT. It would be awesome if G. could devise a Big
Table solution.

As ever, fingers crossed. -stevep

On May 8, 8:05 am, Michael Hermus <michael.her...@gmail.com> wrote:
> > I searched through Google I/O 2011 sessions, and I found this one:
>
> http://www.google.com/events/io/2011/sessions/scaling-app-engine-appl...
> A good article on the topic (many people are probably already familiar with
> this post):http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically...
Reply all
Reply to author
Forward
0 new messages