Worst-case scenario for eventual consistency in the HRD?

Jeff Schnitzer

unread,

Sep 19, 2011, 8:16:36 PM9/19/11

to Google App Engine

I know that an index update in the HRD will typically be visible
within a couple seconds. That's the average case. What is the
worst-case?

Assuming something in the datacenter goes wacky, how long might it
take for an index to update? Tens of seconds, minutes, hours, days?

Thanks,
Jeff

Ikai Lan (Google)

unread,

Sep 19, 2011, 9:55:09 PM9/19/11

to google-a...@googlegroups.com

I'll check for you, but FWIW here are the last numbers for master/slave I heard with regards to replication delay:

- most of the time data is replicated within hundreds of milliseconds

- when there is something wrong, mean time is 3 minutes, with an upper bound that is roughly 10 minutes

If a data center goes offline, that's the window of writes you may lose on master/slave. On HRD you don't lose data.

I'll double check with the datastore team to see if we have numbers, but it might not work the same way. When you do a write, a majority of datastore instances have to acknowledge receiving the write and having appended it to the write journal. Thus, if the primary datastore goes offline, application servers make RPCs to the other datastores and use the first response that comes back. The datastores that are running behind still try to catch up in the background and continue to apply writes from the journal. I suppose the number you're looking for here is: what is replication delay if a datastore isn't forced to catch up?

--

Ikai Lan
Developer Programs Engineer, Google App Engine

plus.ikailan.com | twitter.com/ikai

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Robert Kluin

unread,

Sep 20, 2011, 10:40:00 AM9/20/11

to google-a...@googlegroups.com, Alfred Fuller

I think Jeff was actually asking about "index lag" -- how long before
the indexes will be updated, not how long for the data to replicate.
I'd like to know this info too.

Robert

Mike Wesner

unread,

Sep 20, 2011, 10:40:32 AM9/20/11

to Google App Engine

I don't think Ikai read your post...

Robert and I wanted to write a little HRD status site to track this
and get real data, but we haven't done so yet. I have never seen the
replication take more than about 1s. I think 1s will cover about four
9's, but that is just an educated guess. Until we (the users)
actually measure this over time I don't think we can know for sure.

-Mike

Mike Wesner

unread,

Sep 20, 2011, 10:42:22 AM9/20/11

to Google App Engine

And then I went and used the word replication... i meant index lag.

Ikai Lan (Google)

unread,

Sep 20, 2011, 1:10:23 PM9/20/11

to google-a...@googlegroups.com

Well, indexes are just Bigtable rows, so replication lag does apply to them as well.

--

Ikai Lan
Developer Programs Engineer, Google App Engine

plus.ikailan.com | twitter.com/ikai

Jeff Schnitzer

unread,

Sep 20, 2011, 3:37:58 PM9/20/11

to google-a...@googlegroups.com

I'm doing a lot of work lately with data that requires a large degree
of transactional consistency. One pattern I've found that makes some
of the pain of HRD eventuality go away is to add an extra entity that
uses your query field as a natural key. This really requires global
transactions to work (as announced, it's in trusted testing, wheee!)
but here's an example:

Say you associate a facebook id with an account. In M/S, you'd
probably have something like this:

class User {
@Id Long id;
long fbId;
...
}

...and then when a request arrives with a facebook id, you would query
for the user record. No user record? Create one. With eventual
consistency, this creates a larger window (with M/S it was small)
where you can get duplicate Users for the same fbId.

The solution to transactional integrity and strong consistency is to
add a FbId entity:

class FbId {
@Id String fbId;
long userId;
}

I've now got several of these mapping entities in place now. Using
global transactions to create the FbId and the User at the same time,
it seems to solve consistency issues entirely. I don't know how it
will perform yet under load, but obviously there's not heavy
contention in this situation so I would be surprised if the 2pc hurt
much.

I'm starting to notice several of these FbId-type mapping objects
showing up in my code as a way to force queries (for unique items)
into strong consistency. I'm guessing you could do this for
multi-item queries using a list property instead:

Instead of query(Thing.class).filter("color", someColor), you could
instead keep updating an entity like this:

class ColorThings {
@Id String color;
List<Key<Thing>> things;
}

...which feels upside-down but really has a lot of advantages. If you
put ColorThings in memcache, it's like a query cache which actually
updates properly.

Is anyone else noticing their code being pushed into this pattern by the HRD?

Jeff

Alfred Fuller

unread,

Sep 20, 2011, 4:03:17 PM9/20/11

to google-a...@googlegroups.com

An interesting notion. Although you could also just use ColorThings(key_name=color) as the parent entity for all the Things. This way the list of things would be queriable directly (using an ancestor query) and there would not be a limit on the number and size of Things. They also exist next to each other in the underlying big table so there is only one 'seek' to find them (which is the largest cost when looking things up if you don't count serialization).

Alfred Fuller

unread,

Sep 20, 2011, 4:28:09 PM9/20/11

to google-a...@googlegroups.com

Ikai is correct to think about replication in this case. In a single replica you could have one of three states:

Applied - fully visible

Committed - has the log entry, but has yet to apply it

Missing - the log entry has yet to be replicated

Only in the first case is it visible to a global query. When you write something, the log is committed to at least a majority of replicas. The datastore returns success, then immediately tries to apply the write everywhere it committed the log entry. It usually takes a couple hundred ms to apply. This is why the majority of cases take O(100 ms) to become visible. For a very small % of writes, the write either cannot commit to the local replica or cannot be applied after the commit. In these cases the datastore will still return success, but the write won't be visible until a background process picks it up and applies it. In these case it can take O(minutes) to be picked up and replicated/applied. If there is something wrong in the replica you are querying (for example replication is backed up or the bigtabale is unavailable or the background processes in that replica are having issues), then it could take a deal longer (this becomes very very unlikely very quickly, but not impossible). There really is no hard upper bounds because distributed systems will have pieces that fail (and are designed to still function when they do).

- Alfred

On Tue, Sep 20, 2011 at 10:10 AM, Ikai Lan (Google) <ika...@google.com> wrote:

Jeff Schnitzer

unread,

Sep 20, 2011, 4:58:32 PM9/20/11

to google-a...@googlegroups.com

The problem with using a key-parent is that it limits to a single
index -- say I want to index Things by color and texture.

The downside of this multiple-thing index entity is that (like a
parent-key) it limits throughput. And since there's a 2pc involved,
it probably limits throughput quite a lot...

Jeff

Jeff Schnitzer

unread,

Sep 20, 2011, 5:04:58 PM9/20/11

to google-a...@googlegroups.com

Thanks... while I didn't follow it exactly, I get the gist of what's
going on. Sounds like I should expect five- or six-sigma
probabilities of minute+ eventuality in global query indexes.

Jeff

Robert Kluin

unread,

Sep 20, 2011, 11:43:46 PM9/20/11

to google-a...@googlegroups.com

I've been using the same pattern as Jeff mentions for quite some time
-- even while I was on M/S. I use it because it reduces my problems
to "fetch by key" scenarios, and I can build multiple specialized
"indexes" in this way. Part of the reason I started doing this was
due to "exploding indexes" type issues; this lets me "control the
explosion," and possibly even defer the writes in some cases.

It also allows you to avoid contention issues when the Things are
"frequently" updated, but the indexed values may not be.

Robert

Robert Kluin

unread,

Sep 20, 2011, 11:53:45 PM9/20/11

to google-a...@googlegroups.com

I get that indexes are "just bigtable rows too," and that the normal
replication rules we all know and love apply, so I guess this boils
down to indexes being written separately from the entity. Does the
index write apply to the same nodes, or possibly to different nodes?

Alfred, your next project idea: write some type of low-level
high-performance batcher providing crazy high write-rates to a single
entity group. Perhaps with that you (or we) could come up with a
higher performance way to maintain global indexes. ;)

Robert

Oh, for those wondering what this thread is about... we're just making
up words / phrases.

Ronoaldo José de Lana Pereira

unread,

Sep 21, 2011, 8:17:53 AM9/21/11

to google-a...@googlegroups.com

I'm planing the migration of our app to HRD. It is a "collective buying" site, and found lots of places where I need to change my models/queries. In fact, some cases where we need consistency is this scenario:

class Product {

@Id Long productId;

}

class Order {

@Id Long orderId;

List<Long> productId;

}

class Voucher {

@Id Long voucherId;

Long orderId;

Long productId;

}

Vouchers must be created before orders, so they are currently root entities. When an order is approved, I have a specialized queue with max_concurrent_request = 1 that picks the next available voucher (with has orderId = null) and associates it with an order. To check if the order is filled with all it's vouchers, I "count" how many Vouchers are linked with that orderId, and if there is Vouchers missing, I schedule another queue to consume a Voucher again.

On HRD, this don't work because my the query to get the next Voucher and the query to check how much vouchers I have for an Order is more likelly to don't be consistent. What I'm planning to perform is to group Vouchers that are from the same "product" (~ 30k vouchers per product) and then perform the ancestor query, as suggested by the docs. In this case, I'll end up with:

class Voucher {

@Parent Key<Procut> productId;

@Id Long voucherId;

Long orderId;

}

... and will be able to query for how many Vouchers are linked to an order (one query for each of items in Order.productId). Is this a good pattern for this particular scenario? The writes/second is not a problem for us: i.e. if the order stays for a few minutes until the Vouchers are all filled, it is ok.

Another issue: I have to perform some financial accounting registry, and currently I have this entity:

class AccountingRegistry {

@Parent Key<AccountRegistry> parentRegistry;

@Id Long id;

Date date;

Long ammount;

List<String> filters;

}

To represent an accounting transaction, I'm grouping in the same entity group all registry that are related, and that summed up equals 0. To avoid performing the same transaction twice (i.e., register twice the same order approval), I'm using the "filters" list property to query for another registry that has the same filters (i.e. the order id, the "APROVED" keyword, the domain, etc.). They are also usefull to have some specialized reports, like all sales that came from this domain (domain is one value for the list property). On M/S, as Jeff said, the time window is small, and the chance to have a problem is small, but on HRD the window may take several minutes, and in this case I may have a very inconsistent sales report at the end of the day.

Does you guys think that I can use the same pattern Jeff suggested to solve this problem? Any advice?