I don't get it... Datastore and chronological order

67 views
Skip to first unread message

Daniel Jozsef

unread,
Jan 21, 2018, 1:32:35 PM1/21/18
to Google App Engine
Hello dear people,

There's this thing that has been bothering me for a while. I need to work on an application that we expect to scale, and I have trouble reconciling loudly stated best practices and baseline requirements.

Almost all "web2" media relies on a chronological order. When I browse facebook, or google+, or youtube, it's not posts and videos from 10 years ago that I want to (or do) see. Even though Facebook and Google never seem to present a "fully chronological ordered" list, the worst that can happen is that I see a post from two hours ago after one from two days ago. Never after one from 2008.

However, it would seem that distributed NoSQL databases *hate* timestamps. And not only timestamps, but chronological order *in general*. (see https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/)

So... I've looked into the alternatives... And the *only* solution I found was that I could prefix timestamps with random "bucket ids", their number potentially scaled based on the "write heat" of each entity, and run a separate query for each bucket... but that makes managing pagination beyond ridiculous, and I worry that it would make queries - like someone just randomly navigating to the front page, or hitting reload - expensive, and my gut feeling is that making the most frequent query type more expensive is a bad idea.

The problem is some way in the future now, but I'm really interested how the big players do it. I mean, just thinking of all the writes facebook must handle...

Yannick (Cloud Platform Support)

unread,
Jan 21, 2018, 6:52:46 PM1/21/18
to Google App Engine
Hello Daniel,

First, let me point out the final tip from the article you linked: "don’t prematurely optimize for this case, since chances are, you won’t run into it." Since the Datastore lets you change your schema at any time and that the optimizations you will need to make will depend on future usage you might not be able to predict, it might be best to not spend too much time dwelling on optimizing these queries yet.

This being said, the final solution going to be highly dependent on the kind of queries you want to be able to run against your Datastore. As pointed out in the best practices article, a timestamp prefix can be related to a specific query you need to make (the given example being a userid), or it can also be random as you pointed out, but this doesn't force you to query the resulting "buckets" separately. The prefix just needs to vary enough to properly shard the index across many Bigtable tablets and allow for faster reads and writes of that index as a whole. 

Whether or not it is worthwhile to perform the sorting in-memory rather than have the Datastore index do it is something you will need to decide based on your experience with the performance of each of your queries.

Regarding expensive queries being made often, such as for the content that appears on your main page, you can and certainly should be storing the result of those queries using Memcache so that popular listings do not need to be constantly re-computed.

I hope this helped. Please let me know if some aspect(s) of my answer need to be further detailed.
Reply all
Reply to author
Forward
0 new messages