Migrate from Cloud SQL to DataStore

601 views
Skip to first unread message

Azeem Haider

unread,
Nov 1, 2017, 12:12:59 AM11/1/17
to Google Cloud SQL discuss

I'm creating App where I need to make Post Feeds. Right now I'm using App Engine Standard and Cloud SQL instance db-n1-standard-1 which is based on MySQL But I notice it is very expensive. App is under production mode and In first month price is much higher than my expectation. Most costly thing in Cloud SQL is instance hours rate So, I decide to migrate it on datastore.


I have three tables, 1- users 2- posts 3- comments And Every Table has round about 10 BillionsRows. Posts and Comments table are growing very fast 100 of millions rows are expected.


Users Table


user_id         name        email           phone
--------------------------------------------------
u123            Abc         abc@m.com       123456
u124            Cde         cde@m.com       789065
u786            Qwe         qwe@m.com       754599
 .               .              .               .
 .               .              .               .
 .               .              .               .


Posts Table


post_id         user_id         type        src                 date
------------------------------------------------------------------------
p098            u123            img         path/to/file        13-3-17
p456            u123            vid         path/to/file        14-3-17
p239            u124            img         path/to/file        15-3-17
 .               .               .              .                   .
 .               .               .              .                   .
 .               .               .              .                   .


Comments Table


cmnt_id         post_id         user_id         comment
--------------------------------------------------------
m392            p098            u123            Some Text
m234            p098            u786            Some Text
m324            p456            u123            Some Text
 .               .               .                .
 .               .               .                .
 .               .               .                .


As you can see User can post one or more posts and single post has one or more comments

I need to get posts data along with post comments I know I need to use JOIN for this. And I can easily use it in Cloud SQL because it's based on MySQL.

These three tables have relation between them and As I know Cloud SQL is relational database but DataStore is not.


I have idea in my mind to convert these tables in DataStore


1- Create Entity as Users and save all users records there 

2- Create Entity as Posts and save all posts records there 

3 - Create Entity as Comments and save all comments records there


Post id is always I know for which post I need to get data


For example I need post data about id p098 I'm using Java I decide to create two threads First thread get post data from Posts Entity and second thread get comment data from CommentsEntity. Then I merge these two data set by using java to generate same result as JOIN Like MySQLBecause I think DataStore is not allow JOINS


Here I need some Suggestions.

  1. 1 - Is it a good way to do that ?
  2. 2 - Which Provide the best Performance in this Situation ?
  3. 3 - If Billions of User(Large Number of Users) is accessing Posts and Comments data as same time which one can handle it better ?
  4. 4 - What about Pricing Effect if I migrate from Cloud SQL to DataStore in this situation ?
  5. 5 - If your App need continuous database then SQL instance run continuous which make it high cost, isn't it ? For example in my App every user when open  app  need to show Posts if there is large number of users then I think SQL instance never shutdown. Which means it run 730 hr in a month.
  6. Any recommendation ?

Azeem Haider

unread,
Nov 1, 2017, 12:12:59 AM11/1/17
to Google Cloud SQL discuss
  1. If your App need continuous database then SQL instance run continuous which make it high cost, isn't it ? For example in my App every user when open app need to show Posts if there is large number of users then I think SQL instance never shutdown. Which means it run 730 hrin a month.

  2. Any recommendation ?

Jordan (Cloud Platform Support)

unread,
Nov 1, 2017, 3:08:24 PM11/1/17
to Google Cloud SQL discuss
You are correct, if you require scalability and speed at lower costs (as seen in the pricing calculator) it is recommended to use the Datastore. The Datastore scales to handle any load, meaning any number of your users can be interacting with the Datastore at any given time and it will continue to be fast. Because of the optimized structure of the Datastore you are forced to use APIs instead of running SQL queries. 

- Every Datastore 'Kind' (aka table) contains 'Entities' (aka rows). You would have three 'Kinds' in the Datastore, 'User', 'Post', and 'Comment'. You can think of these likes classes where the 'Entity' is an object of that class (e.g a 'User' entity). 

- When you save an entity to the Datastore you give it key. The key specifies the Kind to save it to, the ancestors it has (aka the keys of its parent 'Post' and 'User' entities), and the ID of the actual entity. The actual entity ID can be auto-generated by the Datastore or specified by you. For example:

KeyFactory keyFactory = datastore.newKeyFactory()
   
.addAncestors(PathElement.of("User", "u123"), PathElement.of("Post", "p098"))
   
.setKind("Comment");
Key commentKey = keyFactory.newKey("m392");

Entity comment = Entity.newBuilder(commentKey)
   
.set("comment", "Some Text")
   
.build();

datastore
.put(comment);


- Your comment in this example would now only have one text property called 'comment' and one Datastore key which contains the keys of its ancestors and its own unique ID. You can then directly ask Datastore for all the child Comment entities of a given Post entity just by using the Post's key. For example:

Query<Entity> comments = Query.newEntityQueryBuilder()
   
.setKind("Comment")
   
.setFilter(PropertyFilter.hasAncestor(
        datastore
.newKeyFactory().addAncestors(PathElement.of("User", "u123"))
           
.setKind("Post").newKey("p098")))
   
.build();

QueryResults<Entity> results = datastore.run(comments);


You can think of the actual keys in Datastore as the JOINs that provide relations between Kinds. Of course you will need to create a script that migrates your current data into the Datastore, but once it is migrated your queries will be very fast. Behind the scenes Datastore builds indexes (aka tables) that already contains the answers to your queries. If you require more complex queries with sort orders you must build your own index configuration file

Azeem Haider

unread,
Nov 2, 2017, 12:51:53 PM11/2/17
to Google Cloud SQL discuss
Thanks for this information but I'm little bit confuse about ancestors As read from document and search is that ancestors are used for transactions and every ancestors are
in same entity groups and entity groups store in same datastore node which make it low scale able is it right ? 

Creating User as a parent of posts and posts are parent of comments is good ?

Rather than this we can add one extra property in posts Entity like user_id as shown in example and filter it using property.

Which is good and scale able filter posts by ancestors or add extra property user_id in posts Entity and filter by property ?

I know both condition get same results but I want to know which one is good in performance and scale ?

Sorry, I'm new in datastore

Jordan (Cloud Platform Support)

unread,
Nov 2, 2017, 2:01:10 PM11/2/17
to Google Cloud SQL discuss
You can definitely continue to keep your separate ids, but querying for the results will be slower as Datastore would then need to perform multiple equality filters in order to find the correct 'user_id', 'post_id', and 'cmnt_id' for an entity (and equality filters on Strings are never fast). 

To clarify, you may Only perform transactions on entity groups. This doesn't mean that ancestors are Only used for transactions. Ancestors provide quicker querying, strong consistency between updating an entity and retrieving the new results, As Well as the ability to perform transactions on them (all or nothing updates). It does so by placing the ancestor tree within the same entity group as you have correctly mentioned. This is actually much more optimized for querying as it minimizes the I/O required to find your results as mentioned in our Datastore Best Practices documentation

There are many advantages to structuring your data using only a single key with ancestors (e,g cleaner code, simpler logic, faster queries, etc) as appose to having multiple ID properties, but at the end of the day the choice is completely up to you.

Note that Google Groups is meant for general product discussions only. If you require technical support for the Datastore it is recommended to post your full questions to Stack Overflow using the supported Cloud tags. 

Azeem Haider

unread,
Nov 4, 2017, 12:08:30 PM11/4/17
to Google Cloud SQL discuss
But there is some problem in ancestor relation as I asked in Stackoverflow 

There is limitation in ancestor relation. You can not write more than one entity per sec.
As describe in Documentation keep your entity group small. And they also said that

only create them when transactions are absolutely necessary

But I don't understand the meaning of this line "Note that entity groups are not required if you simply plan to reference one entity from another.

Please check my Stackoverflow  question for more info.


Ani Hatzis

unread,
Nov 4, 2017, 4:50:35 PM11/4/17
to Google Cloud SQL discuss
 
"Note that entity groups are not required if you simply plan to reference one entity from another."

I believe, that sentence is meant to explain the availability of key properties, i.e. a comment entity could contain a key property storing the post key and another one for the user key. If the comment had its author's user entity as ancestor, you wouldn't necessarily need to reference the user again, because you could just read the parent.

There is limitation in ancestor relation. You can not write more than one entity per sec.

Yes. If you decide to go for User > Post, that means that a user can not create/update/delete more than one post per second (including all writes to the user). On the other hand, you get fast transactional queries for all posts of the user (ancestory-query).
If you go for User > Comment (with a reference to the post to which the comment belongs to), a user cannot create/update/delete more than one comment per second (including all writes to the user). On the other hand, you get fast transactional queries for all comments of the user.
If you go for Post > Comment, there can be only one comment per post created/updated/deleted per second (including all writes to the post), and it doesn't even matter which user is submitting the request. If you think that (transactionally) querying and reading of a post together with all comments is really valuable, you could try to work-around the write limit by offsetting/buffering any comment related write-ops, so writing is throttled below the 1s limit.
Of course, User > Post > Comment would make for a terrible scalability.

With an existing app, maybe you can make these decisions based on your analytics. What kind of requests / operations will be the most frequently used, which are critical for your users and your business. For example:
  1. get one post with most recent 3 comments; 100 requests; should be very fast (less than x ms)
  2. transactional create one comment to a post; 10 requests; can be 202 ACCEPTED, but visible to others with-in a few seconds
  3. non-transactional query of 10 most recent posts of user; 20 requests; should be fast (less than y ms)
  4. transactional query batches of 100 posts per user and delete them (user deletes account); 5 requests; administrative / background
And so on. At the next step, weigh them, estimate costs, and you will get an idea for which kind of ops / requests you want to optimize your schema.

However, I wanted to add, that independently of your schema, you should let Datastore give the IDs for all entities to get the best performance and scalability. As you have read in the doc, you will get random long numerical IDs that are optimally distributed for scalability and performance.

Since you already have billions of records, and maybe many of their IDs are monotonic sequences, it might be worth to apply the same to them after the fact. When importing your old users, I would let Datastore allocate these IDs for you, and store the old ID in a separate UserMap that has the same new ID. E.g. existing user with ID "u123" becomes in Datastore an entity with KEY(User, 39001), you add an entity with KEY(UserMap, 39001)  with indexed property old_id='u123'.  When you do the same with posts of user 'u123', query for the UserMap ID with that old_id and use the resulting ID, KEY(UserMap, 39001), to construct and store KEY(User, 39001). First all users, than all posts, than all comments. If you have completed your migration, you just delete UserOld etc. and you will have an optimal distribution of IDs from the start.

You got some good feedback here and on SO. I hope, I didn't add to your confusion :-)

Azeem Haider

unread,
Nov 8, 2017, 8:52:10 AM11/8/17
to Google Cloud SQL discuss
I understood how I can Migrate from SQL to Datastore. But now Problem is that 

No more than one operation per sec

This is creating trouble for me. Suppose there is another kind Like child of Post then it quite possible there is more than one like per sec. 
I don't want to create Like or comment under user, it always under post. Because in my App it's very rear case when need to get all posts where user comment. Mostly need to 
get all comments of the post. So, I think it's a best to make a post as a parent of comment, to make query fast. You mention.

Try to work-around the write limit by offsetting/buffering any comment related write-ops, so writing is throttled below the 1s limit.

How can I do this I'm using Java any article or tutorial ?   

Suppose if you are in my place and you have four Kinds user, post, comment, like How would you like to define them, Which would you prefer ? If posts are something showing like Facebook or Instagram ? Where you need posts with recent 3 comments and on comment click need all comments of that posts and same for like.

Any suggestion ?

Jordan (Cloud Platform Support)

unread,
Nov 8, 2017, 2:37:33 PM11/8/17
to Google Cloud SQL discuss
As previously stated the choice of how you want to migrate your kinds over is completely up to you. If you expect traffic like that of Facebook or Instagram then yes you may run into contention issues due to the write limit on entity groups. 

In this case you may want to just have your User kind a parent of Post kind (since a single user cannot make more than one post in a second). Then your Comment and Like kinds simply have a 'post_id' property. 

No matter what API call you make to Google it is always recommended to perform error handling to catch any possible exceptions such as a contention error. Once you catch these 5xx error responses in your code, you should perform exponential backoff retry in order to retry the comment or like again after a small exponential delay. The idea is to have your clients backoff until their requests eventually succeed. Of course in your UI you will show the new comment or like right away, but behind the scenes your client will retry until it truly succeeds.

Additionally if you are using the Google Cloud Client Libraries Google will automatically perform exponential backoff for you for most requests. As a side note I went ahead and did a test where I had a thousand individual comment entities created under a single Post parent via a script. There were no contention errors returned from the Datastore and all of the entities were committed. This shows that the Datastore actually replicates and scales quickly, but that these limits are documented as best practices to protect both the Datastore and your application from potential issues. 

 

Ani Hatzis

unread,
Nov 8, 2017, 4:22:15 PM11/8/17
to Google Cloud SQL discuss
Hi Azeem,
 
I understood how I can Migrate from SQL to Datastore. But now Problem is that 

No more than one operation per sec

This is creating trouble for me. Suppose there is another kind Like child of Post then it quite possible there is more than one like per sec. 
I don't want to create Like or comment under user, it always under post. Because in my App it's very rear case when need to get all posts where user comment. Mostly need to 
get all comments of the post. So, I think it's a best to make a post as a parent of comment, to make query fast. You mention.

Alright, assuming your most frequent use-case will be to get a post with the last few comments (or just a snippet of each of the last few comments), like G+ and Facebook give you a preview of a post in your stream. I doubt you need strong consistency for these requests. And then, the user opens a post and you want to show more comments (maybe not all, just a batch, let them scroll or click for the next batch), again weak consistency might be sufficient. Normally, browsing posts or comments make no good reason for transactions.

But a strongly consistent read might make sense when cleaning up Datastore from posts/comments of a user who happens to be a spam-bot or a user who wants to delete the account (privacy) or something like Google Check-out. Having the user as parent might be easier and safer to query really all the user's comments and posts batch-wise.

For getting the post-previews:
You could compute the post preview as one bigger entity (post body including the recent comment-snippets as repeated structured property). For each new comment to a post, you write some data of the comment into the post's recent comments property. However, since many comments to one post could create a hotspot on the post entity, where even exponential backoff might not be sufficient, but you can apply the offsetting/buffering described below.

For browsing comments of post-reviews:
Add a computed indexed property to comments that is used for the complete "query comments for this post", including the sorting by date. First the key or ID of the post, a separator character, then UTC date-time value of comment's creation, e.g.:

Post:
KEY(User, 1, Post 7)

Comment:
KEY(User, 5, Comment, 99)
indexedSortByPost = "user_1_post_7-20171024151617.4929"

For counting likes
Likes could be a special case, because normally you only care for the number of likes, unless you want to combine this with a update-notification (i.e., if user likes a post, user will be notified automatically for any new comment). If it is about counting the likes, you could implement them with sharding counters. See this article with Java sample: Sharding counters

Regarding good Datastore designs, I want to recommend this article from the Datastore docs: Best practices.
There are some more articles in the App Engine docs that will give you a better understanding:
 Computing the post-preview
Try to work-around the write limit by offsetting/buffering any comment related write-ops, so writing is throttled below the 1s limit.

How can I do this I'm using Java any article or tutorial ?   

I don't know about an article on this topic. But for this example it would work something like this:
  1. Whenever a comment is created, the request handler will not just write the new Comment object to Datastore, but will also create a transactional task into a pull task-queue "post-need-new-preview" with the hashed key of the post (e.g. URL-encoded string) as a tag. The same can be done for request handlers that update existing posts or existing comments, or that delete comments.
  2. Add a cron-job to your app with a schedule that reflects how often you want the previews to be updated the earliest, e.g. each 5 minutes.
  3. The cron-job will run a task in a push task-queue where tasks in the pull queue "post-need-new-preview" are leased by tag. Remember: each tag represents a post. Pull queue allows you to lease up to 1,000 tasks with the same tag, starting with the oldest task in the pull queue.
  4. For each tag, try to get the post by key, query the last x comments, copy their snippets into the post and write the post back to Datastore (non-transactional). If there are more tasks with the same tag, wait 1 or 2 seconds and queue another task into the push queue to repeat.
  5. If there are no more tasks left, just leave the cron-job task-handler. You could also leave if the oldest task for a tag is too fresh (e.g. only 30 seconds old task), to save instance hours.
With this, you can throttle the write-ops for each post respecting the 1sec limit. Due to the nature of task-queues, the instances would increase with the amount of work-load (i.e. number of posts that need a new preview). If you put handlers like these into some kind of backend service of your app, you could configure a cheap, slow instance class for them. With the cron-job schedule you can control how fresh you want your post-previews to be.

Of course, there could be alternative implementations where your app stores markers per post in Datastore to achieve more or less the same as with the pull queue. But the basic idea is the same: defer pending write ops, group them by post, and make a single write-op instead at certain intervals.

Azeem Haider

unread,
Nov 12, 2017, 3:14:59 PM11/12/17
to Google Cloud SQL discuss
After your reply I search a lot on shared count. What I understand is that Shared Count is a technique where you divide single entity into multiple entities, Right ?
Pick one of them randomly and update it, When you need to get Count fetch all these entities and sum up values for total count. 

Thing which is confusing me is that if you convert one entity into multiple entities with same parent isn't it inside a single entity group ? If it's a single entity group then you
can't apply more than one write op per sec. 

For example Post1 has two shared count A and B, is post1, A and B in same entity group ? 
OR post1 and A is separate entity group and post1 and B is separate entity group ?

If these are separate entity group then we can also use this technique for comments system, am I right ?

Another thing I don't know I understand it correctly or not.


For getting the post-previews:
You could compute the post preview as one bigger entity (post body including the recent comment-snippets as repeated structured property). For each new comment to a post, you write some data of the comment into the post's recent comments property. However, since many comments to one post could create a hotspot on the post entity, where even exponential backoff might not be sufficient, but you can apply the offsetting/buffering described below.

You said that to create a Post entity with a property of comment which hold list of recent 3 comments ids ? But using this we need to insert entity in comment Kind as well as update comment list id in Post entity.  Please correct me.
 

One more thing which I want to ask, If I access any ancestor entity with id and update it will write op one per sec apply ? 
For example user is parent of post and you have also post_id, You update the post entity using post_id will write op 1 per sec apply ?

Ani Hatzis

unread,
Nov 12, 2017, 4:20:25 PM11/12/17
to Google Cloud SQL discuss


On Sunday, 12 November 2017 21:14:59 UTC+1, Azeem Haider wrote:
After your reply I search a lot on shared count. What I understand is that Shared Count is a technique where you divide single entity into multiple entities, Right ?
Pick one of them randomly and update it, When you need to get Count fetch all these entities and sum up values for total count. 


 
Thing which is confusing me is that if you convert one entity into multiple entities with same parent isn't it inside a single entity group ? If it's a single entity group then you
can't apply more than one write op per sec. 

For example Post1 has two shared count A and B, is post1, A and B in same entity group ? 
OR post1 and A is separate entity group and post1 and B is separate entity group ?


The sharded counters must not have any ancestor, because putting them again into the same entity group would be counter-effective to the limit workaround you want to achieve.

  1. Every counter must be in its own entity group (no parent).
  2. The count property should not be indexed and not in a composite index.
  3. Every counter must have an indexed reference to the post for which it counts, so you can add this filter to any query to collect all counters of a particular post
 
If these are separate entity group then we can also use this technique for comments system, am I right ?


The sharded counters work because when you want to get the total value of likes, you query all counters of the particular post, then you just add all counts and return a single value.

You can also count comments of a post. But you cannot store comments in some shards. The idea is nice, but then, when your user is browsing the comments of a post, your handler has to query multiple shards, and there are to many issues here. You cannot JOIN multiple query results easily, using multiple cursors or offsets per HTTP request and then mapping them back into different shard queries. Much pain awaits on this road.
 
Another thing I don't know I understand it correctly or not.


For getting the post-previews:
You could compute the post preview as one bigger entity (post body including the recent comment-snippets as repeated structured property). For each new comment to a post, you write some data of the comment into the post's recent comments property. However, since many comments to one post could create a hotspot on the post entity, where even exponential backoff might not be sufficient, but you can apply the offsetting/buffering described below.

You said that to create a Post entity with a property of comment which hold list of recent 3 comments ids ?

Not only a list of recent comment ids, but even a preview of the actual comments (the first two lines or so of each), so, when the server returns a post or even a list of posts, the query will already contain everything you need for the response.
 
But using this we need to insert entity in comment Kind as well as update comment list id in Post entity.  Please correct me.
 

Almost correctly. Both kinds have a user as parent.

But Post has the author of the post as parent. And Comment has the author of the comment as parent:

  • KEY(User, 1, Post, 5)
  • KEY(User, 2, Comment 3) with post property KEY(User, 1, Post, 5)
So, it is two different entity groups. If User 2 writes a comment, the comment itself is written into user 2's entity group, not that of user 1. The user will normally not write so many comments to exceed the 1sec limit per entity group. Neither will it happen with writing posts.

However, your code can't immediately update the "recent comments" list of the original post 5 of user 1, because in that case you would get a hotspot for posts with many comments. And for this reason, task-queues can be used to update the post's preview only when necessary and only in a rate slower than the 1sec limit. The use-case was an example for buffering write ops into entity groups.
 

One more thing which I want to ask, If I access any ancestor entity with id and update it will write op one per sec apply ? 

Yes. But in Java it should be possible to do a batch-write, that is, you give multiple entities into the same write. If they are all in the same entity group, it should count as 1 write.

In Python the function is ndb.put_multi(user1, post1_1, post1_2), there is certainly something similar for Java.
 
For example user is parent of post and you have also post_id, You update the post entity using post_id will write op 1 per sec apply ?

Not sure what you mean by "updating the post entity using post_id". If you get a post by its datastore key, and update it, it will count towards the rate limit of its entity group. There is no rate difference if you read the entity by its key or from a query, before you update it.

Ani Hatzis

unread,
Nov 12, 2017, 4:20:25 PM11/12/17
to Google Cloud SQL discuss
One more note, since this is the group for Cloud SQL and your questions are more related to Cloud Datastore and App Engine. Both have their own Google groups, where you might find more people that have experience with those products:



On Sunday, 12 November 2017 21:14:59 UTC+1, Azeem Haider wrote:
Reply all
Reply to author
Forward
0 new messages