Model datastore application

117 views
Skip to first unread message

Susan Lin

unread,
Apr 12, 2016, 7:22:52 PM4/12/16
to Google App Engine

I am looking how to create an efficient model which will satisfy the requirements I put below. I have tried using gcloud-node but have noticed it has limitations with read consistencies, references, etc. I would prefer to write this is nodejs, but would be open to writing in java or python as long as it would improve my model. I am building around the new pricing model which will come July 1st.

My application consists of a closed email system. In essence what happens is users register to the site. These user's can make friends. Then they can send emails to each other.

Components of the app:

Users - Unlimited amount of users can join.

Friends - A User can have 200 confirmed friends and 100 pending friend requests. When a friendlist is retrieved it should show the name of the friend. (I will also need to receive the id of the friends so I can use it on my client side to create emails).

Emails - Users can send emails to their friends and they can receive emails from their friends. The user can then view all their sent emails independently(sentbox) and all their received emails independently(inbox). They can also view the the emails sent between themselves and a friend order by newest. The emails should show the senders and receivers names. Once an email is read it needs to be marked as read.

My model looks something like this, but as you can see their are inefficiencies.

Datastore Kinds:

USER
-email (id) //The email doesn't need to be the id, but I need to be able to retrieve users by their email
-hash_password
-name
-account_status
-created_date

FRIEND
-id (auto-generated)
-friend1
-friend2
-status

EMAIL
-id (auto-generated)
-from
-to
-mutual_id
-message
-created_date
-has_seen

Procedures of the application:

Register - Get operation to see if a user with this email exists. If does not insert key.

Login - Get operation to get user based on email. If exists retrieve the hash_password from the entity and compare to user's input.

Send friend request - Friend data will be written twice for every relationship. Then using the index on friend1 and index on status I will query all the friends for a user and filter only those which are 'pending'. I will then count these friends and see if they are over X. Again I will do this for the other user. If they are both not over the pending limit, I will insert the friend request. This needs to run in a transaction.

Accept a friend request - Friend data will be written twice for every relationship. Then using the index on friend1 and index on status I will query all the friends for a user and filter only those which are pending. I will then count these friends and see if they are over X. Again I will do this for the other user. If they are both not over the pending limit, I will change both entities's status to accepted as a transaction.

Show confirmed friends - Friend data will be written twice for every relationship. Then using the index on friend1 and index on status I will query all the friends for a user and filter only those which are accepted. Not sure how I will show the friend's names (e.g what happens if a user changed their name this needs to be reflected in all friend relationships and emails!).

Show pending friends - Friend data will be written twice for every relationship. Then using the index on friend1 and index on status I will query all the friends for a user and filter only those which are pending. Not sure how I will show the friend's names (e.g what happens if a user changed their name this needs to be reflected in all friend relationships and emails!).

View sent emails - Using the index on the from property I would query to get all the sent emails from a user 5 at a time ordered by created_date (newest first). (e.g what happens if a user changed their name this needs to be reflected in all friend relationships and emails!).

View received emails - Using the index on the to property I would query to get all the received emails to a user 5 at a time ordered by created_date (newest first). When a emails is seen it will update that entities has_seen property to true. (e.g what happens if a user changed their name this needs to be reflected in all friend relationships and emails!).

View emails between 2 users - Using the index on mutual_id which is based on [lower_lexicographic_email]:[higher_lexicographic_email] to query the mutual emails. Ordered by newest, 5 at a time. (e.g what happens if a user changed their name this needs to be reflected in all friend relationships and emails!).

Create email - Using the friend1 and status index I will confirm the user's are friends. If they are friends, I will insert an email.

Anastasios Hatzis

unread,
Apr 13, 2016, 4:36:43 AM4/13/16
to Google App Engine
Susan,

as far as I understand your model and your procedures, Google Cloud Datastore should be a good choice. My suggestions are based on my experience with Python and the NDB API, but I assume there is no significant difference.

Since your app maintains hard limits in the number of friendships and friend requests, I suggest to tweak your model a little bit, so you can apply strong consistent queries (which require an ancestor, i.e. a parent). By doing this, we put the Friend objects into so-called entity groups, where each parent forms its own group.

FRIEND
-parent (Key of the owning user)
-id (auto-generated)
-friend1
-friend2
-status

This way, the app can perform a strongly consistent, keys-only query on both, number of friendships and number of pending friend requests. I guess, both queries would happen rather often (each time a friend request is created or friendship is accepted).

Furthermore, you already suggested two have two objects, one for each side of the friendship, good thinking. Since you already have the key of friend1 as parent, we could remove it from the properties, too.

In addition, we could use the ID of the friend as ID of this relationship to avoid inconsistency by duplicate relationships. (I assume all User keys are parent-less, so the IDs already guarantee unique keys for all users in the datastore). In general, it is cheaper and faster to lookup for a key, than doing a query. This way, the existence of a key would tell us, if there is already a Friend object. The app also can compute the Friend key and directly get the object, rather than always first querying. And since we already have the friend's key (by the ID), we can also remove the property friend2.

FRIEND
-parent (Key of the friend1)
-id (ID of friend2's key)
-status

The status must be indexed though to perform the counts on requests and existing friendships. Any creation, update, and deletion of Friend objects would be in transactions across two groups (of friend1 and friend2), so both legs of the relationship are consistent.

In my experience using webapp2 for user authentication, it has benefits to keep the actual user account (the kind used for authentication) out of anything else in the app. So I would not use User keys as parent or for the friend key-property, but instead: computed keys with the same ID as the user (for uniqueness and consistency) and a different kind. Parents don't need to exist.

FRIEND
-parent (Key of kind FriendParent, ID the same as of corresponding user)
-id (ID of friend2)
-status

For example, two User objects stored, id:1 and id:2. As result of a friend request from 1 to 2, these two objects are created:
FRIEND
- parent: KEY(FriendParent, 1)
- id: 2
- status: not_accepted_yet
FRIEND
- parent: KEY(FriendParent, 2)
- id: 1
- status: new_request_to_accept

The parent keys don't exist as objects in the datastore, They only exist as parent keys in the Friend objects to put them into entity groups, so the app can apply strong consistent ancestor queries for each user.

Downside of the variant with parent: There is a technical limitation how often the app (or any other client) can write into the same entity-group (1/sec). In this model, it would give a hard limit how often each user can:
  • send/revoke friend request (or by the friend)
  • accept/deny request (or by the friend)
  • remove friend (or by the friend)
In other words: With parents in Friend, datastore can not maintain huge numbers of friendships, which implies high frequency of such write-ops, like followers at Twitter, channel subscriptions on YouTube etc. However, considering the rather low hard limits (200/100) I thought this constraint doesn't matter.

If it matters though, we cannot put Friend into bigger entity-groups, but then also loose the ability to perform strong consistent queries (which requires ancestor/parent). An eventually consistent ancestor-less query (for example, when the app counts number of requests for a user) may miss an entity that was written just milliseconds / seconds before, so it could be that the limit is slightly exceeded for some users. In that case, I would suggest to keep Friend in their own entity-group:
FRIEND
-id (similar to mutual_id: friend1_ID:friend2_ID)
-friend1
-friend2
-status


The app can still make the transactional writes with both. With this little tweak the app could at least get Friend by key in more use-cases than with an auto-generated ID.

I will continue my suggestion with the parent-version of Friend.

You have mentioned the display of names. As a general rule of thumb (and I would think that many Datastore users follow this rule), you do less data normalization than in SQL databases, in lack of join queries and such.

As far as it is only names, I would think that user names don't change frequently, so I would add the friend's name to the Friend model, so we don't need to query for the current name of up to 300 users every time we show the list of friends and friend requests (that would double the reads). If a user changes the name, we would need to update all Friend objects with this user. Given the parent-variant of Friend, we can do this strongly consistent. Perform an ancestor query of all Friend objects owned by this user (or the FriendParent object), compute the keys of the mirrored Friend objects, and batch update them to the new name in a few separate transactions.

As no limitations to the number of emails have been mentioned, this could be pretty heavy on writes. Maybe a few thousands emails to touch for each name change? And this would be needed for both, sender or recipient.

Furthermore, what about avatar images of users or other profile information? They may change more frequently. I think it's difficult to make forecasts on all the scenarios, so you could decide which approach would be cheaper. Probably it is safe to assume that the name and the avatar won't change often later on, so it makes sense to write them directly into Friend and maybe also EMail.

FRIEND
-parent (Key of kind FriendParent, ID the same as of corresponding user)
-id (ID of friend2)
-name (of friend2)
-imgUrl (of friend2)
-status

Every other profile-related data should be stored into a separate kind, especially if it can change frequently (last seen, online status etc.). In the HTML templates or with some JS wizardry, the link to each user's profile can be computed easily even in a friend / request list, with-out actually reading a User object. Basically, a Friend object would contain everything the app needs for the most frequent requests.

As I have mentioned earlier, I would separate the authentication-related data of a user from profile-related data, so instead of putting name, avatar etc. into the User kind, I would put it into a UserProfile kind, where the ID always is the same as of the corresponding user account.

USERPROFILE
-id (ID of USER)
-name
-imgUrl
-about me (etc.)
-status

One last note regarding the ID of USER, I suggest to not use the email ID which can change, but datastore keys (ID is part of the key) are immutable. An auto-generated ID would be fine.

For standard GAE environment there is the User API available, if you want to count on Google accounts or OpenID. I use a custom user management and authentication based on webapp2, but certainly, other frameworks also provide similar features. I've thought this is much safer and easier than implementing my own authentication features. There is so much that could go wrong.

Well, I hope this helped you a little.

Ani

Susan Lin

unread,
Apr 13, 2016, 9:31:10 PM4/13/16
to Google App Engine
Thank you very much for the detail response ! One question as I am not famiar with the parent type entities. If I need to list all the contacts would this result in 1 read operation of 200 read operations (considering 200 friends)?

Anastasios Hatzis

unread,
Apr 14, 2016, 3:10:19 AM4/14/16
to Google App Engine
On Thursday, April 14, 2016 at 3:31:10 AM UTC+2, Susan Lin wrote:
Thank you very much for the detail response ! One question as I am not famiar with the parent type entities. If I need to list all the contacts would this result in 1 read operation of 200 read operations (considering 200 friends)?

It depends on the implementation. Considering the "most expensive" user with 200 friends and 100 friend request:

For a keys-only query, as used for counting the number of friends to validate if the 200+100 limits is reached, these are 300 small ops (all free).

If you run a projection query, which should be possible for many of your use-cases of this query, the results would also be 300 small ops (all free).

If you retrieve fully all objects with the query, it is 300 read ops in the old pricing model, or 300 entity reads in the new pricing model.

In other queries or use-cases, it could be much less if a recently read object with this key already is in memcache, so that read hits the memcache instead of the datastore. Since all 300 friends are unique in the result, memcache may only get a chance for hits in subsequent requests for the same friend list. So, for each non-hit entity read, it is a read op (or entity read). memcache hits are free.

Another option to reduce the reads would be to run the query page-wise only, e.g. 5 or 20 friends, maybe the newest first, because that's likely the most interesting for users. That would be potentially up to 20 read ops (or entity reads) for a page size of 20.

In the use-case where you want to show the current user all their friends and friend requests, or display the list on the profile page of a user, it could make sense to use Search API instead (if that is available on your environment). If the app mirrors any writes to Friend entites also into one search document per friend/request, the app doesn't even need to ask the Datastore. Although search documents are not transactional and can be a little stale (depending on the implementation maybe 1 or 2 seconds) the Search APi calls are free (some limitations and quotas to consider). Extra benefit: users can run full text search in the list, sorting and filtering is all possible. I think this solution would be more interesting for EMails, rather than running datastore queries. So your app uses datastore for reading/writing transactionally, but for users to read and browse, your app uses Search API. Given a current user can only search through their own mails of course :)

Finally, one alternative I want to re-iterate: the use of big entities, here with repeated properties, although I don't think we are safely with-in the entity size limit of 2MByte per entity.

You use the Friend kind model as suggested, but whenever the app creates/updates/deletes a Friend object, also mirror the change into a big entity, e.g. the FriendParent kind. If the write happens in the same transaction with the Friend objects, it doesn't affect the 1 write / sec limit, as it happens in the same entity-group anyway. It causes more write-ops (or 1 entity write, I guess), but also reduces the read ops (or entity reads) to 1 for showing all friends. The application code would be a little more complex to make this work fail-safe though:

FRIENDCOPY (Model with no own ID)
-key (of friend2)
-name (of friend2)
-imgURL (of friend2)
-status

FRIENDPARENT
-friends (repeated structured property of kind FRIENDCOPY)

As I've mentioned, I have my doubts your app could take this approach given the 300 potential values, unless you come up with a good idea of splitting the list up into multiple entities which the app still can maintain programmatically and safely. However, it can be a useful pattern in other cases.

Reply all
Reply to author
Forward
0 new messages