How to implement 'Mark All Read' feature in appengine

102 views
Skip to first unread message

nischalshetty

unread,
Jan 9, 2011, 11:57:55 AM1/9/11
to Google App Engine
Say a user selects 5000 unread messages (each message is an entity)
and wants to mark all of them as read. It would be a massive update.
Can anyone help me on what the best way to do such a large update is?

Is there a different way to do this apart from updating each of the
5000 entities by marking their status as read? Even if there is no
escaping the large update, how do you update so many entities on
appengine?

Ernesto Karim Oltra

unread,
Jan 9, 2011, 12:04:43 PM1/9/11
to Google App Engine
Maybe you need to re-think your database models. This message may
help: http://groups.google.com/group/google-appengine/browse_thread/thread/52d44bde94f2e8b8?pli=1

Jay

unread,
Jan 9, 2011, 4:08:24 PM1/9/11
to Google App Engine
That is a good link Ernesto.

nischalshetty - I think the best answer is probably an "it depends"
kind of thing. What do you want to do with unread status? Of course
the key here is that unread is a per user deal. You might be able to
keep some kind of bit-map or other compressed data structure to hold
read/unread info, but your ability to do that may depend on the
structure of your models and your use cases.

I would look closely at the use case / story. Updating 5000 articles
as unread sounds like a 'catch up' kind of operation. You might be
able to maintain an unread since kind of concept similar to the idea
Nick proposed in the post Ernesto referenced. If you can do something
along those lines, you might be able to just move a 'pointer' to
'now'.

If you do need to update 5000 'somethings' you want to try and do it
in batch if at all possible (does the api support > 1000 now?). And if
you can defer the operation by placing it on a task queue, even
better.

-- Jay

On Jan 9, 11:04 am, Ernesto Karim Oltra <ernestoka...@gmail.com>
wrote:
> Maybe you need to re-think your database models. This message may
> help:http://groups.google.com/group/google-appengine/browse_thread/thread/...

nischalshetty

unread,
Jan 9, 2011, 9:14:45 PM1/9/11
to google-a...@googlegroups.com
@Ernesto Thanks for the link, that's a good starting point.

@Jay Thanks for some more pointers. 

Simply put here's what I want to do (a twitter app): 

1. Pull all the tweets of a user
2. The user can selectively read tweets (which would be marked as 'Read')
3. Unread tweets would be well 'Unread'

Here, the latest timestamp thingy won't really help because it can so happen that the user reads a few tweets that were say 5 days old then the user navigates to page one and reads a tweet that have appeared today. So all the tweets in between would still need to be Unread.

Apart from this, I need to give users the ability to say, mark all tweets as Read in case they want to start fresh. But again, Mark All Read would be a checkbox against all tweets, so the user can 'deselect' a few tweets and then click the Mark as Read button :(

Hope you're getting what I want to accomplish. I've marked more than 10,000 emails as 'Read' in Gmail, how do they do it? 

-N

Stephen Johnson

unread,
Jan 9, 2011, 10:49:52 PM1/9/11
to google-a...@googlegroups.com
Here are some quick statistics I did using the idea that you have a entity which consists solely of a key and no other properties. I called this entity kind 'READ'. If a tweet has a matching key in this entity then the tweet has been read. If it doesn't exist then the tweet is unread. I did a batch PUT of 1000, 2000 and 10000 for comparison purposes and a batch GET of 2000 and 10000.

Results for batch put of 1000 was
1383ms 48493cpu_ms 48400api_cpu_ms
Results for batch put of 2000 was
1608ms 97030cpu_ms 96866api_cpu_ms
Results for batch put of 10000 was
7791ms 481290cpu_ms 480916api_cpu_ms
So, each entity key cost was consistently approx. 48ms. Which for the 10000 entity case costs approximately 8 minutes of CPU. This seems like the worse case scenario since someone will have tweets which have already been marked as read and a PUT of these  keys would not need to be done.

Then performing a batch get of 2000 was:
181ms 16830cpu_ms 16666api_cpu_ms
And a batch get of 10000 was:
421ms 83870cpu_ms 83333api_cpu_ms
So, cost for each entity key retrieval is approx. 8.33ms.

So, it seems that this could work since I assume you're only going to be showing a page of tweets at a time. Also, as another design tweak you could put all tweets for a given hour or half hour or 15 minute increment, etc. into one entity. Since tweets are a max. 140 characters you could fit over 7000 tweets in one entity for an hour and that's without compression. Then instead of allowing marking read/unread of individual tweets you could mark all tweets for a given hour or 30 minute increment as read or not read which would significantly cut down on the number of entities that need to be queried and read/unread key markers that need to be maintained.

Hope some of this helps,
Steve

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

nischalshetty

unread,
Jan 10, 2011, 12:57:48 AM1/10/11
to google-a...@googlegroups.com
@Stephen Wow! I'm grateful to you for taking time out to make things so clear. From what I see, you're suggesting to have a different key that when present indicates a tweet as 'Read'.

You're right, there would be pagination so I'll be displaying say 50 tweets per page. Now, I would fetch 50 tweets and then check if any of the tweets have a key in the 'Read' table. Suppose it has, then my pagination would go for a toss. I hope you're getting me? Pagination becomes cumbersome if I try to match an entity's status in a different table :(

-N

Stephen Johnson

unread,
Jan 10, 2011, 1:09:35 AM1/10/11
to google-a...@googlegroups.com
It sholdn't be cumbersome at all. Use the same key identifier for the tweets as for the Read entity, then do a batch get with those keys. Based on my numbers I'd say latency for 50 keys will be from 10ms to 20ms. You'll get back a Map (if using Java - not sure for Python). If they key is in the map then that tweet has been read otherwise it is unread. With a map doing lookups is fast and easy.


-N

--

nischalshetty

unread,
Jan 10, 2011, 1:27:29 AM1/10/11
to google-a...@googlegroups.com
My bad. I forgot to mention, there would be an option where I show just the 'unread' tweets to the user or just the 'read' tweets. When I show all tweets, your example would work perfectly fine. But what about the other two scenarios? Any pointers would help. I use GAE/J.

-N

Jeff Schwartz

unread,
Jan 10, 2011, 7:35:44 AM1/10/11
to google-a...@googlegroups.com
I don't know if the requirement are the same but here's something I came up with that might be of use to you as well.

If I need to mark a group of entities as having some common trait (the trait could be read or unread in your case) in response to an action the user has taken upon a group of entieis (not an entity group but the number of entities in a group with a common trait and the number can vary depending upon your use case) I would create special 'trait' entities. For a trait of 'read' for instance each trait entity would have 2 longs representing the range of entity ids of all the entities sharing the trait. When marking a group of entities, lets say 5000 of them, as read I would create a trait entity that has the 2 longs, the first long equaling the id of the 1st entity in the group and the second long the id of the last entity of the group. When determining if an entity is read I would check for a trait record whose 2 longs encompassed the id of the entity in question. If it existed I would use that to determine its read state and if no trait entity existed that encompassed the entity in question I would use the actual entity's read state field. A conflict exists when an entity is encompassed by a trait entity and whose own read field contradicts the state indicated by the trait. In those cases I check the actual entity's monotonic counter field and if it is greater than the monotonic field value of the trait entity then the actual entity's read field overrules the trait entity. A simple model follows. Please note that I am using Objectify ORM:

trait:

class Trait implements Serializable {

long serialVersionUID = 1L;
 
@Id
Long id;                                 // autogenerated

String a;                                // common attribute such as read, unread, etc and
                                            // can be used to query for trait entities of a specific
                                            // attribute type such as read, unread, yes, no, hot, cold,
                                            // etc.


long r1;                                // key id of 1st entity in the group
long r2;                                // key id of last entity in the group

long m;                                 // monotonic value  set when created and never modified thereafter

.
.
.
}

class Email implements Serializable {

long serialVersionUID = 1L;
 
@Id
Long id; 

boolean read;

long m;                                 // monotonic value set when entity is created and
                                            // only updated if the user reads or unreads

.
.
.

}

Like I said, I don't know if this fits your use case but it worked in mine and it is very light weight.

Jeff


On Mon, Jan 10, 2011 at 1:27 AM, nischalshetty <nischal...@gmail.com> wrote:
My bad. I forgot to mention, there would be an option where I show just the 'unread' tweets to the user or just the 'read' tweets. When I show all tweets, your example would work perfectly fine. But what about the other two scenarios? Any pointers would help. I use GAE/J.

-N

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.



--
Jeff Schwartz

Stephen Johnson

unread,
Jan 10, 2011, 11:33:30 AM1/10/11
to google-a...@googlegroups.com
Ok. A trick I use is to use the key field to hold more than one piece of data so that I can get around some of the limitations of queries as well as get back info doing keys only queries and keeping down the number of the custom keys I have, so you could do something like the following: I suppose each of your users has some sort of user id and that the tweets have some sort of identifier, let's say User-ID and Tweet-ID. Also, I will assume the Tweet-ID you assign also relates to the time of the tweet, that is Tweet-ID 17 would always have a date/time before Tweet-ID 18 and Tweet-ID 18 would always be before Tweet-ID 19, etc. And that is how you are paging through your tweets. That is by date/time. If not then an additional piece of information of date/time would need to be factored in. I will now use the entity name RS standing for ReadStatus instead of Read. We want to keep entity names as short as possible because it is meaningless information (as far as we are considered not Google) that is added to the keys. So you could use as your key format the following: User-ID:Tweet-ID

Also, we would add one more property to the RS entity of type boolean and name it R (again keep names short) indicating whether or not the Tweet has been read (true) or unread (false).

Now, if the user wants to paginate through all tweets then use the Tweet entity itself and use the RS entity and a keys only query to determine its read status. If the user wants to page through tweets that are yet unread then again do a keys only query (using a cursor to save position) on the RS entity and then do a get by keys on the Tweet entities. Do the same for tweets that have been read. Queries would look like

  select __key__ from RS where R == FALSE

Another difference between what I suggested before is that you will need to add entries to the RS entity for both read and unread entities, but the upshot of this approach would be that when you do a Mark All As Read for 10000 tweets you don't have to actually query the 10000 tweets and pull those entities in to memory and then update them. You can just get the keys, then do a PUT (without a GET) to the RS entity safely overwriting any existing values since the only value is the boolean flag.

I hope this helps you to think through and find a solution,
Steve

On Sun, Jan 9, 2011 at 11:27 PM, nischalshetty <nischal...@gmail.com> wrote:
My bad. I forgot to mention, there would be an option where I show just the 'unread' tweets to the user or just the 'read' tweets. When I show all tweets, your example would work perfectly fine. But what about the other two scenarios? Any pointers would help. I use GAE/J.

-N

--

Eli Jones

unread,
Jan 10, 2011, 11:59:48 AM1/10/11
to google-a...@googlegroups.com
Nick Johnson's newsgroup e-mail explains pretty much what you need.. except he leaves out some details (don't know, maybe he was busy building out the Appengine subsystem at the time or something) explaining exactly how to set it up underneath.

You should probably try something like this (which I believe is what Mr Johnson's example uses underneath).  I use Python so you're on your own with Java, but the concept is identical (I'm doing this the lazy way so there is no thought put to what property types should be used.):

class baseEmail(db.Model):
    From = db.StringProperty()
    To = db.StringProperty()
    Subject = db.StringProperty()
    Message = db.StringProperty()
    TimeStamp = db.DateTimeProperty()
    Attachment = db.BlobProperty()

class metaEmail(db.Model):
    Status = db.StringProperty()
    From = db.StringProperty(indexed=False)
    Subject = db.TextProperty()
    TimeStamp = db.DateTimeProperty()

I haven't bothered with thinking too deeply about the "right" ways to pick property types.. or whether to mark some as indexed=False.. but this gets the core concept.

For this example, baseEmail.key().name() == metaEmail.key().name()  (but they don't need to be in some child parent relationship or anything silly like that.. you don't need transactions for this).

When a user clicks "Refresh" to see if they have more messages.. a process runs that does something like this:

newEmails = db.baseEmail().all().filter('TimeStamp >', MaxMetaEmailTimeStamp).fetch()
newMetaEmails = []
for email in newEmails:
    metaEmail = metaEmail(key_name = email.key().name(), Status = 'Unread', From = email.From, Subject = email.Subject, TimeStamp = email.TimeStamp)
    newMetaEmails.append(metaEmail)
db.put(newMetaEmails)

And, at the same time.. it displays the newMetaEmails on screen.

Now, there are more and more optimizations to add (you could use ReferenceProperty with this if you want tighter associations)..  but this is the core idea..

You have metaEmail which is just metadata pointing to the underlying baseEmail data.. make metaEmail as small as possible.. if you don't need Subject or other stuff... leave those off.. the only properties you really need are Unread and TimeStamp.

You can change the metaEmail.Status to 'Deleted' for deleted emails.. and only truly delete them once every 24 hours or something.. this would allow for a handy undelete option.. etc.
    

On Mon, Jan 10, 2011 at 1:27 AM, nischalshetty <nischal...@gmail.com> wrote:
My bad. I forgot to mention, there would be an option where I show just the 'unread' tweets to the user or just the 'read' tweets. When I show all tweets, your example would work perfectly fine. But what about the other two scenarios? Any pointers would help. I use GAE/J.

-N

--

master outside

unread,
Jan 10, 2011, 12:20:53 PM1/10/11
to Google App Engine
One option is to use a date time stamp when they mark all as read. If
you allow people to mark items as unread then you in addition need
have a way to detect that on old items. For example if you were to
use '0' for unread and '1' for read you could use '2' to override the
the all read after time.

nischalshetty

unread,
Jan 10, 2011, 11:21:16 PM1/10/11
to Google App Engine
You guys are all awesome. My friend and I are soon going to finalize a
way and I'll post it here. Your feedback would be greatly appreciated.

-N

David

unread,
Jan 9, 2011, 3:34:58 PM1/9/11
to Google App Engine
You can save a date for the user on when the last mark all read date
was set. Then consider anything before that date as read by that
user. That way you only have to update a single entity.

nischalshetty

unread,
Jan 11, 2011, 1:12:03 PM1/11/11
to google-a...@googlegroups.com
@David

That sounds good but it would probably fail if the user marks one of the messages in between as 'unread' :(

Julian Namaro

unread,
Jan 12, 2011, 12:11:02 AM1/12/11
to Google App Engine
This one year old thread discusses the same problem I think:
http://groups.google.com/group/google-appengine/browse_thread/thread/b35a0387dfdf918b/55563f57306defbc?lnk=gst&q=mark+as+read#55563f57306defbc

A possible solution:

class Message(db.Model):
sender=db.StringProperty()
body=db.TextProperty()

class MessageIndex(db.Model):
# parent = Message, one message can have several
# MessageIndex children to spread the load of recipients
recipient = db.StringListProperty()

class MessageRead(db.Model):
# key_name = message_id + recipient_id
# The entity is written when a message is read
pass

def getMessagesForUser( user, nb_msgs ):
indexes = MessageIndexes.all(keys_only=True).filter(
recipient= user).fetch( nb_msgs )
for k in indexes:
keys.append( k.parent() )
keys.append( db.Key.from_path('MessageRead',
k.parent().id() + recipient_id) )
messages = db.get(keys)


I guess it's about the same that what Stephen is proposing.

sameer.mhatre

unread,
Jan 17, 2011, 1:28:05 PM1/17/11
to google-a...@googlegroups.com
@Jay
As you have mentioned in the post about "bit-map or other compressed data structure to hold read/unread info". Can you give an example for it or else any link which provides implementation for bit-map thing.

Jay

unread,
Jan 18, 2011, 12:21:58 PM1/18/11
to Google App Engine
@Sameer
A bit-map for holding the "list" of read/unread is a pretty common and
widely used pattern.
http://en.wikipedia.org/wiki/Bit_array
Also see data structure and algorithm books.

You can think of the bitmap as a compact list of all read/unread
assuming that the entries can be encoded with a key that increments or
something like that. Then, and this is the part I don't know how to do
offhand, if you could efficiently get an actual list from that bitmap
(expanding it), then you could do a batch get by key.

Something more realistic might be to find the first 0, then get the
next n 0's, get those in batch by key, then get whatever else might be
needed using an ajax pattern, i.e. "later".

I am not sure this advances the discussion any. The idea is half-baked
at best.

sameer.mhatre

unread,
Jan 18, 2011, 1:38:14 PM1/18/11
to google-a...@googlegroups.com
@Jay
Thanks for your reply. This is really helpful to me to solve my problem. I was quite aware of how the bit map works and all. But not able to co-relate my problem to this. Thanks again.
Reply all
Reply to author
Forward
0 new messages