Creating a highly writable object

54 views
Skip to first unread message

thecheatah

unread,
Jun 14, 2011, 11:20:55 PM6/14/11
to Google App Engine
I am trying to implement a system for an object that will be updated a
lot. The way I was thinking was to turn the updates into inserts then
have a batch job that executes the inserts in batches to update the
highly writable object. The inserts can either be sorted by time or by
some sort of an incremented identifier. This identifier or timestamp
can be stored on the highly writable object so the next time the job
runs it knows where to start executing the next batch.

Using timestamp I am running into a problem with eventual consistency.
When I search for inserts to execute some inserts might not make it
into the query because they were not inserted into the index yet. So
suppose we have insert A, B and C. If A and C make it into the batch
job, it will mark all work up to C completed and B will never be
executed.

Using incremented identifiers seems like it will solve the problem but
implementing such an identifier itself is not clear. To explain why it
would solve the original problem, we would be able to detect when we
went from A to C as the difference in the identifiers would be greater
then 1. The sharded counter is great for counting, but is not good to
use as a unique identifier given eventual consistency.

I can use the memcached increment function but the counter might be
flushed out of memory at anytime. I believe the memcache update speed
should be enough for what I want to do.

If I had an upper bound time limit on the eventual consistency, I
could make my system so that it only processes inserts older then the
time limit.

Anyways those are my thoughts and any feedback is appreciated.

BTW: The inserts processed in batches are assumed to be not dependent
on each other.

Ravi Sharma

unread,
Jun 15, 2011, 5:18:32 AM6/15/11
to google-a...@googlegroups.com
if A B C and are not dependent on each other and ordering doesnt matter for you e.g. if you process C A B..then also its fine then you can put a another column in this insert table.  say processed.
When inserting make it N(if string) or false(boolean).

and query that entity based on this column.
Whenevr you prcoess one row, make the value Y or true. and carry on with next insert.

or even you can delete these rows once you have processed them ..then you will not need to have extra column....

Note: I am considering that for one update you will be processing all its insert in one task or job...no mutliprocessing





--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.


thecheatah

unread,
Jun 15, 2011, 11:20:54 AM6/15/11
to Google App Engine
Ravi,

Thanks for the feedback. I was thinking exactly along the lines of
what you have said. The only problem that I see is that I plan on
processing multiple inserts in one batch job. The inserts and the
highly updated object will not be updatable in a single transaction.
Thus, there might be situations where an insert was processed but the
flag was not set or the row was not deleted. To overcome this issue, I
am going to either use make sure that processing inserts multiple
times does not effect the output or accept a small percentage of
failures.

Ravneet

Ravi Sharma

unread,
Jun 15, 2011, 11:48:46 AM6/15/11
to google-a...@googlegroups.com
In those scenerio you can go ahead and do something extra...

Keep a list of Keys in your highly updating object, and whenevr you process one insert and update it into main updating object make sure you put the key in this object's list property.,SO your main object will know if i have got the content of this insert or not

Say after it when you are deleting or updating insert object ..
then when next time you get the same insert(as it was faile when you were marking it as processed), check if key exists in list.... if yes then mark the insert object processed and also remove it from list property.

Also then you need to have a another job which will clean the list property from updating object. read the object list..get the insert object for each key, if they are marked as processed then remove it from this list.

this will eventually increase your datatsore put but you will not have to worry about some inconsistency.


So you code will look like this
Highly updating object will have property liek this
List<Key> processedInserts; (in Java JDO)

TASK -1
1) getNextInsert  object say i1,  assume its key is k1
//at this atge say processedInserts is empty                        
2) check if k1 exists in processedInserts, if no then go to step 3 else go to 4
3) update Highly updating object with content of insert object i1, also add the k1 into processedInserts
//at this stage it will have k1 in processedInserts
4) Update i1 as processed.

Now after this we will have a growing list of processedInserts property...and it has upper boud. So to keep it down. you need to have another job running once in a while or submit a task depending on step2, if processedInserts.size > some number say 500.
TASK -2
In this task
1) getHighlyUpdatingObject
2) Loop through processedInserts
3) get Insert Object , if it is processed delete that key from processedInserts

Just make sure one of the TASK-1 and TASK2 running at one time. You can even run task-2 as part task-1 after step 4, upto you where you see it as safe and less If then else :)

thecheatah

unread,
Jun 15, 2011, 3:38:46 PM6/15/11
to Google App Engine
This is actually a pretty good implementation. The only issue is the
size of the processed task list. Instead of having two tasks, I am
thinking that the one task will clean up the processed task list
before it begins its work. Basically check that the processed inserts
have indeed been deleted.

So the processed list records all the inserts processed in the
previous run. It first deletes all those inserts if needed, then it
goes on to process new tasks.

Thanks,

Ravneet

Bert

unread,
Jun 16, 2011, 4:52:51 AM6/16/11
to Google App Engine
Hi Ravneet,

Have you taken a look at fork join queues?
http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
or
High concurrency counters without sharding?
http://blog.notdot.net/2010/04/High-concurrency-counters-without-sharding

I think they may do what you need and are proven solutions.

Thanks
Rob

thecheatah

unread,
Jun 17, 2011, 9:01:31 AM6/17/11
to Google App Engine
Thanks Bert,

The first link is exactly what I was looking for.

Ravneet

On Jun 16, 4:52 am, Bert <robertbcur...@gmail.com> wrote:
> Hi Ravneet,
>
> Have you taken a look at fork join queues?http://www.google.com/events/io/2010/sessions/high-throughput-data-pi...
> or
> High concurrency counters without sharding?http://blog.notdot.net/2010/04/High-concurrency-counters-without-shar...

Noah McIlraith

unread,
Jun 17, 2011, 10:47:47 AM6/17/11
to google-a...@googlegroups.com
You can use the db.allocate_ids method (I think it's called that, IIRC) to generate incremental numeric IDs, it uses the same system the db module uses to assign unique keys, so it scales like a boss.

thecheatah

unread,
Jun 17, 2011, 11:50:31 AM6/17/11
to Google App Engine
I have looked at that method. The api does not mention whether the
keys are guaranteed to be consecutive or not.

Ravneet
Reply all
Reply to author
Forward
0 new messages