Architecture approaches for puts

33 views
Skip to first unread message

stevep

unread,
Nov 9, 2010, 3:07:06 PM11/9/10
to google-a...@googlegroups.com
I would like some feedback about pluses / minuses for handling new records. Currently I need to optimize how the client request handler processes new entity put()s. Several custom indices for the model are used, so puts run too close to the 1,000 ms limit (were running over the limit prior to Nov. 6th maintenance – thanks Google).

The entities are written with unique integer key values. Integers are generated using Google’s recommended sharded process. Client currently POSTs a new record to the GAE handler. If handler does not send back a successful response, client will retry POST “n” times (at least twice, but possibly more). Continued failures past “n” will prompt user that record could not be created, saves data locally, and asks user to try later.

Planned new process will use Task Queue. 
1) Client POSTs new entity data to the handler. At this point, user sees a dialog box saying record is being written.
2) Handler will use the shards to generate the next integer value for the key.
3) Handler sets up a task queue with the new key value and record data, and responds back to the client with they key value. 
4) Client receives key value back from handler, and changes to inform user that record write is being confirmed on the server (or as before retries entire POST if response is an error code).
5) Client waits a second or two (for task queue to finish), then issues a GET to the handler to read the new record using the key value.
6) Handler does a simple key value read of the new record. Responds back to client either with found or not found status. 
7) If client gets found response, then we are done. If not found, or error response client will wait a few seconds, and issue another GET.
7) If after “n” tries, no GET yields a successful read, then client informs user that record could not be written, and “please try again in a few minutes” (saving new record data locally).

I know this is not ideal, but believe it is a valid, given GAE’s limitations, as an approach to minimize lost writes. Would very much appreciate feedback. I should note that the imposition of a few seconds delay while writing the record should not be an issue given it is a single transaction at the end of a previous creative process which has engaged user for several minutes.  Also, we do not use logic that cannot handle gaps (missing) integer values in the model's key values.

TIA,
stevep

Robert Kluin

unread,
Nov 10, 2010, 1:39:47 AM11/10/10
to google-a...@googlegroups.com
Why not use allocate_ids to generate the ids? That might simplify the
process a bit.
http://code.google.com/appengine/docs/python/datastore/functions.html#allocate_ids

I've been using a similar process for batch updates for quite some
time. Works well for my case, but in my case there is not a user
involved. It is an auto-mated sync process to another system's
database, so I have a unique id to use for lookups to avoid
duplicates.

What happens if the client does not get the response in step 4.

Also, I assume if you get a failure, and resend the entity you'll use
the previous id?

Robert

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

stevep

unread,
Nov 10, 2010, 1:34:13 PM11/10/10
to Google App Engine
Thanks for the response.

We are not very write intensive, so new records occur infrequently. As
I read the allocate_ids documentation, it appears more oriented toward
batch processes. Also, Google's recommended shard process for
generating unique, sequential integers is a very nice bit of flexible,
fast kit that we like.

We use an client async URL call for the new record post which allows
us to define a time limit on the server response. If we do not get a
response within a few seconds, the call itself generates an error
which we handle same as a server error response. If the response comes
after this, it will be ignored.

If we do end up having to save the user data, the id will be saved
with it. Logic for the resend will check first to see if the record
did eventually get posted by the Task Queue. If not, it will send it.
The client will always check the local store for un-posted records
each time the user opens the web page. However, we certainly cannot
expect every user presented with a "try again later" message will
return, so holes in the key number sequence are certain.

This "too late" task queue may be a more common error than a total
failure, so it may make sense to add something like email follow-up.
However, the overall risk (based on recent GAE performance) is a
situation where an app suddenly starts throwing Deadline Exceeded
errors due to GAE infrastructure issues rather than developer-
contolled code issues. In that case, the ability to post the record
for email follow-up will likely fail also.

Down the road, I've thought to run an AWS / MySql server for backup.
If a specific "high need" GAE post fails, then it would be relatively
simple to use redirect to AWS. Its a good bit of redundancy work at
this point though- and only works for a specific type of record), so
we will put that off until we start to get out into the real word and
see some volume**. Odds of both clouds sources having internal
infrastructure issues at the same time hopefully will be very low.

Again, thanks for your response.
stevep

** GAE clearly wins against AWS for low-volume apps because there is
no hourly charge. However, as an app increases its use, the constant
kill/start of instances after n transactions (~10K right now) appears
to me as achieving an hourly change based on CPU cycles "overhead".
However, need a lot more data before understanding this.

On Nov 9, 11:39 pm, Robert Kluin <robert.kl...@gmail.com> wrote:
> Why not use allocate_ids to generate the ids?  That might simplify the
> process a bit.
>  http://code.google.com/appengine/docs/python/datastore/functions.html...

Eli Jones

unread,
Nov 10, 2010, 2:28:16 PM11/10/10
to google-a...@googlegroups.com
How big is the average entity for this Model that you are putting to? (Are you just putting one entity at a time?)

If you create a separate Model with the same properties but no indexes, how long and how much CPU does it use up on a put (in comparison to your fully indexed model)?  Also, what do you mean by "puts run too close to the 1,000 ms limit"?  Do you just mean that your app uses up 1,000 CPU MS or 1,000 API_CPU MS?

Why are you generating a custom integer id instead of using the one that the datastore would create (I am not saying you should not do this, but I am wondering what the requirement is that makes you need to do it.)?

Also, you mention that you are not very write intensive and new records occur infrequently.. so what is the main reasoning for this complicated put process (does the processing leading up to the put place you near the 30 second limit)?

Depending on what your restrictions are.. there are different recommendations that can/could be made.

Robert Kluin

unread,
Nov 11, 2010, 12:56:56 PM11/11/10
to google-a...@googlegroups.com
Hi Steve,
Some follow-up responses are inline.


On Wed, Nov 10, 2010 at 13:34, stevep <pros...@gmail.com> wrote:
> Thanks for the response.
>
> We are not very write intensive, so new records occur infrequently. As
> I read the allocate_ids documentation, it appears more oriented toward
> batch processes. Also, Google's recommended shard process for
> generating unique, sequential integers is a very nice bit of flexible,
> fast kit that we like.

allocate_ids can be used to generate a single id, just like doing
SomeKind().put() will generate an id automatically. I am not aware of
any recommended way to use sharded counters to generate a unique
sequence of sequential numbers. The whole point of sharding is to let
you partially mitigate issues with transactions and contention; but
without transactions you could easily generate duplicate id values. I
would suggest using the Google provided infrastructure to generate
unique id, why reinvent the wheel?

Eli brought up a good point regarding this issue. I assume the reason
for this 'complicated' process is to return an id to the client as
quickly as possible, that way if the client re-submits the request you
do not get duplicated data? I also use a very similar processing
sequence for error checking: once I get a save request I generate a
new key, then process the data for errors. If the errors are 'fatal'
I return the key and the error information for the client to fix
without saving the entity. If the errors are 'acceptable' (ie more
like warnings) I will save the entity before returning to the client.
In either case I always return a key for the new entity.


>
> We use an client async URL call for the new record post which allows
> us to define a time limit on the server response. If we do not get a
> response within a few seconds, the call itself generates an error
> which we handle same as a server error response. If the response comes
> after this, it will be ignored.

I was more curious about the case when you make a request then the
internet connection fails (or the user hits 'submit' twice really
fast). The server will still successfully completes the write, but
the client will not know. So how do you prevent the client from
re-submitting the request, which might result in a duplicate record.


>
> If we do end up having to save the user data, the id will be saved
> with it. Logic for the resend will check first to see if the record
> did eventually get posted by the Task Queue. If not, it will send it.
> The client will always check the local store for un-posted records
> each time the user opens the web page. However, we certainly cannot
> expect every user presented with a "try again later" message will
> return, so holes in the key number sequence are certain.

I totally agree with this approach, I think it is very similar to my
process. I do not use the task-queue to write the initial record
though, it is put during the request I return the key in. Your
approach should be safe because you return the key in the first
request, but there could be cases that result in lots of unneeded
re-submits. For instance if the task-queue is backed up.


>
> This "too late" task queue may be a more common error than a total
> failure, so it may make sense to add something like email follow-up.
> However, the overall risk (based on recent GAE performance) is a
> situation where an app suddenly starts throwing Deadline Exceeded
> errors due to GAE infrastructure issues rather than developer-
> contolled code issues. In that case, the ability to post the record
> for email follow-up will likely fail also.
>
> Down the road, I've thought to run an AWS / MySql server for backup.
> If a specific "high need"  GAE post fails, then it would be relatively
> simple to use redirect to AWS. Its a good bit of redundancy work at
> this point though- and only works for a specific type of record), so
> we will put that off until we start to get out into the real word and
> see some volume**. Odds of both clouds sources having internal
> infrastructure issues at the same time hopefully will be very low.

Again, totally agree with this. I've been working on using TyphoonAE
to keep a 'hot-copy' running that I can fail-over to. This lets me
run the same code in both places, just need a little sync-logic. My
app has good service points, and the core business data comes from a
single kind. So I am able to simply fire tasks with 'update' requests
to the fail-over app. Everything stays in sync and gives me emergency
fail-over ability.

>
> Again, thanks for your response.
> stevep
>
> **  GAE clearly wins against AWS for low-volume apps because there is
> no hourly charge. However, as an app increases its use, the constant
> kill/start of instances after n transactions (~10K right now) appears
> to me as achieving an hourly change based on CPU cycles "overhead".
> However, need a lot more data before understanding this.
>
> On Nov 9, 11:39 pm, Robert Kluin <robert.kl...@gmail.com> wrote:
>> Why not use allocate_ids to generate the ids?  That might simplify the
>> process a bit.
>>  http://code.google.com/appengine/docs/python/datastore/functions.html...
>>
>> I've been using a similar process for batch updates for quite some
>> time.  Works well for my case, but in my case there is not a user
>> involved.  It is an auto-mated sync process to another system's
>> database, so I have a unique id to use for lookups to avoid
>> duplicates.
>>
>> What happens if the client does not get the response in step 4.
>>
>> Also, I assume if you get a failure, and resend the entity you'll use
>> the previous id?
>>
>> Robert
>>
>

stevep

unread,
Nov 11, 2010, 2:35:49 PM11/11/10
to Google App Engine
On Nov 10, 11:28 am, Eli Jones <eli.jo...@gmail.com> wrote:

> How big is the average entity for this Model that you are putting to? (Are
> you just putting one entity at a time?)

The data will vary based on user input. For the model in question, the
size is generally 3k or less. However, we have a large number of
custom indices to support user searches on multiple combinations of
properties. Pretty sure the index updates are the real resource cost
for the new record put().

>
> If you create a separate Model with the same properties but no indexes, how
> long and how much CPU does it use up on a put (in comparison to your fully
> indexed model)?  Also, what do you mean by "puts run too close to the 1,000
> ms limit"?  Do you just mean that your app uses up 1,000 CPU MS or 1,000
> API_CPU MS?

If I look at the appstats detail for my POST function the api-cost
cost for the most recent put is approximately 585 ms. There are some
other DB accesses in the full function call, so I segmented out this
data. This 585 ms is the bulk of the total api-cpu cost for the full
function. I am sorry, but right now I do not have anything set up to
test a write a similar model entity without any indices.

To be honest, I am not sure what the "throttling due to 1,000 cpu ms
response" really means relative to CPU or API_CPU ms, whether it
involves any smoothing across multiple handler functions.

Just looking at the 585 ms api put() resource indicates we would
likely have throttling issues should the infrastructure slow down due
to overall load. So, we're moving the put()s to the task queue. This
is in accordance to what I believe Google engineers' comments to this
forum about the 1,000 ms throttling affect (some posts perhaps
suggesting 800 ms is a more practical limit vs. the 1,000 ms
"theoretical limit" -- the latter being my term).

The TQ approach also allows us to break up some of the overhead in the
current single function. Overall, we'll be much better off re:
throttling risk, and task separations, but it does impose the overhead
of not being able to respond to the client that the everything related
to the function was complete in response to a client's single POST
request.

>
> Why are you generating a custom integer id instead of using the one that the
> datastore would create (I am not saying you should not do this, but I am
> wondering what the requirement is that makes you need to do it.)?

In retrospect, using our own generated integer key value may not have
been needed when the put() is done by the handler function called by
the client POST call. However, as we look to use the task queue for
executing the post, it will prove beneficial because it allows us to
quickly stage the task queue update, and provide information to the
client about what key value to request when it seeks to confirm the
task queue function using GETs subsequent to the initial POST.

>
> Also, you mention that you are not very write intensive and new records
> occur infrequently.. so what is the main reasoning for this complicated put
> process (does the processing leading up to the put place you near the 30
> second limit)?

The current process is pretty simple. If we did not face the issue
with our handler being throttled because we have a transaction or two
going through over 1,000 ms, then I'd leave it as is.

However (again according to my reading of their comments) Google's
engineers advocate moving logic that can take over 1,000 ms to the
task queue. I'm ok with that quite honestly.

As I think through the TQ step we will need to use, I am not sure how
a client can with a single POST to the on-line handler can be assured
that the new record TQ put() is completed. So that leads me to the
complicated process of the initial client POST being followed by a GET
to ensure completion.

If I am missing something, and the client can simply send the POST
data to the Task Queue and assume 100% of the TQ tasks sent will be
completed, then please advise. Just having the client send the current
single POST call is so much easier.

We are no where near 30 seconds for anything, even with a cold start.
No client URL calls currently take more that few seconds when combined
with a cold start. Prior to the Nov 6th update, all the new record
puts for this model were between 1-2 seconds, but are now coming in
lower than 1 sec.

Again, this is all to avoid being throttled due to a the new record
put()s should: 1) current gains from Nov 6th maintenance not hold, or
2) variability in infrastructure load even with the Nov 6th gains push
our 585 ms api-cpu over 1,000. (Getting throttled when the
infrastructure overall is running slow is a double whammy for which
users might not wait!)

>
> Depending on what your restrictions are.. there are different
> recommendations that can/could be made.

I'll happily try to clarify the restrictions, but am not sure what is
being asked here.

Again, many thanks for your help.
stevep

stevep

unread,
Nov 11, 2010, 3:36:46 PM11/11/10
to Google App Engine

Robert,

Overall let me say thanks. Your comments really helped for this
subject.

> allocate_ids can be used to generate a single id, just like doing
> SomeKind().put() will generate an id automatically.  I am not aware of
> any recommended way to use sharded counters to generate a unique
> sequence of sequential numbers.

We are using this code (suggested for for incrementing counters, but
works for sequential key values as well):
http://code.google.com/appengine/articles/sharding_counters.html

To be honest, I was not aware of allocate_ids when setting this up
(started coding as noobie_levelZero, now have advanced to
noobie_levelOne :-)

There seems to be very little overhead for this approach, so will
stick with it for now. We do download this model's data for some
analytics processing. Sequential numeric keys are not necessary for
this, but they prove beneficial anytime we eyeball the analytics. Not
sure if the allocate_id key values would be numerically sequenced.
Once everything else is done, I'll look deeper into comparing these
two approaches, and will post a thread at that time. That's surely a
"don't hold your breath" schedule though.

> Eli brought up a good point regarding this issue.  I assume the reason
> for this 'complicated' process is to return an id to the client as
> quickly as possible, that way if the client re-submits the request you
> do not get duplicated data?

The bigger issue (as per my response to Eli) is to avoid throttling of
the client response handler.

Unless we screw up the handling of the generated key value from the
initial POST call, a resend will simply cause a duplicate put()
costing us cpu cycles, but not a duplicate record.

> I was more curious about the case when you make a request then the
> internet connection fails (or the user hits 'submit' twice really
> fast).  The server will still successfully completes the write, but
> the client will not know.  So how do you prevent the client from
> re-submitting the request, which might result in a duplicate record.

Right now we're using an Adobe AIR app for the client using keys
generated by the client. Process first posts the data to the local AIR
sqlite DB, then, if the user is on-line, it attempts the GAE POST. If
the response works, the returned key value goes into the sqlite DB as
a reference field. GUI is disabled while this happens (with an
appropriate dialog showing). This is all going to be changed with the
new browser only version, so these issues will need to be addressed.

> I totally agree with this approach, I think it is very similar to my
> process.  I do not use the task-queue to write the initial record
> though, it is put during the request I return the key in.  Your
> approach should be safe because you return the key in the first
> request, but there could be cases that result in lots of unneeded
> re-submits.  For instance if the task-queue is backed up.

Thanks -- nice to know I am not missing something obvious. Writing the
new rec during the initial POST call is much preferred as I noted in
my response to Eli, but I just think we will have too much "double
whammy" risks when GAE infrastructure is under load -- see end of my
response to Eli.**

My thanks again,
stevep

** Wouldn't it be nice if the throttling limit was dynamic according
to how well GAE infrastructure was running -- something we cannot
control, so why do we end up paying for it. There is IMHO a perverse,
reverse incentive in current setup where profit and revenues will
maximize at the point where GAE infrastructure investments are
minimized to yield maximum "under load" conditions without causing
customers to decide other other cloud services are clearly superior.
Note that this also applies to cold start cpu overhead costs. A
dynamic throttling algo and a standard cold-start charge (based on
standard infrastructure performance, not varying due to load) would go
a very long way to correct this perversion. Here's an interesting link
about importance of incentives that made me think of GAE's current
setup when I read it (weird but true ):
http://www.npr.org/blogs/money/2010/09/09/129757852/pop-quiz-how-do-you-stop-sea-captains-from-killing-their-passengers

Eli Jones

unread,
Nov 11, 2010, 4:17:15 PM11/11/10
to google-a...@googlegroups.com
Steve,

Mainly, I'm trying to get down to the core issue you are trying to avoid to see if going the Task Queue route pays you more than it can cost you.

Have you actually run into dynamic request limit issues or throttling of your app (when you were just doing vanilla puts)?  Or did you more generally get one of those log warnings saying something to the effect of "This process used a high amount of CPU and may be throttled (or may soon exceed its quota)."?  If you do see that error.. what is the exact one?

It would seem odd that Google would throttle your app across the board just because the occasional put took more than 1 second, and that seems wrong of them to do.

Though, I am very familiar with hitting the dynamic request limit due to thousands of near simultaneous calls to the app that each take more than 1 second to run.  Most of my app is all back end processing using chained and fanned out tasks so I'm definitely not here to tell you to not use Task Queue.

But, the put process you describe has several moving parts, and each part is a point of failure.  So, I'm trying to determine if you had to go this path due to actual throttling, or if you went down this path because you thought you had to in order to avoid the potential of being throttled.

(This is no direct judgement of what I perceive your abilities or foresight to be.. I ask questions like this since I am frequently rewriting complex processes that I initially came up with.. only to later realize that the reasons I had for creating the complex process no longer exist.  So, I am perpetually churning out a bunch of code to get something done.. and then post-optimizing it back down to a more manageable size after the fact.)

stevep

Robert Kluin

unread,
Nov 11, 2010, 5:14:44 PM11/11/10
to google-a...@googlegroups.com
Steve,
No problem. Hopefully some of our / questions are beneficial. :)
By the way, as an economist I totally agree with your last remark;
I've had the same thoughts as you (re performance) many times. In
this case I think Google incentive is competition -- predominately
from AWS, and possibly from Azure. If the platform is too unstable or
poor performing people will find a better solution.

If you are using the total from the counter as your id, you might
want to think rethink using a sharded counter to generate your id
values. If you get two requests sufficiently close together you will
get a duplicate id. Each shard is in its own entity group, you can
not read from other groups within a transaction. That means you are
getting your id value outside a transaction. Of course if you are
actually using the shard_id + the shard's current value as a key_name
your fine, provided you get the value in a transaction.


Robert

Stephen Johnson

unread,
Nov 12, 2010, 11:45:53 AM11/12/10
to google-a...@googlegroups.com
I apologize if this has been asked and answered in this thread. I tried looking through it but it is very long. From what I can tell you haven't answered what your latency is for these requests and I believe that is very important before you do all the work which is outlined below. From what I understand throttling is affected by the  latency time for the requests not for CPU or API CPU usage. For instance you could have a request which takes 2,000 MS API CPU but it's latency is 300MS. The first number in the logs before the CPU MS and API CPU MS indicates the latency for the requests. You say that you have a lot of custom indexes for these entities. My assumption would be that these custom indexes would be updated by the datastore in parallel and as such you could have a very high API CPU usage but your latency could be very low. For example, looking at my log I have a request which has numbers like this:

238ms 678cpu_ms 305api_cpu_ms
The first number is the number after the 200 return code and is the latency. It is almost a third less than the total for CPU MS and falls (what I am presuming) safely within the recommended latency times. At least this is the way I understand it. Also, if this has been asked/answered below I apologize for repeating.

Steve

stevep

unread,
Nov 12, 2010, 3:30:58 PM11/12/10
to Google App Engine


On Nov 11, 1:17 pm, Eli Jones <eli.jo...@gmail.com> wrote:
> Steve,
...
> Have you actually run into dynamic request limit issues or throttling of
> your app (when you were just doing vanilla puts)?

Ah, I see what you are asking now. No, we have not come close to being
throttled -- currently have only thrown some spaghetti on the wall
without much volume.

Previously, I'd not been too worried about final volumes because our
overall application was setup more for quality vs. quantity user
inputs. So if we got an odd throttle-back it would not be a big deal.

Current plan is to leverage a noodle that looks pretty sticky to do
something that could attain high volumes of inputs. Rightly or
wrongly, we want to try to architect in as optimal a fashion as
possible now vs. re-coding later.

The POST/GET confirmation process is more overhead, but I'm guessing
it wont be too bad. Plus, the ability to atomize elements of the
original full POST function to different task queue streams with
different priority level is really appealing.

Cheers,
stevep

stevep

unread,
Nov 12, 2010, 3:40:22 PM11/12/10
to Google App Engine


On Nov 11, 2:14 pm, Robert Kluin <robert.kl...@gmail.com> wrote:

>   If you are using the total from the counter as your id, you might
> want to think rethink using a sharded counter to generate your id
> values.  If you get two requests sufficiently close together you will
> get a duplicate id.

You're right. As noted, we had not initially planned a high volume
app.

This is a valid enough concern now to

stevep

unread,
Nov 12, 2010, 4:02:06 PM11/12/10
to Google App Engine
Stephen,

Thanks for you input. Sorry about the thread length, but I've always
been too verbose.

I'd not known exactly which number in that appstats line applied to
the 1,000 ms throttling. Appreciate the input.

Our latency (first number) on the new record put is around 850 ms for
the put that had 585 ms api-cpu total. So I definitely now plan to
make this change.

The changes on the GAE side are really quite simple. We'll move the
code from the handler function to the task queue module. I've already
done that with some other function calls, and am very confident this
will be easy and fast.

On the client side, I've got to rip up a good bit anyway because we
will not be leveraging Adobe Air's local datastore with the new
product. So this is the right time to do it.

As I noted, there are some very appealing parts to using the task
queue, so overall I'm happy to do this. Example: some of the overhead
related to the current new record "all-enclusive" handler function can
be sent to low-priority queues. It may also prove very beneficial in
the future that we have some ability to tune individual TQ resource
availability.

Thanks again,
stevep

On Nov 12, 8:45 am, Stephen Johnson <onepagewo...@gmail.com> wrote:
> I apologize if this has been asked and answered in this thread. I tried
> looking through it but it is very long. From what I can tell you haven't
> answered what your latency is for these requests and I believe that is very
> important before you do all the work which is outlined below. From what I
> understand throttling is affected by the  latency time for the requests not
> for CPU or API CPU usage. For instance you could have a request which takes
> 2,000 MS API CPU but it's latency is 300MS. The first number in the logs
> before the CPU MS and API CPU MS indicates the latency for the requests. You
> say that you have a lot of custom indexes for these entities. My assumption
> would be that these custom indexes would be updated by the datastore in
> parallel and as such you could have a very high API CPU usage but your
> latency could be very low. For example, looking at my log I have a request
> which has numbers like this:
>
> 238ms 678cpu_ms 305api_cpu_msThe first number is the number after the 200
Reply all
Reply to author
Forward
0 new messages