My Best App Engine Advice Would Be: Throttle Well

554 views
Skip to first unread message

Kaan Soral

unread,
May 26, 2015, 4:57:06 AM5/26/15
to google-a...@googlegroups.com
I've been using App Engine for probably something like 5 years, I have one major app that has been running for 5 years, it's very well polished, and the traffic and behaviour of the app is very predictable *knocking on wood*
I have another app that I've been working on for 3 years, it didn't take off yet, the new app is unpredictable in behaviour, it's vast and unthrottled

While the old app has been handling millions of requests without errors and issues, the new one is failing on even the simplest tasks, the logs are filled with TransientError's, instance failures, transaction failures, the whole thing is chaotic

The old app has throttled queues and basic tasks, the throughput is well calibrated to complete all the tasks in 5 hours, using optimal amount of instances, the traffic is regular, it eases in and eases out throughout the day (without throttling, the old app was in similar state before)
The new app is built to perform, so it's queues have no limits, it trusts App Engine to scale as much as it can

Well turns out that trust isn't well placed, App Engine is supposed to scale on it's own, yet when you leave the limits to the App Engine, it fails to perform
You might ask: "Why would I use App Engine if I'm going to manually scale the limits myself?" - That's a good question, If you're going to have to adjust the limits and throughput manually while your app grows, you might as well use AWS or a similar more reliable service

This is mostly a rant post, but the advice is still solid, one has to manually calibrate the throughput of routines to prevent app spikes, the instance births and deaths should always be eased in and eased out, otherwise various services of app engine fail to perform
On the bright side, throttling also reduces the costs significantly, so it's a good idea to always keep an eye on the app and manually calibrate all routines - on the other side, if your app gains additional traffic without your supervision, these routines will hog and halt

------

On a more technical side, some of these errors are:
"Request was aborted after waiting too long to attempt to service your request." - they come in 100's - flood the logs - these are taskqueue push tasks, so the error is pretty stupid, if they can't be handled, they should be left in the queue
"google.appengine.api.taskqueue.taskqueue.TransientError" - these are from pull queues, there are invisible/untold limits of pull queues, this is also very concerning, because if your app grows, your scale might be bound by these limits, so try not to use pull queues too much
"DeadlineExceededError's" - these are pretty random and rare, yet when you run thousands of tasks, you get these in your logs, they might be omitted
Transactions errors and anomalies: these used to happen a lot, but I switched to a pull queue based logic to prevent them, now they are replaced by pull queue anomalies

(It would have been great if limits and capacities of each service was more transparent, and I really think taskqueues need some eased bucket configurations, things that will help task batches to be executed in an eased manner, currently the only way to achieve this is to put flat and low throughput limits - similarly, same kind of control can be achieved on the instance scheduler level)

------

Also, after 5 years, I gave up on app engine support, during a time we used to get actual support from this google groups, currently it's just random initial replies and no follow ups, so unless you are paying $500 or something monthly for support, don't expect much support, you are on your own to detect the issues and prevent them through experimentation and volunteer help

Jesse Scherer (Google Cloud Support)

unread,
May 26, 2015, 2:48:31 PM5/26/15
to google-a...@googlegroups.com, kaan...@gmail.com
Greetings Kaan,

You have brought up concerns in the past over how we handle this group, so I want to share some improvements with you and thank you for your part in inspiring them.

- In the interest of reducing confusion about where to post, we have updated our help center documentation and welcome message to better enumerate your community support options. Note also that pinned threads referring to StackOverflow have vanished.
- We've asked the support staff to change their display names on this forum to follow the pattern "Bob (Google Cloud Support)." This, hopefully, will make it more obvious when one of my colleagues' requests for information is motivated by Google's desire to provide support here.
- In order to prevent the appearance of censorship, I've asked the support staff to reserve the use of moderation powers for only the most extreme circumstances, such as when a certain Italian-speaking user shares his political opinions.
- Finally, in order to foster a sense of community we have re-organized how Support assigns itself to groups. You'll find that a few Support Team members who have concentrated on this group are now nearly as active as yourself.

Regards,
Jesse

P.S. Another of my colleagues on the team will be following up on some specific points you made in part 2 of your mail.

Nick (Cloud Platform Support)

unread,
May 26, 2015, 7:22:29 PM5/26/15
to google-a...@googlegroups.com, kaan...@gmail.com
Hey Kaan,

I'll echo what Jesse has said about the new efforts in place to provide closer work between the community of developers and Cloud Platform Support, and I look forward to the good discussions that can be had here, as well as working together on stackoverflow and the public issue tracker to make the best use of those forums. 

Thanks for taking the time to bring up some issues you've been seeing. In regards to each of these issues, I'll enumerate them from one to four, according to the order they appeared in your post. I'll discuss my impression of what the issue may be, or what information is missing in order to make a good issue report. I'll also generally comment with some advice on where to move next in getting some support eyes on any potential issues.

    1. "Request was aborted after waiting too long to attempt to service your request"
      • If you've observed log lines with this error appearing when you have a number of tasks in-queue which seem to overload the processing power of your available instances, this may indicate a platform issue or it may also indicate an issue in your own app's config/code, although it's not possible to tell without more details, such as the following:
            * the .yaml/.xml config files (mostly the scaling settings are of interest)
            * a brief description of what the system was doing, tending to prefer code snippets, numbers, code, and logs to brief informal verbal description
            * a time-frame and name of an affected instance
      • With such details, an adequate issue report can be created and dealt with in the public issue tracker, or a valid stack overflow question can be created, depending on whether you perceive it to be a platform or user code issue.
    1. "google.appengine.api.taskqueue.taskqueue.TransientError"
      • As documented here, it's possible this can happen when using Pull queues. This can be, as you correctly observe, related to rate-limiting in the infrastructure, although you feel the details of how rates are set are not sufficiently documented. It's likely that this derives from attempting to lease_tasks() from the queue too often, but it's true that we can't be sure.
      • I definitely understand you here and encourage you to create a public issue tracker thread which can be starred by other users to demonstrate an interest in more detailed documentation around this limit. I
      • In the meantime, where we still need to be able to handle these errors on a platform which does allow you to scale up aggressively, in the context of a data-center (network) with shared but well-isolated and ample resources, error-responses such as these will occur periodically. A well-scalable app can ride out transient errors and rate-limiting with a small application of exponential back-off, non-spiking, etc. I encourage you to take the advice of the docs and attempt to rate-limit when you see this error, as it's likely the lease_rate() per queue is too fast.
      • If you find that a behaviour still appears anomalous to you - that is to say if a behaviour of the system seems out of sync with the documented behaviour - then you should open a public issue tracker issue with sufficient information to allow investigation. If the issue report contains sufficient information, it will be likely to produce a positive result, and quickly.
    2. "DeadlineExceededError"
      • This issue can also occur by the same cause as for 2., and it's worth investigating. My advice again is to create a public issue tracker issue as soon as you notice something that you perceive to be anomalous about the behaviour of any App Engine system. 
    3. "push/pull queue anomalies"
      • I'm unsure what you mean by this, although as above, if you feel there's an issue on the platform, I want to encourage you to report it adequately, as we're here and happy to 
    So, to conclude, once each of these issues you bring up can be investigated along with the documented behaviour and, if necessary, can be developed into a proper issue report for the platform, the public issue tracker issue you create will be picked up and brought to the attention of platform developers / engineers / support. If, rather than a platform issue, it looks like the issue is related to your specific use of the services on the platform, you should rather create a stackoverflow question on the related tags, to get support in that form. 

    Finally, to address what you say in parentheses before the end of "------", it's definitely possible to implement easing and rate-limiting on pull queue task execution, since the frequency of task lease/execution is tune-able in whatever timing logic you set up.

    For push queues, to implement easing, you can define a stepped gradient of queues with different configured processing rates, bucket sizes, etc., and have the task-adding logic read the current state of fullness in the various queues (you can store information about the queue fullness/rate, etc. in Memcache or Datastore, or just use API calls to the Task Queue API), possibly along with API calls to get the number of instances in the handler module, to determine which queue to step up to / include in the rotation of queues which receive tasks (your discretion) when adding tasks with given payloads, etc.

    Using the basic building blocks, some complex timing logic can be implemented, and if you feel that you'd like to make a feature request such as "provide easing parameter in queue configuration", describing how it works, the place for feature requests is the public issue tracker.

    I hope you've come away from this feeling heard, and with a better understanding of where and how to get support with any issues you may encounter. I tried to address each of the issues you brought up to make sure you get useful information.

    Have a great day!


    - Nick

    On Tuesday, May 26, 2015 at 4:57:06 AM UTC-4, Kaan Soral wrote:

    Kaan Soral

    unread,
    May 27, 2015, 11:07:14 AM5/27/15
    to google-a...@googlegroups.com
    Hi Jesse, Nick

    As a small reply to the technical issues, my main complaint is the complication of things, yes I could develop the easing logic myself, however I shouldn't - as for the limits, for non-users they seem like simple limits, yet for a user, these limits block the scalability of the app and they are only observable through experimentation, which means building something and discovering the cold fact that it doesn't scale

    I see now that a lot changed recently and the support/support efforts really improved, your replies are proof enough for this, thank you

    When I experience an issue again, I will give the public issue tracker another chance: https://code.google.com/p/googleappengine/issues/list

    In the meantime, as a very small suggestion, it would be great if the public issue tracker issues are visited and cleaned, it would show users like me that things are changing and re-ignite the momentum that we once had
    For example, this was a small images service issue that I had, the operation is within limits but It used to fail, I never got a reply: https://code.google.com/p/googleappengine/issues/detail?id=10301 - There are many more issues like these that I forget about, this is why when I experience a problem, I resort to public campaigning instead of simple reporting

    Anyway, thanks once again,
    Kaan

    Nick (Cloud Platform Support)

    unread,
    May 27, 2015, 1:29:23 PM5/27/15
    to google-a...@googlegroups.com, kaan...@gmail.com
    Hi Kaan,

    Thanks for your quick reply and your feedback on this forum. It's truly an open question to what extent the complication of network programming can be reduced when understanding of how to implement more complex solutions using the basic building blocks available is needed. We certainly wouldn't want to design a quite specific easing algorithm and force that on users when we could instead leave them a more malleable basis on which to build whichever solution they deem best. Regardless, this really comes down to different attitudes to programming and software design, and shouldn't be a block in our discussion. As I mentioned, you shouldn't hesitate to make a feature request if you feel as though you'd like some more complex functionality to be available as a config-file option.

    As to your comment on the activity in the public issue tracker, rest assured that we're actively doing triage and solving issues day-in and day-out. The issue is, we need to perform this based on some kind of priority, which is based partly on the number of users starring the issue, along with several other factors. The issue you created is only visible to Googlers and yourself since it was tagged with "Type-Production", and so not having been starred by anybody else, would not be picked up first as we triage the highest-priority issues. I hope you'll understand that we do have to focus on issues with many people reporting first of all, although eventually you will see action on your thread. 

    Since I think you didn't mean to use the private "Type-Production" ACL for this issue, and it doesn't contain any information private to you, I've gone ahead and fixed the tags on the issue, and it is now visible to other users to star if they have the same issue. 

    For now, as an aside related to the public issue tracker thread you linked, I'll note that JPEG files can, depending on several image factors, be larger than their sibling GIF image. This might be causing the API response size to break the 32MB limit. This is just conjecture and reliably reproducing a failing example where both the input and output are below the size limit would be a good way to make sure this issue can be handled as smoothly as possible.

    Overall, I hope I've helped put you at ease with regard to how we handle issues and the new focus on community support going forward. I don't think you should need to resort to "public campaigning" in the future if you make the best use of the appropriate forum for the appropriate issue. In a few words, you could remember "google groups for discussion, public issue tracker for platform or issues, stackoverflow for code issues", with the understanding that it might not be always possible to determine whether one issue should be in the public issue tracker or stackoverflow, but doing our best nonetheless to choose the one that seems most appropriate. 

    We're dedicated to the task of helping as many people as much as we can, and though we won't always be active on every new issue or question as soon as it pops up, you will see us active, and with very good responsiveness once we do sit down to take on a particular issue.

    Sincerely,

    Nick
    Reply all
    Reply to author
    Forward
    0 new messages