Scaleability and the magic 1000ms request time

35 views
Skip to first unread message

Jason C

unread,
Sep 16, 2010, 8:41:48 AM9/16/10
to Google App Engine
The number of instances that App Engine makes available to your
application depends on if you keep your average request time under
1000ms for user-facing requests.

Ikai Lan (I believe) said that taskqueue and cron job requests do not
count against this boundary. Ikai also said that this boundary was in
place because longer requests were bad for the ecosystem.

Since taskqueue and cron job requests do not count against this
boundary, in order for them to not be bad for the ecosystem, I'm
guessing that they are served from a different set of servers than
user-facing requests are.

We (appid: steprep) have a number of external machines that also hit
our urls. While we make every effort to keep user-facing requests
quick and responsive, we often use many seconds serving the requests
that are built for external machines (by design).

It has only just struck me this morning that this could be having a
bad (perhaps dramatic) impact on our overall scaleability.

First off, is it true that cron and taskqueue items are served on a
different set of servers? If so, is there any way to designate that a
particular url is being requested by a machine and can be routed to
this alternate set (of presumably slower) servers (e.g., a request
header)?

If I'm way off on all of this, and if taskqueue and cron jobs are
served from the same set of servers, I'm not sure how the "bad for the
ecosystem" argument holds, and perhaps Google should revisit this
1000ms boundary condition altogether.

Nick Johnson (Google)

unread,
Sep 16, 2010, 8:44:34 AM9/16/10
to google-a...@googlegroups.com
Hi Jason,

The same appservers are used to serve user-facing and offline traffic. The volume of user-facing traffic (that is below the latency threshold) you serve determines how many appservers we provision for your application, which in turn affects the capacity available for running offline (task queue and cron) tasks.

-Nick Johnson


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.




--
Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047
Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047

Jason C

unread,
Sep 16, 2010, 8:54:52 AM9/16/10
to Google App Engine
Hmmm, that poses a different issue for us.

Our application does _substantially_ more long-running, taskqueue-
based requests relative to user-facing requests. Indeed, our primary
user interaction is via the reports that we compile and email to them.

This just sounds like a generally bad situation for us.

Any thoughts?

On Sep 16, 6:44 am, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:
> > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com>
> > .

bFlood

unread,
Sep 16, 2010, 8:56:59 AM9/16/10
to Google App Engine
"which in turn affects the capacity available for running offline
tasks" - so, if you have a low volume site, you won't get that many
instances for your tasks? likewise, if you have some user facing
requests that go longer then 1000ms (by design or otherwise), the
instances available for your tasks are impacted? or am I confused?

On Sep 16, 8:44 am, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:
> > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com>
> > .

Ikai Lan (Google)

unread,
Sep 16, 2010, 10:05:52 AM9/16/10
to google-a...@googlegroups.com
Jason, I think your situation is fine. Offline tasks have the property that, unlike user-facing tasks, do not require instant execution. If you schedule an offline task for "now", that actually means "when there's capacity" and App Engine can allocate idle capacity to process your request. Thus, the need to spin up additional instances is unnecessary in most cases. Are you seeing that your tasks are backed up?

To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

bFlood

unread,
Sep 16, 2010, 11:09:53 AM9/16/10
to Google App Engine
Ikai - can we assume by your answer that the task queue is in fact
impacted by user facing requests? the task queue is setup to handle 40
requests/second, how could you ever get this performance if the
instance count is dictated by user requests?

if this is the case, then the only way to get decent task queue
performance on a low volume site is to bombard it with small, no-op
requests so your app instance count increases.

is this just patchwork fixes until the long running processes (from
the Roadmap) is complete?

On Sep 16, 10:05 am, "Ikai Lan (Google)" <ikai.l+gro...@google.com>
wrote:
> > > > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com><google-appengine%2Bunsubscrib

Jason C

unread,
Sep 16, 2010, 11:23:13 AM9/16/10
to Google App Engine
Ikai,

I wouldn't say that I've seen periods of task backups, but in the logs
I do see a lot of "10-second" timeouts when tasks are being attempted
to be executed.

By "10-second" timeout, I mean this guy:

"Request was aborted after waiting too long to attempt to service your
request. This may happen sporadically when the App Engine serving
cluster is under unexpectedly high or uneven load. If you see this
message frequently, please contact the App Engine team."

j

On Sep 16, 8:05 am, "Ikai Lan (Google)" <ikai.l+gro...@google.com>
wrote:
> > > > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com><google-appengine%2Bunsubscrib

Eric Sukmajaya

unread,
Sep 16, 2010, 7:12:38 PM9/16/10
to Google App Engine
Can anyone tell me how to determine the average request time for user-
facing requests of a particular app?

I understand we have a "Milliseconds/Request" chart on the admin
console but there's no differentiation between user facing requests
and offline tasks.

Is this something that the app must keep track internally?

On Sep 16, 10:44 pm, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:
> > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com>
> > .

Jan Z/ Hapara

unread,
Sep 16, 2010, 9:41:21 PM9/16/10
to Google App Engine
Hi Ikai - the behavior we are seeing suggests the "offline" tasks are
subject to the same 1000msec rule as external requests.

Queuing up a number of tasks reliably results in the "Request was
aborted after waiting too long to attempt to service your request"
error - which is actually fine, BUT, the appengine kicks in the back-
off algorithm.

This results in tasks that cycle for 20+ generations, with mean time
between run attempts of 19hr+.

How do we know the 1000 msec rule is in effect?

The situation improves drastically if we introduce a large number of
"no-op" tasks that complete in ~40 msec and skew the averages.

J

On Sep 17, 2:05 am, "Ikai Lan (Google)" <ikai.l+gro...@google.com>
wrote:
> Jason, I think your situation is fine. Offline tasks have the property that,
> unlike user-facing tasks, do not require instant execution. If you schedule
> an offline task for "now", that actually means "when there's capacity" and
> App Engine can allocate idle capacity to process your request. Thus, the
> need to spin up additional instances is unnecessary in most cases. Are you
> seeing that your tasks are backed up?
>
> > > > google-appengi...@googlegroups.com<google-appengine%2Bunsu...@googlegroups.com><google-appengine%2Bunsubscrib
> > e...@googlegroups.com>
> > > > .
> > > > For more options, visit this group at
> > > >http://groups.google.com/group/google-appengine?hl=en.
>
> > > --
> > > Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd.
> > ::
> > > Registered in Dublin, Ireland, Registration Number: 368047
> > > Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration
> > Number:
> > > 368047
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Google App Engine" group.
> > To post to this group, send email to google-a...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > google-appengi...@googlegroups.com<google-appengine%2Bunsu...@googlegroups.com>

Jason C

unread,
Sep 20, 2010, 11:33:01 AM9/20/10
to Google App Engine
Ikai,

Do you have a definitive answer on whether or not task/cron requests
count towards the 1000ms threshold? There seems to be some confusion
and counter-evidence here.

Including our cron/task requests, we run at 1500-2000ms / request.
This is largely because we have LOTS of taskqueue items and we tend to
do a fair amount of work in them. Further, when we do large spike jobs
(e.g., mapreduce), we see lots of deadline-related errors.

What is the best way to know if we're above or below this threshold?
(appid: steprep)

j
> > > > > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com><google-appengine%2Bunsubscrib
> > > e...@googlegroups.com>
> > > > > .
> > > > > For more options, visit this group at
> > > > >http://groups.google.com/group/google-appengine?hl=en.
>
> > > > --
> > > > Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd.
> > > ::
> > > > Registered in Dublin, Ireland, Registration Number: 368047
> > > > Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration
> > > Number:
> > > > 368047
>
> > > --
> > > You received this message because you are subscribed to the Google Groups
> > > "Google App Engine" group.
> > > To post to this group, send email to google-a...@googlegroups.com.
> > > To unsubscribe from this group, send email to
> > > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com>

Ikai Lan (Google)

unread,
Sep 20, 2010, 11:55:56 AM9/20/10
to google-a...@googlegroups.com
Task Queues and cron jobs should not. We encourage small tasks, but in general tasks that take several seconds to run should not impact your autoscaling. If you're seeing otherwise, please let us know.

--
Ikai Lan 
Developer Programs Engineer, Google App Engine



To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

Jason C

unread,
Sep 20, 2010, 12:26:06 PM9/20/10
to Google App Engine
I _do_ believe I'm seeing otherwise - in the form of lots of deadline-
related errors on large spike jobs (e.g., mapreduce and other
continuation-styled jobs).

Do you have any suggestions how I could measure this?

j

On Sep 20, 9:55 am, "Ikai Lan (Google)" <ikai.l+gro...@google.com>
wrote:
> > > On Sep 17, 2:05 am, "Ikai Lan (Google)" <ikai.l+gro...@google.com<ikai.l%2Bgro...@google.com>
> > > > > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com><google-appengine%2Bunsubscrib

Kenneth

unread,
Sep 20, 2010, 3:24:49 PM9/20/10
to Google App Engine
We're seeing a lot of 10 second and deadline errors today. Nothing
like last week but it is still pretty bad today.

There are 21 non-task 10 second errors and 5 30 second deadline errors
in the past 8 hours. The deadline errors are on calls that would
normally take <500ms.

Jason C

unread,
Sep 20, 2010, 4:59:22 PM9/20/10
to Google App Engine
For one of my urls, for a 1 hour period (12.50p to 1.50p log time
2010-09-20), I saw 89 DeadlineExceededErrors and 101 of the 10-second
timeouts. (appid: steprep)

WRT DeadlineExceededError - these requests are normally _well_ below
the 30s boundary.

j

Jason C

unread,
Sep 20, 2010, 5:19:06 PM9/20/10
to Google App Engine
I did another count across all of our urls.

1 hour, 1.04p - 2.04p log time, 2010-09-20

DeadlineExceededErrors: 157
10-second timeouts: 179

Given the high rate of 10-second timeouts, can I assume that I am over
the 1000ms threshhold?

These errors represent almost all of the errors from our system.

j

Darien Caldwell

unread,
Sep 21, 2010, 11:44:50 AM9/21/10
to Google App Engine


On Sep 20, 2:19 pm, Jason C <jason.a.coll...@gmail.com> wrote:
> I did another count across all of our urls.
>
> 1 hour, 1.04p - 2.04p log time, 2010-09-20
>
> DeadlineExceededErrors: 157
> 10-second timeouts: 179
>
> Given the high rate of 10-second timeouts, can I assume that I am over
> the 1000ms threshhold?
>
> These errors represent almost all of the errors from our system.
>
> j

no, the 10 second timeouts are when GAE can't even begin to service
your request. It's not related to the speed of your request handler.

On an aside, GAE does seem more prone to DeadlineExceededErrors now,
than it did before last week's maintenance. I would usually see maybe
1 a week. Now I see at least 3-5 a day.
Reply all
Reply to author
Forward
0 new messages