Need help or consulting: Request was aborted errors

134 views
Skip to first unread message

Richard

unread,
Jul 8, 2012, 4:14:05 PM7/8/12
to google-a...@googlegroups.com
I have an app that produces: "request was aborted" errors.  My users are getting pissed.  So am I.  I thought App Engine was meant to scale ?

The app is running up to 40 instances to serve 500 users.  All 500 users will query the app within a 5 second window (like a mini-DDOS).  It is python 2.7 and marked as thread-safe.  Why one instance can only serve under 20 users, I don't know !  Average memory usage is 44MB on F1 instances.

Latency is set at max 500msec.  I have tried setting idle instances to 20-Automatic.  Billing has been around $30 for the last 12 hours.  However this is CRAZY.  It cost me $30 to push out 1.25GB of data to under 500 users (500 is the peak!)  ?!?

The stupid part is the code looks like this:
        global GAME_DATA
        global GAME_DATA_TIMESTAMP
        delta = datetime.datetime.now() - GAME_DATA_TIMESTAMP
        if delta.total_seconds() < 5.:
                # return cached data
                game = GAME_DATA
        else:
                # get new
                cg_ref = CurrentGame.get_by_key_name(current_game_key_name)
                game = cg_ref.k
                GAME_DATA = game
                GAME_DATA_TIMESTAMP = datetime.datetime.now()

        data = {'s': 'OK', 'g': game.board }
        self.response.out.write(json.dumps(data))

Yes, that's right... we don't actually do 500 db queries...

At this point, I think there is some internal throttle with GAE that only allows a limited number of http connections at once and just kills off the rest or something. 

Can only explain what is going on here, or point me to a competent consultant who I can pay to tell me how to fix this ?

Jeff Schnitzer

unread,
Jul 8, 2012, 4:58:10 PM7/8/12
to google-a...@googlegroups.com
Since you have a multi-threaded app, the very first thing I would try
is switch to F4 instances. You may not need the memory, but
concurrency is theoretically limited by CPU utilization. You may find
that one F4 gives you better results than four F1s.

Jeff
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/Bgk6iG2U7pkJ.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.

Richard

unread,
Jul 8, 2012, 6:03:37 PM7/8/12
to google-a...@googlegroups.com
Jeff: Thanks for the reponse/suggestion.

Here are one of my most common requests:
 
ms=30 cpu_ms=0 api_cpu_ms=0 cpm_usd=0.000263

Seems to be no problems with anything...

And here is another (different request):
ms=15376 cpu_ms=37 api_cpu_ms=17 cpm_usd=0.003428 pending_ms=8265

Seriously ? 15 seconds response ? Of which, 8 seconds sitting in pending ? And what happened to the other 7 seconds ?

This does not seem to be a CPU issue (cpu_ms is super low!).

My customers requests are just sitting about doing nothing... and I don't know enough to know where the latency is coming from! This really is very frustrating.


I have also tried to limit the number of idle instances and watched the error rate. There is NO correlation between the two. :(

-R

> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscribe@googlegroups.com.

Kyle Finley

unread,
Jul 8, 2012, 6:23:02 PM7/8/12
to google-a...@googlegroups.com
Richard,

Another option would be to move the Game State request to a Go instance, either as a backend or as a separate version. I believe a single Go instance should be able the handle 500 request / second. You could then share the Game State between the Python version and the Go version through Memcache, cacheing to instance memory every 5 sec. 

- Kyle 

Per

unread,
Jul 8, 2012, 7:11:29 PM7/8/12
to google-a...@googlegroups.com

Sounds odd. I'd also suggest trying F4 (just for the fun of it, even if the chance is low that you'll find something, it won't hurt to try it for a few hours).

Did you check out appstats yet?

Also, in my experience the whole fiddling-with-sliders was never productive, I'd suggest resetting to automatic/automatic values, it just seemed more reliable to me.

You didn't mention if you had checked the quota details screen, there are a couple of per-minute quotas that seem to bite people out of the blue.  I once had a backup-job that was stuck and cost me lots and lots of money, it was running in the background on the built-in version, maybe something like this can also contribute to throttling the core application?

Well, admittedly we're all poking in the dark here, maybe you can detail what else you have tried and where you've looked already?

All the best
Per

Richard

unread,
Jul 8, 2012, 9:19:04 PM7/8/12
to google-a...@googlegroups.com
I could if memcache actually worked.  But it does not.  I originally tried to use it and found that I could not push the game state to memcache and then have the other instances pull it.  They would get versions of it that were up to 5 minutes old.  My timings are 5 second windows.
5 secs to submit all scores
5 secs to reap scores and calc leaderboards
5 secs to fan out results to clients

Experience shows that memcache is just broken for that sort of timing.

As for using Go instead of Python.  I am not sure I follow why Go should be better.  The lag is not coming from CPU or queries.

Right now I am running 50 instances to serve 500 game clients. $48 for the last 18 hours.  11% of my requests result in "Request was aborted".  Yeah, that is 12 THOUSAND fails.

Back in the year 1995, ftp.cdrom.com could serve 2000 clients simultaneously on a Pentium Pro 200MHz .... and I cannot serve 20 clients on a 500Mhz virtual box ? 

I still contend there is some internal throttling going on somewhere.

-R

Per

unread,
Jul 8, 2012, 10:13:28 PM7/8/12
to google-a...@googlegroups.com

Not sure where I saw it, but I believe an application must respond within a second at most *on average*, or GAE reserves the right to throttle it. So, if you have some kind of slowdown inside your app, then requests start piling up. I'm not a Python developer, but if you had the equivalent of a synchronised map in there, with lots of concurrent access, then you might end up in a situation like this. Not sure about what's really to blame of course, but I'd strongly recommend setting up a controlled loadtesting environment (just copy the app, and start firing requests at it).  Add plenty of logging to your app, enable appstats, and then slowly increase the load. Maybe even strip the application down, starting only with read access. I'm sure you will find something, and I would love to hear what it was! I wish you luck!

Richard

unread,
Jul 8, 2012, 10:26:01 PM7/8/12
to google-a...@googlegroups.com
Yep, I know about that.  The internal limit is 1000msec, but you should aim for 750msec just to 'be safe'.

Here are my timings:
ms=15376 cpu_ms=37 api_cpu_ms=17 cpm_usd=0.003428 pending_ms=8265

37 msec of CPU .... after waiting 8.2 seconds to execute.

Some math:
- 500 incoming requests
- 40 servers
= ~ 15 requests/server
... but each request takes ~55 msec of CPU + API ....

That's 825 msec for THE WHOLE LOT ! .... So why on earth is there 8 second of waiting before that request was serviced ?


-R

Per

unread,
Jul 9, 2012, 6:12:29 AM7/9/12
to google-a...@googlegroups.com

It's obviously very hard to diagnose this remotely. But if it's spending 8 out of 15 seconds waiting, then what about the other 7 seconds, where exactly are they lost? There's no point arguing that this should work (of course it should), IMO you need to start analysing. Maybe you did already, but it's not obvious from your posts what you did already, and what you didn't. In another thread you mentioned you didn't use appstats for a long time - I'd do that first thing. Also, I'd add very detailed manual logging, so you know what bits run fast, and which ones don't.

I've encountered plenty of GAE glitches in the past, but also quite a few things were our own fault. I'm not sure how performance analysis works on Python, but in Java it would be time to bring on the good old Profiler. I'd make sure that on localhost a request really just takes a few miliseconds, even under load, and then take it live, to a loadtesting environment and see what happens there under growing load. Given the numbers you posted, I'd not be surprised if some locking&synchronizing is part of the problem, but that's just a guess. Find out what operation really takes the most time, or at what load things start going wrong.

timh

unread,
Jul 9, 2012, 7:19:51 AM7/9/12
to google-a...@googlegroups.com
Others have asked, and I will too have you profiled your app with appstats yet ?

A few other points,

have you compared the performance of multiple instances with threading off vs threading enabled ? 
Are you using warmup requests to startup new instances ?
I personally have never experienced out of date memcahce data.  Have you logged an issue with appengine team.  Getting 5min old stale date from memcache should never happen.
Finally python with threads isn't going to give you magical increases in concurrent throughput. If the request is cpu bound then other threads won't get a look in.  Where threading in requests works is when you request starts waiting on rpc's.  Then other work can be scheduled. 

T

Jeff Schnitzer

unread,
Jul 9, 2012, 12:35:38 PM7/9/12
to google-a...@googlegroups.com
Did switching to F4 make a difference?

Are these requests startup requests? (they will be noted in the logs
"app engine started a new process...")

Have you put a log message at the beginning and end of the requests to
check timing? It would be good to know if the lag is in your code or
outside of your code.

You mentioned that your app is very bursty, with 500 users submitting
data in a 5s window. What's the time period between submissions?

Jeff
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/J0pwVLK7RoEJ.
>
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.

Kyle Finley

unread,
Jul 9, 2012, 5:28:26 PM7/9/12
to google-a...@googlegroups.com
I'm not sure why memcache would have stale data. Maybe a datastore query is some how responsable. The hrd is only strongly consistent when querying by key. 

As for Go, if I'm going to evangelize I should probably back it up. I put together a sample app to demonstrate it's concurrence. 


The Go runtime uses goroutines to handle request. So in theory it should be able to handle 1000's of request/second. I would be interested to see if this holds true on a 500Mhz virtual box.

Feel free to DDOS the Leader Board JSON URL I have the app set to 1 instance - so any latency will be from 1 instance handling all of the request. If you are still seeing the "Request was aborted" then App Engine must have some type of throttle in place.

- Kyle

Richard

unread,
Jul 9, 2012, 11:05:49 PM7/9/12
to google-a...@googlegroups.com
Ok, so I sat with Jeff in a G+ hangout for a while tonight, and we could not figure it out - other than my app is confusing the crap out of the scheduler which is then not sending requests to instances.



Kyle Finley

unread,
Jul 10, 2012, 12:19:29 AM7/10/12
to google-a...@googlegroups.com
I'm seeing the same thing with the test app I created. Even though I had max instances set to 1 the scheduler still started up a second instance. During this time I had several request that resulted in 500 server errors. Once the second instance was started latency increased - as if the scheduler couldn't decide which instance to route the request to. After a few seconds it subsisted. I believe that this transition period maybe responsible for error rates. The bad thing about it is:

1) The errors didn't appear in the logs. So it is quite likely that this is a common problem, but few developers are aware of it.
2) There's nothing that can be done to fix it.

In fairness I was testing under the free quota, so maybe the scheduler behaves better on payed apps.

Jeff Schnitzer

unread,
Jul 10, 2012, 12:25:38 AM7/10/12
to google-a...@googlegroups.com
This is something different. There is no "max instances" setting -
it's "max idle instances". GAE will still spin up as many instances
as it deems necessary to meet demand... right up until you exhaust
quota.

FWIW, the paid scheduler does seem to operate considerably better than
the free scheduler. For reasons I don't quite understand, free apps
seem to be pretty much unusable. Enabling billing creates a much
smoother experience.

Jeff
> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

Kyle Finley

unread,
Jul 10, 2012, 1:18:02 AM7/10/12
to google-a...@googlegroups.com
Jeff, Thanks for the clarification. It's good to know that this is a free quota issue. The fact that the errors are not reported in the logs, makes it difficult to diagnose.

Jeff Schnitzer

unread,
Jul 10, 2012, 12:59:30 PM7/10/12
to google-a...@googlegroups.com
Updated: This may not be a scheduler problem. It may be that we
can't wringing more than 10 requests/second out of Python27
application instances.

Given a request that does nothing but write a single entity to a
datastore (typically 50ms, but sometimes varies), what should we
expect throughput per Python27 instance to be?

Jeff

Wilson MacGyver

unread,
Jul 10, 2012, 1:03:15 PM7/10/12
to google-a...@googlegroups.com
That'd be interesting. because for the most part, I can't get an
instance to respond to more than 10req/sec
on gae-java either.

there was a brief 1 week or so I was seeing 15req/sec, but it went away.
> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
>



--
Omnem crede diem tibi diluxisse supremum.

Rerngvit Yanggratoke

unread,
Jul 10, 2012, 3:50:26 AM7/10/12
to google-a...@googlegroups.com
I hope the fact that "paid" scheduler is better than "free" scheduler is not true. What about "try before you buy" principle for the free quota? 

On Tue, Jul 10, 2012 at 7:18 AM, Kyle Finley <kylef...@gmail.com> wrote:
Jeff, Thanks for the clarification. It's good to know that this is a free quota issue. The fact that the errors are not reported in the logs, makes it difficult to diagnose.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/LE6uvPsseWMJ.

To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.



--
Best Regards,
Rerngvit Yanggratoke 

Barry Hunter

unread,
Jul 10, 2012, 1:11:09 PM7/10/12
to google-a...@googlegroups.com
On Tue, Jul 10, 2012 at 6:03 PM, Wilson MacGyver <wmac...@gmail.com> wrote:
> That'd be interesting. because for the most part, I can't get an
> instance to respond to more than 10req/sec
> on gae-java either.

Jon McAlister here:
https://groups.google.com/group/google-appengine/msg/2949285edc35fc8f
seems to suggest that there is a hard 10req/sec limit enforced by the
scheduler.

Jeff Schnitzer

unread,
Jul 10, 2012, 1:25:40 PM7/10/12
to google-a...@googlegroups.com
On Tue, Jul 10, 2012 at 10:11 AM, Barry Hunter <barryb...@gmail.com> wrote:
>
> Jon McAlister here:
> https://groups.google.com/group/google-appengine/msg/2949285edc35fc8f
> seems to suggest that there is a hard 10req/sec limit enforced by the
> scheduler.

That's a concurrency limit. If each request took 1000ms, they would
be equivalent... however, each request that we're talking about is
~50ms. What I expect is to see throughput of 1000 / 50 * concurrency
= (20 * concurrency) requests per second. If concurrency is 10, that
would be 200 QPS.

Jeff

Barry Hunter

unread,
Jul 10, 2012, 1:42:26 PM7/10/12
to google-a...@googlegroups.com
Oops, sorry my mistake! I just saw the figure 10 and remeber seeing
that number quoted as a limit. Didnt really stop to really understand
the threads. Sorry.

Kyle Finley

unread,
Jul 10, 2012, 1:47:03 PM7/10/12
to google-a...@googlegroups.com
I would like to see how Go fairs.

Would someone with high bandwidth please try running this:


With one instance running I'm getting:
AB:
Requests per second:    31.31 [#/sec] (mean)
Datastore Admin:

Total number of instancesAverage QPS*Average Latency*Average Memory
1 total16.2673.4 ms7.8 MBytes
Instances help
QPS*Latency*RequestsErrorsAgeMemoryLogsAvailabilityShutdown
16.2673.4 ms6723019:39:507.8 MBytesView LogsDynamic IconDynamic
But I think this might be limited by my bandwidth.

Thanks,

Kyle


Kyle Finley

unread,
Jul 10, 2012, 2:05:51 PM7/10/12
to google-a...@googlegroups.com
Thank you, whoever ran that.

It looks like Go doesn't do all the much better. The scheduler started up 3 instances for that. With the peak (when I happened to check it) at 25 QPS. I would be curious to hear what the real latency was. 

Total number of instancesAverage QPS*Average Latency*Average Memory
3 total12.02218.5 ms6.8 MBytes
Instances help
QPS*Latency*RequestsErrorsAgeMemoryLogsAvailabilityShutdown
25.08314.8 ms9144019:52:14
7.8 MBytesView LogsDynamic IconDynamic
10.85026.9 ms81500:01:027.1 MBytes
View LogsDynamic IconDynamic
0.13314.8 ms2100:01:025.6 MBytes
View LogsDynamic IconDynamic

This is under the free quota, though, but with no datastore ops, just serving a string from memory.


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

Barry Hunter

unread,
Jul 10, 2012, 2:11:07 PM7/10/12
to google-a...@googlegroups.com
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking goscore.scotch-media.appspot.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Finished 1000 requests


Server Software:        Google
Server Hostname:        goscore.scotch-media.appspot.com
Server Port:            80

Document Path:          /leaders
Document Length:        0 bytes

Concurrency Level:      500
Time taken for tests:   4.177839 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      152000 bytes
HTML transferred:       0 bytes
Requests per second:    239.36 [#/sec] (mean)
Time per request:       2088.919 [ms] (mean)
Time per request:       4.178 [ms] (mean, across all concurrent requests)
Transfer rate:          35.43 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        4   14  13.3      5      46
Processing:   120 1300 1110.8    907    4040
Waiting:      119 1299 1110.8    906    4040
Total:        124 1314 1114.3    928    4079

Percentage of the requests served within a certain time (ms)
  50%    928
  66%   1296
  75%   1437
  80%   1679
  90%   3432
  95%   3848
  98%   4043
  99%   4054
 100%   4079 (longest request)


(from a machine in a datacenter, with a consistent 8MB/s connection. (In Germany) 

Kyle Finley

unread,
Jul 10, 2012, 2:27:38 PM7/10/12
to google-a...@googlegroups.com
Thank you Berry!

I was hoping for better results considering there are not service calls. Heres the instance chart.


It looks like I would only be billed for 1 instance. I just wish the scheduler was less aggressive.

Reply all
Reply to author
Forward
0 new messages