sudden and inexplicable large amount of 500 errors last night...

80 views
Skip to first unread message

Gerardo C

unread,
Jan 26, 2016, 12:44:52 PM1/26/16
to Google App Engine
Hi,

Last night for an hour starting at 2:33am PST we had hundreds of failed requests and then for another hour sporadic failed requests.  I have pasted the first failed request type, another type we saw later one and one that seemed to linger until we restarted all the instances by doing a push of the same code base that was in production.  Though the last error seemed to be confined to one instance ID.  Our last code push happened at 6:48pm PST the day before, so it was not related to a code push as far as we can tell.  The logs are pasted below.  I have removed the URL paths for security reasons.

Any ideas of what could have triggered this and how we can prevent it.

Thanks,

Gerardo


TYPE 1 ERROR (many of these):
E 2016-01-26 02:33:00.622  500       0 B 101.87 s I 02:34:42.494 E 02:34:42.494 /...
  2602:304:ce44:a2f0:d695:24ff:fe20:cbfe - - [26/Jan/2016:02:33:00 -0800] "GET /... HTTP/1.1" 500 - - "okhttp/2.6.0" "s.chowbot.com" ms=101872 cpu_ms=0 cpm_usd=0 instance=00c61b117cf18db3caccebe2327c6a61c33a2d app_engine_release=1.9.32 trace_id=-
    I 02:34:42.494 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.
    E 02:34:42.494 Process terminated because the request deadline was exceeded. (Error code 123)


TYPE 2 ERROR (many of these too, though variations of it)
E 2016-01-26 02:51:00.834  500       0 B 68.43 s E 02:52:09.245 I 02:52:09.267 ...
  2602:304:ce44:a2f0:d695:24ff:fe20:cbfe - - [26/Jan/2016:02:51:00 -0800] "GET ... HTTP/1.1" 500 - - "okhttp/2.6.0" "s.chowbot.com" ms=68433 cpu_ms=3422 cpm_usd=0 instance=00c61b117c8e83d4f99203f8f778e18c217ba3 app_engine_release=1.9.32 trace_id=79bb49b033cf51b225722e76415955ec
    E 02:52:09.245 Traceback (most recent call last):
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle
          result = handler(dict(self._environ), self._StartResponse)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
          rv = self.router.dispatch(request, response)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/main.py", line 5984, in custom_dispatcher
          rv = router.default_dispatcher(request, response)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
          return route.handler_adapter(request, response)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
          return handler.dispatch()
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
          return method(*args, **kwargs)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/main.py", line 2554, in get
          'open': check_open(merchant_id, timezone=m.timezone)[0]}
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/main.py", line 1532, in check_open
          schedule = session.query(model).filter(model.day_of_week == day_of_week, model.merchant_id == merchant_id).all()
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/orm/query.py", line 2398, in all
          return list(self)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/orm/query.py", line 2515, in __iter__
          return self._execute_and_instances(context)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/orm/query.py", line 2528, in _execute_and_instances
          close_with_result=True)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/orm/query.py", line 2519, in _connection_from_session
          **kw)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/orm/session.py", line 882, in connection
          execution_options=execution_options)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/orm/session.py", line 887, in _connection_for_bind
          engine, execution_options)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/orm/session.py", line 334, in _connection_for_bind
          conn = bind.contextual_connect()
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/engine/base.py", line 2026, in contextual_connect
          self._wrap_pool_connect(self.pool.connect, None),
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/engine/base.py", line 2061, in _wrap_pool_connect
          return fn()
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/pool.py", line 337, in connect
          return _ConnectionFairy._checkout(self)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/pool.py", line 644, in _checkout
          fairy = _ConnectionRecord.checkout(pool)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/pool.py", line 439, in checkout
          rec = pool._do_get()
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/pool.py", line 1055, in _do_get
          return self._create_connection()
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/pool.py", line 284, in _create_connection
          return _ConnectionRecord(self)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/pool.py", line 410, in __init__
          self.connection = self.__connect()
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/pool.py", line 538, in __connect
          connection = self.__pool._creator()
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/engine/strategies.py", line 89, in connect
          return dialect.connect(*cargs, **cparams)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/sqlalchemy/engine/default.py", line 377, in connect
          return self.dbapi.connect(*cargs, **cparams)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/MySQLdb-1.2.4b4/MySQLdb/__init__.py", line 81, in Connect
          return Connection(*args, **kwargs)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/MySQLdb-1.2.4b4/MySQLdb/connections.py", line 193, in __init__
          if type(k) is not int ])
      DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.
    I 02:52:09.267 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.
    

TYPE 3 ERROR (sporadic on one instance for an hour until all instances were restarted)
    E 2016-01-26 02:51:45.386  500       0 B 16.74 s E 02:52:02.119 E 02:52:02.121 /...
  2601:645:8100:3a00:d695:24ff:fe20:8550 - - [26/Jan/2016:02:51:45 -0800] "GET /... HTTP/1.1" 500 - - "okhttp/2.6.0" "s.chowbot.com" ms=16741 cpu_ms=148 cpm_usd=0 instance=00c61b117cc79f75158a730cf23eba719dbe70 app_engine_release=1.9.32 trace_id=62038f678bf3651d2f6ef64dc3c3e20e
    E 02:52:02.119 'America/Los_Angeles'
      Traceback (most recent call last):
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1511, in __call__
          rv = self.handle_exception(request, response, e)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
          rv = self.router.dispatch(request, response)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/main.py", line 5984, in custom_dispatcher
          rv = router.default_dispatcher(request, response)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
          return route.handler_adapter(request, response)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
          return handler.dispatch()
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 547, in dispatch
          return self.handle_exception(e, self.app.debug)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
          return method(*args, **kwargs)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/main.py", line 2554, in get
          'open': check_open(merchant_id, timezone=m.timezone)[0]}
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/main.py", line 1528, in check_open
          local_time = convert_to_local(datetime.datetime.utcnow(), timezone)
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/meerutils.py", line 39, in convert_to_local
          return pytz.utc.localize(utc_time).astimezone(pytz.timezone(timezone))
        File "/base/data/home/apps/s~meerchant-sms-1/1.390245166414852862/pytz/__init__.py", line 180, in timezone
          raise UnknownTimeZoneError(zone)
      UnknownTimeZoneError: 'America/Los_Angeles'



Gerardo C

unread,
Jan 26, 2016, 12:51:18 PM1/26/16
to Google App Engine
For additional help, I have attached graphs of activity during that time.  The one thing that jumps out is how we shot up to 30 instances all of a sudden.

Gerardo
Screen Shot 2016-01-26 at 9.48.35 AM.png
Screen Shot 2016-01-26 at 9.48.04 AM.png
Screen Shot 2016-01-26 at 9.47.50 AM.png
Screen Shot 2016-01-26 at 9.47.34 AM.png
Screen Shot 2016-01-26 at 9.47.24 AM.png
Screen Shot 2016-01-26 at 9.47.13 AM.png
Screen Shot 2016-01-26 at 9.46.43 AM.png
Screen Shot 2016-01-26 at 9.46.22 AM.png

Nick (Cloud Platform Support)

unread,
Jan 26, 2016, 3:27:48 PM1/26/16
to Google App Engine
It seems from a cursory investigation to be a combination of aggressive down-scaling (given that the traffic decreases just before the errors and most of the errors shown are first requests sent to an instance) combined with a moderate increase in traffic, combined with latency between your app and the database causing requests to outlive their deadline and fail. Especially with a high-latency routine inside a request handler, and a deficiency of instances to handle requests, requests can time out easily since they wait before even reaching the request handler.

If you're interested to know whether we can do a more deep analysis as to whether our own infrastructure had an issue at that time, or whether it's an issue of your system, you should post to the Public Issue Tracker with lots of technical details, just like this thread. That forum is explicitly dedicated to such threads, while this forum is more for general / more high-level discussion of the platform and services. 

I'm making some abstract guesses as to the cause of things by inspecting the stack traces and graphs, so anything I've said is at best a guide to further investigation. If you'd like, you could post to the Public Issue Tracker, and someone will be along quickly to assist in a potentially-deeper analysis.
Reply all
Reply to author
Forward
0 new messages