Initial notes on TyphoonAE

14 views
Skip to first unread message

hawkett

unread,
Oct 4, 2010, 11:31:34 AM10/4/10
to typhoonae
Hi Guys,

Thanks for the help earlier with the task queue method issue. I
have been playing around a little more now that my app is working, and
have a few questions -

1. If I am seeing new server starts (e.g. main.py is being
(re)imported) I'm wondering if this is additional application
instances, or stop/start of existing ones, or both? Your architecture
diagram looks like it shows multiple application instances are used.

2. I'm running into a problem that is probably fairly unique to my
application. I initially 'boot' the database on dev/SDK with a long
running request that also serves as a testing mechanism for my server
code. This process doesn't occur on live - I just upload the raw data
using remote_api. I *think* I am seeing this process terminated
prematurely and new server instance being started, and I am guessing
this could be due to this (terminating memory hogs)
http://groups.google.com/group/typhoonae/browse_thread/thread/33ff124636a78980/0b0a9d26b8f7b0d9?lnk=gst&q=memory#0b0a9d26b8f7b0d9
- I know you say you are looking at an option here to set the memory
amount - but would you be able to point me in the direction of how to
raise this limit (even if it requires a rebuild of the server), as I
would like to verify that this is the problem with my boot process. I
am wondering if there is a way for you to terminate a memory hog
*after* it has finished processing all its requests, rather than just
killing it? Perhaps set a flag on the server to accept no more
requests, and poll it until the active request count drops to zero, or
something like that? This might be why GAE uses timed termination -
to ensure it doesn't kill servers actively running requests?

3. Given that TyphoonAE can handle multiple instances (including stop/
start), it would be great to have some unique identifier in the
application logs to detect that a different server instance is running
the request - I think this is why GAE live uses a 'request thread' log
style - it is really hard to read aggregate logs from multiple threads
- especially if there is no indication which thread a log message is
from. If there is a unique identifier, we can grep to get a log for
each server individually. It would be even better if every request
thread had a unique identifier in the logs, so we could grep per
server and per request :)

4. Is there a way to set the request timeout to something other than
30s? I know this number matches GAE live, but the SDK has no such
limitation. To mimic the SDK behaviour it would be good to turn this
off. This is not a big issue, as the server continues to finish the
request regardless of the front end timeout, but just something I was
wondering.

5. Almost all server errors show up as Gateway errors - is there a way
to print the python stack trace to the browser?

6. I am sometimes seeing caching behaviour that is not evident in the
SDK - e.g. if I load a page and get the gateway error, look in the
logs for the cause, modify my code and hit refresh - most of the time
the refresh just shows the gateway error, and the server logs show no
activity. I'm just wondering where this caching is occurring and if I
can turn it off? i.e. every browser refresh of every url will always
hit the server.

Anyway - enough for now :) Thanks for the great work,

Colin

Tobias

unread,
Oct 5, 2010, 9:33:23 AM10/5/10
to typhoonae
Hey Colin,

First and foremost, many thanks for your first impressions and initial
notes on TyphoonAE! This is all very helpful. Let me try to give you
some answers inline.

On Oct 4, 5:31 pm, hawkett <hawk...@gmail.com> wrote:
> ...
>
> 1. If I am seeing new server starts (e.g. main.py is being
> (re)imported) I'm wondering if this is additional application
> instances, or stop/start of existing ones, or both? Your architecture
> diagram looks like it shows multiple application instances are used.

The number of appservers/app can be configured in the fcgi-program
section of the according supervisor config file (e.g. etc/
1.latest.appid-supervisor.conf). The default number of appservers/app
is 2.

numprocs = 2

For instance, if you experience heavy workload on the appservers over
a number of requests, you might want to increase the number of
appserver processes manually. However, it's on us to find a nifty
algorithm for automatic scaling.

All appserver processes are connected to the NGINX HTTP frontend
server via shared sockets managed by the supervisor daemon. Generally,
we don't restart appserver processes. That's different to GAE and the
main reason why apps stay "hot" on TyphoonAE. There are only three
possible reasons for a restart, though:

a) appserver crashes due to an uncaught exception which causes an
automatic restart (see the supervisor docs for configuring exit codes)
b) manual restart
c) appserver memory consumption exceeds a configurable threshold
(monitored by a separate memmon process)

> 2. I'm running into a problem that is probably fairly unique to my
> application. I initially 'boot' the database on dev/SDK with a long
> running request that also serves as a testing mechanism for my server
> code. This process doesn't occur on live - I just upload the raw data
> using remote_api. I *think* I am seeing this process terminated
> prematurely and new server instance being started, and I am guessing
> this could be due to this (terminating memory hogs)http://groups.google.com/group/typhoonae/browse_thread/thread/33ff124...
> - I know you say you are looking at an option here to set the memory
> amount - but would you be able to point me in the direction of how to
> raise this limit (even if it requires a rebuild of the server), as I
> would like to verify that this is the problem with my boot process. I
> am wondering if there is a way for you  to terminate a memory hog
> *after* it has finished processing all its requests, rather than just
> killing it? Perhaps set a flag on the server to accept no more
> requests, and poll it until the active request count drops to zero, or
> something like that?  This might be why GAE uses timed termination -
> to ensure it doesn't kill servers actively running requests?

The memory limit can also be easily configured in the 1.latest.appid-
supervisor.conf file. Here is an example:

[eventlistener:appid.1_monitor]
command=/Users/tobias/projects/appengine/typhoonae-dev/bin/memmon -g
appid=200MB
events=TICK_60

Just raise the limit by modifying appid=x. If you want to disable the
event listener completely, either delete this section or type bin/
supervisorctl stop appid.1_monitor.

Providing a flag to finish the current request before killing the
process would be a great enhancement. I have to ponder on this,
though. An appropriate signal handler in fcgiserver.py might work.

> 3. Given that TyphoonAE can handle multiple instances (including stop/
> start), it would be great to have some unique identifier in the
> application logs to detect that a different server instance is running
> the request - I think this is why GAE live uses a 'request thread' log
> style - it is really hard to read aggregate logs from multiple threads
> - especially if there is no indication which thread a log message is
> from. If there is a unique identifier, we can grep to get a log for
> each server individually. It would be even better if every request
> thread had a unique identifier in the logs, so we could grep per
> server and per request :)

Great idea! Maybe more granular loglevels might be helpful, too.
That's definitely next on my list after pulling in Joaquins great work
with celery integration. Would you prefer to still keep appserver logs
separated from HTTP logs?

> 4. Is there a way to set the request timeout to something other than
> 30s? I know this number matches GAE live, but the SDK has no such
> limitation. To mimic the SDK behaviour it would be good to turn this
> off. This is not a big issue, as the server continues to finish the
> request regardless of the front end timeout, but just something I was
> wondering.

I'm a bit confused on this one, because we currently configured a
default server-side keepalive timeout of 65 seconds. Where do you
experience requests cut off after 30 seconds?

> 5. Almost all server errors show up as Gateway errors - is there a way
> to print the python stack trace to the browser?

This needs to be fixed in the fcgiserver.py as well, I guess. Would
you like to add an issue for that?

> 6. I am sometimes seeing caching behaviour that is not evident in the
> SDK - e.g. if I load a page and get the gateway error, look in the
> logs for the cause, modify my code and hit refresh - most of the time
> the refresh just shows the gateway error, and the server logs show no
> activity. I'm just wondering where this caching is occurring and if I
> can turn it off? i.e. every browser refresh of every url will always
> hit the server.

After modifying code you should manually restart the appserver
processes to make the changes take effect.

This can be done with the following command:

bin/supervisorctl resatart appid.version: (restarts the whole 'process
group')

Optionally, the apptool takes the --develop option which disables
module caching.

> Anyway - enough for now :) Thanks for the great work,

Thank you again and don't hesitate to post more questions and
suggestions. I hope that my answers above help so far.

- Tobias

Colin H

unread,
Oct 6, 2010, 4:51:02 AM10/6/10
to typhoonae
Hi Tobias,

  Thanks for all the details - that should see me right for a while. 

  Re logs, as an application developer I prefer the http log and the app log to be the same. I don't really have a use case for them to be separate, and it's a pain to have two files to consider - often you can spot a problem much earlier than you would otherwise, just because the HTTP stuff is right there. I can see that for others though maybe having them together is equally painful, although a grep command could probably solve the problem.

  Re the timeout - it might be that it is 65 seconds - I kind of jumped at a number without measuring it :) - Whatever the case, the browser reports timeout, but the server keeps on processing the request until it completes, which is good from my perspective, but wondering about raising or turning off - again this one's not a biggie.

  Re the Bad Gateway issue - here 'tis - http://code.google.com/p/typhoonae/issues/detail?id=72

Thanks again,

Colin


--
You received this message because you are subscribed to the Google
Groups "typhoonae" group.
To post to this group, send email to typh...@googlegroups.com
To unsubscribe from this group, send email to
typhoonae+...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/typhoonae?hl=en

Tobias

unread,
Oct 7, 2010, 8:08:24 PM10/7/10
to typhoonae
Hi Colin,

Sorry for my late response. You can either modify the
keepalive_timeout variable in buildout.cfg (line 197) or directly in
parts/nginxctl/nginxctl.conf (line 14). The latter one will be
overwritten when running bin/bildout, though.

Thanks for adding the issue regarding server errors (Bad Gateway).
I've modified the fcgiserver.py (appserver) to catch and handle this
kind of exceptions.

http://code.google.com/p/typhoonae/source/detail?r=aa446ae10e016f491e6ea7d1fdc81a6a762d5529

I'm also looking for a way to re-implement GAE's error_handlers.
Please let me know if this works for you so far.

Thanks!
Tobias

On Oct 6, 10:51 am, Colin H <hawk...@gmail.com> wrote:
> Hi Tobias,
>
>   Thanks for all the details - that should see me right for a while.
>
>   Re logs, as an application developer I prefer the http log and the app log
> to be the same. I don't really have a use case for them to be separate, and
> it's a pain to have two files to consider - often you can spot a problem
> much earlier than you would otherwise, just because the HTTP stuff is right
> there. I can see that for others though maybe having them together is
> equally painful, although a grep command could probably solve the problem.
>
>   Re the timeout - it might be that it is 65 seconds - I kind of jumped at a
> number without measuring it :) - Whatever the case, the browser reports
> timeout, but the server keeps on processing the request until it completes,
> which is good from my perspective, but wondering about raising or turning
> off - again this one's not a biggie.
>
>   Re the Bad Gateway issue - here 'tis -http://code.google.com/p/typhoonae/issues/detail?id=72
>
> Thanks again,
>
> Colin
> > typhoonae+...@googlegroups.com<typhoonae%2Bunsubscribe@googlegroups .com>

Tobias

unread,
Oct 16, 2010, 3:29:16 PM10/16/10
to typhoonae
Hi Colin,

I apparently mixed something up in my last reply.

On Oct 8, 2:08 am, Tobias <tobias.rodae...@googlemail.com> wrote:

> Sorry for my late response. You can either modify the
> keepalive_timeout variable in buildout.cfg (line 197) or directly in
> parts/nginxctl/nginxctl.conf (line 14). The latter one will be
> overwritten when running bin/bildout, though.

This is not correct. In fact, the fastcgi_read_timeout variable
(default 60s) is responsible for that. A long running appserver
instance causes a 504 Gateway Time-out after 60 seconds if NGINX gets
no response from the upstream. Anyway, a (hanging) appserver instance
isn't killed and must be manually restarted. An additional monitoring
process could take care of that. I'm doing some research now to find
out what this means for the architecture.

Cheers,
Tobias

Tim Hoffman

unread,
Oct 16, 2010, 7:35:04 PM10/16/10
to Tobias, typhoonae
Hi Tobias

It will be interesting job to work out how to tell the difference between a hanging app instance and a long running 
process on the instance. You could monitor cpu usage of the instance.

Rgds

T

Tobias

--
You received this message because you are subscribed to the Google
Groups "typhoonae" group.
To post to this group, send email to typh...@googlegroups.com
To unsubscribe from this group, send email to
typhoonae+...@googlegroups.com

Joaquin Cuenca Abela

unread,
Oct 16, 2010, 7:49:32 PM10/16/10
to Tim Hoffman, Tobias, typhoonae
standard procedure is to set up a handler for /_ah/health and answer
with OK. Ping this url with a supervisord monitor, and if it stops
answering restart the process.

--
Joaquin Cuenca Abela

Tim Hoffman

unread,
Oct 16, 2010, 8:30:07 PM10/16/10
to typhoonae


If each instance is single threaded and the instance is running a long job then it won't respond to the handler.

So it will look like its wedged.

Rgds

T

Tobias

unread,
Oct 16, 2010, 9:46:59 PM10/16/10
to typhoonae
As soon as we have more than one appserver instance for an app,
pinging an URL won't help, because the request might be passed to
another _not_ hanging instance (round robin). Monitoring the CPU usage
only helps to distinguish long-running from other states, but there is
no difference between a ready-to-go-instance and a hanging one in
terms of CPU load.

It turns out that the GAE approach of letting processes die after a
distinct period of idle is quite reasonable. Some kind of asynchronous
heartbeat might be another way to solve that.

But most importantly, I really appreciate all your thoughts on that!

Thanks!
Tobias

On Oct 17, 2:30 am, Tim Hoffman <zutes...@gmail.com> wrote:
> If each instance is single threaded and the instance is running a long job
> then it won't respond to the handler.
>
> So it will look like its wedged.
>
> Rgds
>
> T
>
> On Sun, Oct 17, 2010 at 7:49 AM, Joaquin Cuenca Abela <
>
>
>
> joaq...@cuencaabela.com> wrote:
> > standard procedure is to set up a handler for /_ah/health and answer
> > with OK. Ping this url with a supervisord monitor, and if it stops
> > answering restart the process.
>
> > On Sun, Oct 17, 2010 at 1:35 AM, Tim Hoffman <zutes...@gmail.com> wrote:
> > > Hi Tobias
> > > It will be interesting job to work out how to tell the difference between
> > a
> > > hanging app instance and a long running
> > > process on the instance. You could monitor cpu usage of the instance.
> > > Rgds
> > > T
>
> > > On Sun, Oct 17, 2010 at 3:29 AM, Tobias <tobias.rodae...@googlemail.com>
> > > wrote:
>
> > >> Hi Colin,
>
> > >> I apparently mixed something up in my last reply.
>
> > >> On Oct 8, 2:08 am, Tobias <tobias.rodae...@googlemail.com> wrote:
>
> > >> > Sorry for my late response. You can either modify the
> > >> > keepalive_timeout variable in buildout.cfg (line 197) or directly in
> > >> > parts/nginxctl/nginxctl.conf (line 14). The latter one will be
> > >> > overwritten when running bin/bildout, though.
>
> > >> This is not correct. In fact, the fastcgi_read_timeout variable
> > >> (default 60s) is responsible for that. A long running appserver
> > >> instance causes a 504 Gateway Time-out after 60 seconds if NGINX gets
> > >> no response from the upstream. Anyway, a (hanging) appserver instance
> > >> isn't killed and must be manually restarted. An additional monitoring
> > >> process could take care of that. I'm doing some research now to find
> > >> out what this means for the architecture.
>
> > >> Cheers,
> > >> Tobias
>
> > >> --
> > >> You received this message because you are subscribed to the Google
> > >> Groups "typhoonae" group.
> > >> To post to this group, send email to typh...@googlegroups.com
> > >> To unsubscribe from this group, send email to
> > >> typhoonae+...@googlegroups.com<typhoonae%2Bunsubscribe@googlegroups .com>
> > >> For more options, visit this group at
> > >>http://groups.google.com/group/typhoonae?hl=en
>
> > > --
> > > You received this message because you are subscribed to the Google
> > > Groups "typhoonae" group.
> > > To post to this group, send email to typh...@googlegroups.com
> > > To unsubscribe from this group, send email to
> > > typhoonae+...@googlegroups.com<typhoonae%2Bunsubscribe@googlegroups .com>
Reply all
Reply to author
Forward
0 new messages