Open Question: Turbogears and scaling...

3 views
Skip to first unread message

ajones

unread,
Mar 16, 2006, 6:28:39 PM3/16/06
to TurboGears
I was wondering how well turbogears scales? Obviously interesting
caching tricks can be done on the web face, and other strange voodoo
can be done in the database, but is turbogears itself designed to
scale?

Can I host the controller, for instance, on multiple servers for better
response time? If not, would it even be a good idea to implement that
kind of functionality?

If you had to design a truly big system, or a lot of little
interconnected remote sites, and wanted TG to do it what are the
options?

Can I do this entire post entirely with questions? I think not.

Karl Guertin

unread,
Mar 16, 2006, 7:00:58 PM3/16/06
to turbo...@googlegroups.com
On 3/16/06, ajones <ajo...@gmail.com> wrote:
> I was wondering how well turbogears scales? Obviously interesting
> caching tricks can be done on the web face, and other strange voodoo
> can be done in the database, but is turbogears itself designed to
> scale?

Performance is not a core priority for TG. Almost any technology can
scale with the correct server architecture and proper caching tricks.
TG is no exception here.

> Can I host the controller, for instance, on multiple servers for better
> response time?

Nothing in TG prevents you from doing this. The biggest restriction
would probably be using the Identity framework with the default
sqlobject provider, as that hits the database every pageview. I'm
pretty sure you can write your own provider so that it doesn't do
this, but I don't know how you would go about doing this.

> If you had to design a truly big system, or a lot of little
> interconnected remote sites, and wanted TG to do it what are the
> options?

I've never done this, only read about it, so I'll leave that to
someone else. Word on the street is that the magic google phrase is
'shared nothing'.

> Can I do this entire post entirely with questions? I think not.

So close!

Justin Johnson

unread,
Mar 16, 2006, 8:32:29 PM3/16/06
to turbo...@googlegroups.com

> I was wondering how well turbogears scales? Obviously interesting
> caching tricks can be done on the web face, and other strange voodoo
> can be done in the database, but is turbogears itself designed to
> scale?

I don't see any reason why it shouldn't scale. A lot of it boils down
to the nature of your app. Python as a language is easily good enough
to power large systems. Ruby, Perl and PHP have all been used to
implement large systems and they're arguably slower than Python.

So a lot depends on your webserver configuration and your system
architecture.

Cache what you can, cut down database hits etc.

> Can I host the controller, for instance, on multiple servers for better
> response time? If not, would it even be a good idea to implement that
> kind of functionality?

Yes - here's how I'm approaching this:

I'm planning to run Apache with either FastCGI or mod_python. CherryPy
will sit behind Apache and handle my controllers. I'm not going to use
sessions although I'll use minimal cookie data to store state. I'm
going to follow the mantra of sharing nothing.

The controllers will access the database server using SQLObject. The
database server is running Postgres. Another server will be used for
storing user images and files using the native flat file system.

Scaling then: when I need to scale I'll add another web server running
Apache and TurboGears/CherryPy. I'll load balance between these web
servers probably in a round-robin way to begin with. There are hardware
and software solutions to load balancing. So, different web servers
will be able to handle different requests from the same user. This is
why sharing nothing is a good idea. Try not to store state on your web
servers.

As the number of web servers increases they'll begin to stress the
database. The database server will become the bottleneck. Cache if you
can. Next on the cards then is some form of database replication,
although that's sufficiently far enough into the future to not worry
about just now. :-)

Bottom line: I think the framework is good enough to trigger your
controllers and deliever templated response in a timely fashion. The
rest is down to external factors that apply to any other web system out
there.

I've still got some decisions to make such as FastCGI vs mod_python -
anybody have the pro's and con's between these two?

Justin Johnson

unread,
Mar 16, 2006, 8:52:36 PM3/16/06
to turbo...@googlegroups.com

I wrote previously:

> I've still got some decisions to make such as FastCGI vs mod_python -
> anybody have the pro's and con's between these two?

Sorry, after doing some Googling and looking around it appears I should
be considering SCGI rather than FastCGI.

Jorge Godoy

unread,
Mar 16, 2006, 9:07:56 PM3/16/06
to turbo...@googlegroups.com
Justin Johnson <just...@ntlworld.com> writes:

> I wrote previously:
>
> > I've still got some decisions to make such as FastCGI vs mod_python -
> > anybody have the pro's and con's between these two?
>
> Sorry, after doing some Googling and looking around it appears I should
> be considering SCGI rather than FastCGI.

I dunno how all of that will play with FirstClass (TG + WSGI) since this is
looking like it will be the future...

Anyway, the best thing you should do is benchmarking *your* application. And
you should also think about using mod_proxy, CP is pretty good and fast but
combined with Apache's cache it goes even farther than one would expect. One
advantage of using such a configuration is not having to take Apache down to
upgrade your TG website (at least I'm used to use more than one application
here, so instead of taking all of them down I just restart the TG app and it
is updated ;-)).

--
Jorge Godoy <jgo...@gmail.com>

Jonathan LaCour

unread,
Mar 16, 2006, 9:47:16 PM3/16/06
to turbo...@googlegroups.com
> I was wondering how well turbogears scales? Obviously interesting
> caching tricks can be done on the web face, and other strange voodoo
> can be done in the database, but is turbogears itself designed to
> scale?

TurboGears itself is definitely designed to scale exactly the same
way you scale most "LAMP" type systems. You scale it the same way
that you would scale Rails applications, incidentally. There is
certainly some additional work that could be done with caching, etc.,
but overall it is definitely designed to scale if you do it right.

The real question is: is your TurboGears application designed to
scale. I will get to this below.

> Can I host the controller, for instance, on multiple servers for
> better
> response time? If not, would it even be a good idea to implement that
> kind of functionality?

If you properly design your application to be as stateless as
possible, load balancing TurboGears is actually quite easy. Here is
the short list of how to make this happen:

1. Don't use sessions, use cookies. If you absolutely *have* to
use sessions, don't use in-memory sessions. Use database
sessions if you have to.

2. Don't store any state inside your controllers, or in memory.

If you do those things, then you can fire up as many processes as you
like, on as many servers as you like, and use lighttpd's built-in
load-balancing to point at your application processes. I have done
this myself with SCGI and lighttpd, and its crazy fast.

A lot of people tend to focus on caching and minimizing trips to the
database when it comes to scaling, but I think the above is much more
important at an architectural level. This is how you should
structure your application, and you can get to optimizing with
caching and minimizing database hits later. Both caching and
database optimization take considerably more time and effort than
throwing a few extra processes at the problem.

> If you had to design a truly big system, or a lot of little
> interconnected remote sites, and wanted TG to do it what are the
> options?

Too many options to speak of, but the WSGI future of TurboGears will
make this much easier than it is today (not that the situation is all
that bad today!).

Best of luck scaling! I don't suspect you will have too many
problems if you follow some of the simple rules mentioned in this
thread already.

--
Jonathan LaCour
http://cleverdevil.org


Kevin Dangoor

unread,
Mar 16, 2006, 9:56:59 PM3/16/06
to turbo...@googlegroups.com
On 3/16/06, Jorge Godoy <jgo...@gmail.com> wrote:

>
> Justin Johnson <just...@ntlworld.com> writes:
> > Sorry, after doing some Googling and looking around it appears I should
> > be considering SCGI rather than FastCGI.
>
> I dunno how all of that will play with FirstClass (TG + WSGI) since this is
> looking like it will be the future...

CherryPy already uses WSGI on the front end, so First Class will not
actually change web server deployment.

(What First Class changes is how you can compose your applications and
reuse bits from elsewhere.)

Kevin

Robin Haswell

unread,
Mar 17, 2006, 4:56:04 AM3/17/06
to turbo...@googlegroups.com
> I've still got some decisions to make such as FastCGI vs mod_python -
> anybody have the pro's and con's between these two?

My experience with setting up shared hosts tells me this: FastCGI is
slow but works with SuEXEC so is secure. mod_python is fast but all
processes run as the Apache user - draw your own conclusions from this.

What really solves the problem is the Apache Perchild MPM - mod_* speed
with suexec capabilities. Unfortunately for some unfathomable reason
perchild isn't finished and isn't being developed :'(

Is this accurate?

-Rob

Gerhard Häring

unread,
Mar 17, 2006, 5:35:21 AM3/17/06
to turbo...@googlegroups.com
Robin Haswell wrote:
>>I've still got some decisions to make such as FastCGI vs mod_python -
>>anybody have the pro's and con's between these two?
>
> My experience with setting up shared hosts tells me this: FastCGI is
> slow but works with SuEXEC so is secure. mod_python is fast but all
> processes run as the Apache user - draw your own conclusions from this.

Not considering benchmarks of "hello-world" style apps, is
FastCGI/SCGI/mod_proxy really noticably slower than using mod_python for
real applications?

> What really solves the problem is the Apache Perchild MPM - mod_* speed
> with suexec capabilities. Unfortunately for some unfathomable reason

> perchild isn't finished and isn't being developed :'( [...]

Apparently developing something like the Apache perchild MPM is hard.
Because it didn't get fixed and it had several problems, it got finally
scrapped in Apache 2.2.

There were/are several attempts to develop a replacment, none of them
ready for production yet, according to their developers:

metuxmpm: http://www.sannes.org/metuxmpm/
itk mpm: http://home.samfundet.no/~sesse/mpm-itk/
peruser mpm: http://www.telana.com/peruser.php

-- Gerhard

Robin Haswell

unread,
Mar 17, 2006, 6:10:16 AM3/17/06
to turbo...@googlegroups.com

Gerhard Häring wrote:
> Robin Haswell wrote:
>
>>>I've still got some decisions to make such as FastCGI vs mod_python -
>>>anybody have the pro's and con's between these two?
>>
>>My experience with setting up shared hosts tells me this: FastCGI is
>>slow but works with SuEXEC so is secure. mod_python is fast but all
>>processes run as the Apache user - draw your own conclusions from this.
>
>
> Not considering benchmarks of "hello-world" style apps, is
> FastCGI/SCGI/mod_proxy really noticably slower than using mod_python for
> real applications?

My experience is really only with PHP, where the statup times are quite
high (PHP doesn't have the concept of runtime module selection as such).
However I can't imagine any situations where FCGID would be quicker than
mod_python, as mod_python's python interpreter runs within the Apache
process itself.

Lateef

unread,
Mar 17, 2006, 2:34:15 PM3/17/06
to TurboGears
Most things seem to be covered exception for caching so I will share my
witless drivel about caching.
First you need to decide how dirty your data can get. If you have a
realtime stock quote system probably can't live with a lot of dirt vs
say blog comments probably don't need to be instant (well at-least the
ones I write probably don't need to be read at all). The second problem
is "the web" is stateless so you can't send updates down the socket to
the web browser. Thanks TG's widget and awesome AJAX support this can
be minimized (and there is much rejoicing, thanks!).

It doesn't matter how you "architect" the system, writes end up in one
place. Sure you can do replication but there is a world or setup,
configuration, schema changes, transactions and data synchronization oh
my! You are still only writing in once place, plus replication is just
a linear optimization, caching can get you exponential optimization
without all the lions, tigers and bears. Cache only after you have
tried to do some happy hacking to make things faster, my experience for
a couple years in EJB (oh the humanity!) was cache it and forget. Most
of the performance issues with EJB have to do with locking data so
there is no dirty data, plus the threading architecture is... Sorry I
digress often. Just cache as little as possible.

In the big project I work on we cache at a couple different levels.
Highest level is cache the page, thus skipping templates engine,
controller code, database lookup. In this project we use memcached
because it is very flexible on how we setup the caching system. No
matter what, you should create a wrapper around your caching
implementation so you an move from one type of cache to another without
changing any controller code. memcached support a number of different
languages which is nice for integration.

There are many ways to invalidate cached data. The easiest way is to
just set an expiration date. Since we are happy Postgresql users we use
a trigger system (sure features like triggers make the database slower
but I want the entire system to run faster not just the database).
There are two ways that we could do this, the first is write a python
function that gets called when an update or delete happens on cached
data. The python function in our case invalidate all the caches that
had that data. The second way is to have a trigger insert records into
an invalidation table. Every n number of seconds an external process
consumes the records by invalidating your caches (which by the way is
basically how slony works for replication).

* Optimize code and queries first
* Keep an abstract layer or two from the cache implementation
* Application server scale easy as pie, database servers scale like
hernias
* Turbogears rules!

Good luck

Karl Guertin

unread,
Mar 17, 2006, 3:04:43 PM3/17/06
to turbo...@googlegroups.com
On 3/17/06, Lateef <lateef....@gmail.com> wrote:
> The second problem
> is "the web" is stateless so you can't send updates down the socket to
> the web browser.

Half true. The HTTP is stateless, but you can send updates down the socket.

http://alex.dojotoolkit.org/?p=545

Cool, eh?

Robin Haswell

unread,
Mar 17, 2006, 4:02:22 PM3/17/06
to turbo...@googlegroups.com
My $0.02 (approx £0.012 where I come from) on this:

In my company, an application that scales is an application that you can
throw hardware at without having to think about it. We generally don't
bother with intricate caching and optimisations, because my time has a
cost and optimisation often buys less performance than, say, a Dell
SC1425. I guess what I want to know about this is, are there any parts
of TG's standard setup (with a separate DB server) that store changes on
the application machine? Stuff like sessions - are sessions stored
on-disk/memory? And if they are... why?

Also, I'm a firm believer that any app has a potential to become really
popular, and will potentially need more hardware. In other words,
optimisation is fighting a losing battle.

I'm not saying don't optimise - by all means, make sure your indexes are
correct, make sure you're running the right SQL statements and make sure
you're not doing needless work, but after that I don't believe in
investing time and effort in caching mechanisms when an expansion path
will give you more bang for your buck. Sure, there are applications when
caching is obvious, but those aren't the applications I'd usually choose
TurboGears for.

-Rob

PS. You don't always write to one place :-) Most people do, but there
are solutions that are fast and safe, you just need to think outside the
box (but I don't think I can say much more than that).

Lateef

unread,
Mar 17, 2006, 4:45:34 PM3/17/06
to TurboGears
Yeah, I knew I would get roasted like a pig for not point out the
exceptions. I hadn't seen this statefull http exception before it is
cool!

ajones

unread,
Mar 17, 2006, 5:09:37 PM3/17/06
to TurboGears
Some interesting commentary here, and a few cool ideas. I think I might
try to make this a recurring thing with slightly more of a focus on
individual pieces of turbogears instead of usage scenarios.

Lateef

unread,
Mar 17, 2006, 5:11:11 PM3/17/06
to TurboGears
I am not a TG expert but the options I believe are file, memory,
database and role your own.

"my time has a cost and optimisation often buys less performance than,
say, a Dell SC1425"

Unfortunatly my time is not worth a IBM 64way mainframe (or I would be
one happy hacker). Bigger machines help but as my comment said before
this will give you only linear optimization at some point you will need
_exponential_ optmizaitions. This also depends on the complexity of the
data relationships that your application needs. You need a machine that
is 64 times faster buy a 64 proc machine, but you need a machine 10000x
then start hacking.

"You don't always write to one place"

My experience although limit did include some trading systems over on
this side of the pond I have not seen any applications that wrote to
more than one place. I wish I got to work on a project that required
this kind of technical scaling but alas I am a bottom feeder :) Data is
stored in RDBMS or mainframe so I can not comment on such fancy stuff.

Cheers,
Lateef

Robin Haswell

unread,
Mar 17, 2006, 6:17:26 PM3/17/06
to turbo...@googlegroups.com

> "my time has a cost and optimisation often buys less performance than,
> say, a Dell SC1425"
> Unfortunatly my time is not worth a IBM 64way mainframe (or I would be
> one happy hacker). Bigger machines help but as my comment said before
> this will give you only linear optimization at some point you will need
> _exponential_ optmizaitions. This also depends on the complexity of the
> data relationships that your application needs. You need a machine that
> is 64 times faster buy

Nah mate you miss my point! Not bigger machines, *more* machines. A Dell
SC1425 is a pretty low-end piece of kit, the idea is you use multiple
machines.

Let's say you have an application that is currently running at 100%
above acceptable capacity. You can solve this problem in basically four
ways:

1. Buy hardware that is twice as powerful
2. Perform optimisation, caching - etc.
3. A combination of the 1) and 2)
4. Buy another similar server and run them both

In my experience, 4) is always the cheapest option, and requires less
hassle than 2) and 3) (and less hassle is the TG way!). The trick is to
make option 4 possible by asking questions like "What will happen if I
use two app or database servers - or both" early on in the build
process. I do this for everything and it's served me right so far :-)
Part of my personal PHP standard library is some wrappers around session
management and database handling that means:

1) All my session data is stored in the database, which means from then
on I can implement *all* my persistent storage in an RDBMS.

2) My database "reads" and my database "writes" are separated and
controllable, so if we need to add replication it's possible to direct
all writes to the master server and balance reads between the slaves.
(Yes I said there are alternatives to the master/slave setup, but in web
apps which are mostly read-heavy it's a pretty good solution anyway).

-Rob

PS. If you're interested in the "writing to one place" problem, you
should look m/custer
(http://www.continuent.com/index.php?option=com_content&task=view&id=211&Itemid=168).
We have our own solution, but in general it's a pretty awesome setup for
database scaling through multiple servers.

Bob Ippolito

unread,
Mar 17, 2006, 8:42:27 PM3/17/06
to turbo...@googlegroups.com

Scaling horizontally, what you list as 4, is the only real option.
There's plenty of public record that shows that all the successful
guys (Google and LiveJournal come to mind) are using lots of
relatively cheap servers, rather than small numbers of giant
servers. If you design for that, you'll never have a problem so long
as you can afford to operate, and that's not so tough of a problem
because the costs are at worst linear. With any other option, the
price to upgrade grows exponentially and there's a ceiling on what
kind of power you can even buy to run an app that is mostly serial.

Good optimizations can do wonders in the short term, e.g. cut
immediate hardware costs in half... but you get that anyway if you
wait about a year. It's typically better to expand your service such
that it maximizes profits, rather than optimize your service to
minimize your overhead. There's only so low you can go with cutting
your overhead.. but there's no well defined ceiling for maximum
profits (look at Google!).

-bob

lateef jackson

unread,
Mar 18, 2006, 9:33:26 AM3/18/06
to turbo...@googlegroups.com
I guess my confusion was from my original post where I said:
* Application server scale easy as pie, database servers scale like
hernias
I should have been more specific that applications server what I meant was TG, PHP where ever the business logic of the code is. Maybe easy as pie was the wrong analogy to use.

The m/custer technology is what I usually call database proxy. CJDBC was the first time I thought about using a db proxy. This is a really great solution for failover but not for scalability. If I am doing more database transaction persecond I can't just add another database server, I have to buy a bigger database server and if I am using a database proxy and have 3 database server that means I have to buy 3 database servers. You are technically correct database writes don't end up in exactly one place but those write are copied not split up. So now you have every write ending up on every single database node, which only scales if you writes stays the same and your reads increases. All the applications I have worked on reads and write increase as load increases, sometimes you can have a spike in the reads and not the writes it just depends on the applications.

To write in more than one place you have to have the technology to be to load balance the writes. Then reads will need to know what server to go to for what data. I think this would be a blast to code!

Fortunatly reads are a majority of most applications especially web applications. Thus why some simple caching is easier and more effective than replication of data.


lateef jackson

unread,
Mar 18, 2006, 9:51:42 AM3/18/06
to turbo...@googlegroups.com
LiveJournal is an excelent example it is where I stole the idea of using memcache: http://www.linuxjournal.com/article/7451

My visions for scaling up is to us url partitioning so that simular data all goes to the same host(s) and have no shared cache. For example if you have a news site all the tech articles could go to one server(s) that had all tech articles in there cache. This makes the invalidation have to iterate over a list of hosts but thems the breaks.
Reply all
Reply to author
Forward
0 new messages