taking python enterprise level?...

simn_stv

unread,

Feb 25, 2010, 5:26:18 AM2/25/10

to

hello people, i have been reading posts on this group for quite some
time now and many, if not all (actually not all!), seem quite
interesting.
i plan to build an application, a network based application that i
estimate (and seriously hope) would get as many as 100, 000 hits a day
(hehe,...my dad always told me to 'AIM HIGH' ;0), not some 'facebook'
or anything like it, its mainly for a financial transactions which
gets pretty busy...
so my question is this would anyone have anything that would make
python a little less of a serious candidate (cos it already is) and
the options may be to use some other languages (maybe java, C (oh
God))...i am into a bit of php and building API's in php would not be
the hard part, what i am concerned about is scalability and
efficiency, well, as far as the 'core' is concerned.

would python be able to manage giving me a solid 'core' and will i be
able to use python provide any API i would like to implement?...

im sorry if my subject was not as clear as probably should be!.
i guess this should be the best place to ask this sort of thing, hope
im so right.

Thanks

Steve Holden

unread,

Feb 25, 2010, 6:13:28 AM2/25/10

to pytho...@python.org

I'd suggest that if you are running an operation that gets 100,000 hits
a day then your problems won't be with Python but with organizational
aspects of your operation.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/

Martin P. Hellwig

unread,

Feb 25, 2010, 6:21:30 AM2/25/10

to

On 02/25/10 10:26, simn_stv wrote:
<cut>

> what i am concerned about is scalability and
> efficiency, well, as far as the 'core' is concerned.
>
> would python be able to manage giving me a solid 'core' and will i be
> able to use python provide any API i would like to implement?...

<cut>
Python isn't the most efficient language, the assembler provided by the
maker of your CPU probably is the best you can get, everything after
that is a trade-off between performance and flexibility (flexible in the
most flexible sense of the word :-)).

That being said, for me, Python (well actually any turing complete
programming language), is more like a box of lego with infinite amount
of pieces.
Scalability and API issues are the same as the shape and function of the
model your making with lego.

Sure some type of pieces might be more suited than other types but since
you can simulate any type of piece with the ones that are already
present, you are more limited by your imagination than by the language.

So in short, I don't see any serious problems using Python, I have used
it in Enterprise environments without any problems but than again I was
aware not to use it for numerical intensive parts without the use of 3rd
party libraries like numpy. Which for me resulted in not doing the
compression of a database delta's in pure python but to offload that to
a more suitable external program, still controlled from Python though.

--
mph

Tim Wintle

unread,

Feb 25, 2010, 6:45:22 AM2/25/10

to simn_stv, pytho...@python.org

On Thu, 2010-02-25 at 02:26 -0800, simn_stv wrote:
> i plan to build an application, a network based application that i
> estimate (and seriously hope) would get as many as 100, 000 hits a day
> (hehe,...my dad always told me to 'AIM HIGH' ;0), not some 'facebook'
> or anything like it, its mainly for a financial transactions which
> gets pretty busy...

I've got apps running that handle *well* over 100,000 hits / process /
day using Python - although some of the heavy lifting is off-loaded to C
and MySql - obviously without actually looking at your requirements that
doesn't mean much as I don't know how much work each hit requires.

Regarding financial transactions - you'll almost certainly want to
integrate with something that already has transactional support (sql
etc) - so I expect that will bear the brunt of the load

> so my question is this would anyone have anything that would make
> python a little less of a serious candidate (cos it already is) and
> the options may be to use some other languages (maybe java, C (oh
> God))

I've avoided integrating java with my python (I'm not a big fan of java)
- but I've integrated quite a bit of C - it's fairly easy to do, and you
can just port the inner loops if you see the need arise.

> ...i am into a bit of php and building API's in php would not be
> the hard part, what i am concerned about is scalability and
> efficiency, well, as far as the 'core' is concerned.

I've heard that php can be well scaled (by compiling it to bytecode/C++)
- but my preference would always be to python.

Tim

simn_stv

unread,

Feb 25, 2010, 7:46:38 AM2/25/10

to

very well noted steve, i'd be careful (which is a very relative word)
with the organizational aspects...
i'm sure ure quite rooted in that aspect, hey u need a job??........;)

Tim Chase

unread,

Feb 25, 2010, 8:07:59 AM2/25/10

to simn_stv, pytho...@python.org

simn_stv wrote:
> i plan to build an application, a network based application that i
> estimate (and seriously hope) would get as many as 100, 000 hits a day
> (hehe,...my dad always told me to 'AIM HIGH' ;0), not some 'facebook'
> or anything like it, its mainly for a financial transactions which
> gets pretty busy...
> so my question is this would anyone have anything that would make
> python a little less of a serious candidate (cos it already is) and
> the options may be to use some other languages (maybe java, C (oh
> God))...i am into a bit of php and building API's in php would not be
> the hard part, what i am concerned about is scalability and
> efficiency, well, as far as the 'core' is concerned.

Python is as "enterprise" as the developer who wields it.

Scalability revolves entirely around application design &
implementation. Or you could use Erlang (or haskell, etc ;-)

-tkc

D'Arcy J.M. Cain

unread,

Feb 25, 2010, 8:58:16 AM2/25/10

to simn_stv, pytho...@python.org

On Thu, 25 Feb 2010 02:26:18 -0800 (PST)
simn_stv <nan...@googlemail.com> wrote:
> i plan to build an application, a network based application that i
> estimate (and seriously hope) would get as many as 100, 000 hits a day

That's nothing. I ran a financial type app on Python that sometimes
hit 100,000 transactions an hour. We kept looking for bottlenecks that
we could convert to C but never found any. Our biggest problem was in
a network heavy element of the app and that was low level TCP/IP stuff
that rather than being Python's problem was something we used Python to
fix.

As others have pointed out, you will want some kind of enterprise
database that will do a lot of the heavy lifting. I suggest
PostgreSQL. It is the best open source database engine around. That
will take the biggest load off your app.

There's lots of decisions to make in the days ahead but I think that
choosing Python as your base language is a good first one.

> so my question is this would anyone have anything that would make
> python a little less of a serious candidate (cos it already is) and
> the options may be to use some other languages (maybe java, C (oh
> God))...i am into a bit of php and building API's in php would not be
> the hard part, what i am concerned about is scalability and
> efficiency, well, as far as the 'core' is concerned.

Scaleability and efficiency won't be your issues. Speed of development
and clarity of code will be. Python wins.

--
D'Arcy J.M. Cain <da...@druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

Martin P. Hellwig

unread,

Feb 25, 2010, 10:29:34 AM2/25/10

to

On 02/25/10 13:58, D'Arcy J.M. Cain wrote:
> On Thu, 25 Feb 2010 02:26:18 -0800 (PST)

<cut>

> Our biggest problem was in
> a network heavy element of the app and that was low level TCP/IP stuff
> that rather than being Python's problem was something we used Python to
> fix.

<cut>
Out off interest, could you elaborate on that?

Thanks

--
mph

D'Arcy J.M. Cain

unread,

Feb 25, 2010, 11:18:14 AM2/25/10

to Martin P. Hellwig, pytho...@python.org

Somewhat - there is an NDA so I can't give exact details. It was
crucial to our app that we sync up databases in Canada and the US (later
Britain, Europe and Japan) in real time with those transactions. Our
problem was that even though our two server systems were on the
backbone, indeed with the same major carrier, we could not keep them in
sync. We were taking way to long to transact an update across the
network.

The problem had to do with the way TCP/IP works, especially closer to
the core. Our provider was collecting data and sending it only after
filling a buffer or after a timeout. The timeout was short so it
wouldn't normally be noticed and in most cases (web pages e.g.) the
connection is opened, data is pushed and the connection is closed so
the buffer is flushed immediately. Our patterns were different so we
were hitting the timeout on every single transaction and there was no
way we would have been able to keep up.

Our first crack at fixing this was to simply add garbage to the packet
we were sending. Making the packets an order of magnitude bigger sped
up the proccessing dramatically. That wasn't a very clean solution
though so we looked for a better way.

That better way turned out to asynchronous update transactions. All we
did was keep feeding updates to the remote site and forget about ACKS.
We then had a second process which handled ACKS and tracked which
packets had been properly transferred. The system had IDs on each
update and retries happened if ACKS didn't happen soon enough.
Naturally we ignored ACKS that we had already processed.

All of the above (and much more complexity not even discussed here) was
handled by Python code and database manipulation. There were a few
bumps along the way but overall it worked fine. If we were using C or
even assembler we would not have sped up anything and the solution we
came up with would have been horrendous to code. As it was I and my
chief programmer locked ourselves in the boardroom and had a working
solution before the day was out.

Python wins again.

Aahz

unread,

Feb 25, 2010, 11:33:59 AM2/25/10

to

In article <5cd38064-34d6-40d3...@i39g2000yqm.googlegroups.com>,

simn_stv <nan...@googlemail.com> wrote:
>
>i plan to build an application, a network based application that i
>estimate (and seriously hope) would get as many as 100, 000 hits a day
>(hehe,...my dad always told me to 'AIM HIGH' ;0), not some 'facebook'
>or anything like it, its mainly for a financial transactions which
>gets pretty busy...

Remember that YouTube runs on Python.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer

Martin P. Hellwig

unread,

Feb 25, 2010, 1:22:40 PM2/25/10

to

On 02/25/10 16:18, D'Arcy J.M. Cain wrote:
<cut working around ISP's with braindead network configurations>
Very interesting, I had a similar kind of problem (a network balancer
that doesn't balance small tcp packages too well) and solved it by
wrapping the TCP package in UDP. UDP was treated differently, although
in overall switch and router manager it has a lower priority compared to
other tcp packages, in normal usage it was faster.

Probably because UDP has less things to inspect and by this can be
processed faster by all the network equipment in between, but to be
honest it worked for me and the client wasn't interested in academic
explanations and since this was a working solution I didn't investigated
it any further.

Oh and a big thank you for PyGreSQL,! It has proven to be an extremely
useful module for me (especially since I used to hop a lot between
different unixes and Windows).
--
mph

Roy Smith

unread,

Feb 25, 2010, 2:21:51 PM2/25/10

to

In article <mailman.244.1267114...@python.org>,

"D'Arcy J.M. Cain" <da...@druid.net> wrote:

> The problem had to do with the way TCP/IP works, especially closer to
> the core. Our provider was collecting data and sending it only after
> filling a buffer or after a timeout. The timeout was short so it
> wouldn't normally be noticed and in most cases (web pages e.g.) the
> connection is opened, data is pushed and the connection is closed so
> the buffer is flushed immediately. Our patterns were different so we
> were hitting the timeout on every single transaction and there was no
> way we would have been able to keep up.
>
> Our first crack at fixing this was to simply add garbage to the packet
> we were sending. Making the packets an order of magnitude bigger sped
> up the proccessing dramatically. That wasn't a very clean solution
> though so we looked for a better way.

Interesting, the system I'm working with now has a similar problem. We've
got a request/ack protocol over TCP which often sends lots of small packets
and can have all sorts of performance issues because of this.

In fact, we break completely on Solaris-10 with TCP Fusion enabled. We've
gone back and forth with Sun on this (they claim what we're doing is
broken, we claim TCP Fusion is broken). In the end, we just tell all of
our Solaris-10 customers to disable TCP Fusion.

Diez B. Roggisch

unread,

Feb 25, 2010, 7:12:00 PM2/25/10

to

> That better way turned out to asynchronous update transactions. All we
> did was keep feeding updates to the remote site and forget about ACKS.
> We then had a second process which handled ACKS and tracked which
> packets had been properly transferred. The system had IDs on each
> update and retries happened if ACKS didn't happen soon enough.
> Naturally we ignored ACKS that we had already processed.

sounds like using UDP to me, of course with a protocol on top (namely
the one you implemented).

Any reason you sticked to TCP instead?

Diez

D'Arcy J.M. Cain

unread,

Feb 25, 2010, 11:01:26 PM2/25/10

to Diez B. Roggisch, pytho...@python.org

TCP does a great job of delivering a stream of data in order and
handling the retries. The app really was connection oriented and we
saw no reason to emulate that over an unconnected protocol. There were
other wheels to reinvent that were more important.

Diez B. Roggisch

unread,

Feb 26, 2010, 4:19:15 AM2/26/10

to

Am 26.02.10 05:01, schrieb D'Arcy J.M. Cain:

> On Fri, 26 Feb 2010 01:12:00 +0100
> "Diez B. Roggisch"<de...@nospam.web.de> wrote:
>>> That better way turned out to asynchronous update transactions. All we
>>> did was keep feeding updates to the remote site and forget about ACKS.
>>> We then had a second process which handled ACKS and tracked which
>>> packets had been properly transferred. The system had IDs on each
>>> update and retries happened if ACKS didn't happen soon enough.
>>> Naturally we ignored ACKS that we had already processed.
>>
>> sounds like using UDP to me, of course with a protocol on top (namely
>> the one you implemented).
>>
>> Any reason you sticked to TCP instead?
>
> TCP does a great job of delivering a stream of data in order and
> handling the retries. The app really was connection oriented and we
> saw no reason to emulate that over an unconnected protocol. There were
> other wheels to reinvent that were more important.

So when you talk about ACKs, you don't mean these on the TCP-level
(darn, whatever iso-level that is...), but on some higher level?

Diez

mdipierro

unread,

Feb 26, 2010, 4:32:30 AM2/26/10

to

100,000 hits a day is not a low. I get that some day on my web server
without problem and without one request dropped.

Most frameworks web2py, Django, Pylons can handle that kind of load
since Python is not the bottle neck.
You have to follow some tricks:

1) have the web server serve static pages directly and set the pragma
cache expire to one month
2) cache all pages that do not have forms for at least few minutes
3) avoid database joins
4) use a server with at least 512KB Ram.
5) if you pages are large, use gzip compression

If you develop your app with the web2py framework, you always have the
option to deploy on the Google App Engine. If you can live with their
constraints you should have no scalability problems.

Massimo

simn_stv

unread,

Mar 1, 2010, 7:01:08 AM3/1/10

to

On Feb 25, 12:21 pm, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:

> On 02/25/10 10:26, simn_stv wrote:
> <cut>> what i am concerned about is scalability and
> > efficiency, well, as far as the 'core' is concerned.
>
> > would python be able to manage giving me a solid 'core' and will i be
> > able to use python provide any API i would like to implement?...
>
> <cut>
> Python isn't the most efficient language, the assembler provided by the
> maker of your CPU probably is the best you can get,

<cut>
LOL...;), yeah right, the mere thought of writing assembler
instructions is SCARY!!

simn_stv

unread,

Mar 1, 2010, 7:11:07 AM3/1/10

to

sure it wouldnt have sped it up a bit, even a bit?; probably the
development and maintenance time would be a nightmare but it should
speed the app up a bit...

>
> Python wins again.
>
> --
> D'Arcy J.M. Cain <da...@druid.net> | Democracy is three wolveshttp://www.druid.net/darcy/ | and a sheep voting on

> +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

seriously added to the reputation of python, from my own
perspective...kudos python!

simn_stv

unread,

Mar 1, 2010, 7:27:12 AM3/1/10

to

sure it wouldnt have sped it up a bit, even a bit?; probably the

development and maintenance time would be a nightmare but it should
speed the app up a bit...

>
> Python wins again.
>
> --
> D'Arcy J.M. Cain <da...@druid.net> | Democracy is three wolveshttp://www.druid.net/darcy/ | and a sheep voting on

> +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

seriously added to the reputation of python, from my own
perspective...kudos python!

simn_stv

unread,

Mar 1, 2010, 7:32:16 AM3/1/10

to

On Feb 26, 10:32 am, mdipierro <massimodipierr...@gmail.com> wrote:
> 100,000 hits a day is not a low. I get that some day on my web server
> without problem and without one request dropped.
>
> Most frameworks web2py, Django, Pylons can handle that kind of load
> since Python is not the bottle neck.

taking a look at django right now, doesnt look too bad from where im
standing, maybe when i get into the code i'd run into some issues that
would cause some headaches!!

> You have to follow some tricks:
>
> 1) have the web server serve static pages directly and set the pragma
> cache expire to one month
> 2) cache all pages that do not have forms for at least few minutes
> 3) avoid database joins

but this would probably be to the detriment of my database design,
which is a no-no as far as im concerned. The way the tables would be
structured requires 'joins' when querying the db; or could you
elaborate a little??

> 4) use a server with at least 512KB Ram.

hmmm...!, still thinking about what you mean by this statement also.

thanks for the feedback...

D'Arcy J.M. Cain

unread,

Mar 1, 2010, 9:43:13 AM3/1/10

to simn_stv, pytho...@python.org

On Mon, 1 Mar 2010 04:11:07 -0800 (PST)
simn_stv <nan...@googlemail.com> wrote:
> > All of the above (and much more complexity not even discussed here) was
> > handled by Python code and database manipulation. There were a few
> > bumps along the way but overall it worked fine. If we were using C or
> > even assembler we would not have sped up anything and the solution we
> > came up with would have been horrendous to code. As it was I and my
> > chief programmer locked ourselves in the boardroom and had a working
> > solution before the day was out.
>
> sure it wouldnt have sped it up a bit, even a bit?; probably the
> development and maintenance time would be a nightmare but it should
> speed the app up a bit...

What do you mean by "even a bit?" The bulk of the time is in sending
bits on the wire. Computer time was always negligible in this
situation. Yes, I can write all of my applications in assembler to get
a 0.00000000000001% increase in speed but who cares?

If you have decent algorithms in place then 99% of the time I/O will be
your bottleneck and if it isn't then you have a compute heavy problem
that assembler isn't going to fix.

And even if I get a 100% increase in speed, I still lose. Computer
time is cheaper than programmer time by so many orders of magnitude
that it isn't even worh factoring in the speedup.

simn_stv

unread,

Mar 1, 2010, 9:42:28 AM3/1/10

to

i think its on the TCP that he's referring to or is it?...
if it is, that means he's doing some 'mean' network level scripting,
impressive...but i never thought python could go that deep in network
programming!...

D'Arcy J.M. Cain

unread,

Mar 1, 2010, 10:02:15 AM3/1/10

to simn_stv, pytho...@python.org

On Mon, 1 Mar 2010 06:42:28 -0800 (PST)
simn_stv <nan...@googlemail.com> wrote:
> On Feb 26, 10:19 am, "Diez B. Roggisch" <de...@nospam.web.de> wrote:
> > So when you talk about ACKs, you don't mean these on the TCP-level
> > (darn, whatever iso-level that is...), but on some higher level?
>

> i think its on the TCP that he's referring to or is it?...

No, I mean in our own application layer.

> if it is, that means he's doing some 'mean' network level scripting,
> impressive...but i never thought python could go that deep in network
> programming!...

What I meant was that we just keep sending packets which TCP/IP keeps
in order for us by reassembling out-of-order and retransmitted
packets. Asynchronously we sent back to our own application an ACK
that our app level packet was finally received. It's a sliding window
protocol. http://en.wikipedia.org/wiki/Sliding_Window_Protocol

mdipierro

unread,

Mar 1, 2010, 7:20:06 PM3/1/10

to

On Mar 1, 6:32 am, simn_stv <nany...@googlemail.com> wrote:
...

>
> > You have to follow some tricks:
>
> > 1) have the web server serve static pages directly and set the pragma
> > cache expire to one month
> > 2) cache all pages that do not have forms for at least few minutes
> > 3) avoid database joins
>
> but this would probably be to the detriment of my database design,
> which is a no-no as far as im concerned. The way the tables would be
> structured requires 'joins' when querying the db; or could you
> elaborate a little??

Joins are the bottle neck of most web app that relay on relational
databases. That is why non-relational databases such as Google App
Engine, CouchDB, MongoDB do not even support Joins. You have to try to
minimize joins as much as possible by using tricks such as de-
normalization and caching.

> > 4) use a server with at least 512KB Ram.
>
> hmmm...!, still thinking about what you mean by this statement also.

I meant 512MB. The point is you need a lot of ram because you want to
run multiple python instances, cache in ram as much as possible and
also allow the database to buffer in ram as much as possible. You will
see Ram usage tends to spike when you have lots of concurrent
requests.

D'Arcy J.M. Cain

unread,

Mar 2, 2010, 1:58:13 AM3/2/10

to mdipierro, pytho...@python.org

On Mon, 1 Mar 2010 16:20:06 -0800 (PST)
mdipierro <massimod...@gmail.com> wrote:
> Joins are the bottle neck of most web app that relay on relational
> databases. That is why non-relational databases such as Google App
> Engine, CouchDB, MongoDB do not even support Joins. You have to try to
> minimize joins as much as possible by using tricks such as de-
> normalization and caching.

I keep seeing this statement but nothing to back it up. I have created
many apps that run on Python with a PostgreSQL database with a fully
normalized schema and I can assure you that database joins were never
my problem unless I made a badly constructed query or left off a
critical index.

> I meant 512MB. The point is you need a lot of ram because you want to
> run multiple python instances, cache in ram as much as possible and
> also allow the database to buffer in ram as much as possible. You will
> see Ram usage tends to spike when you have lots of concurrent
> requests.

Put as much memory as you can afford/fit into your database server.
It's the cheapest performance boost you can get. If you have a serious
application put at least 4GB into your dedicated database server.
Swapping is your enemy.

Aahz

unread,

Mar 2, 2010, 11:22:38 AM3/2/10

to

In article <mailman.99.12675130...@python.org>,

D'Arcy J.M. Cain <da...@druid.net> wrote:
>
>Put as much memory as you can afford/fit into your database server.
>It's the cheapest performance boost you can get. If you have a serious
>application put at least 4GB into your dedicated database server.
>Swapping is your enemy.

Also, put your log/journal files on a different spindle from the database
files. That makes a *huge* difference.

mk

unread,

Mar 3, 2010, 11:26:19 AM3/3/10

to pytho...@python.org

D'Arcy J.M. Cain wrote:

> I keep seeing this statement but nothing to back it up. I have created
> many apps that run on Python with a PostgreSQL database with a fully
> normalized schema and I can assure you that database joins were never
> my problem unless I made a badly constructed query or left off a
> critical index.

I too have done that (Python/PGSQL), even adding a complicated layer of
SQLAlchemy on top of it and have not had issue with this: when I
profiled one of my apps, it turned out that it spent most of its
computation time... rendering HTML. Completely unexpected: I expected DB
to be bottleneck (although it might be that with huge datasets this
might change).

Having said that, re evidence that joins are bad: from what I've *heard*
about Hibernate in Java from people who used it (I haven't used
Hibernate apart from "hello world"), in case of complicated object
hierarchies it supposedly generates a lot of JOINs and that supposedly
kills DB performance.

So there *may* be some evidence that joins are indeed bad in practice.
If someone has smth specific/interesting on the subject, please post.

Regards,
mk

Tim Wintle

unread,

Mar 3, 2010, 12:50:02 PM3/3/10

to mk, pytho...@python.org

On Wed, 2010-03-03 at 17:26 +0100, mk wrote:
>
> So there *may* be some evidence that joins are indeed bad in
> practice.
> If someone has smth specific/interesting on the subject, please post.

I have found joins to cause problems in a few cases - I'm talking about
relatively large tables though - roughly order 10^8 rows.

I'm on Mysql normally, but that shouldn't make any difference - I've
seen almost the same situation on Oracle

consider this simple example:

/* Table A */
CREATE TABLE TableA (
project_id BIGINT NOT NULL,
cost INT,
date DATETIME,
PRIMARY KEY (project_id, date)
);

/* Table projects */
CREATE TABLE projects (
client_id BIGINT NOT NULL,
project_id BIGINT NOT NULL,
INDEX(client_id)
);

... now the index on TableA has been optimised for queries against date
ranges on specific project ids which should more or less be sequential
(under a load of other assumptions) - but that reduces the efficiency of
the query under a join with the table "projects".

If you denormalise the table, and update the first index to be on
(client_id, project_id, date) it can end up running far more quickly -
assuming you can access the first mapping anyway - so you're still
storing the first table, with stored procedures to ensure you still have
correct data in all tables.

I'm definitely glossing over the details - but I've definitely got
situations where I've had to choose denormalisation over purity of data.

Rolled-up data tables are other situations - where you know half your
queries are grouping by field "A" it's sometimes a requirement to store
that.

Tim

Steve Holden

unread,

Mar 3, 2010, 1:14:18 PM3/3/10

to pytho...@python.org

mk wrote:
> D'Arcy J.M. Cain wrote:
>> I keep seeing this statement but nothing to back it up. I have created
>> many apps that run on Python with a PostgreSQL database with a fully
>> normalized schema and I can assure you that database joins were never
>> my problem unless I made a badly constructed query or left off a
>> critical index.
>
> I too have done that (Python/PGSQL), even adding a complicated layer of
> SQLAlchemy on top of it and have not had issue with this: when I
> profiled one of my apps, it turned out that it spent most of its
> computation time... rendering HTML. Completely unexpected: I expected DB
> to be bottleneck (although it might be that with huge datasets this
> might change).
>
> Having said that, re evidence that joins are bad: from what I've *heard*
> about Hibernate in Java from people who used it (I haven't used
> Hibernate apart from "hello world"), in case of complicated object
> hierarchies it supposedly generates a lot of JOINs and that supposedly
> kills DB performance.
>

> So there *may* be some evidence that joins are indeed bad in practice.
> If someone has smth specific/interesting on the subject, please post.
>

I suspect that this myth is propagated from the distributed database
world: joining tables across two different servers can indeed be
problematic from a performance point of view.

However, the classic advice in database design is to start with a
normalized design and then vary it only if you need to for performance
reasons (which will also involve taking a hit on the coding side,
especially if updates are involved).

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/

Philip Semanchuk

unread,

Mar 3, 2010, 2:15:26 PM3/3/10

to Python (General)

On Mar 3, 2010, at 11:26 AM, mk wrote:

> D'Arcy J.M. Cain wrote:
>> I keep seeing this statement but nothing to back it up. I have
>> created
>> many apps that run on Python with a PostgreSQL database with a fully
>> normalized schema and I can assure you that database joins were never
>> my problem unless I made a badly constructed query or left off a
>> critical index.
>
> I too have done that (Python/PGSQL), even adding a complicated layer
> of SQLAlchemy on top of it and have not had issue with this: when I
> profiled one of my apps, it turned out that it spent most of its
> computation time... rendering HTML. Completely unexpected: I
> expected DB to be bottleneck (although it might be that with huge
> datasets this might change).
>
> Having said that, re evidence that joins are bad: from what I've
> *heard* about Hibernate in Java from people who used it (I haven't
> used Hibernate apart from "hello world"), in case of complicated
> object hierarchies it supposedly generates a lot of JOINs and that
> supposedly kills DB performance.
>
> So there *may* be some evidence that joins are indeed bad in
> practice. If someone has smth specific/interesting on the subject,
> please post.

It's an unprovable assertion, or a meaningless one depending on how
one defines the terms. You could also say "there *may* be some
evidence that Python lists are bad in practice". Python lists and SQL
JOINs are like any part of a language or toolkit. They're good tools
for solving certain classes of problems. They can also be misapplied
to problems that they're not so good at. Sometimes they're a
performance bottleneck, even when solving the problems for which
they're best. Sometimes the best way to solve a performance bottleneck
is to redesign your app/system so you don't need to solve that kind of
problem anymore (hence the join-less databases). Other times, the cure
is worse than the disease and you're better off throwing hardware at
the problem.

My $.02
Philip

mk

unread,

Mar 3, 2010, 2:39:35 PM3/3/10

to pytho...@python.org

Hello Tim,

Pardon the questions but I haven't had the need to use denormalization
yet, so:

Tim Wintle wrote:

> /* Table A */
> CREATE TABLE TableA (
> project_id BIGINT NOT NULL,
> cost INT,
> date DATETIME,
> PRIMARY KEY (project_id, date)
> );
>
> /* Table projects */
> CREATE TABLE projects (
> client_id BIGINT NOT NULL,
> project_id BIGINT NOT NULL,
> INDEX(client_id)
> );
>
>

> .... now the index on TableA has been optimised for queries against date

> ranges on specific project ids which should more or less be sequential
> (under a load of other assumptions) - but that reduces the efficiency of
> the query under a join with the table "projects".
>
> If you denormalise the table, and update the first index to be on
> (client_id, project_id, date) it can end up running far more quickly -

IOW you basically merged the tables like follows?

CREATE TABLE projects (
client_id BIGINT NOT NULL,
project_id BIGINT NOT NULL,

cost INT,
date DATETIME,
INDEX(client_id, project_id, date)
);

From what you write further in the mail I conclude that you have not
eliminated the first table, just made table projects look like I wrote
above, right? (and used stored procedures to make sure that both tables
contain the relevant data for client_id and project_id columns in both
tables)

Have you had some other joins on denormalized keys? i.e. in example how
the join of hypothetical TableB with projects on projects.client_id
behave with such big tables? (bc I assume that you obviously can't
denormalize absolutely everything, so this implies the need of doing
some joins on denormalized columns like client_id).

> assuming you can access the first mapping anyway -

? I'm not clear on what you mean here.

> so you're still
> storing the first table, with stored procedures to ensure you still have
> correct data in all tables.

Regards,
mk

mk

unread,

Mar 3, 2010, 3:58:15 PM3/3/10

to pytho...@python.org

Philip Semanchuk wrote:
>> So there *may* be some evidence that joins are indeed bad in practice.
>> If someone has smth specific/interesting on the subject, please post.
>
> It's an unprovable assertion, or a meaningless one depending on how one
> defines the terms. You could also say "there *may* be some evidence that
> Python lists are bad in practice". Python lists and SQL JOINs are like
> any part of a language or toolkit. They're good tools for solving
> certain classes of problems. They can also be misapplied to problems
> that they're not so good at. Sometimes they're a performance bottleneck,
> even when solving the problems for which they're best. Sometimes the
> best way to solve a performance bottleneck is to redesign your
> app/system so you don't need to solve that kind of problem anymore
> (hence the join-less databases). Other times, the cure is worse than the
> disease and you're better off throwing hardware at the problem.

Look, I completely agree with what you're saying, but: that doesn't
change the possibility that joins may be expensive in comparison to
other SQL operations. This is the phrase I should have used perhaps;
'expensive in comparison with other SQL operations' instead of 'bad'.

Example from my app, where I behaved "by the book" (I hope) and
normalized my data:

$ time echo "\c hrs;
SELECT hosts.ip, reservation.start_date, architecture.architecture,
os_kind.os_kind, os_rel.os_rel, os_version.os_version, project.project,
email.email FROM hosts
INNER JOIN project ON project.id = hosts.project_id
INNER JOIN architecture ON hosts.architecture_id = architecture.id
INNER JOIN os_kind ON os_kind.id = hosts.os_kind_id
INNER JOIN os_rel ON hosts.os_rel_id = os_rel.id
INNER JOIN os_version ON hosts.os_version_id = os_version.id
INNER JOIN reservation_hosts ON hosts.id = reservation_hosts.host_id
INNER JOIN reservation on reservation.id =
reservation_hosts.reservation_id
INNER JOIN email ON reservation.email_id = email.id

;" | psql > /dev/null

real 0m0.099s
user 0m0.015s
sys 0m0.005s

$ time echo "\c hrs;
> SELECT hosts.ip FROM hosts;
> SELECT reservation.start_date FROM reservation;
> SELECT architecture.architecture FROM architecture;
> SELECT os_rel.os_rel FROM os_rel;
> SELECT os_version.os_version FROM os_version;
> SELECT project.project FROM project;
> SELECT email.email FROM email;
> " | psql > /dev/null

real 0m0.046s
user 0m0.008s
sys 0m0.004s

Note: I've created indexes on those tables, both on data columns like
hosts.ip and on .id columns.

So yes, joins apparently are at least twice as expensive as simple
selects without joins, on a small dataset. Not a drastic increase in
cost, but smth definitely shows.

It would be interesting to see what happens when row numbers increase to
large numbers, but I have no such data.

Regards,
mk

D'Arcy J.M. Cain

unread,

Mar 3, 2010, 4:23:14 PM3/3/10

to mk, pytho...@python.org

On Wed, 03 Mar 2010 20:39:35 +0100
mk <mrk...@gmail.com> wrote:
> > If you denormalise the table, and update the first index to be on
> > (client_id, project_id, date) it can end up running far more quickly -

Maybe. Don't start with denormalization. Write it properly and only
consider changing if profiling suggests that that is your bottleneck.
With a decent database engine and proper design it will hardly ever be.

> From what you write further in the mail I conclude that you have not
> eliminated the first table, just made table projects look like I wrote
> above, right? (and used stored procedures to make sure that both tables
> contain the relevant data for client_id and project_id columns in both
> tables)

Note that rather than speeding things up this could actually slow
things down depending on your usage. If you do lots of updates and you
have to write extra information every time then that's worse than a few
extra reads, especially since read data can be cached but written data
must be pushed to disk immediately in an ACID database.

Philip Semanchuk

unread,

Mar 3, 2010, 5:40:36 PM3/3/10

to Python (General)

On Mar 3, 2010, at 3:58 PM, mk wrote:

> Philip Semanchuk wrote:
>>> So there *may* be some evidence that joins are indeed bad in
>>> practice. If someone has smth specific/interesting on the subject,
>>> please post.
>> It's an unprovable assertion, or a meaningless one depending on how
>> one defines the terms. You could also say "there *may* be some
>> evidence that Python lists are bad in practice". Python lists and
>> SQL JOINs are like any part of a language or toolkit. They're good
>> tools for solving certain classes of problems. They can also be
>> misapplied to problems that they're not so good at. Sometimes
>> they're a performance bottleneck, even when solving the problems
>> for which they're best. Sometimes the best way to solve a
>> performance bottleneck is to redesign your app/system so you don't
>> need to solve that kind of problem anymore (hence the join-less
>> databases). Other times, the cure is worse than the disease and
>> you're better off throwing hardware at the problem.
>
> Look, I completely agree with what you're saying, but: that doesn't
> change the possibility that joins may be expensive in comparison to
> other SQL operations. This is the phrase I should have used perhaps;
> 'expensive in comparison with other SQL operations' instead of 'bad'.

Well OK, but that's a very different argument. Yes, joins can be
expensive. They're often still the best option, though. The first step
people usually take to get away from joins is denormalization which
can improve SELECT performance at the expense of slowing down INSERTs,
UPDATEs, and DELETEs, not to mention complicating one's code and data
model. Is that a worthwhile trade? Depends on the application. As I
said, sometimes the cure is worse than the disease.

Don't worry about joins until you know they're a problem. As Knuth
said, premature optimization is the root of all evil.

Good luck
Philip

PS - Looks like you're using Postgres -- excellent choice. I miss
using it.

simn_stv

unread,

Mar 4, 2010, 5:30:15 AM3/4/10

to

till i think i absolutely need to trade-off easier and less
complicated code, better db structure (from a relational perspective)
and generally less "head aches" for speed, i think i'll stick with the
joins for now!...;)

the thought of denormalization really doesnt appeal to me...

mk

unread,

Mar 4, 2010, 7:34:39 AM3/4/10

to pytho...@python.org

Philip Semanchuk wrote:
> Well OK, but that's a very different argument. Yes, joins can be
> expensive. They're often still the best option, though. The first step
> people usually take to get away from joins is denormalization which can
> improve SELECT performance at the expense of slowing down INSERTs,
> UPDATEs, and DELETEs, not to mention complicating one's code and data
> model. Is that a worthwhile trade?

I'd say that in more than 99% of situations: NO.

More than that: if I haven't normalized my data as it should have been
normalized, I wouldn't be able to do complicated querying that I really,
really have to be able to do due to business logic. A few of my queries
have a few hundred lines each with many sub-queries and multiple
many-to-many joins: I *dread the thought* what would happen if I had to
reliably do it in a denormalized db and still ensure data integrity
across all the business logic contexts. And performance is still more
than good enough: so there's no point for me, as of the contexts I
normally work in, to denormalize data at all.

It's just interesting for me to see what happens in that <1% of situations.

> Depends on the application. As I
> said, sometimes the cure is worse than the disease.
>
> Don't worry about joins until you know they're a problem. As Knuth said,
> premature optimization is the root of all evil.

Sure -- the cost of joins is just interesting to me as a 'corner case'.
I don't have datasets large enough for this to matter in the first place
(and I probably won't have them that huge).

> PS - Looks like you're using Postgres -- excellent choice. I miss using it.

If you can, I'd recommend using SQLAlchemy layer on top of
Oracle/Mysql/Sqlite, if that's what you have to use: this *largely*
insulates you from the problems below and it does the job of translating
into a peculiar dialect very well. For my purposes, SQLAlchemy worked
wonderfully: it's very flexible, it has middle-level sql expression
language if normal querying is not flexible enough (and normal querying
is VERY flexible), it has a ton of nifty features like autoloading and
rarely fails bc of some lower-level DB quirk AND its high-level object
syntax is so similar to SQL that you quickly & intuitively grasp it.

(and if you have to/prefer writing some query in "low-level" SQL, as I
have done a few times, it's still easy to make SQLAlchemy slurp the
result into objects provided you ensure there are all of the necessary
columns in the query result)

Regards,
mk

Tim Wintle

unread,

Mar 4, 2010, 10:23:33 AM3/4/10

to mk, pytho...@python.org

On Wed, 2010-03-03 at 20:39 +0100, mk wrote:
> Hello Tim,
>
> Pardon the questions but I haven't had the need to use denormalization
> yet, so:

> IOW you basically merged the tables like follows?

>
> CREATE TABLE projects (
> client_id BIGINT NOT NULL,
> project_id BIGINT NOT NULL,
> cost INT,
> date DATETIME,
> INDEX(client_id, project_id, date)
> );

Yup

> From what you write further in the mail I conclude that you have not
> eliminated the first table, just made table projects look like I wrote
> above, right? (and used stored procedures to make sure that both tables
> contain the relevant data for client_id and project_id columns in both
> tables)

Yup

> Have you had some other joins on denormalized keys? i.e. in example how
> the join of hypothetical TableB with projects on projects.client_id
> behave with such big tables? (bc I assume that you obviously can't
> denormalize absolutely everything, so this implies the need of doing
> some joins on denormalized columns like client_id).

For these joins (for SELECT statements) this _can_ end up running faster
- of course all of this depends on what kind of queries you normally end
up getting and the distribution of data in the indexes.

I've never written anything that started out with a schema like this,
but several have ended up getting denormalised as the projects have
matured and query behaviour has been tested

> > assuming you can access the first mapping anyway -
>
> ? I'm not clear on what you mean here.

I'm refering to not eliminating the first table as you concluded

>
> Regards,
> mk
>

Tim Wintle

unread,

Mar 4, 2010, 10:29:23 AM3/4/10

to D'Arcy J.M. Cain, pytho...@python.org, mk

On Wed, 2010-03-03 at 16:23 -0500, D'Arcy J.M. Cain wrote:
> On Wed, 03 Mar 2010 20:39:35 +0100
> mk <mrk...@gmail.com> wrote:
> > > If you denormalise the table, and update the first index to be on
> > > (client_id, project_id, date) it can end up running far more quickly -
>
> Maybe. Don't start with denormalization. Write it properly and only
> consider changing if profiling suggests that that is your bottleneck.

Quite - and I'd add to cache reads as much in front end machines as is
permissible in your use case before considering denormalisation.

> With a decent database engine and proper design it will hardly ever be.

I completely agree - I'm simply responding to the request for an example
where denormalisation may be a good idea.

Tim