Using Python for processing of large datasets (convincing managment)

Thomas Jensen

unread,

Jul 6, 2002, 12:05:00 PM7/6/02

to

Hello group (and list :-),

I've used Python for several years (and followed this group until about
6 months ago).
I work in a small company which specialises in collecting and procesing
financial data. Most of our production environment is based on Microsoft
stuff like ASP/VBScript, VB6, WinNT, MS SQL Server, etc.

One of the next development tasks is rewriting the nightly processing
job which is having problems with our ~100mb database (it it written in
Borland C++, but absolutely not optimized for speed!).

The goals of the rewritten piece of software would be:
* Improved speed
* Improved scalability - parallel processing on multiple machines/CPUs
* Improved scalability - ability to handle greater databases (>1gb)
* Ability to calculate only a subset of the data

Now, instead of rewriting the job in C++, I'd (of course) like to use
Python.
However the CEO (small company, told you :-), made a couple of somewhat
valid points against it.
1) He was worried about getting a replacement devlopper in case I left.
2) He said, "Name 3 companies using Python for key functions"
3) He was worried about the stability/reliability of python in our
production environment (you know, 99.999 % and all that)

I was hoping someone in this group could help with some really
compelling arguments, as I'd really to use Python for this job.

Best regards
Thomas Jensen

Peter Hansen

unread,

Jul 6, 2002, 12:28:01 PM7/6/02

to

Thomas Jensen wrote:
>
> However the CEO (small company, told you :-), made a couple of somewhat
> valid points against it.

> [...]

> I was hoping someone in this group could help with some really
> compelling arguments, as I'd really to use Python for this job.

If you email me offline I'd be happy to give you details from my
current employer's extensive use of Python as a highly reliable
and effective tool, and how it has never been an issue that
developers who know Python before they start working with it are
not as common as weeds.

(The offer is open to anyone, by the way.)

(Thomas, I assume nu...@obscure.dk.X is not a valid email address. ;-)

-Peter

Thomas Jensen

unread,

Jul 6, 2002, 12:34:38 PM7/6/02

to

Peter Hansen wrote:

<snip>

> If you email me offline I'd be happy to give you details from my
> current employer's extensive use of Python as a highly reliable
> and effective tool, and how it has never been an issue that
> developers who know Python before they start working with it are
> not as common as weeds.

Thank you, I will!

> (Thomas, I assume nu...@obscure.dk.X is not a valid email address. ;-)

Ahem, sorry! It should be valid now, except the underscore.
(Perhaps I'm a bit paranoid about spam, but on the other side I get less
than one spam mail pr. week, which I belive is rather low ;-)

--
Best Regards
Thomas Jensen
(remove underscore in email address to mail me)

William Park

unread,

Jul 6, 2002, 1:54:40 PM7/6/02

to

If your cronjob can tackle 1MB but not 1GB, then I don't think this is
programming language issue. Rather, you should look at your algorithm and
data structure.

If your company is private for-profit company, then use money argument:

- Anyone who knows Python or Unix shell will have the necessary
analytical skills. And, there can easily be found on
<comp.lang.python> or <comp.unix.shell>.

- Scaling to multiple CPU is OS issue. Much easier with Linux (no
comment on Windows :-)

- Scaling to GB is algorithm issue. Python makes development easier,
because it's easy to write and read.

Mostly, he saves money because he will be able to find right people. The
fact that they happen to know the right language is just bonus.

--
William Park, Open Geometry Consulting, <openge...@yahoo.ca>
8-CPU Cluster, Hosting, NAS, Linux, LaTeX, python, vim, mutt, tin

Alex Martelli

unread,

Jul 6, 2002, 4:40:48 PM7/6/02

to

Thomas Jensen wrote:
...

> However the CEO (small company, told you :-), made a couple of somewhat
> valid points against it.
> 1) He was worried about getting a replacement devlopper in case I left.
> 2) He said, "Name 3 companies using Python for key functions"
> 3) He was worried about the stability/reliability of python in our
> production environment (you know, 99.999 % and all that)

Come visit www.python-in-business.org -- we founded the Python
Business Forum, a non-profit alliance of firms which do use Python
"for key functions", and one of the PBF's primary purposes is to
help reassure CEO's (and CTO's &c, for larger companies:-) about
just such issues as these.

Point 1 is basically never a problem because Python is to easy
to pick up (and somebody else's Python code is easier to maintain
than for any other language), but the PBF aims to help, in the
long run, by establishing a network and referral point for
Python trainers and consultants.

Point 2 is easy -- Zope Corporation relies on Python so much it
pays to employ the Python core team, Google uses Python enough to
demand Python skill in some of its job offers, Fidelity (a huge
financial trust in Britain) is dependent on Python (tbe basis of
all of ReportLab's excellent products) for all of its reports-as-
PDF web sales strategy -- that's three. We can no doubt find
more, but three is what your boss asked for.

Point 3 is troublesome if taken literally -- I know of NO language
claiming 99.999% freedom from defects for its implementations.
The standard Python distribution is certainly far less buggy than,
e.g., most Microsoft language products I've ever used -- on which
many corporations, wisely or otherwise, rely for their business --
but that's still a FAR sight from 1-part-in-100,000-or-lower
defect rates. One of PBF's plans, named "Python in a Tie", is
a very stable Python distribution, due to be blessed by Guido but
tested very intensely by the PBF. But the targets, though not
quantified so far, are NOT as ambitious as 99.999%. If you're
planning deployment in life-critical applications (the only
reason I can see for such strict, enormously expensive demands)
I think, regretfully, that you'll have to look elsewhere (I'll
be curious to hear about what software tools you find that claim
those reliability levels, at least if they back it up with real
hard-cash reliability insurance -- words are cheap).

Alex

Thomas Jensen

unread,

Jul 6, 2002, 5:18:02 PM7/6/02

to

William Park wrote:

[snip]

> If your cronjob can tackle 1MB but not 1GB, then I don't think this is
> programming language issue. Rather, you should look at your algorithm and
> data structure.

I am inclined to agree. The current implementation is very inefficient
in it's database aceess (lots of small queries with no supporting
indexes, furthermore the same data is often read multiple times).

> If your company is private for-profit company, then use money argument:

It is.

>
> - Anyone who knows Python or Unix shell will have the necessary
> analytical skills. And, there can easily be found on
> <comp.lang.python> or <comp.unix.shell>.

The company is based in Denmark, and I belive that the amount of Danish
people in the group is rather small?
However I recently heard that some danish universities uses Python as
the primary language in CS.

> - Scaling to multiple CPU is OS issue. Much easier with Linux (no
> comment on Windows :-)

I've heard that Python threads don't scale (well?) to multiple CPUs ?
Maybe that's only on Windows?
I was planning on (be it Python) using XML-RPC for
inter-process/-machine communications.

> - Scaling to GB is algorithm issue. Python makes development easier,
> because it's easy to write and read.
>
> Mostly, he saves money because he will be able to find right people. The
> fact that they happen to know the right language is just bonus.

Well said.

Alex Martelli

unread,

Jul 6, 2002, 5:30:18 PM7/6/02

to

Thomas Jensen wrote:
...

> I've heard that Python threads don't scale (well?) to multiple CPUs ?
> Maybe that's only on Windows?

No, it's true on all platforms: Python threads only use one CPU. There
is a global interpreter lock (GIL) to ensure only one Python bytecode
is being executed at any given time, except for those which call out to
C-coded parts (e.g. for blocking I/O) which explicitly release the GIL.

With Python, you can exploit multiple CPUs only by multi-*processing* --
and here, it's possible that Windows' multi-processing inefficiencies
may byte you (with Unix-like systems, often multiple processes or
multiple threads in one process have quite comparable performance).

Alex

Thomas Jensen

unread,

Jul 6, 2002, 5:52:51 PM7/6/02

to

Alex Martelli wrote:
> Thomas Jensen wrote:

[snip]

> Come visit www.python-in-business.org -- we founded the Python
> Business Forum, a non-profit alliance of firms which do use Python
> "for key functions", and one of the PBF's primary purposes is to
> help reassure CEO's (and CTO's &c, for larger companies:-) about
> just such issues as these.

Appriciate the link. I'll keep it for later use, should we go for Python.

> Point 1 is basically never a problem because Python is to easy
> to pick up (and somebody else's Python code is easier to maintain
> than for any other language), but the PBF aims to help, in the
> long run, by establishing a network and referral point for
> Python trainers and consultants.

Sounds interresting.

> Point 2 is easy -- Zope Corporation relies on Python so much it
> pays to employ the Python core team, Google uses Python enough to
> demand Python skill in some of its job offers, Fidelity (a huge
> financial trust in Britain) is dependent on Python (tbe basis of
> all of ReportLab's excellent products) for all of its reports-as-
> PDF web sales strategy -- that's three. We can no doubt find
> more, but three is what your boss asked for.

Oh, I thought Zope was "just" a product.
Google and Fidelity, nice! (Do you have any links regarding this?)

> Point 3 is troublesome if taken literally -- I know of NO language
> claiming 99.999% freedom from defects for its implementations.

I'm sorry I didn't make myself clear (in Denmark usually 5 nines means
uptime or accesability for web services). Actually I belive we "only"
guarantee 99,99% to our customers :-)

> The standard Python distribution is certainly far less buggy than,
> e.g., most Microsoft language products I've ever used -- on which
> many corporations, wisely or otherwise, rely for their business --
> but that's still a FAR sight from 1-part-in-100,000-or-lower
> defect rates. One of PBF's plans, named "Python in a Tie", is
> a very stable Python distribution, due to be blessed by Guido but
> tested very intensely by the PBF. But the targets, though not
> quantified so far, are NOT as ambitious as 99.999%. If you're
> planning deployment in life-critical applications (the only
> reason I can see for such strict, enormously expensive demands)
> I think, regretfully, that you'll have to look elsewhere (I'll
> be curious to hear about what software tools you find that claim
> those reliability levels, at least if they back it up with real
> hard-cash reliability insurance -- words are cheap).

I wholeheartedly agree.
I belive the concern was more about Python taking down our servers and
thereby ruining our uptime (we have a load balancing unit, so both
servers had to go down).
Having used Python quite a bit, I known that wouldn't happen but 2
people saying it is better than one :-)

I've worked a lot with MS tools professionally and must say that my view
of MS have gone from bad to worse. Visual Basic is by far the worst,
most clumsy and buggy computer language I've ever used. I've come to
really hate that language. The only "good" thing about it, is that it's
so closely bound to COM, that COM components are really easy to make.
Don't even get me started about the stability of using IIS for SOAP
servers :-(

Thomas Jensen

unread,

Jul 6, 2002, 6:13:21 PM7/6/02

to

Alex Martelli wrote:

[snip]

> With Python, you can exploit multiple CPUs only by multi-*processing* --
> and here, it's possible that Windows' multi-processing inefficiencies
> may byte you (with Unix-like systems, often multiple processes or
> multiple threads in one process have quite comparable performance).

Ok, thanks.
The actual job is easily parallelisable (is that a word? :-) in that it
can be broken into a number (about 500) of calls to a function that
takes one integer as input, ie.
calcUnit(unitnum)
(This assumes that a database connection is available through a class or
global variable to the function).

I was planning on spawning one single-threaded XMLRPC-server per CPU per
machine and then having a control process on one of the machines with a
thread per process. These threads would fetch unit numbers from a Queue
object and call the XMLRPC server using xmlrpclib.

Am I correct in beliving that this would utilize all CPUs? (Windows
issues aside).

Matt Gerrans

unread,

Jul 6, 2002, 3:37:07 PM7/6/02

to

> One of the next development tasks is rewriting the nightly processing
> job which is having problems with our ~100mb database (it it written in
> Borland C++, but absolutely not optimized for speed!).

BCB is great -- and you can still use it for the performance-critical areas
by creating COM Automation servers which are a snap to call from Python.
One of the great things about Python is that it works well with C/C++ so
that you can eat your cake and have it too.

> The goals of the rewritten piece of software would be:
> * Improved speed

Python is not going to help in this area, unfortunately, unless you are
talking about improved speed of development! ;-)

> * Improved scalability - parallel processing on multiple machines/CPUs

This might be more easily accomplished with Java, depending on exactly how
you intend to implement it. Java is probably the best tool for distributed
processing; in particular JINI is ideal for this kind of thing.

> * Improved scalability - ability to handle greater databases (>1gb)

This is probably more dependent on your design than the language or platform
you choose.

> * Ability to calculate only a subset of the data

Also dependent more on your design.

> Now, instead of rewriting the job in C++, I'd (of course) like to use
Python.

Naturally!

> However the CEO (small company, told you :-), made a couple of somewhat
> valid points against it.
> 1) He was worried about getting a replacement devlopper in case I left.

I don't think that is a problem at all, these days. I think Python
developers are becoming pretty ubiquitous. On top of that, any experienced
programmer can learn Python in a snap -- it is so engaging that it is fun
and quick to learn.

In fact, you can show him some Python and the equivalent C++ as a
demonstration of how much simpler and elegant Python is. For instance, can
you imagine writing a small program in C++ which will recurse directories
doing a search-and-replace operation with regular expressions support? It
is a big task in C++, but it is pretty trivial in Python.

> 2) He said, "Name 3 companies using Python for key functions"

I'd bet *every* company in the Fortune 500 uses Python for one thing or
another, whether they know it or not. Many are probably using it for very
important functions; they just don't advertise it. Why should they --
their business is not about explaining how they accomplish every task, it is
about doing it. I have developed Python code for one of the largest of
them that is very key to their business, but I doubt that the CEO would know
of it or that the company would tout this fact -- what they care about is
creating and selling thier products.

> 3) He was worried about the stability/reliability of python in our
> production environment (you know, 99.999 % and all that)

As long as you are not using it for GUI development, in my experience, it is
extremely solid.

> I was hoping someone in this group could help with some really
> compelling arguments, as I'd really to use Python for this job.

I think the most compelling argument you can come up with is to write a demo
in Python that works on a subset of the data, as you mentioned above. The
speed with which you can develop and the quality of the code you develop
will be the biggest selling factor.

Be aware that your demo could also convince you that Python is not the right
tool for the job as well. Python is a great tool, but it is not the best
tool for *every* task.

Paul Rubin

unread,

Jul 7, 2002, 12:39:37 AM7/7/02

to

Depending on what kind of computation you're doing on that dataset,
Python may not give you the speed you need. It has big advantages over
C/C++ for development speed and program maintainability, but the ugly
truth is that the interpreter is pretty slow.

You may be best off with a hybrid approach, writing the control
structure of your program in Python but the inner computational loops
in C, called through the Python-to-C API or through SWIG. These have
a bit more learning curve than programming in pure Python, but will
let you get at the performance of C code when you need it.

Thomas Jensen

unread,

Jul 7, 2002, 6:17:10 AM7/7/02

to

Matt Gerrans wrote:
> BCB is great -- and you can still use it for the performance-critical areas
> by creating COM Automation servers which are a snap to call from Python.
> One of the great things about Python is that it works well with C/C++ so
> that you can eat your cake and have it too.

I'll consider it, however I don't want to rely too much on COM, in case
we'd migrate to some Unix flavour some day.
Maybe I'll look into writing some of the algoritms as C modules, but I'm
not sure it's worth it (see below).

>>The goals of the rewritten piece of software would be:
>>* Improved speed
>
> Python is not going to help in this area, unfortunately, unless you are
> talking about improved speed of development! ;-)

Yes and no. In my estimate, the primary reason for the slowness of the
current job is inefficient dataaccess and datahandling.
A few examples:
* Using "SELECT TOP 1 value FROM T_MyTable ORDER BY value DESC" to get
maximum value.
* linear seraches
* LOTS of SQL calls returning only oe row

When the job was originally written a lot of factors were not known and,
well, it did it's job.
However now the amount of data requires better performance.

>>* Improved scalability - parallel processing on multiple machines/CPUs
>
> This might be more easily accomplished with Java, depending on exactly how
> you intend to implement it. Java is probably the best tool for distributed
> processing; in particular JINI is ideal for this kind of thing.

I Don't know much Java I must admit. However for my needs I belive
XMLRPC will do just fine.
I've looked a little into DCOM and, while I won't use it, I must admit I
like the idea of having the component automatically instiantiated on the
remote machine instead of having to have a running server. Hmm, if only
Windows had inetd.

>>* Improved scalability - ability to handle greater databases (>1gb)
>
> This is probably more dependent on your design than the language or platform
> you choose.

Yes, but I also think that Python makes it easier to get the design right.

>>Now, instead of rewriting the job in C++, I'd (of course) like to use
> Python.
>
> Naturally!

:-)

>>However the CEO (small company, told you :-), made a couple of somewhat
>>valid points against it.
>>1) He was worried about getting a replacement devlopper in case I left.
>
> I don't think that is a problem at all, these days. I think Python
> developers are becoming pretty ubiquitous. On top of that, any experienced
> programmer can learn Python in a snap -- it is so engaging that it is fun
> and quick to learn.

I'm happy so many people seems to think that, last week I sent a Python
example to one of my co-workers and he picked it up immediatly (he's a
VB/ASP devloper with a little C++ knowledge).
When told about it, most people are somewhat sceptic about using
indentation for program structure (funny, it seems to be the same people
that never indent their code out of lazyness or whatever ;-). However
when they see an actual program, most realize how beneficial it is.

>>2) He said, "Name 3 companies using Python for key functions"
>
> I'd bet *every* company in the Fortune 500 uses Python for one thing or
> another, whether they know it or not. Many are probably using it for very
> important functions; they just don't advertise it. Why should they --
> their business is not about explaining how they accomplish every task, it is
> about doing it. I have developed Python code for one of the largest of
> them that is very key to their business, but I doubt that the CEO would know
> of it or that the company would tout this fact -- what they care about is
> creating and selling thier products.

And that's what they should care about I guess. It's funny how it works
isn't it. Our CEO is very worried about all this Open Source stuff (be
it Python, Linux, *BSD, MySQL, whatever). The problem is not the free as
in speach - it's the free as in beer. Many people simply can't belive
that something that is gratis can be any good (which is probably a good
rule of thumb when speaking of material goods like books and such, just
not software).
Beeing able to buy MySQL might actually be the convincing factor.

[snip]

> I think the most compelling argument you can come up with is to write a demo
> in Python that works on a subset of the data, as you mentioned above. The
> speed with which you can develop and the quality of the code you develop
> will be the biggest selling factor.

I'm doing it as we speak, shuldn't take long :-)

> Be aware that your demo could also convince you that Python is not the right
> tool for the job as well. Python is a great tool, but it is not the best
> tool for *every* task.

I belive it's the perfect tool for this case, but if it's not, it's
better to find out now i think.

Thomas Jensen

unread,

Jul 7, 2002, 6:26:54 AM7/7/02

to

Paul Rubin wrote:

[snip]

> You may be best off with a hybrid approach, writing the control
> structure of your program in Python but the inner computational loops
> in C, called through the Python-to-C API or through SWIG. These have
> a bit more learning curve than programming in pure Python, but will
> let you get at the performance of C code when you need it.

I gotta look into this.
However I am uncertain as how to structure my program.
One of the tasks of the program will be to calculate the standard
deviation of rows of daily values (which are the result of another
calculation, etc). I was planning on using lists and tuples like this:
[(date, value), (date, value), ...]
How well will this perform i wonder? Since lists and tuples are Python
structures, won't they still be "slow" to traverse?

Anders Dahlberg

unread,

Jul 7, 2002, 6:36:26 AM7/7/02

to

Hello, comments inline

Newbie argument:

Why not consider using jython?
Same scripting as python, better scaling to multiple cpu's - seems atleast
to me as an easier solution than xml-rpc?

(maybe it's easier to sell the idea to your boss too, due to java-hype and
all ;)

> Best Regards
> Thomas Jensen
/Anders

Thomas Jensen

unread,

Jul 7, 2002, 6:41:39 AM7/7/02

to

Anders Dahlberg wrote:

[snip]

> Why not consider using jython?
> Same scripting as python, better scaling to multiple cpu's - seems atleast
> to me as an easier solution than xml-rpc?

Does it also scale well to multiple machines?
If so, do you have any links to information regarding this?

> (maybe it's easier to sell the idea to your boss too, due to java-hype and
> all ;)

That might be a point :-)

Alex Martelli

unread,

Jul 7, 2002, 6:45:33 AM7/7/02

to

Thomas Jensen wrote:
...

> One of the tasks of the program will be to calculate the standard
> deviation of rows of daily values (which are the result of another
> calculation, etc). I was planning on using lists and tuples like this:
> [(date, value), (date, value), ...]
> How well will this perform i wonder? Since lists and tuples are Python
> structures, won't they still be "slow" to traverse?

If you have substantial numeric computations to perform on large
arrays, you're almost surely better off using the Numeric package
to extend Python, particularly if performance is an issue.

Alex

Anders Dahlberg

unread,

Jul 7, 2002, 6:57:41 AM7/7/02

to

"Thomas Jensen" <spam@ob_scure.dk> wrote in message
news:3D281AE3.3070800@ob_scure.dk...

> Anders Dahlberg wrote:
>
> [snip]
>
> > Why not consider using jython?
> > Same scripting as python, better scaling to multiple cpu's - seems
atleast
> > to me as an easier solution than xml-rpc?
>
> Does it also scale well to multiple machines?
> If so, do you have any links to information regarding this?

Considering it's compiling down to java-bytecode which are then run with a
java-vm - It should scale as well as any java solution (or atleast ~90%).
Links could be:

http://java.sun.com ;)
http://www.bea.com/products/weblogic/jrockit/index.shtml - very good server
vm. (optimised for intel)

Plus that all the j2ee servers are running on a java-vm, thus any scaling
information you can gather from them should probably apply to your case too?
IIRC though, all java-code running on the default sun jvm will take
advantage of additional cpu's if they are available - no extra coding should
be required (exception from that the code should be easy enough for the vm
to "parallellize" :)

disclaimer: This is just what I believe - if you want to base your case on
this I believe you should contact other companies using java with
multiple-cpus!

> > (maybe it's easier to sell the idea to your boss too, due to java-hype
and
> > all ;)
>
> That might be a point :-)
>
> --
> Best Regards
> Thomas Jensen
> (remove underscore in email address to mail me)

/Anders

Thomas Jensen

unread,

Jul 7, 2002, 7:14:24 AM7/7/02

to

Anders Dahlberg wrote:

[snip]

> Considering it's compiling down to java-bytecode which are then run with a
> java-vm - It should scale as well as any java solution (or atleast ~90%).
> Links could be:
>
> http://java.sun.com ;)
> http://www.bea.com/products/weblogic/jrockit/index.shtml - very good server
> vm. (optimised for intel)

[snip]

> disclaimer: This is just what I believe - if you want to base your case on
> this I believe you should contact other companies using java with
> multiple-cpus!

I am quite certain that scaling (well) to multiple CPU requires one to
use threading at least. Scaling to several physical machines might be
(relativly) easy with Java, but imagine it must require some coding, or?

Paul Rubin

unread,

Jul 7, 2002, 7:19:20 AM7/7/02

to

Thomas Jensen <spam@ob_scure.dk> writes:
> Yes and no. In my estimate, the primary reason for the slowness of the
> current job is inefficient dataaccess and datahandling.
> A few examples:
> * Using "SELECT TOP 1 value FROM T_MyTable ORDER BY value DESC" to get
> maximum value.
> * linear seraches
> * LOTS of SQL calls returning only oe row
>
> When the job was originally written a lot of factors were not known
> and, well, it did it's job.
> However now the amount of data requires better performance.

It sounds like your applciation would speed up a lot by judiciously
adding some indices to your tables. Talk to your DBA.

> >>* Improved scalability - parallel processing on multiple
> > machines/CPUs This might be more easily accomplished with Java,
> > depending on exactly how you intend to implement it. Java is
> > probably the best tool for distributed processing; in particular
> > JINI is ideal for this kind of thing.
>
> I Don't know much Java I must admit. However for my needs I belive
> XMLRPC will do just fine.

It doesn't sound to me like you need anything like this. Reorganizing
your database may completely solve your problem.

Thomas Jensen

unread,

Jul 7, 2002, 7:20:51 AM7/7/02

to

Alex Martelli wrote:
>
> If you have substantial numeric computations to perform on large
> arrays, you're almost surely better off using the Numeric package
> to extend Python, particularly if performance is an issue.

http://www.pfdubois.com/numpy/ ?
It looks just like what I need!
Thanks!

Paul Rubin

unread,

Jul 7, 2002, 7:21:39 AM7/7/02

to

Thomas Jensen <spam@ob_scure.dk> writes:
> I gotta look into this.
> However I am uncertain as how to structure my program.
> One of the tasks of the program will be to calculate the standard
> deviation of rows of daily values (which are the result of another
> calculation, etc). I was planning on using lists and tuples like this:
> [(date, value), (date, value), ...]
> How well will this perform i wonder? Since lists and tuples are Python
> structures, won't they still be "slow" to traverse?

Lists and tuples are fast to traverse (they're just vectors in memory)
but I don't see their relevance if by "rows" you mean database rows.
You have to iterate over the rows and compute the SD. I expect the
time for that will be mostly taken by database operations.

Anders Dahlberg

unread,

Jul 7, 2002, 7:29:13 AM7/7/02

to

"Thomas Jensen" <spam@ob_scure.dk> wrote in message

news:3D282290.1050101@ob_scure.dk...

> Anders Dahlberg wrote:
>
> [snip]
>
> > Considering it's compiling down to java-bytecode which are then run with
a
> > java-vm - It should scale as well as any java solution (or atleast
~90%).
> > Links could be:
> >
> > http://java.sun.com ;)
> > http://www.bea.com/products/weblogic/jrockit/index.shtml - very good
server
> > vm. (optimised for intel)
>
> [snip]
>
> > disclaimer: This is just what I believe - if you want to base your case
on
> > this I believe you should contact other companies using java with
> > multiple-cpus!
>
> I am quite certain that scaling (well) to multiple CPU requires one to
> use threading at least.

Yes, this should be quite straigthforward to implement.

> Scaling to several physical machines might be
> (relativly) easy with Java, but imagine it must require some coding, or?

Yes, java is no magic bullet :(

I *imagine* it would be "easy" due to java-rmi protocoll and maybe even a
technology as javaspaces (distrubted space - "linda"?).
E.G. http://www.javaworld.com/javaworld/jw-10-2000/jw-1002-jiniology-p5.html
(it's not python - but should be similar to a jython implementation)

> --
> Best Regards
> Thomas Jensen
> (remove underscore in email address to mail me)

/Anders

Thomas Jensen

unread,

Jul 7, 2002, 7:40:33 AM7/7/02

to

Paul Rubin wrote:

> It sounds like your applciation would speed up a lot by judiciously
> adding some indices to your tables. Talk to your DBA.

That's me (small company :-)
Seriously I've been spending some time optimizing the DB recently, which
resulted in a speedup of a factor of about 2. Most long running calls
are gone now, but still there are architectual flaws in the (C++) code
that cause major slowdowns.
Still we'd like be able to handle >10 times the current amount of data,
preferably without buying 10 times the horsepower :-).

[snip - rewriting jobs]

> It doesn't sound to me like you need anything like this. Reorganizing
> your database may completely solve your problem.

We do need to consider several modifications to our database - indices
as well as new/modified tables. However the current job would need
substantial modifications to take advantage of it.

Paul Rubin

unread,

Jul 7, 2002, 7:41:55 AM7/7/02

to

Thomas Jensen <spam@ob_scure.dk> writes:
> I am quite certain that scaling (well) to multiple CPU requires one to
> use threading at least. Scaling to several physical machines might be
> (relativly) easy with Java, but imagine it must require some coding,
> or?

You are obsessed with this multi-CPU stuff but have not given the
slightest bit of evidence that you need that complexity to meet your
performance goals. Spend your time trying to understand your problem
better rather than throwing fancy technology at it. Chances are you
can do what you need with a simple, single-CPU approach.

Alex Martelli

unread,

Jul 7, 2002, 7:48:55 AM7/7/02

to

Thomas Jensen wrote:
...

> I was planning on spawning one single-threaded XMLRPC-server per CPU per
> machine and then having a control process on one of the machines with a
> thread per process. These threads would fetch unit numbers from a Queue
> object and call the XMLRPC server using xmlrpclib.
>
> Am I correct in beliving that this would utilize all CPUs? (Windows
> issues aside).

Yes, but xmlrpclib may not yield the best performance -- since performance
appears to be key for you, you may want to look at other distributed
computing solutions, such as pyro (pyro.sourceforge.net).

Alex

Alex Martelli

unread,

Jul 7, 2002, 7:54:09 AM7/7/02

to

Thomas Jensen wrote:
...

> Oh, I thought Zope was "just" a product.

It's a product. Not sure what the "just" means here.

> Google and Fidelity, nice! (Do you have any links regarding this?)

Look at Google's own site, where they list job prospects, to see
Python as a prereq for some job positions.

About Fidelity, I think you may want to contact ReportLabs -- it's
their customer, after all, I only know about it because it was
mentioned at Europython recently.

>> Point 3 is troublesome if taken literally -- I know of NO language
>> claiming 99.999% freedom from defects for its implementations.
>
> I'm sorry I didn't make myself clear (in Denmark usually 5 nines means
> uptime or accesability for web services). Actually I belive we "only"
> guarantee 99,99% to our customers :-)

That's quite another issue, and mostly related to setting up redundance
suitably (no single point of failure). And one in ten thousands IS much
more reasonable / achievable than one in a hundred thousands.

> I belive the concern was more about Python taking down our servers and
> thereby ruining our uptime (we have a load balancing unit, so both
> servers had to go down).

No reason to think that's any more likely than with any other
programming language. We're talking about a language that's been
around for over ten years, after all, not a "new kid on the block".

> Having used Python quite a bit, I known that wouldn't happen but 2
> people saying it is better than one :-)
>
> I've worked a lot with MS tools professionally and must say that my view
> of MS have gone from bad to worse. Visual Basic is by far the worst,
> most clumsy and buggy computer language I've ever used. I've come to
> really hate that language. The only "good" thing about it, is that it's
> so closely bound to COM, that COM components are really easy to make.
> Don't even get me started about the stability of using IIS for SOAP
> servers :-(

I won't. But -- any opinion on the VB7, aka VBNET, release? On paper
it looks better than VB6 (it should -- they made two dozen incompatible
changes, each taking it a bit closer to Python:-) but I have no idea
about the reliability of the implementation, not having yet had to use it.

Alex

Thomas Jensen

unread,

Jul 7, 2002, 8:11:36 AM7/7/02

to

:-)

Please read my original post again. I merely said, that one of the
design goals was scalability.
The current job takes about 5 hours to complete! I am absolutely certain
that I would be able to write a new job in any language whatsoever (be
it Python, C++ or even *shiver* Basic) that would complete the job in
less than 30 minutes, given the right DB optimizations and program
design. It could be written entirely in SQL for that matter (which would
probably perform rather well!).

"Spend your time trying to understand your problem better rather than
throwing fancy technology at it"

This is exactly what I'm trying to do. Distributed computing is far from
the number one priority in this project!
However, since noone known excactly how much data we will be handling in
a year or two, I have been asked to make sure the job is written in a
scalable manner.
Once the algoritms and database design has been optimized, there is
still an upper bound as to how much data one CPU can handle, don't you
agree?
I expect it to be much easier to build the job around a distributed core
now, rather than adding support later.

Thomas Jensen

unread,

Jul 7, 2002, 8:29:42 AM7/7/02

to

Alex Martelli wrote:
> Thomas Jensen wrote:
> ...
>
>>Oh, I thought Zope was "just" a product.
> It's a product. Not sure what the "just" means here.

I didn't known there was a Zope Corporation :-)

>>I'm sorry I didn't make myself clear (in Denmark usually 5 nines means
>>uptime or accesability for web services). Actually I belive we "only"
>>guarantee 99,99% to our customers :-)
>
> That's quite another issue, and mostly related to setting up redundance
> suitably (no single point of failure). And one in ten thousands IS much
> more reasonable / achievable than one in a hundred thousands.

I guess it is, and we do have redundancy on most of our systems. (We are
currently looking into total redundancy, but it quickly adds up).

[snip - VB6, etc]

> I won't. But -- any opinion on the VB7, aka VBNET, release? On paper
> it looks better than VB6 (it should -- they made two dozen incompatible
> changes, each taking it a bit closer to Python:-) but I have no idea
> about the reliability of the implementation, not having yet had to use it.

Me neither.
I think there are fundemental problems in basing a professional
programming language on Basic, but they may have twisted the Basic part
so much it is actually getting better :-)

Paul Rubin

unread,

Jul 7, 2002, 9:02:16 AM7/7/02

to

Thomas Jensen <spam@ob_scure.dk> writes:
> Please read my original post again. I merely said, that one of the
> design goals was scalability.
> The current job takes about 5 hours to complete!

Is 5 hours acceptable, if your data doesn't get any bigger?
It not, what's the maximum you can accept?

> I am absolutely certain that I would be able to write a new job in
> any language whatsoever (be it Python, C++ or even *shiver* Basic)
> that would complete the job in less than 30 minutes, given the right
> DB optimizations and program design. It could be written entirely in
> SQL for that matter (which would probably perform rather well!).

OK, you're certain you can do it in 30 minutes. Are you certain
you CAN'T do it in 5 minutes? If you can do it in 5 minutes, maybe
you can stop worrying about scaling.

> "Spend your time trying to understand your problem better rather than
> throwing fancy technology at it"
>
> This is exactly what I'm trying to do. Distributed computing is far
> from the number one priority in this project!
> However, since noone known excactly how much data we will be handling
> in a year or two, I have been asked to make sure the job is written in
> a scalable manner.

In another post you said you wanted to handle 10 times as much data
as you currently handle. Now you say it's not known exactly--do you
have an idea or not?

If it's acceptable for the program to need 3 hours, and you can handle
the current data size in 10 minutes, then you can handle 10x the data
size with plenty of speed to spare (assuming no
seriously-worse-than-linear-time processes).

> Once the algoritms and database design has been optimized, there is
> still an upper bound as to how much data one CPU can handle, don't you
> agree?

I think the bottleneck is going to be the database. You might not get
better throughput with multiple client CPU's than with just one. If
you do, maybe your client application needs more optimization.

> I expect it to be much easier to build the job around a distributed
> core now, rather than adding support later.

First gather some evidence that a distributed client will be better
than a single one for large datasets. It could well be that you'll
never have reason to add distributed support.

What is the application? What is the data and what do you REALLY need
to do with it? How much is there ever REALLY likely to be? Is an SQL
database even really the best way to store and access it? If there's
not multiple processes updating it, maybe you don't need its overhead.
Could a 1960's mainframe programmer deal with your problem, and if
s/he could deal with it at all, why do you need multiple CPU's when
each one is 1000 times faster than the 1960's computer?

Inside most complicated programs there's a simple program struggling
to get out.

Alex Martelli

unread,

Jul 7, 2002, 10:10:05 AM7/7/02

to

Thomas Jensen wrote:
...

> I didn't known there was a Zope Corporation :-)

They used to have a different name ("Digital Creations", I believe), but
www.zope.com now gives their name as "Zope Corporation".

> I guess it is, and we do have redundancy on most of our systems. (We are
> currently looking into total redundancy, but it quickly adds up).

Very. For 99.99% uptime, you may be better off accepting or ot two
highly reliable single-point-of-failure subsystems, rather than trying
to add redundancy *every*where (as higher levels might require).

Alex

Alex Martelli

unread,

Jul 7, 2002, 10:19:51 AM7/7/02

to

Thomas Jensen wrote:

> Alex Martelli wrote:
>>
>> If you have substantial numeric computations to perform on large
>> arrays, you're almost surely better off using the Numeric package
>> to extend Python, particularly if performance is an issue.
>
> http://www.pfdubois.com/numpy/ ?
> It looks just like what I need!

It's indeed a wonderful tool for computations on large arrays, vastly
extending Python's applicability to scientific applications.

> Thanks!

You're welcome! I take advantage of the occasion to mention that
Paul Dubois is among the nominees for the "Programmers' Choice"
award, http://www.activestate.com/Corporate/Awards/ActiveAwards.html
(so am I, but I never contributed to Numeric!-), open for everybody
to vote until July 17. If you find Numeric the best thing since
sliced bread, visiting the site and voting for Paul might be a nice
way to signal your appreciation.

Alex

Peter Hansen

unread,

Jul 7, 2002, 10:54:10 AM7/7/02

to

Alex Martelli wrote:
>
> You're welcome! I take advantage of the occasion to mention that
> Paul Dubois is among the nominees for the "Programmers' Choice"
> award, http://www.activestate.com/Corporate/Awards/ActiveAwards.html
> (so am I, but I never contributed to Numeric!-), open for everybody
> to vote until July 17. If you find Numeric the best thing since
> sliced bread, visiting the site and voting for Paul might be a nice
> way to signal your appreciation.

Unfortunately it is not possible to vote only for the Python programmers,
making it impossible for me to vote in an ethical manner. I know
nothing of most of the other areas (Perl, etc.) and I won't resort
to random voting just to get in my votes for Paul, you, or the others.

(I assume I didn't miss something there. I left the others
unselected and it refused my submission.)

-Peter

Thomas Jensen

unread,

Jul 7, 2002, 11:57:18 AM7/7/02

to

Paul Rubin wrote:
> Thomas Jensen <spam@ob_scure.dk> writes:

>>The current job takes about 5 hours to complete!
>
> Is 5 hours acceptable, if your data doesn't get any bigger?
> It not, what's the maximum you can accept?

5 hours is about the maximum acceptable. Usually the time is a little
shorter, but 5 hours happen when much new data is added.
However, this is a case of "faster is better". 5 hours is acceptable but
1 minute would open up new business oppertunities.

> OK, you're certain you can do it in 30 minutes. Are you certain
> you CAN'T do it in 5 minutes? If you can do it in 5 minutes, maybe
> you can stop worrying about scaling.

It would perhaps be possible to reach 5 minutes. However I am quite
certain that the effort and development time of going from 10 minutes to
5 minutes would be far greater than writing a simple RPC client/server
architecture. Since we have 8 (or is it 6, I don't remember) CPU's that
are mostly idle at nighttime, why not utilize them ?

> In another post you said you wanted to handle 10 times as much data
> as you currently handle. Now you say it's not known exactly--do you
> have an idea or not?

AFAIR I said at least 10 times (thats what I meant anyways).
Faster calculations => able to handle more data => new business
oppertunities.
I it was a big problem making the program distributed, I probably
wouldn't consider it.

> If it's acceptable for the program to need 3 hours, and you can handle
> the current data size in 10 minutes, then you can handle 10x the data
> size with plenty of speed to spare (assuming no
> seriously-worse-than-linear-time processes).

Most sub-calculations currently scales almost linearly, asuming indices
are set up correctly.

> I think the bottleneck is going to be the database. You might not get
> better throughput with multiple client CPU's than with just one. If
> you do, maybe your client application needs more optimization.

We already have 2 DB Servers, a master replicating changes to a slave.
Our analysis shows that most database operations are/will be SELECTs.
Adding more DB servers is trivial, especially if we migrate to MySQL
(well, cheaper at least :-)

> What is the application? What is the data and what do you REALLY need
> to do with it? How much is there ever REALLY likely to be? Is an SQL
> database even really the best way to store and access it? If there's
> not multiple processes updating it, maybe you don't need its overhead.

We have several applications accessing the DB via SQL. Migrating to
something else is out of the question in the near future.

> Could a 1960's mainframe programmer deal with your problem, and if
> s/he could deal with it at all, why do you need multiple CPU's when
> each one is 1000 times faster than the 1960's computer?

I've never used/programmed a mainframe, so I don't know :-)

> Inside most complicated programs there's a simple program struggling
> to get out.

I agree, I really do. I usually rewrite most of my programs 1 or more
times to make them less complicated.

Before going on with the distributed approach, I will probably write a
"proof of concept" demo. Should this demo show, that it is not worth the
effort, I will put it aside for now.

The reason I mentioned XMLRPC earlier is that I've used it before and in
my oppinion it is extremly easy and intuitive to use.
The model I'm currently working with is caracterized by having relativly
few RPC calls, with each call having only one integer as input and output.
Should I require more complex data structures, I'd probably look for a
binary protocol.

But all that apart - the distributed part is not really the hard or
complex part about this project. I understand that as soon as the
calculations take place in more than one thread (be it on one or more
CPUs/machines) it adds some complexity. However, designing the
application in such a way that parralell computations are
possible/plausible, can't be that bad I think.

I really see all this distribution talk as one among several
optimization strategies.

An extreme example of another strategy: Develop the entire thing in
assembler, using flat files or entirely bypassing the file-system.
I done correctly, it would probably outperform other strategies by far,
but it would also be:
* Less maintainable
* Less readable
* a lot harder to use from ASP/PHP
* etc

Sometimes you just have to chose.

If you like I can come up with some less extreme examples :-)

Thomas Jensen

unread,

Jul 7, 2002, 12:00:34 PM7/7/02

to

Alex Martelli wrote:

> It's indeed a wonderful tool for computations on large arrays, vastly
> extending Python's applicability to scientific applications.

I can't wait to dive into it!

> to vote until July 17. If you find Numeric the best thing since
> sliced bread, visiting the site and voting for Paul might be a nice
> way to signal your appreciation.

I will!

Alex Martelli

unread,

Jul 7, 2002, 12:08:35 PM7/7/02

to

Peter Hansen wrote:
...
>> award, http://www.activestate.com/Corporate/Awards/ActiveAwards.html
...

> Unfortunately it is not possible to vote only for the Python programmers,

Didn't USE to be, but they fixed it -- you can now choose "abstain
for each category" for each language you don't want to vote on, and
indeed it's now the default for each category.

> making it impossible for me to vote in an ethical manner. I know
> nothing of most of the other areas (Perl, etc.) and I won't resort
> to random voting just to get in my votes for Paul, you, or the others.

This does you honor and indeed many others must have felt the same
way -- some, enough to let ActiveState know, since ActiveState did
immediately respond by fixing the page accordingly.

> (I assume I didn't miss something there. I left the others
> unselected and it refused my submission.)

It was that way until a few days ago, but now it's fixed. Retry...

Alex

William Park

unread,

Jul 7, 2002, 6:47:14 PM7/7/02

to

Thomas Jensen <spam@ob_scure.dk> wrote:
> We already have 2 DB Servers, a master replicating changes to a slave.
> Our analysis shows that most database operations are/will be SELECTs.
> Adding more DB servers is trivial, especially if we migrate to MySQL
> (well, cheaper at least :-)

As I and others have said, deal with algorithm issues first. Especially,
since you already have something that is working.

It may be that you are getting killed by overheads. For example, if your
situation goes something like
Given table of (a, b, x, y, z),
select a=1, b=1; then do something with x, y, z; insert it back.
select a=1, b=2; then do something with x, y, z; insert it back.
...
select a=2, b=1; then do something with x, y, z; insert it back.
select a=2, b=2; then do something with x, y, z; insert it back.
...
(1 million lines)
Then, you have
1e6 x 2 x (connect time, search time, load time, disconnect time)

Can you dump the whole thing as text file in one-shot, do whatever with
(a,b,x,y,z), and load it back in one-shot?

--
William Park, Open Geometry Consulting, <openge...@yahoo.ca>
8-CPU Cluster, Hosting, NAS, Linux, LaTeX, python, vim, mutt, tin

Cameron Laird

unread,

Jul 8, 2002, 12:12:52 AM7/8/02

to

In article <7xwus7n...@ruckus.brouhaha.com>,

Me, too. While I know quite well how difficult
it is to describe any program that's worth wri-
ting, what we've heard of this one puzzles me.
I'll summarize by saying simply that I'm with
Paul: I *strongly* suspect that database opera-
tions swamp arithmetic operations in elapsed
time, and that attention to the former will be
most rewarding.

You've mentioned once already that you might do
more with your SQL. I can imagine that much the
greatest returns in performance will come from
writing more of your algorithms in SQL. That's
likely to be a more scalable and satisfying ap-
proach than the multi-processing complexities at
which you've hinted.
--

Cameron Laird <Cam...@Lairds.com>
Business: http://www.Phaseit.net
Personal: http://starbase.neosoft.com/~claird/home.html

Paul Rubin

unread,

Jul 8, 2002, 12:13:34 AM7/8/02

to

cla...@starbase.neosoft.com (Cameron Laird) writes:
> You've mentioned once already that you might do more with your SQL.
> I can imagine that much the greatest returns in performance will
> come from writing more of your algorithms in SQL. That's likely to
> be a more scalable and satisfying ap- proach than the
> multi-processing complexities at which you've hinted. --

Actually, SQL tends to be awful slow too. Recent Oracle versions let
you write stored procedures in Java, which are orders of magnitude
faster than PL/SQL procedures. MySQL also has some alternate language
interfaces for stored procedures now but I'm not up on them.

Paul Rubin

unread,

Jul 8, 2002, 12:23:51 AM7/8/02

to

Thomas Jensen <spam@ob_scure.dk> writes:
> > Is 5 hours acceptable, if your data doesn't get any bigger?
> > It not, what's the maximum you can accept?
>
> 5 hours is about the maximum acceptable. Usually the time is a little
> shorter, but 5 hours happen when much new data is added.
> However, this is a case of "faster is better". 5 hours is acceptable
> but 1 minute would open up new business oppertunities.

OK, that make sense.

> > I think the bottleneck is going to be the database. You might not get
> > better throughput with multiple client CPU's than with just one. If
> > you do, maybe your client application needs more optimization.
>
> We already have 2 DB Servers, a master replicating changes to a slave.
> Our analysis shows that most database operations are/will be SELECTs.
> Adding more DB servers is trivial, especially if we migrate to MySQL
> (well, cheaper at least :-)

If you're doing all these single row selects without many updates,
and you're not doing many joins, it really sounds more and more like
an SQL database isn't the best tool for your task.

> Before going on with the distributed approach, I will probably write a
> "proof of concept" demo. Should this demo show, that it is not worth
> the effort, I will put it aside for now.

Fair enough. You could just check the CPU load on your SQL server
right now, as your single client runs.

> But all that apart - the distributed part is not really the hard or
> complex part about this project. I understand that as soon as the
> calculations take place in more than one thread (be it on one or more
> CPUs/machines) it adds some complexity. However, designing the
> application in such a way that parralell computations are
> possible/plausible, can't be that bad I think.

What kinds of calculations are these really? The only one you've
described so far is selecting a bunch of rows and computing the SD of
the numbers on one column. It may be fastest to do that with a server
sided stored procedure.

> I really see all this distribution talk as one among several
> optimization strategies.

OK, that's good, as long as you see there's a range of approaches.
Sometimes all someone will have is a hammer and everything looks like
a thumb ;-).

> An extreme example of another strategy: Develop the entire thing in
> assembler, using flat files or entirely bypassing the file-system.
> I done correctly, it would probably outperform other strategies by
> far, but it would also be:
> * Less maintainable
> * Less readable
> * a lot harder to use from ASP/PHP
> * etc

If your data layout is simple enough you might just store it in a
fixed-width record format, then mmap() it into memory and crunch it
with a C program (or even a Python program). That approach is
generally simple and fast. It will probably outperform any SQL
approach by orders of magnitude.

Gerhard Häring

unread,

Jul 8, 2002, 12:24:16 AM7/8/02

to

* Paul Rubin <phr-n...@NOSPAMnightsong.com> [2002-07-07 21:13 -0700]:

> cla...@starbase.neosoft.com (Cameron Laird) writes:
> > You've mentioned once already that you might do more with your SQL.
> > I can imagine that much the greatest returns in performance will
> > come from writing more of your algorithms in SQL. That's likely to
> > be a more scalable and satisfying ap- proach than the
> > multi-processing complexities at which you've hinted. --
>
> Actually, SQL tends to be awful slow too. Recent Oracle versions let
> you write stored procedures in Java,

Certainly sounds like a good deal for Oracle, Sun and the chip
producers. I'm not sure if it solves any real problem.

> which are orders of magnitude faster than PL/SQL procedures. MySQL
> also has some alternate language interfaces for stored procedures now
> but I'm not up on them.

PostgreSQL allows you to write stored procedures in Python (and Tcl, and
Perl, and PgSQL). Now that's useful.

Gerhard
--
mail: gerhard <at> bigfoot <dot> de registered Linux user #64239
web: http://www.cs.fhm.edu/~ifw00065/ OpenPGP public key id AD24C930
public key fingerprint: 3FCC 8700 3012 0A9E B0C9 3667 814B 9CAA AD24 C930
reduce(lambda x,y:x+y,map(lambda x:chr(ord(x)^42),tuple('zS^BED\nX_FOY\x0b')))

Thomas Jensen

unread,

Jul 8, 2002, 4:47:27 AM7/8/02

to

William Park wrote:
> Thomas Jensen <spam@ob_scure.dk> wrote:
>
>>We already have 2 DB Servers, a master replicating changes to a slave.
>>Our analysis shows that most database operations are/will be SELECTs.
>>Adding more DB servers is trivial, especially if we migrate to MySQL
>>(well, cheaper at least :-)
>
> As I and others have said, deal with algorithm issues first. Especially,
> since you already have something that is working.

Simply rewriting the current job to be distributed has never been the
plan. I am very grateful for all the kind advices regarding algoritm
design. I will assure You that a considerable amount of time has gone
into algoritm design already.

> It may be that you are getting killed by overheads. For example, if your
> situation goes something like
> Given table of (a, b, x, y, z),
> select a=1, b=1; then do something with x, y, z; insert it back.
> select a=1, b=2; then do something with x, y, z; insert it back.
> ...
> select a=2, b=1; then do something with x, y, z; insert it back.
> select a=2, b=2; then do something with x, y, z; insert it back.
> ...
> (1 million lines)
> Then, you have
> 1e6 x 2 x (connect time, search time, load time, disconnect time)
>
> Can you dump the whole thing as text file in one-shot, do whatever with
> (a,b,x,y,z), and load it back in one-shot?

It's something like that, actually it's more like this:
select a from T_A;
select b from T_B;
select c from T_C;
calculate and update c;
The problem is, that most of the job is build around this model (there's
also T_D, T_E, T_F, etc some alike, some not) so changing the general
approach would require rewriting most of the program anyways (touching
at least 75% of the code I estimate).

The current model looks more like this (this is a very simplified
example that only shows a very small part of the calculation):
select date, value from T_A where unitid = x order by date;
calculate T_B values from T_A and perhaph external values
select date, value from T_B where unitid = x order by date;
update T_B where it differ from the calculated values

We are aware that there may be other models which would be faster,
however apart from being fast, the calculations must *always* be
correct. This is the model we have chosen to achieve this (since other
factors than T_A may affect the value of T_B).

I don't think dumping anything to a text file will be nessesary, however
I will consider it should there be problems.

Thomas Jensen

unread,

Jul 8, 2002, 5:12:11 AM7/8/02

to

Paul Rubin wrote:
> Thomas Jensen <spam@ob_scure.dk> writes:

> If you're doing all these single row selects without many updates,
> and you're not doing many joins, it really sounds more and more like
> an SQL database isn't the best tool for your task.

I agree, however the data needs to end up in a SQL database, since we
have quite a lot of code (ASP code and COM components) depending on it
beeing a SQL database (some doing quite complex joins, etc).

>>Before going on with the distributed approach, I will probably write a
>>"proof of concept" demo. Should this demo show, that it is not worth
>>the effort, I will put it aside for now.
>
> Fair enough. You could just check the CPU load on your SQL server
> right now, as your single client runs.

25-50% but thats not a realistic measure, since the client makes a huge
amount of small selects, which probably makes network latency play some
role.

> What kinds of calculations are these really? The only one you've

All kinds, ranging from complex financial calculations to simple AVG,
MIN, MAX. Sometimes the output of a calculation is smaller (byte-wice)
than the input, sometimes larger.

> described so far is selecting a bunch of rows and computing the SD of
> the numbers on one column. It may be fastest to do that with a server
> sided stored procedure.

I agree, and indeed we considered writing the entire job using Stored
Procs. We ditched the idea because:
* MSSQL7 is not integrated with SourceSafe (AFAIK?)
* We generally like the idea of seperating data and code.
* (eh, other stuff I don't remember right now :-)
Actually part of the calculations are already written in SPs (for some
realtime calculations), but to our surprise it didn't show the
performance one would expect.

>>I really see all this distribution talk as one among several
>>optimization strategies.
>
> OK, that's good, as long as you see there's a range of approaches.
> Sometimes all someone will have is a hammer and everything looks like
> a thumb ;-).

Hehe, that's right :-)

> If your data layout is simple enough you might just store it in a
> fixed-width record format, then mmap() it into memory and crunch it
> with a C program (or even a Python program). That approach is
> generally simple and fast. It will probably outperform any SQL
> approach by orders of magnitude.

Please see above why this wouldn't work.

Cameron Laird

unread,

Jul 8, 2002, 7:58:40 AM7/8/02

to

In article <7xznx2j...@ruckus.brouhaha.com>,
Paul Rubin <phr-n...@NOSPAMnightsong.com> wrote:
.
.
.

>Actually, SQL tends to be awful slow too. Recent Oracle versions let
>you write stored procedures in Java, which are orders of magnitude
>faster than PL/SQL procedures. MySQL also has some alternate language
>interfaces for stored procedures now but I'm not up on them.

I'm *not* ready to ratify these propositions.

SQL's a special-purpose language--there's no question about
that. To call it slow ... well, I just don't find it suf-
ficiently comparable to general-purpose languages for that
to be meaningful to me. What I do know is that it's very
typical for beginners in SQL to do simple-minded accesses,
then sort-and-calculate in a host (client-side) language,
without realizing that they might get ORDERS of magnitude
better performance by having SQL do the calculations on the
server side.

It's also possible to push calculations to the server side
and degrade performance. I'm simply urging Mr. Jensen to
follow through on his inclination to look to SQL improvements
as an early step in performance improvement.

We don't know yet, do we, what RDBMS he uses? Yes, the
vendors certainly do attempt product differentiation by the
extension languages they offer for stored procedures,
server-side functions, and other goodies.

Michael Hudson

unread,

Jul 8, 2002, 7:30:50 AM7/8/02

to

Thomas Jensen <spam@ob_scure.dk> writes:

> (Perhaps I'm a bit paranoid about spam, but on the other side I get
> less than one spam mail pr. week, which I belive is rather low ;-)

Bloody hell yes. Though spamassassin has cut the spam that hits my
inbox to about that level...

Cheers,
M.

--
Academic politics is the most vicious and bitter form of politics,
because the stakes are so low. -- Wallace Sayre

Michael Hudson

unread,

Jul 8, 2002, 7:43:18 AM7/8/02

to

Thomas Jensen <nu...@obscure.dk.X> writes:

> Hello group (and list :-),

They don't seem to be online yet, but Tim Couper's slides on "Selling
Python to a Fortune 500 Company" from EuroPython might be interesting
to read.

They'll appear here:

http://europython.zope.nl/sessions/presentations/all

soonish (I hope).

Cheers,
M.

--
Any form of evilness that can be detected without *too* much effort
is worth it... I have no idea what kind of evil we're looking for
here or how to detect is, so I can't answer yes or no.
-- Guido Van Rossum, python-dev

Stephan R.A. Deibel

unread,

Jul 8, 2002, 8:22:56 AM7/8/02

to

On Sat, 6 Jul 2002, Laura Creighton wrote:
> 2) He said, "Name 3 companies using Python for key functions"

1) Industrial Light and Magic uses Python extensively, e.g. in creating
Star Wars II.

2) Google is based on Python.

3) Nasa uses Python quite a bit. For example:

http://builder.com.com/article.jhtml?id=u00420020617DGS01.htm

(does require a free membership to see the article)

Some more links to possibly useful marketing-python sort of stuff
are here:

http://pythonology.org/

HTH.

- Stephan

------------------------------------------------------------------------
Wing IDE for Python Archaeopteryx Software, Inc
www.wingide.com Take Flight!

Neil Schemenauer

unread,

Jul 8, 2002, 11:18:30 AM7/8/02

to

Thomas Jensen wrote:
> I am quite certain that scaling (well) to multiple CPU requires one to
> use threading at least.

What's wrong with fork()?

Neil

Bo M. Maryniuck

unread,

Jul 8, 2002, 11:18:10 AM7/8/02

to

On Monday 08 July 2002 17:18, Neil Schemenauer wrote:
> What's wrong with fork()?

Certainly resources. Maybe... ;)

--
Sincerely yours, Bogdan M. Maryniuck

"What you end up with, after running an operating system concept through
these many marketing coffee filters, is something not unlike plain hot
water."
(By Matt Welsh)

Thomas Jensen

unread,

Jul 8, 2002, 5:43:38 PM7/8/02

to

Cameron Laird wrote:
> Me, too. While I know quite well how difficult
> it is to describe any program that's worth wri-
> ting, what we've heard of this one puzzles me.
> I'll summarize by saying simply that I'm with
> Paul: I *strongly* suspect that database opera-
> tions swamp arithmetic operations in elapsed
> time, and that attention to the former will be
> most rewarding.

I have, on purpose, not described the workings of this program in very
great detail, since my original post was more about the general idea of
using Python for this kind of job. Having easy access to distributed
computations is merely a bonus and, if nothing else, a buzz-word to
mention to managment *hint*.
Furthermore, the ability to scale the application simply gives a good
feeling, even if it is *never* needed.

> You've mentioned once already that you might do
> more with your SQL. I can imagine that much the
> greatest returns in performance will come from
> writing more of your algorithms in SQL. That's
> likely to be a more scalable and satisfying ap-
> proach than the multi-processing complexities at
> which you've hinted.

Satisfying, perhaps, but could you elaborate on scalable?

I simply fail to see how it is that distributed computing is so bad?
Everybody seems to think that once you make something distributed, every
other optimization posibility simply disapear?
I never said distributed computing was a priority or even would be a
part of the first version. I *is* a design goal however, that should we
one day, after all other optimizations in the world, using SQL, need
more speed, we can do so by adding machines/CPUs (be it DB servers or
application servers).

Thomas Jensen

unread,

Jul 8, 2002, 5:45:45 PM7/8/02

to

Michael Hudson wrote:
> They don't seem to be online yet, but Tim Couper's slides on "Selling
> Python to a Fortune 500 Company" from EuroPython might be interesting
> to read.
>
> They'll appear here:
>
> http://europython.zope.nl/sessions/presentations/all

Page bookmarked, thanks!

Thomas Jensen

unread,

Jul 8, 2002, 5:51:30 PM7/8/02

to

Neil Schemenauer wrote:
> What's wrong with fork()?

Available on Win32?
In our current situation that's a requirement.

Thomas Jensen

unread,

Jul 8, 2002, 5:57:58 PM7/8/02

to

Stephan R.A. Deibel wrote:
> On Sat, 6 Jul 2002, Laura Creighton wrote:

[snip - cool links]

Beeing away from this group for 6 months, I had almost forgotten how
kind and helpful people here is - that goes for all of you, thanks.

Since I'm quite busy right now (and I'm on vacation, damn it :-) and is
leaving for Praha, Czech Republic on Thursdag, I might not be able to
reply to every post. However do know that I will utilize my own (and
Google's) mirror of this thread in the comming weeks.

Cheers! (already looking forward to cheap Budwiser and Urquell! :-)

Cameron Laird

unread,

Jul 8, 2002, 7:01:58 PM7/8/02

to

In article <3D2A078A.7040502@ob_scure.dk>,
Thomas Jensen <spam@ob_scure.dk> wrote:
.
.

.
>I have, on purpose, not described the workings of this program in very
>great detail, since my original post was more about the general idea of
>using Python for this kind of job. Having easy access to distributed
>computations is merely a bonus and, if nothing else, a buzz-word to
>mention to managment *hint*.
>Furthermore, the ability to scale the application simply gives a good
>feeling, even if it is *never* needed.

.
.

.
>Satisfying, perhaps, but could you elaborate on scalable?
>
>I simply fail to see how it is that distributed computing is so bad?
>Everybody seems to think that once you make something distributed, every
>other optimization posibility simply disapear?
>I never said distributed computing was a priority or even would be a
>part of the first version. I *is* a design goal however, that should we
>one day, after all other optimizations in the world, using SQL, need
>more speed, we can do so by adding machines/CPUs (be it DB servers or
>application servers).

.
.
.
I call SQL noodling "scalable" in the sense that good
SQL queries can be hosted on bigger and bigger servers.
We know how to do that--it's a commercial reality.

I *like* distributed computing. I've spent much of the
last eighteen months promoting SOAP, XML-RPC, and CORBA.
Your mention of Linda and its descendants, including
T-Spaces, thrilled me. HOWEVER, I rarely recommend
distribution for performance objectives, for reasons
that have mostly appeared already in this thread. Com-
mercial applications (as opposed to scientific ones)
just don't find success that way.

Your situation might be an exception. It's hard to know.
The computations you describe--DB retrievals, elementary
statistics, ...--sound to me like ones that I've seen
most successfully hosted on conventional architectures.

Thomas Jensen

unread,

Jul 8, 2002, 7:51:34 PM7/8/02

to

Cameron Laird wrote:
> In article <3D2A078A.7040502@ob_scure.dk>,

> I call SQL noodling "scalable" in the sense that good
> SQL queries can be hosted on bigger and bigger servers.
> We know how to do that--it's a commercial reality.

Ok, I understand.
I think it's often a question of choosing the right tool for the job.
Consider the following example: find the average of a series of values
found in a table. Of course(?) doing a "SELECT AVG(value) FROM T_MyTable
WHERE ..." would be much faster that retriving all the values and doing
the calculations on the client/app-server side. However if, for some
reason, the contents of T_MyTable was already in the clients memory
(perhaps it was calculated there), calculating the average on the client
would perhaps be faster.

Be assured though, that for each calculation, both SQL and
Python(/C++/VB or wathever it ends up being) solutions will be written
and the fastest chosen. As it have been noted, the result might be that
the SQL approach is the fastest, only time will tell.

> I *like* distributed computing. I've spent much of the
> last eighteen months promoting SOAP, XML-RPC, and CORBA.
> Your mention of Linda and its descendants, including
> T-Spaces, thrilled me. HOWEVER, I rarely recommend
> distribution for performance objectives, for reasons
> that have mostly appeared already in this thread. Com-
> mercial applications (as opposed to scientific ones)
> just don't find success that way.

Well, you might be rigth, I don't know.
I'm a little scared though about using SQL too extensivly.
I might be too much of an SQL newbie, but there's just some stuff that's
hard to write in (portable) SQL.
For example I've done some quite fancy calculations using multiple
"DECLARE CURSOR", etc in MSSQL. However, trying to run these thru MySQL
is, well problematic.

> Your situation might be an exception. It's hard to know.
> The computations you describe--DB retrievals, elementary
> statistics, ...--sound to me like ones that I've seen
> most successfully hosted on conventional architectures.

I think I'm currently planning on a 90% conventional with possibility of
later expansion to distributed computing :-)

The last 6 months I've been working almost exclusivly on a (commercial)
project heavily based on SOAP (not for performance objectives though :-).
That part really doesn't scare me :-)

Cameron Laird

unread,

Jul 8, 2002, 10:19:43 PM7/8/02

to

In article <3D2A2586.5070302@ob_scure.dk>,
Thomas Jensen <sp...@obscure.dk> wrote:
.
.
.

>The last 6 months I've been working almost exclusivly on a (commercial)
>project heavily based on SOAP (not for performance objectives though :-).
>That part really doesn't scare me :-)

.
.
.
Let me know if you want to publicize your work.
I frequently write articles and profiles of SOAP
projects for State-side magazines (and worldwide,
when there's an opportunity).

Alex Martelli

unread,

Jul 9, 2002, 1:49:52 AM7/9/02

to

Thomas Jensen wrote:
...

> I'm a little scared though about using SQL too extensivly.

Joe Celko's "SQL for Smarties" might be just what you need to move you
to the next level of SQL usage (or, to turn you off it for life,
depending:-).

Alex

Paul Rubin

unread,

Jul 9, 2002, 2:12:01 AM7/9/02

to

Alex Martelli <al...@aleax.it> writes:
> > I'm a little scared though about using SQL too extensivly.
>
> Joe Celko's "SQL for Smarties" might be just what you need to move you
> to the next level of SQL usage (or, to turn you off it for life,
> depending:-).

Philip Greenspun's SQL for Web Nerds

http://philip.greenspun.com/sql/index.html

is pretty good (at least for non-SQL-whizzes like me) and has a
reasonable amount of info on performance tuning and optimization.

One thing I remember about Oracle is that prepared queries really
help. I don't know about other database. But if you're not using
them, you probably should give them a try.

Alex Martelli

unread,

Jul 9, 2002, 3:49:04 AM7/9/02

to

Paul Rubin wrote:

> Alex Martelli <al...@aleax.it> writes:
>> > I'm a little scared though about using SQL too extensivly.
>>
>> Joe Celko's "SQL for Smarties" might be just what you need to move you
>> to the next level of SQL usage (or, to turn you off it for life,
>> depending:-).
>
> Philip Greenspun's SQL for Web Nerds
>
> http://philip.greenspun.com/sql/index.html
>
> is pretty good (at least for non-SQL-whizzes like me) and has a
> reasonable amount of info on performance tuning and optimization.

Indeed, a fast scan seems to show that Philip's coverage is quite
different from Joe's -- Joe talks about SQL (lots of SQL, very
advanced SQL, SQL to do things you wouldn't think could be done
in SQL [and maybe in some cases shouldn't!-)]), while Philip does
a reasonably wide-ranging survey of Oracle issues, including the very
best piece of advice of them all -- *hire an expert*.

There's a think3 application that uses a RDBMS and supports many
different RDBMS brands. Towards the end of my long stay in think3,
most substantial installations of this application were being
deployed on either MS SQL Server, or Oracle. It seems the customers
using MS SQL Server were more or less "OK, not happy, but OK" with
our performance. OTOH, the customers using Oracle tended to fall
in two widely separated camps -- a group perfectly happy with our
performance, a group deeply dissatisfied with it. Funny -- why...?

Solution: MS SQL Server ain't all that tunable -- it gives you
reasonable performance if you program and administer it reasonably,
and that's about it... no superb performance, no horrors either.
Oracle, OTOH, _is_ very highly tunable, and *demands* such kind
of tuning -- if you give it proper care and feeding you get stellar
performance and scalability, otherwise it can truly be the pits.

The happy Oracle-using customers were large firms who had hired
(or home-grown) full-time experts at Oracle tuning, working full
time to keep the datases humming, including those used by our
application. The unhappy ones were mostly small and middling
firms who _thought_ they could just buy Oracle, perhaps pay for
a once-off installation / tuning / training for low-skilled or
overworked/wide-responsibility sysadm staff, and forget it.

Wrong. Very wrong. If you can afford Oracle, you can afford to
hire (or at least regularly retain a freelance) expert help to
tune it, keep it tuned, make it perform to its potential. If you
can't, you may be better off with cheaper stuff (not necessarily
MS SQL Server, which has the big minus of constraining you to
wimpy Windows servers, but, say, SAP/DB, nee Adabas -- free, and
there's a lot of expertise around for it from the times when it
was a costly, enterprise-grade commercial RDBMS).

Disclaimer: this is basically hearsay -- I couldn't tune an Oracle
DB myself any more than I could tune a fish -- AND it may be based
on now-obsolete versions of the products named (for all I _know_, the
Oracle you can buy today is self-tuning, and/or the MS SQL Server
you can buy today requires half a dozen full-time staff with degrees
in nuclear engineering -- I just find such hypotheses unlikely:-).

Alex

Thomas Jensen

unread,

Jul 9, 2002, 4:35:14 AM7/9/02

to

Cameron Laird wrote:
> In article <3D2A2586.5070302@ob_scure.dk>,
> Thomas Jensen <sp...@obscure.dk> wrote:
> .
>

>>The last 6 months I've been working almost exclusivly on a (commercial)
>>project heavily based on SOAP (not for performance objectives though :-).
>>That part really doesn't scare me :-)
>
> .

> Let me know if you want to publicize your work.
> I frequently write articles and profiles of SOAP
> projects for State-side magazines (and worldwide,
> when there's an opportunity).

Sounds interresting, it's coded in VB/C# though.
Please mail me privately if You're still interested.
(I've had som problems with my email but I'm hoping to sort them out today)

Andrew MacIntyre

unread,

Jul 11, 2002, 6:51:26 AM7/11/02

to

On Tue, 9 Jul 2002, Alex Martelli wrote:

> Wrong. Very wrong. If you can afford Oracle, you can afford to
> hire (or at least regularly retain a freelance) expert help to
> tune it, keep it tuned, make it perform to its potential.

{lots snipped}

The justification for Oracle is that there's a lot of people on tap for
problem solving.

The bit that bites, as you note, is that you _need_ those people on tap
for Oracle to be useful.

Having been involved with Ingres for a while, its easy to see the
management types get itchy that there weren't that many Ingres people
about. The fact that the installation was pretty much self tuning and
the Ingres specialists weren't needed very much was lost on the PHBs.

The real problem with Ingres was that CA hardly bothered to really make
much of it. That just made the Oracle decision a no-brainer for the PHBs.

--
Andrew I MacIntyre "These thoughts are mine alone..."
E-mail: and...@bullseye.apana.org.au | Snail: PO Box 370
and...@pcug.org.au | Belconnen ACT 2616
Web: http://www.andymac.org/ | Australia