String concatenation - which is the fastest way ?

przem...@poczta.fm

unread,

Aug 10, 2011, 7:17:54 AM8/10/11

to pytho...@python.org

Hello,

I'd like to write a python (2.6/2.7) script which connects to database, fetches
hundreds of thousands of rows, concat them (basically: create XML)
and then put the result into another table. Do I have any choice
regarding string concatenation in Python from the performance point of view ?
Since the number of rows is big I'd like to use the fastest possible library
(if there is any choice). Can you recommend me something ?

Regards
Przemyslaw Bak (przemol)

----------------------------------------------------------------
Znajdz samochod idealny dla siebie!
Szukaj >> http://linkint.pl/f2a0a

Chris Angelico

unread,

Aug 10, 2011, 8:32:06 AM8/10/11

to pytho...@python.org

On Wed, Aug 10, 2011 at 12:17 PM, <przem...@poczta.fm> wrote:
> Hello,
>
> I'd like to write a python (2.6/2.7) script which connects to database, fetches
> hundreds of thousands of rows, concat them (basically: create XML)
> and then put the result into another table. Do I have any choice
> regarding string concatenation in Python from the performance point of view ?
> Since the number of rows is big I'd like to use the fastest possible library
> (if there is any choice). Can you recommend me something ?

First off, I have no idea why you would want to create an XML dump of
hundreds of thousands of rows, only to store it in another table.
However, if that is your intention, list joining is about as efficient
as you're going to get in Python:

lst=["asdf","qwer","zxcv"] # feel free to add 399,997 more list entries
xml="<foo>"+"</foo><foo>".join(lst)+"</foo>"

This sets xml to '<foo>asdf</foo><foo>qwer</foo><foo>zxcv</foo>' which
may or may not be what you're after.

ChrisA

przem...@poczta.fm

unread,

Aug 10, 2011, 9:31:46 AM8/10/11

to pytho...@python.org

Chris,

since this process (XML building) is running now inside database (using native SQL commands)
and is one-thread task it is quite slow. What I wanted to do is to spawn several python subprocesses in parallel which
will concat subset of the whole table (and then merge all of them at the end).
Basically:
- fetch all rows from the database (up to 1 million): what is recommended data type ?
- spawn X python processes each one:
- concat its own subset
- merge the result from all the subprocesses

This task is running on a server which has many but slow cores and I am trying to divide this task
into many subtasks.

Regards
Przemyslaw Bak (przemol)

----------------------------------------------------------------
Doladuj telefon przez Internet!
Sprawdz >> http://linkint.pl/f2a06

Steven D'Aprano

unread,

Aug 10, 2011, 10:36:59 AM8/10/11

to

przem...@poczta.fm wrote:

> Hello,
>
> I'd like to write a python (2.6/2.7) script which connects to database,
> fetches hundreds of thousands of rows, concat them (basically: create XML)
> and then put the result into another table. Do I have any choice
> regarding string concatenation in Python from the performance point of
> view ? Since the number of rows is big I'd like to use the fastest
> possible library (if there is any choice). Can you recommend me something
> ?

For fast string concatenation, you should use the string.join method:

substrings = ['a', 'bb', 'ccc', 'dddd']
body = ''.join(substrings)

Using string addition in a loop, like this:

# Don't do this!
body = ''
for sub in substrings:
body += sub

risks being *extremely* slow for large numbers of substrings. (To be
technical, string addition can O(N**2), while ''.join is O(N).) This
depends on many factors, including the operating system's memory
management, and the Python version and implementation, so repeated addition
may be fast on one machine and slow on another. Better to always use join,
which is consistently fast.

You should limit string addition to small numbers of substrings:

result = head + body + tail # This is okay.

--
Steven

Chris Angelico

unread,

Aug 10, 2011, 10:38:16 AM8/10/11

to pytho...@python.org

On Wed, Aug 10, 2011 at 2:31 PM, <przem...@poczta.fm> wrote:
> - fetch all rows from the database (up to 1 million): what is recommended data type ?
> - spawn X python processes each one:
> - concat its own subset
> - merge the result from all the subprocesses
>

What you're writing is, fundamentally, glue between your SQL engine
and your SQL engine. Look up what you get from your query and work
with that. Which SQL library are you suing?

Python may and may not be the best tool for this job.

Chris Angelico

unread,

Aug 10, 2011, 10:38:42 AM8/10/11

to pytho...@python.org

On Wed, Aug 10, 2011 at 3:38 PM, Chris Angelico <ros...@gmail.com> wrote:
> Which SQL library are you suing?

And this is why I should proof-read BEFORE, not AFTER, sending.

Which SQL library are you *using*?

ChrisA

Stefan Behnel

unread,

Aug 10, 2011, 12:20:10 PM8/10/11

to pytho...@python.org

przem...@poczta.fm, 10.08.2011 15:31:

> On Wed, Aug 10, 2011 at 01:32:06PM +0100, Chris Angelico wrote:
>> On Wed, Aug 10, 2011 at 12:17 PM,<przem...@poczta.fm> wrote:

>>> I'd like to write a python (2.6/2.7) script which connects to database, fetches
>>> hundreds of thousands of rows, concat them (basically: create XML)
>>> and then put the result into another table. Do I have any choice
>>> regarding string concatenation in Python from the performance point of view ?
>>> Since the number of rows is big I'd like to use the fastest possible library
>>> (if there is any choice). Can you recommend me something ?
>>

>> First off, I have no idea why you would want to create an XML dump of
>> hundreds of thousands of rows, only to store it in another table.
>> However, if that is your intention, list joining is about as efficient
>> as you're going to get in Python:
>>
>> lst=["asdf","qwer","zxcv"] # feel free to add 399,997 more list entries
>> xml="<foo>"+"</foo><foo>".join(lst)+"</foo>"
>>
>> This sets xml to '<foo>asdf</foo><foo>qwer</foo><foo>zxcv</foo>' which
>> may or may not be what you're after.
>

> since this process (XML building) is running now inside database (using native SQL commands)
> and is one-thread task it is quite slow. What I wanted to do is to spawn several python subprocesses in parallel which
> will concat subset of the whole table (and then merge all of them at the end).
> Basically:

> - fetch all rows from the database (up to 1 million): what is recommended data type ?
> - spawn X python processes each one:
> - concat its own subset
> - merge the result from all the subprocesses
>

> This task is running on a server which has many but slow cores and I am trying to divide this task
> into many subtasks.

Makes sense to me. Note that the really good DBMSes (namely, PostgreSQL)
come with built-in Python support.

You still didn't provide enough information to make me understand why you
need XML in between one database and another (or the same?), but if you go
that route, you can just read data through multiple connections in multiple
threads (or processes), have each build up one (or more) XML entries, and
then push those into a queue. Then another thread (or more than one) can
read from that queue and write the XML items into a file (or another
database) as they come in.

If your data has a considerable size, I wouldn't use string concatenation
or joining at all (note that it requires 2x the memory during
concatenation), but rather write it into a file, or even just process the
data on the fly, i.e. write it back into the target table right away.
Reading a file back in after the fact is much more resource friendly than
keeping huge amounts of data in memory. And disk speed is usually not a
problem when streaming data from disk into a database.

It may also be worth considering to write out literal SQL insert statements
instead of XML. Most databases have a decent bulk upload tool.

Stefan

przem...@poczta.fm

unread,

Aug 11, 2011, 2:40:30 AM8/11/11

to pytho...@python.org

The data are in Oracle so I have to use cx_oracle.

> You still didn't provide enough information to make me understand why you
> need XML in between one database and another (or the same?), but if you
> go that route, you can just read data through multiple connections in
> multiple threads (or processes), have each build up one (or more) XML
> entries, and then push those into a queue. Then another thread (or more
> than one) can read from that queue and write the XML items into a file
> (or another database) as they come in.

I am not a database developer so I don't want to change the whole process
of data flow between applications in my company. Another process is
reading this XML from particular Oracle table so I have to put the final XML there.

> If your data has a considerable size, I wouldn't use string concatenation
> or joining at all (note that it requires 2x the memory during
> concatenation), but rather write it into a file, or even just process the
> data on the fly, i.e. write it back into the target table right away.
> Reading a file back in after the fact is much more resource friendly than
> keeping huge amounts of data in memory. And disk speed is usually not a
> problem when streaming data from disk into a database.

This server has 256 GB of RAM so memory is not a problem.
Also the select which fetches the data is sorted. That is why I have to
carefully divide into subtasks and then merge it in correct order.

Regards
Przemyslaw Bak (przemol)

----------------------------------------------------------------
Dom marzen - kup lub wynajmij taniej niz myslisz!
Szukaj >> http://linkint.pl/f2a0d

przem...@poczta.fm

unread,

Aug 11, 2011, 2:30:08 AM8/11/11

to pytho...@python.org

cx_oracle

Regards
Przemyslaw Bak (przemol)

----------------------------------------------------------------

Chris Angelico

unread,

Aug 11, 2011, 6:59:31 AM8/11/11

to pytho...@python.org

On Thu, Aug 11, 2011 at 7:40 AM, <przem...@poczta.fm> wrote:
> I am not a database developer so I don't want to change the whole process
> of data flow between applications in my company. Another process is
> reading this XML from particular Oracle table so I have to put the final XML there.

I think you may be looking at a submission to
http://www.thedailywtf.com/ soon. You seem to be working in a rather
weird dataflow. :( Under the circumstances, you're probably going to
want to go with the original ''.join() option.

> This server has 256 GB of RAM so memory is not a problem.
> Also the select which fetches the data is sorted. That is why I have to
> carefully divide into subtasks and then merge it in correct order.

There's no guarantee that all of that 256GB is available to you, of course.

What may be the easiest way is to do the select in a single process,
then partition it and use the Python multiprocessing module to split
the job into several parts. Then you need only concatenate the handful
of strings.

You'll need to do some serious profiling, though, to ascertain where
the bottleneck really is. Is it actually slow doing the concatenation,
or is it taking more time reading/writing the disk? Is it actually all
just taking time due to RAM usage? Proper string concatenation doesn't
need a huge amount of CPU.

ChrisA

przem...@poczta.fm

unread,

Aug 11, 2011, 7:52:55 AM8/11/11

to pytho...@python.org

On Thu, Aug 11, 2011 at 11:59:31AM +0100, Chris Angelico wrote:
> On Thu, Aug 11, 2011 at 7:40 AM, <przem...@poczta.fm> wrote:
> > I am not a database developer so I don't want to change the whole process
> > of data flow between applications in my company. Another process is
> > reading this XML from particular Oracle table so I have to put the final XML there.
>
> I think you may be looking at a submission to
> http://www.thedailywtf.com/ soon. You seem to be working in a rather
> weird dataflow. :( Under the circumstances, you're probably going to
> want to go with the original ''.join() option.
>
> > This server has 256 GB of RAM so memory is not a problem.
> > Also the select which fetches the data is sorted. That is why I have to
> > carefully divide into subtasks and then merge it in correct order.
>
> There's no guarantee that all of that 256GB is available to you, of course.

I am the admin of this server - the memory is available for us :-)

> What may be the easiest way is to do the select in a single process,
> then partition it and use the Python multiprocessing module to split
> the job into several parts. Then you need only concatenate the handful
> of strings.

This is the way I am going to use.

> You'll need to do some serious profiling, though, to ascertain where
> the bottleneck really is. Is it actually slow doing the concatenation,
> or is it taking more time reading/writing the disk? Is it actually all
> just taking time due to RAM usage? Proper string concatenation doesn't
> need a huge amount of CPU.

I did my homework :-) - the CPU working on concatenation is a bottleneck.

Regards
Przemyslaw Bak (przemol)

----------------------------------------------------------------
Dziesiatki tysiecy ofert domow i mieszkan!
Ogladaj >> http://linkint.pl/f2a0c

Chris Angelico

unread,

Aug 11, 2011, 7:58:36 AM8/11/11

to pytho...@python.org

On Thu, Aug 11, 2011 at 12:52 PM, <przem...@poczta.fm> wrote:
> On Thu, Aug 11, 2011 at 11:59:31AM +0100, Chris Angelico wrote:
>> There's no guarantee that all of that 256GB is available to you, of course.
>
> I am the admin of this server - the memory is available for us :-)

Hehe. I mean to any particular application; obviously you're way WAY
beyond the 4GB limit of a 32-bit app, though there may be other
limits; the main consideration here is contention - whatever Oracle
itself is using, your concatenation app can't.

>> You'll need to do some serious profiling, though, to ascertain where
>> the bottleneck really is. Is it actually slow doing the concatenation,
>> or is it taking more time reading/writing the disk? Is it actually all
>> just taking time due to RAM usage? Proper string concatenation doesn't
>> need a huge amount of CPU.
>
> I did my homework :-) - the CPU working on concatenation is a bottleneck.

Glad you have! So many people don't. Looks like you have a plan there, then!

ChrisA

Stefan Behnel

unread,

Aug 11, 2011, 7:59:50 AM8/11/11

to pytho...@python.org

Chris Angelico, 11.08.2011 12:59:

> On Thu, Aug 11, 2011 at 7:40 AM,<przem...@poczta.fm> wrote:
>> I am not a database developer so I don't want to change the whole process
>> of data flow between applications in my company. Another process is
>> reading this XML from particular Oracle table so I have to put the final XML there.
>
> I think you may be looking at a submission to
> http://www.thedailywtf.com/ soon. You seem to be working in a rather
> weird dataflow. :( Under the circumstances, you're probably going to
> want to go with the original ''.join() option.
>
>> This server has 256 GB of RAM so memory is not a problem.
>> Also the select which fetches the data is sorted. That is why I have to
>> carefully divide into subtasks and then merge it in correct order.
>

> There's no guarantee that all of that 256GB is available to you, of course.
>

> What may be the easiest way is to do the select in a single process,
> then partition it and use the Python multiprocessing module to split
> the job into several parts. Then you need only concatenate the handful
> of strings.

Or join them using an n-way merge.

Stefan

przem...@poczta.fm

unread,

Aug 11, 2011, 9:46:13 AM8/11/11

to pytho...@python.org

On Thu, Aug 11, 2011 at 11:59:31AM +0100, Chris Angelico wrote:
>

> What may be the easiest way is to do the select in a single process,
> then partition it and use the Python multiprocessing module to split
> the job into several parts. Then you need only concatenate the handful
> of strings.

This is the way I am going to use.
But what is the best data type to hold so many rows and then operate on them ?

Regards
Przemyslaw Bak (przemol)

----------------------------------------------------------------
Zarabiasz 4 tys./miesiac? Damy wiecej!
Sprawdz >> http://linkint.pl/f2a0f

Chris Angelico

unread,

Aug 11, 2011, 9:48:43 AM8/11/11

to pytho...@python.org

On Thu, Aug 11, 2011 at 2:46 PM, <przem...@poczta.fm> wrote:
> This is the way I am going to use.
> But what is the best data type to hold so many rows and then operate on them ?
>

List of strings. Take it straight from your Oracle interface and work
with it directly.

ChrisA

SigmundV

unread,

Aug 11, 2011, 5:38:32 PM8/11/11

to

When I saw the headline I thought "oh no, not string concatenation
again... we have had scores of these thread before...", but this is a
rather interesting problem. The OP says he's not a database
developer, but why is he then fiddling with internal database
operations? Wouldn't it be better to go back to the database
developers and have them look into parallel processing. I'm sure that
Oracle databases can do parallel processing by now...

Sigmund

przem...@poczta.fm

unread,

Aug 12, 2011, 3:10:42 AM8/12/11

to pytho...@python.org

:-)
Good question but I try to explain what motivates me to do it.
First reason (I think the most important :-) ) is that I want to learn
something new - I am new to python (I am unix/storage sysadmin but with programming
background so python was a natural choice for more complicated
sysadmin tasks).
Another reason is that our server (and I am responsible for it) has
many, many but slow cores (as I had written before). It means that
parallelization of operations is obvious - the developer is not keen
to spent much time on it (she is busy) - and for me this is something new
(among some boring daily tasks ... ;-) ) and fresh :-)
Another intention is to get some more knowledge about parallelization:
how to divide some task into subtasks, what is the most optimal way to do it, etc
And the last reason is that I love performance tuning :-)

Regards
Przemyslaw Bak (przemol)

----------------------------------------------------------------
Zmyslowa bielizna? U nas ja znajdziesz!
http://linkint.pl/f29fe

przem...@poczta.fm

unread,

Aug 12, 2011, 3:50:29 AM8/12/11

to pytho...@python.org

On Thu, Aug 11, 2011 at 02:48:43PM +0100, Chris Angelico wrote:
> On Thu, Aug 11, 2011 at 2:46 PM, <przem...@poczta.fm> wrote:
> > This is the way I am going to use.
> > But what is the best data type to hold so many rows and then operate on them ?
> >
>

> List of strings. [...]

Let's assume I have the whole list in the memory:
Can I use this list in the following way ?
subprocess_1 - run on list between 1 and 10000
subprocess_2 - run on list between 10001 and 20000
subprocess_3 - run on list between 20001 and 30000
etc
...
Can I use sort of indexing on this list ?

Regards
Przemyslaw Bak (przemol)

----------------------------------------------------------------

przem...@poczta.fm

unread,

Aug 11, 2011, 10:39:23 AM8/11/11

to pytho...@python.org

On Thu, Aug 11, 2011 at 02:48:43PM +0100, Chris Angelico wrote:
> On Thu, Aug 11, 2011 at 2:46 PM, <przem...@poczta.fm> wrote:
> > This is the way I am going to use.
> > But what is the best data type to hold so many rows and then operate on them ?
> >
>

> List of strings. Take it straight from your Oracle interface and work
> with it directly.

Can I use this list in the following way ?

subprocess_1 - run on list between 1 and 10000
subprocess_2 - run on list between 10001 and 20000
subprocess_3 - run on list between 20001 and 30000
etc
...

Sort of indexing ?

Regards
Przemyslaw Bak (przemol)

----------------------------------------------------------------

Chris Angelico

unread,

Aug 12, 2011, 5:20:41 AM8/12/11

to pytho...@python.org

On Thu, Aug 11, 2011 at 3:39 PM, <przem...@poczta.fm> wrote:
> On Thu, Aug 11, 2011 at 02:48:43PM +0100, Chris Angelico wrote:
>> List of strings. Take it straight from your Oracle interface and work
>> with it directly.
>
> Can I use this list in the following way ?
> subprocess_1 - run on list between 1 and 10000
> subprocess_2 - run on list between 10001 and 20000
> subprocess_3 - run on list between 20001 and 30000
> etc
> ...

Yep! You use list slicing. Working with smaller numbers for an example:

>>> ltrs=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>> ltrs[:10]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
>>> ltrs[10:20]
['k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't']
>>> ltrs[20:]
['u', 'v', 'w', 'x', 'y', 'z']

(I actually created that list as "list(string.ascii_lowercase)" for
what that's worth.)

Slice operations are quite efficient.

Chris Angelico

Stefan Behnel

unread,

Aug 12, 2011, 5:25:06 AM8/12/11

to pytho...@python.org

przem...@poczta.fm, 11.08.2011 16:39:

> On Thu, Aug 11, 2011 at 02:48:43PM +0100, Chris Angelico wrote:

>> On Thu, Aug 11, 2011 at 2:46 PM,<przem...@poczta.fm> wrote:
>>> This is the way I am going to use.
>>> But what is the best data type to hold so many rows and then operate on them ?
>>>
>>

>> List of strings. Take it straight from your Oracle interface and work
>> with it directly.
>
> Can I use this list in the following way ?
> subprocess_1 - run on list between 1 and 10000
> subprocess_2 - run on list between 10001 and 20000
> subprocess_3 - run on list between 20001 and 30000
> etc
> ...

Sure. Just read the data as it comes in from the database and fill up a
chunk, then hand that on to a process. You can also distribute it in
smaller packets, just check what size gives the best throughput.

Still, I'd give each work parcel a number and then reorder the results
while collecting them, that allows you to vary the chunk size and the
process size independently, without having to wait for a process that
happens to take longer.

Stefan

SigmundV

unread,

Aug 12, 2011, 12:09:42 PM8/12/11

to

On Aug 12, 8:10 am, przemol...@poczta.fm wrote:
> Good question but I try to explain what motivates me to do it.
> First reason (I think the most important :-) ) is that I want to learn
> something new - I am new to python (I am unix/storage sysadmin but with programming
> background so python was a natural choice for more complicated
> sysadmin tasks).
> Another reason is that our server (and I am responsible for it) has
> many, many but slow cores (as I had written before). It means that
> parallelization of operations is obvious - the developer is not keen
> to spent much time on it (she is busy) - and for me this is something new
> (among some boring daily tasks ... ;-) ) and fresh :-)
> Another intention is to get some more knowledge about parallelization:
> how to divide some task into subtasks, what is the most optimal way to do it, etc
> And the last reason is that I love performance tuning :-)

When you put it this way I better understand what you're after and why
you do this. And I agree with all your points. Learning something new
is a noble cause in itself. :)

Sigmund

przem...@poczta.fm

unread,

Aug 16, 2011, 5:38:00 AM8/16/11

to pytho...@python.org

On Fri, Aug 12, 2011 at 11:25:06AM +0200, Stefan Behnel wrote:
> przem...@poczta.fm, 11.08.2011 16:39:
>> On Thu, Aug 11, 2011 at 02:48:43PM +0100, Chris Angelico wrote:
>>> On Thu, Aug 11, 2011 at 2:46 PM,<przem...@poczta.fm> wrote:
>>>> This is the way I am going to use.
>>>> But what is the best data type to hold so many rows and then operate on them ?
>>>>
>>>
>>> List of strings. Take it straight from your Oracle interface and work
>>> with it directly.
>>
>> Can I use this list in the following way ?
>> subprocess_1 - run on list between 1 and 10000
>> subprocess_2 - run on list between 10001 and 20000
>> subprocess_3 - run on list between 20001 and 30000
>> etc
>> ...
>
> Sure. Just read the data as it comes in from the database and fill up a
> chunk, then hand that on to a process. You can also distribute it in
> smaller packets, just check what size gives the best throughput.

Since the performance is critical I wanted to use multiprocessing module.
But when I get all the source rows into one list of strings can I easly
share it between X processes ?

Regards
Przemyslaw Bak (przemol)

----------------------------------------------------------------
Najwieksza baza najtanszych ofert mieszkaniowych
http://linkint.pl/f2a0e