Parallel sort -m ?

JQ

unread,

Aug 9, 2004, 12:08:48 AM8/9/04

to

I have many hundred very large pre-sorted files that I want to sort
merge (sort -m) *in parallel*. Performance is obviously much better
having multiple sort merge processes feeding each other on SMP machines.

Having some initial problems with scripting a solution. Does anyone
have any knowledge of how to do this properly, the standard web searches
only return commercial solutions.

For instance on a 4-way AIX box taking in several hundred foo files, and
getting bar as a result:

mkfifo sm.1 sm.2 sm.3
sort -m foo10 foo20 ... >sm.1 &
sort -m foo30 foo40 ... >sm.2 &
sort -m foo50 foo60 ... >sm.3 &
sort -m sm.1 sm.2 sm.3 >bar

Looking for the best approach to parallel out sort -m given arbitrary #
of CPUs and arbitrary # of input files. What's the optimum # of named
pipes? optimum # of files per sort -m? Any suggestions or known
solutions prior to me trying to script some kind of semblance to the
above with argumented degree of parallelism and input files?

--
JQ
http://xepoch.com

Matt C

unread,

Aug 9, 2004, 1:59:12 PM8/9/04

to

"JQ" <f...@bar.com> wrote in message
news:10hdu6h...@corp.supernews.com...

> I have many hundred very large pre-sorted files that I want to sort
> merge (sort -m) *in parallel*. Performance is obviously much better
> having multiple sort merge processes feeding each other on SMP machines.

.
. <snip>
.
> --
> JQ
> http://xepoch.com

Do youself a favour and get your sysadmins to install the gnu version of
sort.
Our guys installed it, but called the binary as "gsort" instead of "sort" so
as to not accidentally
upset any other scripts that use sort.

We were planning on "going live" in two steps, but the gnu sort proved so
fast that we went in one step.
In fact the first time we ran it, we couldn't believe that it had worked!

This saved our project a hell of a lot of money, they were planning on
buying a commercial sort for £25k! Maybe I should try and get them to
donate some money to the FSF!

This doesn't answer your parallel question, but it's worth knowing :-)

I'm not sure that merging files A & B into X separately from C & D into Y
then going onto merging X and Y would be very efficient in disk
reads/writes. This will probably outway *any* amount of CPU usuage as you
are effectively reading the same files twice instead of once.

The best way of speeding it up is to stick the source files on 1 set of
striped disk and put the dest file onto another set of striped disks.

Hope this helps!
Matt

JQ

unread,

Aug 10, 2004, 12:09:23 AM8/10/04

to

Matt C wrote:
> I'm not sure that merging files A & B into X separately from C & D into Y
> then going onto merging X and Y would be very efficient in disk
> reads/writes. This will probably outway *any* amount of CPU usuage as you
> are effectively reading the same files twice instead of once.

Aye, but the subsequent reads could be cached. Placing the files in
some tree-like structure and having parent sort -m's actually does speed
things along considerably as long as you have an SMP box. A sort -m on
named pipes (from `child' sort -m's) is CPU-bound, not IO, vs. a lot of
IO on the data file sort -m's.

It may not be efficient IO wise, but total wall clock time is less.

--
JQ
http://xepoch.com

Matt C

unread,

Aug 11, 2004, 4:49:26 PM8/11/04

to

"JQ" <f...@bar.com> wrote in message

news:10hgijk...@corp.supernews.com...

I'm really surprised that a merge takes up much CPU!?

JQ

unread,

Aug 15, 2004, 11:30:00 PM8/15/04

to

Matt C wrote:
> I'm really surprised that a merge takes up much CPU!?

Me too, but yep it does. Again, though, only on those being fed from
named pipes, not those reading IO from auxillary storage.

--
JQ
http://xepoch.com