Having some initial problems with scripting a solution. Does anyone
have any knowledge of how to do this properly, the standard web searches
only return commercial solutions.
For instance on a 4-way AIX box taking in several hundred foo files, and
getting bar as a result:
mkfifo sm.1 sm.2 sm.3
sort -m foo10 foo20 ... >sm.1 &
sort -m foo30 foo40 ... >sm.2 &
sort -m foo50 foo60 ... >sm.3 &
sort -m sm.1 sm.2 sm.3 >bar
Looking for the best approach to parallel out sort -m given arbitrary #
of CPUs and arbitrary # of input files. What's the optimum # of named
pipes? optimum # of files per sort -m? Any suggestions or known
solutions prior to me trying to script some kind of semblance to the
above with argumented degree of parallelism and input files?
--
JQ
http://xepoch.com
Do youself a favour and get your sysadmins to install the gnu version of
sort.
Our guys installed it, but called the binary as "gsort" instead of "sort" so
as to not accidentally
upset any other scripts that use sort.
We were planning on "going live" in two steps, but the gnu sort proved so
fast that we went in one step.
In fact the first time we ran it, we couldn't believe that it had worked!
This saved our project a hell of a lot of money, they were planning on
buying a commercial sort for £25k! Maybe I should try and get them to
donate some money to the FSF!
This doesn't answer your parallel question, but it's worth knowing :-)
I'm not sure that merging files A & B into X separately from C & D into Y
then going onto merging X and Y would be very efficient in disk
reads/writes. This will probably outway *any* amount of CPU usuage as you
are effectively reading the same files twice instead of once.
The best way of speeding it up is to stick the source files on 1 set of
striped disk and put the dest file onto another set of striped disks.
Hope this helps!
Matt
Aye, but the subsequent reads could be cached. Placing the files in
some tree-like structure and having parent sort -m's actually does speed
things along considerably as long as you have an SMP box. A sort -m on
named pipes (from `child' sort -m's) is CPU-bound, not IO, vs. a lot of
IO on the data file sort -m's.
It may not be efficient IO wise, but total wall clock time is less.
--
JQ
http://xepoch.com
I'm really surprised that a merge takes up much CPU!?
Me too, but yep it does. Again, though, only on those being fed from
named pipes, not those reading IO from auxillary storage.
--
JQ
http://xepoch.com