Splitting a file

ExecMan

unread,

Apr 13, 2012, 4:08:46 PM4/13/12

to

Hi,

Ok, I have a large file and I want to split it into X number of
files. The split command allows me to specify the number of lines
per file, but not number of files.

So if my file say contains 10 records and I want to split it into 3
files, one of them will contain 4 records........

Is there a simple way to do this?

Thanks!

The Natural Philosopher

unread,

Apr 13, 2012, 4:19:02 PM4/13/12

to

gcc?

:-) :-)

--
To people who know nothing, anything is possible.
To people who know too much, it is a sad fact
that they know how little is really possible -
and how hard it is to achieve it.

Janis Papanagnou

unread,

Apr 13, 2012, 4:24:29 PM4/13/12

to

Only if you have information about the files in advance, or if you
can make two passes over the file, or if you alternately fill the
target files (line 1, 4, 7 in file1, line 2, 5, 8 in file2 , etc.).

The latter can be done, e.g., by

awk '{print > FILENAME NR%3+1}' inputfile

and you get as result files inputfile1, inputfile2, and inputfile3,
with the respective lines.

Janis

>
> Thanks!

unruh

unread,

Apr 13, 2012, 5:22:03 PM4/13/12

to

> On 13.04.2012 22:08, ExecMan wrote:
>> Hi,
>>
>> Ok, I have a large file and I want to split it into X number of
>> files. The split command allows me to specify the number of lines
>> per file, but not number of files.
>>
>> So if my file say contains 10 records and I want to split it into 3
>> files, one of them will contain 4 records........
>>
>> Is there a simple way to do this?

Sure. Try it. If you have too many files, raise the size limit, if too
few, lower it. But if you have no limit on the filesize, why not just
leave stuff in one file?

Wayne C. Morris

unread,

Apr 13, 2012, 6:34:40 PM4/13/12

to

In article <dccb345d-c572-49bc...@p6g2000yqi.googlegroups.com>,

Yes, if each line is one whole record. Use wc to count the number of lines in
the input file, divide by X, and round up. The result is the number of lines
per output file.

Eli the Bearded

unread,

Apr 13, 2012, 6:58:54 PM4/13/12

to

#/bin/sh
# splitfile.sh: a wrapper around split(1)
# warning: relies on user's PATH
# Tested under NetBSD, Linux, and Solaris
# Eli the Bearded 13 April 2012

file=
num=10
out=
lines=

usage () {
echo "$0: a wrapper around split(1) -- usage"
echo " splitfile [ options ]"
echo ""
echo "Options:"
echo " -h or --help print usage and exit"
echo " -f FILE input file to split (required)"
echo " -n NUMBER number of output files (default 10)"
echo " -o OUTPUT output file prefix"
echo ""
echo "splitfile will exit with 2 if split(1) is not run or the"
echo "exit value of split if it was run."
}

while [ "X$1" != X ] ; do
case "X$1" in
X-h|X--help) usage; exit 2 ;;
X-f) shift; file="$1" ;;
X-n) shift; num="$1" ;;
X-o) shift; out="$1" ;;
X*) echo "$0: unrecognized option $1; use -h for help"
exit 2 ;;
esac
shift
done

case "X$file" in
X) echo "$0: need input file"
exit 2 ;;
esac

case "X$num" in
# solaris /bin/sh wouldn't like something like "X[^1-9]"
X[1-9]) : okay ;;
X*) echo "$0: need number of files"
exit 2 ;;
esac

case "X$out" in
X) : no value okay ;;
esac

wcout=`wc -l < "$file"` # has whitespace
lines=`expr $wcout + 0` # fix whitespace for solaris expr

case "X$lines" in
X[1-9]*) : okay ;;
X*) echo "$0: unexpected output from wc"
exit 2 ;;
esac

split=`expr "$lines" / "$num"`

# deal with expr rounding the wrong way
total=`expr $split \* $num`
if expr "$total" \< "$lines" > /dev/null ; then
split=`expr $split + 1`
fi

split -$split "$file" "$out"
exit $?

Elijah
------
most of the script, like so many, is just error checking

Wayne C. Morris

unread,

Apr 14, 2012, 3:10:37 PM4/14/12

to

In article <eli$12041...@qz.little-neck.ny.us>,
Eli the Bearded <*@eli.users.panix.com> wrote:

> #/bin/sh

Should be:

#!/bin/sh

> case "X$file" in
> X) echo "$0: need input file"
> exit 2 ;;
> esac

As long as you're doing error checking, you should also make sure the file
exists, is a regular file, and is readable:

if [ ! -f "$file" ] || [ ! -r "$file" ] ; then
echo "$0: input file is not a regular file or is not readable"
exit 2 ;
fi

> case "X$num" in
> # solaris /bin/sh wouldn't like something like "X[^1-9]"
> X[1-9]) : okay ;;
> X*) echo "$0: need number of files"
> exit 2 ;;
> esac

That won't allow more than 9 files. Untested on Solaris, but this should work:

case "X$num" in
X*[!0-9]*) echo "$0: need number of files"
exit 2 ;;
# test for all zeroes
X*[1-9]*) : okay ;;
X*) echo "$0: number of files cannot be zero"
exit 2 ;;

esac

> wcout=`wc -l < "$file"` # has whitespace
> lines=`expr $wcout + 0` # fix whitespace for solaris expr
>
> case "X$lines" in
> X[1-9]*) : okay ;;
> X*) echo "$0: unexpected output from wc"
> exit 2 ;;
> esac

That tests the output from expr, not wc. I'm not sure a test is necessary
anyway; is there any situation where wc could give unexpected output if $file is
a regular file and can be read? If not, I'd just write:

lines=`wc -l < "$file"`

> split=`expr "$lines" / "$num"`
>
> # deal with expr rounding the wrong way
> total=`expr $split \* $num`
> if expr "$total" \< "$lines" > /dev/null ; then
> split=`expr $split + 1`
> fi

Here's a simpler way to divide and round up:

split=`expr $ $lines + $num - 1 $ / $num`

ExecMan

unread,

Apr 14, 2012, 4:25:21 PM4/14/12

to

On Apr 14, 2:10 pm, "Wayne C. Morris" <wayne.mor...@this.is.invalid>
wrote:
> In article <eli$1204131...@qz.little-neck.ny.us>,

I found this line on a website and it works:

awk '{a[NR]=$0}END{for(i=1;i<=NR;i++)print a[i] > "tmp_"
1+int(n*((i-1)/NR))}' n=$msgs $pfpfile

But, I do not understand what is it doing. I see that NR, what is
that?

unruh

unread,

Apr 14, 2012, 5:57:39 PM4/14/12

to

On 2012-04-14, ExecMan <artm...@yahoo.com> wrote:
> On Apr 14, 2:10?pm, "Wayne C. Morris" <wayne.mor...@this.is.invalid>

> wrote:
>> In article <eli$1204131...@qz.little-neck.ny.us>,

>> ?Eli the Bearded <*...@eli.users.panix.com> wrote:
>>
>> > #/bin/sh
>>
>> Should be:
>>
>> #!/bin/sh
>>
>> > case "X$file" in

>> > ? ? ? ? X) echo "$0: need input file"
>> > ? ? ? ? ? ?exit 2 ;;

>> > esac
>>
>> As long as you're doing error checking, you should also make sure the file
>> exists, is a regular file, and is readable:
>>
>> if [ ! -f "$file" ] || [ ! -r "$file" ] ; then

>> ? ? echo "$0: input file is not a regular file or is not readable"
>> ? ? exit 2 ;
>> fi
>>
>> > case "X$num" in
>> > ? ?# solaris /bin/sh wouldn't like something like "X[^1-9]"
>> > ? ? ? ? X[1-9]) : okay ;;
>> > ? ? ? ? X*) echo "$0: need number of files"
>> > ? ? ? ? ? ?exit 2 ;;
>> > esac
>>
>> That won't allow more than 9 files. ?Untested on Solaris, but this should work:
>>
>> case "X$num" in
>> ? ? X*[!0-9]*) echo "$0: need number of files"
>> ? ? ? ? exit 2 ;;
>> ? # test for all zeroes
>> ? ? X*[1-9]*) : okay ;;
>> ? ? X*) echo "$0: number of files cannot be zero"
>> ? ? ? ? exit 2 ;;

>> esac
>>
>> > wcout=`wc -l < "$file"` # has whitespace
>> > lines=`expr $wcout + 0` # fix whitespace for solaris expr
>>
>> > case "X$lines" in

>> > ? ? ? ? X[1-9]*) : okay ;;
>> > ? ? ? ? X*) echo "$0: unexpected output from wc"
>> > ? ? ? ? ? ? exit 2 ;;
>> > esac
>>
>> That tests the output from expr, not wc. ?I'm not sure a test is necessary

>> anyway; is there any situation where wc could give unexpected output if $file is

>> a regular file and can be read? ?If not, I'd just write:
>>
>> lines=`wc -l < "$file"`
>>
>> > split=`expr "$lines" / "$num"`
>>
>> > # deal with expr rounding the wrong way
>> > total=`expr $split \* $num`
>> > if expr "$total" \< "$lines" > /dev/null ; then

>> > ? split=`expr $split + 1`

>> > fi
>>
>> Here's a simpler way to divide and round up:
>>
>> split=`expr $ $lines + $num - 1 $ / $num`
>
>
> I found this line on a website and it works:
>
> awk '{a[NR]=$0}END{for(i=1;i<=NR;i++)print a[i] > "tmp_"
> 1+int(n*((i-1)/NR))}' n=$msgs $pfpfile
>
> But, I do not understand what is it doing. I see that NR, what is
> that?

NR = number record (ie line number) This sets up an associative array
within awk where each line is one member of the array. Thos lines are
then printed out to the file tmp_ # where # is the number of the file(
which now has a black in its name.)
Ie, This is extremely inefficient of space, and could well croak on a
long file as the whole file is copied into memory.
Note that all of these will fail pretty miserably if one of the lines is
say 99% of the length of the file, and the other 500 lines are each two
characters long.

>

ExecMan

unread,

Apr 14, 2012, 9:34:35 PM4/14/12

to

On Apr 14, 4:57 pm, unruh <un...@invalid.ca> wrote:

The average length of the line is very short, actually just an email
address. The file itself is about 800,000 lines long. So, average
line length is like 50 characters.....

unruh

unread,

Apr 15, 2012, 12:50:18 AM4/15/12

to

On 2012-04-15, ExecMan <artm...@yahoo.com> wrote:

>> > But, I do not understand what is it doing. ?I see that NR, what is

>> > that?
>>
>> NR = number record (ie line number) This sets up an associative array
>> within awk where each line is one member of the array. Thos lines are
>> then printed out to the file tmp_ # where # is the number of the file(
>> which now has a black in its name.)
>> Ie, This is extremely inefficient of space, and could well croak on a
>> long file as the whole file is copied into memory.
>> Note that all of these will fail pretty miserably if one of the lines is
>> say 99% of the length of the file, and the other 500 lines are each two
>> characters long.
>>
>>
>
> The average length of the line is very short, actually just an email
> address. The file itself is about 800,000 lines long. So, average
> line length is like 50 characters.....

So the files are about 50MB long. That would probably fit in memory.

>
>

ExecMan

unread,

Apr 16, 2012, 12:59:50 PM4/16/12

to

On Apr 14, 11:50 pm, unruh <un...@invalid.ca> wrote:

I'm just trying to understand the 'logic' on how the statement works:

awk '{a[NR]=$0}END{for(i=1;i<=NR;i++)print a[i] > "tmp_"
1+int(n*((i-1)/NR))}' n=$msgs $pfpfile

Say $msgs is 3.
Say $pfpfile has 10 records in it.
So, the array 'a' gets 10 records.

It simply spreads the records into N number of files. I'm curious on
the logic for that:

1+int(n*((i-1)/NR))

Record 1: 1 + int(3*((1-1)/10)) = ?
Record 2: 1 + int(3*((2-1)/10)) = ?
Record 3: 1 + int(3*((3-1)/10)) = ?

I noticed that the first 3 records were in file 1, the next 3 in file
2 and the final 4 in file 3. How did it do that?

Janis Papanagnou

unread,

Apr 16, 2012, 2:48:27 PM4/16/12

to

On 16.04.2012 18:59, ExecMan wrote:
>>[...]

>
> I'm just trying to understand the 'logic' on how the statement works:
>
>
> awk '{a[NR]=$0}END{for(i=1;i<=NR;i++)print a[i] > "tmp_"
> 1+int(n*((i-1)/NR))}' n=$msgs $pfpfile

Mind that it may not be advantageous to store everything in memory only
to do the final calculation after you know the maximum value of NR.

Since you're implicitly making a two-pass calculation anyway you could
separate the steps using wc(1), lines calculation (only once!) in shell,
and split(1).

>
> Say $msgs is 3.
> Say $pfpfile has 10 records in it.
> So, the array 'a' gets 10 records.

Yes.

>
> It simply spreads the records into N number of files. I'm curious on
> the logic for that:
>
> 1+int(n*((i-1)/NR))
>
> Record 1: 1 + int(3*((1-1)/10)) = ?
> Record 2: 1 + int(3*((2-1)/10)) = ?
> Record 3: 1 + int(3*((3-1)/10)) = ?
>
> I noticed that the first 3 records were in file 1, the next 3 in file
> 2 and the final 4 in file 3. How did it do that?

Math?

If you don't understand the calculus it may help to partition the
expression into parts, say, i, (i-1)/NR, n*((i-1)/NR), int(n*((i-1)/NR))
and write a small script to see how the sub-expressions evolve. In this
case

1 0.00 0.00 0
2 0.10 0.30 0
3 0.20 0.60 0
4 0.30 0.90 0
5 0.40 1.20 1
6 0.50 1.50 1
7 0.60 1.80 1
8 0.70 2.10 2
9 0.80 2.40 2
10 0.90 2.70 2

Column 1 is the record number
Column 2 is the value mapped/scaled onto the interval [0,1[
Column 3 is the value scaled to cover the ranges of files
Column 4 is the truncated value for indexing the files

Hope that helps.

Janis

ExecMan

unread,

Apr 17, 2012, 9:27:56 AM4/17/12

to

On Apr 16, 1:48 pm, Janis Papanagnou <janis_papanag...@hotmail.com>
wrote:

Very interesting. This is probably something I would not have though
of by myself. Whether it is the most efficient way, who knows. But
it is a creative way to do it.......

Peter J. Holzer

unread,

Apr 17, 2012, 4:32:15 PM4/17/12

to

This can be easily solved: To get the first of n parts, seek to position
size/n in the file, then read until the end of the current line. copy
everything up to this position into the first output file. Then repeat
for the rest of the file and the remaining parts.

In Perl it looks like this:

#!/usr/bin/perl

use warnings;
use strict;
use File::stat;

=head1 NAME

split2nfiles - split file into n (equal-sized) files

=head1 SYNOPSIS

split2nfiles n file [prefix]

=cut

my ($parts, $file, $prefix) = @ARGV;;
$prefix = "$file." unless defined $prefix;
my $format = "%s%0" . length($parts) . "d";

open(my $in_fh, '<', $file) or die "$0: cannot open $file: $!\n";

my $st = stat($in_fh);
my $size = $st->size;

my $last_pos = 0;

for my $part (1 .. $parts) {
my $new_pos = ($size - $last_pos) / $parts + $last_pos;
seek($in_fh, $new_pos, 0) or die "$0: seek to $new_pos failed: $!\n";
my $dummy = <$in_fh>;
$new_pos = tell($in_fh);
my $out_file_name = sprintf($format, $prefix, $part);
open (my $out_fh, '>', $out_file_name) or die "$0: cannot open $out_file_name: $!\n";
seek($in_fh, $last_pos, 0) or die "$0: seek to $last_pos failed: $!\n";
read($in_fh, my $buffer, $new_pos - $last_pos);
print $out_fh $buffer;
close($out_fh) or die "$0: error closing $out_file_name after write: $!\n";
$parts--;
$last_pos = $new_pos;
}
__END__

(only lightly tested)

This still needs as much memory as the largest part. Reading each part
in lines or blocks is left as an exercise for the reader.

The output files aren't all the same size of course, because each has to
contain an integral number of lines, but it's close. If your example
file with 1 very long and 500 very short lines is split into 5 parts,
the first will contain the first line and the other 4 parts will contain
126, 125, 125 and 124 lines respectively.

hp

--
_ | Peter J. Holzer | Deprecating human carelessness and
|_|_) | Sysadmin WSR | ignorance has no successful track record.
| | | h...@hjp.at |
__/ | http://www.hjp.at/ | -- Bill Code on as...@irtf.org

unruh

unread,

Apr 17, 2012, 11:35:06 PM4/17/12

to

??? How would this solve the long-first-line problem. No matter what you
did, that first line would then produce a first file that was 90% of the
whole file. Or if it was the last line, you would alsways copy the whole
file into the first split.

But anyway, you could also use awk to do it without storing he whole
thing in memeory, but not by doing an
associateive array.