How to find what is between n'th and the next tab?

Pekka Kumpulainen

unread,

Nov 24, 2000, 3:00:00 AM11/24/00

to

Hi,
I am a newbie with perl but I have read the faq, searched from books and
web tutorials.
I want to get a column from tab-separated ascii-file.
Is it possible to find where in my line is the n'th tab and then get the
characters from that to next tab?

Now I do it with split:
(@stuff) = split /\s/, $_;
print OUT "$stuff[$colnum-1]\n";

It works but this it takes ages to run. If I get the first one or second
limiting the split:
(@stuff[0..1],$rest) = split /\s/, $_, 3;
print OUT "$stuff[1]\n";

This runs fast. So splitting entire string seems to be inefficient when
I only want one value.

Can I find positions of tabs n and n+1? Then I could use substr.
Or is there a feature in split (which I could not find yet) that allows
to pick n:th value directly, something like: ($crap,$mystuff,$morecrap)
= split /\s/, $_, n;

Thanks in advance

Meanwhile I 'll try running index n times increasing startpoint etc. to
see how it performes.
My current datafile has 1376216 rows and 82 columns and there is more
coming...

Michael Guenther

unread,

Nov 24, 2000, 3:00:00 AM11/24/00

to

Try
my $str = "1001\t101\t010001\t00000\tFR0ED\trolf";
my $count = 0;
my $result;

print "FRED".$str."#\n";

while ($str=~/\t(\w*)/g){
$count ++;
if ($count == 4){
$result=$1;
}
}
print $result."

Michael

Anno Siegel

unread,

Nov 24, 2000, 3:00:00 AM11/24/00

to

Pekka Kumpulainen <ku...@mit.tut.fi> wrote in comp.lang.perl.misc:

>Hi,
>I am a newbie with perl but I have read the faq, searched from books and
>web tutorials.
>I want to get a column from tab-separated ascii-file.
>Is it possible to find where in my line is the n'th tab and then get the
>characters from that to next tab?
>
>Now I do it with split:
> (@stuff) = split /\s/, $_;
> print OUT "$stuff[$colnum-1]\n";
>
>It works but this it takes ages to run. If I get the first one or second
>limiting the split:
> (@stuff[0..1],$rest) = split /\s/, $_, 3;
> print OUT "$stuff[1]\n";
>
>This runs fast. So splitting entire string seems to be inefficient when
>I only want one value.
>
>Can I find positions of tabs n and n+1? Then I could use substr.
>Or is there a feature in split (which I could not find yet) that allows
>to pick n:th value directly, something like: ($crap,$mystuff,$morecrap)
>= split /\s/, $_, n;

Let me number the columns from zero.

It is certainly possible to determine the position of two consecutive
tabs in a string. You could walk from tab to tab using index() (the
tree-argument form), but that would require a loop on perl level,
something like $pos = index( $str, "\t", $pos + 1) while $col--;
This also needs special treatment for the 0th and the last column.
Most likely, split is faster than this.

Another approach uses a regex: qr/(?:[^\t]*\t){$col}([^\t]*)(?:\t|$)/;
deposits the $col'th column in $1. This requires compilation of a
regex for each column. If you do that for every access, I doubt that
it can beat split. You might try and pre-compile a regex for every
column. Only a benchmark can tell if that is actually faster on your
machine.

Anno

Linc Madison

unread,

Nov 24, 2000, 3:00:00 AM11/24/00

to

In article <3A1E6A39...@mit.tut.fi>, Pekka Kumpulainen
<ku...@mit.tut.fi> wrote:

> I am a newbie with perl but I have read the faq, searched from books
> and web tutorials. I want to get a column from tab-separated
> ascii-file. Is it possible to find where in my line is the n'th tab
> and then get the characters from that to next tab?
>
> Now I do it with split:
> (@stuff) = split /\s/, $_;
> print OUT "$stuff[$colnum-1]\n";
>
> It works but this it takes ages to run. If I get the first one or second
> limiting the split:
> (@stuff[0..1],$rest) = split /\s/, $_, 3;
> print OUT "$stuff[1]\n";
>
> This runs fast. So splitting entire string seems to be inefficient when
> I only want one value.
>
> Can I find positions of tabs n and n+1? Then I could use substr.
> Or is there a feature in split (which I could not find yet) that allows
> to pick n:th value directly, something like: ($crap,$mystuff,$morecrap)
> = split /\s/, $_, n;
>

> Thanks in advance
>
> Meanwhile I 'll try running index n times increasing startpoint etc. to
> see how it performes.
> My current datafile has 1376216 rows and 82 columns and there is more
> coming...

I haven't timed this for efficiency, but you could play around with
something like this:

my $n = 43 - 1; # i.e., $n = 42 for the 43rd column
while (<IN>) {
...
my $temp = $_; # in case you want to do something else with $_
$temp =~ s/([^\t]*\t){$n}//o; # discard the first 42 columns
my ($stuff) = split /\t/, $temp; # discard all columns after 43
print OUT "$stuff\n";
...
}

You can extend this to pick out the 14th, 32nd, and 59th columns, or
whatever you need.

By the way, you don't want to use "split /\s/" if there might be spaces
in the text of any of your fields.

The /o flag on the substitution operation causes the search pattern to
be compiled only once, instead of being recompiled each time you run
through the loop.

Note also that

my $stuff = split ...

and

my ($stuff) = split ...

produce very different results.

Tad McClellan

unread,

Nov 24, 2000, 10:21:11 AM11/24/00

to

Michael Guenther <MiGue...@lucent.com> wrote:

>Try

Can't. It won't compile.

>print $result."

Can't find string terminator '"' anywhere before EOF at ./temp line 17.

--
Tad McClellan SGML consulting
ta...@metronet.com Perl programming
Fort Worth, Texas

Tad McClellan

unread,

Nov 24, 2000, 11:35:10 AM11/24/00

to

Pekka Kumpulainen <ku...@mit.tut.fi> wrote:

>I am a newbie with perl

Welcome!

>but I have read the faq, searched from books and
>web tutorials.

Thanks. We all appreciate that.

Just in case you don't know, I wouldn't want you to miss a
large and authoritative resource:

The FAQs are only 10 of the 50-80 files worth of standard
docs that come with the perl distribution.

So don't forget to check the non-FAQ docs too.

>I want to get a column from tab-separated ascii-file.

use split().

>Is it possible to find where in my line is the n'th tab and then get the
>characters from that to next tab?

Yes. You could call index() a bunch of times, updating the value
of the 3rd argument each time.

But that seems like an awful lot of bother...

>Now I do it with split:
> (@stuff) = split /\s/, $_;

^

> print OUT "$stuff[$colnum-1]\n";

Note that that does not do what you said you wanted to do.

It splits on any of 5 characters, but you said you wanted to
split only on 1 particular character (a tab).

You want \t there, not \s

>Or is there a feature in split (which I could not find yet) that allows
>to pick n:th value directly, something like: ($crap,$mystuff,$morecrap)
>= split /\s/, $_, n;

It is not a feature of split(), but there is a feature of Perl
that will help you do that.

It is called a "list slice".

See the "Slices" section in perldata.pod.

>Thanks in advance

You're welcome in arrears.

>Meanwhile I 'll try running index n times increasing startpoint etc. to
>see how it performes.

You should use the Benchmark module for benchmarking :-)

------------------------
#!/usr/bin/perl -w
use Benchmark;

$str = "1001\t101\t010001\t00000\tFR0ED\trolf";

$colnum = 5;

timethese 1_000_000, {
split_s => q( @vals = split(/\s/, $str); $val = $vals[$colnum-1];
),
split_t => q( @vals = split(/\t/, $str); $val = $vals[$colnum-1];
),
slice => q( $val = (split /\t/, $str)[$colnum-1];
),
match => q( $val = $1 if $str =~ /([^\t]*(\t|$)){$colnum}/;
),
};
------------------------

(partial) output:

match: 19 wallclock secs (18.50 usr + 0.01 sys = 18.51 CPU)
slice: 13 wallclock secs (11.56 usr + 0.02 sys = 11.58 CPU)
split_s: 19 wallclock secs (19.01 usr + 0.04 sys = 19.05 CPU)
split_t: 11 wallclock secs (12.26 usr + 0.02 sys = 12.28 CPU)

Change $colnum to 1 and we get:

match: 19 wallclock secs (18.65 usr + 0.01 sys = 18.66 CPU)
slice: 10 wallclock secs (11.51 usr + 0.02 sys = 11.53 CPU)
split_s: 20 wallclock secs (18.92 usr + 0.05 sys = 18.97 CPU)
split_t: 13 wallclock secs (12.39 usr + 0.03 sys = 12.42 CPU)

Modify it to run with your real data, and see what it says.

nob...@mail.com

unread,

Nov 24, 2000, 1:12:54 PM11/24/00

to

Pekka Kumpulainen <ku...@mit.tut.fi> writes:

> This runs fast. So splitting entire string seems to be inefficient when
> I only want one value.

Using a simple match is about twice as fast as split even though it
involves re-compiling a regex. Interestingly, if you avoid the
recompilation it saves very little.

#!/usr/bin/perl -w
use Benchmark;
use strict;

my $n = 50;
my $data = join "\t" => map "foo$_" => 0 .. 100;
my $piece;
my %regex_cache;

timethese 10000 => {
split => sub {
$piece = (split /\s/, $data, $n+2)[$n];
},

match => sub {
($piece) = $data =~ /(?:\S*\s){$n}(\S+)/;
},

match_o => sub {
($piece) = $data =~ /(?:\S*\s){$n}(\S+)/o;
},

match_cache => sub {
($piece) = $data =~ ( $regex_cache{$n} ||= qr/(?:\S*\s){$n}(\S+)/ );
},
}

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\

Michael Guenther

unread,

Nov 26, 2000, 3:00:00 AM11/26/00

to

Tad McClellan wrote in message ...

>Michael Guenther <MiGue...@lucent.com> wrote:
>
>>Try
>
>
>Can't. It won't compile.
>
>
>>print $result."

>
>
>Can't find string terminator '"' anywhere before EOF at ./temp line 17.

Sorry coppy and past problem this line should be

print $result."\n";

I think you gys know this

Sorry Michael

Pekka Kumpulainen

unread,

Nov 28, 2000, 3:00:00 AM11/28/00

to

Pekka Kumpulainen wrote:

Thans a lot for all answers.
The regex-things seemed to be the fastest here. I have no idea how they
work, but I will try to figure it out. At last some motivation to spend time
learning regex.
Here are some results from a small testfile and column 160:
--
$stuff = $1 if $_ =~ /([^\t]*(\t|$)){$colnum}/;
28 sec
--
$stuff = (split /\t/, $_, $colnum+2)[$colnum];
34 sec
--
$_ =~ s/([^\t]*\t){$n}//o; # discard the first 42 columns
my ($stuff) = split /\t/, $_; # discard all columns after 43
22 sec
--
$_ =~ qr/(?:[^\t]*\t){$colnum}([^\t]*)(?:\t|$)/;
22 sec
--

($piece) = $data =~ /(?:\S*\s){$n}(\S+)/;

($piece) = $data =~ /(?:\S*\s){$n}(\S+)/o;

($piece) = $data =~ ( $regex_cache{$n} ||= qr/(?:\S*\s){$n}(\S+)/ );

all these 19 sec

Bart Lateur

unread,

Nov 28, 2000, 3:00:00 AM11/28/00

to

Anno Siegel wrote:

>Another approach uses a regex: qr/(?:[^\t]*\t){$col}([^\t]*)(?:\t|$)/;
>deposits the $col'th column in $1. This requires compilation of a
>regex for each column.

I think 'd try that with the /o option. And your regex can use some
clean-up. A: you don't want to match newlines. B: '(?:\t|$)' is
superfluous, because of greedy matching, '[^\t]*' will slurp in as much
as possible -- until the next tab, or the end of the string.

If you need more than one column, I might use a closure (custom sub).
From 5.6 on, /o works properly inside a closure.

sub get_column {
my $col = shift;
return sub {
shift =~ /^(?:[^\t]*\t){$col}([^\t\n]*)/o;
$1;
};
}

my $sub = get_column(5);
$_ = join "\t", 'A' .. 'Z';
print $sub->($_);
-->
F

In pre-5.6, you need an eval for the sub, so that every closure contains
it's own regex -- instead of a common one.

sub get_column {
my $col = shift;
return eval
'sub {
shift =~ /^(?:[^\t]*\t){$col}([^\t\n]*)/o;
$1;
}';
}

I haven't benchmarked it.

--
Bart.

Anno Siegel

unread,

Nov 28, 2000, 3:00:00 AM11/28/00

to

Bart Lateur <bart....@skynet.be> wrote in comp.lang.perl.misc:

>Anno Siegel wrote:
>
>>Another approach uses a regex: qr/(?:[^\t]*\t){$col}([^\t]*)(?:\t|$)/;
>>deposits the $col'th column in $1. This requires compilation of a
>>regex for each column.
>
>I think 'd try that with the /o option.

...as long as $col doesn't vary from call to call. Since it was
a variable in the original code I don't think we can assume that.

It may pay to pre-compile the requisite set of regexes:

my @get_col map qr/^(?:[^\t]*\t){$_}([^\t\n]*)/, 0 .. $n_fields - 1;

[snip regex cleanup]

>If you need more than one column, I might use a closure (custom sub).
>From 5.6 on, /o works properly inside a closure.
>
> sub get_column {
> my $col = shift;
> return sub {
> shift =~ /^(?:[^\t]*\t){$col}([^\t\n]*)/o;
> $1;
> };
> }
>
> my $sub = get_column(5);
> $_ = join "\t", 'A' .. 'Z';
> print $sub->($_);
>-->
> F

That is an interesting property of closures. It is hard to see,
however, how this can be realized without repeated compilation
of the regex.

Anno

Bart Lateur

unread,

Nov 28, 2000, 3:00:00 AM11/28/00

to

Anno Siegel wrote:

>That is an interesting property of closures. It is hard to see,
>however, how this can be realized without repeated compilation
>of the regex.

It needs to be compiled only once per column. So, if you want the 17th
column of each record, you get one closure for this column, and call it
for each line.

--
Bart.

nob...@mail.com

unread,

Nov 29, 2000, 3:00:00 AM11/29/00

to

Bart Lateur <bart....@skynet.be> writes:

> It needs to be compiled only once per column. So, if you want the 17th
> column of each record, you get one closure for this column, and call it
> for each line.

If you run the example in my earlier contibution to this thread where
I benchmark the split(), recompiled regex and precompiled regex
approaches you'll see that actually avoiding recompilation is probably
not justified in this case. (I didn't include the closure and /o case
because I'm running in 5.5).

I forgot to include the output last time so here it is again (slightly
modified):

#!/usr/bin/perl -w
use Benchmark;
use strict;

my $n = 50;
my $data = join "\t" => map "foo$_" => 0 .. 100;
my $piece;

my @regex_cache;
my %regex_cache;

timethese 10000 => {
split => sub {
$piece = (split /\s/, $data, $n+2)[$n];
},

match => sub {

($piece) = $data =~ /(?:\S*\s){$n}(\S+)/;

},

match_o => sub {

($piece) = $data =~ /(?:\S*\s){$n}(\S+)/o;

},

match_cache_h => sub {

($piece) = $data =~ ( $regex_cache{$n} ||= qr/(?:\S*\s){$n}(\S+)/ );

},
match_cache => sub {

($piece) = $data =~ ( $regex_cache[$n] ||= qr/(?:\S*\s){$n}(\S+)/ );

},
}

Benchmark: timing 10000 iterations of match, match_cache, match_cache_h, match_o, split...
match: 1 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU)
match_cache: 1 wallclock secs ( 1.06 usr + 0.00 sys = 1.06 CPU)
match_cache_h: 1 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU)
match_o: 1 wallclock secs ( 1.01 usr + 0.00 sys = 1.01 CPU)
split: 3 wallclock secs ( 2.62 usr + 0.00 sys = 2.62 CPU)

Anno Siegel

unread,

Nov 29, 2000, 3:00:00 AM11/29/00

to

Bart Lateur <bart....@skynet.be> wrote in comp.lang.perl.misc:

>Anno Siegel wrote:
>
>>That is an interesting property of closures. It is hard to see,
>>however, how this can be realized without repeated compilation
>>of the regex.
>

>It needs to be compiled only once per column. So, if you want the 17th
>column of each record, you get one closure for this column, and call it
>for each line.

True, in principle only one translation per case is necessary. But I
have trouble imagining how the interpreter would keep track of cases
that have been dealt with.

Anno

Rick Delaney

unread,

Nov 29, 2000, 10:11:32 PM11/29/00

to

Bart Lateur wrote:
>
> If you need more than one column, I might use a closure (custom sub).
> From 5.6 on, /o works properly inside a closure.

Unfortunately not true. What is your perl -V? It works correctly in
ActiveState's build 613 but not in any other *nix version of 5.6 or 5.7
that I have. Maybe I missed a config option when compiling myself. :-(

--
Rick Delaney
rick.d...@home.com

Bart Lateur

unread,

Nov 30, 2000, 3:00:00 AM11/30/00

to

Anno Siegel wrote:

>>It needs to be compiled only once per column. So, if you want the 17th
>>column of each record, you get one closure for this column, and call it
>>for each line.
>
>True, in principle only one translation per case is necessary. But I
>have trouble imagining how the interpreter would keep track of cases
>that have been dealt with.

Oh. I was assuming that, for example, the "name" was the 17'th column,
and you want to extract the name for every record. That's not too
uncommon, I assume?

But anyway, what you propose isn't as hard to do in plain Perl. Just
make an array of closures. Even caching the closures is dead easy.

$field = ($sub[$col] ||= get_column($col))->($line);

--
Bart.