Now I do it with split:
(@stuff) = split /\s/, $_;
print OUT "$stuff[$colnum-1]\n";
It works but this it takes ages to run. If I get the first one or second
limiting the split:
(@stuff[0..1],$rest) = split /\s/, $_, 3;
print OUT "$stuff[1]\n";
This runs fast. So splitting entire string seems to be inefficient when
I only want one value.
Can I find positions of tabs n and n+1? Then I could use substr.
Or is there a feature in split (which I could not find yet) that allows
to pick n:th value directly, something like: ($crap,$mystuff,$morecrap)
= split /\s/, $_, n;
Thanks in advance
Meanwhile I 'll try running index n times increasing startpoint etc. to
see how it performes.
My current datafile has 1376216 rows and 82 columns and there is more
coming...
print "FRED".$str."#\n";
while ($str=~/\t(\w*)/g){
$count ++;
if ($count == 4){
$result=$1;
}
}
print $result."
Michael
Let me number the columns from zero.
It is certainly possible to determine the position of two consecutive
tabs in a string. You could walk from tab to tab using index() (the
tree-argument form), but that would require a loop on perl level,
something like $pos = index( $str, "\t", $pos + 1) while $col--;
This also needs special treatment for the 0th and the last column.
Most likely, split is faster than this.
Another approach uses a regex: qr/(?:[^\t]*\t){$col}([^\t]*)(?:\t|$)/;
deposits the $col'th column in $1. This requires compilation of a
regex for each column. If you do that for every access, I doubt that
it can beat split. You might try and pre-compile a regex for every
column. Only a benchmark can tell if that is actually faster on your
machine.
Anno
> I am a newbie with perl but I have read the faq, searched from books
> and web tutorials. I want to get a column from tab-separated
> ascii-file. Is it possible to find where in my line is the n'th tab
> and then get the characters from that to next tab?
>
> Now I do it with split:
> (@stuff) = split /\s/, $_;
> print OUT "$stuff[$colnum-1]\n";
>
> It works but this it takes ages to run. If I get the first one or second
> limiting the split:
> (@stuff[0..1],$rest) = split /\s/, $_, 3;
> print OUT "$stuff[1]\n";
>
> This runs fast. So splitting entire string seems to be inefficient when
> I only want one value.
>
> Can I find positions of tabs n and n+1? Then I could use substr.
> Or is there a feature in split (which I could not find yet) that allows
> to pick n:th value directly, something like: ($crap,$mystuff,$morecrap)
> = split /\s/, $_, n;
>
> Thanks in advance
>
> Meanwhile I 'll try running index n times increasing startpoint etc. to
> see how it performes.
> My current datafile has 1376216 rows and 82 columns and there is more
> coming...
I haven't timed this for efficiency, but you could play around with
something like this:
my $n = 43 - 1; # i.e., $n = 42 for the 43rd column
while (<IN>) {
...
my $temp = $_; # in case you want to do something else with $_
$temp =~ s/([^\t]*\t){$n}//o; # discard the first 42 columns
my ($stuff) = split /\t/, $temp; # discard all columns after 43
print OUT "$stuff\n";
...
}
You can extend this to pick out the 14th, 32nd, and 59th columns, or
whatever you need.
By the way, you don't want to use "split /\s/" if there might be spaces
in the text of any of your fields.
The /o flag on the substitution operation causes the search pattern to
be compiled only once, instead of being recompiled each time you run
through the loop.
Note also that
my $stuff = split ...
and
my ($stuff) = split ...
produce very different results.
>Try
Can't. It won't compile.
>print $result."
Can't find string terminator '"' anywhere before EOF at ./temp line 17.
--
Tad McClellan SGML consulting
ta...@metronet.com Perl programming
Fort Worth, Texas
>I am a newbie with perl
Welcome!
>but I have read the faq, searched from books and
>web tutorials.
Thanks. We all appreciate that.
Just in case you don't know, I wouldn't want you to miss a
large and authoritative resource:
The FAQs are only 10 of the 50-80 files worth of standard
docs that come with the perl distribution.
So don't forget to check the non-FAQ docs too.
>I want to get a column from tab-separated ascii-file.
use split().
>Is it possible to find where in my line is the n'th tab and then get the
>characters from that to next tab?
Yes. You could call index() a bunch of times, updating the value
of the 3rd argument each time.
But that seems like an awful lot of bother...
>Now I do it with split:
> (@stuff) = split /\s/, $_;
^
> print OUT "$stuff[$colnum-1]\n";
Note that that does not do what you said you wanted to do.
It splits on any of 5 characters, but you said you wanted to
split only on 1 particular character (a tab).
You want \t there, not \s
>Or is there a feature in split (which I could not find yet) that allows
>to pick n:th value directly, something like: ($crap,$mystuff,$morecrap)
>= split /\s/, $_, n;
It is not a feature of split(), but there is a feature of Perl
that will help you do that.
It is called a "list slice".
See the "Slices" section in perldata.pod.
>Thanks in advance
You're welcome in arrears.
>Meanwhile I 'll try running index n times increasing startpoint etc. to
>see how it performes.
You should use the Benchmark module for benchmarking :-)
------------------------
#!/usr/bin/perl -w
use Benchmark;
$str = "1001\t101\t010001\t00000\tFR0ED\trolf";
$colnum = 5;
timethese 1_000_000, {
split_s => q( @vals = split(/\s/, $str); $val = $vals[$colnum-1];
),
split_t => q( @vals = split(/\t/, $str); $val = $vals[$colnum-1];
),
slice => q( $val = (split /\t/, $str)[$colnum-1];
),
match => q( $val = $1 if $str =~ /([^\t]*(\t|$)){$colnum}/;
),
};
------------------------
(partial) output:
match: 19 wallclock secs (18.50 usr + 0.01 sys = 18.51 CPU)
slice: 13 wallclock secs (11.56 usr + 0.02 sys = 11.58 CPU)
split_s: 19 wallclock secs (19.01 usr + 0.04 sys = 19.05 CPU)
split_t: 11 wallclock secs (12.26 usr + 0.02 sys = 12.28 CPU)
Change $colnum to 1 and we get:
match: 19 wallclock secs (18.65 usr + 0.01 sys = 18.66 CPU)
slice: 10 wallclock secs (11.51 usr + 0.02 sys = 11.53 CPU)
split_s: 20 wallclock secs (18.92 usr + 0.05 sys = 18.97 CPU)
split_t: 13 wallclock secs (12.39 usr + 0.03 sys = 12.42 CPU)
Modify it to run with your real data, and see what it says.
> This runs fast. So splitting entire string seems to be inefficient when
> I only want one value.
Using a simple match is about twice as fast as split even though it
involves re-compiling a regex. Interestingly, if you avoid the
recompilation it saves very little.
#!/usr/bin/perl -w
use Benchmark;
use strict;
my $n = 50;
my $data = join "\t" => map "foo$_" => 0 .. 100;
my $piece;
my %regex_cache;
timethese 10000 => {
split => sub {
$piece = (split /\s/, $data, $n+2)[$n];
},
match => sub {
($piece) = $data =~ /(?:\S*\s){$n}(\S+)/;
},
match_o => sub {
($piece) = $data =~ /(?:\S*\s){$n}(\S+)/o;
},
match_cache => sub {
($piece) = $data =~ ( $regex_cache{$n} ||= qr/(?:\S*\s){$n}(\S+)/ );
},
}
--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
>
>
>Can't find string terminator '"' anywhere before EOF at ./temp line 17.
Sorry coppy and past problem this line should be
print $result."\n";
I think you gys know this
Sorry Michael
Thans a lot for all answers.
The regex-things seemed to be the fastest here. I have no idea how they
work, but I will try to figure it out. At last some motivation to spend time
learning regex.
Here are some results from a small testfile and column 160:
--
$stuff = $1 if $_ =~ /([^\t]*(\t|$)){$colnum}/;
28 sec
--
$stuff = (split /\t/, $_, $colnum+2)[$colnum];
34 sec
--
$_ =~ s/([^\t]*\t){$n}//o; # discard the first 42 columns
my ($stuff) = split /\t/, $_; # discard all columns after 43
22 sec
--
$_ =~ qr/(?:[^\t]*\t){$colnum}([^\t]*)(?:\t|$)/;
22 sec
--
($piece) = $data =~ /(?:\S*\s){$n}(\S+)/;
($piece) = $data =~ /(?:\S*\s){$n}(\S+)/o;
($piece) = $data =~ ( $regex_cache{$n} ||= qr/(?:\S*\s){$n}(\S+)/ );
all these 19 sec
>Another approach uses a regex: qr/(?:[^\t]*\t){$col}([^\t]*)(?:\t|$)/;
>deposits the $col'th column in $1. This requires compilation of a
>regex for each column.
I think 'd try that with the /o option. And your regex can use some
clean-up. A: you don't want to match newlines. B: '(?:\t|$)' is
superfluous, because of greedy matching, '[^\t]*' will slurp in as much
as possible -- until the next tab, or the end of the string.
If you need more than one column, I might use a closure (custom sub).
From 5.6 on, /o works properly inside a closure.
sub get_column {
my $col = shift;
return sub {
shift =~ /^(?:[^\t]*\t){$col}([^\t\n]*)/o;
$1;
};
}
my $sub = get_column(5);
$_ = join "\t", 'A' .. 'Z';
print $sub->($_);
-->
F
In pre-5.6, you need an eval for the sub, so that every closure contains
it's own regex -- instead of a common one.
sub get_column {
my $col = shift;
return eval
'sub {
shift =~ /^(?:[^\t]*\t){$col}([^\t\n]*)/o;
$1;
}';
}
I haven't benchmarked it.
--
Bart.
...as long as $col doesn't vary from call to call. Since it was
a variable in the original code I don't think we can assume that.
It may pay to pre-compile the requisite set of regexes:
my @get_col map qr/^(?:[^\t]*\t){$_}([^\t\n]*)/, 0 .. $n_fields - 1;
[snip regex cleanup]
>If you need more than one column, I might use a closure (custom sub).
>From 5.6 on, /o works properly inside a closure.
>
> sub get_column {
> my $col = shift;
> return sub {
> shift =~ /^(?:[^\t]*\t){$col}([^\t\n]*)/o;
> $1;
> };
> }
>
> my $sub = get_column(5);
> $_ = join "\t", 'A' .. 'Z';
> print $sub->($_);
>-->
> F
That is an interesting property of closures. It is hard to see,
however, how this can be realized without repeated compilation
of the regex.
Anno
>That is an interesting property of closures. It is hard to see,
>however, how this can be realized without repeated compilation
>of the regex.
It needs to be compiled only once per column. So, if you want the 17th
column of each record, you get one closure for this column, and call it
for each line.
--
Bart.
> It needs to be compiled only once per column. So, if you want the 17th
> column of each record, you get one closure for this column, and call it
> for each line.
If you run the example in my earlier contibution to this thread where
I benchmark the split(), recompiled regex and precompiled regex
approaches you'll see that actually avoiding recompilation is probably
not justified in this case. (I didn't include the closure and /o case
because I'm running in 5.5).
I forgot to include the output last time so here it is again (slightly
modified):
#!/usr/bin/perl -w
use Benchmark;
use strict;
my $n = 50;
my $data = join "\t" => map "foo$_" => 0 .. 100;
my $piece;
my @regex_cache;
my %regex_cache;
timethese 10000 => {
split => sub {
$piece = (split /\s/, $data, $n+2)[$n];
},
match => sub {
($piece) = $data =~ /(?:\S*\s){$n}(\S+)/;
},
match_o => sub {
($piece) = $data =~ /(?:\S*\s){$n}(\S+)/o;
},
match_cache_h => sub {
($piece) = $data =~ ( $regex_cache{$n} ||= qr/(?:\S*\s){$n}(\S+)/ );
},
match_cache => sub {
($piece) = $data =~ ( $regex_cache[$n] ||= qr/(?:\S*\s){$n}(\S+)/ );
},
}
Benchmark: timing 10000 iterations of match, match_cache, match_cache_h, match_o, split...
match: 1 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU)
match_cache: 1 wallclock secs ( 1.06 usr + 0.00 sys = 1.06 CPU)
match_cache_h: 1 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU)
match_o: 1 wallclock secs ( 1.01 usr + 0.00 sys = 1.01 CPU)
split: 3 wallclock secs ( 2.62 usr + 0.00 sys = 2.62 CPU)
True, in principle only one translation per case is necessary. But I
have trouble imagining how the interpreter would keep track of cases
that have been dealt with.
Anno
Unfortunately not true. What is your perl -V? It works correctly in
ActiveState's build 613 but not in any other *nix version of 5.6 or 5.7
that I have. Maybe I missed a config option when compiling myself. :-(
--
Rick Delaney
rick.d...@home.com
>>It needs to be compiled only once per column. So, if you want the 17th
>>column of each record, you get one closure for this column, and call it
>>for each line.
>
>True, in principle only one translation per case is necessary. But I
>have trouble imagining how the interpreter would keep track of cases
>that have been dealt with.
Oh. I was assuming that, for example, the "name" was the 17'th column,
and you want to extract the name for every record. That's not too
uncommon, I assume?
But anyway, what you propose isn't as hard to do in plain Perl. Just
make an array of closures. Even caching the closures is dead easy.
$field = ($sub[$col] ||= get_column($col))->($line);
--
Bart.