Suppose I have a long string $a, and a test string $b.
I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.
For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.
I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?
Thanks,
Peng
Not easily, and probably fairly slow.
You need some heuristic algorithym.
sln
----------------------------
use strict;
use warnings;
my $str = 'aacdxfo sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses = 2;
my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";
}
$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;
print @tmp,"\n",$rxpattern,"\n\n";
while ($str =~ /$rxpattern/g )
{
my $cnt = 0;
for my $i (2..(@tmp+1)) {
last if (($cnt += defined( $-[$i])) > $misses);
}
if ($cnt > $misses) {
pos($str) = $-[0]+1;
} else {
print "$cnt bad chars, but found a close match: '$1'\n";
}
}
__END__
abcdef
(?i-xsm:((?:a|(.))(?:b|(.))(?:c|(.))(?:d|(.))(?:e|(.))(?:f|(.))))
2 bad chars, but found a close match: 'aacdxf'
2 bad chars, but found a close match: 'abcrxf'
Hi,
I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is its
documentation?
Thanks,
Peng
Thanks,
Peng
You can find out about the @- array in:
perldoc perlvar
John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall
> I can not find what '$-[$i]' means in your code
It is accessing the array named @-
> on online perl
> documentation. Would you please let me know its meaning? Where is its
> documentation?
perldoc perlvar
--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
It seems that the perldoc is not easy to understand by a newbie. While
it explains @-, the perldoc keeps on referring to others things that I
don't understand. Can somebody help explain @- for me with more
examples?
BTW, I'd think that the perldoc be made more readable for newbies. I'm
not sure whether other newbies would have the same feeling.
Thanks,
Peng
Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;
my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";
my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();
print "\nmatching '$a' against '$b'\n";
OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}
__END__
matching
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx'
against 'aacdxf'
match 'abcdef' at offset 0
match 'bacdef' at offset 6
match 'aacdxf' at offset 21
match 'aacdef' at offset 28
match 'bacdef' at offset 38
match 'aacdxf' at offset 53
match 'aacdef' at offset 60
match 'aacdxf' at offset 66
match 'aacdfx' at offset 72
---------------------------------------------------------------
Sorry about the line wrap
--
"Usenet is like a herd of performing elephants with diarrhea:
massive, difficult to redirect, awe-inspiring, entertaining, and a
source of mind-boggling amounts of excrement when you least
expect it." (Gene Spafford, 1992).
If $a could be very long, say, of the order of MB, will this approach
be much slower than the code given by sln?
Thanks,
Peng
PY> BTW, I'd think that the perldoc be made more readable for
PY> newbies. I'm not sure whether other newbies would have the same
PY> feeling.
There's a tradeoff between making the perldocs more readable for newbies
and making them more information-dense for experts. There are other
sources of documentation for newbies, but precious few for experts, and
so I think the perldocs are at a very sweet spot.
Further, you can learn more about Perl by following up on everything in
the documentation that you don't understand. Indeed, that's how many of
us learned as much as we did.
Charlton
--
Charlton Wilbur
cwi...@chromatico.net
I total understand what you mean.
At this moment, I only be interested in understanding how @- works. I
tried to read that perldoc page, but I feel it is hard to follow.
Since that section of perldocs is not very readable to newbies, I'm
wondering if there is any more introductory material that explains it
with more examples.
I tried to look the more introductory materials on @- online but I
don't find any.
I think as I understand perl more, I shall be able to read perldoc
more easily.
Thanks,
Peng
Or more simpler as:
#!/usr/bin/perl
use strict;
use warnings;
my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';
my $limit = 2;
my $lenb = length $b;
print "\nmatching '$a' against '$b'\n";
while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}
__END__
> On Nov 20, 12:19 am, "John W. Krahn" <some...@example.com> wrote:
>> Peng Yu wrote:
>> > On Nov 19, 7:55 pm, s...@netherlands.com wrote:
>>
...
>> > I can not find what '$-[$i]' means in your code on online perl
>> > documentation. Would you please let me know its meaning? Where is
>> > its documentation?
>>
>> You can find out about the @- array in:
>>
>> perldoc perlvar
>
> It seems that the perldoc is not easy to understand by a newbie. While
> it explains @-, the perldoc keeps on referring to others things that I
> don't understand. Can somebody help explain @- for me with more
> examples?
I looked at the perldoc perlvar page to remind myself what it said. I am
unable to figure out how one could make it clearer:
<blockquote>
This array holds the offsets of the beginnings of the last
successful submatches in the currently active dynamic scope.
$-[0] is the offset into the string of the beginning of the
entire match. The *n*th element of this array holds the offset
of the *n*th submatch, so $-[1] is the offset where $1 begins,
$-[2] the offset where $2 begins, and so on.
</blockquote>
Consider the example below:
C:\DOCUME~1\asu1\LOCALS~1\Temp> cat t.pl
#!/usr/bin/perl
use strict;
use warnings;
my $s = '123abcABC';
if ( $s =~ /^([0-9]+([0-9]))[a-z]+([A-Z]([A-Z]+))$/ ) {
print "@-\n";
print "$_\n" for $1, $2, $3, $4;
}
__END__
In this case, a successful match would produce
$1 = 123 (offset = 0)
$2 = 3 (offset = 2)
$3 = ABC (offset = 6)
$4 = BC (offset = 7)
The entire match starts at offset 0, so we have @- = (0, 0, 2, 6, 7).
Sinan
--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
> Suppose I have a long string $a, and a test string $b.
>
> I want to fine all the substrings in $a, whose length is the same as
> $b with at most n mismatches.
Use substr() and the "^" operator and y/\0//c.
> For example, string 'abcdef' and string 'aacdxf' have two mismatches
> at the 2nd character and the 5th character.
$ perl -wle'
$r = "abcdef" ^ "aacdxf";
print $r =~ y/\0//c
'
2
--
Affijn, Ruud
"Gewoon is een tijger."
Yes, this is by far the fastest way. It benches 1/3 of the time
of using arrays and substr. Although, traversing with substr's and
using the bitwise ^ strings with tr///c might be faster, if only
by a little, its worth investigating. I personally think using the
regex engine for this is faster.
This is a prime example where bit-wise operators rule over string
comparisons... pure math.
If the problem were to spill over to multiple character comparisons
this won't work but the OP didn't say that.
Fast indeed!
I looked over this pretty good. I got a question about that look-ahead
zero-width assertion.
I noticed the capture group is in the assertion and it only advances the
regexp ordinal position by one.
Example:
/(.{$lenb})/ advances 6
/(?=(.{$lenb}))/ advances 1
I looked in perlre,perlretut and there is only one mention of capture within
a "non-capture zero-width assertion" extended expression elements.
That was just a side from some example on (?>..), and that just mentioned,
the ordinal position is adjusted with a (?=(regex)) analogy.
Do you know why this is?
sln
Yes, a regular expression is also possible. First write code to create
the regex string, for your example it can be done like this:
perl -wle'
my ($b, @b) = "aacdxf";
for my $i (0 .. 4) {
substr(my $bi = $b, $i, 1, ".");
for my $j ($i+1 .. 5) {
substr(my $bj = $bi, $j, 1, ".");
push @b, $bj;
}
}
my $re = join("\n | ", @b);
$re = qr/
( $re
)
/x;
print $re;
'
(?x-ism:
( ..cdxf
| .a.dxf
| .ac.xf
| .acd.f
| .acdx.
| a..dxf
| a.c.xf
| a.cd.f
| a.cdx.
| aa..xf
| aa.d.f
| aa.dx.
| aac..f
| aac.x.
| aacd..
)
)
Follow-up:
Below is a modified version that chops the time in half.
The fastest solution conforming to the original problem statement
is from John Krahn. Because you can't fight math when posing
1-dimensional problems like this.
However, I'm posting this modified version incase the problem set
expands to multi-character heuristics.
The modifications include staying away from the @- array altogether.
In my opinion, it is just a eval{} in disguise anyway, don't know
don't care. Position is set every time. The pos() function is free,
its virtually no overhead. Reading @-, the patter/sub-pattern matching
position array, on the otherhand, incurrs 5-10 times the overhead.
Its to be avoided at all costs on performance related regexp issues.
Staying away from the @- array but keeping the algorithym means doing
something thats not supposed to be done, but it works:
$i = 1; # is not a reference
$$i # $1
I didn't see any side affects except you have to manage the pos() each time.
See: FAQ 7.29: How can I use a variable as a variable name?
That being said Expanding notation to multi-character substrings
and to add some intelligence (possibly) to dynamically generated regexp's,
might be a help in the future.
For instance, you could expand the definition of matching items like so:
@terms = (
# string - required - generated rxterm
# ----------------------------------------
'Merry ' , 1, # (?:Merry )
'C', , 0, # (?:C|(.))
'h', , 0, # (?:h|(.))
'r', , 0, # (?:r|(.))
'i', , 0, # (?:i|(.))
's', , 0, # (?:s|(.))
't', , 0, # (?:t|(.))
'mas', , 0, # (?:mas|(.{3}))
);
But then your getting into heuristics.
Good luck!
sln
--------------------------------
use strict;
use warnings;
use Benchmark ':hireswallclock';
my ($t0,$t1,$tdif);
my $pattern = 'aacdxf';
my $misses = 2;
my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";
}
$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;
my $extent = @tmp+1;
my ($i,$cnt,$lastpos) = (0,0,0);
print @tmp,"\n",$rxpattern,"\n\n";
my $fname = 'c:\temp\5MEG_FILE.txt';
open my $fh, $fname or die "can't open $fname...";
$t0 = new Benchmark;
while (<$fh>)
{
chomp;
$lastpos = 0;
while ( /$rxpattern/g )
{
$cnt = 0;
for $i (2..$extent) {
last if (($cnt += defined( $$i)) > $misses);
}
if ($cnt <= $misses) {
print "$cnt bad chars, but found a close match: '$1'\n";
}
pos() = ++$lastpos;
}
}
$t1 = new Benchmark;
close $fh;
$tdif = timediff($t1, $t0);
print "the code took:",timestr($tdif),"\n";
__END__
Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.
I agree that adding more examples would make the manual to long to be
read by experts. But I think that this can be solved by having two
sets of document---one is with examples, the other is without
examples.
Thanks,
Peng
It is a reference, concise and accurate (as much as possible), it is not
and never will be a tutorial.
>> > While
>> > it explains @-, the perldoc keeps on referring to others things that I
>> > don't understand. Can somebody help explain @- for me with more
>> > examples?
Then you will have to look for a tutorial or dig down into those other
things. Very often it is not possible or not desirable to explain
something without making references to other concepts. And @- is
certainly not the right place to explain grouping in regular
expressions.
>> I looked at the perldoc perlvar page to remind myself what it said. I am
>> unable to figure out how one could make it clearer:
>Thank you the example. I see what it means. In general, I think the
>document can be made easier to read with more examples.
But there are examples in that section:
<quote>
After a match against some variable $var:
"$`" is the same as "substr($var, 0, $-[0])"
"$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
"$'" is the same as "substr($var, $+[0])"
"$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
"$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
"$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
</quote>
>I agree that adding more examples would make the manual to long to be
>read by experts. But I think that this can be solved by having two
>sets of document---one is with examples, the other is without
>examples.
Go ahead and write the fivehundredandnintysecond tutorial about Perl.
You will find that either you are repeating yourself all over the place
or you start using references all over the place because Perl's concept
are so intertwined and cannot be explained stand-alone.
Not to mention the ensuing maintenance nightmare when you have to fix
something all over the place.
jue
>Peng Yu <Peng...@gmail.com> wrote:
>>On Nov 20, 3:46 pm, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
>>> Peng Yu <PengYu...@gmail.com> wrote innews:8f31f2be-e332-4d6d...@o2g2000yqd.googlegroups.com:
>>> > It seems that the perldoc is not easy to understand by a newbie.
>
>It is a reference, concise and accurate (as much as possible), it is not
>and never will be a tutorial.
What are you defending, a reference?? A reference is and has always been
another word for a POS! It is just a list of prototypes.
A reference has been and always will be nothing of significance at all.
Used by somebody who already knows it all to adjust for typo's and syntax.
I hope those people don't actually write the referece.
>
>>> > While
>>> > it explains @-, the perldoc keeps on referring to others things that I
>>> > don't understand. Can somebody help explain @- for me with more
>>> > examples?
>
>Then you will have to look for a tutorial or dig down into those other
>things. Very often it is not possible or not desirable to explain
>something without making references to other concepts. And @- is
>certainly not the right place to explain grouping in regular
>expressions.
Why the f*ck not? Can't explain something you have no idea where you
learned it from, then blowhard to someone else needing answer's? Get off man..
>
>>> I looked at the perldoc perlvar page to remind myself what it said. I am
>>> unable to figure out how one could make it clearer:
>
>>Thank you the example. I see what it means. In general, I think the
>>document can be made easier to read with more examples.
>
>
>But there are examples in that section:
><quote>
> After a match against some variable $var:
>
> "$`" is the same as "substr($var, 0, $-[0])"
> "$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
> "$'" is the same as "substr($var, $+[0])"
> "$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
> "$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
> "$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
></quote>
>
>
>>I agree that adding more examples would make the manual to long to be
>>read by experts. But I think that this can be solved by having two
>>sets of document---one is with examples, the other is without
>>examples.
>
>Go ahead and write the fivehundredandnintysecond tutorial about Perl.
>You will find that either you are repeating yourself all over the place
>or you start using references all over the place because Perl's concept
>are so intertwined and cannot be explained stand-alone.
>
You my friend should not be involved in Perl. You have cactus up your
ass, every move you make when reading posts on this group drive's that
POINT home brother!!!
>Not to mention the ensuing maintenance nightmare when you have to fix
>something all over the place.
>
>jue
Go fly a friggin kite friend..
sln
Very elegant and fast, but you don't need all the modifiers /sog to the
regex, just /g will do.
Actually, he needs 'so' as well. If this is used in a multi-lined
way, the '.' needs the 's' modifier if it is possible a '\n' is encounterred
and could be part of the pattern.
The 'o' suggests to not recompile, which might happen, don't know, anything is
possible. By including both, no harm, ho foul.
sln
I didn't understand the above examples because it refers to $-[0],
which was exactly what I wanted to understand.
> >I agree that adding more examples would make the manual to long to be
> >read by experts. But I think that this can be solved by having two
> >sets of document---one is with examples, the other is without
> >examples.
>
> Go ahead and write the fivehundredandnintysecond tutorial about Perl.
> You will find that either you are repeating yourself all over the place
> or you start using references all over the place because Perl's concept
> are so intertwined and cannot be explained stand-alone.
>
> Not to mention the ensuing maintenance nightmare when you have to fix
> something all over the place.
I agree your points. But I think that the perldoc shall have as less
cyclic references as possible. If the example that you mentioned in
the perldoc is not cyclic referred, it would easier to read.
Thanks,
Peng
> On Nov 20, 9:06 pm, Jürgen Exner <jurge...@hotmail.com> wrote:
>>
>> But there are examples in that section:
>> <quote>
>> After a match against some variable $var:
>>
>> "$`" is the same as "substr($var, 0, $-[0])"
>> "$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
>> "$'" is the same as "substr($var, $+[0])"
>> "$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
>> "$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
>> "$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
>> </quote>
>
> I didn't understand the above examples because it refers to $-[0],
> which was exactly what I wanted to understand.
Arrays are referred to with @, but array *elements* with $. So, given
the array @-, its first element is $-[0].
sherm--
--
My blog: http://shermspace.blogspot.com
Cocoa programming in Perl: http://camelbones.sourceforge.net
Harm will be done using the 'o'nce option, when the code is
transferred to a subroutine and the regexp will be called with $b's of
different length, because the regexp will be compiled with the first
$lenb encountered and never change.
Better put the regexp into a variable and use this in the while ():
my $re = qr/(?=(.{$lenb}))/s;
while ( $a =~ /$re/g ) {
HTH
> On Nov 20, 9:06 pm, Jürgen Exner <jurge...@hotmail.com> wrote:
<snip>
>> But there are examples in that section:
>> <quote>
>> After a match against some variable $var:
>>
>> "$`" is the same as "substr($var, 0, $-[0])"
>> "$&" is the same as "substr($var, $-[0], $+[0] -
> $-[0])"
>> "$'" is the same as "substr($var, $+[0])"
>> "$1" is the same as "substr($var, $-[1], $+[1] -
> $-[1])"
>> "$2" is the same as "substr($var, $-[2], $+[2] -
> $-[2])"
>> "$3" is the same as "substr $var, $-[3], $+[3] -
> $-[3])"
>> </quote>
>
> I didn't understand the above examples because it refers to $-[0],
> which was exactly what I wanted to understand.
Are you joking? How can one explain what @- is without explaining the
meaning of its elements?
You do know that $x[1] refers to the second element of @x, right?
> I agree your points. But I think that the perldoc shall have as less
> cyclic references as possible. If the example that you mentioned in
> the perldoc is not cyclic referred, it would easier to read.
I don't see such cyclic references in the documentation for @-.
Did you quote from a newer version than 5.10.0? It's slightly different
in my docs (only checked 5.8.8 and 5.10.0).
>> >Thank you the example. I see what it means. In general, I think the
>> >document can be made easier to read with more examples.
>>
>> But there are examples in that section:
>> <quote>
>> After a match against some variable $var:
>>
>> "$`" is the same as "substr($var, 0, $-[0])"
>> "$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
>> "$'" is the same as "substr($var, $+[0])"
>> "$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
>> "$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
>> "$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
>> </quote>
>
> I didn't understand the above examples because it refers to $-[0],
> which was exactly what I wanted to understand.
If it didn't contain $-[0], it wouldn't be an example for the use of
$-[0], would it?
> I agree your points. But I think that the perldoc shall have as less
> cyclic references as possible. If the example that you mentioned in
> the perldoc is not cyclic referred, it would easier to read.
I don't see any cyclic references in the section about @-. $-[0] is
explained in the very first sentence:
@- $-[0] is the offset of the start of the last successful match.
To understand the examples, it helps to know what $`, $&, etc. are.
These are explained elsewhere in the same document, and none of them
is explained by referring to @-. No cycles there.
hp
>>> >On Nov 20, 3:46 pm, "A. Sinan Unur" <1...@llenroc.ude.invalid>
>>> >wrote:
>>> >> I looked at the perldoc perlvar page to remind myself what it
>>> >> said. I am unable to figure out how one could make it clearer:
>
> Did you quote from a newer version than 5.10.0? It's slightly
> different in my docs (only checked 5.8.8 and 5.10.0).
C:\Documents and Settings\asu1> perl -v
This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)
Copyright 1987-2007, Larry Wall
Binary build 1004 [287188] provided by ActiveState
http://www.ActiveState.com
Built Sep 3 2008 13:16:37
I never used the /o modifier. Tested it a long time ago and found it
squirley. I only use qr//.
Yes, you are right, the 'o' option lasts the life of the script.
This defies all logic, where is this regexp stored. Is it related to
variables, even lexical ones? This has to be a runtime association on
the first $pattern 'named' variable, even to nested scope lexicals?
I guess its possible.
Over the "life of the script" is all thats mentioned. I think this has
strictly limited use, basically a one time usage.
Docs:
"If $pattern won't be changing over the lifetime of the script,
we can add the //o modifier, which directs perl to only perform
variable substitutions once"
Thanks!
sln
----------------
my $pattern = 'this';
print getmatch_o('this is a test',$pattern)."\n";
$pattern = 'is';
print getmatch_o('this is a test',$pattern)."\n";
sub getmatch_o
{
my ($target,$pattern) = @_;
print $pattern."\n";
my $ttt = $pattern;
my $ggg = $target;
return $1 if ($ggg =~ /($ttt)/o);
return '';
}
__END__
this
this
is
this
>On Thu, 20 Nov 2008 10:06:33 -0800, "John W. Krahn" <som...@example.com> wrote:
>
[snip]
>>while ( $a =~ /(?=(.{$lenb}))/sog ) {
[snip]
>
>I noticed the capture group is in the assertion and it only advances the
>regexp ordinal position by one.
>
>Example:
>
>/(.{$lenb})/ advances 6
>/(?=(.{$lenb}))/ advances 1
>
>I looked in perlre,perlretut and there is only one mention of capture within
>a "non-capture zero-width assertion" extended expression elements.
>That was just a side from some example on (?>..), and that just mentioned,
>the ordinal position is adjusted with a (?=(regex)) analogy.
>
>Do you know why this is?
>
>
>sln
Does anybody know?
sln