[perl #27515] Simple regular expression is wrongly optimised

Jamie Lokier

unread,

Mar 8, 2004, 12:11:14 PM3/8/04

to bugs-bi...@netlabs.develooper.com

# New Ticket Created by Jamie Lokier
# Please include the string: [perl #27515]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org:80/rt3/Ticket/Display.html?id=27515 >

This is a bug report for perl from ja...@shareable.org,
generated with the help of perlbug 1.34 running under perl v5.8.0.

-----------------------------------------------------------------
[Please enter your report here]

Try this:

$_ = "1x"; /^(.*)(?=x)x/ ? print "match :$1:\n" : print "no match\n"'

It should print "match :1:" but it actually prints "no match".
Now this:

$_ = "1x"; /(.*)(?=x)x/ ? print "match :$1:\n" : print "no match\n"'

It should print "match :1:" but it actually prints "match ::".

What's going wrong? You can see from the following "use re 'debug'"
output: The "floating substr" optimisation is broken:

Guessing start of match, REx `^(.*)(?=x)x' against `1x'...
Found floating substr `x' at offset 1...
By STCLASS: moving 0 --> 1
Guessed: match at offset 1
Matching REx `^(.*)(?=x)x' against `x'
Setting an EVAL scope, savestack=3
1 <1> <x> | 1: BOL
failed...
Match failed
no match

It is equally broken with regexes of the form /^(.*(?=x))x/,
/^(.*)(?=[x])x/ and /^(.*)(?=[a-z])x/. /^(.*)(?=[^\n])x/,
/^(.*)(?=[1x])x/ and /^(.*)(?=x)./ are fine.

Have a nice day,
Thanks,
-- Jamie

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=core
severity=high
---
Site configuration information for perl v5.8.0:

Configured by bhcompile'
cf_email='bhcompile at Wed Aug 13 11:45:59 EDT 2003.

Summary of my rderl (revision 5.0 version 8 subversion 0) configuration:
Platform:
osname=linux, osvers=2.4.21-1.1931.2.382.entsmp, archname=i386-linux-thread-multi
uname='linux str'
config_args='-des -Doptimize=-O2 -g -pipe -march=i386 -mcpu=i686 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Dotherlibdirs=/usr/lib/perl5/5.8.0 -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr'
hint=recommended, useposix=true, d_sigaction=define
usethreads=define use5005threads=undef'
useithreads=define usemultiplicity=
useperlio= d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=un uselongdouble=
usemymalloc=, bincompat5005=undef
Compiler:
cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
optimize='',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm'
ccversion='', gccversion='3.2.2 20030222 (Red Hat Linux 3.2.2-5)', gccosandvers=''
gccversion='3.2.2 200302'
intsize=r, longsize=r, ptrsize=5, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long'
k', ivsize=4'
ivtype='l, nvtype='double'
o_nonbl', nvsize=, Off_t='', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='gcc'
l', ldflags =' -L/u'
libpth=/usr/local/lib /lib /usr/lib
libs=-lnsl -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil
perllibs=
libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libper
gnulibc_version='2.3.2'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so', d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE'
cccdlflags='-fPIC'
ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5', lddlflags='s Unicode/Normalize XS/A'

Locally applied patches:
MAINT18379

---
@INC for perl v5.8.0:
/usr/lib/perl5/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/5.8.0
/usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.0
/usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.0
/usr/lib/perl5/vendor_perl
/usr/lib/perl5/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/5.8.0
.

---
Environment for perl v5.8.0:
HOME=/home/jamie
LANG=en_GB.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/jamie/bin
PERL_BADLANG (unset)
SHELL=/bin/bash
dlflags='-share (unset)

h...@crypt.org

unread,

Mar 9, 2004, 8:15:14 AM3/9/04

to perl5-...@perl.org

Jamie Lokier (via RT) <perlbug-...@perl.org> wrote:
:Try this:

:
: $_ = "1x"; /^(.*)(?=x)x/ ? print "match :$1:\n" : print "no match\n"'
:
:It should print "match :1:" but it actually prints "no match".
:Now this:
:
: $_ = "1x"; /(.*)(?=x)x/ ? print "match :$1:\n" : print "no match\n"'
:
:It should print "match :1:" but it actually prints "match ::".

Thanks for the report, I can confirm this bug still exists with
the latest development version of perl. It was actually reported
before, and there are example tests (marked as 'expected to fail')
in the test suite for exactly this type of case.

The original discussion is at:
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2002-01/msg00650.html

:What's going wrong? You can see from the following "use re 'debug'"

:output: The "floating substr" optimisation is broken:

It isn't the floating substr but the stclass optimisation that's broken;
quoting from another part of the -Dr output:
floating `x' at 0..2147483647 (checking floating) stclass `ANYOF[x]' minlen 1

That's saying:
- we must find a string "x" somewhere, at any position
- the matched text must start with something matching character class [x]
- the minimum length we will match is 1

Now, the stclass logic for /.*x/ should be:
must start with ( [^\n] or [x] )
.. and for /.*(?=x)x/:
must start with ( [^\n] or ([x] and [x]) )

I believe what happens is that regcomp.c:S_study_chunk() uses flags
SCF_DO_STCLASS_AND and SCF_DO_STCLASS_OR to signal what we should do,
but because it is working forward through the compiled regexp, it
goes something like:
(start) [any] AND
.* [^\n] OR
(?=x) ([^\n] or [x]) AND
x (([^\n] or [x]) and [x])
.. but study_chunk() is quite, er, challenging to read through, and
I think I got stuck last time I tried to track this down.

I'll give it another go. :)

Hugo

h...@crypt.org

unread,

Mar 9, 2004, 9:13:15 AM3/9/04

to perl5-...@perl.org

h...@crypt.org wrote:
:I believe what happens is that regcomp.c:S_study_chunk() uses flags

:SCF_DO_STCLASS_AND and SCF_DO_STCLASS_OR to signal what we should do,
:but because it is working forward through the compiled regexp, it
:goes something like:
: (start) [any] AND
: .* [^\n] OR
: (?=x) ([^\n] or [x]) AND
: x (([^\n] or [x]) and [x])
:.. but study_chunk() is quite, er, challenging to read through, and
:I think I got stuck last time I tried to track this down.

So, if we have an expression like:
a?(?=b)c?(?=d)e
(where a..e represent possibly overlapping character classes), we need
to be able to construct a start class like:
a | (b & (c | (d & e)))

I think we can achieve this by passing two classes around, both an 'or'
class and an 'and' class. Initialise to [none] and [any] respectively,
and then:
- for each subpattern, combine its two classes with OR
- if the subpattern class should be ANDed, AND it with our 'and' class
- if the subpattern should be ORed, OR (it AND our 'and' class) with
our 'or' class

For the above example that would give:
where style or-class and-class
------------------------------------------
(start) and none any
a? or a any
(?=b) and a b
c? or a | (b & c) b
(?=d) and a | (b & c) b & d
e and a | (b & c) b & d & e
.. giving a final class of:
(a | (b & c)) | (b & d & e)
.. which I think is correctly equivalent to the original:
a | (b & (c | (d & e)))

If anyone can follow this, and can either see flaws or confirm its
general correctness that would be very useful.

Hugo

Nick Ing-Simmons

unread,

Mar 9, 2004, 12:05:20 PM3/9/04

to h...@crypt.org, perl5-...@perl.org

Not at 2nd reading, but it smells of De Morgan

A&B = ~(~A|~B)