[cl-ppcre-devel] report a bug

5 views
Skip to first unread message

Xiangjun Wu

unread,
Jun 24, 2009, 10:31:45 PM6/24/09
to cl-ppcr...@common-lisp.net
In order to fix issue of montezuma, http://code.google.com/p/montezuma/issues/detail?id=3, I suppose I found a bug of cl-ppcre.

CL-USER> (cl-ppcre:scan
              (cl-ppcre:create-scanner
                 "(\\w+)*\\@\\w+") "______________________________________"
                          :start 0)
;; Evaluation aborted.

It hangs when the number of underscore hit a critical value.
I speculate that '\w' includes underscore in regular expression would account for this
bug. and replace with other character of '\w' also has this problem.

CL-USER> (cl-ppcre:scan (cl-ppcre:create-scanner
               "(a\\w+)*\\@\\w+") "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
               :start 0)
;; Evaluation aborted.

but if I eliminate the last \w, it is OK.
CL-USER> (cl-ppcre:scan
              (cl-ppcre:create-scanner
                 "(_\\w+)*\\@") "_______________________________________"
                          :start 0)
NIL

I also check it in perl, Maybe perl is more efficient in regular expression operation, I raise the number of
underscores, but it is OK.

$str = "john._______________________________________
__________________________________";

if ($str =~ m/(_*\w+)*\@\w+/)
{
   print "ok\n";
}

Please check it and give your comment.

片云天共远永夜月同孤

Edi Weitz

unread,
Jun 25, 2009, 3:21:53 PM6/25/09
to General interest list about cl-ppcre and cl-unicode
Hi,

On Thu, Jun 25, 2009 at 4:31 AM, Xiangjun Wu<neta...@gmail.com> wrote:

>                  "(\\w+)*\\@\\w+"

That's the type of regular expression that typically leads to a
combinatorial explosion in regex engines unless they use specific
"tricks" to deal with this. Recent versions of Perl are pretty clever
in this regard (they look for "floating" substrings) while CL-PPCRE
isn't, but - frankly - I don't really see the point of this. I think
this is mainly so that the regex engine looks good in benchmarks. I
definitely wouldn't call this a bug.

The question is - what do you want to achieve with this regular
expression? Can't you write it in a simpler way?

Cheers,
Edi.

_______________________________________________
cl-ppcre-devel site list
cl-ppcr...@common-lisp.net
http://common-lisp.net/mailman/listinfo/cl-ppcre-devel

Leslie P. Polzer

unread,
Jun 26, 2009, 9:10:14 AM6/26/09
to cl-ppcr...@common-lisp.net

On Jun 25, 9:21 pm, Edi Weitz <e...@agharta.de> wrote:

> The question is - what do you want to achieve with this regular
> expression? Can't you write it in a simpler way?

Isn't this pattern pretty useful in general:

A@B

where A and B are word characters and @ is a specific non-word
character?

How else could we specify it?

[a-zA-Z0-9] doesn't seem acceptable to me since it relies on
the latin alphabet...

Leslie

--
http://www.linkedin.com/in/polzer

Hans Hübner

unread,
Jun 26, 2009, 10:09:44 AM6/26/09
to leslie...@gmx.net, General interest list about cl-ppcre and cl-unicode
On Fri, Jun 26, 2009 at 15:10, Leslie P. Polzer<s...@viridian-project.de> wrote:

> On Jun 25, 9:21 pm, Edi Weitz <e...@agharta.de> wrote:
>
>> The question is - what do you want to achieve with this regular
>> expression?  Can't you write it in a simpler way?
>
> Isn't this pattern pretty useful in general:
>
> A@B
>
> where A and B are word characters and @ is a specific non-word
> character?

Sure, but the original bug report was about this:

(\\w+)*\\@\\w+

I can't make any sense of this regular expression, but maybe it is
because I am lacking some skills. Maybe Wu can explain what he wants
to achive with it?

-Hans

Xiangjun Wu

unread,
Jun 26, 2009, 10:46:38 AM6/26/09
to General interest list about cl-ppcre and cl-unicode
Very sorry, it is a typo, :(

It should be:

(cl-ppcre:scan
(cl-ppcre:create-scanner
"(_\\w+)*\\@\\w+") "______________________________________"
:start 0)

but other examples indicate the accurate idea.


--
片云天共远永夜月同孤

Leslie P. Polzer

unread,
Jun 26, 2009, 11:01:39 AM6/26/09
to General interest list about cl-ppcre and cl-unicode

Xiangjun Wu wrote:
> Very sorry, it is a typo, :(
>
> It should be:
>
> (cl-ppcre:scan
> (cl-ppcre:create-scanner
> "(_\\w+)*\\@\\w+") "______________________________________"
> :start 0)
>
> but other examples indicate the accurate idea.

Looking at this I'm not sure what this is good for.

Why would we want to match strings of the form _xxx@xxx
in a full-text indexer?

Perhaps it would be best to get rid of the whole messy
regex (of which this is only a small part) and write
a new documented one from scratch. Or use a custom
state-based tokenizer.

Chris Dean

unread,
Jun 26, 2009, 2:17:52 PM6/26/09
to General interest list about cl-ppcre and cl-unicode

Xiangjun Wu <neta...@gmail.com> writes:
> (cl-ppcre:scan
> (cl-ppcre:create-scanner
> "(_\\w+)*\\@\\w+") "______________________________________"
> :start 0)
>


Perhaps

(cl-ppcre:create-scanner "(_[_\\w]+)?@\\w+")

will work for your app? The problem in the original expression is the
"+" followed by the "*" can lead to a combinatorial explosion.

If you loosen the requirement that all non-zero matches in the first
expression must begin with an "_" you could have:

(cl-ppcre:create-scanner "[_\\w]*@\\w+")

Cheers,
Chris Dean

Xiangjun Wu

unread,
Jul 23, 2009, 11:18:53 PM7/23/09
to General interest list about cl-ppcre and cl-unicode

片云天共远永夜月同孤
Thank you, it works for our application.
Reply all
Reply to author
Forward
0 new messages