Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Max. 256 chars in regexp!

308 views
Skip to first unread message

Alexandru

unread,
Feb 14, 2019, 10:47:11 AM2/14/19
to
I just hit the wall of max. 256 chars allowed by regexp!

regexp {^(.){0,255}$} "dfsdfsdf"

works!

regexp {^(.){0,256}$} "dfsdfsdf"

fails with:

couldn't compile regular expression pattern: invalid repetition count(s)

Wow! How is this possible? Next bug waiting to happen in my code all over the place.

Alexandru

unread,
Feb 14, 2019, 10:49:17 AM2/14/19
to
Don't write "you can use string length". I know that. The above is just a simple example. My regexpression is little more complicated.

Harald Oehlmann

unread,
Feb 14, 2019, 11:22:12 AM2/14/19
to
Alexandru,
thank you for the message. This is the documentation on the re_syntax
manual page:

"The forms using { and } are known as bounds. The numbers m and n are
unsigned decimal integers with permissible values from 0 to 255 inclusive."

The reason is, that this format is quite expensive in evaluation.
If a length check should be done one should better use something else.
Due to that, I have implemented for input verification a descripion of 2
values, re and maximum length.

proc check {s re maxlen} {
if {[strlen $s] > $maxlen} {return 0}
return [regexp -- $re $s]
}

Hope this helps,
Harald

Alexandru

unread,
Feb 14, 2019, 11:26:28 AM2/14/19
to
Hi Harald,

That's a good work arround. Thanks.
So if the regexp would allow for longer strings that would also affect the performance an shorter strings?

Harald Oehlmann

unread,
Feb 14, 2019, 11:51:09 AM2/14/19
to
Am 14.02.2019 um 17:26 schrieb Alexandru:
> Hi Harald,
>
> That's a good work arround. Thanks.
> So if the regexp would allow for longer strings that would also affect the performance an shorter strings?
>

What I understood from the explanation by Donal was, that x{n,m} is
internally replaced by "x" (n times) and "x?" n-m times.

So, \d{3,5} is transformed to \d\d\d\d?\d?

The performed gets quite bad for higher repetition counts.

Little knowledge here,
Harald

Alexandru

unread,
Feb 14, 2019, 11:54:03 AM2/14/19
to
But this means that changing the core implementation to allow more instances would not have performance issue with available user code. The performance ould remain the same. Why not let the performance be a user issue/decision?

Rich

unread,
Feb 14, 2019, 12:57:13 PM2/14/19
to
Alexandru <alexandr...@meshparts.de> wrote:
> Am Donnerstag, 14. Februar 2019 16:47:11 UTC+1 schrieb Alexandru:
>> I just hit the wall of max. 256 chars allowed by regexp!
>>
>> regexp {^(.){0,255}$} "dfsdfsdf"
>>
>> works!
>>
>> regexp {^(.){0,256}$} "dfsdfsdf"
>>
>> fails with:
>>
>> couldn't compile regular expression pattern: invalid repetition
>> count(s)
>>
>> Wow! How is this possible?

We live in a finite world. There is always an 'upper limit',
everywhere. Yes, some of the upper limits are large enough that one
can't reasonably encounter them without concerted effert to do so, but
there is always /some/ limit somewhere.

>> Next bug waiting to happen in my code all over the place.

Only if you are generating dynamic repetition counts inside dynamic
regular expression patterns. Static patterns (i.e, {15}) will never
encounter the limit, due to their static nature.

> Don't write "you can use string length". I know that. The above is
> just a simple example. My regexpression is little more complicated.

You can chain multiple repetition counts:

% regexp {^(.{0,255}.{0,255})$} "dfsdfsdf"
1

But in general using string length to measure lengths, or verify if a
string is within some size bounds, is always going to be faster than
using a regular expression for the same effect. And the longer the
length being measured, the greater the performance difference.

Robert Heller

unread,
Feb 14, 2019, 2:56:37 PM2/14/19
to
Beyond a certian point, it is *always* going to be better to do something
like:

if {[regexp {(\d+)} $string => matched] > 0 &&
[string length $matched] > 20 &&
[string length $matched] < 2048} {
# whatever
}

than

if {[regexp {(\d{20,2048})} $string => matched]} {
# whatever
}

Actually, taking the length check out [of the regexp], except for trivial
cases, is probably a "best practice" in general. I expect that the performance
issue kicks in quite early, so a range bound over a fairly low number
(probably < 20), is not recomended anyway. Putting in an upper bound of 255 is
probably considered a very generous upper limit.

--
Robert Heller -- 978-544-6933
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
hel...@deepsoft.com -- Webhosting Services

Rolf Ade

unread,
Feb 14, 2019, 3:36:26 PM2/14/19
to
Alexandru <alexandr...@meshparts.de> writes:

Alexandru <alexandr...@meshparts.de> writes:
> Am Donnerstag, 14. Februar 2019 17:22:12 UTC+1 schrieb Harald Oehlmann:
>> Am 14.02.2019 um 16:47 schrieb Alexandru:
>> > I just hit the wall of max. 256 chars allowed by regexp!
>> >
>> > regexp {^(.){0,255}$} "dfsdfsdf"
>> >
>> > works!
>> >
>> > regexp {^(.){0,256}$} "dfsdfsdf"
>> >
>> > fails with:
>> >
>> > couldn't compile regular expression pattern: invalid repetition count(s)
>> >
>> > Wow! How is this possible? Next bug waiting to happen in my code all over the place.
>> >
>> [...]
>> "The forms using { and } are known as bounds. The numbers m and n are
>> unsigned decimal integers with permissible values from 0 to 255 inclusive."
>>
>> The reason is, that this format is quite expensive in evaluation.
>> If a length check should be done one should better use something else.
>> Due to that, I have implemented for input verification a descripion of 2
>> values, re and maximum length.
>>
>> proc check {s re maxlen} {
>> if {[strlen $s] > $maxlen} {return 0}
>> return [regexp -- $re $s]
>> }
>
> That's a good work arround. Thanks.
> So if the regexp would allow for longer strings that would also affect
> the performance an shorter strings?

Since nobody did so far it seems the thankless task to question your use
of regexp is left for me.

We know too less what exactly you do but since you fear bugs all over
the place it at least seems you do it a lot. Regular expressions are a
mighty tool (and often faster - sometimes a lot - than parsing a string
with other means).

On the other hand - you've surely already read or heard this at times -
using regular expressions can bear some risks. The most obvious - as
your regexps getting more complicated - is: are you always sure you
understand, what your regexp matches?

In the case at hand you bump against a regexp limit. If you are the
single one who reaches a certain limit that may (or even not, but also
even may) be a sign that you overdo something.

The point here is that - despite the ususal speed of regexps - some
regexps are ridiculous _slow_. One example is this {n,m} construct.

Since you seem to care about execution speed please measure. If you need
{0,256} or even bigger than 256 you very probably can do this another
way - eg. Haralds proposal - much faster.


briang

unread,
Feb 15, 2019, 8:40:14 PM2/15/19
to
A year or 2 ago I took a body of our code that used regexp to tease data out from result of another tool, and rewrote it. The code was hard to read, hard to debug and hard maintain. I replaced regexp with a parser written using PEG from tcllib. (https://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/pt/pt_pegrammar.html -- Thank you Adreas). This was easy to work with in Tcl. Once I had everything working again, I used the Critcl output to create a C implementation of the parser. Now the code is much faster, and easier to understand. The grammar description is much more readable than regexp expressions.

There are still many places where regexp is the right tool, it just wasn't for this instance.

When there's too many bumps in the road, it's time to find a different highway.

-Brian

Alexandru

unread,
Feb 16, 2019, 4:56:39 AM2/16/19
to
Thanks for the tip. I will probably change my code. But for me the whole limitation to the regexp looks like unnecessary. It's like limiting the number if iterations in the "for" loop.

Andreas Leitgeb

unread,
Feb 22, 2019, 9:56:32 AM2/22/19
to
Alexandru <alexandr...@meshparts.de> wrote:
> But for me the whole limitation to the regexp looks like unnecessary.
> It's like limiting the number of iterations in the "for" loop.

In an ideal world, more than a handful of people worldwide would be
able to write a highly efficient regexp implementation.

Those who produced the currently existing implementation apparently
considered it an acceptable compromise to inline repetition-ranges,
like as if there was a "loop" structure in tcl as follows:
proc repeat {body cnt} { uplevel 1 [string repeat $body\n $cnt] }

In some ideal world, an RE containing (.){65536,1048576} would be
perhaps treated in a loop collecting an acceptable number of consecutive
matches (and even do that without any performance hit).

0 new messages