REEscape()

24 views
Skip to first unread message

Peter J. Farrell

unread,
Feb 2, 2011, 4:27:54 PM2/2/11
to CFML Conventional Wisdom
FYI, I proposed a new function to the OpenBD folks on their tracker a
while back to help in the RegEx area of the language. The function
REEscape(string) takes a string and escapes any RegEx control characters
(like * . - braces, etc.) so the string is safe for use in a RegEx
pattern as string literal. This function was recently added to the
OpenBD 1.5 BER. The reason why I proposed this function is that similar
functions are available in Ruby, PHP, .Net and Python. We're definitely
missing out on this.

The current version in OpenBD escapes the following characters:

$, {, }, (, ), <, >, [, ], ^, ., *, +, ?, #, :, &, and \

If there are other control characters that are missing, please say
something -- but this list is similar or the same to control characters
that the other languages I mentioned offer. I thought I would mention
this on the list in case Railo or ACF would like to implement a similar
function.

Best,
.Peter

Mark Jones

unread,
Feb 2, 2011, 4:35:53 PM2/2/11
to cfml-convent...@googlegroups.com
Just thought I should mention that such a function would have saved me a bunch of time over the years. Write a regex for a URL, think you've got everything escaped, test... errors... find more, escape again... and in the end it's an ugly, unreadable, nearly impossible to update mess of a string.


Peter J. Farrell

unread,
Feb 2, 2011, 4:49:17 PM2/2/11
to cfml-convent...@googlegroups.com
Mark Jones said the following on 02/02/2011 03:35 PM:

> Just thought I should mention that such a function would have saved me
> a bunch of time over the years. Write a regex for a URL, think you've
> got everything escaped, test... errors... find more, escape again...
> and in the end it's an ugly, unreadable, nearly impossible to update
> mess of a string.
>
Good to know that I'm not the only one that benefits from this. I took
the liberty of filing this both at ACF and Railo. We'll see what happens
there.

Railo:
https://issues.jboss.org/browse/RAILO-1168

ACF:
http://cfbugs.adobe.com/cfbugreport/flexbugui/cfbugtracker/main.html?#bugId=86138

.pjf

Peter Boughton

unread,
Feb 2, 2011, 5:23:21 PM2/2/11
to cfml-convent...@googlegroups.com
This is certainly a function that is missing from CFML's regex
functionality, but this bit has me a tad worried...

> The current version in OpenBD escapes the following characters:
>
> $, {, }, (, ), <, >, [, ], ^, ., *, +, ?, #, :, &, and \

There is no situation where < or > or : needs escaping.

If the function simply prefixes characters with a backslash, that's
the wrong way to go about it.

Some people might say "so what, it'll still work", but with regex a
backslash both removes special meaning (escapes) and *gives* special
meaning - and there is always the possibility that a future version of
the regex engine might decide to grant special meaning to something
that is incorrectly escaped.

Particularly so in the case of < and > - since if < is changed to \<
since that means "start of word" in other regex implementations
(one-way version of \b and similarly \> is "end of word"), so if the
regex engine gets upgraded and includes support for that, then
escaping function is suddenly broken.

So yeah, the correct way to do this is to use the regex engine's
built-in escaping functionality, which knows what is special and what
isn't and should escape only the things that actually need it.

Assuming OpenBD is using the Apache ORO engine, like ACF does, then
there appears to be a quotemeta function that does this:
http://jakarta.apache.org/oro/api/org/apache/oro/text/regex/Perl5Compiler.html#quotemeta(java.lang.String)

Hopefully it's simple enough for the engines to alias that to
REEscape, and the function can then appear in the next release of each
engine.

Peter J. Farrell

unread,
Feb 2, 2011, 5:44:39 PM2/2/11
to cfml-convent...@googlegroups.com
Peter Boughton said the following on 02/02/2011 04:23 PM:

> This is certainly a function that is missing from CFML's regex
> functionality, but this bit has me a tad worried...
>
>> The current version in OpenBD escapes the following characters:
>>
>> $, {, }, (, ), <, >, [, ], ^, ., *, +, ?, #, :, &, and \
> There is no situation where < or > or : needs escaping.
>
We could use \Q and \E to indicate a string literal but in Java, this
feature does not work correctly in JDK 1.4 and 1.5 when used in a
character class or followed by a quantifier when using the
java.util.Regex classes. Still the safest way in all modern RegEx
implementations is to escape control characters is with back slashes.
Ultimately, I'd like REEscape to be compatibility -- no matter what the
underlying RegEx engine is being used -- back slashes are the lowest
common denominator.

You're probably right about the < and > -- however I included them in
the ticket because Ruby and PHP escapes those. They could be removed.
The ":" is uses in posix character class names. Some escaping utilities
in Java I've seen go as far as to escape all characters except a-z, A-Z
and 0-9 which is still valid RegEx when you are dealing with string
literals.

This is a great discussion.

.Peter

Peter J. Farrell

unread,
Feb 2, 2011, 5:57:39 PM2/2/11
to cfml-convent...@googlegroups.com
Peter J. Farrell said the following on 02/02/2011 04:44 PM

> You're probably right about the < and > -- however I included them in
> the ticket because Ruby and PHP escapes those.
Did a bit more digging on "<" and they can be used for negative
lookbehinds - ?<!

> The ":" is uses in posix character class names.

A more ":" is used inf passive groups as well (?:....)

.pjf

Peter Boughton

unread,
Feb 2, 2011, 6:06:51 PM2/2/11
to cfml-convent...@googlegroups.com
Yikes, I'd definitely consider anything which escaped everything
except alphanumerics to be incorrect - regex can be intimidating
enough without throwing extra characters in just for the sake of it.

For Posix classes, once the double brackets are escaped, the contents
( including : ) have no meaning.
Similarly with lookbehinds and non-capturing groups - once the parens
and ? are escaped the others are just regular characters.

As above - I'd rather see limited escaping to avoid making expressions
unnecessarily long (and potentially scary).


Regarding the java.util.regex \Q...\E buggyness ... well the function
should only be concerned with its own engine, which (now I've got the
source) I can see is the Apache ORO one. Actually, not sure how the
quotemeta works with it - last time I checked CF didn't support the \Q
... \E stuff anyway?

Although... it might be nice to be able to say ReEscape(Text,For) -
thinking particularly of outputting for JavaScript here, but a
sensible argument could work for any regex target.


( I did once consider suggesting a feature where the RE~ functions can
be switched over to be powered by java.util.regex or others, by way of
an administrator setting, but never did actually go ahead and propose
it in the end. Possibly going off-topic for this discussion, but if
anyone else thinks it'd be nice to have that, maybe I should go raise
feature requests on the three engines? )

Peter J. Farrell

unread,
Feb 5, 2011, 4:32:22 PM2/5/11
to CFML Conventional Wisdom
Looks like there is a discussion on REEscape on a Railo ticket now:

https://issues.jboss.org/browse/RAILO-1168

Peter J. Farrell

unread,
Feb 5, 2011, 5:58:02 PM2/5/11
to cfml-convent...@googlegroups.com
Peter Boughton said the following on 02/02/2011 05:06 PM:

> As above - I'd rather see limited escaping to avoid making expressions
> unnecessarily long (and potentially scary).
Point taken here. I guess I tend to be overly cautious.

I have submitted a patch file to OpenBD to remove :, <, > and # as
escapable characters.


> Regarding the java.util.regex \Q...\E buggyness ... well the function
> should only be concerned with its own engine, which (now I've got the
> source) I can see is the Apache ORO one. Actually, not sure how the
> quotemeta works with it - last time I checked CF didn't support the \Q
> ... \E stuff anyway?

No it doesn't...


> Although... it might be nice to be able to say ReEscape(Text,For) -
> thinking particularly of outputting for JavaScript here, but a
> sensible argument could work for any regex target.

That could work, but in the end back slashes works on all modern regex
engines and I think it would be a pain to maintain for CFML engine
committers to keep other target engines in sync -- especially when we
have an universal escaping strategy (back slashes) that works. I guess
the question is -- what is there to gain to add this feature?

I see REEscape being used like this:

<cfset matches = REMatch("([a-z]){3} #REEscape(variables.someValue)#
([0-9]){2}), variables.findMatches) />

In most cases, I think most people won't even been looking at the regex
pattern -- just inserting the string literal. So I don't think the
argument of -- would be easier to read works too well for me.


> ( I did once consider suggesting a feature where the RE~ functions can
> be switched over to be powered by java.util.regex or others, by way of
> an administrator setting, but never did actually go ahead and propose
> it in the end. Possibly going off-topic for this discussion, but if
> anyone else thinks it'd be nice to have that, maybe I should go raise
> feature requests on the three engines? )

First, I'd start a new thread on this list outlining your reasoning on
why this feature should be added? For example, Apache ORO has been
official retired at Apache and has been put into the Apache Attic
projects (as of Sept 2010). Maybe it is time for engines to move over to
the built-in java.util.regex? IIRC, OpenBD required Java 1.5+.

.pjf

Paul Klinkenberg

unread,
Feb 6, 2011, 7:29:07 AM2/6/11
to cfml-convent...@googlegroups.com
Hi Peter, and all,

Matt Woodward pointed me to the conventional wisdom mailing list yesterday, in a reply to my cfcsv post (http://www.railodeveloper.com/post.cfm/railo-custom-tag-cfcsv).
I didn't know about that, even though most of my team members are pretty active on that list :-/

Regarding the REEscape function: I saw the JIRA ticket, and thought "hey, I can build that in mere minutes, so let's do this". Without knowing any of the conv. wisdom list chatter about it. It wasn't mentioned in the JIRA ticket btw, which would have made matters more clear.

Anyway, now that I know of the mailing list, I'll be reading it, and discussing new tags / functions before releasing it.

Kind regards,

Paul Klinkenberg
www.railodeveloper.com

https://issues.jboss.org/browse/RAILO-1168

--

Peter J. Farrell

unread,
Feb 6, 2011, 7:51:01 AM2/6/11
to cfml-convent...@googlegroups.com
Paul Klinkenberg said the following on 02/06/2011 06:29 AM:

> Hi Peter, and all,
>
> Matt Woodward pointed me to the conventional wisdom mailing list yesterday, in a reply to my cfcsv post (http://www.railodeveloper.com/post.cfm/railo-custom-tag-cfcsv).
> I didn't know about that, even though most of my team members are pretty active on that list :-/
Welcome! Glad to have you aboard the list.

A little about me. I'm the lead developer of the Mach-II framework and
general open source advocate. I was on the CFML Advisory Committee
until I resigned due to complications to my schedule / life at the
time. Since I was serving as a community member, I felt it would be
better for somebody that had more time (and therefore energy) to be
appointed. Sadly, that effort sort of fizzled before a new community
representative was appointed. Anyways, this list is here now and a
great way for the community to gather to discuss the language as a whole.

I'm a junior Java-head and learning a lot by submitting patches to the
OpenBD folks. You'll see me bring up cross CFML engine compatibility
issues on the list or new feature ideas. I wish I had all the time in
the world so I could write patches for Railo, but I'm having enough of a
heartache figuring out OpenBD under the hood. I am, however, nice
enough to submit tickets to Railo's JIRA.

> Regarding the REEscape function: I saw the JIRA ticket, and thought "hey, I can build that in mere minutes, so let's do this". Without knowing any of the conv. wisdom list chatter about it. It wasn't mentioned in the JIRA ticket btw, which would have made matters more clear.

I will be sure to mention a thread here on the ticket in the future for
people that don't follow this list closely.

It's amazing what community involvement can make things happens.
Recently a ticket came into OpenBD that I noticed about LSIsDate() being
slower then expected. I was able to organize the test cases in the
function so the more common ones were tried first. Blang! 5x
performance increase on the most common date patterns


> Anyway, now that I know of the mailing list, I'll be reading it, and discussing new tags / functions before releasing it.

Great! Be sure to tell all of your friends.

Peter Boughton

unread,
Feb 6, 2011, 9:38:37 AM2/6/11
to cfml-convent...@googlegroups.com
> That could work, but in the end back slashes works on all modern regex
> engines and I think it would be a pain to maintain for CFML engine
> committers to keep other target engines in sync -- especially when we
> have an universal escaping strategy (back slashes) that works.  I guess
> the question is -- what is there to gain to add this feature?

Hmm, I wasn't clear enough - I'm definitely not saying "don't use
backslashes" or anything similar.

What I was suggesting was simply knowing *which* characters need
escaping for each engine - and indeed with this it would make more
sense to always use backslashes if going down this route.

I'm not sure the major engines change so frequently or without warning
that it would be hard to keep up with necessary changes. (Probably no
more frequently than database drivers need updating for new versions?)

The benefit would be in being able to do stuff like this:

<cfsearch variable="LuceneResults" collection="docs" filter="^[a-z]{3}
#REEscape(Text)#" />
<cfexecute variable="GrepResults" name="grep" arguments="-e/^[a-z]{3}
#REEscape(Text,"gnu")#/" />
<cfset XmlResults = XmlSearch( XmlData , "fn:matches( node ,
'^[a-z]{3} #REEscape(Text,"xpath")#' )" ) />
<cfoutput><script>findSomething( /^[a-z]{3}
#REEscape(Text,"javascript")#/g ) </script> </cfoutput>

etc.

Not sure how useful people might find that - it certainly wouldn't be
a flaw if REEscape doesn't do this, but it might be a handy
enhancement if enough people want it.

> First, I'd start a new thread on this list outlining your reasoning on
> why this feature should be added? For example, Apache ORO has been
> official retired at Apache and has been put into the Apache Attic
> projects (as of Sept 2010). Maybe it is time for engines to move over to
> the built-in java.util.regex?  IIRC, OpenBD required Java 1.5+.

Well, a straight switch would cause broken code (e.g. java.util.regex
using $n in replacement, not \n ), hence why it'd need to be an
option, at least for a period of time, and there are some reasons why
I think a permanent option including other regex engines would
actually be a beneficial feature.

But yeah, I'll start a new thread here going into details, at some
point in the next few days.

Paul Klinkenberg

unread,
Feb 7, 2011, 8:34:39 AM2/7/11
to cfml-convent...@googlegroups.com
Hi Peter, and all,

My bio isn't nearly that interesting, but here it goes. I have been working professionally with CFML since 2001, after switching from Perl. The switch from over-complicated to super-simple (and fast!) was great, and I have never looked back since.
I have been working as a programmer for several companies, for multinational and government clients, and also for small Dutch companies. A few years back, I started a venture in indoor floorplan management with a friend of mine. High up in a WTC tower in Amsterdam, I learned to build flex apps, but also found out what a burnout is. That changed my life to the better in the end. Nowadays, I am working at Carlos Gallupa, a small (Railo partner) firm in The Netherlands, for 3 days a week. The other days were spent with writing OSS and hacking, untill I happily became a father last month. Now my baby girl Luce is taking up quite a lot of the free time, but that's perfectly okay :-)

In early 2009, I started using Railo for the hosting activities of my own small company Ongevraagd Advies (unasked advice). I fell in love with the Railo list, and the idea of CFML becoming opensource. Ever since, I have been an active member on the Railo mailing list, organizing Railo meetings in The Netherlands, blogging, filing bugs and answering questions. Since late 2010, I became the official voluntary Railo Extension Manager. This new title makes me focus more on the highly extensible nature of Railo, and I will try to get all of you involved in submitting extensions :-)

Thanks for the welcome Peter!

Kind regards,

Paul Klinkenberg
www.railodeveloper.com

--

Reply all
Reply to author
Forward
0 new messages