New RegEx BIFs

3 views
Skip to first unread message

Peter J. Farrell

unread,
Jun 22, 2010, 10:42:52 PM6/22/10
to CFML Conventional Wisdom
I've been working with RegEx based string parsing a lot lately and
there are a few places in the CFML language where we could make things
easier.

REFindAll(regex, text) -> Struct
REFindAllNoCase(regex, text) -> Struct

This function would return a struct with two keys "pos" and
"len" (just like REFind) except each element in the array would
correspond to each match in the string. This is useful if you want to
get all the matches of a particular regex.

Yes, you can do this with REFind, but you have to loop and reset the
start position like this example CFML code:

<cffunction name="reFindAll" access="private" returntype="struct"
output="false"
hint="Finds all regex matches and returns the position / length of
each match.">
<cfargument name="regex" type="string" required="true"
hint="The regex pattern to use." />
<cfargument name="input" type="string" required="true"
hint="The text to search and apply the regex pattern to." />

<cfset var results = StructNew() />
<cfset var start = 1 />
<cfset var match = "" />

<!--- Setup results --->
<cfset results.len = ArrayNew(1) />
<cfset results.pos = ArrayNew(1) />

<!--- Loop through input text for matches --->
<cfloop condition="true">

<!--- Perform search --->
<cfset match = REFind(arguments.regex, arguments.input, start,
TRUE) />

<!--- Break if nothing matched --->
<cfif NOT match.len[1]>
<cfbreak />
</cfif>

<cfset ArrayAppend(results.len, match.len[1]) />
<cfset ArrayAppend(results.pos, match.pos[1]) />

<!--- Reposition start point --->
<cfset start = match.pos[1] + match.len[1] />
</cfloop>

<!--- If no matches, add 0 to both arrays --->
<cfif NOT ArrayLen(results.len)>
<cfset results.len[1] = 0 />
<cfset results.pos[1] = 0 />
</cfif>

<cfreturn results />
</cffunction>


RESplit(regex, text) -> array
RESplitNoCase(regex, text) -> array

This would be similar to REFindAll and REFindAllNoCase, but instead of
getting the pos/len of all the matches, this function would return an
array of the matched text. This is similar to the function available
in Python:

http://docs.python.org/library/re.html#re.split

One technical note here is that in the Python version, the last array
element is any "remainder" from the input text that did not match.
I'm not sure if this useful, but this could be an optional third
attribute for RESplit() to make the last array element any non-
matching remainder (i.e. any text that occurs after the last match for
added to the split).

REEscape(string) -> string
This function returns a string with all non-alphanumerics backslashed.
This is very useful if you want to use an literal string that may have
RegExx meta-characters in it.

Example using a phone number as a literal:

REEscape("555.555.1234") -> "555\.555\.1234"

Python has a similar function and I think this would an useful
addition to the CFML language.

Matthew Woodward

unread,
Jun 22, 2010, 11:35:25 PM6/22/10
to cfml-convent...@googlegroups.com
On Tue, Jun 22, 2010 at 7:42 PM, Peter J. Farrell <pe...@mach-ii.com> wrote:
REFindAll(regex, text) -> Struct
REFindAllNoCase(regex, text) -> Struct

This function would return a struct with two keys "pos" and
"len" (just like REFind) except each element in the array would
correspond to each match in the string.  This is useful if you want to
get all the matches of a particular regex.

I'd have to test to verify but I believe the "subexpression flag" in REFind in OpenBD does just that:
http://openbluedragon.org/manual/?/function/refind

Is that what you're after?

--
Matthew Woodward
ma...@mattwoodward.com
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint, etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html

Matthew Woodward

unread,
Jun 22, 2010, 11:47:23 PM6/22/10
to cfml-convent...@googlegroups.com
On Tue, Jun 22, 2010 at 8:35 PM, Matthew Woodward <ma...@mattwoodward.com> wrote:

I'd have to test to verify but I believe the "subexpression flag" in REFind in OpenBD does just that:
http://openbluedragon.org/manual/?/function/refind

Is that what you're after?


My bad, I was thinking of REMatch:
http://openbluedragon.org/manual/?/function/rematch

Which I think does the first part of what you're asking.

Peter J. Farrell

unread,
Jun 23, 2010, 3:15:41 AM6/23/10
to cfml-convent...@googlegroups.com
Yeah, my bad - REMatch fits my suggestion for RESplit however REFind() with subexpressions doesn't work like REFindAll().  Subexpressions flag looks at the first match and goes into that.

Matthew Woodward said the following on 22/06/10 22:47:

Peter Boughton

unread,
Jun 23, 2010, 7:21:25 AM6/23/10
to cfml-convent...@googlegroups.com
For REFindAll, I think that'd be more consistent as an extra option on
REFind - I think PHP's regex functions have something like this, a
numeric value for how many times to match, default 1, set to 0 for all
matches.

That gives a bit more flexibility/performance, (for example if you
only want the first four matches of something containing 50 matches,
or whatever).

> REEscape(string) -> string
> This function returns a string with all non-alphanumerics backslashed.

I was going to point out that this is overkill (not all non-alnum need
escaping), but it looks like the .quotemeta() function in the Apache
ORO library (which CF uses) does exactly what you're suggesting
anyway.

Personally I tend to use Java regex, which supports \Q..\E inline plus
has a Pattern.quote() function (which handles content that might
contain \E)

If the expression still needs to be readable afterwards, the \Q..\E is
probably clearer, but those are explicitly not supported in the
current Apache ORO, so escaping each char is the only option.

Given the names of both of those, I'd probably say go with REQuote as
the name, even though REEscape is what I would have used too.

Andy Wu

unread,
Jun 23, 2010, 7:22:41 AM6/23/10
to cfml-convent...@googlegroups.com
I like the idea of refindall but I think it should optionally support subexpressions like refind. How would that effect your thinking on what it should return?

Peter Boughton

unread,
Jun 23, 2010, 7:27:53 AM6/23/10
to cfml-convent...@googlegroups.com
Oh, and rematch should also have an option to return sub-expressions
(as strings rather than pos/len, of course)

Infact, no reason for refind and rematch not to have the exact same
argument structure:

( reg_expression , string , start , subexpression , matchcount )


And would be good to replace the one/all scope of rereplace with an
numeric count too (though that would have compatibility issues)

Reply all
Reply to author
Forward
0 new messages