A Caché of Tips: Regular Expressions

77 views
Skip to first unread message

Emily Haggstrom

unread,
Jul 9, 2012, 5:56:05 PM7/9/12
to intersy...@googlegroups.com

This tip will cover pattern matching using the new %Regex.Matcher class, available in Cache 2012.2. This implementation of Regular Expressions comes from the International Components for Unicode (ICU). You can find details here: http://www.icu-project.org

 

Caché Object Script has a simple pattern matching tool (see below) that can be used to match whole strings. The mvbasic command MATCH is similar. However this new tool has a much broader range of matching characters, and it can be used to locate and manipulate substrings as well as whole strings. It is also a standard approach that is implemented in many languages (with minor variations). Note that this is not for use in query languages, so it does not replace the cmql WITH … LIKE clause.

http://docs.intersystems.com/cache20121/csp/docbook/DocBook.UI.Page.cls?KEY=GCOS_operators#GCOS_operators_pattern

 

There are some good guides to writing Regular Expressions on the web (see the links at the end of this email) but here is a brief primer. Regular Expressions are strings built of literal characters and matching characters. Here are just a few of the matching characters you can use:

 

.           Wildcard. Any single character

\d         A digit character. The numbers 0-9

\n         A newline character, $c(10)

\r          A carriage return character, $c(13)

\s         A whitespace character, including space and tab

\t          A tab character

\w         A word character. Letters, numbers, or an underscore

 

\p{prop} A character matching the Unicode Property Type {prop}. For instance:

\p{L}      Letters

\p{LL}    Lowercase Letters

\p{LU}   Uppercase Letters

\p{N}     Numbers

\p{P}     Punctuation

 

Quantifiers can be added after a character or set of characters to define how many times it can appear in the string. The quantifiers you can use are:

 

?          1 or 0 times

+          1 or more times

*           0 or more times

{n}        exactly n times

{n,}       n or more times

{n,m}     at least n, but not more than m times

 

Anchor Characters

 

^           Matches the beginning of a string

$          Matches the end of a string

 

Here’s a simple example of a regular expression you could use to check for a valid social security number:

 

"^\d{3}-\d{2}-\d{4}$"

 

In this pattern the dashes are included as literals. If you wanted to search for a literal character that is also a special character in regular expressions (like ?) you can preceed it with a backslash (\?).

 

To test a string against a regular expression pattern, you need to create a Matcher object. You can do this by passing the pattern to the %New() method of the %Regex.Matcher class. The resulting object can be used to test any number of strings against the initial pattern.

 

USER:;MATCHER="%Regex.Matcher"->%New("\d{3}-\d{2}-\d{4}")

USER:;PRINT MATCHER->Match("123-45-6789")

1

 

You can also use a regular expression to locate a pattern within a string. For example, if you wanted to look for a 5 digit part number in a memo field, you might use something like this:

 

USER:;MATCHER="%Regex.Matcher"->%New("#\d{5}")

USER:;MATCHER->Text="Buyer ordered part #12345."

USER:;PRINT MATCHER->Locate()

1

 

The starting and ending position of the matching substring are stored in the Start and End properties of the matcher object.

 

USER:;PRINT MATCHER->Start

20

USER:;PRINT MATCHER->End

26

 

It is important to note that the next call to Locate() will pick up where the last one left off, so if I call that again, it will fail.

 

USER:;PRINT MATCHER->Locate()

0

 

If you want to start at the beginning again, you can call the ResetPosition() method, or you can pass a position argument to the Locate() method to start anywhere in the string.

 

USER:;MATCHER->ResetPosition()

USER:;PRINT MATCHER->Locate()

1

USER:;PRINT MATCHER->Locate(0)

1

 

You can make much more complex regular expressions using logical operators and grouping.

 

()          Parenthesis are used to control order of operations

|           Logical OR operator

[1234]   Square brackets match any one of the enclosed characters

[1-4]      Square brackets can also use a range of characters (letters or numbers)

 

For instance, these patterns are all equivalent ways to represent “one number between 1 and 4”

 

“1|2|3|4”             “[1234]”             “[1-4]”

 

Parenthesis can also be used to represent substrings of a match. Here’s a pattern you could use to identify someone’s salutation:

 

"(Mrs?) \p{L}+ \p{L}+"

 

USER:;MATCHER="%Regex.Matcher"->%New("(Mrs?)")

USER:;MATCHER->Text="Mrs Sally Jones"

USER:;PRINT MATCHER->Locate()

1

 

The substring that matches the part of the pattern in parenthesis is stored in the multidimensional Group property. You can access the value like this

 

USER:;PRINT MATCHER->Group(1)

Mrs

 

This also gives you more control over your text string. Let’s say Sally just got her Phd, and we want to change her salutation. We can use ReplaceAll() to change those logical groups. The output will be a modified version of string. The string in the Text property remains unchanged.

 

USER:;PRINT MATCHER->ReplaceAll("Dr")

Dr Sally Jones

 

Alternatively, you can use the SubstituteIn() method to use the captured logical groups in a new string. You can refer to multiple capture groups as $1,$2…

 

USER:;MATCHER="%Regex.Matcher"->%New("(Mrs?) (\p{L}+) (\p{L}+)")

USER:;MATCHER->Text="Mrs Sally Jones"

USER:;PRINT MATCHER->Locate()

1

USER:;PRINT MATCHER->SubstituteIn("Please join us for dinner $1 $3")

Please join us for dinner Mrs Jones

 

I hope I have given you a good sampling of the capability of the %Regex.Matcher tool. The full capabilities of the tool are much more varied, since there are many more matching characters available for control characters and special character sets, and boolean logic can generate very intricate patterns.

 

For more information, check out the following sites:

http://www.icu-project.org

http://en.wikipedia.org/wiki/Regular_expressions

http://www.regular-expressions.info/reference.html/

 

The following pages are available in the Cache 2012.2 documentation, though not yet published on the web:

http://localhost:57772/csp/docbook/DocBook.UI.Page.cls?KEY=GCOS_regexp

http://localhost:57772/csp/documatic/%25CSP.Documatic.cls?APP=1&LIBRARY=%25SYS&CLASSNAME=%25Regex.Matcher

Jason Warner

unread,
Jul 9, 2012, 6:11:14 PM7/9/12
to intersy...@googlegroups.com
This is one of the more exciting announcements (for me) from Global
Summit. We used a wrapper and the .Net gateway to get access to regular
expressions, but it has some shortcomings.

Is it possible to pass a %Stream to the regex matcher? Some of the text
we are matching for a project I have is MUCH larger than the MAXSTRING
even with long strings enabled. Being able to use a
%Stream.GlobalCharacter or a %Stream.FileCharacter would be a welcome
change to the kluges I'm using for the moment.

Jason

On Monday, July 09, 2012 3:56:05 PM, Emily Haggstrom wrote:
> This tip will cover pattern matching using the new %Regex.Matcher
> class, available in Cache 2012.2. This implementation of Regular
> Expressions comes from the International Components for Unicode (ICU).
> You can find details here: http://www.icu-project.org
> <http://www.icu-project.org/>
> http://www.icu-project.org <http://www.icu-project.org/>
> --
> You received this message because you are subscribed to the Google
> Groups "InterSystems: MV Community" group.
> To post to this group, send email to Cac...@googlegroups.com
> To unsubscribe from this group, send email to
> CacheMV-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/CacheMV?hl=en

Ed Clark

unread,
Jul 9, 2012, 6:16:18 PM7/9/12
to intersy...@googlegroups.com
Currently the %Regex.Matcher class only works with a string. Sorry, no streams at this point.

Jason Warner

unread,
Jul 9, 2012, 6:17:26 PM7/9/12
to intersy...@googlegroups.com
Oh well... It will be nice to have them native to Cache instead of a
gateway callout.

Jason

Dawn Wolthuis

unread,
Jul 9, 2012, 6:29:29 PM7/9/12
to intersy...@googlegroups.com
Very Cool!

Of course I really have to trade in my Grace Hopper-like preference
for human readable code were I to ever encounter a class written in
COS with regex strings in it too!

It is not a coincidence that our gender, Emily, has largely tossed the
coding profession aside -- approximately 1/3 of software developers
were women when I started out. But I digress.
smiles --dawn
> --
> You received this message because you are subscribed to the Google Groups
> "InterSystems: MV Community" group.
> To post to this group, send email to Cac...@googlegroups.com
> To unsubscribe from this group, send email to
> CacheMV-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/CacheMV?hl=en



--
Dawn M. Wolthuis

Take and give some delight today

Bill Farrell

unread,
Jul 10, 2012, 8:39:35 AM7/10/12
to intersy...@googlegroups.com

This is a fantastic development, Emily!   I use regular expressions a lot because they can do more complex matching and editing than one can achieve with MV “Match” and “EReplace()”.   RegEx matching and editing is very common in *nix and I’ve used them quite a bit with VB Dot Net as well.  A well-constructed RegEx can make short work of even the most complex data validation problems.

 

The down side is that regular expressions are tricky to use and a bear to learn.  (I did most of the regex-expression building at my old job.)  Fear not, there’s help.

 

There is something I can recommend that I use quite frequently.  There is a quite inexpensive product called RegexBuddy (http://www.regexbuddy.com/) that takes the pain out of creating and testing regular expressions.  Even a person who’s inexperienced with RegEx can construct some pretty nifty matching and editing strings fairly quickly with RegexBuddy.

 

Disclaimer:  I have no connection with the creators of RegexBuddy.  I just really like the product.

 

Best!

Bill

--

Ed Clark

unread,
Jul 10, 2012, 9:09:22 AM7/10/12
to intersy...@googlegroups.com
I found this website http://txt2re.com/index.php3?s=crt%20a%3C1%3E,a%3Ctwo%3E,a%3C1%3E%20%3C%3C%3E%3E&-47 somewhat interesting for playing with regular expressions.

You can't use regular expressions directly in cmql with LIKE, but you can use them in a subroutine called from an i-type expression.
Reply all
Reply to author
Forward
0 new messages