This tip will cover pattern matching using the new %Regex.Matcher class, available in Cache 2012.2. This implementation of Regular Expressions comes from the International Components for Unicode (ICU). You can find details here: http://www.icu-project.org
Caché Object Script has a simple pattern matching tool (see below) that can be used to match whole strings. The mvbasic command MATCH is similar. However this new tool has a much broader range of matching characters, and it can be used to locate and manipulate substrings as well as whole strings. It is also a standard approach that is implemented in many languages (with minor variations). Note that this is not for use in query languages, so it does not replace the cmql WITH … LIKE clause.
There are some good guides to writing Regular Expressions on the web (see the links at the end of this email) but here is a brief primer. Regular Expressions are strings built of literal characters and matching characters. Here are just a few of the matching characters you can use:
. Wildcard. Any single character
\d A digit character. The numbers 0-9
\n A newline character, $c(10)
\r A carriage return character, $c(13)
\s A whitespace character, including space and tab
\t A tab character
\w A word character. Letters, numbers, or an underscore
\p{prop} A character matching the Unicode Property Type {prop}. For instance:
\p{L} Letters
\p{LL} Lowercase Letters
\p{LU} Uppercase Letters
\p{N} Numbers
\p{P} Punctuation
Quantifiers can be added after a character or set of characters to define how many times it can appear in the string. The quantifiers you can use are:
? 1 or 0 times
+ 1 or more times
* 0 or more times
{n} exactly n times
{n,} n or more times
{n,m} at least n, but not more than m times
Anchor Characters
^ Matches the beginning of a string
$ Matches the end of a string
Here’s a simple example of a regular expression you could use to check for a valid social security number:
"^\d{3}-\d{2}-\d{4}$"
In this pattern the dashes are included as literals. If you wanted to search for a literal character that is also a special character in regular expressions (like ?) you can preceed it with a backslash (\?).
To test a string against a regular expression pattern, you need to create a Matcher object. You can do this by passing the pattern to the %New() method of the %Regex.Matcher class. The resulting object can be used to test any number of strings against the initial pattern.
USER:;MATCHER="%Regex.Matcher"->%New("\d{3}-\d{2}-\d{4}")
USER:;PRINT MATCHER->Match("123-45-6789")
1
You can also use a regular expression to locate a pattern within a string. For example, if you wanted to look for a 5 digit part number in a memo field, you might use something like this:
USER:;MATCHER="%Regex.Matcher"->%New("#\d{5}")
USER:;MATCHER->Text="Buyer ordered part #12345."
USER:;PRINT MATCHER->Locate()
1
The starting and ending position of the matching substring are stored in the Start and End properties of the matcher object.
USER:;PRINT MATCHER->Start
20
USER:;PRINT MATCHER->End
26
It is important to note that the next call to Locate() will pick up where the last one left off, so if I call that again, it will fail.
USER:;PRINT MATCHER->Locate()
0
If you want to start at the beginning again, you can call the ResetPosition() method, or you can pass a position argument to the Locate() method to start anywhere in the string.
USER:;MATCHER->ResetPosition()
USER:;PRINT MATCHER->Locate()
1
USER:;PRINT MATCHER->Locate(0)
1
You can make much more complex regular expressions using logical operators and grouping.
() Parenthesis are used to control order of operations
| Logical OR operator
[1234] Square brackets match any one of the enclosed characters
[1-4] Square brackets can also use a range of characters (letters or numbers)
For instance, these patterns are all equivalent ways to represent “one number between 1 and 4”
“1|2|3|4” “[1234]” “[1-4]”
Parenthesis can also be used to represent substrings of a match. Here’s a pattern you could use to identify someone’s salutation:
"(Mrs?) \p{L}+ \p{L}+"
USER:;MATCHER="%Regex.Matcher"->%New("(Mrs?)")
USER:;MATCHER->Text="Mrs Sally Jones"
USER:;PRINT MATCHER->Locate()
1
The substring that matches the part of the pattern in parenthesis is stored in the multidimensional Group property. You can access the value like this
USER:;PRINT MATCHER->Group(1)
Mrs
This also gives you more control over your text string. Let’s say Sally just got her Phd, and we want to change her salutation. We can use ReplaceAll() to change those logical groups. The output will be a modified version of string. The string in the Text property remains unchanged.
USER:;PRINT MATCHER->ReplaceAll("Dr")
Dr Sally Jones
Alternatively, you can use the SubstituteIn() method to use the captured logical groups in a new string. You can refer to multiple capture groups as $1,$2…
USER:;MATCHER="%Regex.Matcher"->%New("(Mrs?) (\p{L}+) (\p{L}+)")
USER:;MATCHER->Text="Mrs Sally Jones"
USER:;PRINT MATCHER->Locate()
1
USER:;PRINT MATCHER->SubstituteIn("Please join us for dinner $1 $3")
Please join us for dinner Mrs Jones
I hope I have given you a good sampling of the capability of the %Regex.Matcher tool. The full capabilities of the tool are much more varied, since there are many more matching characters available for control characters and special character sets, and boolean logic can generate very intricate patterns.
For more information, check out the following sites:
http://en.wikipedia.org/wiki/Regular_expressions
http://www.regular-expressions.info/reference.html/
The following pages are available in the Cache 2012.2 documentation, though not yet published on the web:
http://localhost:57772/csp/docbook/DocBook.UI.Page.cls?KEY=GCOS_regexp
This is a fantastic development, Emily! I use regular expressions a lot because they can do more complex matching and editing than one can achieve with MV “Match” and “EReplace()”. RegEx matching and editing is very common in *nix and I’ve used them quite a bit with VB Dot Net as well. A well-constructed RegEx can make short work of even the most complex data validation problems.
The down side is that regular expressions are tricky to use and a bear to learn. (I did most of the regex-expression building at my old job.) Fear not, there’s help.
There is something I can recommend that I use quite frequently. There is a quite inexpensive product called RegexBuddy (http://www.regexbuddy.com/) that takes the pain out of creating and testing regular expressions. Even a person who’s inexperienced with RegEx can construct some pretty nifty matching and editing strings fairly quickly with RegexBuddy.
Disclaimer: I have no connection with the creators of RegexBuddy. I just really like the product.
Best!
Bill
--