Where to draw the line - re-dux

62 views
Skip to first unread message

David Miller

unread,
Jun 27, 2009, 6:40:33 PM6/27/09
to Clojure Dev
When last we talked (http://groups.google.com/group/clojure-dev/t/
1dfd473e6921ba0d?hl=en), we addressed the subtle question of where in
core.clj one crosses the line between providing core Clojure
functionality that ClojureCLR needs to reproduce exactly versus
providing access to the Java libs--presumably ClojureCLR does not need
to provide rewrites of all that. (As was stated.)

In that conversation, BigDecimal won. It is not clear there was any
consensus on resultset-seq.

But onward. re-dux = let's talk regular expressions.

The set of re-* functions in core.clj--re-pattern, re-matcher, re-
groups, re-seq, re-matches, re-find--are actually quite
java.util.regex-centric. There is sufficient mismatch compared to
the System.Text.RegularExpressions classes that I cannot expose them
directly and get these functions to work as advertised.

(For those who care, an example: the Java Matcher class starts out not
having made a match. The CLR Match class represents the first match
having already been made. If you do (re-groups (re-matcher #"\w+"
"abc")) in Clojure, it will throw an exception. In CLR, if re-matcher
returns a Match, the re-groups call will succeed. This one-off
problem messes up a number of these functions.)

Thus, literal exposure of the CLR regex classes will not map into
these functions as described.

So, should I:

(1) create my own shim classes to fake a certain amount (I hope not
all) of the functionality of the Pattern and Matcher classes from
java.util.regex, just enough to get the re* functions working,
(2) Ignore the existing re-* functions and implement a set of re-*
functions designed to work properly with Regex, Match, etc. in CLR?
(3) (1) + (2)

In other words, are the re-* functions Clojure-core or are they an
interface to some JVM functionality?

Two quick notes:

(1) the re-* functions are used exactly once in core.clj, in 'find-
doc. There are three occurrences in the new (GTIC) test framework.
(2) I have remarkably instances of this kind of problem to consider.

-- David

Shawn Hoover

unread,
Jun 29, 2009, 4:13:47 PM6/29/09
to cloju...@googlegroups.com
On Sat, Jun 27, 2009 at 6:40 PM, David Miller <dmill...@gmail.com> wrote:

The set of re-* functions in core.clj--re-pattern, re-matcher, re-
groups, re-seq, re-matches, re-find--are actually quite
java.util.regex-centric.    There is sufficient mismatch compared to
the System.Text.RegularExpressions  classes that I cannot expose them
directly and get these functions to work as advertised.

(For those who care, an example: the Java Matcher class starts out not
having made a match.  The CLR Match class represents the first match
having already been made.  If you do  (re-groups (re-matcher #"\w+"
"abc")) in Clojure, it will throw an exception.  In CLR, if re-matcher
returns a Match, the re-groups call will succeed.  This one-off
problem messes up a number of these functions.)

Thus, literal exposure of the CLR regex classes will not map into
these functions as described.

So, should I:

(1) create my own shim classes to fake a certain amount (I hope not
all) of the functionality of the Pattern and Matcher classes from
java.util.regex, just enough to get the re* functions working,
(2) Ignore the existing re-* functions and implement a set of re-*
functions designed to work properly with Regex, Match, etc. in CLR?
(3)  (1) + (2)

In other words, are the re-* functions Clojure-core or are they an
interface to some JVM functionality?

The literal syntax #"", re-pattern, and the Clojure data structure-based abstraction of re-seq are general and useful with the only leak being you have to use the right pattern syntax for your platform. Fortunately .NET's Regex class also supports a (?i) grouping construct in the literal so we don't have to add parameters to re-pattern for multiline or case-insensitivity.

The other re-* functions support re-seq and introduce the difficulties you mentioned.
  • re-groups: The example you gave with re-groups is not a problem if you treat re-groups as an implementation detail and just use the higher level functions--re-seq and re-find run their results through re-groups anyway. Therefore re-groups could just as well be private.
  • re-find: The two-argument form is a nice convenience to get the first match as a Clojure data structure, but I'm not sure about the one-argument form. Is it worth keeping a mutable matcher object around and looping on it when we already have re-seq?
  • re-matcher: Could be made private or stick around as a convenience for getting at Regex.Matches when you need something the core API doesn't offer. Or it could return Regex.Match or a shim if there's a good reason to keep re-find.
  • re-matches: I don't see a need for this. It sounds like a pluralized match, but it's a bool for "does match the entire input", which I can accomplish painlessly by anchoring the pattern with ^ and $.
Here are the features I use from System.Text.RegularExpressions and how they could translate to a current or imaginary core api:

Regex.IsMatch(s re), new Regex(re).Match(s).Success
Just use two-argument re-find as is, which works fine for logical truth: (re-find re s)
Or rename it to be more obvious about what I'm using it for: (re-match re s)
Or (first (re-seq re s))

Regex.Replace(s re replacement)
This one doesn't exist in Clojure: (re-replace re s replacement)

Regex.Match(s re options).Groups[0]
(when-let [[match group0 & groups] (re-find re s)]
  group0)
(map second (re-seq re s))

Regex.Match(s re options).Groups["groupName"]
Uh, shoot, this one doesn't work with re-seq. Can we make a custom seq of match objects that destructure like vectors AND are callable by group name keywords? Otherwise at this point I need to directly call Regex.Match.Groups or Regex.Matches[0].Groups.
(when-let [match (re-find #"(?<as>a*)-(?<bs>b*)" "aaa-bbb")]
  (:as match))
(when-let [{a :as, b :bs, :as groups} (re-find #"(?<as>a*)-(?<bs>b*)" "aaa-bbb")]
  a)

So core re-pattern and re-seq--and two-argument re-find as a shortcut to get the first match--can pretty much work for me. I don't need re-groups or re-matches. The question remaining for me is whether and how to expose the Match used in re-find and re-matcher or force the user to use .NET for that.

Hmm, I'm not sure if that helps or makes it more confusing :) Hopefully the examples are helpful.

Shawn

David Miller

unread,
Jun 29, 2009, 10:26:34 PM6/29/09
to Clojure Dev
Ironically (I suppose), the CLR Match class is a better match (so to
speak) for Clojure than the Java Matcher class, the CLR version being
immutable. It would be perfectly reasonable for re-seq to return a
lazy sequence of Match objects, allowing full access to the goodness
of that class.

At any rate, I already implemented the core re-* functions in
ClojureCLR using a fake match class (clojure.lang.JReMatcher).

Your analysis and examples are fine. Perhaps they could be the basis
for designing a contrib library of CLR-Regex goodness. Assuming
anyone wants to encourage platform-dependent libraries.

David

On Jun 29, 3:13 pm, Shawn Hoover <shawn.hoo...@gmail.com> wrote:
>    - re-groups: The example you gave with re-groups is not a problem if you
>    treat re-groups as an implementation detail and just use the higher level
>    functions--re-seq and re-find run their results through re-groups anyway.
>    Therefore re-groups could just as well be private.
>    - re-find: The two-argument form is a nice convenience to get the first
>    match as a Clojure data structure, but I'm not sure about the one-argument
>    form. Is it worth keeping a mutable matcher object around and looping on it
>    when we already have re-seq?
>    - re-matcher: Could be made private or stick around as a convenience for
>    getting at Regex.Matches when you need something the core API doesn't offer.
>    Or it could return Regex.Match or a shim if there's a good reason to keep
>    re-find.
>    - re-matches: I don't see a need for this. It sounds like a pluralized
Reply all
Reply to author
Forward
0 new messages