Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
regex literal syntax
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  23 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Chouser  
View profile  
 More options Oct 7 2008, 6:37 pm
From: Chouser <chou...@gmail.com>
Date: Tue, 7 Oct 2008 18:37:30 -0400
Local: Tues, Oct 7 2008 6:37 pm
Subject: regex literal syntax

Ok, I know we've been over this before, but nothing was actually done.

For the record:
http://groups.google.com/group/clojure/browse_thread/thread/81b361a4e...

So here is my attempt formalize a simple proposal.

The reader should take the literal contents of #"..." and pass to
Pattern.compile as a raw string, making no changes to the contents.
That means all backslashes (\) and double quotes (") would be passed
right in.  The only other thing the reader need concern itself with,
is that when it sees a \" it should not treat that double-quote as the
end of the pattern, but rather keep on doing until it sees a
double-quote that is not preceded by a backslash.  Nevertheless is
would pass both the quoting \ and the following " to Pattern.compile.

That's it. Simple. It works because Java's Pattern itself understands
backslash quoting, including literal chars like backslash and double
quote, hex and octal patterns, as well as other regex patterns.

Some examples:

1. Simple text
(re-find #"foo" "foo") --> "foo"

2. Pre-defined character class
(re-find #"\w*" "foo!@#$") --> "foo"

3. Special character (regex and string)
(re-find #"\t" "\t") --> "\t"

4. Scary special character (regex only)
Note that the escape sequences available inside #"" are Java Pattern
escape sequences, and therefore by definition different from Clojure
String escape sequences.  Of course this is what you need for \w and
such to work:
(re-find #"\a" "\u0007") --> beep ""

5. Special character (string only)
The revere of the previous example -- Clojure strings understand "\b"
as (str \backspace), but Java patterns do not, so this example uses
hex instead:
(re-find #"\x08" "\b") --> "\b"

6. Hex
(re-find #"\x31" "1") --> "1"

7. Octal
(re-find #"\061" "1") --> "1"

8. Word boundary:
(re-find #"\bfoo" "foo") --> "foo"

9. Quoting fun -- double quote, a single character:
(re-find #"\"" "\"") --> "\""

10. Quoting fun -- backslash, a single character:
(re-find #"\\" "\\") --> "\\"

11. Open paren
(re-find #"\(" "(") --> "("

I think this demonstrates you can create any pattern you might need.
For reference, here are the above patterns expressed in the current
(not the proposed) reader syntax:

1. #"foo"
2. #"\\w*"
3. #"\t" or #"\\t"
4. #"\\a" (but #"\a" makes the reader blow up)
5. #"\\x08"
6. #"\\x31"
7. #"\061" or #"\\061"
8. #"\\bfoo" (note #"\bfoo" is legal, but doesn't do what you want)
9. #"\"" or #"\\\"" (but #"\\"" blows up the reader)
10. #"\\\\" (but #"\\" is illegal)
11. #"\\(" (but #"\(" is illegal)

Somehow I'm not sure that communicates how much I dislike the current
syntax.  Oh well, maybe others can chime in on that point. I
implemented this to provide the examples above, not because I think
this is a done deal or anything.  Please comment!

Here is a new print method to match the attached patch to LispReader:

(defmethod print-method java.util.regex.Pattern [p w]
  (.write w "#\"")
  (.write w (.pattern p))
  (.write w "\""))

That print method will take a bit more work to properly quote some
Patterns that could be created by means other than the Clojure
literal.

--Chouser

  regex-reader.patch
1K Download

    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Beauregard  
View profile  
 More options Oct 7 2008, 7:02 pm
From: "Michael Beauregard" <mich...@insightfulminds.com>
Date: Tue, 7 Oct 2008 17:02:32 -0600
Local: Tues, Oct 7 2008 7:02 pm
Subject: Re: regex literal syntax

I love it!


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chouser  
View profile  
 More options Oct 7 2008, 11:18 pm
From: Chouser <chou...@gmail.com>
Date: Tue, 7 Oct 2008 23:18:25 -0400
Local: Tues, Oct 7 2008 11:18 pm
Subject: Re: regex literal syntax
Is it bad etiquette to reply to myself?  I thought it might be useful
to compare the proposed syntax with that of other languages with good
regex support.

I tried all the examples from my previous message in Perl, Python,
Ruby, and JavaScript.  All but Python have literal regex syntax, while
Python has a raw string format that is generally used for regular
expressions.  All but JavaScript have multiple quote characters which
allowed me to use double quotes just like Clojure:

Clojure:  #"foo"
Perl:     m"foo" (although m/foo/ or just /foo/ is more common)
Python:   r"foo" (you can also use r'foo' or r"""foo""")
Ruby:    %r"foo" (or %r/foo/ or just /foo/)
JS:        /foo/

All the examples for the proposed new Clojure syntax work the same in
all these languages (with the exception of example 4 in JavaScript,
where \a means a plain letter a instead of ASCII 7).  If instead you
escape things the way you currently have to in Clojure, many of the
expressions don't work or mean something different in the other
languages.

In other words, under the proposed syntax Clojure regex literals would
be less surprising for people used to any of these other languages.

--Chouser


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rich Hickey  
View profile  
 More options Oct 8 2008, 8:08 am
From: Rich Hickey <richhic...@gmail.com>
Date: Wed, 8 Oct 2008 05:08:31 -0700 (PDT)
Local: Wed, Oct 8 2008 8:08 am
Subject: Re: regex literal syntax

On Oct 7, 11:18 pm, Chouser <chou...@gmail.com> wrote:

Will existing Clojure regex (consumer) code need to change, i.e. will
people need to modify their existing #"" literals and if so in what
way?

Rich


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
pc  
View profile  
 More options Oct 8 2008, 8:57 am
From: pc <peng2che...@yahoo.com>
Date: Wed, 8 Oct 2008 05:57:59 -0700 (PDT)
Local: Wed, Oct 8 2008 8:57 am
Subject: Re: regex literal syntax
I think this is a great idea.

On Oct 8, 5:08 am, Rich Hickey <richhic...@gmail.com> wrote:


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chouser  
View profile  
 More options Oct 8 2008, 11:03 am
From: Chouser <chou...@gmail.com>
Date: Wed, 8 Oct 2008 11:03:14 -0400
Local: Wed, Oct 8 2008 11:03 am
Subject: Re: regex literal syntax

On Wed, Oct 8, 2008 at 8:08 AM, Rich Hickey <richhic...@gmail.com> wrote:

> Will existing Clojure regex (consumer) code need to change, i.e. will
> people need to modify their existing #"" literals and if so in what
> way?

Yes, many existing #"" literals would have new meaning or become
invalid under this proposal.

Some patterns wouldn't have to be changed.  Here are a couple examples:
#"foo"
#"(one) *(two)"
#"/this.*/that"

By far the most common change people would have to make would be to
remove doubled back-slashes:
old: #"\\w"         new: #"\w"
old: #"\\("         new: #"\("
old: #"\\bword\\b"  new: #"\bword\b"

Reading through Clojures string reader and Java's Pattern docs, the
only other anomaly I've spotted is if someone was using \b to mean
\backspace.  With the proposed change, #"\b" means word boundary, so:
old: #"\b"  new: #"\x08"

Most of the time, failing to update your regex literal will result in
a valid regex that means something different.  Put another way, things
that used to match like you wanted will just stop matching.  In a few
cases (such as #"\\(") what used to be a valid regex will throw an
exception at read time, with a detailed error message pointing out the
position of the illegal paren.

Of course if this change is unacceptable, these proposed rules could
be applied to a new dispatch macro.  One option would be something
like #r/foo/ that would allow your choice of delimiters to further
reduce the need for back-slash quoting within the regex.

--Chouser


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
J. McConnell  
View profile  
 More options Oct 8 2008, 11:21 am
From: "J. McConnell" <jdo...@gmail.com>
Date: Wed, 8 Oct 2008 11:21:14 -0400
Local: Wed, Oct 8 2008 11:21 am
Subject: Re: regex literal syntax

On Wed, Oct 8, 2008 at 11:03 AM, Chouser <chou...@gmail.com> wrote:

> Of course if this change is unacceptable, these proposed rules could
> be applied to a new dispatch macro.  One option would be something
> like #r/foo/ that would allow your choice of delimiters to further
> reduce the need for back-slash quoting within the regex.

Personally, I'd vote for this. Allowing for a choice of delimiters is
a very useful feature of Ruby. That said, I'd be happy with the new
#"..." syntax proposed in this thread ... Anything to avoid the
double-escaping.

Thanks for doing this work, Chouser.

- J.


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bob  
View profile  
 More options Oct 8 2008, 5:24 pm
From: Bob <bobnnamt...@gmail.com>
Date: Wed, 8 Oct 2008 14:24:14 -0700 (PDT)
Local: Wed, Oct 8 2008 5:24 pm
Subject: Re: regex literal syntax
Yes, I like this.  The double backslashing is really a pain and
confused me at first.  Get off that road now while clojure is still
pretty new.

Bob

On Oct 8, 9:03 am, Chouser <chou...@gmail.com> wrote:


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Brian Watkins  
View profile  
 More options Oct 9 2008, 12:33 pm
From: Brian Watkins <WildU...@gmail.com>
Date: Thu, 9 Oct 2008 09:33:17 -0700 (PDT)
Local: Thurs, Oct 9 2008 12:33 pm
Subject: Re: regex literal syntax
Yes, dealing with multiple escaped escape sequences is a weak point of
handling regexs.  To really integrate them into the language, and that
integration is essential to success, they must be typed straight into
the code literally without baggage.

As much as flexibility and simplicity, the easy application of
powerful first class regular expressions to text processing was the
killer app for Perl.  No programmer wants to accept second best
anymore.

-Brian

On Oct 8, 4:24 pm, Bob <bobnnamt...@gmail.com> wrote:


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Stephen C. Gilardi  
View profile  
 More options Oct 9 2008, 1:14 pm
From: "Stephen C. Gilardi" <squee...@mac.com>
Date: Thu, 09 Oct 2008 13:14:44 -0400
Local: Thurs, Oct 9 2008 1:14 pm
Subject: Re: regex literal syntax

On Oct 8, 2008, at 11:03 AM, Chouser wrote:

> Of course if this change is unacceptable, these proposed rules could
> be applied to a new dispatch macro.  One option would be something
> like #r/foo/ that would allow your choice of delimiters to further
> reduce the need for back-slash quoting within the regex.

I like both proposed changes: new escape rules for #" , and #r with  
arbitrary delimiters. Thanks for moving the issue along.

Comparing the two, I think the arbitrary delimiter allowed by #r is  
very attractive. The only potential downside I see with that is that  
it requires a smarter "clojure mode" in an editor to know how to find  
the end in source code. (I use emacs, so I trust I'll be all set. I  
suspect "clojure mode" most other editors can be adapted to handle it.)

With #r in place, I would be in favor of leaving #" as it is now, and  
possibly deprecating and removing it over time.

--Steve


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Barry  
View profile  
 More options Oct 9 2008, 2:03 pm
From: Paul Barry <pauljbar...@gmail.com>
Date: Thu, 9 Oct 2008 11:03:55 -0700 (PDT)
Local: Thurs, Oct 9 2008 2:03 pm
Subject: Re: regex literal syntax
What about having #"pattern" work like is does now, and then having #/
pattern/ work similarly to Ruby, Python, Perl, etc. regular expression
in that they not require double escaping of characters like '\'?  So
in other words:

    #/<em>(.*?)<\/em>/

instead of:

    #"<em>(.*?)<\\/em>"

The advantage to this is that it is backwards compatible.  I don't
think the arbitrary delimiter is as necessary as not having to include
extra escaping characters because it is a string.

On Oct 9, 1:14 pm, "Stephen C. Gilardi" <squee...@mac.com> wrote:


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chouser  
View profile  
 More options Oct 9 2008, 2:44 pm
From: Chouser <chou...@gmail.com>
Date: Thu, 9 Oct 2008 14:44:08 -0400
Local: Thurs, Oct 9 2008 2:44 pm
Subject: Re: regex literal syntax

On Thu, Oct 9, 2008 at 2:03 PM, Paul Barry <pauljbar...@gmail.com> wrote:

> What about having #"pattern" work like is does now, and then having #/
> pattern/ work similarly to Ruby, Python, Perl, etc. regular expression
> in that they not require double escaping of characters like '\'?  So
> in other words:

I would vote against any new syntax that doesn't allow for a
user-chosen delimiter.  If we introduce something new, it should solve
the problem more completely than that.

> The advantage to this is that it is backwards compatible.

That's true and good, and if Rich is open to it, I think #r/foo/ or
#~/foo/ or something would be a great choice, allowing for / or " or
perhaps even () for delimiting the regex.

> I don't think the arbitrary delimiter is as necessary as not having to include
> extra escaping characters because it is a string.

I actually prefer " over / as the only allowed delimiter. Matching
file paths with / as the delimiter is not uncommon, and rather
painful:
#/\/usr\/lib\/*.so/

The contexts where " have to be quoted often don't seem quite as bad:
#"<img src=\"file://foo/bar\">"

--Chouser


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rich Hickey  
View profile  
 More options Oct 9 2008, 3:09 pm
From: Rich Hickey <richhic...@gmail.com>
Date: Thu, 9 Oct 2008 12:09:59 -0700 (PDT)
Local: Thurs, Oct 9 2008 3:09 pm
Subject: Re: regex literal syntax

On Oct 9, 2:44 pm, Chouser <chou...@gmail.com> wrote:

Let's stay on track with your first simple proposal - eliminating
escaping of \ and \" being non-terminating.

Arbitrary delimiters begs the question of why not in strings too, and
I think the editor/tools issues are real, as well as the general
inability to grok code with user-defined character interpretation.

The question is, are people willing to deal with the breakage in the
short term?

Could there be a reader warning, if \\ is seen in patterns?

Is there a possible conversion or safe interpretation of existing
patterns (esp. \\\" and \\\\)

How tricky is the print side?

Rich


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Stadig  
View profile  
 More options Oct 9 2008, 4:43 pm
From: "Paul Stadig" <p...@stadig.name>
Date: Thu, 9 Oct 2008 16:43:50 -0400
Local: Thurs, Oct 9 2008 4:43 pm
Subject: Re: regex literal syntax

Clojure is still beta-ish, no? I say either go with the breakage, or use the
#r"" without the user chosen delimiters.

The extra backquoting was admittedly a surprise for me at first. It seemed
awkward, and unexpected, so if some breakage occurs, then I think its
warranted. If breakage is a problem, then I'd vote for going with a new #r""
and deprecating the old syntax.

Paul


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chouser  
View profile  
 More options Oct 9 2008, 4:48 pm
From: Chouser <chou...@gmail.com>
Date: Thu, 9 Oct 2008 16:48:29 -0400
Local: Thurs, Oct 9 2008 4:48 pm
Subject: Re: regex literal syntax

On Thu, Oct 9, 2008 at 3:09 PM, Rich Hickey <richhic...@gmail.com> wrote:

> The question is, are people willing to deal with the breakage in the
> short term?

I certainly am, but that may not mean much. :-)
Nobody has spoken out against it yet -- that's a got to be a good sign.

> Could there be a reader warning, if \\ is seen in patterns?

Not really, in general such patterns are valid.  For example #"\\w*"
currently means match zero or more word"characters.  Proposed, #"\\w*"
would mean match a backslash followed by any number of "w"s

I suppose we could have a warning that flags "likely" errors, and then
a way to turn it off, but that doesn't sound like a nice solution.

> Is there a possible conversion or safe interpretation of existing
> patterns (esp. \\\" and \\\\)

I'm not quite sure what you mean. Is this the part of the previous question?

#"\\\"" Currently matches the single char double-quote, Proposed would
match a backslash followed by a double-quote

#"\\\\" Currently matches the single char backslash. Proposed would
match a backslash followed by another backslash.

If you mean could old patterns be programmatically converted to new
ones, the answer is yes.  Somehow I don't think you're asking for a
code walker that spits out what it reads, except with old regex
literals replaced with new ones, but I suppose that could be done.

The only pattern I've found that is currently valid that wouldn't be
under the proposal was mentioned in my earlier email: #"\\(" currently
matches a literal open paren.  Proposed would be interpreted as a
backslash and the beginning of a group, but without the closing paren
Pattern/compile will throw an exception.

> How tricky is the print side?

Simple for Patterns created using the proposed literal form, as the
Pattern hangs on to the original string -- just slap #"" around it.

But I suppose that's not quite good enough, as you can build Patterns
other ways: (Pattern/compile "\"")
I think the only cause of trouble is the " char, so the Pattern's
string would have to be scanned looking for " and determining if it's
already quoted.  If quoted, leave it alone; if unquoted, insert a
backslash.  I'll code it up, unless this proposal is already dead on
one of the other points.

Thanks,
--Chouser


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Randall R Schulz  
View profile  
 More options Oct 9 2008, 4:54 pm
From: Randall R Schulz <rsch...@sonic.net>
Date: Thu, 9 Oct 2008 13:54:49 -0700
Local: Thurs, Oct 9 2008 4:54 pm
Subject: Re: regex literal syntax
On Thursday 09 October 2008 12:09, Rich Hickey wrote:

> ...

> The question is, are people willing to deal with the breakage in the
> short term?

I have no skin in this game, so what I say is to be taken with a grain
of salt, but...

If you make this change, millions of Clojure users for all eternity
_will_ thank you.

On the other hand, if you make this change, a couple of dozen users
_may_ mutter something under their breath at you once or twice and then
acknowledge that the change was a good thing.

> ...

> Rich

Randall Schulz

    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rich Hickey  
View profile  
 More options Oct 9 2008, 4:57 pm
From: Rich Hickey <richhic...@gmail.com>
Date: Thu, 9 Oct 2008 13:57:38 -0700 (PDT)
Local: Thurs, Oct 9 2008 4:57 pm
Subject: Re: regex literal syntax

On Oct 9, 4:48 pm, Chouser <chou...@gmail.com> wrote:

Go for it.

Rich


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chouser  
View profile  
 More options Oct 9 2008, 11:14 pm
From: Chouser <chou...@gmail.com>
Date: Thu, 9 Oct 2008 23:14:35 -0400
Local: Thurs, Oct 9 2008 11:14 pm
Subject: Re: regex literal syntax

On Thu, Oct 9, 2008 at 4:57 PM, Rich Hickey <richhic...@gmail.com> wrote:

> Go for it.

(defmethod print-method java.util.regex.Pattern [p #^Writer w]
  (.write w "#\"")
  (loop [[#^Character c & r :as s] (seq (.pattern #^java.util.regex.Pattern p))]
    (when s
      (cond
        (= c \\) (do (.append w \\)
                     (.append w #^Character (first r))
                     (recur (rest r)))
        (= c \") (do (.write w "\\\"")
                     (recur r))
        :else    (do (.append w c)
                     (recur r)))))
  (.append w \"))

--Chouser


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christophe Grand  
View profile  
 More options Oct 10 2008, 4:07 am
From: Christophe Grand <christo...@cgrand.net>
Date: Fri, 10 Oct 2008 10:07:41 +0200
Local: Fri, Oct 10 2008 4:07 am
Subject: Re: regex literal syntax
Chouser a écrit :

Hello Chouser, (btw, nice work you are doing with ClojureScript)

Do you plan to support printing of all Pattern instances or only those
created using the new literal syntax? In the first case, the story gets
more complex with \Q, \E and flags. In the second case, that's fine.

Christophe


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chouser  
View profile  
 More options Oct 10 2008, 7:48 am
From: Chouser <chou...@gmail.com>
Date: Fri, 10 Oct 2008 07:48:36 -0400
Local: Fri, Oct 10 2008 7:48 am
Subject: Re: regex literal syntax

On Fri, Oct 10, 2008 at 4:07 AM, Christophe Grand <christo...@cgrand.net> wrote:

> Hello Chouser, (btw, nice work you are doing with ClojureScript)

Thanks -- good to see you around again!

> Do you plan to support printing of all Pattern instances or only those
> created using the new literal syntax? In the first case, the story gets
> more complex with \Q, \E and flags. In the second case, that's fine.

I think I need to support all Patterns, and you're right I haven't
thought sufficiently about \Q and \E.

Interestingly, I don't think there's any way to get a plain
double-quote into a regex after a \Q under the proposal.  I don't
*think* that's a problem, as you're most likely to use \Q when
building a pattern programmatically, and if you're not you can drop
out of \Q mode to get your " in there.

If anyone disagrees with this conclusion, speak up!

Of course that means I need to do this for you when printing...

--Chouser


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chouser  
View profile  
 More options Oct 10 2008, 5:02 pm
From: Chouser <chou...@gmail.com>
Date: Fri, 10 Oct 2008 17:02:12 -0400
Local: Fri, Oct 10 2008 5:02 pm
Subject: Re: regex literal syntax

On Fri, Oct 10, 2008 at 7:48 AM, Chouser <chou...@gmail.com> wrote:

> Of course that means I need to do this for you when printing...

I've attached an updated patch with a new print method, against the
latest SVN 1058.

If you really want to dig into the ugliness, here's a test I've been
using.  Apply the attached patch, rebuild clojure, and then load up
this code:

(import '(java.util.regex Pattern)
        '(java.io PushbackReader StringReader))
(defn test-re [s & [ts]]
  (let [p1 (Pattern/compile s)
        p2 (read (PushbackReader. (StringReader. (prn-str p1))))]
    (println (str "raw str: " s))
    (print   (str "prn1:  " (prn-str p1)))
    (print   (str "prn2:  " (prn-str p2)))
    (println (if (= (prn-str p1) (prn-str p2)) "PASS" "FAIL"))
    (when ts
      (println (str "match1: " (re-find p1 ts)))
      (println (str "match2: " (re-find p2 ts)))
      (println (if (= (re-find p1 ts) (re-find p2 ts)) "PASS" "FAIL")))
    (println)))

You run it by passing in a string to be compiled by Pattern.  This is
essentially the "old format" without the leading # char:

user=> (test-re "foo")
raw str: foo
prn1:  #"foo"
prn2:  #"foo"
PASS

If my print method is correct, prn1 and prn2 should always be the
same, and thus PASS.  Also what you're seeing there is the "new
format".  In this case, both the old and new formats are the same.

If you pass in an optional second string, test-re will try to match it
with both patterns:

user=> (test-re "a \\\"\\w*\\\" please" "a \"word\" please")
raw str: a \"\w*\" please
prn1:  #"a \"\w*\" please"
prn2:  #"a \"\w*\" please"
PASS
match1: a "word" please
match2: a "word" please
PASS

Again, if my patch if my patch is correct, match1 and match2 should be
the same, and thus another PASS.  Here you can see how the new format
(prn1 and prn2) are simpler than the old format, as well as being
identical to what Pattern actually operates on (raw str).

Here's a truly nasty example prompted by Christophe Grand's comment:

user=> (test-re "a\"\\Qb\"c\\d\\Ee\"f" "a\"b\"c\\de\"f")
raw str: a"\Qb"c\d\Ee"f
prn1:  #"a\"\Qb\E\"\Qc\d\Ee\"f"
prn2:  #"a\"\Qb\E\"\Qc\d\Ee\"f"
PASS
match1: a"b"c\de"f
match2: a"b"c\de"f
PASS

If you can find *any* input that produces FAIL, please let me know.

--Chouser

  regex-reader.patch
2K Download

    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rich Hickey  
View profile  
 More options Oct 15 2008, 7:12 pm
From: Rich Hickey <richhic...@gmail.com>
Date: Wed, 15 Oct 2008 16:12:26 -0700 (PDT)
Local: Wed, Oct 15 2008 7:12 pm
Subject: Re: regex literal syntax

On Oct 10, 5:02 pm, Chouser <chou...@gmail.com> wrote:

Patch applied - SVN rev 1070 - thanks!

If you use regex literals, this is a breaking change - you must fix
them as described above (basically, \ doesn't need to be escaped
anymore).

Rich


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chouser  
View profile  
 More options Oct 15 2008, 7:41 pm
From: Chouser <chou...@gmail.com>
Date: Wed, 15 Oct 2008 19:41:03 -0400
Local: Wed, Oct 15 2008 7:41 pm
Subject: Re: regex literal syntax

On Wed, Oct 15, 2008 at 7:12 PM, Rich Hickey <richhic...@gmail.com> wrote:

> Patch applied - SVN rev 1070 - thanks!

Possible reader docs:

Regex patterns (#"pattern")
A regex pattern is read and compiled at read time. The pattern is
passed directly to the java.util.regex.Pattern compile method, so its
quoting rules apply instead of the string literal quoting rules.  This
means that unlike string literals, backslashes in regex patterns do
not need to be escaped with another backslash.  For example, #"\d*"
matches zero or more digits.

--Chouser


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2010 Google