regex literal syntax

633 views
Skip to first unread message

Chouser

unread,
Oct 7, 2008, 6:37:30 PM10/7/08
to clo...@googlegroups.com
Ok, I know we've been over this before, but nothing was actually done.

For the record:
http://groups.google.com/group/clojure/browse_thread/thread/81b361a4e82602b7/0313c224a480a161

So here is my attempt formalize a simple proposal.

The reader should take the literal contents of #"..." and pass to
Pattern.compile as a raw string, making no changes to the contents.
That means all backslashes (\) and double quotes (") would be passed
right in. The only other thing the reader need concern itself with,
is that when it sees a \" it should not treat that double-quote as the
end of the pattern, but rather keep on doing until it sees a
double-quote that is not preceded by a backslash. Nevertheless is
would pass both the quoting \ and the following " to Pattern.compile.

That's it. Simple. It works because Java's Pattern itself understands
backslash quoting, including literal chars like backslash and double
quote, hex and octal patterns, as well as other regex patterns.

Some examples:

1. Simple text
(re-find #"foo" "foo") --> "foo"

2. Pre-defined character class
(re-find #"\w*" "foo!@#$") --> "foo"

3. Special character (regex and string)
(re-find #"\t" "\t") --> "\t"

4. Scary special character (regex only)
Note that the escape sequences available inside #"" are Java Pattern
escape sequences, and therefore by definition different from Clojure
String escape sequences. Of course this is what you need for \w and
such to work:
(re-find #"\a" "\u0007") --> beep ""

5. Special character (string only)
The revere of the previous example -- Clojure strings understand "\b"
as (str \backspace), but Java patterns do not, so this example uses
hex instead:
(re-find #"\x08" "\b") --> "\b"

6. Hex
(re-find #"\x31" "1") --> "1"

7. Octal
(re-find #"\061" "1") --> "1"

8. Word boundary:
(re-find #"\bfoo" "foo") --> "foo"

9. Quoting fun -- double quote, a single character:
(re-find #"\"" "\"") --> "\""

10. Quoting fun -- backslash, a single character:
(re-find #"\\" "\\") --> "\\"

11. Open paren
(re-find #"\(" "(") --> "("

I think this demonstrates you can create any pattern you might need.
For reference, here are the above patterns expressed in the current
(not the proposed) reader syntax:

1. #"foo"
2. #"\\w*"
3. #"\t" or #"\\t"
4. #"\\a" (but #"\a" makes the reader blow up)
5. #"\\x08"
6. #"\\x31"
7. #"\061" or #"\\061"
8. #"\\bfoo" (note #"\bfoo" is legal, but doesn't do what you want)
9. #"\"" or #"\\\"" (but #"\\"" blows up the reader)
10. #"\\\\" (but #"\\" is illegal)
11. #"\\(" (but #"\(" is illegal)

Somehow I'm not sure that communicates how much I dislike the current
syntax. Oh well, maybe others can chime in on that point. I
implemented this to provide the examples above, not because I think
this is a done deal or anything. Please comment!

Here is a new print method to match the attached patch to LispReader:

(defmethod print-method java.util.regex.Pattern [p w]
(.write w "#\"")
(.write w (.pattern p))
(.write w "\""))

That print method will take a bit more work to properly quote some
Patterns that could be created by means other than the Clojure
literal.

--Chouser

regex-reader.patch

Michael Beauregard

unread,
Oct 7, 2008, 7:02:32 PM10/7/08
to clo...@googlegroups.com
I love it!

Chouser

unread,
Oct 7, 2008, 11:18:25 PM10/7/08
to clo...@googlegroups.com
Is it bad etiquette to reply to myself? I thought it might be useful
to compare the proposed syntax with that of other languages with good
regex support.

I tried all the examples from my previous message in Perl, Python,
Ruby, and JavaScript. All but Python have literal regex syntax, while
Python has a raw string format that is generally used for regular
expressions. All but JavaScript have multiple quote characters which
allowed me to use double quotes just like Clojure:

Clojure: #"foo"
Perl: m"foo" (although m/foo/ or just /foo/ is more common)
Python: r"foo" (you can also use r'foo' or r"""foo""")
Ruby: %r"foo" (or %r/foo/ or just /foo/)
JS: /foo/

All the examples for the proposed new Clojure syntax work the same in
all these languages (with the exception of example 4 in JavaScript,
where \a means a plain letter a instead of ASCII 7). If instead you
escape things the way you currently have to in Clojure, many of the
expressions don't work or mean something different in the other
languages.

In other words, under the proposed syntax Clojure regex literals would
be less surprising for people used to any of these other languages.

--Chouser

Rich Hickey

unread,
Oct 8, 2008, 8:08:31 AM10/8/08
to Clojure
Will existing Clojure regex (consumer) code need to change, i.e. will
people need to modify their existing #"" literals and if so in what
way?

Rich

pc

unread,
Oct 8, 2008, 8:57:59 AM10/8/08
to Clojure
I think this is a great idea.

Chouser

unread,
Oct 8, 2008, 11:03:14 AM10/8/08
to clo...@googlegroups.com
On Wed, Oct 8, 2008 at 8:08 AM, Rich Hickey <richh...@gmail.com> wrote:
>
> Will existing Clojure regex (consumer) code need to change, i.e. will
> people need to modify their existing #"" literals and if so in what
> way?

Yes, many existing #"" literals would have new meaning or become
invalid under this proposal.

Some patterns wouldn't have to be changed. Here are a couple examples:
#"foo"
#"(one) *(two)"
#"/this.*/that"

By far the most common change people would have to make would be to
remove doubled back-slashes:
old: #"\\w" new: #"\w"
old: #"\\(" new: #"\("
old: #"\\bword\\b" new: #"\bword\b"

Reading through Clojures string reader and Java's Pattern docs, the
only other anomaly I've spotted is if someone was using \b to mean
\backspace. With the proposed change, #"\b" means word boundary, so:
old: #"\b" new: #"\x08"

Most of the time, failing to update your regex literal will result in
a valid regex that means something different. Put another way, things
that used to match like you wanted will just stop matching. In a few
cases (such as #"\\(") what used to be a valid regex will throw an
exception at read time, with a detailed error message pointing out the
position of the illegal paren.

Of course if this change is unacceptable, these proposed rules could
be applied to a new dispatch macro. One option would be something
like #r/foo/ that would allow your choice of delimiters to further
reduce the need for back-slash quoting within the regex.

--Chouser

J. McConnell

unread,
Oct 8, 2008, 11:21:14 AM10/8/08
to clo...@googlegroups.com
On Wed, Oct 8, 2008 at 11:03 AM, Chouser <cho...@gmail.com> wrote:
>
> Of course if this change is unacceptable, these proposed rules could
> be applied to a new dispatch macro. One option would be something
> like #r/foo/ that would allow your choice of delimiters to further
> reduce the need for back-slash quoting within the regex.

Personally, I'd vote for this. Allowing for a choice of delimiters is
a very useful feature of Ruby. That said, I'd be happy with the new
#"..." syntax proposed in this thread ... Anything to avoid the
double-escaping.

Thanks for doing this work, Chouser.

- J.

Bob

unread,
Oct 8, 2008, 5:24:14 PM10/8/08
to Clojure
Yes, I like this. The double backslashing is really a pain and
confused me at first. Get off that road now while clojure is still
pretty new.

Bob

On Oct 8, 9:03 am, Chouser <chou...@gmail.com> wrote:

Brian Watkins

unread,
Oct 9, 2008, 12:33:17 PM10/9/08
to Clojure
Yes, dealing with multiple escaped escape sequences is a weak point of
handling regexs. To really integrate them into the language, and that
integration is essential to success, they must be typed straight into
the code literally without baggage.

As much as flexibility and simplicity, the easy application of
powerful first class regular expressions to text processing was the
killer app for Perl. No programmer wants to accept second best
anymore.

-Brian

Stephen C. Gilardi

unread,
Oct 9, 2008, 1:14:44 PM10/9/08
to clo...@googlegroups.com

On Oct 8, 2008, at 11:03 AM, Chouser wrote:

Of course if this change is unacceptable, these proposed rules could
be applied to a new dispatch macro.  One option would be something
like #r/foo/ that would allow your choice of delimiters to further
reduce the need for back-slash quoting within the regex.

I like both proposed changes: new escape rules for #" , and #r with arbitrary delimiters. Thanks for moving the issue along.

Comparing the two, I think the arbitrary delimiter allowed by #r is very attractive. The only potential downside I see with that is that it requires a smarter "clojure mode" in an editor to know how to find the end in source code. (I use emacs, so I trust I'll be all set. I suspect "clojure mode" most other editors can be adapted to handle it.)

With #r in place, I would be in favor of leaving #" as it is now, and possibly deprecating and removing it over time.

--Steve

Paul Barry

unread,
Oct 9, 2008, 2:03:55 PM10/9/08
to Clojure
What about having #"pattern" work like is does now, and then having #/
pattern/ work similarly to Ruby, Python, Perl, etc. regular expression
in that they not require double escaping of characters like '\'? So
in other words:

#/<em>(.*?)<\/em>/

instead of:

#"<em>(.*?)<\\/em>"

The advantage to this is that it is backwards compatible. I don't
think the arbitrary delimiter is as necessary as not having to include
extra escaping characters because it is a string.

Chouser

unread,
Oct 9, 2008, 2:44:08 PM10/9/08
to clo...@googlegroups.com
On Thu, Oct 9, 2008 at 2:03 PM, Paul Barry <paulj...@gmail.com> wrote:
>
> What about having #"pattern" work like is does now, and then having #/
> pattern/ work similarly to Ruby, Python, Perl, etc. regular expression
> in that they not require double escaping of characters like '\'? So
> in other words:

I would vote against any new syntax that doesn't allow for a
user-chosen delimiter. If we introduce something new, it should solve
the problem more completely than that.

> The advantage to this is that it is backwards compatible.

That's true and good, and if Rich is open to it, I think #r/foo/ or
#~/foo/ or something would be a great choice, allowing for / or " or
perhaps even () for delimiting the regex.

> I don't think the arbitrary delimiter is as necessary as not having to include
> extra escaping characters because it is a string.

I actually prefer " over / as the only allowed delimiter. Matching
file paths with / as the delimiter is not uncommon, and rather
painful:
#/\/usr\/lib\/*.so/

The contexts where " have to be quoted often don't seem quite as bad:
#"<img src=\"file://foo/bar\">"

--Chouser

Rich Hickey

unread,
Oct 9, 2008, 3:09:59 PM10/9/08
to Clojure


On Oct 9, 2:44 pm, Chouser <chou...@gmail.com> wrote:
Let's stay on track with your first simple proposal - eliminating
escaping of \ and \" being non-terminating.

Arbitrary delimiters begs the question of why not in strings too, and
I think the editor/tools issues are real, as well as the general
inability to grok code with user-defined character interpretation.

The question is, are people willing to deal with the breakage in the
short term?

Could there be a reader warning, if \\ is seen in patterns?

Is there a possible conversion or safe interpretation of existing
patterns (esp. \\\" and \\\\)

How tricky is the print side?

Rich

Paul Stadig

unread,
Oct 9, 2008, 4:43:50 PM10/9/08
to clo...@googlegroups.com
Clojure is still beta-ish, no? I say either go with the breakage, or use the #r"" without the user chosen delimiters.

The extra backquoting was admittedly a surprise for me at first. It seemed awkward, and unexpected, so if some breakage occurs, then I think its warranted. If breakage is a problem, then I'd vote for going with a new #r"" and deprecating the old syntax.


Paul

Chouser

unread,
Oct 9, 2008, 4:48:29 PM10/9/08
to clo...@googlegroups.com
On Thu, Oct 9, 2008 at 3:09 PM, Rich Hickey <richh...@gmail.com> wrote:
>
> The question is, are people willing to deal with the breakage in the
> short term?

I certainly am, but that may not mean much. :-)
Nobody has spoken out against it yet -- that's a got to be a good sign.

> Could there be a reader warning, if \\ is seen in patterns?

Not really, in general such patterns are valid. For example #"\\w*"
currently means match zero or more word"characters. Proposed, #"\\w*"
would mean match a backslash followed by any number of "w"s

I suppose we could have a warning that flags "likely" errors, and then
a way to turn it off, but that doesn't sound like a nice solution.

> Is there a possible conversion or safe interpretation of existing
> patterns (esp. \\\" and \\\\)

I'm not quite sure what you mean. Is this the part of the previous question?

#"\\\"" Currently matches the single char double-quote, Proposed would
match a backslash followed by a double-quote

#"\\\\" Currently matches the single char backslash. Proposed would
match a backslash followed by another backslash.

If you mean could old patterns be programmatically converted to new
ones, the answer is yes. Somehow I don't think you're asking for a
code walker that spits out what it reads, except with old regex
literals replaced with new ones, but I suppose that could be done.

The only pattern I've found that is currently valid that wouldn't be
under the proposal was mentioned in my earlier email: #"\\(" currently
matches a literal open paren. Proposed would be interpreted as a
backslash and the beginning of a group, but without the closing paren
Pattern/compile will throw an exception.

> How tricky is the print side?

Simple for Patterns created using the proposed literal form, as the
Pattern hangs on to the original string -- just slap #"" around it.

But I suppose that's not quite good enough, as you can build Patterns
other ways: (Pattern/compile "\"")
I think the only cause of trouble is the " char, so the Pattern's
string would have to be scanned looking for " and determining if it's
already quoted. If quoted, leave it alone; if unquoted, insert a
backslash. I'll code it up, unless this proposal is already dead on
one of the other points.

Thanks,
--Chouser

Randall R Schulz

unread,
Oct 9, 2008, 4:54:49 PM10/9/08
to clo...@googlegroups.com
On Thursday 09 October 2008 12:09, Rich Hickey wrote:
> ...

>
> The question is, are people willing to deal with the breakage in the
> short term?

I have no skin in this game, so what I say is to be taken with a grain
of salt, but...

If you make this change, millions of Clojure users for all eternity
_will_ thank you.

On the other hand, if you make this change, a couple of dozen users
_may_ mutter something under their breath at you once or twice and then
acknowledge that the change was a good thing.


> ...
>
> Rich


Randall Schulz

Rich Hickey

unread,
Oct 9, 2008, 4:57:38 PM10/9/08
to Clojure


On Oct 9, 4:48 pm, Chouser <chou...@gmail.com> wrote:
Go for it.

Rich

Chouser

unread,
Oct 9, 2008, 11:14:35 PM10/9/08
to clo...@googlegroups.com
On Thu, Oct 9, 2008 at 4:57 PM, Rich Hickey <richh...@gmail.com> wrote:
>
> Go for it.

(defmethod print-method java.util.regex.Pattern [p #^Writer w]
(.write w "#\"")
(loop [[#^Character c & r :as s] (seq (.pattern #^java.util.regex.Pattern p))]
(when s
(cond
(= c \\) (do (.append w \\)
(.append w #^Character (first r))
(recur (rest r)))
(= c \") (do (.write w "\\\"")
(recur r))
:else (do (.append w c)
(recur r)))))
(.append w \"))

--Chouser

Christophe Grand

unread,
Oct 10, 2008, 4:07:41 AM10/10/08
to clo...@googlegroups.com
Chouser a écrit :
Hello Chouser, (btw, nice work you are doing with ClojureScript)

Do you plan to support printing of all Pattern instances or only those
created using the new literal syntax? In the first case, the story gets
more complex with \Q, \E and flags. In the second case, that's fine.

Christophe

Chouser

unread,
Oct 10, 2008, 7:48:36 AM10/10/08
to clo...@googlegroups.com
On Fri, Oct 10, 2008 at 4:07 AM, Christophe Grand <chris...@cgrand.net> wrote:
>
> Hello Chouser, (btw, nice work you are doing with ClojureScript)

Thanks -- good to see you around again!

> Do you plan to support printing of all Pattern instances or only those
> created using the new literal syntax? In the first case, the story gets
> more complex with \Q, \E and flags. In the second case, that's fine.

I think I need to support all Patterns, and you're right I haven't
thought sufficiently about \Q and \E.

Interestingly, I don't think there's any way to get a plain
double-quote into a regex after a \Q under the proposal. I don't
*think* that's a problem, as you're most likely to use \Q when
building a pattern programmatically, and if you're not you can drop
out of \Q mode to get your " in there.

If anyone disagrees with this conclusion, speak up!

Of course that means I need to do this for you when printing...

--Chouser

Chouser

unread,
Oct 10, 2008, 5:02:12 PM10/10/08
to clo...@googlegroups.com
On Fri, Oct 10, 2008 at 7:48 AM, Chouser <cho...@gmail.com> wrote:
>
> Of course that means I need to do this for you when printing...

I've attached an updated patch with a new print method, against the
latest SVN 1058.

If you really want to dig into the ugliness, here's a test I've been
using. Apply the attached patch, rebuild clojure, and then load up
this code:

(import '(java.util.regex Pattern)
'(java.io PushbackReader StringReader))
(defn test-re [s & [ts]]
(let [p1 (Pattern/compile s)
p2 (read (PushbackReader. (StringReader. (prn-str p1))))]
(println (str "raw str: " s))
(print (str "prn1: " (prn-str p1)))
(print (str "prn2: " (prn-str p2)))
(println (if (= (prn-str p1) (prn-str p2)) "PASS" "FAIL"))
(when ts
(println (str "match1: " (re-find p1 ts)))
(println (str "match2: " (re-find p2 ts)))
(println (if (= (re-find p1 ts) (re-find p2 ts)) "PASS" "FAIL")))
(println)))

You run it by passing in a string to be compiled by Pattern. This is
essentially the "old format" without the leading # char:

user=> (test-re "foo")
raw str: foo
prn1: #"foo"
prn2: #"foo"
PASS

If my print method is correct, prn1 and prn2 should always be the
same, and thus PASS. Also what you're seeing there is the "new
format". In this case, both the old and new formats are the same.

If you pass in an optional second string, test-re will try to match it
with both patterns:

user=> (test-re "a \\\"\\w*\\\" please" "a \"word\" please")
raw str: a \"\w*\" please
prn1: #"a \"\w*\" please"
prn2: #"a \"\w*\" please"
PASS
match1: a "word" please
match2: a "word" please
PASS

Again, if my patch if my patch is correct, match1 and match2 should be
the same, and thus another PASS. Here you can see how the new format
(prn1 and prn2) are simpler than the old format, as well as being
identical to what Pattern actually operates on (raw str).

Here's a truly nasty example prompted by Christophe Grand's comment:

user=> (test-re "a\"\\Qb\"c\\d\\Ee\"f" "a\"b\"c\\de\"f")
raw str: a"\Qb"c\d\Ee"f
prn1: #"a\"\Qb\E\"\Qc\d\Ee\"f"
prn2: #"a\"\Qb\E\"\Qc\d\Ee\"f"
PASS
match1: a"b"c\de"f
match2: a"b"c\de"f
PASS

If you can find *any* input that produces FAIL, please let me know.

--Chouser

regex-reader.patch

Rich Hickey

unread,
Oct 15, 2008, 7:12:26 PM10/15/08
to Clojure


On Oct 10, 5:02 pm, Chouser <chou...@gmail.com> wrote:
Patch applied - SVN rev 1070 - thanks!

If you use regex literals, this is a breaking change - you must fix
them as described above (basically, \ doesn't need to be escaped
anymore).

Rich

Chouser

unread,
Oct 15, 2008, 7:41:03 PM10/15/08
to clo...@googlegroups.com
On Wed, Oct 15, 2008 at 7:12 PM, Rich Hickey <richh...@gmail.com> wrote:
>
> Patch applied - SVN rev 1070 - thanks!

Possible reader docs:

Regex patterns (#"pattern")
A regex pattern is read and compiled at read time. The pattern is
passed directly to the java.util.regex.Pattern compile method, so its
quoting rules apply instead of the string literal quoting rules. This
means that unlike string literals, backslashes in regex patterns do
not need to be escaped with another backslash. For example, #"\d*"
matches zero or more digits.

--Chouser

Reply all
Reply to author
Forward
0 new messages