For the record:
http://groups.google.com/group/clojure/browse_thread/thread/81b361a4e82602b7/0313c224a480a161
So here is my attempt formalize a simple proposal.
The reader should take the literal contents of #"..." and pass to
Pattern.compile as a raw string, making no changes to the contents.
That means all backslashes (\) and double quotes (") would be passed
right in. The only other thing the reader need concern itself with,
is that when it sees a \" it should not treat that double-quote as the
end of the pattern, but rather keep on doing until it sees a
double-quote that is not preceded by a backslash. Nevertheless is
would pass both the quoting \ and the following " to Pattern.compile.
That's it. Simple. It works because Java's Pattern itself understands
backslash quoting, including literal chars like backslash and double
quote, hex and octal patterns, as well as other regex patterns.
Some examples:
1. Simple text
(re-find #"foo" "foo") --> "foo"
2. Pre-defined character class
(re-find #"\w*" "foo!@#$") --> "foo"
3. Special character (regex and string)
(re-find #"\t" "\t") --> "\t"
4. Scary special character (regex only)
Note that the escape sequences available inside #"" are Java Pattern
escape sequences, and therefore by definition different from Clojure
String escape sequences. Of course this is what you need for \w and
such to work:
(re-find #"\a" "\u0007") --> beep ""
5. Special character (string only)
The revere of the previous example -- Clojure strings understand "\b"
as (str \backspace), but Java patterns do not, so this example uses
hex instead:
(re-find #"\x08" "\b") --> "\b"
6. Hex
(re-find #"\x31" "1") --> "1"
7. Octal
(re-find #"\061" "1") --> "1"
8. Word boundary:
(re-find #"\bfoo" "foo") --> "foo"
9. Quoting fun -- double quote, a single character:
(re-find #"\"" "\"") --> "\""
10. Quoting fun -- backslash, a single character:
(re-find #"\\" "\\") --> "\\"
11. Open paren
(re-find #"\(" "(") --> "("
I think this demonstrates you can create any pattern you might need.
For reference, here are the above patterns expressed in the current
(not the proposed) reader syntax:
1. #"foo"
2. #"\\w*"
3. #"\t" or #"\\t"
4. #"\\a" (but #"\a" makes the reader blow up)
5. #"\\x08"
6. #"\\x31"
7. #"\061" or #"\\061"
8. #"\\bfoo" (note #"\bfoo" is legal, but doesn't do what you want)
9. #"\"" or #"\\\"" (but #"\\"" blows up the reader)
10. #"\\\\" (but #"\\" is illegal)
11. #"\\(" (but #"\(" is illegal)
Somehow I'm not sure that communicates how much I dislike the current
syntax. Oh well, maybe others can chime in on that point. I
implemented this to provide the examples above, not because I think
this is a done deal or anything. Please comment!
Here is a new print method to match the attached patch to LispReader:
(defmethod print-method java.util.regex.Pattern [p w]
(.write w "#\"")
(.write w (.pattern p))
(.write w "\""))
That print method will take a bit more work to properly quote some
Patterns that could be created by means other than the Clojure
literal.
--Chouser
I tried all the examples from my previous message in Perl, Python,
Ruby, and JavaScript. All but Python have literal regex syntax, while
Python has a raw string format that is generally used for regular
expressions. All but JavaScript have multiple quote characters which
allowed me to use double quotes just like Clojure:
Clojure: #"foo"
Perl: m"foo" (although m/foo/ or just /foo/ is more common)
Python: r"foo" (you can also use r'foo' or r"""foo""")
Ruby: %r"foo" (or %r/foo/ or just /foo/)
JS: /foo/
All the examples for the proposed new Clojure syntax work the same in
all these languages (with the exception of example 4 in JavaScript,
where \a means a plain letter a instead of ASCII 7). If instead you
escape things the way you currently have to in Clojure, many of the
expressions don't work or mean something different in the other
languages.
In other words, under the proposed syntax Clojure regex literals would
be less surprising for people used to any of these other languages.
--Chouser
Yes, many existing #"" literals would have new meaning or become
invalid under this proposal.
Some patterns wouldn't have to be changed. Here are a couple examples:
#"foo"
#"(one) *(two)"
#"/this.*/that"
By far the most common change people would have to make would be to
remove doubled back-slashes:
old: #"\\w" new: #"\w"
old: #"\\(" new: #"\("
old: #"\\bword\\b" new: #"\bword\b"
Reading through Clojures string reader and Java's Pattern docs, the
only other anomaly I've spotted is if someone was using \b to mean
\backspace. With the proposed change, #"\b" means word boundary, so:
old: #"\b" new: #"\x08"
Most of the time, failing to update your regex literal will result in
a valid regex that means something different. Put another way, things
that used to match like you wanted will just stop matching. In a few
cases (such as #"\\(") what used to be a valid regex will throw an
exception at read time, with a detailed error message pointing out the
position of the illegal paren.
Of course if this change is unacceptable, these proposed rules could
be applied to a new dispatch macro. One option would be something
like #r/foo/ that would allow your choice of delimiters to further
reduce the need for back-slash quoting within the regex.
--Chouser
Personally, I'd vote for this. Allowing for a choice of delimiters is
a very useful feature of Ruby. That said, I'd be happy with the new
#"..." syntax proposed in this thread ... Anything to avoid the
double-escaping.
Thanks for doing this work, Chouser.
- J.
Of course if this change is unacceptable, these proposed rules could
be applied to a new dispatch macro. One option would be something
like #r/foo/ that would allow your choice of delimiters to further
reduce the need for back-slash quoting within the regex.
I would vote against any new syntax that doesn't allow for a
user-chosen delimiter. If we introduce something new, it should solve
the problem more completely than that.
> The advantage to this is that it is backwards compatible.
That's true and good, and if Rich is open to it, I think #r/foo/ or
#~/foo/ or something would be a great choice, allowing for / or " or
perhaps even () for delimiting the regex.
> I don't think the arbitrary delimiter is as necessary as not having to include
> extra escaping characters because it is a string.
I actually prefer " over / as the only allowed delimiter. Matching
file paths with / as the delimiter is not uncommon, and rather
painful:
#/\/usr\/lib\/*.so/
The contexts where " have to be quoted often don't seem quite as bad:
#"<img src=\"file://foo/bar\">"
--Chouser
I certainly am, but that may not mean much. :-)
Nobody has spoken out against it yet -- that's a got to be a good sign.
> Could there be a reader warning, if \\ is seen in patterns?
Not really, in general such patterns are valid. For example #"\\w*"
currently means match zero or more word"characters. Proposed, #"\\w*"
would mean match a backslash followed by any number of "w"s
I suppose we could have a warning that flags "likely" errors, and then
a way to turn it off, but that doesn't sound like a nice solution.
> Is there a possible conversion or safe interpretation of existing
> patterns (esp. \\\" and \\\\)
I'm not quite sure what you mean. Is this the part of the previous question?
#"\\\"" Currently matches the single char double-quote, Proposed would
match a backslash followed by a double-quote
#"\\\\" Currently matches the single char backslash. Proposed would
match a backslash followed by another backslash.
If you mean could old patterns be programmatically converted to new
ones, the answer is yes. Somehow I don't think you're asking for a
code walker that spits out what it reads, except with old regex
literals replaced with new ones, but I suppose that could be done.
The only pattern I've found that is currently valid that wouldn't be
under the proposal was mentioned in my earlier email: #"\\(" currently
matches a literal open paren. Proposed would be interpreted as a
backslash and the beginning of a group, but without the closing paren
Pattern/compile will throw an exception.
> How tricky is the print side?
Simple for Patterns created using the proposed literal form, as the
Pattern hangs on to the original string -- just slap #"" around it.
But I suppose that's not quite good enough, as you can build Patterns
other ways: (Pattern/compile "\"")
I think the only cause of trouble is the " char, so the Pattern's
string would have to be scanned looking for " and determining if it's
already quoted. If quoted, leave it alone; if unquoted, insert a
backslash. I'll code it up, unless this proposal is already dead on
one of the other points.
Thanks,
--Chouser
I have no skin in this game, so what I say is to be taken with a grain
of salt, but...
If you make this change, millions of Clojure users for all eternity
_will_ thank you.
On the other hand, if you make this change, a couple of dozen users
_may_ mutter something under their breath at you once or twice and then
acknowledge that the change was a good thing.
> ...
>
> Rich
Randall Schulz
(defmethod print-method java.util.regex.Pattern [p #^Writer w]
(.write w "#\"")
(loop [[#^Character c & r :as s] (seq (.pattern #^java.util.regex.Pattern p))]
(when s
(cond
(= c \\) (do (.append w \\)
(.append w #^Character (first r))
(recur (rest r)))
(= c \") (do (.write w "\\\"")
(recur r))
:else (do (.append w c)
(recur r)))))
(.append w \"))
--Chouser
Do you plan to support printing of all Pattern instances or only those
created using the new literal syntax? In the first case, the story gets
more complex with \Q, \E and flags. In the second case, that's fine.
Christophe
Thanks -- good to see you around again!
> Do you plan to support printing of all Pattern instances or only those
> created using the new literal syntax? In the first case, the story gets
> more complex with \Q, \E and flags. In the second case, that's fine.
I think I need to support all Patterns, and you're right I haven't
thought sufficiently about \Q and \E.
Interestingly, I don't think there's any way to get a plain
double-quote into a regex after a \Q under the proposal. I don't
*think* that's a problem, as you're most likely to use \Q when
building a pattern programmatically, and if you're not you can drop
out of \Q mode to get your " in there.
If anyone disagrees with this conclusion, speak up!
Of course that means I need to do this for you when printing...
--Chouser
I've attached an updated patch with a new print method, against the
latest SVN 1058.
If you really want to dig into the ugliness, here's a test I've been
using. Apply the attached patch, rebuild clojure, and then load up
this code:
(import '(java.util.regex Pattern)
'(java.io PushbackReader StringReader))
(defn test-re [s & [ts]]
(let [p1 (Pattern/compile s)
p2 (read (PushbackReader. (StringReader. (prn-str p1))))]
(println (str "raw str: " s))
(print (str "prn1: " (prn-str p1)))
(print (str "prn2: " (prn-str p2)))
(println (if (= (prn-str p1) (prn-str p2)) "PASS" "FAIL"))
(when ts
(println (str "match1: " (re-find p1 ts)))
(println (str "match2: " (re-find p2 ts)))
(println (if (= (re-find p1 ts) (re-find p2 ts)) "PASS" "FAIL")))
(println)))
You run it by passing in a string to be compiled by Pattern. This is
essentially the "old format" without the leading # char:
user=> (test-re "foo")
raw str: foo
prn1: #"foo"
prn2: #"foo"
PASS
If my print method is correct, prn1 and prn2 should always be the
same, and thus PASS. Also what you're seeing there is the "new
format". In this case, both the old and new formats are the same.
If you pass in an optional second string, test-re will try to match it
with both patterns:
user=> (test-re "a \\\"\\w*\\\" please" "a \"word\" please")
raw str: a \"\w*\" please
prn1: #"a \"\w*\" please"
prn2: #"a \"\w*\" please"
PASS
match1: a "word" please
match2: a "word" please
PASS
Again, if my patch if my patch is correct, match1 and match2 should be
the same, and thus another PASS. Here you can see how the new format
(prn1 and prn2) are simpler than the old format, as well as being
identical to what Pattern actually operates on (raw str).
Here's a truly nasty example prompted by Christophe Grand's comment:
user=> (test-re "a\"\\Qb\"c\\d\\Ee\"f" "a\"b\"c\\de\"f")
raw str: a"\Qb"c\d\Ee"f
prn1: #"a\"\Qb\E\"\Qc\d\Ee\"f"
prn2: #"a\"\Qb\E\"\Qc\d\Ee\"f"
PASS
match1: a"b"c\de"f
match2: a"b"c\de"f
PASS
If you can find *any* input that produces FAIL, please let me know.
--Chouser
Possible reader docs:
Regex patterns (#"pattern")
A regex pattern is read and compiled at read time. The pattern is
passed directly to the java.util.regex.Pattern compile method, so its
quoting rules apply instead of the string literal quoting rules. This
means that unlike string literals, backslashes in regex patterns do
not need to be escaped with another backslash. For example, #"\d*"
matches zero or more digits.
--Chouser