REPL does not flush input after character errors in strings? Also feature request for #""

Darshan Shaligram

unread,

Mar 10, 2008, 10:40:33 AM3/10/08

to Clojure

When fiddling with the Repl recently (on Clojure svn r733), I ran into
this weird behaviour:

$ clojure
Clojure
user=> "\b"
java.lang.Exception: ReaderError:(1,1) Unsupported escape character:
\b
at clojure.lang.LispReader.read(LispReader.java:158)
[etc.]
user=> "\\b"
"\n"
user=> java.lang.Exception: ReaderError:(2,1) Unsupported character: \
\b
at clojure.lang.LispReader.read(LispReader.java:158)
[etc.]

Looks like the reader choked on the \b in the first input, but then
didn't flush the trailing input ", so it saw the second input line as
""\\b".

I thought this might be because I run clojure.lang.Repl with jline,
but I get the same behaviour even without jline (repl started as java -
classpath $CLOJURE_HOME/clojure.jar clojure.lang.Repl).

Is this a bug? It'd be great if the Repl could flush trailing input
after a compile error.

I ran into this when I was looking for a convenient reader syntax for
regexps, found the neat #"<regex>" and assumed that #"" gave special
treatment to escape sequences, only to stub my toes on the standard
StringReader. It would be awesome if #"" could grok regex escapes
without needing doubled \\ everywhere (triply awesome with a different
reader dispatch macro like #r that allows specifying a delimiter like
#r/regex here/).

That aside, I'm finding Clojure a lot of fun to hack with. Thanks for
this great language!

Cheers,
Darshan

Rich Hickey

unread,

Mar 10, 2008, 5:48:14 PM3/10/08

to Clojure

On Mar 10, 10:40 am, Darshan Shaligram <scinti...@gmail.com> wrote:
> When fiddling with the Repl recently (on Clojure svn r733), I ran into
> this weird behaviour:
>
> $ clojure
> Clojure
> user=> "\b"
> java.lang.Exception: ReaderError:(1,1) Unsupported escape character:
> \b

> user=> "\\b"
> "\n"
> user=> java.lang.Exception: ReaderError:(2,1) Unsupported character: \
> \b

>

> Looks like the reader choked on the \b in the first input, but then
> didn't flush the trailing input ", so it saw the second input line as
> ""\\b".

> Is this a bug? It'd be great if the Repl could flush trailing input
> after a compile error.
>

Yes, I'd like it to do something more useful. There's no such thing as
input flushing though, just discarding, and the question becomes, how
much to toss?

> I ran into this when I was looking for a convenient reader syntax for
> regexps, found the neat #"<regex>" and assumed that #"" gave special
> treatment to escape sequences, only to stub my toes on the standard
> StringReader. It would be awesome if #"" could grok regex escapes
> without needing doubled \\ everywhere

Is that all you mean by special treatment, not handling, when reading
a regex, \ in a regex as an escape character, or is there something
more to it? I'm not much of a regex user, so I'd appreciate any input
from those who are.

> (triply awesome with a different
> reader dispatch macro like #r that allows specifying a delimiter like
> #r/regex here/).
>

Is there something special about / as a regex delimiter? Is it just a
familiarity thing? Is it rarely used inside a pattern where " is more
often? Would #/regex/ be preferable to #"regex"? Or do you really want
to swap out the delimiter arbitrarily?

Input from all interested parties welcome.

> That aside, I'm finding Clojure a lot of fun to hack with. Thanks for
> this great language!

You're welcome,

Rich

Shawn Hoover

unread,

Mar 10, 2008, 6:25:26 PM3/10/08

to clo...@googlegroups.com

On Mon, Mar 10, 2008 at 2:48 PM, Rich Hickey <richh...@gmail.com> wrote:

On Mar 10, 10:40 am, Darshan Shaligram <scinti...@gmail.com> wrote:
> I ran into this when I was looking for a convenient reader syntax for
> regexps, found the neat #"<regex>" and assumed that #"" gave special
> treatment to escape sequences, only to stub my toes on the standard
> StringReader. It would be awesome if #"" could grok regex escapes
> without needing doubled \\ everywhere

Is that all you mean by special treatment, not handling, when reading
a regex, \ in a regex as an escape character, or is there something
more to it? I'm not much of a regex user, so I'd appreciate any input
from those who are.

\ is the escape character for regex, so it's an annoyance of using Java string literals to make a regex: you have to double \ all your escape sequences (especially annoying if you need a literal \ in your regex). Agreed with Darshan, it would be great if Clojure's reader could treat the \ as a regex escape (ignore it) instead of as a string escape.

> (triply awesome with a different
> reader dispatch macro like #r that allows specifying a delimiter like
> #r/regex here/).
>

Is there something special about / as a regex delimiter? Is it just a
familiarity thing? Is it rarely used inside a pattern where " is more
often? Would #/regex/ be preferable to #"regex"? Or do you really want
to swap out the delimiter arbitrarily?

Input from all interested parties welcome.

/regex/ is a literal regex in some languages, but it doesn't have to be that way if there's a good way for the reader to parse them and deal with the \ escapes.

Shawn

Stuart Sierra

unread,

Mar 10, 2008, 8:14:18 PM3/10/08

to Clojure

On Mar 10, 5:48 pm, Rich Hickey <richhic...@gmail.com> wrote:
> Is there something special about / as a regex delimiter? Is it just a
> familiarity thing? Is it rarely used inside a pattern where " is more
> often? Would #/regex/ be preferable to #"regex"? Or do you really want
> to swap out the delimiter arbitrarily?
>
> Input from all interested parties welcome.

My 2 cents: I've never cared for /regex/ because it necessitates
escaping /'s in paths. #"regex" also serves as a reminder that the
pattern will get read as a Java string, thus one needs to escape \'s.
That said, I do make use of the arbitrary-delimiter regex syntax in
Perl and Ruby.

-Stuart

Darshan Shaligram

unread,

Mar 11, 2008, 12:19:06 AM3/11/08

to Clojure

On Mar 11, 2:48 am, Rich Hickey <richhic...@gmail.com> wrote:
> On Mar 10, 10:40 am, Darshan Shaligram <scinti...@gmail.com> wrote:

> > [...] It'd be great if the Repl could flush trailing input

> > after a compile error.

> Yes, I'd like it to do something more useful. There's no such thing as
> input flushing though, just discarding, and the question becomes, how
> much to toss?

Hmm, good point, and the original problem was not a big deal anyway,
so forget I brought that up. :-)

> > It would be awesome if #"" could grok regex escapes without
> > needing doubled \\ everywhere

> Is that all you mean by special treatment, not handling, when
> reading a regex, \ in a regex as an escape character, or is there
> something more to it? I'm not much of a regex user, so I'd
> appreciate any input from those who are.

I do a lot of regexps, so excuse me if I'm more verbose than necessary
in the following screed. :-)

Regexps use a lot of backslashed sequences like \s, \w, \b, etc. In
addition, regexps give special meaning to characters like ()[]{}|, so
if you want to match such characters literally, you need to escape
them with backslashes. All this means that any non-trivial regular
expression will use a lot of backslashes.

Let's take a simple instance of a regex to match the start of a
C-style function call (this regex is not right, but close enough
for discussion):

\b\w+\s*\(

That looks like this in languages that have only C-style strings:

Java & C: "\\b\\w+\\s*\\("

Ugh, backslash proliferation. Elisp has this problem too, and it has
it worse, because Emacs regexps usually need more backslash action
than Perl-style regexps.

Some languages have strings that do not assign special meaning to
backslashes (or give backslashes their escape-this duty only in \\ and
\<quote>). For instance, Python and Scala have string syntax where
you can say:

Python: r"\b\w+\s*\("

Scala: """\b\w+\s*\("""

which is much cleaner than the Java equivalent (if distressingly
verbose for short regexps in the Scala version). Unfortunately, the
Python raw string has one disadvantage: there's no way to get
newlines into a raw string without awkward string concatenation or
using the more verbose multiline form (r"""<yadda>""").

Groovy has // syntax for regex-like strings, but it doesn't allow
choosing your own delimiter:

Groovy (string): /\b\w+\s*\(/
Groovy (regex) : ~/\b\w+\s*\(/

Groovy's // is just a special kind of quote, and returns an
appropriate string ("\\b\\w+\\s*\\(" in this case), but the inability
to choose your delimiter produces leaning toothpick syndrome for
filename regexps.

Ruby and Perl have quoting constructs specifically aimed at regular
expressions - they allow you to write things like:

Ruby: %r{\b\w+\s*\(}
Perl: qr|\b\w+\s*\(|

While these look like the Python equivalent, you can use \t, \n, and
friends in such expressions with their normal meaning as in
double-quoted strings. Also, %r and qr allow arbitrary delimiters - /
is a common example of a delimiter, but as Stuart notes, / is
particularly inconvenient when dealing with filenames, so the ability
to use arbitrary delimiters is valuable.

Common Lisp has access to convenient regex quoting through Edi Weitz'
CL-INTERPOL and its reader macros:

CL: #?/\b\w+\s*\(/

Or, with arbitrary delimiters (#?r) and extended regex syntax (#?rx)
for readability:

CL: #?rx! \b \w+ \s* \( !

Which style (if any) you'd like for Clojure is your call, but I rather
fancy:

would-be-neat-in-Clojure: #r{\b\w+\s*\(}

That is, #r signals regex-quote behaviour, the first character after
the #r is the delimiter (with only punctuation allowed as delimiters),
and:

1. Known escape sequences such as \t, \n are translated as in
double-quoted strings.

2. Unknown escape sequences such as \b are converted to the equivalent
of "\\b", i.e. the backslash inserts itself instead of escaping the
next character, with the exception of:

3. A backslash in front of a delimiter character inserts a literal
delimiter character (as in #r/.*\/(.*)/). I'm guessing StringReader
already does this.

4. Delimiters may be balanced - i.e., #r{ must be matched by a }, ( by
a ), and [ by a ]. Some folks also like < and >. Balanced
delimiters may be nested without escaping, as long as the nested
delimiters are also balanced: #r(www.(\w+).com)

Ideally also:

5. A non-regex quoting construct - perhaps #rs - that does the same
things as #r, but creates a string instead of a compiled regex.
That makes things like writing Windows-style paths easy:
#rs{C:\yak\sheep\llama}
Not that that's a problem these days since Windows and Java are
happy
with / instead of \, but this also allows constructing regexps
in pieces:
(str #rs/^\s+/ word-we-want "$")

I realise this may seem like a lot of work, but it would make regex
use way more convenient. I could put together a little patch for the
#r syntax for Clojure if there's interest.

Alternatively, or additionally, we could allow the use of reader
macros from Clojure code, so we could write a Clojure equivalent for
CL-INTERPOL. :-)

Cheers, and thanks for listening,
Darshan

Chouser

unread,

Mar 11, 2008, 9:15:37 AM3/11/08

to Clojure

Darshan, thanks for the complete write-up. Your suggestion has my
vote.

Since all of these simply create either a regex or a string, the code
should be able to be neatly contained in the reader.

--Chouser

Chouser

unread,

Mar 11, 2008, 10:26:27 AM3/11/08

to Clojure

On Mar 11, 12:19 am, Darshan Shaligram <scinti...@gmail.com> wrote:
> Ideally also:
>
> 5. A non-regex quoting construct - perhaps #rs - that does the same
> things as #r, but creates a string instead of a compiled regex.
> That makes things like writing Windows-style paths easy:
> #rs{C:\yak\sheep\llama}
> Not that that's a problem these days since Windows and Java are
> happy
> with / instead of \, but this also allows constructing regexps
> in pieces:
> (str #rs/^\s+/ word-we-want "$")

On further reflection, this point worries me a little. In a regex
string \n and \s each have specific but different meaning. The first
is a C-style escape (newline) and the second a regex escape
(whitespace). But in your Windows-style path, what if some dirs start
with s and others with n? Wouldn't #rs{C:\serious\new\files} be the
same as "C:\\serious\new\\files"? Wouldn't the user be surprised by
the newline in the middle of their path?

I suppose it would still be useful to have the string form to allow
building up of regexes, but it would still be fairly regex-specific
even if it wasn't trying to compile each piece.

--Chouser

Darshan Shaligram

unread,

Mar 11, 2008, 10:38:43 AM3/11/08

to clo...@googlegroups.com

On Tue, Mar 11, 2008 at 7:56 PM, Chouser <cho...@gmail.com> wrote:
> On Mar 11, 12:19 am, Darshan Shaligram <scinti...@gmail.com> wrote:

> > 5. A non-regex quoting construct - perhaps #rs - that does the same
> > things as #r, but creates a string instead of a compiled regex.
> > That makes things like writing Windows-style paths easy:
> > #rs{C:\yak\sheep\llama}

> On further reflection, this point worries me a little. In a regex

> string \n and \s each have specific but different meaning. The first
> is a C-style escape (newline) and the second a regex escape
> (whitespace). But in your Windows-style path, what if some dirs start
> with s and others with n?

That's a good point. #rs is then only useful as a convenience for
regexps, like Groovy's // (which is still not bad).

Cheers,
Darshan

Eric Lavigne

unread,

Mar 11, 2008, 10:39:54 AM3/11/08

to clo...@googlegroups.com

> Regexps use a lot of backslashed sequences like \s, \w, \b, etc. In
> addition, regexps give special meaning to characters like ()[]{}|, so
> if you want to match such characters literally, you need to escape
> them with backslashes. All this means that any non-trivial regular
> expression will use a lot of backslashes.
>
> Let's take a simple instance of a regex to match the start of a
> C-style function call (this regex is not right, but close enough
> for discussion):
>
> \b\w+\s*\(
>
> That looks like this in languages that have only C-style strings:
>
> Java & C: "\\b\\w+\\s*\\("
>
> Ugh, backslash proliferation. Elisp has this problem too, and it has
> it worse, because Emacs regexps usually need more backslash action
> than Perl-style regexps.
>
> Some languages have strings that do not assign special meaning to
> backslashes (or give backslashes their escape-this duty only in \\ and
> \<quote>). For instance, Python and Scala have string syntax where
> you can say:

Java's backslash escaping in strings is a simple solution for handling
special characters. The backslash escaping in typical regex syntax
plays the same role and also makes a lot of sense. The only problem is
that they both use the same escape character, the backslash, when a
regex needs to be expressed within a string. I think all of these
problems would go away if we just use ~ for regexes and \ for strings,
or vice-versa. Then, in a regex, if you actually need a literal tilda
(rare) you will have to double it up ~~ just as you need to double up
backslashes \\ to get a literal backslash in a string.

--

I like to find simple solutions to overlooked problems that actually
need to be solved, and deliver them as informally as possible,
starting with a very crude version 1, then iterating rapidly.

- Paul Graham, Six Principles for Making New Things -

Chouser

unread,

Mar 11, 2008, 10:56:41 AM3/11/08

to Clojure

On Mar 11, 10:39 am, "Eric Lavigne" <lavigne.e...@gmail.com> wrote:
> Java's backslash escaping in strings is a simple solution for handling
> special characters. The backslash escaping in typical regex syntax
> plays the same role and also makes a lot of sense. The only problem is
> that they both use the same escape character, the backslash, when a
> regex needs to be expressed within a string. I think all of these
> problems would go away if we just use ~ for regexes and \ for strings,
> or vice-versa.

That's an interesting point. However, both string and regex escaping
have been around for a long time, and each feel natural to a large
number of people. If there was actual overlap between the two sets,
the "naturalness" of each wouldn't carry a whole lot of weight with
me. But in fact several languages have solved the problem neatly
without resorting to "unusual" escaping for either regex or strings.

Having said all that, Clojure currently supports a very small number
of string escapes (looks like \t \r \n \\ and \"), which is a tiny
subset of what C or Java supports, or what regex requires. So if one
were to get an unusual escape char, I would vote for it to be the C-
style chars: ~t ~r etc.

--Chouser

Darshan Shaligram

unread,

Mar 11, 2008, 11:07:53 AM3/11/08

to clo...@googlegroups.com

On Tue, Mar 11, 2008 at 8:09 PM, Eric Lavigne <lavign...@gmail.com> wrote:
[Regular expressions and the need for escaping backslash in strings]

> Java's backslash escaping in strings is a simple solution for handling
> special characters. The backslash escaping in typical regex syntax
> plays the same role and also makes a lot of sense. The only problem is
> that they both use the same escape character, the backslash, when a
> regex needs to be expressed within a string. I think all of these
> problems would go away if we just use ~ for regexes and \ for strings,
> or vice-versa. Then, in a regex, if you actually need a literal tilda
> (rare) you will have to double it up ~~ just as you need to double up
> backslashes \\ to get a literal backslash in a string.

Lua does something along these lines for its string patterns, which
are designed to be very similar to regexps, but a more compact (in
terms of implementation code size) subset. Lua uses pattern character
classes like %w, %s, etc., and you can use % to escape pattern
meta-characters.

Unfortunately, this does not feel like a good fit for Clojure, since
we're using Java's regular expression engine under the hood, and Java
wants backslashes, so you'd need to translate ~X to \X, and the
behind-the-scenes magic there is likely to trip up users. Further,
when you say "regular expressions", programmers instinctively think of
Perl-style regexps with backslashes. If not all programmers, at least
yours truly does. :-)

The different syntax works fine for Lua, because Lua doesn't claim its
string patterns are regular expressions, and there are some subtle
differences from Perl-style regexps, and having a difference in
notation helps make that explicit.

Cheers,
Darshan

christop...@gmail.com

unread,

Mar 11, 2008, 2:55:13 PM3/11/08

to Clojure

On Mar 10, 10:48 pm, Rich Hickey <richhic...@gmail.com> wrote:
> Is that all you mean by special treatment, not handling, when reading
> a regex, \ in a regex as an escape character, or is there something
> more to it? I'm not much of a regex user, so I'd appreciate any input
> from those who are.

Handling \ as an escape character only before " (and not after and odd
sequence of \) seems to be the simplest solution.
As a bonus I'd like to append flags after the closing quotes : d
(UNIX_LINES), i (CASE_INSENSITIVE), x (COMMENTS), m (MULTILINE) etc.
(see http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#UNIX_LINES
for flags characters)

With these rules, #"\\a+\""i would return Pattern.compile("\\\\a+\"",
Pattern.CASE_INSENSITIVE).

> > (triply awesome with a different
> > reader dispatch macro like #r that allows specifying a delimiter like
> > #r/regex here/).
>
> Is there something special about / as a regex delimiter? Is it just a
> familiarity thing? Is it rarely used inside a pattern where " is more
> often? Would #/regex/ be preferable to #"regex"? Or do you really want
> to swap out the delimiter arbitrarily?
>
> Input from all interested parties welcome.

Swapping out the delimiter could come handy but is not necessary.

Christophe

Eric Lavigne

unread,

Mar 11, 2008, 3:40:52 PM3/11/08

to clo...@googlegroups.com

> Handling \ as an escape character only before " (and not after and odd
> sequence of \) seems to be the simplest solution.
> As a bonus I'd like to append flags after the closing quotes : d
> (UNIX_LINES), i (CASE_INSENSITIVE), x (COMMENTS), m (MULTILINE) etc.
> (see http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#UNIX_LINES
> for flags characters)

Adding special cases to cover the edge cases creates more edge cases.
In your proposed scheme, how would you represent a string with only
the following two characters: \"

With slight modification, I still like my original proposal. If ~ were
used for string escape as in Common Lisp and \ were used for regex
escape as in Perl, then regexes would still look like they usually do,
but with the occasional ~ for newlines, tabs, and literal quote
characters. This is also simpler from an implementation standpoint
because Clojure could always make the (~ -> \) substitution upon
string creation, regardless of whether that string would eventually be
used as a regex. The only remaining sticky point, as far as I can see,
is that such strings would still print out with a proliferation of
backslashes. Maybe that's okay as long as it isn't in the code?

Rich Hickey

unread,

Mar 11, 2008, 4:09:27 PM3/11/08

to Clojure

On Mar 11, 3:40 pm, "Eric Lavigne" <lavigne.e...@gmail.com> wrote:
> > Handling \ as an escape character only before " (and not after and odd
> > sequence of \) seems to be the simplest solution.
> > As a bonus I'd like to append flags after the closing quotes : d
> > (UNIX_LINES), i (CASE_INSENSITIVE), x (COMMENTS), m (MULTILINE) etc.

> > (seehttp://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#UN...

> > for flags characters)
>
> Adding special cases to cover the edge cases creates more edge cases.
> In your proposed scheme, how would you represent a string with only
> the following two characters: \"
>

Everyone, I am following the discussion with great interest. I'd just
like to constrain the dialog with the following parameters:

Clojure regexes are Java regexes - I have no desire whatsoever to
define another format. As such, if the Java regex escape character is
'\', that's what it will be in Clojure.

So the only question is how to represent regex patterns in what are
now #"..." reader literals. There, currently, the same interpretation
of \ is in effect as is for Java string literals, i.e. special
handling of \t, \n etc, and errors for unsupported escape sequences.
It seems that that is not useful for regex reading. If \ is not a
reader escape in regexes, the question becomes how to represent " (or
whatever the delimiter is) inside the regex, since it would otherwise
terminate the expression.

Rich

Chouser

unread,

Mar 11, 2008, 5:05:29 PM3/11/08

to Clojure

On Mar 11, 4:09 pm, Rich Hickey <richhic...@gmail.com> wrote

> So the only question is how to represent regex patterns in what are
> now #"..." reader literals. There, currently, the same interpretation
> of \ is in effect as is for Java string literals, i.e. special
> handling of \t, \n etc, and errors for unsupported escape sequences.
> It seems that that is not useful for regex reading.

Actually, it looks like Java's regex engine itself understands some
escape sequences, as listed in Java's Pattern docs. For example, this
returns a match:

(re-seq #"foo\\nbar" "foo\nbar")

So disallowing string escaping with '\' (as Rich suggests) hurts you
pretty much not at all.

> If \ is not a
> reader escape in regexes, the question becomes how to represent " (or
> whatever the delimiter is) inside the regex, since it would otherwise
> terminate the expression.

I like the idea of having a variety of quote chars, especially with
open/close pairs like (), [], etc. For example if you need a literal
", you could allow #r("). If you need a literal ), you could allow
#r")". You can even allow, as Darshan suggested, balanced interior
quotes like #r(outer(inner)).

Is that sufficient? It requires a pretty tortured case for this to
fall short. You'd have to need a regex that includes one of every
closing quote character or something. In such a tortured case, you
could even get away with concatenating two separate strings to make
your regex. Perhaps that's good enough? Otherwise, it's not at all
clear to me what a good way would be to escape the closing quote.

--Chouser

christop...@gmail.com

unread,

Mar 12, 2008, 3:59:44 AM3/12/08

to Clojure

On Mar 11, 8:40 pm, "Eric Lavigne" <lavigne.e...@gmail.com> wrote:
> > Handling \ as an escape character only before " (and not after and odd
> > sequence of \) seems to be the simplest solution.
> > As a bonus I'd like to append flags after the closing quotes : d
> > (UNIX_LINES), i (CASE_INSENSITIVE), x (COMMENTS), m (MULTILINE) etc.

> > (seehttp://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#UN...

> > for flags characters)
>
> Adding special cases to cover the edge cases creates more edge cases.
> In your proposed scheme, how would you represent a string with only
> the following two characters: \"

As #"\\\"".
I just propose to rely on the escaping strategy built in
Pattern.compile but making escaping " mandatory.
Pattern.compile allows escaping any character (eg \, is the escape
sequence for ,) hence " can be escaped as \" but is not required to.
Pattern.matches(("\"", "\"") == true
Pattern.matches(("\\\"", "\"") == true

The reader would be:
static class RegexReader extends AFn{
public Object invoke(Object reader, Object doublequote) throws
Exception{
StringBuilder sb = new StringBuilder();
Reader r = (Reader) reader;
boolean esc = false;

for(int ch = r.read(); !(esc && ch == '"'); ch = r.read()) {
if(ch == -1)
throw new Exception("EOF while reading regex pattern");
sb.append((char) ch);
esc = !esc && ch != '\\';
}

return Pattern.compile(sb.toString());
}
}

Christophe

Message has been deleted

christop...@gmail.com

unread,

Mar 12, 2008, 4:37:24 AM3/12/08

to Clojure

I should never code before coffee (and never post code without testing
it). Here is my correct proposal for a regex reader as close as
possible to Java's Pattern syntax:

static class RegexReader extends AFn{
public Object invoke(Object reader, Object doublequote) throws
Exception{
StringBuilder sb = new StringBuilder();
Reader r = (Reader) reader;
boolean esc = false;

for(int ch = r.read(); esc || ch != '"'; ch = r.read()) {

if(ch == -1)
throw new Exception("EOF while reading
regex pattern");
sb.append((char) ch);

esc = !esc && ch == '\\';
}

return Pattern.compile(sb.toString());
}
}

Christophe

Rich Hickey

unread,

Mar 12, 2008, 11:25:37 AM3/12/08

to Clojure

On Mar 12, 4:37 am, "christo...@cgrand.net"

<christophe.gr...@gmail.com> wrote:
> I should never code before coffee (and never post code without testing
> it). Here is my correct proposal for a regex reader as close as
> possible to Java's Pattern syntax:

If the reader wants to consider the regex pattern opaque, then the
only thing it need be concerned with is ". It could just specify that
" terminates the pattern, unless it is doubled, "", in which case you
get a single " in the pattern, and get completely out of the \
interpretation business.

Rich

christop...@gmail.com

unread,

Mar 12, 2008, 12:52:03 PM3/12/08

to Clojure

On Mar 12, 4:25 pm, Rich Hickey <richhic...@gmail.com> wrote:
> On Mar 12, 4:37 am, "christo...@cgrand.net"

> If the reader wants to consider the regex pattern opaque, then the
> only thing it need be concerned with is ". It could just specify that
> " terminates the pattern, unless it is doubled, "", in which case you
> get a single " in the pattern, and get completely out of the \
> interpretation business.

Yes, it could work as well but \" is more regular, considering both
both Clojure string literals and Java regex escape sequences.

Principle of least astonishment: which syntax #"\\""\n" or #"\\\"\n"
fits the regex literal matching the string literal "\\\"\n" best?

As I already stated, \" (or \/ if you want to pick a familiar
delimiter) is even part of the Java regex spec (
http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#bs
"A backslash may be used prior to a non-alphabetic character
regardless of whether that character is part of an unescaped
construct.") it doesn't feel like another foreign escaping construct
layered upon an already escaped string.

BTW, what are your plans for string literals? Do you plan to recognize
all the Java string literals escape sequence (including Unicode
escapes) in StringReader? Would you accept a patch?

- Christophe, who likes to discuss the color of the bikeshed.

Chouser

unread,

Mar 12, 2008, 1:06:02 PM3/12/08

to Clojure

On Mar 12, 11:25 am, Rich Hickey <richhic...@gmail.com> wrote:
> If the reader wants to consider the regex pattern opaque, then the
> only thing it need be concerned with is ". It could just specify that
> " terminates the pattern, unless it is doubled, "", in which case you
> get a single " in the pattern, and get completely out of the \
> interpretation business.

As long as I can pick my own quote chars on a per-regex basis, I'd be
completely content with that.

Rich Hickey

unread,

Mar 12, 2008, 1:08:56 PM3/12/08

to Clojure

On Mar 12, 12:52 pm, "christo...@cgrand.net"

<christophe.gr...@gmail.com> wrote:
> On Mar 12, 4:25 pm, Rich Hickey <richhic...@gmail.com> wrote:
>
> > On Mar 12, 4:37 am, "christo...@cgrand.net"
> > If the reader wants to consider the regex pattern opaque, then the
> > only thing it need be concerned with is ". It could just specify that
> > " terminates the pattern, unless it is doubled, "", in which case you
> > get a single " in the pattern, and get completely out of the \
> > interpretation business.
>
> Yes, it could work as well but \" is more regular, considering both
> both Clojure string literals and Java regex escape sequences.
>
> Principle of least astonishment: which syntax #"\\""\n" or #"\\\"\n"
> fits the regex literal matching the string literal "\\\"\n" best?
>

I think we need to leave the match out of it. We should talk about
which literal yields which pattern, e.g. in your example - #"\\\"\n",
how many backslashes are in the resulting pattern?

> BTW, what are your plans for string literals? Do you plan to recognize
> all the Java string literals escape sequence (including Unicode
> escapes) in StringReader? Would you accept a patch?
>

What is involved in recognizing all escape sequences? Is there logic
or is it just a table thing? Can it be round-tripped with the printer?
Will it slow things down much?

Rich

christop...@gmail.com

unread,

Mar 12, 2008, 2:07:44 PM3/12/08

to Clojure

On Mar 12, 6:08 pm, Rich Hickey <richhic...@gmail.com> wrote:
> On Mar 12, 12:52 pm, "christo...@cgrand.net"
>
> <christophe.gr...@gmail.com> wrote:
> > Principle of least astonishment: which syntax #"\\""\n" or #"\\\"\n"
> > fits the regex literal matching the string literal "\\\"\n" best?
>
> I think we need to leave the match out of it. We should talk about
> which literal yields which pattern, e.g. in your example - #"\\\"\n",
> how many backslashes are in the resulting pattern?

I'm not sure what you name the resulting pattern. Is it the string
passed to Pattern.compile or the result of compilation? If the later:
there is one backslash in the pattern (it recognize one backslash
followed by one double quote, followed by one carriage return).

If the former, this pattern can be produced by:
in Java: Pattern.compile("\\\\\"\\n") or Pattern.compile("\\\\\"\n")
(depending who escapes the CR)
in current Clojure: #"\\\\\"\\n" or #"\\\\\"\n" (depending who escapes
the CR)
in Javascript or sed (should work with perl, vi etc.) /\\"\n/ or /\\
\"\n/ (optional escape sequence)
your proposal: #"\\""\n"
my proposal: #"\\\"\n" (similar to one legacy (sed, perl etc.)
notation)
my proposal with / instead of ": #/\\"\n/ or #/\\\"\n/ (similar to
both legacy (sed, perl etc.) notations)

> > BTW, what are your plans for string literals? Do you plan to recognize
> > all the Java string literals escape sequence (including Unicode
> > escapes) in StringReader? Would you accept a patch?
>
> What is involved in recognizing all escape sequences? Is there logic
> or is it just a table thing?

Mainly a table thing + some logic for hexadecimal and octal notations.

> Can it be round-tripped with the printer?

Not really: multiple escape sequences for a single character (eg: new
line can be denoted by \n \u000a or \12).

> Will it slow things down much?

No: it would just add cases to the switch statement in escape
handling.

Christophe

Rich Hickey

unread,

Mar 12, 2008, 3:26:40 PM3/12/08

to Clojure

On Mar 12, 2:07 pm, "christo...@cgrand.net"

<christophe.gr...@gmail.com> wrote:
> On Mar 12, 6:08 pm, Rich Hickey <richhic...@gmail.com> wrote:
>
> > On Mar 12, 12:52 pm, "christo...@cgrand.net"
>
> > <christophe.gr...@gmail.com> wrote:
> > > Principle of least astonishment: which syntax #"\\""\n" or #"\\\"\n"
> > > fits the regex literal matching the string literal "\\\"\n" best?
>
> > I think we need to leave the match out of it. We should talk about
> > which literal yields which pattern, e.g. in your example - #"\\\"\n",
> > how many backslashes are in the resulting pattern?

> If the former, this pattern can be produced by:
> in Java: Pattern.compile("\\\\\"\\n") or Pattern.compile("\\\\\"\n")
> (depending who escapes the CR)
> in current Clojure: #"\\\\\"\\n" or #"\\\\\"\n" (depending who escapes
> the CR)

That's what I'm trying to be more precise about. A pattern is a
string. It need not have come from a String literal and we don't need
a string literal to describe it. The reader will read something inside
#"..." and will create a string, and a pattern from that. Again, to in
your proposal - #"\\\"\n" - is that intended to be the regex:

backslash double-quote newline

or

backslash double-quote backslash n

or

backslash backslash double-quote backslash n

or

backslash backslash backslash double-quote backslash n

i.e., what characters get passed to Pattern.compile()?

Rich

christop...@gmail.com

unread,

Mar 12, 2008, 5:22:41 PM3/12/08

to Clojure

On Mar 12, 8:26 pm, Rich Hickey <richhic...@gmail.com> wrote:
> That's what I'm trying to be more precise about. A pattern is a
> string. It need not have come from a String literal and we don't need
> a string literal to describe it. The reader will read something inside
> #"..." and will create a string, and a pattern from that. Again, to in
> your proposal - #"\\\"\n" - is that intended to be the regex:

> i.e., what characters get passed to Pattern.compile()?

The characters between the first and the last double quotes, *as is*,
namely:

backslash backslash backslash double-quote backslash n

What I am trying to say is that we can pick an *innocuous* character
(eg " or /) as a delimiter, scan the input for this delimiter
preceded by strictly an even number (may be zero) of backslashes. This
delimiter is the closing delimiter. Every character in-between
(including all these backslashes) can be passed *as is* to
Pattern.compile (which already knows how to deal with these
backslashes).

Christophe

Eric Lavigne

unread,

Mar 12, 2008, 8:59:20 PM3/12/08

to clo...@googlegroups.com

> > That's what I'm trying to be more precise about. A pattern is a
> > string. It need not have come from a String literal and we don't need
> > a string literal to describe it. The reader will read something inside
> > #"..." and will create a string, and a pattern from that. Again, to in
> > your proposal - #"\\\"\n" - is that intended to be the regex:
>
> > i.e., what characters get passed to Pattern.compile()?
>
> The characters between the first and the last double quotes, *as is*,
> namely:
>
> backslash backslash backslash double-quote backslash n

I still don't understand this scheme. Please interpret the following
Clojure code, assuming that your scheme is being used. Is it a list of
two elements (regex and string) or just one regex (which includes a
literal space)? Whichever one of these you select, how would the other
possibility be represented? Do literal spaces within the regex need to
be escaped?

(#"\\\"\n" "\"\\\"\n")

> What I am trying to say is that we can pick an *innocuous* character
> (eg " or /) as a delimiter, scan the input for this delimiter
> preceded by strictly an even number (may be zero) of backslashes. This
> delimiter is the closing delimiter. Every character in-between
> (including all these backslashes) can be passed *as is* to
> Pattern.compile (which already knows how to deal with these
> backslashes).

Though I don't understand this yet, the even-odd distinction gives me
the impression that it is overly complicated. Why can't my regular
expression include an odd number of back-slashes?

christop...@gmail.com

unread,

Mar 13, 2008, 5:04:16 AM3/13/08

to Clojure

On Mar 13, 1:59 am, "Eric Lavigne" <lavigne.e...@gmail.com> wrote:
> I still don't understand this scheme. Please interpret the following
> Clojure code, assuming that your scheme is being used. Is it a list of
> two elements (regex and string) or just one regex (which includes a
> literal space)? Whichever one of these you select, how would the other
> possibility be represented? Do literal spaces within the regex need to
> be escaped?
>
> (#"\\\"\n" "\"\\\"\n")

One regex and one string.
Just one regex: (#"\\\"\n\" \"\"\\\"\n")

> > What I am trying to say is that we can pick an *innocuous* character
> > (eg " or /) as a delimiter, scan the input for this delimiter
> > preceded by strictly an even number (may be zero) of backslashes. This
> > delimiter is the closing delimiter. Every character in-between
> > (including all these backslashes) can be passed *as is* to
> > Pattern.compile (which already knows how to deal with these
> > backslashes).
>
> Though I don't understand this yet, the even-odd distinction gives me
> the impression that it is overly complicated. Why can't my regular
> expression include an odd number of back-slashes?

For exactly the same reason that you can't have an odd number of back-
slashes in a string literal.
What's this : '("\\\"\n" "\"\\\"\n") one string or two strings? It's
the same even-odd distinction.

Ok here is a more detailed explanation:
The Java regex engine already features a full backslash-based escape
mechanism (including \\ \t \n \r \f, unicode escape sequences etc.).
It only lacks one thing: no delimiters (end of string is intended to
be the delimiter) and here is the problem with embedding Java regex
literals in a host language: no delimiters.
But I'm lucky: some (non-alphabetical) characters can be optionally
escaped and among those characters there is " (and ' and / and
others). That is the " and \" Java regexes yield the same result. So
if I ban the non-escaped form from regex literal I don't lose
expressive power and I can decide that it (the non-escaped form) will
be the delimiter.
Ok, now, how do I recognize the end delimiter while reading the input?
Given that java regex are backslash-based (and that backslash escapes
to \\), I can say for sure that the reader is at the beginning of a
java regex escape sequence after reading a sequence consisting in an
odd number of backslashes. So I have a criterion to tell that :
" in \" is not the end of literal delimiter
" in \\" is the end of literal delimiter
" in \\\" is not the end of literal delimiter
etc.
As I'm using the built-in Java regex escape mechanism, I have no
escaping to do on my own: I just need to be able to recognize the end
delimiter and pass all the read characters to the java regex engine
for compilation.

I hope the explanations made my proposal clearer.

Christophe

Eric Lavigne

unread,

Mar 13, 2008, 6:42:38 AM3/13/08

to clo...@googlegroups.com

> > Though I don't understand this yet, the even-odd distinction gives me
> > the impression that it is overly complicated. Why can't my regular
> > expression include an odd number of back-slashes?

Okay, so the total number of backslashes can be anything, but a
consecutive series of an odd number of backslashes doesn't make sense
at the end of a regex.

> That is the " and \" Java regexes yield the same result. So
> if I ban the non-escaped form from regex literal I don't lose
> expressive power and I can decide that it (the non-escaped form) will
> be the delimiter.

Your regular expression format takes advantage of the fact that some
strings are not valid regular expressions, and that certain strings,
though valid regular expressions, are redundant and don't need to be
allowed. Until now, I missed that point.

I like this proposal now.

Chouser

unread,

Mar 13, 2008, 9:50:55 AM3/13/08

to Clojure

I thought we were closer to a solution with the idea of not scanning
the string for escapes at all. #r{\\"\n} gets passed directly to
Pattern.compile which interprets it as (backslash double-quote
newline). If you need to match a { as well, switch to different quote
chars: #r({\\"\n) Oh, you also need to match parens? ok: #r[({\
\"\n]. How silly do you want to get? #r:!$<[({\\"\n: It's really
hard to think of a pattern that can't be easily represented this way
-- easy for the writing programmer, easy for the reading programmer,
and easy for the lisp reader.

Rich Hickey

unread,

Mar 13, 2008, 10:22:00 AM3/13/08

to Clojure

Hmmm... I don't think I agreed to #r or any of the variable delimiter
concepts. When I see the end of #r:!$<[({\\"\n: I think something is
wrong. Never mind the editor support hassles.

Right now I am leaning towards " as the delimiter, and \ escaping only
". The only question is whether \" puts \" in the pattern or just "

Rich

Eric Lavigne

unread,

Mar 13, 2008, 10:23:02 AM3/13/08

to clo...@googlegroups.com

The differences between Chouser's proposal and Christophe's proposal
are small and not particularly important, but I do prefer
Christophe's. Here are my thoughts on the differences.

1) In Christophe's version, the regex delimiters are always the same.
This gives it a feeling of uniformity.
2) In Christophe's version, I can create those delimiters without
first thinking about whether my preferred delimiters appear in the
regex.
3) Chouser's version allows " to appear as part of the regex without a
matching backslash. That is an advantage but not, in my opinion,
sufficient to make up for points (1) and (2).

Eric Lavigne

unread,

Mar 13, 2008, 10:25:18 AM3/13/08

to clo...@googlegroups.com

> Hmmm... I don't think I agreed to #r or any of the variable delimiter
> concepts. When I see the end of #r:!$<[({\\"\n: I think something is
> wrong. Never mind the editor support hassles.
>
> Right now I am leaning towards " as the delimiter, and \ escaping only
> ". The only question is whether \" puts \" in the pattern or just "

Those two possibilities lead to identical interpretations by the Java
regex engine, which is why Christophe's idea works in the first place.
I prefer \" in the regex's printed representation (if it will have
one, which I hope it will) for consistency.

christop...@gmail.com

unread,

Mar 13, 2008, 11:12:28 AM3/13/08

to Clojure

On Mar 13, 3:25 pm, "Eric Lavigne" <lavigne.e...@gmail.com> wrote:
> > ". The only question is whether \" puts \" in the pattern or just "
>
> Those two possibilities lead to identical interpretations by the Java
> regex engine, which is why Christophe's idea works in the first place.
> I prefer \" in the regex's printed representation (if it will have
> one, which I hope it will) for consistency.

Putting \" in the pattern make roundtripping easy. Just add these
lines to RT.print:

else if (x instanceof Pattern) {
w.write("#\"" + x.toString() + '"');
}

Examples:
> Clojure
> user=> #"\""
> #"\""
> user=> #"\\\""
> #"\\\""
> user=> #"\u0041"
> #"\u0041"

(of course, it depends on this RegexReader:

static class RegexReader extends AFn{
public Object invoke(Object reader, Object doublequote) throws
Exception{
StringBuilder sb = new StringBuilder();
Reader r = (Reader) reader;
boolean esc = false;

for(int ch = r.read(); esc || ch != '"'; ch = r.read()) {
if(ch == -1)
throw new Exception("EOF while reading
regex pattern");
sb.append((char) ch);
esc = !esc && ch == '\\';
}

return Pattern.compile(sb.toString());
}
}

Chrsitophe

Rich Hickey

unread,

Mar 13, 2008, 11:19:15 AM3/13/08

to Clojure

On Mar 13, 10:25 am, "Eric Lavigne" <lavigne.e...@gmail.com> wrote:
> > Hmmm... I don't think I agreed to #r or any of the variable delimiter
> > concepts. When I see the end of #r:!$<[({\\"\n: I think something is
> > wrong. Never mind the editor support hassles.
>
> > Right now I am leaning towards " as the delimiter, and \ escaping only
> > ". The only question is whether \" puts \" in the pattern or just "
>
> Those two possibilities lead to identical interpretations by the Java
> regex engine, which is why Christophe's idea works in the first place.

I understand that, but I'm concerned that even though the regex engine
is specified to accept \"s, it is unlikely to have seen many (any?) in
normal use by Java programmers, i.e. when they write \" in their
string literals and create a pattern they get " alone. I'd prefer not
to generate something that's on an exceptional code path.

Rich

christop...@gmail.com

unread,

Mar 13, 2008, 11:52:57 AM3/13/08

to Clojure

On Mar 13, 4:19 pm, Rich Hickey <richhic...@gmail.com> wrote:
> I understand that, but I'm concerned that even though the regex engine
> is specified to accept \"s, it is unlikely to have seen many (any?) in
> normal use by Java programmers, i.e. when they write \" in their
> string literals and create a pattern they get " alone. I'd prefer not

java.regex.Pattern is not known to be so brittle [1].
There are "real world" uses:
http://www.google.com/codesearch?hl=en&lr=&q=Pattern+compile+%5B%5E%5C%5C%5D%5C%5C%28%5C%5C%5C%5C%29%2B%22+lang%3Ajava&btnG=Search
returns at a first glance Geronimo, Nutch and JRuby.

Christophe

[1]: 11 open bugs (but some bad regressions in closed bugs... like
this one http://bugs.sun.com/view_bug.do?bug_id=6497148
But it got detected and fixed on a RC version)
http://bugs.sun.com/search.do?process=1&category=java&bugStatus=open&subcategory=classes_util_regex&type=bug&keyword=pattern+compile+regex

christop...@gmail.com

unread,

Mar 13, 2008, 12:17:24 PM3/13/08

to Clojure

On Mar 13, 4:12 pm, "christo...@cgrand.net"

<christophe.gr...@gmail.com> wrote:
> Putting \" in the pattern make roundtripping easy.

Well... in all honesty I forgot to add the fine print that says: *as
long as the pattern comes from a literal and not from somewhere else*
(e.g. re-pattern or Pattern.compile).

I can even think of some cases where I don't know if one can possibly
derive all parameters passed to Pattern.compile from the compiled
Pattern.

Christophe

Rich Hickey

unread,

Mar 13, 2008, 12:19:54 PM3/13/08

to Clojure

On Mar 13, 11:52 am, "christo...@cgrand.net"

<christophe.gr...@gmail.com> wrote:
> On Mar 13, 4:19 pm, Rich Hickey <richhic...@gmail.com> wrote:
>
> > I understand that, but I'm concerned that even though the regex engine
> > is specified to accept \"s, it is unlikely to have seen many (any?) in
> > normal use by Java programmers, i.e. when they write \" in their
> > string literals and create a pattern they get " alone. I'd prefer not
>
> java.regex.Pattern is not known to be so brittle [1].

> There are "real world" uses:http://www.google.com/codesearch?hl=en&lr=&q=Pattern+compile+%5B%5E%5...

> returns at a first glance Geronimo, Nutch and JRuby.
>

Hmm... I wonder how many of those think they're matching backslash
quote by doing that:

>>> private static Pattern escapedQuote = Pattern.compile("\\\"");

Rich

christop...@gmail.com

unread,

Mar 13, 2008, 12:39:05 PM3/13/08

to Clojure

On Mar 13, 5:19 pm, Rich Hickey <richhic...@gmail.com> wrote:
> Hmm... I wonder how many of those think they're matching backslash
> quote by doing that:
>
> >>> private static Pattern escapedQuote = Pattern.compile("\\\"");

Touché. This one is obvious... and on the first page.

Christophe

Reply all

Reply to author

Forward