On Mar 10, 10:40 am, Darshan Shaligram <scinti...@gmail.com> wrote:> I ran into this when I was looking for a convenient reader syntax forIs that all you mean by special treatment, not handling, when reading
> regexps, found the neat #"<regex>" and assumed that #"" gave special
> treatment to escape sequences, only to stub my toes on the standard
> StringReader. It would be awesome if #"" could grok regex escapes
> without needing doubled \\ everywhere
a regex, \ in a regex as an escape character, or is there something
more to it? I'm not much of a regex user, so I'd appreciate any input
from those who are.
> (triply awesome with a differentIs there something special about / as a regex delimiter? Is it just a
> reader dispatch macro like #r that allows specifying a delimiter like
> #r/regex here/).
>
familiarity thing? Is it rarely used inside a pattern where " is more
often? Would #/regex/ be preferable to #"regex"? Or do you really want
to swap out the delimiter arbitrarily?
Input from all interested parties welcome.
> > 5. A non-regex quoting construct - perhaps #rs - that does the same
> > things as #r, but creates a string instead of a compiled regex.
> > That makes things like writing Windows-style paths easy:
> > #rs{C:\yak\sheep\llama}
> On further reflection, this point worries me a little. In a regex
> string \n and \s each have specific but different meaning. The first
> is a C-style escape (newline) and the second a regex escape
> (whitespace). But in your Windows-style path, what if some dirs start
> with s and others with n?
That's a good point. #rs is then only useful as a convenience for
regexps, like Groovy's // (which is still not bad).
Cheers,
Darshan
Java's backslash escaping in strings is a simple solution for handling
special characters. The backslash escaping in typical regex syntax
plays the same role and also makes a lot of sense. The only problem is
that they both use the same escape character, the backslash, when a
regex needs to be expressed within a string. I think all of these
problems would go away if we just use ~ for regexes and \ for strings,
or vice-versa. Then, in a regex, if you actually need a literal tilda
(rare) you will have to double it up ~~ just as you need to double up
backslashes \\ to get a literal backslash in a string.
--
I like to find simple solutions to overlooked problems that actually
need to be solved, and deliver them as informally as possible,
starting with a very crude version 1, then iterating rapidly.
- Paul Graham, Six Principles for Making New Things -
Lua does something along these lines for its string patterns, which
are designed to be very similar to regexps, but a more compact (in
terms of implementation code size) subset. Lua uses pattern character
classes like %w, %s, etc., and you can use % to escape pattern
meta-characters.
Unfortunately, this does not feel like a good fit for Clojure, since
we're using Java's regular expression engine under the hood, and Java
wants backslashes, so you'd need to translate ~X to \X, and the
behind-the-scenes magic there is likely to trip up users. Further,
when you say "regular expressions", programmers instinctively think of
Perl-style regexps with backslashes. If not all programmers, at least
yours truly does. :-)
The different syntax works fine for Lua, because Lua doesn't claim its
string patterns are regular expressions, and there are some subtle
differences from Perl-style regexps, and having a difference in
notation helps make that explicit.
Cheers,
Darshan
Adding special cases to cover the edge cases creates more edge cases.
In your proposed scheme, how would you represent a string with only
the following two characters: \"
With slight modification, I still like my original proposal. If ~ were
used for string escape as in Common Lisp and \ were used for regex
escape as in Perl, then regexes would still look like they usually do,
but with the occasional ~ for newlines, tabs, and literal quote
characters. This is also simpler from an implementation standpoint
because Clojure could always make the (~ -> \) substitution upon
string creation, regardless of whether that string would eventually be
used as a regex. The only remaining sticky point, as far as I can see,
is that such strings would still print out with a proliferation of
backslashes. Maybe that's okay as long as it isn't in the code?
I still don't understand this scheme. Please interpret the following
Clojure code, assuming that your scheme is being used. Is it a list of
two elements (regex and string) or just one regex (which includes a
literal space)? Whichever one of these you select, how would the other
possibility be represented? Do literal spaces within the regex need to
be escaped?
(#"\\\"\n" "\"\\\"\n")
> What I am trying to say is that we can pick an *innocuous* character
> (eg " or /) as a delimiter, scan the input for this delimiter
> preceded by strictly an even number (may be zero) of backslashes. This
> delimiter is the closing delimiter. Every character in-between
> (including all these backslashes) can be passed *as is* to
> Pattern.compile (which already knows how to deal with these
> backslashes).
Though I don't understand this yet, the even-odd distinction gives me
the impression that it is overly complicated. Why can't my regular
expression include an odd number of back-slashes?
Okay, so the total number of backslashes can be anything, but a
consecutive series of an odd number of backslashes doesn't make sense
at the end of a regex.
> That is the " and \" Java regexes yield the same result. So
> if I ban the non-escaped form from regex literal I don't lose
> expressive power and I can decide that it (the non-escaped form) will
> be the delimiter.
Your regular expression format takes advantage of the fact that some
strings are not valid regular expressions, and that certain strings,
though valid regular expressions, are redundant and don't need to be
allowed. Until now, I missed that point.
I like this proposal now.
The differences between Chouser's proposal and Christophe's proposal
are small and not particularly important, but I do prefer
Christophe's. Here are my thoughts on the differences.
1) In Christophe's version, the regex delimiters are always the same.
This gives it a feeling of uniformity.
2) In Christophe's version, I can create those delimiters without
first thinking about whether my preferred delimiters appear in the
regex.
3) Chouser's version allows " to appear as part of the regex without a
matching backslash. That is an advantage but not, in my opinion,
sufficient to make up for points (1) and (2).
Those two possibilities lead to identical interpretations by the Java
regex engine, which is why Christophe's idea works in the first place.
I prefer \" in the regex's printed representation (if it will have
one, which I hope it will) for consistency.