destructuring-regex

113 views
Skip to first unread message

Jasper

unread,
Feb 25, 2012, 9:21:32 PM2/25/12
to juli...@googlegroups.com
I implemented a macro; `regex_bind` that binds stuff from regular
expressions in the form:(it is in the attachments, haha i don't even
know if mailing lists do attachments.)

@regex_bind a_string
["skip.to.regex" stuff_between [integer "[0-9]"]
grab_what_is_after]
{stuff_between, integer, grab_what_is_after}

I assert `.head == :hcat` currently. I choose hcat because i prefer
spaces over commas in this case.

Biggest place where this is incomplete is that if it doesn't match, it
just stops and does nothing!!

Of course more things could be defined, instead of `[integer "[0-9]"]`,
perhaps allow `integer::Int64` to tell how to parse that. And also
allow negative values in this case. Which could be a caveit, because
without unsigned types we can't specify otherwise. Alternatively simply
do `[integer :integer]` or `[integer integer]`. Maybe `[expr expr]` to
fetch an expression. Now i think of that, `[output
reference_function_that_eats_string]` could be very extendable.

Otherwise one could define a lot of stuff, for instance `{}`, `[,]`
or even 'function calls' having different meanings, but that would
likely make too much of a mess.

Probably other names, like `regex_var`, `var_filling_regex` are better
(maybe even just `regex`, though `Regex` already taken)

One alternative to this whole thing is using a stream and having a
`string_upto_regex`, `next_regex_match`, `skip_to_regex`,
`upto,match = upto_and_match_next_regex`. But that potentially scatters
what is being matched. It also requires typing the function name
all the time.(something a macro could somewhat alieviate)

Talking about streams, is there a string-stream already? Like this
http://linux.die.net/man/3/fmemopen, or the Common Lisp one. What about
macros like `with-open-file`(opening and closing a file for you)
`with-string-stream`(same for string) Maybe i'll try make a bit of a
Common Lisp-Julia rosetta stone, for both reference and inspiration for
Julia(macro) (standard)libraries.

regex-add.j
regex-trying.j

Stefan Karpinski

unread,
Feb 27, 2012, 2:22:51 AM2/27/12
to juli...@googlegroups.com
Sorry, not ignoring this — I just haven't had time to look at it yet!

Stefan Karpinski

unread,
Mar 4, 2012, 1:26:32 AM3/4/12
to juli...@googlegroups.com
Finally got a chance look at this. It's clearly very powerful, but it's a little too "Lispy" for my taste. I've come to the conclusion that people's brains like syntax. Not too much of it, but more than none. The most commonly used words in every human language are always irregular — that to me is like syntax. The less common words just work according to standard grammatical rules — which is like the standard language constructs of function application, variable assignment, etc.

Ok, sorry, that was sort of a tangent, but this reminded me of it... Can you talk a little more about your motivations for this destructuring regex approach? At one point a while back, I was talking about how the problem with regular expression matching is that it has to both be able to capture all the matching metadata and bind them to variables with names *and* be able to indicate when there is no match. This leads to the return type of the match function being Union(Nothing,RegexMatch), which is not horrible, but kind of unfortunate. Every usage ends up looking like this too:

m = match(r"b(a)(r)", str)
if m != nothing
  x, y = m.captures
  # do something with the match contents
else
  # handle not matching
end

That's an awful lot of boilerplate for dealing with a regex match. Perl avoids this by having a bunch of global variables that are implicitly set: $1, $2, etc. That's effective but pretty awful. It would be nice to have something cleaner. Given the ruby-like block syntax proposed in issue 441, we could solve this much more cleanly now:

match(r"b(a)(r)", str) do x,y
  # do something with the match contents
end

Which, unfortunately leads us to wanting an else clause; something like this:

match(r"b(a)(r)", str) do x,y
  # do something with the match contents
else
  # handle not matching
end

Maybe that could be translated as this:

match((x,y)->begin
  # do something with the match contents
end, ()->begin
  # handle not matching
end, r"b(a)(r)", str)

This would allow writing very syntax-like forms just using higher-order functions.

Stefan Karpinski

unread,
Mar 4, 2012, 1:28:50 AM3/4/12
to juli...@googlegroups.com
There is a string stream construct, which can be used in a variety of manners. Basically the core is a memio object, which presents an I/O interface but just writes into memory. Later you can take the string associated with it. This is wrapped up in the print_to_string function, which takes a function with arguments and returns the output resulting from calling the function on those arguments. It's used extensively in string.j if you're interested in examples.

Patrick O'Leary

unread,
Mar 4, 2012, 3:55:00 AM3/4/12
to juli...@googlegroups.com
On Saturday, March 3, 2012 7:26:32 PM UTC-6, Stefan Karpinski wrote:
Ok, sorry, that was sort of a tangent, but this reminded me of it... Can you talk a little more about your motivations for this destructuring regex approach? At one point a while back, I was talking about how the problem with regular expression matching is that it has to both be able to capture all the matching metadata and bind them to variables with names *and* be able to indicate when there is no match. This leads to the return type of the match function being Union(Nothing,RegexMatch), which is not horrible, but kind of unfortunate. Every usage ends up looking like this too:

m = match(r"b(a)(r)", str)
if m != nothing
  x, y = m.captures
  # do something with the match contents
else
  # handle not matching
end

It don't see how it would fit in with Julia, but I do like how Haskell propagates failure with the Maybe monad, particularly when using do to hide the monadic machinery. You don't need the explicit "if" check in that case. Scala does the same thing with for-comprehensions.

Jasper

unread,
Mar 8, 2012, 6:06:19 PM3/8/12
to juli...@googlegroups.com
Sorry taking a bit of time to respond.

The origin of the idea is basically Common Lisps `destructuring-bind` i
used the `regex` library to make `regex-list` which makes a list of
matches when given a list of regular expressions and a string.
https://github.com/o-jasper/j-basic/blob/master/src/regex-sequence.lisp
Logically it followed that there be a `destructuring-regex`. Note that
the route via the list is (in principle)less efficient that directly
throwing the stuff into variables, like i did in the Julia macro.(might
update the lisp version)

It is lispy simply because it is direct 'translation' of the concept
About syntax, personally i don't feel that syntax is really needed, but
i think it might be for others, and it might be needed to attract
programmers. Anyway, if syntax maps to s-expressions neatly, it doesn't
matter much. Which seems to be true for Julia afaics so far.

I have made an initial CL function list for 'inspiration', or learning
from its mistakes
https://github.com/JuliaLang/julia/wiki/common-lisp-rosetta-for-the-devs

To be honest, i dont have any Ruby or Perl experience, and i only
very shortly tried Haskell. Features of those languages might
be a better option than a macro, like some of the macros in that wiki
page might be done better with object-destructors.

I gave paragraphs titles so it is slightly less wall-of-texty.

== Syntax of regular expressions with variables ==
Basically, since it is macroexpand-time, the string would be ..not
something calculated at run-time. Hence the 'string-inserting'
notation `"$a$b"` would not be available for it. Maybe we can use it
for indicating variables to match instead. Instead of
`["skip.to.regex" stuff_between [integer "[0-9]"] grab_what_is_after]`
do
`"skip.to.regex$(stuff_between)$(integer::"[0-9]")$grab_what_is_after"
or some such. `$(i::Unsigned)` could be defined as `$(i::"[0-9]")`,
could go as far as allowing every type to have its own regex.(i am a
poor example-maker)

== Mismatch condition ==
On one hand, maybe things can get messy when you have to check if stuff
is `nothing`, on the other hand, i feel we might be putting too much
features into the macro if we try deal with it there.
The thing with the `if ... then .. end` syntax is that `then` and `end`
take two lines, making it feel a bit clunky sometimes. `.. ? .. : ..`
helps in this respect, though.(it is identical to `if`? I dont think it
is in C)

(Does Julia have any kind of `case` or `switch` yet?)Next to the
'basic' regex we could make a `regex_case`, each case having a
regex-with-variables input, and a body, the body is just executed(with
the variables available, of course!) if it matches.
The user might also be interested in partial matches though, not sure
how to allow for that. Maybe:

regex_case input_stream_or_string
case matcher
... # Complete match
case matcher
... # Complete match (optional)
case 2 # `2` is incorrect for a matcher; it indicates 'at least two
... # variables matched of the previous one.
else #Optionally, of course.(maybe `default`)
end

Now i notice that that would be similar to continuing with `else
match(..) do ...`. Anyway, there might also be a 'plain' version:

regex input_stream_or_string matcher
...body..
end

I guess Ruby blocks aren't quite the same as the macros, i'll try look
at Ruby/Perl a bit more. I do see that you could make a function:

matcher_function(regex,if_match::Function,if_not_matched:Function)

And have `if_match` catch the arguments from regex somehow, and
`if_not_matched` can just use `matcher` again. Not sure if i am
describing what Ruby does here though.

Both `matcher_function`, and `match(..) do` require the thing that is
matched with to be repeated though. We could have both
`matcher_function`, or `match(..,str) do ..` and a macro mopping it up.

How do we optimize, or keep the route to optimization free, though? If
the start of the string to match is the same, and it didnt match
before, it won't match later, for instance. What the macro could do is
make a 'tree' each node branching as each bit of regular expression is
different. Probably requires good understanding of how the regex works,
and a bit of care though, and if not taking into account, chopping up
regular expression too granularly, it might actually slow things down.

== Macros first-classy enough? ==
To be honest, i dont really like the `@macro` notation, if macros can
not simply name themselves and `end` without a `begin`, they're less
second-class citizens. If they can't the above wouldn't be able to be a
standard library. Though i guess it could just be altered to start
`@regex` or `@regex_case`, and have a slightly superfluous `begin` in
there, and the body of the `begin .. end` would behave a 'bit
strange', because we'd be looking for the variable `case` in there.

== Notes ==
Numbered (preferably local)variables are ugly, but i think they might
also be convenient. Maybe we can make them, and put them in a 'quick and
dirty' module/package/namespace. (for instance for interactive repl use)
They might als be useful for making functions passed into `map`, `anyp`
or anything else taking a function as argument. (left/right)currying
and composing can help there too, but anything more than one level of
them starts getting harder to read.

Jasper

unread,
Mar 12, 2012, 11:48:50 AM3/12/12
to juli...@googlegroups.com
Excuse the previous post being so long.. I updated the CL version
https://github.com/o-jasper/j-basic/blob/master/src/destructuring-regex.lisp
to use a more efficient method, and illustrate the idea of allow each
type to have a regex(regex-string-of-type), and a parser.(parse-type) A
`regex-case` is pretty easy to make, and the difficulty in
optimizing as said before is mainly in figuring how equivalent the
regular expressions are.

Looks like was wrong about `"$.."` notation being like a formating
function, quoting it i see `macrocall`, still i dont think non-constant
regular expressions belong in destructuring-regex so using 'that
notation in reverse' is still an idea.

Reply all
Reply to author
Forward
0 new messages