announcing RubyLexer 0.6.0

vikkous

unread,

Apr 23, 2005, 11:53:03 AM4/23/05

to

At this time, I am pleased to announce the release of RubyLexer 0.6.0,
a standalone lexer of ruby in ruby. RubyLexer attempts to completely
and correctly tokenize all valid ruby 1.8 source code, and it mostly
succeeds. In time, RubyLexer will be able to lex all ruby code. For
now, some newer features are unsupported and there are some extremely
obscure bugs involving strings, but all real world ruby code should be
supported. It is my hope to provide a high-quality lexer for all those
language tools which require one.

RubyLexer is hosted on RubyForge
(http://rubyforge.org/projects/rubylexer/).
Here's where to get the tarball:
http://rubyforge.org/frs/download.php/4191/rubylexer-0.6.0.tar.bz2

Trans

unread,

Apr 23, 2005, 12:49:46 PM4/23/05

to

Hi,

could you describe Ruby lexer a bit more. I know very little about
lexers, so excuse if I ask dumb questions, but... What's the output
look like? How does it compare to other projects like ParseTree? Do you
have any plans for its use?

Thanks,
T.

Florian Groß

unread,

Apr 23, 2005, 1:04:04 PM4/23/05

to

vikkous wrote:

> At this time, I am pleased to announce the release of RubyLexer 0.6.0,
> a standalone lexer of ruby in ruby. RubyLexer attempts to completely
> and correctly tokenize all valid ruby 1.8 source code, and it mostly
> succeeds.

How extendable is this? Would you be able to add new rules to it add
run-time? If it is like that then it could be used for writing Ruby
source code filters which is something that is useful for exploring new
syntax.

I can also contribute a few pieces of code that I think that are hard to
lex properly if you are interested.

Peter Suk

unread,

Apr 23, 2005, 1:14:40 PM4/23/05

to

On Apr 23, 2005, at 10:54 AM, vikkous wrote:

> At this time, I am pleased to announce the release of RubyLexer 0.6.0,

YeeHaaa!! ThankYouThankYou!

--
There's neither heaven nor hell, save what we grant ourselves.
There's neither fairness nor justice, save what we grant each other.

vikkous

unread,

Apr 23, 2005, 5:36:11 PM4/23/05

to

A lexer, or tokenizer (they mean the same thing) divides an input
source language into words. It also removes comments and finds the
boundaries of strings. Once this is done, it's much easier to correctly
process the language in a pre-processor or parser. Here's an example.
Given this ruby code:

8+(9 *5)

a correct lexing is something like:

["8","+","(","*","5",")"]

(For lexing purposes, punctuation and operators count as strings as
well.)

The ouput of RubyLexer is actually more complicated than that... for
one thing, there are tokens for whitespace as well. for another, the
individual tokens are not Strings, but Tokens (or subclasses of it, to
be precise), a class defined in RubyLexer. Tokens to respond to to_s in
the expected way, however. (Initially, I did want to have RubyLexer
just return Strings, but it turned out I needed to distinguish
different token types, and the best way to do that is with the type
system.)

ParseTree is a parser, not a lexer. Parsing is the next step in a
compiler pipeline; it determines what order to evaluate to operations
in an expression and solves the difficult problems of precedence and
associativity. (Another way to think of parsers is as the bit that
figures out where the implicit parentheses are inserted into the source
code.) I think that the tool corresponding to RubyLexer is Ripper, but
I don't really know, so don't blame me if I'm wrong.

I have lots of plans, of course, but being only one little programmer
with lots of big ideas, who knows if I'll ever get to them...

vikkous

unread,

Apr 23, 2005, 5:47:19 PM4/23/05

to

> How extendable is this? Would you be able to add new rules to it
> add run-time?

Ummm... if you're really lucky, maybe. I didn't really have
extensibility in mind. It might be possible to add it, without a lot
of trouble, depending on what you want to extend. So, what do you want
to extend?

> If it is like that then it could be used for writing Ruby
> source code filters which is something that is useful for exploring
> new syntax.

One of the applications I had in mind was to create a lexer family for
ruby-like languages, but that has sort of fallen by the wayside right
now. I still like the idea, but other priorities press at the moment.

> I can also contribute a few pieces of code that I think that are hard

> to lex properly if you are interested.

Oh! That would be lovely. Weird syntax, obscure syntax, new syntax,
twisted, devious, mutant syntax, I want it all for my menagerie.

vikkous

unread,

Apr 23, 2005, 5:49:07 PM4/23/05

to

Peter Suk wrote:
> YeeHaaa!! ThankYouThankYou!

You're welcome. It's nice to be appredciated.

Hal Fulton

unread,

Apr 23, 2005, 6:48:28 PM4/23/05

to

vikkous wrote:
>>I can also contribute a few pieces of code that I think that are hard
>>to lex properly if you are interested.
>
> Oh! That would be lovely. Weird syntax, obscure syntax, new syntax,
> twisted, devious, mutant syntax, I want it all for my menagerie.

Ha... I'll see if I can dig up anything.

In the meantime, one of my favorites is an expression containing a
string that contains an interpolated expression that contains a
string containing another interpolated expression:

x = "Hi, my name is #{"Slim #{rand(4)>2?"Whitman":"Shady"}"}."

Hal

gabriele renzi

unread,

Apr 23, 2005, 8:52:41 PM4/23/05

to

vikkous ha scritto:

first let me say I think this is cool :)
Anyway, I wonder: isn't something like this included with ruby (irb's
lexer) ?
Care to explain the differences a little?

Florian Groß

unread,

Apr 23, 2005, 10:03:22 PM4/23/05

to

vikkous wrote:

>>How extendable is this? Would you be able to add new rules to it
>>add run-time?
>
> Ummm... if you're really lucky, maybe. I didn't really have
> extensibility in mind. It might be possible to add it, without a lot
> of trouble, depending on what you want to extend. So, what do you want
> to extend?

One simple example would be adding a ".=" assign-result-of-method-call
operator as in "foo = 'bar'; foo .= reverse"

> Oh! That would be lovely. Weird syntax, obscure syntax, new syntax,
> twisted, devious, mutant syntax, I want it all for my menagerie.

See attachment.

pre.rb

vikkous

unread,

Apr 23, 2005, 10:39:56 PM4/23/05

to

> first let me say I think this is cool :)
> Anyway, I wonder: isn't something like this included with ruby (irb's

> lexer) ?
> Care to explain the differences a little?

Irb's lexer is not as complete. I can't think of any examples, but when
developing this, I played around with irb quite a bit, trying different
syntaces. Irb would do pretty good most of the time, but every so
often, I'd come up with something that had to be wrapped in eval %() in
order to work in irb...

vikkous

unread,

Apr 23, 2005, 10:53:13 PM4/23/05

to

> "Hi, my name is #{"Slim #{rand(4)>2?"Whitman":"Shady"} "}."

Yes, this is the type of thing I'm thinking of! Stretch the language!
Bend it to the breaking point! <Sound of whip cracking>. But you're not
being deviant enough; you didn't break my lexer yet (tho you can never
be too sure with these string interpolations).

Here's how tricky you have to be to fool it:

p "#{<<kekerz}#{"foob"
zimpler
kekerz
}"

Here document header and body in different interpolations... tricky.

Peter Suk

unread,

Apr 24, 2005, 12:13:11 AM4/24/05

to

Examples?

vikkous

unread,

Apr 24, 2005, 1:32:28 AM4/24/05

to

Florian Groß wrote:
> One simple example would be adding a ".="
assign-result-of-method-call
> operator as in "foo = 'bar'; foo .= reverse"

At first, I thought, "This guy is dreaming; my code is just too rigid
to allow extensions of that kind very easily.". But of course, it
wouldn't be too hard for me to special case this one operator in for
you if you wanted to... it'd just be a quick hack in RubyLexer#dot...
in fact, it could be done in a subclass:
[warning: untested code!]

class FlorianRubyLexer < RubyLexer
def dot(ch)
#this is the routine in RubyLexer that handles tokens beginning
with '.'
if readahead(2)=='.='
KeywordToken.new(@file.read(2),@file.pos-2)
else
super
end
end
end

Not too bad for extensibility, eh? I think things look quite hopeful
for your idea, actually....
you do have to know RubyLexer internals to do this kind of thing, but
that's true for any library. And you probably want to add operators
that create no new ambiguities in the language. This one doesn't create
ambiguity, a sign that you've been thinking about this already. Tell me
more of the kind of thing you want, and maybe I'll write more of your
lexer for you.

> See attachment.
> « pre.rb »

Now that's deviant! Whitespace as a fancy string delimiter... I don't
even know if that's what breaks RubyLexer, but that's sick, man, really
sick.

Ps: what does the code do?

vikkous

unread,

Apr 24, 2005, 3:02:41 AM4/24/05

to

> Examples?

I should have written them down, but I didn't. Next time I come across
one, I'll let you know. Some no doubt got into testdata/p.rb (in
rubylexer).

gabriele renzi

unread,

Apr 24, 2005, 5:44:48 AM4/24/05

to

vikkous ha scritto:

this is what I expected, I just think you should made it clear to casual
users :)

Florian Groß

unread,

Apr 24, 2005, 3:37:33 PM4/24/05

to

vikkous wrote:

>>One simple example would be adding a ".=" assign-result-of-method-call
>>operator as in "foo = 'bar'; foo .= reverse"
>
> At first, I thought, "This guy is dreaming; my code is just too rigid
> to allow extensions of that kind very easily.". But of course, it
> wouldn't be too hard for me to special case this one operator in for
> you if you wanted to... it'd just be a quick hack in RubyLexer#dot...
> in fact, it could be done in a subclass:
> [warning: untested code!]
>
> class FlorianRubyLexer < RubyLexer
> def dot(ch)
> #this is the routine in RubyLexer that handles tokens beginning
> with '.'
> if readahead(2)=='.='
> KeywordToken.new(@file.read(2),@file.pos-2)
> else
> super
> end
> end
> end

Which is exactly what I thought would be a good way of extending. This
looks good.

Another thing that I would be able to make good use of is getting the
next expression, whatever it might be.

Let's say I have this code:

z = if (x + y) * 2 > 2 then
code here
end

It would then be very nice if I could lex until I see the 'if' then say
'give me an atomic expression' which would parse until the 'then' and
then say 'give me an atomic expression' again which would parse until
the 'end'. Basically I don't want to match paired things (parentheses,
do .. end, class definitions etc.) at the transformation level.

Yup, that sample does not introduce any new syntax -- I would like to
transform it to this:

z = if ((x + y) * 2 > 2).true? then
code here
end

Which is why I would need to find a sub-expression.

Also note that just grabbing everything until the next 'then' would not
be good enough:

# Nonsense code, but still valid
if x > if x < 5 then 3 else 2 end then
puts "Good!"
end

If it weren't for that point then IRB's lexer would be a more or less
nifty match already.

Does this sound like something that can be done without too much trouble?

For doing code transformations it is of course also important that you
can turn back the stream of tokens into a String easily. I did this with
IRB's lexer by using the .line_no and .pos methods of tokens, but that
was not too good a match, actually.

>>See attachment.
>>« pre.rb »
>
> Now that's deviant! Whitespace as a fancy string delimiter... I don't
> even know if that's what breaks RubyLexer, but that's sick, man, really
> sick.

Oh, that is still relatively simple. There's worse stuff happening under
the surface.

> Ps: what does the code do?

If you invoke it as ruby -rpre file.rb it will pre-process file.rb
before letting Ruby handle it. It parses simple directives that look
like this:

#!if rand > 0.5 then
}{}{ # Cause a Syntax Error
#!else
puts "Hello World"
#!end

That file would produce a Syntax Error at parse-time half of the time
and output Hello World in the other cases.

> "Hello"
> 1+5
> Time.now
#!gsub!(/^>/, "puts")

And that would make '>' at the beginning of a line mean 'output this: '.

It's basically something like the C preprocessor, but in a more Rubyish
manner written in obscure style. I guess it is pretty useless after all.

vikkous

unread,

Apr 25, 2005, 1:30:59 AM4/25/05

to

Florian Groß ha scritto:

> Which is exactly what I thought would be a good way of
> extending. This looks good.

Everything may not be as simple as this one case was. The fact that the
first example you gave turned out to be pretty easy is encouraging, but
I think we're likely to run into something really nasty before you are
happy.

> It would then be very nice if I could lex until I see the 'if' then
> say 'give me an atomic expression' which would parse until
> the 'then' and then say 'give me an atomic expression' again
> which would parse until the 'end'. Basically I don't want to
> match paired things (parentheses, do .. end, class definitions
> etc.) at the transformation level.

In general, 'get the next expression' is a problem that requires a
parser, not a lexer. Have you looked at ParseTree? Of course you have.

In this case however, you are in luck. Delimited expressions, that
start and end with ( and ), or begin and end, or whatever, are already
discovered by my lexer. (During the development of RubyLexer, I
discovered that it had to be half-a-parser as well, in order to
correctly get all the information that's needed to lex correctly.) The
information you want is already being gathered by RubyLexer, it's just
not available in a public interface. We should negotiate such an
interface since you seem to need it. What you propose, 'get the next
expression', is not one I want to do. RubyLexer does not deal in
abstractions larger than tokens... at least, not on a public level. I
am, however, willing to emit 'advisory' tokens at certain points in the
token stream, (several such types of tokens are being emitted already)
which should allow you to do what we want, if we design it carefully.

On the other hand.... the reason I chose not to emit advisory tokens
for this particular case is that the complimentary tool to RubyLexer is
intended to be Reg, which can find nested pairs of braces and the like
pretty easily. Have you looked at Reg at all? I realize that I only
released it yesterday, and as of yet it's only half-working because
critical features are as yet unimplemented, but I think it might be
just the thing for the types of preprocessors you have in mind.

Reg might not be able to easily tell 'if' the postfix operator from
'if' the value in current RubyLexer output. Since one requires an end
and the other doesn't, that can be troublesome to deal with. 'do' is
also a pain, now that I think of it. All these cases are handled
correctly in RubyLexer, we just have to find an appropriate
(token-based, not expression-based) interface.

> Also note that just grabbing everything until the next 'then' would
> not be good enough:
>
> # Nonsense code, but still valid
> if x > if x < 5 then 3 else 2 end then
> puts "Good!"
> end

Don't worry about this type of thing. I have these problems well under
control, one way or another.

> Does this sound like something that can be done without
> too much trouble?

Definitely!

> For doing code transformations it is of course also important that
> you can turn back the stream of tokens into a String easily. I did
> this with IRB's lexer by using the .line_no and .pos methods of
> tokens, but that was not too good a match, actually.

So what would be a good match? I don't see why this should be a
problem. My implementation of Token implements to_s, which returns the
ruby code corresponding to the token; ususally, this is exactly the
same as the code that created the token originally. There's also a
offset method, which returns the position of the token in the input
stream, relative to the very beginning. Tokens don't have a #line_no,
but you can get the same information from FileAndLineTokens.

Turning the token stream back into a big string (or file) is esentially
what one of my test programs (tokentest) does. The resulting ruby files
are legal and parse in exactly the same way. I haven't yet shown that
they are really exactly equivalent (but there's not much room for
variation); that will be the next RubyLexer release.

> If it weren't for that point then IRB's lexer would be a more or
> less nifty match already.

> I did this with IRB's lexer by using the .line_no and .pos

> methods of tokens, but that was not too good a match, actually.

Wait,,,, so you wrote irb's lexer? One of my wishlist items is to
integrate RubyLexer with irb among others.... how hard do you think
this will be?

> Oh, that is still relatively simple. There's worse stuff happening
> under the surface.

Well, it was unexpected for me. Much to my embarassment; I thought I
was an expert at this. I must say many elements of this got me very
confused at first, and obviously I never put all the pieces together.
Congratulations.

Ps: I haven't figured out why this breaks RubyLexer yet, but I will.

Pps: putting tricky stuff in eval strings and the like won't break the
lexer (yet). To the lexer, it's just a string.

> It's basically something like the C preprocessor, but in a more
> Rubyish manner written in obscure style. I guess it is pretty
> useless after all.

Not at all. Now that I know what it does, maybe I'll find a use for it,
someday.

Florian Groß

unread,

Apr 25, 2005, 9:37:01 AM4/25/05

to

vikkous wrote:

>>Which is exactly what I thought would be a good way of
>>extending. This looks good.
>
> Everything may not be as simple as this one case was. The fact that the
> first example you gave turned out to be pretty easy is encouraging, but
> I think we're likely to run into something really nasty before you are
> happy.

Hm, that ought to be not too much of a problem. I'm okay with having a
look at some of the internals for that kind of things.

>>It would then be very nice if I could lex until I see the 'if' then
>>say 'give me an atomic expression' which would parse until
>>the 'then' and then say 'give me an atomic expression' again
>>which would parse until the 'end'. Basically I don't want to
>>match paired things (parentheses, do .. end, class definitions
>>etc.) at the transformation level.
>
> In general, 'get the next expression' is a problem that requires a
> parser, not a lexer. Have you looked at ParseTree? Of course you have.
>
> In this case however, you are in luck. Delimited expressions, that
> start and end with ( and ), or begin and end, or whatever, are already
> discovered by my lexer. (During the development of RubyLexer, I
> discovered that it had to be half-a-parser as well, in order to
> correctly get all the information that's needed to lex correctly.) The
> information you want is already being gathered by RubyLexer, it's just
> not available in a public interface. We should negotiate such an
> interface since you seem to need it. What you propose, 'get the next
> expression', is not one I want to do. RubyLexer does not deal in
> abstractions larger than tokens... at least, not on a public level. I
> am, however, willing to emit 'advisory' tokens at certain points in the
> token stream, (several such types of tokens are being emitted already)
> which should allow you to do what we want, if we design it carefully.

Hm, I am not sure if that is enough for this case. The condition part of
a if or something else will after all not always be surrounded by ( and
) or begin and end or something similar.

Advisory tokens (which would tell me that I am now entering the
condition of if and now leaving it and now entering the action part of
it and so on) might do this. However, you are right in that this is not
usually the task of a lexer. In the past I have frequently had trouble
with the distinction of lexing and parsing in real language parsing --
most languages require you to keep some context for actually tokenizing
them. Ruby, for example, requires that your lexer knows about all kinds
of quoted Strings and where they end and interpolated expressions inside
them. I'm not sure of where to best draw the line so it's probably
better to let you decide.

> On the other hand.... the reason I chose not to emit advisory tokens
> for this particular case is that the complimentary tool to RubyLexer is
> intended to be Reg, which can find nested pairs of braces and the like
> pretty easily. Have you looked at Reg at all? I realize that I only
> released it yesterday, and as of yet it's only half-working because
> critical features are as yet unimplemented, but I think it might be
> just the thing for the types of preprocessors you have in mind.

Heh, I didn't realize that you were also the author of that library so I
did not draw the connection. I have, however, marked those two threads
as something I will have to examine. (They are now colored red.)

I'm watching Reg with growing interest -- I'm not sure if I have already
told this to you (I remember telling the author of "BNF-like grammar
specified DIRECTLY in Ruby"), but I have also done something vaguely
similar -- I have done an object-oriented way of constructing and
combining Regular Expressions. What you have done is something better.

I'm especially interested in how the LALR parser, Reg and RubyLexer
might all work together. Any way of getting some sample code? I'm aware
of the fact that this is all subject to change as long as you have not
implemented all the necessary features like look-ahead, but getting a
quick overview would still be nice.

> Reg might not be able to easily tell 'if' the postfix operator from
> 'if' the value in current RubyLexer output. Since one requires an end
> and the other doesn't, that can be troublesome to deal with. 'do' is
> also a pain, now that I think of it. All these cases are handled
> correctly in RubyLexer, we just have to find an appropriate
> (token-based, not expression-based) interface.

I would be pretty much okay with the advisory tokens idea -- it sounds
like meta-tokens that tell me about the context.

>>For doing code transformations it is of course also important that
>>you can turn back the stream of tokens into a String easily. I did
>>this with IRB's lexer by using the .line_no and .pos methods of
>>tokens, but that was not too good a match, actually.
>
> So what would be a good match? I don't see why this should be a
> problem. My implementation of Token implements to_s, which returns the
> ruby code corresponding to the token; ususally, this is exactly the
> same as the code that created the token originally. There's also a
> offset method, which returns the position of the token in the input
> stream, relative to the very beginning. Tokens don't have a #line_no,
> but you can get the same information from FileAndLineTokens.

This does sound good. Having an offset ought to actually be better than
separate character and line numbers as well.

>>I did this with IRB's lexer by using the .line_no and .pos
>>methods of tokens, but that was not too good a match, actually.
>
> Wait,,,, so you wrote irb's lexer? One of my wishlist items is to
> integrate RubyLexer with irb among others.... how hard do you think
> this will be?

Nope, not really. I've just used it out of IRB. Integrating it ought to
be possible, but I'm not sure why that would be necessary.

> Well, it was unexpected for me. Much to my embarassment; I thought I
> was an expert at this. I must say many elements of this got me very
> confused at first, and obviously I never put all the pieces together.
> Congratulations.
>
> Ps: I haven't figured out why this breaks RubyLexer yet, but I will.

Good luck. :)

> Pps: putting tricky stuff in eval strings and the like won't break the
> lexer (yet). To the lexer, it's just a string.

Yup, same for IRB.

Peter Suk

unread,

Apr 25, 2005, 12:28:27 PM4/25/05

to

On Apr 25, 2005, at 8:37 AM, Florian Groß wrote:

>
> I'm especially interested in how the LALR parser, Reg and RubyLexer
> might all work together. Any way of getting some sample code? I'm
> aware of the fact that this is all subject to change as long as you
> have not implemented all the necessary features like look-ahead, but
> getting a quick overview would still be nice.
>

I am currently constructing an LALR parser for Ruby using RubyLexer for
the Alumina-VM project. I suspect that RubyLexer is going to make this
much cleaner.

--Peter

vikkous

unread,

Apr 25, 2005, 5:40:05 PM4/25/05

to

> Advisory tokens (which would tell me that I am now entering
> the condition of if and now leaving it and now entering the
> action part of it and so on) might do this.

So you want to match the 'then' with it's owning 'if'? That's not
something I've had to do yet, but it shouldn't be hard... How's this
for an interface:
I can add a new method to the Token class, let's call it match_id for
now. Every time there's a token like 'if', '(', 'begin', that starts a
nested context, the match_id of that token will be set to a unique
value. When the corresponding 'end' or ')' comes along, it will have a
match_id with the same value as the corresponding context opening
token. We can easily have 'then' with a match_id corresponding to its
'if' as well. This should make it pretty easy to put the pieces
together again afterward.

Hmm... but there are tokens besides 'then' that can serve the same
syntactical role: ':', ';', and newline in this case. So the same thing
would have to happen with them, I guess. Do you want to know things
like, this colon is standing in place of a then? What sorts of thing
besides 'then' do you want to match to their owners?

There are complications for incremental lexing too, which isn't
something I do now, but I want to. Let me think a little about this.
You might be getting these features in a subclass of RubyLexer.

Heh. I just realized that strings now work the way you wanted
originally, but I'm going to break that in a future version to be the
way I want it.

> In the past I have frequently had trouble
> with the distinction of lexing and parsing in real language
> parsing -- most languages require you to keep some context
> for actually tokenizing them. Ruby, for example, requires that
> your lexer knows about all kinds of quoted Strings and where
> they end and interpolated expressions inside them.

You can say that again. The amount of extra (non-lexical, strictly
speaking) work to get RubyLexer working was phenomenal. You wouldn't
believe all the squirrelly little cases. It makes the language easy to
use, but hard to process programatically. Given the choice, I'd like to
find a different way next time. If there could be one tool that does
both at once... I don't know what that would look like. Reg might be
able to do both, but in separate stages.

> Nope, not really. I've just used it out of IRB. Integrating it
> ought to be possible, but I'm not sure why that would be
> necessary.

It's necessary because I want to. Because irb's lexer is sometimes
wrong, and freaks like me who use irb to explore the syntax get fooled
sometimes. Because irb could use it to colorize input and output.
(Maybe it's current lexer would serve for the last purpose...)

> > Ps: I haven't figured out why this breaks RubyLexer yet, but I
> > will.
>
> Good luck. :)

I got a little way through it... aside from the unique use of
whitespace, my big problem so far is handling the dos-style newlines. I
handle common cases of it now, but pre is anything but common. Are you
a windows person, or did you do that just to be more deviant and make
my life difficult? :)

vikkous

unread,

Apr 25, 2005, 5:47:25 PM4/25/05

to

Peter Suk wrote:
> On Apr 25, 2005, at 8:37 AM, Florian Groß wrote:
> > I'm especially interested in how the LALR parser, Reg and
> > RubyLexer might all work together. Any way of getting
> > some sample code? I'm aware of the fact that this is all
> > subject to change as long as you have not implemented
> > all the necessary features like look-ahead, but getting a
> > quick overview would still be nice.
>
> I am currently constructing an LALR parser for Ruby using
> RubyLexer for the Alumina-VM project. I suspect that
> RubyLexer is going to make this much cleaner.

Please see my post titled, "Lalr(n) parsing with reg". Peter's taking
the traditional approach; I've got my own weird ideas that I want to
try.

Florian Groß

unread,

Apr 25, 2005, 8:06:42 PM4/25/05

to

vikkous wrote:

>>Advisory tokens

> So you want to match the 'then' with it's owning 'if'? That's not
> something I've had to do yet, but it shouldn't be hard... How's this
> for an interface:
> I can add a new method to the Token class, let's call it match_id for
> now. Every time there's a token like 'if', '(', 'begin', that starts a
> nested context, the match_id of that token will be set to a unique
> value. When the corresponding 'end' or ')' comes along, it will have a
> match_id with the same value as the corresponding context opening
> token. We can easily have 'then' with a match_id corresponding to its
> 'if' as well. This should make it pretty easy to put the pieces
> together again afterward.

It is not so important to match the then to the if to me -- it is just
important to get the part that comes between the if and the matching
'then', ':', ';' or newline. I'm not sure if you even need to do it as
you described -- I thought having a special mode / sub-class lexer which
emits contextual tokens that are no real tokens would already do this
fairly well while also being reasonably simple. So

if condition then action end

would produce a token stream similar to

# pardon me if my way of representing this is not at all compatible
# with RubyLexer's design -- I need to get familiar with it soon
[KeyWord['if'], IfConditionStart, VariableOrMethod['condition'],
IfConditionEnd, KeyWord['then'], IfActionStart,
VariableOrMethod['action'], IfActionEnd, KeyWord['end']]

And I think that that would be easier to analyze than the non-annotated
token stream. Of course you would still have to do nesting counting to
be able to extract the sections, but I think that would be reasonable
for simplicity's sake.

>>most languages require you to keep some context
>>for actually tokenizing them.

> You can say that again. The amount of extra (non-lexical, strictly
> speaking) work to get RubyLexer working was phenomenal. You wouldn't
> believe all the squirrelly little cases. It makes the language easy to
> use, but hard to process programatically. Given the choice, I'd like to
> find a different way next time. If there could be one tool that does
> both at once... I don't know what that would look like. Reg might be
> able to do both, but in separate stages.

Hm, why is that? Could it not use the rules it uses for parsing for
one-token-at-a-time-ahead lexing?

I'm not sure whether not having lexing and parsing more unified has
benefits or downsides with your approach. I guess I will just have to
write a Joy interpreter using all this. Do you think that that can
already be done or is there features missing that would make it wise to
delay this further?

> [Integrating the lexer with IRB]

> It's necessary because I want to. Because irb's lexer is sometimes
> wrong, and freaks like me who use irb to explore the syntax get fooled
> sometimes. Because irb could use it to colorize input and output.
> (Maybe it's current lexer would serve for the last purpose...)

Heh, you must have been reading old postings of mine. IRB doing syntax
highlighting as you type has been on my wish list for a while.

That aside, I think I misunderstood you. I originally thought you wanted
to integrate IRB's lexer with your tool chain, but it appears that you
want to instead integrate your lexer with IRB.

I think such things are possible fairly easily with Ruby -- after all
you just have to emulate the method interfaces of the part you want to
replace and swap it out.

I have done similar things with ruby-breakpoint where I overwrite parts
of IRB so that it can be split into a client and a server. The server
part does not use STDIN/STDOUT which means I can then use IRB for
debugging CGI applications and pretty much everything else as well.

> [pre.rb]

> I got a little way through it... aside from the unique use of
> whitespace, my big problem so far is handling the dos-style newlines. I
> handle common cases of it now, but pre is anything but common. Are you
> a windows person, or did you do that just to be more deviant and make
> my life difficult? :)

Heh, I'm really one of them Windows users and mostly happy so far though
I think I would not object against a free switch to Mac OS X if the
opportunity ever turned up.

Had I wanted to make this yet more difficult I would have mixed multiple
styles of newlines. ;)

Now I actually do wonder if using CRLF instead of LF does anything
special to newline-delimited literals on any platforms.

vikkous

unread,

Apr 26, 2005, 1:53:51 PM4/26/05

to

> would produce a token stream similar to
>
> # pardon me if my way of representing this is not at all
compatible
> # with RubyLexer's design -- I need to get familiar with it soon
> [KeyWord['if'], IfConditionStart, VariableOrMethod['condition'],
> IfConditionEnd, KeyWord['then'], IfActionStart,
> VariableOrMethod['action'], IfActionEnd, KeyWord['end']]
>
> And I think that that would be easier to analyze than the non-
> annotated token stream. Of course you would still have to do
> nesting counting to be able to extract the sections, but I think
> that would be reasonable for simplicity's sake.

Ok, fair enough. Maybe this way is easier after all.

> Hm, why is that? Could it not use the rules it uses for parsing
> for one-token-at-a-time-ahead lexing?

I just can't see this. The lexer rules' input is the source file, but
the parser's is the parse stack -- which comes from the lexer's output
ultimately.... this can be a very powerful way to compose pattern
matchers, but in the end different rule sets are used with 2 different
inputs.

The lexer and parser can run interleaved, and the lexer can get
information from the parser to help interpret things (this is sometimes
called "cheating", but it isn't; it's often the easiest way). But
there's still the two rule sets. I don't know if it's possible to have
1 rule set do both at once, but the idea is intruiging.

> I'm not sure whether not having lexing and parsing more
> unified has benefits or downsides with your approach. I
> guess I will just have to write a Joy interpreter using all this.
> Do you think that that can already be done or is there
> features missing that would make it wise to delay this further?

I took a little look a joy. Hoo-boy. I'm guessing this language is
pretty easy to parse. I would say reg is not ready for anything
significant until it has backreferences and substitutions. At that
point, it's got match-and-replace, and retrieval of arbitrary match
subexpressions. If you think you can live without those, I'd say go for
it. There are some problems with the backtracking engine, but so far as
I can see, only a whole lot of ambiguity causes the problems, so it's
_probably_ ok for most things.

> Had I wanted to make this yet more difficult I would have
> mixed multiple styles of newlines. ;)
>
> Now I actually do wonder if using CRLF instead of LF does
> anything special to newline-delimited literals on any
> platforms.

Sure enough, I translated to unix format and the problems disappeared.
Using a dos newline as a
delimiter in a fancy string is just a little difficult for me because I
had always assumed string delimiters were a single character... hrm.
Here documents need this functionality to really support dos newlines
correctly too, I think.

Florian Groß

unread,

Apr 26, 2005, 3:51:39 PM4/26/05

to

vikkous wrote:

>>Hm, why is that? Could it not use the rules it uses for parsing
>>for one-token-at-a-time-ahead lexing?
>
> I just can't see this. The lexer rules' input is the source file, but
> the parser's is the parse stack -- which comes from the lexer's output
> ultimately.... this can be a very powerful way to compose pattern
> matchers, but in the end different rule sets are used with 2 different
> inputs.
>
> The lexer and parser can run interleaved, and the lexer can get
> information from the parser to help interpret things (this is sometimes
> called "cheating", but it isn't; it's often the easiest way). But
> there's still the two rule sets. I don't know if it's possible to have
> 1 rule set do both at once, but the idea is intruiging.

Hm, this might be related to me thinking pretty much in Regexps as that
has turned out to be quite simple. Is it not possible to apply your
extended expressions to Strings? Perhaps by .scan(/./)?

>>I'm not sure whether not having lexing and parsing more
>>unified has benefits or downsides with your approach. I
>>guess I will just have to write a Joy interpreter using all this.
>>Do you think that that can already be done or is there
>>features missing that would make it wise to delay this further?
>
> I took a little look a joy. Hoo-boy. I'm guessing this language is
> pretty easy to parse. I would say reg is not ready for anything
> significant until it has backreferences and substitutions. At that
> point, it's got match-and-replace, and retrieval of arbitrary match
> subexpressions. If you think you can live without those, I'd say go for
> it. There are some problems with the backtracking engine, but so far as
> I can see, only a whole lot of ambiguity causes the problems, so it's
> _probably_ ok for most things.

Yup, it ought to be relatively simple to parse, though I still don't
like lexing it as you don't want to handle spaces specially in Strings
and so on.

I'm not even sure if I will need non-trivial backtracking or
substitutions which is probably a sign I will need them.

>>Now I actually do wonder if using CRLF instead of LF does
>>anything special to newline-delimited literals on any
>>platforms.
>
> Sure enough, I translated to unix format and the problems disappeared.

Was this the only problem? I think that my usage of here-docs might turn
out to be quite exotic as well.