OpenToken version 3.1 preview

1 view
Skip to first unread message

Stephen Leake

unread,
Jul 19, 2009, 5:23:41 PM7/19/09
to
I've finally made some time to work on the OpenToken release.

I've created a new web page
http://www.stephe-leake.org/ada/opentoken.html

It has a tarball of the version 3.1w ('w' for working) sources. These
are also in the ada-france monotone server.

The web page also has a list of changes. I've merged the version I
used in webcheck (which required significant changes in the HTML
parser), and in my work (which uncovered some other bugs). Both of
those projects are now using this OpenToken package. It includes fixes
for the two Debian bugs against OpenToken.

Please look it over, and let me know if you'd like something changed.

If I get no responses in two weeks, I'll declare this release final,
and just drop the 'w'.

If someone could provide dates for the previous OpenToken releases,
that would be fun.

--
-- Stephe

Stephen Leake

unread,
Jul 20, 2009, 5:47:43 AM7/20/09
to
Stephen Leake <stephe...@stephe-leake.org> writes:

> I've created a new web page
> http://www.stephe-leake.org/ada/opentoken.html
>
> It has a tarball of the version 3.1w ('w' for working) sources.

I've now added a zip version as well.

--
-- Stephe

AdaMagica

unread,
Jul 21, 2009, 9:03:22 AM7/21/09
to
> Please look it over, and let me know if you'd like something changed.

There is a problem with Bracketed_Comment. If it extends over more
than one line, the token is correctly recognized, but the lexeme
fails.

You can take the Bracketed_Comment_Test in directory Test to verify
the wrong behaviour.

> If someone could provide dates for the previous OpenToken releases,
> that would be fun.

These should be on Ted Dennisons's site.

Stephen Leake

unread,
Jul 22, 2009, 9:41:09 PM7/22/09
to
AdaMagica <christo...@eurocopter.com> writes:

> There is a problem with Bracketed_Comment. If it extends over more
> than one line, the token is correctly recognized, but the lexeme
> fails.

The line feed characters are dropped from the lexeme, on Windows.

> You can take the Bracketed_Comment_Test in directory Test to verify
> the wrong behaviour.

I've added a test that shows the problem.

I don't suppose you have an idea of how to fix it?

It will be interesting to figure out how to make that test portable
between Windows and Gnu/Linux. The easiest way to identify which line
ending to use that I know of is to look at
GNAT.Directory_Operations.Dir_Separator; it's '\' for CR LF, '/' for
LF. Don't know how to deal with Mac!

The Ada standard intended System.Name to deal with this, but in GNAT
that's always SYSTEM_NAME_GNAT, so that's no help.

Anyone have a better idea?

>> If someone could provide dates for the previous OpenToken releases,
>> that would be fun.
>
> These should be on Ted Dennisons's site.

If you mean
http://www.telepath.com/~dennison/Ted/OpenToken/OpenToken.html, there
are a couple of dates there, thanks:

8/13/00 - Version 3.0b is now available.
1/27/00 - Version 2.0 is now available.

--
-- Stephe

AdaMagica

unread,
Jul 23, 2009, 1:09:27 AM7/23/09
to
On Jul 23, 3:41 am, Stephen Leake <stephen_le...@stephe-leake.org>
wrote:

> AdaMagica <christoph.gr...@eurocopter.com> writes:
> > There is a problem with Bracketed_Comment. If it extends over more
> > than one line, the token is correctly recognized, but the lexeme
> > fails.
>
> The line feed characters are dropped from the lexeme, on Windows.

Also on Linux.

> I don't suppose you have an idea of how to fix it?

You guessed right - I haven't. I shortly browsed the code, but found
no simple solution.

> It will be interesting to figure out how to make that test portable
> between Windows and Gnu/Linux. The easiest way to identify which line
> ending to use that I know of is to look at
> GNAT.Directory_Operations.Dir_Separator; it's '\' for CR LF, '/' for
> LF. Don't know how to deal with Mac!

There are other OSs where an end of line is not a character in the
stream. Can OpenToken handle these?
We could do a Get_Line and insert a LF irrespective of what the OS
uses. If then a lexeme was output that comprises several lines
(currently only Bracketed_Comment I think), the output routine would
have to translate this back to the OS's New_Line (this has of course
to be documented in the recognizer).

There is a declaration EOL_Character in package OpenToken.

Dmitry A. Kazakov

unread,
Jul 23, 2009, 4:00:10 AM7/23/09
to

You could do what I did in the Simple Components for Ada parser. I
decoupled sources from the parser itself. The source is an abstract object
that provides basic operations like "get next line" and "forward to the
next line". The obvious advantage is that you need not to care about LF, CR
in the parser, and can use files, streams, strings, GUI text buffers, etc,
as a source to the same parser.

My 2 cents.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

vlc

unread,
Jul 23, 2009, 11:19:34 AM7/23/09
to
Stephen Leake wrote:
> Don't know how to deal with Mac!

When I was using a MAC the last time, they used a colon (":") to
separate directories. But this was before Mac OS X. I don't know if
this has changed with the BSD kernel.

sjw

unread,
Jul 23, 2009, 4:09:12 PM7/23/09
to
On Jul 23, 2:41 am, Stephen Leake <stephen_le...@stephe-leake.org>
wrote:

> Don't know how to deal with Mac!

for the 99.999% of people who are using Mac OS X, treat as Unix.

Stephen Leake

unread,
Jul 24, 2009, 6:47:04 AM7/24/09
to
"Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> writes:

> You could do what I did in the Simple Components for Ada parser. I
> decoupled sources from the parser itself. The source is an abstract object
> that provides basic operations like "get next line" and "forward to the
> next line". The obvious advantage is that you need not to care about LF, CR
> in the parser, and can use files, streams, strings, GUI text buffers, etc,
> as a source to the same parser.

Yes, OpenToken does this; the sources are called Text_Feeders.

The provided file Text_Feeder uses Ada.Text_IO, so it does "the right
thing" for each operating system.

Which is why the LF is missing from the lexeme; Ada.Text_IO.Get_Line
consumes it.

OpenToken also provides a String Text_Feeder, which of course has no
notion of lines.

--
-- Stephe

Stephen Leake

unread,
Jul 24, 2009, 6:54:42 AM7/24/09
to
AdaMagica <christo...@eurocopter.com> writes:

> On Jul 23, 3:41�am, Stephen Leake <stephen_le...@stephe-leake.org>
> wrote:
>> AdaMagica <christoph.gr...@eurocopter.com> writes:
>> > There is a problem with Bracketed_Comment. If it extends over more
>> > than one line, the token is correctly recognized, but the lexeme
>> > fails.
>>
>> The line feed characters are dropped from the lexeme, on Windows.
>
> Also on Linux.
>
>> I don't suppose you have an idea of how to fix it?
>
> You guessed right - I haven't. I shortly browsed the code, but found
> no simple solution.
>
>> It will be interesting to figure out how to make that test portable
>> between Windows and Gnu/Linux. The easiest way to identify which line
>> ending to use that I know of is to look at
>> GNAT.Directory_Operations.Dir_Separator; it's '\' for CR LF, '/' for
>> LF. Don't know how to deal with Mac!
>
> There are other OSs where an end of line is not a character in the
> stream. Can OpenToken handle these?

The current file Text_Feeder uses Ada.Text_IO, so it should do "the
right thing" for any OS.

> We could do a Get_Line and insert a LF irrespective of what the OS
> uses.

That's what the text feeder does now. Actually, it inserts
EOL_Character (see below).

So the LF must be dropped after that; I'll have to look harder.

> If then a lexeme was output that comprises several lines (currently
> only Bracketed_Comment I think), the output routine would have to
> translate this back to the OS's New_Line (this has of course to be
> documented in the recognizer).

Right.

> There is a declaration EOL_Character in package OpenToken.

Which has a comment to change it for your OS; not very friendly, as
it's a constant!

It's used in OpenToken.Recognizer.Character_Set.Standard_Whitespace,
OpenToken.Recognizer.Line_Comment.Analyze,
OpenToken.Recognizer.String.Analyze.

I'll change the comment to "we use this regardless of OS, since we
need a standard way of representing an end of line in a string
buffer".

--
-- Stephe

Dmitry A. Kazakov

unread,
Jul 24, 2009, 7:11:41 AM7/24/09
to
On Fri, 24 Jul 2009 06:47:04 -0400, Stephen Leake wrote:

> "Dmitry A. Kazakov" <mai...@dmitry-kazakov.de> writes:
>
>> You could do what I did in the Simple Components for Ada parser. I
>> decoupled sources from the parser itself. The source is an abstract object
>> that provides basic operations like "get next line" and "forward to the
>> next line". The obvious advantage is that you need not to care about LF, CR
>> in the parser, and can use files, streams, strings, GUI text buffers, etc,
>> as a source to the same parser.
>
> Yes, OpenToken does this; the sources are called Text_Feeders.
>
> The provided file Text_Feeder uses Ada.Text_IO, so it does "the right
> thing" for each operating system.
>
> Which is why the LF is missing from the lexeme; Ada.Text_IO.Get_Line
> consumes it.

What is wrong with that? I do it exactly same way for text files.

Well, provided there is no concern to make it compatible with how Ada RM
defines "a line" (I am not sure that "LRM line" always same as "OS line").
Another issue could be compatibility with the GUI text editor.

Stephen Leake

unread,
Jul 24, 2009, 9:18:50 PM7/24/09
to
Stephen Leake <stephe...@stephe-leake.org> writes:

> AdaMagica <christo...@eurocopter.com> writes:
>
>> On Jul 23, 3:41�am, Stephen Leake <stephen_le...@stephe-leake.org>
>> wrote:
>>> AdaMagica <christoph.gr...@eurocopter.com> writes:
>>> > There is a problem with Bracketed_Comment. If it extends over more
>>> > than one line, the token is correctly recognized, but the lexeme
>>> > fails.
>>>
>>> The line feed characters are dropped from the lexeme, on Windows.
>>

Here is the explanation of this symptom.

Text_Feeder uses Ada.Text_IO.Get_Line, so it never sees the "CR LF" on
DOS, nor the "LF" on Linux. It does insert a EOL_Character = CR for
each line break. That's why it appears to be dropping the LF.

So for a file created like this:

Text1 : constant String := "/* A comment that starts here";
Text2 : constant String := " and keeps going";
Text3 : constant String := " and finally ends here *.*..";

Create (File, Out_File, File_Name);
Put_Line (File, Text1);
Put_Line (File, Text2);
Put_Line (File, Text3);
Close (File);

the expected lexeme is:

Expected_Lexeme : constant String :=
Text1 & OpenToken.EOL_Character &
Text2 & OpenToken.EOL_Character &
Text3;

I've added a test that demonstrates this, and a comment to
opentoken-recognizer-bracketed_comment.ads to document it.

If the purpose of the lexer is to just recognize comments and skip
them, this is fine.

If the purpose of the lexer is to be able to later reconstruct the
code, the reconstruction routine will need a way to turn EOL_Character
back into OS-specific newlines; using Ada.Text_IO.Put_Line will do
that.

--
-- Stephe

Reply all
Reply to author
Forward
0 new messages