Standard Practice for a Canned Lexer, Parser, Analyzer?

265 views
Skip to first unread message

Liam

unread,
Jan 28, 2010, 10:59:27 PM1/28/10
to Clojure
Could someone educate me about what developers normally do when faced
with having to create a lexer / parser / analyzer, say for clojure?

Why would people go with a canned solution, i.e. ready-made like soup
out of a can, instead of by hand?

E.g. why did the Counterclockwise Eclipse plug-in for Clojure use
ANTLR , or why did in the Enclojure NetBeans plug-in for clojure use
JFLEX? Why in clojure itself is there a reader made by hand and not
using a canned generator?

Am I naive in thinking one should do that by hand? Is this archaic
thinking like those who still prefer building websites in HTML by
hand?

What's the advantage of doing that, say for clojure or in general? You
still have to learn how a given generator works. And you may be
limited by its design. What if you want fine combed control over how
things are parsed to get, for example, sophisticated syntax based
evaluation or inferences from cold code. E.g. like what Eclipse does
for Java and their “Java Models” and exhaustive “Abstract Syntax Tree”
nodes?

I hope some of you could be generous enough to enlighten me.

Laurent PETIT

unread,
Jan 29, 2010, 3:45:35 AM1/29/10
to clo...@googlegroups.com
Hello,

2010/1/29 Liam <liam....@gmail.com>:


> Could someone educate me about what developers normally do when faced
> with having to create a lexer / parser / analyzer, say for clojure?
>
> Why would people go with a canned solution, i.e. ready-made like soup
> out of a can, instead of by hand?
>
> E.g. why did the Counterclockwise Eclipse plug-in for Clojure use
> ANTLR ,

Concerning Counterclockwise, it's simple: a combination of reasons
* I wanted to play with ANTLR since we were also using it at work
* Not much of Counterclockwise was written in clojure at that time

Note that currently, there's a joint effort to provide clojure-based
parsing for Counterclockwise : I'm working to create a port of paredit
during my spare time, currently using a hand-written clojure parser
(YAHWCP Yet Another Hand Written Clojure Parser - not a reader, a
parser which produces a parsetree), and Christophe Grand is working on
the general solution of providing a parser generator.

> or why did in the Enclojure NetBeans plug-in for clojure use
> JFLEX? Why in clojure itself is there a reader made by hand and not
> using a canned generator?
>
> Am I naive in thinking one should do that by hand? Is this archaic
> thinking like those who still prefer building websites in HTML by
> hand?

Note there is the fnparse library, too ( http://github.com/joshua-choi/fnparse )

> What's the advantage of doing that, say for clojure or in general? You
> still have to learn how a given generator works. And you may be
> limited by its design. What if you want fine combed control over how
> things are parsed to get, for example, sophisticated syntax based
> evaluation or inferences from cold code. E.g. like what Eclipse does
> for Java and their “Java Models” and exhaustive “Abstract Syntax Tree”
> nodes?
>
> I hope some of you could be generous enough to enlighten me.

I'm not sufficiently expert in the subject to give you this
enlightenment, sorry.

P.S. : what is a cold node ?
From the vocabulary you're using, one could infer you're not a
beginner on the subject, so certainly you could do well with any tool.
I guess one of the purposes of such tools is to guide the users by
providing "built-in semantics" for thinking about the problem domain.
Another purpose may be that those tools provide more "declarative"
ways to define their grammars.

--
Laurent

Liam

unread,
Jan 29, 2010, 4:45:08 AM1/29/10
to Clojure
Hello Laurent,
Thank you very much for your answers and maintaining CCW. I was
partially encouraged to use Eclipse (a while back) due to the
existence of CCW (or clojure-dev as it was then).

I wrote: […] sophisticated syntax based evaluation or inferences from
“cold code” [...] not node.

This is just my personal language in referring to evaluating code from
its parse tree. In my mind, code when on the screen being edited: it
is cold. But when run through the clojure compiler: it is hot. My
understanding of programming is completely intuitive not formally
taught.

So, I may know some stuff by personal research, but my education is
not computer science, and my work experience is not in programming. I
have been interested in the subject for nearly 20 years now as an
“armchair programmer”. I only manged to skew my education a little
towards that, and I managed to make some of my work experience in IT
project management systems. Only this year (with the US recession) did
this interest become more involved for me than just a part-time hobby.

I often end-up feeling that I don't know what other programmers seem
to “just know” due to their direct work experience or education. I
very much appreciated your encouragement above. Thanks again.

Roberto Mannai

unread,
Jan 29, 2010, 5:14:13 AM1/29/10
to clo...@googlegroups.com
Hello Liam
I'm neither an expert on this subject. :) Nevertheless I'll give you
my position.

Defining a new language, formally speaking, is very hard: you have to
define new symbols and rules. On many languages the same token can be
used in different contexts (think of "{" on Java: if blocks, method
blocks, class blocks: here the word "block" means very different
things). If you're going to write by hand you'll have to check all
these contextual points, and finally handle the validation process of
the source file as a whole.

Tools as ANTLR give you the chance to just specify the grammar
(allowed symbols and rules), leaving to them all the validation
checks. You can also give them a grammar and a set of events (called
"template" in ANTLR), which are automatically triggered by the parser
when it processes a given rule.

So this is the reason, in my opinion, of why the "by hand" way is not
very used. This topic is related to (external) Domain Specific
Languages, so maybe you can find interesting this:
http://martinfowler.com/articles/codeGenDsl.html#UsingTemplatesForGeneration.

Ciao
Roberto

> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

Roberto Mannai

unread,
Jan 29, 2010, 5:15:38 AM1/29/10
to clo...@googlegroups.com

Alex Osborne

unread,
Jan 29, 2010, 6:18:25 PM1/29/10
to clo...@googlegroups.com
Liam <liam....@gmail.com> writes:

> Could someone educate me about what developers normally do when faced
> with having to create a lexer / parser / analyzer, say for clojure?

For parsing something like Clojure I would probably do it by hand, as
the syntax is regular-enough that it's pretty easy to do. For parsing
anything more complex or if I was trying to match a formally specified
grammar [1] I'd probably use a generator.

[1] http://python.org/doc/2.5.4/ref/grammar.txt

> Why would people go with a canned solution, i.e. ready-made like soup
> out of a can, instead of by hand?

Because parsing complex languages is rather hard to get perfect. It
becomes vary easy to misinterpret something or accept something
which should be rejected. This can lead to strange bugs and sometimes
security problems. When your parser is just a direct translation of the
grammar you can be a lot more confident you got everything right.


Alex Osborne

unread,
Jan 29, 2010, 7:15:24 PM1/29/10
to clo...@googlegroups.com
The way most parser generators work is that you can attach arbitrary
pieces of code to be run at certain points during parsing. Usually you
use this to build up your abstract syntax tree (AST). No information
(except perhaps things like comments and whitespace if you don't need to
preserve them) needs lost when you convert the source code to an AST.
Usually you would also include in the AST useful metadata like the
line and character positions particular expressions came from so that
when displaying an error you can easily point out the location of the
problem.

If the language is simple enough, instead of building an AST you could
directly emit byte-code (or directly do analysis) from within the
parser, but doing so limits what you can do rather a lot. For example
many compiler optimizations are transformations applied to the AST
before code generation happens.

Liam <liam....@gmail.com> writes:

> I wrote: […] sophisticated syntax based evaluation or inferences from
> “cold code” [...] not node.
>
> This is just my personal language in referring to evaluating code from
> its parse tree. In my mind, code when on the screen being edited: it
> is cold. But when run through the clojure compiler: it is hot. My
> understanding of programming is completely intuitive not formally
> taught.

I would imagine most analysis/refactoring tools work primarily on an
AST. If they need to do something with the raw text they would just
have the parser include enough information in the AST that they can
reproduce the source exactly. Alternatively, for something fairly
simple like method renaming in static language, they could just store
the start and end locations in the text of every method name node. Then
they search the AST looking for method names, pull out each of their
start and end locations and do the appropriate replacements in the
text.

I don't think using a parser generator really limits you in any way
unless perhaps you have really odd syntax rules [1]. Even then you
could likely use a generator for most of the normal syntax and 'plug in'
some custom code for just the bits that aren't statically parsable.

[1] http://www.perlmonks.org/?node_id=663393

Liam

unread,
Jan 29, 2010, 10:14:52 PM1/29/10
to Clojure
Roberto and Alex, thanks for your feedback. This is informative. I
appreciate your perspectives & links.

Michael Wood

unread,
Jan 30, 2010, 1:49:57 AM1/30/10
to clo...@googlegroups.com
On 30 January 2010 01:18, Alex Osborne <a...@meshy.org> wrote:
> Liam <liam....@gmail.com> writes:
>
>> Could someone educate me about what developers normally do when faced
>> with having to create a lexer / parser / analyzer, say for clojure?
>
> For parsing something like Clojure I would probably do it by hand, as
> the syntax is regular-enough that it's pretty easy to do.  For parsing
> anything more complex or if I was trying to match a formally specified
> grammar [1] I'd probably use a generator.
>
> [1] http://python.org/doc/2.5.4/ref/grammar.txt

How about for things like binary network protocols? Would you treat
them the same way as e.g. source code for a language? Obviously
there's no "code generation", but you still need to parse it.

--
Michael Wood <esio...@gmail.com>

Roberto Mannai

unread,
Jan 30, 2010, 6:36:15 AM1/30/10
to clo...@googlegroups.com
I should not go with an automatic parser. Binary network protocols can
mean a broad range of things :). If there is just a passive consumer
(like a textual HTTP browser), you could "consume" all the binary data
and then parse it, though I don't know if do exist a grammar for
binary symbols (just found this: http://www.iwriteiam.nl/Ha_BFF.html).
You likely are consuming a real time stream, so the binary data is the
actual content, not a structured message.

In general, high level network protocols are text-based: HTTP, telnet,
POP, SMTP. There is some "trick":
- you can send binary attachments with SMTP or consume them via HTTP
because they are encoded in a textual description (BASE64);
- the FTP protocol is two-faced: the interactive command controls are
sent via a textual protocol, whereas the binary data is sent in a
binary stream (in a new TCP port).

Low level protocols (TCP or IP) are binary protocols. The packets have
a format (http://www.faqs.org/rfcs/rfc793.html) which can be quickly
checked in a numeric way, not in a symbolic way. The check is not just
on the "format", but also on the content, via checksum redundancy
bits. Just a parsing is not enough.

Michael Wood

unread,
Jan 30, 2010, 7:04:06 AM1/30/10
to clo...@googlegroups.com
On 30 January 2010 13:36, Roberto Mannai <robe...@gmail.com> wrote:
> I should not go with an automatic parser. Binary network protocols can
> mean a broad range of things :). If there is just a passive consumer

Yes, I suppose so :)

> (like a textual HTTP browser), you could "consume" all the binary data
> and then parse it, though I don't know if do exist a grammar for
> binary symbols (just found this: http://www.iwriteiam.nl/Ha_BFF.html).
> You likely are consuming a real time stream, so the binary data is the
> actual content, not a structured message.
>
> In general, high level network protocols are text-based: HTTP, telnet,
> POP, SMTP. There is some "trick":

Ah, you're thinking of "nice" protocols :) Although I believe telnet
has some non-text parts to it too.

I was thinking more along the lines of DNS or DICOM etc., but I
suppose the fact that it's binary does not really matter. It's still
just messages being passed back and forth and each one needs to be
parsed and dealt with and possibly responded to.

> - you can send binary attachments with SMTP or consume them via HTTP
> because they are encoded in a textual description (BASE64);
> - the FTP protocol is two-faced: the interactive command controls are
> sent via a textual protocol, whereas the binary data is sent in a
> binary stream (in a new TCP port).

Yes, thanks.

> Low level protocols (TCP or IP) are binary protocols. The packets have
> a format (http://www.faqs.org/rfcs/rfc793.html) which can be quickly
> checked in a numeric way, not in a symbolic way. The check is not just
> on the "format", but also on the content, via checksum redundancy
> bits. Just a parsing is not enough.

I suppose I should have asked a proper question :) but I didn't have a
specific question really.

What prompted my question was the following:

I have come across references to a "declarative implementation of the
DICOM-3 network protocol" written in Common Lisp and I was wondering
what that means, exactly, and how one would go about doing something
for an arbitrary network protocol.

--
Michael Wood <esio...@gmail.com>

Alex Osborne

unread,
Jan 30, 2010, 7:04:40 AM1/30/10
to clo...@googlegroups.com
Michael Wood <esio...@gmail.com> writes:

> How about for things like binary network protocols? Would you treat
> them the same way as e.g. source code for a language? Obviously
> there's no "code generation", but you still need to parse it.

As Roberto points out, most common (application-level) network protocols
are textual, not binary. You can certainly use a parser generator for
textual protocols. For example Mongrel (a Ruby HTTP server) uses the
Ragel parser generator to parse HTTP requests. Here's the grammar it uses:

http://github.com/fauna/mongrel/blob/master/ext/http11/http11_parser_common.rl

Actual binary (not-textual) protocols typically aren't languages as
such. They often just consist of something like this:

[message length] [message type] [payload]

Where [payload] is a fairly fixed structure (think C struct) based on
the message type number. The payload might have a variable length, but
it's usually rigidly specified by a flag field or pascal-style strings.
Textual parser generators aren't probably all that useful for binary
protocols, however there *are* binary protocol parser generators. An
example would be Google protocol buffers:

http://code.google.com/p/protobuf/

Roberto Mannai

unread,
Jan 30, 2010, 7:49:52 AM1/30/10
to clo...@googlegroups.com
On Sat, Jan 30, 2010 at 1:04 PM, Alex Osborne <a...@meshy.org> wrote:
> there *are* binary protocol parser generators.  An
> example would be Google protocol buffers:
>
> http://code.google.com/p/protobuf/

Very interesting, thank you

Roberto Mannai

unread,
Jan 30, 2010, 8:33:43 AM1/30/10
to clo...@googlegroups.com
On Sat, Jan 30, 2010 at 1:04 PM, Michael Wood <esio...@gmail.com> wrote:
>
> I have come across references to a "declarative implementation of the
> DICOM-3 network protocol" written in Common Lisp and I was wondering
> what that means, exactly, and how one would go about doing something
> for an arbitrary network protocol.

Maybe you're referring to this document:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.982&rep=rep1&type=pdf

I didn't know this (medical) protocol. Anyway, its parsing is driven
by some header control data:

<<To the PDU parser, this message is simply an uninterpreted stream of
bytes. The DICOM Upper-Layer protocol specifies a header for each
message encoding the length of the following data field and some flags
indicating whether the data field is a command message or a data
object (and additionally whether the data field contains the complete
object or is
a portion of a long data field that has been fragmented into multiple
messages).
If the message is a command, our system gives the PDU parser pointers
to the beginning and end (in the TCP data buffer) of the message. The
ruleset contains rules for parsing commands using the same conventions
and language as for parsing any other type of PDU.
If the message is a data object, however, the buffer (and begin/end
pointers) are passed to a different parser. This parser is also
data-driven, using the data dictionary to decode arbitrary data
objects.>>

Each server/client has a state machine which processes those messages:
it can guess what to do next by reading the header control data:

<<The Prism DICOM code uses a finite state machine to implement the DICOM Upper
Layer Protocol, parsing incoming PDUs, and generating outgoing PDUs
using a production rule grammar>>

<<As the system parses incoming PDUs it extends an environment which
stores these variables and their values. The environment is accessible
to any operation that
needs values of data fields in the decoded PDUs. In Lisp terms, the
environment is represented as a nested association list. That is, the
entire environment is a list of
components.>>

It seems to me that there is no automatic parser generation from a
grammar: there is a continuous consuming and parsing which check the
data with the stored grammar, and after a recognized event it changes
the machine's state, producing a proper output.

Anyway, very interesting this LISP implementation, thank you. Never
heard of such a thing :)

Michael Wood

unread,
Jan 30, 2010, 11:42:37 AM1/30/10
to clo...@googlegroups.com
On 30 January 2010 14:04, Alex Osborne <a...@meshy.org> wrote:
> Michael Wood <esio...@gmail.com> writes:
>
>> How about for things like binary network protocols?  Would you treat
>> them the same way as e.g. source code for a language?  Obviously
>> there's no "code generation", but you still need to parse it.
>
> As Roberto points out, most common (application-level) network protocols
> are textual, not binary.

Yes, I'm quite familiar with various text-based protocols like IMAP,
POP3, NNTP, IRC, SMTP, etc. I've also played around a little with
implementing small subsets of DNS and DHCP in Python. It was fun, but
I'm pretty sure my approach was sub-optimal. That was a while ago,
but I was wondering about how to go about something like that in
Clojure and the "declarative" DICOM protocol implementation in Common
Lisp sounded interesting. Anyway, if I get around to this I'll try to
figure out what they did and how to map that to Clojure.

> You can certainly use a parser generator for
> textual protocols.  For example Mongrel (a Ruby HTTP server) uses the
> Ragel parser generator to parse HTTP requests.  Here's the grammar it uses:
>
> http://github.com/fauna/mongrel/blob/master/ext/http11/http11_parser_common.rl
>
> Actual binary (not-textual) protocols typically aren't languages as
> such.  They often just consist of something like this:
>
> [message length] [message type] [payload]
>
> Where [payload] is a fairly fixed structure (think C struct) based on
> the message type number.  The payload might have a variable length, but
> it's usually rigidly specified by a flag field or pascal-style strings.

Yes, I suppose you're right.

> Textual parser generators aren't probably all that useful for binary
> protocols, however there *are* binary protocol parser generators.  An
> example would be Google protocol buffers:
>
> http://code.google.com/p/protobuf/

Thanks :)

--
Michael Wood <esio...@gmail.com>

Michael Wood

unread,
Jan 30, 2010, 11:45:28 AM1/30/10
to clo...@googlegroups.com
On 30 January 2010 15:33, Roberto Mannai <robe...@gmail.com> wrote:
> On Sat, Jan 30, 2010 at 1:04 PM, Michael Wood <esio...@gmail.com> wrote:
>>
>> I have come across references to a "declarative implementation of the
>> DICOM-3 network protocol" written in Common Lisp and I was wondering
>> what that means, exactly, and how one would go about doing something
>> for an arbitrary network protocol.
>
> Maybe you're referring to this document:
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.982&rep=rep1&type=pdf

Yes, that looks like the one.

> I didn't know this (medical) protocol. Anyway, its parsing is driven
> by some header control data:

Yes, that's right.

Thanks for your comments.

--
Michael Wood <esio...@gmail.com>

Reply all
Reply to author
Forward
0 new messages