Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

flex scanner too huge, suggestions?

9 views
Skip to first unread message

Troy Cauble

unread,
Feb 17, 2001, 1:35:30 AM2/17/01
to
Hi all,

I have a working flex/bison implementation to parse a text
protocol. The problem is that the scanner is huge.

Made with LFLAGS = -Cfe, size gives
42068(.text) + 24(.data) + 60(.bss) + 1687886(.rodata) = 1730038

Made with LFLAGS = -Cem, size gives
42496(.text) + 24(.data) + 60(.bss) + 508326(.rodata) = 550906

This is still pretty large for some of our target environments.

I haven't tried to disable features to determine what makes
it so huge, but the protocol is case-insensitive and context
(position) sensitive. I addressed the context sensitivity
with flex start states.

My flex file has the following %option line.
%option noyywrap 8bit batch case-insensitive perf-report

I'm looking for suggestions for either tweaking my flex
implementation or switching to another scanner generator.

The corresponding bison parser is solid, so I'm not really
looking for suggestions to replace the pair. (We initially
tried several other tools for the total parsing problem but
they all choked on this large, context sensitive protocol.)

Thanks,
-troy

Scott Nicol

unread,
Feb 23, 2001, 12:01:13 AM2/23/01
to
"Troy Cauble" <tr...@bell-labs.com> wrote in message

> Made with LFLAGS = -Cfe, size gives
> 42068(.text) + 24(.data) + 60(.bss) + 1687886(.rodata) = 1730038

-Cfe is no table compression, not much use except for debugging flex.

> Made with LFLAGS = -Cem, size gives
> 42496(.text) + 24(.data) + 60(.bss) + 508326(.rodata) = 550906

OK, so you have very large tables.

> I haven't tried to disable features to determine what makes
> it so huge, but the protocol is case-insensitive and context
> (position) sensitive. I addressed the context sensitivity
> with flex start states.
>
> My flex file has the following %option line.
> %option noyywrap 8bit batch case-insensitive perf-report

None of these options will cause table explosion. case-insensitive
will shrink tables somewhat.

How many start states does your scanner have? A few won't make much
of a difference, but start states do chew up table space. If you have
lots of start states, this could be the problem.

> I'm looking for suggestions for either tweaking my flex
> implementation or switching to another scanner generator.

Try to reduce the number of states by reducing the number of rules.
For example, if you have a lot of keywords, each with a distinct rule,
it would make your tables smaller if you could combine them into one
rule. If your scanner looks like this:

const { return CONST; }
else { return ELSE; }
if { return IF; }
... 500 keywords deleted
while { return WHILE; }
[a-z][a-z0-9_]* { yylval.symbol = lookup(yytext); return
IDENTIFIER; }

you could shrink your tables by placing all of the keywords into the
IDENTIFIER rule. To do this, you would have to load a symbol table with all
of the keywords. Your lex file would then look like this:

[a-z][a-z0-9_]* { yylval.symbol = lookup(yytext); return yylval.token; }

> The corresponding bison parser is solid, so I'm not really looking
> for suggestions to replace the pair. (We initially tried several
> other tools for the total parsing problem but they all choked on
> this large, context sensitive protocol.)

flex is about as good as you'll get. It generates very small tables.
Another alternative is to write your own yylex(), by hand. It isn't
usually that difficult, and you may be able to do some things by hand
much more efficiently than you can do in flex. Lots of start states
is generally a sign that your trying to force things.

--
Scott Nicol
sni...@apk.net

Tim Josling

unread,
Feb 23, 2001, 12:00:29 AM2/23/01
to
One possibility that is often used is 'keywords'. If there is a
large class of tokens that match a generic pattern, you can have
the scanner match the pattern, then look up which one it is in a
keyword table. Most 3GL compilers do this.

Flex also has various table compression options eg -Cm (compress
and use meta equivalence classes). See the manual.

Tim Josling

RKRayhawk

unread,
Feb 23, 2001, 12:03:03 AM2/23/01
to
Troy Cauble tr...@bell-labs.com
Date: 17 Feb 2001 01:35:30 -0500
posts some stats on a build of a scanner

<<

Made with LFLAGS = -Cfe, size gives
42068(.text) + 24(.data) + 60(.bss) + 1687886(.rodata) = 1730038

Made with LFLAGS = -Cem, size gives
42496(.text) + 24(.data) + 60(.bss) + 508326(.rodata) = 550906


>>

If you have a language syntax requirement that is huge, then it is
huge. You seem to suggest this with the comments about trying to
master the 'context sensitivity' with start states. Yet seems like
the problem _might_ be elsewhere.

"Exactly what is introducing the large .rodata?", would be the
question. And here you may be looking at the libraries used for
executable construction at link time. You may wish to mention what
your OS+version is, and indicate any special libraries you are
bringing in.

You are using the word 'protocol' in your brief posts. What are you
telling us here, are you attaching any special software to the
front-end scanner? For example, are you bringing in anything to get
sockets, anything to get GUI services?

The problem could be in the one aspect that you identify most clearly,
and I do not mean to ignore it by the preceeding. The case
insensitivity _might_ bloat scanner tables; more (flex) experienced
lurkers are encouraged to comment on that. If so, then you might need
some manual conversion of strings to a single case (upper or lower)
and then drive simplistic regular expression compilation (that is,
back off from case-insensitive); but from what you have indicated that
will be a large amount of work for you.

Let me request you post an indication as to whether you have large
amounts of error processing in your program. My interest here is to
see if the bloat is the text from the error messages. (All
speculation, so ignore this if its a distraction). But if you have
hard coded your error messages internal to the scanner (applies
conceptually to parsers too), then the linked up result will look real
big. ((Engineers do this a lot. Start with a few simple messages ...
later as the program/system matures most of the executable image is
the text substance of diagnostic messages)). If this is relavant, the
solution is to externalize the text of diagnostics to a file and call
up the pieces you need, message by message, at execution time with
indexes or hashes (error numbers).

And one more risk of hurting your feelings by possibly being too
simplistic, I am just trying to pull out every rabit I can find in the
hat. If you have your compiler in debug mode, the executables can be
very large to hold symbol tables and all manner code point
identification for step-wise execution.

Hope that some of that is useful,


Robert Rayhawk
ray...@alum.calberkeley.org

Clint Olsen

unread,
Feb 25, 2001, 10:50:40 AM2/25/01
to
I strongly suggest you try using re2c instead of flex. I just converted a
flex scanner over to re2c, and the size difference is stunning.

Re2c can also easily generate reentrant C lexers which is great. Re2c
requires you to write your own buffer management, but I copied the example
they provided for a C scanner and it was terrific.

re2c -b : c file size: 25470 (!!)
flex -f (%option align): c file size: 421804

Executable with re2c: 106252 bytes
Executable with flex: 325850 bytes

Re2c has some limitations on regular expression support, but the most
important features are there. The examples provided are very helpful.

Typically re2c generates faster and smaller parsers than flex. In my case,
the speed was about on par with flex. However, I got line and column
information support in the re2c scanner.

I wish I had found re2c sooner.

-Clint

re2c
----

Version 0.9.1
Originally written by Peter Bumbulis (pe...@csg.uwaterloo.ca)
Currently maintained by Brian Young (bay...@acm.org)

The re2c distribution can be found at:

http://www.tildeslash.org/re2c/index.html

The source distribution is available from:

http://www.tildeslash.org/re2c/re2c-0.9.1.tar.gz

Ron Pinkas

unread,
Feb 25, 2001, 10:51:47 AM2/25/01
to
Troy,

> I'm looking for suggestions for either tweaking my flex
> implementation or switching to another scanner generator.

You might want to check SimpLex at:

http://sourceforge.net/projects/simplex

It is an ANSI C, open source, bison compatible, generic lexer. It is not a
generator, rather, it is data driven by an #included .slx language
definition file. It has very strong context capabilities by utilizing YACC
like rules for context analysis.

The resulting lexer is typically 1/2 to 1/4 of an equivalent Flex generated
lexer, and is just as fast (or faster).

Ron Pinkas

Troy Cauble

unread,
Mar 1, 2001, 2:33:52 AM3/1/01
to
Troy Cauble <tr...@bell-labs.com> wrote:
: I have a working flex/bison implementation to parse a text

: protocol. The problem is that the scanner is huge.

: Made with LFLAGS = -Cfe, size gives
: 42068(.text) + 24(.data) + 60(.bss) + 1687886(.rodata) = 1730038

: Made with LFLAGS = -Cem, size gives
: 42496(.text) + 24(.data) + 60(.bss) + 508326(.rodata) = 550906

: This is still pretty large for some of our target environments.

: I haven't tried to disable features to determine what makes
: it so huge, but the protocol is case-insensitive and context
: (position) sensitive. I addressed the context sensitivity
: with flex start states.


Thanks to all who replied here and via e-mail.

I thought I'd post what worked for future archive readers.

The big win for my flex scanner was rewriting INCLUSIVE start
states as EXCLUSIVE. I had about 50 start states with
1 to 5 rules each, to handle portions of the protocol.

Moving to EXTERNAL start states meant I had to add additional
rules for each start state (but a small subset of the INITIAL
rules). This resulted in more rules in the source file, but
much smaller tables. .rodata is less than half of what it was
at -Cem.

Given the table reduction, I'm *guessing* that 50 inclusive
start states generates 50 times the INITIAL (non start state)
rules. (Standard rules plus those in start state 1; standard
rules plus those in start state 2; ...)

Oddly enough, the oft-suggested "single identifier rule with
keyword lookup" actually increased my table size slightly
before I did rewrote the start states. I can't guess why.
(It wasn't the lookup function, I didn't even bother to
write that.)

-troy

0 new messages