The remarkable similarities between Flex/Lex and XSLT

Roger L Costello

unread,

Jun 24, 2022, 9:00:44 AM6/24/22

to

Hi Folks,

XSLT is a language for processing XML documents.

There are remarkable similarities between Flex/Lex and XSLT. Lex was created
47 years ago, long before XSLT. One wonders if some members of the XSLT 1.0
Working Group were Lex users and were influenced by its concepts?

Here are some of the similarities between Flex/Lex and XSLT:

Both are pattern-matching languages, i.e.,

pattern action
pattern action
pattern action

Both allow you to create subsets:

The XSLT "mode" allows the program to effectively be broken into a set of
mini XSLT programs
The Flex/Lex "start condition" allows the lexer to effectively be broken into
a set of mini lexers

Both have a default rule that is executed when no other rule matches

There are some differences, of course:

XSLT is primarily for processing XML but it can process plain text, whereas
Flex/Lex is primarily for processing plain text (esp. source code) but it can
process XML

XSLT uses XPath as its pattern-matching language, whereas Flex/Lex uses
regular expressions as its pattern-matching language

Pretty interesting, I think!

/Roger
[I would be surprised if the XSLT authors hadn't seen lex but Wikipedia suggests
it was more influenxed by awk, which also has pattern action lines. -John]

gah4

unread,

Jun 24, 2022, 7:58:58 PM6/24/22

to

On Friday, June 24, 2022 at 6:00:44 AM UTC-7, Roger L Costello wrote:

> XSLT is a language for processing XML documents.

> There are remarkable similarities between Flex/Lex and XSLT. Lex was created
> 47 years ago, long before XSLT. One wonders if some members of the XSLT 1.0
> Working Group were Lex users and were influenced by its concepts?

> Here are some of the similarities between Flex/Lex and XSLT:

> Both are pattern-matching languages, i.e.,

I was thinking, but didn't post yet, about the different ways of writing pattern
matching languages. Well, more specifically about parsing languages,
but even more about the pattern matching part.

I wrote recently about STEP, which has an input language somewhat
different from yacc/bison for describing a parser.

And even more, if there should be a language for writing pattern
matching languages in.

That is, do we need a compiler-compiler-compiler.

It does seem rare that one starts from scratch in defining a new
computer language, even though not a general purpose
programming language.
[Pattern-action goes back at least to RPG in 1959, and it was
based on the way plugboard accounting machines work. -John]

Christopher F Clark

unread,

Jun 25, 2022, 12:43:07 PM6/25/22

to

I think we should have (not sure about the word "need", "should have" is
close enough) compiler-compiler-compilers.

We know enough to write a single algorithm that can generate regular
expression recognizers as either NFAs or DFAs, LL(k) parsers, LR(k)
parsers, GLR(k) parsers, PEG parsers, and can incorporate captures,
back-references, predicates, adaptive rules, permutations, dynamic
precedence rules, etc. We also know enough to include the generation of
visitors, attribute evaluators, and other next-level "assistants".
Having it accept a variety of notations is also relatively easy.

And to truly make it a compiler-compiler-compiler, one needs to make the
parts separable and be able to be "generated" in the special case forms
(e.g. to be able to recreate an LL version like ANTLR or an LR version like
Bison and of course, the variations in between). Circa 2000 we already had
a version of our Yacc++ that could generate something close to recursive
descent code from an LR grammar, so this is just extending that concept.
ANTLR4 is doing something close to the reverse and moving to a more
state-machine-like description of (mostly) LL grammars.

Thus, creating a compiler-compiler-compiler is a feasible (although
non-trivial task). It's on my "to-do" list.

--
******************************************************************************

Chris Clark email:
christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

matt.ti...@gmail.com

unread,

Jun 25, 2022, 12:44:47 PM6/25/22

to

On Friday, 24 June 2022 at 09:00:44 UTC-4, Roger L Costello wrote:
> Hi Folks,
>
> XSLT is a language for processing XML documents.
>
> There are remarkable similarities between Flex/Lex and XSLT. Lex was created
> 47 years ago, long before XSLT. One wonders if some members of the XSLT 1.0
> Working Group were Lex users and were influenced by its concepts?

It's not really about a single tool like Lex.

Before XML there was SGML, which XML was supposed to "simplify". SGML
included a schema language (DTD), which defines the hierarchical structure of a
document using regular expressions over elements. There was also a strange
unnecessary constraint on these expressions called "ambiguity", which
*everybody* who wrote SGML software needed to understand, and so the idea of
applying formal language techniques to SGML was inevitable.

Long before XSLT, there were a variety of attempts to define languages that
would allow users to specify an automatic translation from SGML into printed
form. Many of these languages were context-free grammars at their core, with
translation rules as actions. This is called "syntax-directed translation"
and was a well-known concept long before that.

With SGML, though, the problem of syntax-directed translation is different
than it is in other contexts, and more difficult in many ways, because the
basic structures in the input are very easy to parse -- elements are delimited
after all -- but the input was a semantically marked up text and the output
was a published document that had to follow all the ambiguously-defined
stylistic rules that people use when they actually to typography. This meant
that complicated grammars, over *element trees* instead of linear text, and
lots of other ideas, needed to be applied. Lots of companies put a lot of
work into it.

So by the time XSLT came around, everyone on the committee as already familiar
with a lot of this history from SGML processing, which was based on a lot of
work rooted in the same formal language theory that goes into lexers and
parsers, and that is why some of XSLT looks a lot like Lex.

Unfortunately, XSLT kind of sucks. When the standard was written, the problem
itself had not really been solved by industry in a really acceptable way (and
it still hasn't been!), and the W3C committee fell into the trap of trying to
innovate instead of codifying best practice.