Cheers,
Dave
True, if you forget about the white space that you silently throw away.
Below I wrote some background information with some terms you can use as
starting point for searching more information.
The lack of recursion in lex makes it possible to generate a DFA
(deterministic finite automaton) from the regular expression(s), which allows
for *very fast* processing of the characters.
The yacc parser on the other hand is slower, and more powerful than regular
expressions.
(Note that in compiler theory, 'regular expressions with recursion' don't
exist, such grammars fall in the Chomsky hierarchy of contex-free languages.
For such languages, you need to have a stack besides the automaton.)
This additional power allows you to recognize nested structures correctly, for
example to find the matching closing brakcet in an expression.
The LALR(1) algorithm used by yacc is a compromise between maximal recognition
power (enough for most languages), and not too many states (although that is
of less importance nowadays).
> I guess Lex parser is only a special case of Yacc parser (Lex output
> is list but not recursively structure like tree). So why not merge
> them into one tool?
Speed and simplicity mostly. The current division of labour means you can use
optimized algorithms for each part while reducing overall complexity.
While writing yacc rules, you don't have to think about the representation of
an integer number (was it '12345' or '0x3039' or '030071' or 'B11000000111001'?).
None the less, parsers that combine both phases do exist. One example is
'sglr' (a C-based solution, I am not aware of any Python-based solutions). It
uses much more powerful (and much more resource-eating) algorithms to
recognize the text.
Their power is useful at places where the lex/yacc algorithms are not enough,
eg with ambigious grammars and/or textual input with several languages
combined (eg Cobol with embedded SQL).
Albert