Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

destructive lexical analyisis

6 views
Skip to first unread message

Rainer Weikusat

unread,
Nov 14, 2022, 12:02:48 PM11/14/22
to
I already did a posting about that a while back, but since that's such a
neat example, I thought I'd do another. In Higher Oder Perl, Mark Jason
Dominus wrote about doing lexical analysys of strings based on Perl
regexes. The basic pattern he proposed looked like this:

for ($string) {
/\G(<regex a>)/gc and do {
.
.
.
redo;
};

/\G(<regex b>)/gc and do {
.
.
.
redo;
};

/\G(<regex c>)/gc and do {
.
.
.
redo;
};
}

The regexes are supposed to match different kind of tokens which are
then made available to the code processing them via capture buffers. The
decoration \G /gc is necessary to match subsequent parts of the same
string, \G meaning "start matching where the last /g match left off", /g
for "match globally" and /c for "don't reset match position on
mismatch".

Perl can efficiently delete characters at the beginning of a string,
hence, assuming that it's ok to consume the string during parsing, the
example above can be simplified somewhat as

for ($string) {
s/^(<regex a>)// and do {
.
.
.
redo;
}

The matching position is fixed here: It's always at the beginning of the
string. Instead of moving the position to the right as matching
proceeds, the string itself is logically shifted to the left as each
matched token is removed from it.

Complete example for that (extracts so-called list token from suricata
rules, a list token being either a sequence of \S chars or something
that's bracketed via [] and may contain nested [] to an arbitray depth):

sub next_list
{
my ($v, $lvl);

if (s/^\[//) {
$v = '[';
$lvl = 1;

{
s/^([^][]+)// and $v .= $1;
s/^\[// and $v .= '[', ++$lvl, redo;

s/.//;
$v .= ']';

--$lvl;
return $v if $lvl == 0;

redo;
}
}

s/(\S+)//;
return $1;
}

[The exact format/ syntax for this is undocumented and can only be
determined by reading through the suricata rule parsing code]
0 new messages