Questions about regex

709 views
Skip to first unread message

Brian Candler

unread,
Jan 21, 2019, 5:46:25 AM1/21/19
to Wazuh mailing list
I have a question about regex semantics, which has led me to some questions in the code itself.

The documentation says that if you want to match a literal "<", you have to escape it with a backslash.  But I can't see why this is the case, as I can't see any case where "<" would have a special meaning, or any reason why a bare "<" would not match.  If it were caret (^) then I could understand it, but there is no mention of \^

I went looking through the source, and got a bit confused.

There is regexmap[][256], which is indexed by the regex class code and the next character.  Where the values are 0 and 1, this makes sense: 0 = not in class, 1 = in class.  There is a macro to test for this:

==> src/os_regex/os_regex_internal.h
#define Regex(x,y)   (regexmap[x][y] == TRUECHAR)

But there are cases where the regexmap is *not* 0 or 1, and I can't find any part of the code which uses those values.

For example, \< is  class 15:

==> src/os_regex/os_regex_compile.c
                case '<':
                    *pt = 15;
                    break;

And here is the regexmap entry for class 15:

==> src/os_regex/os_regex_maps.c
    {
        0, 0, 2, 3, 4, 5, 6, 7,
        8, 10, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23,
        24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39,
        40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 1, 61, 62, 63,              #  <<<<<< NOTE "1,"
        64, 97, 98, 99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111,
        112, 113, 114, 115, 116, 117, 118, 119,
        120, 121, 122, 91, 92, 93, 94, 95,
        96, 97, 98, 99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111,
        112, 113, 114, 115, 116, 117, 118, 119,
        120, 121, 122, 123, 124, 125, 126, 127,
        128, 129, 130, 131, 132, 133, 134, 135,
        136, 137, 138, 139, 140, 141, 142, 143,
        144, 145, 146, 147, 148, 149, 150, 151,
        152, 153, 154, 155, 156, 157, 158, 159,
        160, 161, 162, 163, 164, 165, 166, 167,
        168, 169, 170, 171, 172, 173, 174, 175,
        176, 177, 178, 179, 180, 181, 182, 183,
        184, 185, 186, 187, 188, 189, 190, 191,
        192, 193, 194, 195, 196, 197, 198, 199,
        200, 201, 202, 203, 204, 205, 206, 207,
        208, 209, 210, 211, 212, 213, 214, 215,
        216, 217, 218, 219, 220, 221, 222, 223,
        224, 225, 226, 227, 228, 229, 230, 231,
        232, 233, 234, 235, 236, 237, 238, 239,
        240, 241, 242, 243, 244, 245, 246, 247,
        248, 249, 250, 251, 252, 253, 254, 255,
    },

I can see that character '<' is included (value '1' at offset 60), but I don't know what all the other non-zero values are doing - it looks as if they are transposing uppercase and lowercase, but I can't find anywhere in the code which uses regexmap[][] except within the Regex() macro, which only checks for TRUECHAR (1).  In other words, I think these are all treated as 'false'.

And I still don't know why '<' has its own character class, as the only place I can see '<' treated specially is where it is escape as \<

$ grep -1R "'<'" src/os_regex
src/os_regex/os_regex_compile.c-                    break;
src/os_regex/os_regex_compile.c:                case '<':
src/os_regex/os_regex_compile.c-                    *pt = 15;

On top of this, there doesn't seem to be a way to match a literal caret(^).  Nor a literal asterisk (*) or plus (+), except as part of the \p character class.

Also: \s only seems to match space (32), not tab or carriage return or newline etc; and \S similarly matches anything apart from space (32).  This is different to normal regular expressions.

Clues gratefully received.  I am happy to hack on this, or even just add a few more source comments, but I'd like to understand the intention first.

Thanks,

Brian.

Aside: the reason I started looking in this area of code was to convince myself that, unlike normal regular expressions, a bare dot just matches a dot, whilst \. matches any character.  This does indeed seem to be the case, but maybe it's worth flagging explicitly in the documentation.

Brian Candler

unread,
Jan 21, 2019, 6:36:16 AM1/21/19
to Wazuh mailing list
Hmm, I was expecting .* to match "zero or more dots".  However this decoder:

<decoder name="cisco-ios">
  <prematch>^\d+:\s+\p*\w+\s+\d+\s+\d+:\d+:\d+.*\d*\s*\w*:\s+%</prematch>
</decoder>

Does not match this string:

137: *Jan 18 11:48:01.737: %LINK-3-UPDOWN: Interface GigabitEthernet1/0/21, changed state to up

Yet if I change .* to \p* in the regexp, it does match.  I need to dig through the regexp code some more.

(Aside: .*\d* is intended to match optional milliseconds, and \s*\w* is intended to match optional timezone.  I have seen examples with one or the other, neither or both)

Brian Candler

unread,
Jan 21, 2019, 8:06:19 AM1/21/19
to Wazuh mailing list
After digging through the code, the reason for the first issue is that the + and * repeat operators only apply to backslash classes, not single characters. Hence OS_Regex does not support .* to match zero or more dots.

\p* does match zero or more dots, but doesn't work in the combination that I'm using.  Specifically

\p*\d*\s*\w*:

does not match just a colon, although it should.  I believe that \p* greedily captures the colon, and then it won't backtrack across the sequence of repeated character classes.  I'd argue this is a bug in OS_Regex; but I'm not sure it's worth tinkering with this code, when there are more robust alternatives like PCRE available.

Manuel Jiménez

unread,
Jan 21, 2019, 9:39:54 AM1/21/19
to Brian Candler, Wazuh mailing list

Hello Brian,

Apologies for the inconveniences. We will review the current documentation and we will indicate this information in order to clarify what you found.
I’d like to thank you for digging into this. Please, feel free to collaborate opening a new pull request to the Wazuh documentation repository or any of our repositories whenever you may want to.

Best regards,
Manuel


--
You received this message because you are subscribed to the Google Groups "Wazuh mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wazuh+un...@googlegroups.com.
To post to this group, send email to wa...@googlegroups.com.
Visit this group at https://groups.google.com/group/wazuh.
To view this discussion on the web visit https://groups.google.com/d/msgid/wazuh/5332fff8-9a15-4a4f-8b8d-cb12b62a0229%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Brian Candler

unread,
Jan 21, 2019, 10:50:04 AM1/21/19
to Wazuh mailing list
Documentation PR: https://github.com/wazuh/wazuh-documentation/pull/822

On the code side, I'd still like to know what's the purpose of regexmap[][] values which are neither zero nor one.  Maybe this is a relic from OSSEC or OS_Regex.

Victor Fernandez

unread,
Jan 21, 2019, 5:02:01 PM1/21/19
to Brian Candler, Wazuh mailing list
Hi Brian,

Yes, we inherit OSSEC regexes completely. We have fixed a couple of issues we found in them:
  1. A multi-regex decoder where one node matches the input completely makes the next regex produce an invalid read (getting data beyond the string limit).
  2. A regex ending with a group, like ^Test: (\S+) sent a file to (\S+), trimmed the last character of the last group. For instance, for this input string: "Test: Bob sent a file to Alice", the second group would be "Alic".
However, indeed, we have not improved the functionality so far. On the one hand, we don't want to break the current behavior, I mean, current regexes have a defined behavior no matter if we think that it's not formally correct. On the other hand, I think we should move to a standard regex library (like PCRE or POSIX) has more sense than extending the current OSSEC regexes. In this context, we may include them as an optional syntax. For example:
<decoder name="test">
  <regex type="pcre">^[[:punct:]]*:</regex>
</decoder>
In any case, we need that the foreign library that we would import is compatible with every platform where Wazuh works.

Regarding your question about the regex map, I'm not an expert on that implementation, but let me explain what I think:

The backslash at a regex (\) defines a token code (os_regex_compile.c):
/* Give the new values for each regex */
switch (*pt) {
    case 'd':
        *pt = 1;
        break;
    case 'w':
        *pt = 2;
        break;
    case 's':
        *pt = 3;
        break;
    case 'p':
        *pt = 4;
        break;
    case '(':
        *pt = 5;
        break;
    case ')':
        *pt = 6;
        break;
    case '\\':
        *pt = 7;
        break;
    case 'D':
        *pt = 8;
        break;
    case 'W':
        *pt = 9;
        break;
    case 'S':
        *pt = 10;
        break;
    case '.':
        *pt = 11;
        break;
    case 't':
        *pt = 12;
        break;
    case '$':
        *pt = 13;
        break;
    case '|':
        *pt = 14;
        break;
    case '<':
        *pt = 15;
        break;
    default:
        reg->error = OS_REGEX_BADREGEX;
        goto compile_error;
}

Per this macro:
#define Regex(x,y)   (regexmap[x][y] == TRUECHAR)
Regex(x,y) relates the token code (x) to the input character (y), and it returns true if the selected cell is exactly 1.

For instance, this is the mapping row for \w:
0, 0, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15,

16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 1, 46, 47,
1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 58, 59, 60, 61, 62, 63,
1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 91, 92, 93, 94, 1,
96, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 123, 124, 125, 126, 127,

128, 129, 130, 131, 132, 133, 134, 135,
136, 137, 138, 139, 140, 141, 142, 143,
144, 145, 146, 147, 148, 149, 150, 151,
152, 153, 154, 155, 156, 157, 158, 159,
160, 161, 162, 163, 164, 165, 166, 167,
168, 169, 170, 171, 172, 173, 174, 175,
176, 177, 178, 179, 180, 181, 182, 183,
184, 185, 186, 187, 188, 189, 190, 191,
192, 193, 194, 195, 196, 197, 198, 199,
200, 201, 202, 203, 204, 205, 206, 207,
208, 209, 210, 211, 212, 213, 214, 215,
216, 217, 218, 219, 220, 221, 222, 223,
224, 225, 226, 227, 228, 229, 230, 231,
232, 233, 234, 235, 236, 237, 238, 239,
240, 241, 242, 243, 244, 245, 246, 247,
240, 241, 242, 243, 244, 245, 246, 247,
The matching ones are at 0-9, -@, A-Z_, and a-z.

In this scenario, having { 0, 0, 0, 0, 0, 0, 1, 0, 0, } is equivalent to having { 0, 0, 2, 3, 4, 5, 1, 7, 8 }, but the latter is more readable.

There is a special escaping case: \<, that only matches with the character '<'. I think that the reason for this is in the XML parser library (also inherited from OSSEC).

The following text:
<regex>p < q</regex>
It's illegal in XML because < opens a tag. We need to escape it:
<regex>p \< q</regex>
However, the XML parser would include the escaping character into the regex. Hence, the regex accepts \< for compatibility with the XML library.

Hope this helps you.
Best regards,

Victor Manuel Fernandez-Castro 
Core Engineering | vic...@wazuh.com


On Mon, Jan 21, 2019 at 7:50 AM Brian Candler <b.ca...@pobox.com> wrote:
Documentation PR: https://github.com/wazuh/wazuh-documentation/pull/822

On the code side, I'd still like to know what's the purpose of regexmap[][] values which are neither zero nor one.  Maybe this is a relic from OSSEC or OS_Regex.

--
You received this message because you are subscribed to the Google Groups "Wazuh mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wazuh+un...@googlegroups.com.
To post to this group, send email to wa...@googlegroups.com.
Visit this group at https://groups.google.com/group/wazuh.

Brian Candler

unread,
Jan 22, 2019, 4:32:07 AM1/22/19
to Wazuh mailing list
On Monday, 21 January 2019 22:02:01 UTC, Victor Fernandez wrote:
On the other hand, I think we should move to a standard regex library (like PCRE or POSIX) has more sense than extending the current OSSEC regexes. In this context, we may include them as an optional syntax. For example:
<decoder name="test">
  <regex type="pcre">^[[:punct:]]*:</regex>
</decoder>

I think that makes sense, although it would be nice to set a default at the top of the file so you don't have to remember to repeat type="pcre" everywhere.

Or: given that rules already have <match> and <regex> for two different types of pattern, maybe it makes sense to have a third one like <pcre>

But then again, the existing rules could be mechanically converted to PCRE form, as a one-time conversion (for wazuh 4.0.0 perhaps?)

 
In any case, we need that the foreign library that we would import is compatible with every platform where Wazuh works.


Indeed.  As I can see, the wazuh manager is only supported on Linux; but in any case PCRE is portable to Windows. as some pre-compiled Windows binaries are available on www.pcre.org.

 

In this scenario, having { 0, 0, 0, 0, 0, 0, 1, 0, 0, } is equivalent to having { 0, 0, 2, 3, 4, 5, 1, 7, 8 }, but the latter is more readable.


Why is the latter more readable?  It obfuscates the functionality.  You'll note that offset [N] doesn't always contain value N.  For example, here's part of the regexmap for \d

const unsigned char regexmap[][256] = {
    {
        0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 59, 60, 61, 62, 63,
        64, 97, 98, 99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111,
        112, 113, 114, 115, 116, 117, 118, 119,
        120, 121, 122, 91, 92, 93, 94, 95,
        96, 97, 98, 99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111,
        112, 113, 114, 115, 116, 117, 118, 119,
        120, 121, 122, 123, 124, 125, 126, 127,


You can see the values I've highlighted which are out-of-sequence.  It looks like it is intended to map capital letters to lower-case letters.  Except it's not: the values read from regexmap are only ever tested for being equal to 1 or not !

Therefore it could be simplified by taking all values which are not 1, and changing them to 0.  It shouldn't make any difference to the behaviour, but it saves head-scratching when trying to understand the code.

 
There is a special escaping case: \<, that only matches with the character '<'. I think that the reason for this is in the XML parser library (also inherited from OSSEC).

The following text:
<regex>p < q</regex>
It's illegal in XML because < opens a tag. We need to escape it:
<regex>p \< q</regex>
However, the XML parser would include the escaping character into the regex. Hence, the regex accepts \< for compatibility with the XML library.


That explanation makes sense.  Unfortunately the XML spec does not include backslash-escaping: the only way to include < is as an entity (e.g. &lt;) or in a CDATA section.

So if the parser accepts \< then it's not an XML parser - it's an OSSECML parser :-)

I could only find two examples in wazuh-ruleset where this feature is used:

./decoders/0340-trend-osce_decoders.xml:  <prematch>^20\d\d\d\d\d\d\<;></prematch>
./decoders/0340-trend-osce_decoders.xml:  <regex offset="after_prematch">^\d+\<;>\S+\<;>(\d+)\<;</regex>

And this indeed is provably not valid XML.

$ xmllint ./decoders/0340-trend-osce_decoders.xml
./decoders/0340-trend-osce_decoders.xml:18: parser error : StartTag: invalid element name
  <prematch>^20\d\d\d\d\d\d\<;></prematch>
                             ^
./decoders/0340-trend-osce_decoders.xml:19: parser error : StartTag: invalid element name
  <regex offset="after_prematch">^\d+\<;>\S+\<;>(\d+)\<;</regex>
                                       ^
./decoders/0340-trend-osce_decoders.xml:19: parser error : StartTag: invalid element name
  <regex offset="after_prematch">^\d+\<;>\S+\<;>(\d+)\<;</regex>
                                              ^
./decoders/0340-trend-osce_decoders.xml:19: parser error : StartTag: invalid element name
  <regex offset="after_prematch">^\d+\<;>\S+\<;>(\d+)\<;</regex>
                                                       ^

Having said that, there's a more fundamental violation anyway: an XML document has a single root element, which these don't.  I note also that double-dash (--) is not permitted inside XML comments, and this happens in a few places too.

It is claimed that ossec.conf is in XML format with a root <ossec_config> element, which sounds plausible.  But in my installation I find that /var/ossec/etc/ossec.conf has two root elements, both <ossec_conf>.

Maybe the documentation should say "XML-like" or "XML-based" rather than "XML" ?

Regards,

Brian.

Brian Candler

unread,
Jan 22, 2019, 7:46:33 AM1/22/19
to Wazuh mailing list
BTW I now found https://github.com/wazuh/wazuh/issues/570, and I see there's already a plan to replace the regex engine.

Brian Candler

unread,
Jan 22, 2019, 9:37:11 AM1/22/19
to Wazuh mailing list
Aside: another potential benefit of going with PCRE is the availability of named capture groups.

Before:

  <regex offset="after_prematch">^ \S+ for (\S+) from (\S+) port </regex>
  <order>user, srcip</order>

After:

  <regex offset="after_prematch">^ \S+ for (?<user>\S+) from (?<srcip>\S+) port </regex>

This is particularly helpful when capture groups are nested inside other groups, or in alternative branches (xxx|yyy)

Victor Fernandez

unread,
Jan 22, 2019, 1:53:15 PM1/22/19
to Brian Candler, Wazuh mailing list
Hi Brian,

Thank you for your comments. Let me explain what is in our roadmap related to these issues:

First, you nailed it: ossec.conf syntax is not XML but XML-like language. We are struggling with some problems: the parser accepts a configuration file that a standard XML validator wouldn't do, starting with a lack of a root element. What we want to do is redesign the configuration and move it to YAML. We will use a standard YAML parser (libyaml) for that.

Regarding the regex, I forgot to comment that indeed it is greedy, while a formally correct regex uses backtracking. I like your idea of using <pcre>. We receive a proposal of extending regex to PCRE (#205) but it was only an idea, hope to implement that by ourselves soon. I've added your comment there.

Even more, I'd love to replace the field filters at rules (<srcip> / <dstuser> / <field name="...">) with a single query, like jq.

They are only ideas and issues that take part in our roadmap. I am sure that Wazuh 4.0 will include some of them.

Best regards.

Victor Manuel Fernandez-Castro 
Core Engineering | vic...@wazuh.com

--
You received this message because you are subscribed to the Google Groups "Wazuh mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wazuh+un...@googlegroups.com.
To post to this group, send email to wa...@googlegroups.com.
Visit this group at https://groups.google.com/group/wazuh.

Brian Candler

unread,
Jan 22, 2019, 5:29:11 PM1/22/19
to Wazuh mailing list
On Tuesday, 22 January 2019 18:53:15 UTC, Victor Fernandez wrote:
What we want to do is redesign the configuration and move it to YAML. We will use a standard YAML parser (libyaml) for that.

I think that's an excellent idea.  This will also fix a problem I have come across: the ambiguity over repeated XML elements.  In some cases in wazuh, these are treated as alternatives; in other cases they are just concatenated together.

So if I want to write a rule which triggers on two different sids, I try

    <if_sid>4312</if_sid>
    <if_sid>4313</if_sid>
 
and I find it's parsed, but is treated as <if_sid>43124313</if_sid>

With YAML it's explicit whether something is a list or not

    if_sid: 4312

    if_sid: [4312, 4313]

(and if lists are not valid in this context, you can reject them)


Even more, I'd love to replace the field filters at rules (<srcip> / <dstuser> / <field name="...">) with a single query, like jq.

Yes, that would be interesting too.

Choosing a suitable query language needs to be done carefully.  Since Wazuh is in C, I wonder if lua might be a good partner?

    sid == 4312 or sid == 4313

I don't have any experience embedding lua, but I know it works well in some other high-performance applications (e.g. powerdns).

Or maybe go the whole hog and embed python (I saw a git ticket about python 3.7 and ffi).  It would be great to be able to express logic in a "real" programming language when required.

Victor Fernandez

unread,
Feb 2, 2019, 9:46:24 PM2/2/19
to Brian Candler, Wazuh mailing list
Hi Brian,

To be honest, some tags like <if_sid> append their content deliberately, this is to make it easier to define long options. I mean, this "feature" is not due to the XML language, some other tags are designed so that each option replaces the previous one.

Once more, this is an issue that the documentation should clarify.

On the other hand, I barely know Lua, but I've seen traces of an integration with Lua, probably used in old versions of OSSEC. I'm thinking of a new language, similar to a jq expression, to model rules. Hopefully, we'll have something like that in the future.

Best regards,

Victor Manuel Fernandez-Castro 
Core Engineering | vic...@wazuh.com

--
You received this message because you are subscribed to the Google Groups "Wazuh mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wazuh+un...@googlegroups.com.
To post to this group, send email to wa...@googlegroups.com.
Visit this group at https://groups.google.com/group/wazuh.
Reply all
Reply to author
Forward
0 new messages