Searching for the right regex algorithm for a match

30 views

Skip to first unread message

Dy Coder

unread,

Feb 18, 2015, 6:18:01 AM2/18/15

to re...@googlegroups.com

I'm trying to fetch a pattern matcher for some of my needs.

I will elaborate as much as possible and hopefully someone could help me out figuring what algorithm should I use.

Definitions:

Patterns:

Atom:

The basic unit of a pattern, comprised of a byte string. any of the bytes might be a "don't care" byte.
An atom might have fields - contiguous non-overlapping sub-string
An atom would match the exact byte string, allowing "don't care" bytes to match any bytes
Example in regular-expression-like notation: "\x33\x34(P<field1>....)\x35\x36"

The atom "AA..AA" for example would match AABBAA and AACCAA and all similar strings but not AAMMMAA

Sub-pattern:

Set of one or more atoms (not necessarily of the same length) and a quantification (exactly 1, 0 or 1, 0 or more, 1 or more).
A sub-pattern, according to the quantification, matches several consecutive instances of any of the atoms (not necessarily the same atom).
example in regular-expression-line notation: "(\x00|\x01\x02|\x04....)+"

The sub-pattern (AA|B.B|CC)+ would match AAAACC ad BFBBGBAACC and all similar strings but not AAABBB

Pattern:

A sequence of sub-patterns.
A pattern would match a contiguous sequence of sub-pattern matches.
Example in regular-expression-like notation: "(\x00\x01|\x02\x03)+(\x04\x05|\x06..)*"
The language is regular and is a subset of the regular expression language

The pattern (AA|BB|C.)+(X|Y|Z)*(W)? would match AABBCCXXYYZZ and AAW but not XYZW

Input:

A set of patterns as as described above.
A binary stream of arbitrary size, fed into the algorithm in chunks

Output:

A set of matches of the patterns in the stream, matches must include all instances of all patterns, including overlapping instances of the same pattern or different patterns, and wholly contained instances of one pattern inside another. Instances of one pattern inside a bigger instance of the same pattern should not be returned.
For each match a list of the sub-patterns that were matched should be returned, with the index of the atom that matched that instance, along with indices indicating the location of that atom in the matched string
If a field was specified for an atom, it would be extracted
The matches should be greedy, in the sense that the largest contiguous string that matches each single pattern must be matched

Examples (some of them rare and unlikely edge cases, atom list follows the match):

match notation is: matchedstring (patternindex subpatternindex-atomindex[start..end, fieldname=fieldvalue] ..... )

pattern (AA..AA)+, input AABBAACCAAAADDAA would result in two matches:
AABBAA (0-0[0..5] )
AACCAAAADDAA ( 0-0[5..9] 0-0[10..15] )

pattern (...), input ABCDE would result in 3 matches:
ABC ( 0 0-0[0..2] )
BCD ( 0 0-0[1..3] )
CDE ( 0 0-0[2..4] )

pattern (..)+, input ABCDE would result in 2 matches:
ABCD ( 0 0-0[0..3] )
BCDE ( 0 0-0]1..4] )

pattern (ABBA) and patern (BB), input ABBABBA would result in 4 matches:
ABBA ( 0 0-0[0..3] )
ABBA ( 0 0-0[3..6] )
BB ( 1 0-0[1..2] )
BB ( 1 0-0[4..5] )

pattern (AA(P<f1>..)|AB(P<f1>..))+(CC), input XXXXAAAAAABBABXXCCXXXX would result in a single match:
AAAAAABBABAACC ( 0-0[4..7, f1=AA] 0-0[8..11, f1=BB] 0-1[12..15, f1=XX] 1-0[16..17] )

Requirements:

Linear time - O(stream length)
Lazy matching - GetNextMatch - no need for large in memory structures to hold the matches and no need to wait for the entire match before an action could be performed.
Process the input in chunks - including matching patterns that could span two or more chunks. Important for very large streams.

Example of usage (assuming a context object to save state between calls with consecutive chunks, not a requirement):

ctx = Context()

foreach chunk in chunks:

foreach match in find_matches(ctx, chunk):

process_match(match)

Functionally equivalent matching using python re library would be to group each pattern with a lookahead assertion (?=...) so that the entire pattern would consume none of the input string, therefore searching for the pattern in the input string, in effect, is like running the same regex multiple time, each time starting one byte further into the string. This results in a very high complexity.