I'm trying to fetch a pattern matcher for some of my needs.
I will elaborate as much as possible and hopefully someone could help me out figuring what algorithm should I use.
Definitions:
Patterns:
Atom:
The basic unit of a pattern, comprised of a byte string. any of the bytes
might be a "don't care" byte.
An atom might have fields - contiguous non-overlapping sub-string
An atom would match the exact byte string, allowing "don't care"
bytes to match any bytes
Example in regular-expression-like notation:
"\x33\x34(P<field1>....)\x35\x36"
The atom "AA..AA" for example would match AABBAA and AACCAA and all
similar strings but not AAMMMAA
Sub-pattern:
Set of one or more atoms (not necessarily of the same length) and a
quantification (exactly 1, 0 or 1, 0 or more, 1 or more).
A sub-pattern, according to the quantification, matches several consecutive
instances of any of the atoms (not necessarily the same atom).
example in regular-expression-line notation:
"(\x00|\x01\x02|\x04....)+"
The sub-pattern (AA|B.B|CC)+ would match AAAACC ad BFBBGBAACC and all similar
strings but not AAABBB
Pattern:
A sequence of sub-patterns.
A pattern would match a contiguous sequence of sub-pattern matches.
Example in regular-expression-like notation:
"(\x00\x01|\x02\x03)+(\x04\x05|\x06..)*"
The language is regular and is a subset of the regular expression language
The pattern (AA|BB|C.)+(X|Y|Z)*(W)? would match AABBCCXXYYZZ and AAW but not
XYZW
Input:
- A set of patterns as as
described above.
- A binary stream of
arbitrary size, fed into the algorithm in chunks
Output:
- A set of matches of the
patterns in the stream, matches must include all instances of all
patterns, including overlapping instances of the same pattern or different
patterns, and wholly contained instances of one pattern inside another.
Instances of one pattern inside a bigger instance of the same pattern
should not be returned.
For each match a list of the sub-patterns that were matched should be
returned, with the index of the atom that matched that instance, along
with indices indicating the location of that atom in the matched string
If a field was specified for an atom, it would be extracted
The matches should be greedy, in the sense that the largest contiguous
string that matches each single pattern must be matched
Examples (some of them rare and unlikely edge cases, atom list follows the
match):
match notation is: matchedstring (patternindex
subpatternindex-atomindex[start..end, fieldname=fieldvalue] ..... )
pattern (AA..AA)+, input AABBAACCAAAADDAA would result in two matches:
AABBAA (0-0[0..5] )
AACCAAAADDAA ( 0-0[5..9] 0-0[10..15] )
pattern (...), input ABCDE would result in 3 matches:
ABC ( 0 0-0[0..2] )
BCD ( 0 0-0[1..3] )
CDE ( 0 0-0[2..4] )
pattern (..)+, input ABCDE would result in 2 matches:
ABCD ( 0 0-0[0..3] )
BCDE ( 0 0-0]1..4] )
pattern (ABBA) and patern (BB), input ABBABBA would result in 4 matches:
ABBA ( 0 0-0[0..3] )
ABBA ( 0 0-0[3..6] )
BB ( 1 0-0[1..2] )
BB ( 1 0-0[4..5] )
pattern (AA(P<f1>..)|AB(P<f1>..))+(CC), input
XXXXAAAAAABBABXXCCXXXX would result in a single match:
AAAAAABBABAACC ( 0-0[4..7, f1=AA] 0-0[8..11, f1=BB] 0-1[12..15, f1=XX]
1-0[16..17] )
Requirements:
- Linear time - O(stream
length)
- Lazy matching -
GetNextMatch - no need for large in memory structures to hold the matches
and no need to wait for the entire match before an action could be
performed.
- Process the input in
chunks - including matching patterns that could span two or more chunks.
Important for very large streams.
Example of usage (assuming a context object to save state between calls with
consecutive chunks, not a requirement):
ctx = Context()
foreach chunk in chunks:
foreach match in find_matches(ctx, chunk):
process_match(match)
Functionally equivalent matching using python re library would be to group
each pattern with a lookahead assertion (?=...) so that the entire pattern
would consume none of the input string, therefore searching for the pattern in
the input string, in effect, is like running the same regex multiple time, each
time starting one byte further into the string. This results in a very high
complexity.
Thanks!