On 2015-12-11, Doc Trins O'Grace <
doctrin...@gmail.com> wrote:
> Recently, I asked about delimiter detecting in awk. I got to thinking
> about an awk procedure that could walk through a file, developing the
> regular expression that would match every line. If any tool could
> build regular expressions for itself, certainly awk would be that
> tool.
>
> So, have any of you all heard of any efforts along these lines?
What you're describing is simply the branch union of the literal
match regexes for the lines of the file.
If the file consists of:
line1
line2
line3
...
lineN
then this is the regex
line1|line2|line3|...|lineN
(All the characters from the lines have to be literal. If we are doing
this textually using the regex surface syntax, we have to escape
anything that is a regex operator, of course.)
When this giant regex is converted to an NFA graph and reduced to a DFA,
all the commonality among the lines is worked out into a compact state
transition table. For intance if all the lines start with a common
prefix and end in a common suffix, the NFA -> DFA path will ferret this
out, and there will be a common sequence of state transitions for those
segments of the pattern matching.
When all the entries are literal, we don't have to use a regex approach,
but rather a data structure related to a regex, known as a trie.
A trie compresses a set of strings by taking advantage of their common
prefixes. To test whether a string is present in a trie, we scan
through the string only once, and for each character, we traverse a node
of the trie.