Regular Expressions

Doc Trins O'Grace

unread,

Dec 11, 2015, 2:52:05 PM12/11/15

to

Emboldened by your kindness in answering my past questions, I would like to ask another.

Regular Expressions are, on the whole, something of a mess. Now I know messes. I've lived through them. (For example I became quite good at using TECO in the very early 70's.) I have to document my regular expressions very carefully because they can be written by the human hand much more readily than they can be read by the human eye.

Recently, I asked about delimiter detecting in awk. I got to thinking about an awk procedure that could walk through a file, developing the regular expression that would match every line. If any tool could build regular expressions for itself, certainly awk would be that tool.

So, have any of you all heard of any efforts along these lines?

Luuk

unread,

Dec 11, 2015, 3:23:08 PM12/11/15

to

On 11-12-15 20:52, Doc Trins O'Grace wrote:
> Emboldened by your kindness in answering my past questions, I would like to ask another.
>
> Regular Expressions are, on the whole, something of a mess. Now I know messes. I've lived through them.
> (For example I became quite good at using TECO in the very early 70's.)

ah, a TECO, a 'Taped Editor and COrrector'.... ;)

> I have to document my regular expressions very carefully because they can be written by the human hand much more readily than they can be read by the human eye.

>
> Recently, I asked about delimiter detecting in awk.
> I got to thinking about an awk procedure that could walk through a file,
> developing the regular expression that would match every line.

The regular expression '*', should match every line... ;)

Kaz Kylheku

unread,

Dec 11, 2015, 5:10:54 PM12/11/15

to

On 2015-12-11, Doc Trins O'Grace <doctrin...@gmail.com> wrote:
> Recently, I asked about delimiter detecting in awk. I got to thinking
> about an awk procedure that could walk through a file, developing the
> regular expression that would match every line. If any tool could
> build regular expressions for itself, certainly awk would be that
> tool.
>
> So, have any of you all heard of any efforts along these lines?

What you're describing is simply the branch union of the literal
match regexes for the lines of the file.

If the file consists of:

line1
line2
line3
...
lineN

then this is the regex

line1|line2|line3|...|lineN

(All the characters from the lines have to be literal. If we are doing
this textually using the regex surface syntax, we have to escape
anything that is a regex operator, of course.)

When this giant regex is converted to an NFA graph and reduced to a DFA,
all the commonality among the lines is worked out into a compact state
transition table. For intance if all the lines start with a common
prefix and end in a common suffix, the NFA -> DFA path will ferret this
out, and there will be a common sequence of state transitions for those
segments of the pattern matching.

When all the entries are literal, we don't have to use a regex approach,
but rather a data structure related to a regex, known as a trie.

A trie compresses a set of strings by taking advantage of their common
prefixes. To test whether a string is present in a trie, we scan
through the string only once, and for each character, we traverse a node
of the trie.

Doc Trins O'Grace

unread,

Dec 14, 2015, 9:51:40 AM12/14/15

to

Kaz, this was very helpful! I have benefited from your instruction
as well as your perspicuity. You've given me a number of things to
think about. Thank you!