Q: Handling a monster pattern to match one of many words

mjc

unread,

Nov 18, 2009, 1:16:37 PM11/18/09

to

I need to search a lot of files (over 1,000) and find all the lines
that contain any of a moderately large number of words (over 500).
When a line contains any of these words, I want to make them visible.

To do this, I store the words in an array words[1 to n] and build up a
pattern string of the form
patstr = "(" word[1] "|" word[2] "|" ... "|" word[n] ")".
(If I want to match only exact words, I surround them with "\\<" and "\
\>".)

Then, for each line in the file, I do this:

if ( match(line, patstr) )
{
for ( i=1; i<=n; i++ )
{
k = index(line, word[i]);
if ( k > 0 )
{
(show that word[i] is in the line and where it is, then remove
it)
}
}
}

This seems to work well, but, since I have read that actual patterns
(/.../) are faster than strings,
I was wondering if patstr could be somehow "compiled" into a pattern
and if, so,
how much faster that would run.

I realize that I could write patstr into a file and make it part of a
gawk function, but it would be
over 5,000 characters long and that might cause problems.

So, is there any way to convert this big pattern string into a real,
live pattern, or should
I just leave it like this?

Thanks,

Marty Cohen

Ed Morton

unread,

Nov 18, 2009, 1:49:36 PM11/18/09

to

On Nov 18, 12:16 pm, mjc <mjco...@acm.org> wrote:
> I need to search a lot of files (over 1,000) and find all the lines
> that contain any of a moderately large number of words (over 500).
> When a line contains any of these words, I want to make them visible.
>
> To do this, I store the words in an array words[1 to n] and build up a
> pattern string of the form
> patstr = "(" word[1] "|" word[2] "|" ... "|" word[n] ")".
> (If I want to match only exact words, I surround them with "\\<" and "\
> \>".)
>
> Then, for each line in the file, I do this:
>
> if ( match(line, patstr) )

You coul just use

> {
> for ( i=1; i<=n; i++ )
> {
> k = index(line, word[i]);
> if ( k > 0 )
> {
> (show that word[i] is in the line and where it is, then remove
> it)
> }
> }
>
> }
>
> This seems to work well, but, since I have read that actual patterns
> (/.../) are faster than strings,
> I was wondering if patstr could be somehow "compiled" into a pattern
> and if, so,
> how much faster that would run.
>
> I realize that I could write patstr into a file and make it part of a
> gawk function, but it would be
> over 5,000 characters long and that might cause problems.
>
> So, is there any way to convert this big pattern string into a real,
> live pattern, or should
> I just leave it like this?
>
> Thanks,
>
> Marty Cohen

You should stop using the word "pattern" as it has no real meaning.
You're either refering to REs or strings. match() finds an RE in a
string. index() finds a substring in a string. So, match() could find
a "pattern" in a string that index() does not and vice-versa.

In your case if "words" means at least strings that don't contain any
characters with meaning in REs, then you're probably OK interleaving
usage of match() and index() as above, but be careful you don't end up
with a "." or something, e.g. if "27.3" is a "word". I'd just avoid
doing it but then you need to decide which function to use.

Ed Morton

unread,

Nov 18, 2009, 1:50:58 PM11/18/09

to

On Nov 18, 12:49 pm, Ed Morton <mortons...@gmail.com> wrote:
> On Nov 18, 12:16 pm, mjc <mjco...@acm.org> wrote:
>
> > I need to search a lot of files (over 1,000) and find all the lines
> > that contain any of a moderately large number of words (over 500).
> > When a line contains any of these words, I want to make them visible.
>
> > To do this, I store the words in an array words[1 to n] and build up a
> > pattern string of the form
> > patstr = "(" word[1] "|" word[2] "|" ... "|" word[n] ")".
> > (If I want to match only exact words, I surround them with "\\<" and "\
> > \>".)
>
> > Then, for each line in the file, I do this:
>
> > if ( match(line, patstr) )
>
> You coul just use

Meant to say:

You could just use "if ($0 ~ patstr)"

> doing it but then you need to decide which function to use.- Hide quoted text -
>
> - Show quoted text -

Ed Morton

unread,

Nov 18, 2009, 2:12:51 PM11/18/09

to

On Nov 18, 12:16 pm, mjc <mjco...@acm.org> wrote:

...back to your original question - best I can come up with is
creating an awk script from it on the fly and calling that via system
(), e.g. something like:

BEGIN{
build patstr
print "$0 ~ " patstr " { ... }" > "tmp"
cmd = ARGV[0] " -f tmp " FILENAME
}
FNR==1 { system(cmd); close(cmd); nextfile }
END{ system("rm -f " tmp) }

Personally I probably wouldn't bother but it might be interesting to
see if it sped things up or slowed them down any.

Regards,

Ed.

Ed Morton

unread,

Nov 18, 2009, 2:16:15 PM11/18/09

to

Sorry:

BEGIN{
...build patstr...
print patstr " { ...check individual words... }" > "tmp"

cmd = ARGV[0] " -f tmp "
}

FNR==1 { system(cmd FILENAME); close(cmd FILENAME); nextfile }
END{ system("rm -f " tmp); close("rm -f " tmp }

Ed.

Ed Morton

unread,

Nov 18, 2009, 2:23:47 PM11/18/09

to

> Ed.- Hide quoted text -

>
> - Show quoted text -

This is my 5th reply to the original post with no-one else posting in
between. Must be a record and must mean I need to get some coffee.

Rather than calling system() one per file, you could just pass the
full list of file names to the created command and call it once:

BEGIN{
...build patstr...
print patstr " { ...check individual words... }" > "tmp"
cmd = ARGV[0] " -f tmp "

for (arg=1; arg<ARGC; arg++) {
cmd = cmd ARGV[arg]
}
system(cmd)
close(cmd)
system("rm -f " tmp)
close("rm -f " tmp)
exit
}

No more posts from me on this one for a while!

Ed.

Manuel Collado

unread,

Nov 18, 2009, 6:20:50 PM11/18/09

to

Ed Morton escribi�:

Another possibility is to store the words to look for as indexes of an
array. The input line can be split into individual words, and each input
word can be checked with
if (word in array) ...

Just an idea.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Anton Treuenfels

unread,

Nov 18, 2009, 11:50:09 PM11/18/09

to

"Manuel Collado" <m.co...@invalid.domain> wrote in message
news:he1vnq$efp$1...@heraldo.rediris.es...

> Another possibility is to store the words to look for as indexes of an
> array. The input line can be split into individual words, and each input
> word can be checked with
> if (word in array) ...

Yes, that was my first thought as well. Instead of intializing the array
with the equivalent of:

word[1] = "string1"
word[2] = "string2"

do it like this:

word[ "string1" ]
word[ "string2" ]

and then the main loop looks like:

{
for ( i = 1; i <= NF; i++ ) {
if ( $i in word ) {
# whatever
}
}
}

One thing the OP's code did not show being taken into account was alphabetic
case. It may be that case is relevant and matches must therefore always be
exact. If not, something like this may be an idea:

BEGIN {

while ( (getline word < "wordlist.txt") > 0 )
wordlist[ toupper(word) ]
close( "wordlist.txt" )
}

{
for ( i = 1; i <= NF; i++ ) {
if ( toupper($i) in wordlist ) {
# whatever
}
}
}

Mmm, I see I'm assuming that no target "word" contains embedded spaces. This
wouldn't affect getting them into "wordlist" but default field splitting
ensures no "$i" could ever match them.

- Anton Treuenfels