Faster than strstr

DB

unread,

May 7, 2001, 5:25:09 PM5/7/01

to

My current project for SPI Dynamics made extensive use of the strstr()
function to detect malicious activity in web server log files. The
performance of that search routine was quite a bottleneck while LogAlert
was detecting exploit strings. The strstr() function seemed to be part
of the problem, I had to find a way to speed things up.

I found strchr() to be the fastest search available on the X86. This
became the front-end to my new search routine. The result is strchrstr(),
that uses a fast strchr() to find potential matches, and then does a
strncmp() on the remainder of the target string.

The new function is a direct replacement for strstr(). Since this
approach made quite a difference, I made another change, and did a
second character compare before the strncmp(). This is strchrstr2(),
which is NOT an exact replacement for strstr(), because the target
string must be at least 2 characters long.

TESTING:
The search space is a 9 Mb text file, searching for a nine character
target string and looped 255 times. The tests were run on RHL 6.2
and 7.0. The kernel builds were 2.2.14 and 2.2.16 with glibc versions
2.1.3 and 2.1.92 respectively.

system | strstr| strchrstr | strchrstr2 |
------------------------------------------| (timed with gettimeofday())
5x86-160 | 195.5 | 141.4 | 121.9 | RHL 6.2
P-166 | 97.7 | 58.9 | 52.3 | RHL 7.0
Ath-840 | 18.5 | 16.8 | 15.5 | RHL 6.2

Links to the source code for all three test programs and the alternate
search routine are at: http://www.SPIDynamics.com/speed_search.html

/*Fast, compatible strstr() replacement.*/
char *strchrstr(char *haystack, char *needle)
{
size_t sl;

if(needle && haystack)
{
sl = strlen(needle);
if(!sl) return(haystack);
sl--;
while(1)
{
haystack = strchr(haystack, *needle);
if(haystack)
{
if(!sl) return(haystack);
if(!strncmp(haystack + 1, needle + 1, sl)) return(haystack);
haystack++;
}
else
break;
};
}
return((char *)NULL);
}

DB
dbar...@SPIDynamics.com
--
comp.lang.c.moderated - moderation address: cl...@plethora.net

David Rubin

unread,

May 8, 2001, 11:57:12 AM5/8/01

to

DB wrote:

[snip - strstr not fast enough]

> I found strchr() to be the fastest search available on the X86. This
> became the front-end to my new search routine. The result is strchrstr(),
> that uses a fast strchr() to find potential matches, and then does a
> strncmp() on the remainder of the target string.

...or you can try a regex package.

david

--
If 91 were prime, it would be a counterexample to your conjecture.
-- Bruce Wheeler

Douglas A. Gwyn

unread,

May 8, 2001, 11:58:34 AM5/8/01

to

The speed of a strstr() implementation depends on many factors.
If you need to search for the same pattern repeatedly, splitting
strstr() up into separate initialization and search phases helps.
You might compare your implementation with those in Chapter 2 of
"Software Solutions in C", edited by Dale Schumacher. Its code
was included on a diskette; if you can't find a copy then send
me e-mail at work: gw...@arl.army.mil

Walter Briscoe

unread,

May 8, 2001, 11:59:08 AM5/8/01

to

In article <clcm-2001...@plethora.net> of Mon, 7 May 2001 21:25:09
in comp.lang.c.moderated, DB <drba...@inetnow.net> writes

>My current project for SPI Dynamics made extensive use of the strstr()
>function to detect malicious activity in web server log files. The
>performance of that search routine was quite a bottleneck while LogAlert
>was detecting exploit strings. The strstr() function seemed to be part
>of the problem, I had to find a way to speed things up.
>
>I found strchr() to be the fastest search available on the X86. This
>became the front-end to my new search routine. The result is strchrstr(),
>that uses a fast strchr() to find potential matches, and then does a
>strncmp() on the remainder of the target string.

[snip]
Do you not have access to the source code for strchr and strstr? You may
be able to make a general improvement. If you can't and RHL gives you an
opaque library, you are entitled to report a problem to the supplier.

>TESTING:
>The search space is a 9 Mb text file, searching for a nine character
>target string and looped 255 times. The tests were run on RHL 6.2
>and 7.0. The kernel builds were 2.2.14 and 2.2.16 with glibc versions
>2.1.3 and 2.1.92 respectively.

This looks like a very suitable application for the Boyar-Moore string
searching method. That is a "well known" algorithm which searches a
large string for a fixed short string. It uses the properties of the
short string to optimise the searching of the longer string. It consists
of less than 50 lines of C code and is published. You will find it
fairly easily with Boyar Moore and search. You can whet your appetite
with http://www.cs.utexas.edu/users/best-ideas/string-searching
I have no experience with using this algorithm. It is reported to be
good. From my reading of it, I would expect you to be able to double the
speed of your tests to date (and then some!)
--
Walter Briscoe

Hans-Bernhard Broeker

unread,

May 8, 2001, 11:59:50 AM5/8/01

to

In comp.lang.c.moderated DB <drba...@inetnow.net> wrote:

> I found strchr() to be the fastest search available on the X86. This
> became the front-end to my new search routine. The result is strchrstr(),
> that uses a fast strchr() to find potential matches, and then does a
> strncmp() on the remainder of the target string.

Your "Subject:" line is at least partly misleading. You do not have a
function faster than strstr() --- you have one faster than the
particular implementation of strstr() you're using, on one particular
architecture, for one size range of "needle", and a rather large
"haystack".

Actually, there are algorithms significantly faster than the
standardized strstr() for your particular application profile, and
also faster than yours. Do a literature search on Boyer-Moore
searching, and you'll see.

--
Hans-Bernhard Broeker (bro...@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

Kevin Ashley

unread,

May 8, 2001, 7:38:29 PM5/8/01

to

DB wrote:
>
> My current project for SPI Dynamics made extensive use of the strstr()
> function to detect malicious activity in web server log files. The
> performance of that search routine was quite a bottleneck while LogAlert
> was detecting exploit strings. The strstr() function seemed to be part
> of the problem, I had to find a way to speed things up.
>
> I found strchr() to be the fastest search available on the X86. This
> became the front-end to my new search routine. The result is strchrstr(),
> that uses a fast strchr() to find potential matches, and then does a
> strncmp() on the remainder of the target string.

There are much more optimal ways to search for a target string in
a body of text. For a review of some of them, see an article in the
journal "Software Practice and Experience" from some considerable
number of years ago (can't recall, except I know it must be pre-1985.)

My recollection was that the best general algorithm was not dissimilar
to yours (use strchr, then do some matching) but with some critical
differences. Primarily, it depended on knowledge of likely letter frequencies
in the searched text. A full explanation is off-topic for almost all the
groups you posted to (IMHO), but you basically search first for the letter
in the target which is least likely to occur. You also construct a jump
table which tells you, having examined a particular letter, how many tests
you can skip. E.g. you are searching for "easy". You start by looking
at the 4th letter of the sample text and see if it is "y". If it isn't,
but is (say) an "m", you can skip the next three letters and examine
position 8 to see if it is "y". And so on.

Better algorithms exist for more tightly-defined problem spaces, but this
is easy to implement and provably fast. It has the advantage that search
speed increases as the length of the target string increases. (OK, that's
sometimes an advantage and sometimes not so good.)

Kevin Ashley

Kevin D. Quitt

unread,

May 8, 2001, 7:38:48 PM5/8/01

to

On 08 May 2001 15:59:08 GMT, Walter Briscoe <wbri...@ponle.demon.co.uk>
wrote:
>http://www.cs.utexas.edu/users/best-ideas/string-searching

That link's dead. Try

http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

--
#include <standard.disclaimer>
_
Kevin D Quitt USA 91351-4454 96.37% of all statistics are made up
Per the FCA, this email address may not be added to any commercial mail list

Kevin D. Quitt

unread,

May 8, 2001, 7:39:25 PM5/8/01

to

FWIW, the name you've chosen for your function is reserved.

--
#include <standard.disclaimer>
_
Kevin D Quitt USA 91351-4454 96.37% of all statistics are made up
Per the FCA, this email address may not be added to any commercial mail list

Edward Rosten

unread,

May 8, 2001, 7:42:21 PM5/8/01

to

In article <clcm-2001...@plethora.net>, "David Rubin"
<dlr...@hotmail.com> wrote:

> DB wrote:
>
> [snip - strstr not fast enough]
>> I found strchr() to be the fastest search available on the X86. This
>> became the front-end to my new search routine. The result is
>> strchrstr(), that uses a fast strchr() to find potential matches, and
>> then does a strncmp() on the remainder of the target string.
>
> ...or you can try a regex package.

Surely, regex, which is more complex than strstr will be slower?

-Ed

--
You can't go wrong with psycho-rats.

u 9 8 e j r (at) e c s . o x . a c . u k

Edward Rosten

unread,

May 8, 2001, 7:42:49 PM5/8/01

to

> with http://www.cs.utexas.edu/users/best-ideas/string-searching I have

This URL 404's. Do you have another?

cheers

-Ed

--
You can't go wrong with psycho-rats.

u 9 8 e j r (at) e c s . o x . a c . u k

Douglas A. Gwyn

unread,

May 8, 2001, 7:43:06 PM5/8/01

to

Walter Briscoe wrote:
> This looks like a very suitable application for the Boyar-Moore string

> searching method. ...

> I have no experience with using this algorithm. It is reported to be
> good. From my reading of it, I would expect you to be able to double the
> speed of your tests to date (and then some!)

Yes, Boyer-Moore tends to be best for the "average" string search
application; double the naive approach's speed is about right.
If one is searching for multiple occurrences of the same string,
the pattern set-up phase can be amortized over several searches.

Chris Kern

unread,

May 9, 2001, 12:15:34 AM5/9/01

to

On 08 May 2001 23:42:21 GMT, "Edward Rosten" <lo...@my.sig> posted the
following:

>In article <clcm-2001...@plethora.net>, "David Rubin"
><dlr...@hotmail.com> wrote:
>
>> DB wrote:
>>
>> [snip - strstr not fast enough]
>>> I found strchr() to be the fastest search available on the X86. This
>>> became the front-end to my new search routine. The result is
>>> strchrstr(), that uses a fast strchr() to find potential matches, and
>>> then does a strncmp() on the remainder of the target string.
>>
>> ...or you can try a regex package.
>
>Surely, regex, which is more complex than strstr will be slower?

It would depend on the algorithm used, wouldn't it? I can imagine a
brute-force string matching algorithm going slower than a regexp
package which uses a better algorithm.

-Chris

Ben Pfaff

unread,

May 9, 2001, 12:15:43 AM5/9/01

to

"Edward Rosten" <lo...@my.sig> writes:

> Surely, regex, which is more complex than strstr will be slower?

A Pentium III is more complex than an 8086, but which is slower?
--
"When in doubt, treat ``feature'' as a pejorative.
(Think of a hundred-bladed Swiss army knife.)"
--Kernighan and Plauger, _Software Tools_

Bengt Richter

unread,

May 9, 2001, 12:15:52 AM5/9/01

to

On 07 May 2001 21:25:09 GMT, DB <drba...@inetnow.net> wrote:

>My current project for SPI Dynamics made extensive use of the strstr()
>function to detect malicious activity in web server log files. The
>performance of that search routine was quite a bottleneck while LogAlert
>was detecting exploit strings. The strstr() function seemed to be part
>of the problem, I had to find a way to speed things up.
>
>I found strchr() to be the fastest search available on the X86. This
>became the front-end to my new search routine. The result is strchrstr(),
>that uses a fast strchr() to find potential matches, and then does a
>strncmp() on the remainder of the target string.
>

Are you sure you were not just accidentally linking in a non-optimized
version of some kind? People have been fiddling with those routines
for a long time. They should be pretty tight.

Others have pointed out that an alternative search algorithm might
be faster for your application, but reading between the lines I'm
guessing you are calling strstr once for each of a list of "exploit
strings." If so, I suspect you could gain a LOT by generating a new
single-pass search program each time your "exploit string" list
changes, and running that.

Once you set up a make file and a simple script or two, it could be
very convenient. Make could even run the search for you.

The make would (1) run a script to translate the expoit string table
into input for flex. (2) Run flex to take that and generate source for
a parsing routine that would find your strings and call whatever
action routines you define (to interface to your current app, or
maybe just print warnings), (3) compile and link (with your app if
you aren't just printing warnings etc.), and (4) run it.

You could then just type make when you wanted to do a scan, and be
sure your search included the latest "exploit strings."

A detail re your results: How many times in the 9mb did the 9-char
target string occur? And how many partial matches? If only a few,
the results are mostly showing first-char search loop times. Of
course, that may be the normal (hopefully) case in your application.

A single pass total search will make a lot of difference if you have
a hundred strings to search for, and probably even if you only have 5,
if you are searching through 9mb.

How fast do you need to be able to do the whole job?

BTW, flex is available from gnu.org, and you can compile it for
windows as well as the usual.

HTH.

>The new function is a direct replacement for strstr(). Since this
>approach made quite a difference, I made another change, and did a
>second character compare before the strncmp(). This is strchrstr2(),
>which is NOT an exact replacement for strstr(), because the target
>string must be at least 2 characters long.
>
>TESTING:
>The search space is a 9 Mb text file, searching for a nine character
>target string and looped 255 times. The tests were run on RHL 6.2
>and 7.0. The kernel builds were 2.2.14 and 2.2.16 with glibc versions
>2.1.3 and 2.1.92 respectively.
>
> system | strstr| strchrstr | strchrstr2 |
>------------------------------------------| (timed with gettimeofday())
>5x86-160 | 195.5 | 141.4 | 121.9 | RHL 6.2
>P-166 | 97.7 | 58.9 | 52.3 | RHL 7.0
>Ath-840 | 18.5 | 16.8 | 15.5 | RHL 6.2
>
>Links to the source code for all three test programs and the alternate
>search routine are at: http://www.SPIDynamics.com/speed_search.html
>

[...]

Laurent Deniau

unread,

May 9, 2001, 10:28:54 AM5/9/01

to

Ben Pfaff wrote:
>
> "Edward Rosten" <lo...@my.sig> writes:
>
> > Surely, regex, which is more complex than strstr will be slower?
>
> A Pentium III is more complex than an 8086, but which is slower?

1) slow down your PentiumIII to the frequency of your 8086
2) use only instructions from the 8086's set.
3) compare the speed.
You may be surprised by the result...

BTW I agree with you :-)

> --
> "When in doubt, treat ``feature'' as a pejorative.
> (Think of a hundred-bladed Swiss army knife.)"
> --Kernighan and Plauger, _Software Tools_
> --
> comp.lang.c.moderated - moderation address: cl...@plethora.net

a+, ld.

--
[ Laurent Deniau -- Data Analysis and Signal Processing ]
[ CERN -- The European Laboratory for Particle Physics ]
[ Laurent...@cern.ch - http://home.cern.ch/ldeniau ]
[ -- One becomes old when dreams become regrets -- ]
[ -- Il faut avoir une haute opinion sur ]
[ ce qu'on va faire, pas sur ce qu'on a fait -- ]

Edward Rosten

unread,

May 9, 2001, 10:29:12 AM5/9/01

to

>> Surely, regex, which is more complex than strstr will be slower?
>
> A Pentium III is more complex than an 8086, but which is slower?

What I'm saying is that the regex has to do far more that a simple
search just to do a simple search. Surely you could get a fixed string
search algorithm to be faster than a much more general purpose regex
algorithm.

The extra complexity doesn't achieve anything in this case.

Which is faster at integers, a 200MHz Pentium, or a 200MHz StrongArm?

(In all the benchmarks I've seen the strongarm is faster, and it is much
simpler)

-Ed

--
You can't go wrong with psycho-rats.

u 9 8 e j r (at) e c s . o x . a c . u k

Edward Rosten

unread,

May 9, 2001, 10:29:20 AM5/9/01

to

>>
>>Surely, regex, which is more complex than strstr will be slower?
>
> It would depend on the algorithm used, wouldn't it? I can imagine a
> brute-force string matching algorithm going slower than a regexp package
> which uses a better algorithm.

Good point. But the fastest fixed string algorithms have to be at least
as fast as regex ones (and likely faster by quite a bit).

-Ed

--
You can't go wrong with psycho-rats.

u 9 8 e j r (at) e c s . o x . a c . u k

Dann Corbit

unread,

May 9, 2001, 3:26:14 PM5/9/01

to

"Chris Kern" <ke...@grinnell.edu> wrote in message
news:clcm-2001...@plethora.net...

Even more so, it will be a function of the problem size. If I have ten
character strings, then strstr() will probably do better than regexp(). If I
have thousand megabyte strings, I doubt if there is any contest.

Then again, why not try a web search to find stuff like this:
http://www.cs.purdue.edu/homes/stelo/pattern.html
--
C-FAQ: http://www.eskimo.com/~scs/C-faq/top.html
"The C-FAQ Book" ISBN 0-201-84519-9
C.A.P. FAQ: ftp://cap.connx.com/pub/Chess%20Analysis%20Project%20FAQ.htm

Bill Medland

unread,

May 9, 2001, 3:26:04 PM5/9/01

to

Edward Rosten <lo...@my.sig> wrote in article
<clcm-2001...@plethora.net>...

> > with http://www.cs.utexas.edu/users/best-ideas/string-searching I have
>
>
> This URL 404's. Do you have another?
>
> cheers
>
> -Ed

Try http://www.cs.utexas.edu/users/moore/best-ideas/string-searching

Douglas A. Gwyn

unread,

May 10, 2001, 4:25:58 PM5/10/01

to

Bengt Richter wrote:
> Are you sure you were not just accidentally linking in a non-optimized
> version of some kind? People have been fiddling with those routines
> for a long time. They should be pretty tight.

One would think so, but it's not always the case. When I ran timing
tests for my chapter on fast string searching in "Software Solutions
in C", I was surprised to discover that the native C library on SunOS
had an implementation of strstr() that performed *horribly*, 70 times
slower than a straight-forward, unoptimized naive implementation of
the obvious brute-force solution. How could this be? It turned out
that somebody had tried to "improve" the brute-force method to not
invoke strcmp() when the text being searched was shorter than the
pattern (this precaution is *not necessary*); they had changed the
loop termination condition from (*tx == '\0') to (strlen(tx) < pat_len).
It was the evaluation of strlen() each iteration that was the problem.

A carefully-tuned implementation of brute-force searching ran 4.3
times faster than the straightforward naive implementation, i.e. 300
times faster than the native SunOS C library! Boyer-Moore improved
upon that by a factor of 2, for the particular tests I ran.

Jonathan Leffler

unread,

May 11, 2001, 11:26:02 AM5/11/01

to

"Douglas A. Gwyn" wrote:
> Bengt Richter wrote:
> > Are you sure you were not just accidentally linking in a non-optimized
> > version of some kind? People have been fiddling with those routines
> > for a long time. They should be pretty tight.
>
> One would think so, but it's not always the case. When I ran timing
> tests for my chapter on fast string searching in "Software Solutions
> in C", I was surprised to discover that the native C library on SunOS
> had an implementation of strstr() that performed *horribly*, 70 times
> slower than a straight-forward, unoptimized naive implementation of
> the obvious brute-force solution. How could this be? It turned out
> that somebody had tried to "improve" the brute-force method to not
> invoke strcmp() when the text being searched was shorter than the
> pattern (this precaution is *not necessary*); they had changed the
> loop termination condition from (*tx == '\0') to (strlen(tx) < pat_len).
> It was the evaluation of strlen() each iteration that was the problem.

I ran into that one too, maybe 10 years ago. I happened to be searching
for fixed strings in big hunks of data (multiple kilobytes at a time),
and what ran fast on HP-UX didn't finish on SunOS. I used profiling to
find that strlen was being called as often as strstr, and my code wasn't
calling strlen anywhere, so I knew the culprit. And by replacing the
SunOS strstr() with a crude version, I got back the high performance
found on HP-UX.

--
Yours,
Jonathan Leffler (Jonathan...@Informix.com) #include <disclaimer.h>
Guardian of DBD::Informix v1.00.PC1 -- http://www.perl.com/CPAN
"I don't suffer from insanity; I enjoy every minute of it!"

Rex Barzee

unread,

May 12, 2001, 1:54:51 AM5/12/01

to

Kevin Ashley wrote:
>
> DB wrote:
> >
> > My current project for SPI Dynamics made extensive use of the strstr()
> > function to detect malicious activity in web server log files. The
> > performance of that search routine was quite a bottleneck while LogAlert
> > was detecting exploit strings. The strstr() function seemed to be part
> > of the problem, I had to find a way to speed things up.
> >
> > I found strchr() to be the fastest search available on the X86. This
> > became the front-end to my new search routine. The result is strchrstr(),
> > that uses a fast strchr() to find potential matches, and then does a
> > strncmp() on the remainder of the target string.
>
> There are much more optimal ways to search for a target string in
> a body of text. For a review of some of them, see an article in the
> journal "Software Practice and Experience" from some considerable
> number of years ago (can't recall, except I know it must be pre-1985.)
>

The best article that I've seen is "Fast String Searching" found in
"Software Practice and Experience", volume 21, no. 11, November 1991, pg
1221-1248. I don't know if that's really the article you mean, but
nonetheless it is a very good article comparing performance of various
algorithms, and it includes C source code.

--
Rex Barzee
Hewlett-Packard Unix Systems Enablement Lab

Brian Inglis

unread,

May 13, 2001, 2:56:43 PM5/13/01

to

On 07 May 2001 21:25:09 GMT, DB <drba...@inetnow.net> wrote:

Have you tested an implementation of fgrep (fixed string match)
for comparison, or tried pulling the string search code from one?

Recent implementations should contain Boyer-Moore-Pratt searches
which use end first matching and a couple of arrays containing
skip offsets when unmatched and matched characters are seen to
run an order of magnitude faster than straight [split intended]
forward searches on single/multiple targets.

You can get some of the benefit in your routine by searching for
the last character in your search string and matching preceding
characters backwards to the start of the input and search
strings.

Thanks. Take care, Brian Inglis Calgary, Alberta, Canada
--
Brian....@CSi.com (Brian dot Inglis at SystematicSw dot ab dot ca)
fake address use address above to reply
tos...@aol.com ab...@aol.com ab...@yahoo.com ab...@hotmail.com ab...@msn.com ab...@sprint.com ab...@earthlink.com ab...@cadvision.com ab...@ibsystems.com u...@ftc.gov
spam traps

Kaz Kylheku

unread,

May 14, 2001, 1:01:50 AM5/14/01

to

On 13 May 2001 18:56:43 GMT, Brian Inglis <Brian.do...@Compuserve.com>
wrote:

>Have you tested an implementation of fgrep (fixed string match)
>for comparison, or tried pulling the string search code from one?

fgrep and egrep are obsolete commands, roughly equivalent to grep -F and
grep -E. The grep program should be able to figure out whether or not a
pattern is a fixed string regardless of the command line options; the purpose
of -F is to disable the recognition of regular expression operators in the
search pattern, not to indicate that the search should be optimized in
some way.

Jerry Coffin

unread,

May 14, 2001, 3:53:09 PM5/14/01

to

In article <clcm-2001...@plethora.net>,
Brian.do...@Compuserve.com says...

[ ... ]

> Recent implementations should contain Boyer-Moore-Pratt searches
> which use end first matching and a couple of arrays containing
> skip offsets when unmatched and matched characters are seen to
> run an order of magnitude faster than straight [split intended]
> forward searches on single/multiple targets.

Just FWIW, I know of:

Knuth-Morris-Pratt
Boyer-Moore
Boyer-Moore-Horspool
Sunday's variant of Boyer-Moore-Horspool

but TTBOMK, Pratt was not involved in any variant on the Boyer-Moore
search algorithm. I suppose to be complete, I could mention that
I've invented another variant of Boyer-Moore that I believe is
original, but isn't on the list above. I'm pretty sure Pratt didn't
have anything to do with it either though. <G>

--
Later,
Jerry.

The Universe is a figment of its own imagination.

Kevin Ashley

unread,

May 14, 2001, 3:53:21 PM5/14/01

to

Rex Barzee wrote:
>
> Kevin Ashley wrote:
> >
> > DB wrote:
> > >

....

> > > I found strchr() to be the fastest search available on the X86. This
> > > became the front-end to my new search routine. The result is strchrstr(),
> > > that uses a fast strchr() to find potential matches, and then does a
> > > strncmp() on the remainder of the target string.
> >
> > There are much more optimal ways to search for a target string in
> > a body of text. For a review of some of them, see an article in the
> > journal "Software Practice and Experience" from some considerable
> > number of years ago (can't recall, except I know it must be pre-1985.)
> >
>
> The best article that I've seen is "Fast String Searching" found in
> "Software Practice and Experience", volume 21, no. 11, November 1991, pg
> 1221-1248. I don't know if that's really the article you mean, but
> nonetheless it is a very good article comparing performance of various
> algorithms, and it includes C source code.

It's not the article I mean, but I expect it is a better review
of the issue than the one I was referring to. Until 1985 I worked
somewhere that had a subscription to SP+E, and I read each month's
issue as it came in, as well as consuming all the back issues when
I started the job. I remember it as an excellent and readable journal.
That's what enabled me to put a (vague) date on the article.

At the time of the 1980's article, the author(s) were interested
in comparing software algorithms with hardware implementations
of string search that were often found on CISC machines of the time.
I think they were pleasantly surprised that their best algorithm
beat the hardware hands down. The 'hardware' implementation
would almost certainly have used the naive implementation of strstr,
albeit in microcode and hence saving a number of fetch/decode
cycles. I doubt such comparisons are of interest any more.

Kevin Ashley.

Brian Inglis

unread,

May 16, 2001, 9:50:40 AM5/16/01

to

On 14 May 2001 19:53:09 GMT, Jerry Coffin <jco...@taeus.com>
wrote:

>In article <clcm-2001...@plethora.net>,
>Brian.do...@Compuserve.com says...
>
>[ ... ]
>
>> Recent implementations should contain Boyer-Moore-Pratt searches
>> which use end first matching and a couple of arrays containing
>> skip offsets when unmatched and matched characters are seen to
>> run an order of magnitude faster than straight [split intended]
>> forward searches on single/multiple targets.
>
>Just FWIW, I know of:
>
>Knuth-Morris-Pratt
>Boyer-Moore
>Boyer-Moore-Horspool
>Sunday's variant of Boyer-Moore-Horspool
>
>but TTBOMK, Pratt was not involved in any variant on the Boyer-Moore
>search algorithm. I suppose to be complete, I could mention that
>I've invented another variant of Boyer-Moore that I believe is
>original, but isn't on the list above. I'm pretty sure Pratt didn't
>have anything to do with it either though. <G>

Looks like my memory hashed the initials into the same bucket and
munged the keys instead of chaining them.

Thanks. Take care, Brian Inglis Calgary, Alberta, Canada
--
Brian....@CSi.com (Brian dot Inglis at SystematicSw dot ab dot ca)
fake address use address above to reply
tos...@aol.com ab...@aol.com ab...@yahoo.com ab...@hotmail.com ab...@msn.com ab...@sprint.com ab...@earthlink.com ab...@cadvision.com ab...@ibsystems.com u...@ftc.gov
spam traps

Brian Inglis

unread,

May 17, 2001, 11:47:21 PM5/17/01

to

On 14 May 2001 19:53:09 GMT, Jerry Coffin <jco...@taeus.com>
wrote:

>In article <clcm-2001...@plethora.net>,

>Brian.do...@Compuserve.com says...
>
>[ ... ]
>
>> Recent implementations should contain Boyer-Moore-Pratt searches
>> which use end first matching and a couple of arrays containing
>> skip offsets when unmatched and matched characters are seen to
>> run an order of magnitude faster than straight [split intended]
>> forward searches on single/multiple targets.
>
>Just FWIW, I know of:
>
>Knuth-Morris-Pratt
>Boyer-Moore
>Boyer-Moore-Horspool
>Sunday's variant of Boyer-Moore-Horspool
>
>but TTBOMK, Pratt was not involved in any variant on the Boyer-Moore
>search algorithm. I suppose to be complete, I could mention that
>I've invented another variant of Boyer-Moore that I believe is
>original, but isn't on the list above. I'm pretty sure Pratt didn't
>have anything to do with it either though. <G>

Just saw Boyer-Moore-Gosper mentioned recently -- will have to
look that one up!

Thanks. Take care, Brian Inglis Calgary, Alberta, Canada
--
Brian....@CSi.com (Brian dot Inglis at SystematicSw dot ab dot ca)
fake address use address above to reply
tos...@aol.com ab...@aol.com ab...@yahoo.com ab...@hotmail.com ab...@msn.com ab...@sprint.com ab...@earthlink.com ab...@cadvision.com ab...@ibsystems.com u...@ftc.gov
spam traps

Jeffrey Turner

unread,

May 19, 2001, 11:35:03 AM5/19/01

to

Walter Briscoe wrote:

I couldn't find the site you list, but this site has a LOT of info on
string searching: http://citeseer.nj.nec.com/context/2702/0

--Jeff

mmf

unread,

May 20, 2001, 12:48:14 AM5/20/01

to

In article <clcm-2001...@plethora.net>, DB <drba...@inetnow.net> wrote:

>My current project for SPI Dynamics made extensive use of the strstr()
>function to detect malicious activity in web server log files. The
>performance of that search routine was quite a bottleneck while LogAlert
>was detecting exploit strings. The strstr() function seemed to be part
>of the problem, I had to find a way to speed things up.

why not move laterally? instead of a faster way to repeatedly scan text
why not find a way to scan text once

theres something called a finite state machine that can find any number of
any strings in one pass threough the text

a deterministic fsm examines each character exactly once but it may have a
exponentially large number of states

a nondeterministic fsm can have fewer states but may need to examne a
character multiple times

you can use tools like lex to make gfsms and you can also do it by hand
this is a deterministic gfsm that find all occurence of the substrings
plover and plugh

int character, state = 0, lp = 1, cp = 1;
while (state>=0) {
character = fgetc(inputfile);
if (character=='\n') lp++,cp=0; else cp++;
switch (state) {
case 0: switch (character) {
case 'p': state = 'p'; continue;
default: state = 0;continue;
}
case 'p': switch (character) {
case 'p': state = 'p'; continue;
case 'l': state = 'pl'; continue;
default: state = 0; continue;
}
case 'pl': switch (character) {
case 'p': state = 'p'; continue;
case 'o': state = 'plo'; continue;
case 'u': state = 'plu'; continue;
default: state = 0; continue;
}
case 'plo': switch (character) {
case 'p': state = 'p'; continue;
case 'v': state = 'plov'; continue;
default: state = 0; continue;
}
case 'plov': switch (character) {
case 'p': state = 'p'; continue;
case 'e': state = 'plve'; continue;
default: state = 0; continue;
}
case 'plve': switch (character) {
case 'p': state = 'p'; continue;
case 'r':
fprintf(outputfile,
"found plover line %d"
"column %d through %d\n",
lp,cp-6,cp-1);
state = 0; continue;
default: state = 0; continue;
}
case 'plu': switch (character) {
case 'p': state = 'p'; continue;
case 'g': state = 'plug'; continue;
default: state = 0; continue;
}
case 'plug': switch (character) {
case 'p': state = 'p'; continue;
case 'h': state = 'plo'; continue;
fprintf(outputfile,
"found plugh line %d"
"column %d through %d\n",
lp,cp-5,cp-1);
state = 0; continue;
default: state = 0; continue;

Keith Thompson

unread,

May 20, 2001, 12:41:58 PM5/20/01

to

mair_...@www.yahoo.akadns.net (mmf) writes:
[...]

> case 'pl': switch (character) {
> case 'p': state = 'p'; continue;
> case 'o': state = 'plo'; continue;
> case 'u': state = 'plu'; continue;
> default: state = 0; continue;
> }

[...]

Multi-character constants like 'plu' are non-portable. Since all you
really care in this case about is uniqueness, it makes more sense to
use an enum.

--
Keith Thompson (The_Other_Keith) k...@cts.com <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://www.sdsc.edu/~kst>
Cxiuj via bazo apartenas ni.

CBFalconer

unread,

May 20, 2001, 12:42:03 PM5/20/01

to

mmf wrote:
>
> In article <clcm-2001...@plethora.net>, DB <drba...@inetnow.net> wrote:
>
> >My current project for SPI Dynamics made extensive use of the strstr()
> >function to detect malicious activity in web server log files. The
> >performance of that search routine was quite a bottleneck while LogAlert
> >was detecting exploit strings. The strstr() function seemed to be part
> >of the problem, I had to find a way to speed things up.
>
> why not move laterally? instead of a faster way to repeatedly scan text
> why not find a way to scan text once

A technique used in ID2ID (to replace id strings in multiple
source files in one pass) inputs the candidate strings and forms a
binary tree. It uses an AVL tree to avoid imbalance, but could
also use red-black trees. Having formed that, a single pass
through the source extracts each possible identifier and does a
rapid binary search. If not found the original identifier is
output, otherwise the replacement.

There are complexities in ID2ID to avoid replacing strings with
identifiers that are already in the source and other anomalies,
none of which applies to a simple search.

Look up Ben Pfaffs libraries for AVL and Red-Black trees in
standard C.

In your case things should be especially simple because the input
set of strings is relatively unchanging.

--
Chuck F (cbfal...@my-deja.com) (cbfal...@XXXXworldnet.att.net)
http://www.qwikpages.com/backstreets/cbfalconer :=(down for now)
(Remove "NOSPAM." from reply address. my-deja works unmodified)
mailto:u...@ftc.gov (for spambots to harvest)

Ben Pfaff

unread,

May 20, 2001, 1:53:47 PM5/20/01

to

CBFalconer <cbfal...@my-deja.com> writes:

> mmf wrote:
> >
> > In article <clcm-2001...@plethora.net>, DB <drba...@inetnow.net> wrote:
> >
> > >My current project for SPI Dynamics made extensive use of the strstr()
> > >function to detect malicious activity in web server log files. The
> > >performance of that search routine was quite a bottleneck while LogAlert
> > >was detecting exploit strings. The strstr() function seemed to be part
> > >of the problem, I had to find a way to speed things up.
> >
> > why not move laterally? instead of a faster way to repeatedly scan text
> > why not find a way to scan text once
>
> A technique used in ID2ID (to replace id strings in multiple
> source files in one pass) inputs the candidate strings and forms a
> binary tree. It uses an AVL tree to avoid imbalance, but could

> also use red-black trees. [...]

If the table of candidate strings does not change during a
program run, then there's a good chance that a balanced tree is a
waste of time and too complicated to boot. I would probably use
qsort() followed by bsearch() for simplicity or a hand-coded
binary search for speed, or maybe a hash table or trie, myself.

Though AVL and red-black trees are pretty, there's an even
prettier algorithm for giving an ordinary binary tree perfect
balance in O(n) time and O(1) space. If you really want to use a
binary tree, then, for a tree with static content, you can't beat
this. Just build a tree that's unbalanced any which way and run
it through the balancer.

This algorithm is implemented in the latest ALPHA release of
libavl, FWIW. As always, libavl is available from
<URL:http://www.msu.edu/~pfaffben>.
--
"What is appropriate for the master is not appropriate for the novice.
You must understand the Tao before transcending structure."
--The Tao of Programming

CBFalconer

unread,

May 21, 2001, 2:42:38 PM5/21/01

to

That runs the risk that the initial build can run into evil
conditions, such as a linked list on sorted input. The tree
builder can then use excessive space for its own recursion. An
intrinsically balanced entry mechanism avoids all this, and is
relatively impervious to peculiar input sets.

--
Chuck F (cbfal...@my-deja.com) (cbfal...@XXXXworldnet.att.net)
http://www.qwikpages.com/backstreets/cbfalconer :=(down for now)
(Remove "NOSPAM." from reply address. my-deja works unmodified)
mailto:u...@ftc.gov (for spambots to harvest)

k

unread,

May 29, 2001, 6:06:41 PM5/29/01

to

You could try the following. Works only for patterns of at most lenght 32
(or whatever is the number of bits in a machine word...) However, this is
easy easy to fix, regardless of the word lenght. Left as an exercise...

int bpsuffix( unsigned char * t, int n, unsigned char * p, int m )
{
unsigned b[ 256 ];
unsigned d;
const unsigned mm = 1 << ( m - 1 );
int i, j, shift, matches = 0, avg_shift = 0;
unsigned char c;

/* stupid preprocessing, rewrite it.... */

for( i = 0; i < m / 2; i++ )
{
c = p[ i ];
p[ i ] = p[ m - i - 1 ];
p[ m - i - 1 ] = c;
}

for( i = 0; i < 256; i++ ) b[ i ] = 0;

for( i = 0; i < m; i++ ) b[ p[ i ]] |= 1 << i;

for( i = 0; i < m / 2; i++ )
{
c = p[ i ];
p[ i ] = p[ m - i - 1 ];
p[ m - i - 1 ] = c;
}

i = 0;

do { j = m; shift = m; d = ~0;

do { d &= b[ t[ i + j - 1 ]];
--j;
if( d & mm )
if( j > 0 )
shift = j;
else
{
++ matches;
printf("Match at position %d\n", i );
}

} while( d <<= 1 );

/* the following test is unnecessary, if t is padded
appropriately...*/

if(( i += shift ) >= n ) break;

/* } while( t[ ( i += shift ) ] ); */
} while( t[ i ] );

return matches;