Fast way to determine if a string contains a member of a list of strings

16 views
Skip to first unread message

Karch

unread,
Mar 5, 2008, 12:05:22 PM3/5/08
to
I need to find the fastest way in terms of storage and searching to
determine if a given string contains one of a member of a list of strings.
So, think of it in terms of this: I have a string such as the following:

"Smith said the dog belongs (or did belong) to the secretary, an employee of
the company."

Then the list contains the following (and this list is MUCH larger in the
real situation):

Adams
Alan
Jones
Jacobs
Smith
Thaxton
Vela
Zulu

I would need to stop the processing and return (true) as soon as Smith was
found. On the other hand, if the string was changed to the following, there
would be no match and I would return (false):

"Smitherson said the dog belongs (or did belong) to the secretary, an
employee of the company."

In the given string, do know that the matches should begin at a given point
(zero position), but I need to keep processing until I have exhausted the
candidate string in the list - as shown above - to prevent a false match.

I have played around with some different data structures, such as prefix and
suffix trees, an these work in the case that you have a string that you are
trying to match in a list, not vice versa. The approach is required to be
very performant because it will be evaluated millions of times. I am also
okay with an unsafe code approach that works. I just need the evaluations to
terminate as soon as possible rather than looping through every single item
in the list. Even an IndexOf is too slow.


amdrit

unread,
Mar 5, 2008, 12:38:38 PM3/5/08
to
I think RegEx will help you out here. As far as not iterating over the
entire list, I've no clue how you can selectively not fall through given the
only match in the list is the last item in the list or when there is no
match at all.


"Karch" <news.microsoft.com> wrote in message
news:e993ZMuf...@TK2MSFTNGP06.phx.gbl...

John Daragon

unread,
Mar 5, 2008, 1:18:48 PM3/5/08
to
Karch wrote:
> I need to find the fastest way in terms of storage and searching to
> determine if a given string contains one of a member of a list of strings.
> So, think of it in terms of this: I have a string such as the following:
>
> "Smith said the dog belongs (or did belong) to the secretary, an employee of
> the company."
>
> Then the list contains the following (and this list is MUCH larger in the
> real situation):
>
> Adams
> Alan
> Jones
> Jacobs
> Smith
> Thaxton
> Vela
> Zulu
>

Could you give us some idea of the likely length of the list? And
whether it's static or dynamic ?

jd

Peter Duniho

unread,
Mar 5, 2008, 1:48:00 PM3/5/08
to
On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft.com> wrote:

> I need to find the fastest way in terms of storage and searching to
> determine if a given string contains one of a member of a list of
> strings.

I don't know if RegEx has optimizations for dealing with this sort of
thing. It could, but given how long a search string for RegEx could be if
you concatenate all of your possible matches, maybe it doesn't.

That'd be my first try though. It seems like the actual RegEx search
string would be simple (just "or" all of your possible matches together).
If it performs well enough, there you go.

If not, I'd guess there's a fair amount of research into algorithms like
this, but a fairly basic approach is a state graph. It's memory
intensive, but rewards you with good performance. This came up awhile ago
and I posted some sample code to get someone else started. You can find
that post here:
http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/0f06f696d4500b77

I've referred to this post a couple of other times, and while I've never
had anyone say it actually turned out to be useful, they've never said it
wasn't either. :)

I suppose if I'm going to keep referring people to it, maybe I ought to
clean it up a little bit more. But what's there does work and should be
enough to get you pointed in the right direction.

Pete

John Daragon

unread,
Mar 5, 2008, 1:57:23 PM3/5/08
to

If the list is static, which means that the overhead of hashing it can
occur just the once, I'd be inclined to use a HashTable.

Tokenise the string using String.Split(), then iterate over the
resulting array, using each token to access the hashtable, which will
return null if the token is not found. For large values of list length
this should be really quite efficient (assuming a decent hash algorithm
for your data type and distribution).

jd

Ben Voigt [C++ MVP]

unread,
Mar 5, 2008, 2:04:44 PM3/5/08
to

"Karch" <news.microsoft.com> wrote in message
news:e993ZMuf...@TK2MSFTNGP06.phx.gbl...

By your example, you are assuming a delimiter (so that a string containing
Smitherson does not match Smith).

Store your list in a Dictionary.
Extract the word (or each word) from the string by splitting with the
delimiter, look for it in the dictionary.


Ben Voigt [C++ MVP]

unread,
Mar 5, 2008, 2:10:37 PM3/5/08
to

"Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
news:%23mjouNv...@TK2MSFTNGP03.phx.gbl...

OTOH, if the items can contain the delimiter as in "Van Buren" things get
somewhat more complex.

Build a trie from your list. Match using the trie using a maximal length
match algorithm. So far Smitherson will still match Smith, so you need one
more step: after you make a match, check that the subsequent character is a
delimiter or you reached the end of your string.


Barry Kelly

unread,
Mar 5, 2008, 2:57:13 PM3/5/08
to
(Followups set to microsoft.public.dotnet.languages.csharp)

This sounds awfully like a college problem or pre-job technical
assignment.

Karch wrote:
> I need to find the fastest way in terms of storage and searching to
> determine if a given string contains one of a member of a list of strings.

"The fastest way in terms of storage" is a contradictory constraint.
Fastest implies time, storage implies space. Optimizing for one or the
other usually involves a tradeoff, though fastest usually requires
keeping storage down because of cache effects etc. Cache effects may be
relevant for this problem, see below.

However, even taking this into account, your problem is underspecified.
See below for details.

> So, think of it in terms of this: I have a string such as the following:
>
> "Smith said the dog belongs (or did belong) to the secretary, an employee of
> the company."
>
> Then the list contains the following (and this list is MUCH larger in the
> real situation):
>
> Adams
> Alan
> Jones
> Jacobs
> Smith
> Thaxton
> Vela
> Zulu
>
> I would need to stop the processing and return (true) as soon as Smith was
> found. On the other hand, if the string was changed to the following, there
> would be no match and I would return (false):
>
> "Smitherson said the dog belongs (or did belong) to the secretary, an
> employee of the company."

In order for 'Smith' in the list to not match against 'Smith' (followed
by 'erson') in this example, there needs to be either (1) an implied
"word-break" zero-width assertion after each element in the list
(similar to regex '$', '\W', '\>', etc.), or (2) an implied convention
for splitting up the sentence into fragments that can be matched against
the list - tokenizing, in a word. The missing information (1) can be
converted into (2) and vice versa. You need to preprocess - implicitly
via algorithm or explicitly via data munging - either the list or the
sentence to reflect the hidden constraints you haven't described.

However, once you've done that, the solution falls out fairly easily. A
prefix trie which includes an edge for the assertion along the lines of
(1) above should solve the problem trivially, or alternatively
extracting the appropriate token using the appropriate rules of (2) and
using simple hash lookup should also work. Matches should cost
proportional to the length of the matched string in both cases, while
mismatches might be discarded more cheaply using (1); however the
constant factors involved, and in particular cache locality loss from a
trie, may make (2) cheaper even for mismatches, especially if you have a
lot of spurious prefixes. You'll need to measure.

> In the given string, do know that the matches should begin at a given point
> (zero position), but I need to keep processing until I have exhausted the
> candidate string in the list - as shown above - to prevent a false match.

-- Barry

--
http://barrkel.blogspot.com/

Jesse Houwing

unread,
Mar 5, 2008, 3:41:24 PM3/5/08
to
Hello Peter,

> On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft.com> wrote:
>
>> I need to find the fastest way in terms of storage and searching to
>> determine if a given string contains one of a member of a list of
>> strings.
>>
> I don't know if RegEx has optimizations for dealing with this sort of
> thing. It could, but given how long a search string for RegEx could
> be if you concatenate all of your possible matches, maybe it doesn't.

The .NET regex engine does not optimize for partial matches, but you can
derive those yourself so that:

New York|New Zealand
becomes (new (York|Zealand))

which would perform better. I know of certain tools online used by the SpamAssassin
project which can do this for you. It shouldn't be to hard to build by hand
yourself though.

also if new Zealand is found more often statistically, you can optimize the
regex by putting that option in front so that:

New York|New Zealand
becomes (new (Zealand|York))

Normally a () will result in a group being created. This eats performance.
You have 2 ways to suppress this.
First:
Add the option RegexOptions.ExplicitCapture to the regex object's constructor
Second
add ?: in all the () like this (?:new (?:Zealand|York))

If you have devided your sub strings into the smalles overlapping subgroups,
you can also start to capture greedily using the (?>...) group construct
which will improve performance.

If you know the casing beforehand, it's probably faster to not specify the
RegexOptions.IgnoreCase. But you'd have to test that, I never tried. As in:
[Nn]ew [Yy]ork
is probably faster than
New York + RegexOptions.IgnoreCase

If you refrain from certain regex tricks you could try to see if RegexOptions.ECMACompliant
is faster than a normal regex, haven't tried either.

Lastly use the RegexOptions.Compiled option to make your regex faster. The
first call will suffer a severe hit, but the rest becomes much faster (usually).
Store the Regex in a static readonly Regex and you'll only have to do it
once.

> That'd be my first try though. It seems like the actual RegEx search
> string would be simple (just "or" all of your possible matches
> together). If it performs well enough, there you go.

And if not, there is room to optimize ;). I'm willing to help on that if
needed.

In the end the regex solution will never be the fastest solution possible.
Using unsafe string manipulation will be much faster, but also a lot harder
to implement. If you can reach the required performance using Regex, then
I say go for it. But that's your perogative.

> If not, I'd guess there's a fair amount of research into algorithms
> like this, but a fairly basic approach is a state graph. It's memory
> intensive, but rewards you with good performance. This came up awhile
> ago and I posted some sample code to get someone else started. You
> can find that post here:
> http://groups.google.com/group/microsoft.public.dotnet.languages.cshar
> p/msg/0f06f696d4500b77
>
> I've referred to this post a couple of other times, and while I've
> never had anyone say it actually turned out to be useful, they've
> never said it wasn't either. :)
>
> I suppose if I'm going to keep referring people to it, maybe I ought
> to clean it up a little bit more. But what's there does work and
> should be enough to get you pointed in the right direction.
>
> Pete
>

--
Jesse Houwing
jesse.houwing at sogeti.nl


Jack Jackson

unread,
Mar 5, 2008, 7:02:34 PM3/5/08
to
One problem I see with this is that you apparently are not doing
simply a string search, but some kind of word search, without defining
what a word is.

You say the "Smithperson said ..." should not have any matches, so
while the string Smith is a substring, it does not match.

Would it match:
(Smith
(Smith)
Smith,
Smith;
Smith-
smith

etc.

The answer to this will probably constrian your possible algorithms.
If it didn't get too many false hits, doing a string search first and
then re-checking to see if the matches are valid word matches might be
OK.

Karch

unread,
Mar 5, 2008, 8:20:44 PM3/5/08
to
The list is static (read into memory during initialization) and could be up
to approximately 1000 items (hence the need for a fast method)

"John Daragon" <jo...@argv.co.uk> wrote in message
news:47cee409$0$21865$db0f...@news.zen.co.uk...

Karch

unread,
Mar 6, 2008, 1:08:22 PM3/6/08
to
I think the StateGraph approach is going to work best. I have an
implementation up and running and should be able to get some timings done
today. I don't think Regex is an option for me because of performance and
the fact that, although the list itself is static, it will be loaded into
memory at run-time. So, the expression building would be complex, not to
mention eating cycles just to prep.

"Peter Duniho" <NpOeS...@nnowslpianmk.com> wrote in message
news:op.t7j3u...@petes-computer.local...

Karch

unread,
Mar 6, 2008, 1:10:14 PM3/6/08
to
These are novel approaches, but in my case - after looking into it further -
I don't think they would be a good option because of how I need to load the
list data initially and then the actual expression building step required.
Then, of course, the regex evaluation itself.

"Jesse Houwing" <jesse....@newsgroup.nospam> wrote in message
news:21effc90469a78...@news.microsoft.com...

Karch

unread,
Mar 6, 2008, 1:12:51 PM3/6/08
to
The Hashtable would work great alongside a tokenized string, but the problem
is that I don't have a distinct delimiter since the string to be matched can
span multiple words. It's really a question of "do any of these strings
appear in this source string", irrespective of whitespace.

"John Daragon" <jo...@argv.co.uk> wrote in message

news:47ceed14$0$32055$da0f...@news.zen.co.uk...

Peter Duniho

unread,
Mar 6, 2008, 1:19:51 PM3/6/08
to
On Thu, 06 Mar 2008 10:08:22 -0800, Karch <news.microsoft.com> wrote:

> I think the StateGraph approach is going to work best. I have an
> implementation up and running and should be able to get some timings done
> today.

Have you tried the dictionary approach suggested by a couple of others?

I wasn't paying attention when I first read your question and failed to
note that your search is delimited by spaces. Or, at least in the
examples you provided it was.

I think that if your data is actually restricted like that, the dictionary
lookup might be as fast as a state graph and it would be a lot simpler in
terms of implementation.

Just a thought. Obviously I really like state graphs :), but when there
is information you know about the input data that allows you to constrain
the problem a bit (e.g. by using spaces to identify the beginning and
ending of any possible match), it's often possible to solve the problem
efficiently without using a state graph.

Pete

Karch

unread,
Mar 6, 2008, 1:16:28 PM3/6/08
to

"Barry Kelly" <barry....@gmail.com> wrote in message
news:pjtts3l33v5sv7b6b...@4ax.com...

> (Followups set to microsoft.public.dotnet.languages.csharp)
>
> This sounds awfully like a college problem or pre-job technical
> assignment.

Nope. If it were, I would probably have a nice fresh memory of my data
structures and algorithms class and wouldn't be posting here.

>
> In order for 'Smith' in the list to not match against 'Smith' (followed
> by 'erson') in this example, there needs to be either (1) an implied
> "word-break" zero-width assertion after each element in the list
> (similar to regex '$', '\W', '\>', etc.), or (2) an implied convention
> for splitting up the sentence into fragments that can be matched against
> the list - tokenizing, in a word. The missing information (1) can be
> converted into (2) and vice versa. You need to preprocess - implicitly
> via algorithm or explicitly via data munging - either the list or the
> sentence to reflect the hidden constraints you haven't described.
>
> However, once you've done that, the solution falls out fairly easily. A
> prefix trie which includes an edge for the assertion along the lines of
> (1) above should solve the problem trivially, or alternatively
> extracting the appropriate token using the appropriate rules of (2) and
> using simple hash lookup should also work. Matches should cost
> proportional to the length of the matched string in both cases, while
> mismatches might be discarded more cheaply using (1); however the
> constant factors involved, and in particular cache locality loss from a
> trie, may make (2) cheaper even for mismatches, especially if you have a
> lot of spurious prefixes. You'll need to measure.

Yeah, this would work really good - but one point that I didn't mention is
that I can't really tokenize the string since the strings to match could
span multiple words and whitepace. I looked at the data and I couldn't find
a realiable way to tokenize.

Peter Duniho

unread,
Mar 6, 2008, 1:22:46 PM3/6/08
to
On Thu, 06 Mar 2008 10:12:51 -0800, Karch <news.microsoft.com> wrote:

> The Hashtable would work great alongside a tokenized string, but the
> problem
> is that I don't have a distinct delimiter since the string to be matched
> can
> span multiple words. It's really a question of "do any of these strings
> appear in this source string", irrespective of whitespace.

Then how do you distinguish "Smith" from "Smitherson", as in your
example? Can you at least make a requirement that the search pattern will
always begin and end on a space, even if it includes spaces within?

Pete

Jesse Houwing

unread,
Mar 6, 2008, 3:27:14 PM3/6/08
to
Hello Karch" news.microsoft.com,

> These are novel approaches, but in my case - after looking into it
> further - I don't think they would be a good option because of how I
> need to load the list data initially and then the actual expression
> building step required.

Keep in mind that if you the list is static after load, it would be a one
time thing. You could even use an event to trigger the reload of teh regex
if needed. But I suspect that a dedicated string algorithm is the way to
go here.

> Then, of course, the regex evaluation itself.

Only timing will tell, but my guess is that a dedicated algorithm in the
end will be faster.

Jesse

>>> ar p/msg/0f06f696d4500b77

Karch

unread,
Mar 6, 2008, 3:35:23 PM3/6/08
to
As I mentioned, I do have the ability to reorder the data, so I'll have to
make sure that sub-patterns occur after search patterns. Or separate the
searches into two passes. I don't really care so much if I find both "Smith"
AND "Smitherson" - as long as I find at least one of them in the source
string. As far as your second question, I'm not sure. For example, I am
going to have things like this in my search list:

"Joe Smith ("
"Jones"
"Jane Jones"
"Jim Stewart/"

The source string (the one to be searched) could be anything. All I care
about is if one of those string in the list appears in the source string. As
soon as I find one, I can stop the searching and return "true". I don't need
to verify all the places that any of the strings in the list occur. Does
that make more sense?


"Peter Duniho" <NpOeS...@nnowslpianmk.com> wrote in message

news:op.t7lxb...@petes-computer.local...

Mufaka

unread,
Mar 6, 2008, 4:49:51 PM3/6/08
to
Karch wrote:
> The Hashtable would work great alongside a tokenized string, but the problem
> is that I don't have a distinct delimiter since the string to be matched can
> span multiple words. It's really a question of "do any of these strings
> appear in this source string", irrespective of whitespace.
>
<snip>
This is a little different than your original post, but if you can't
delimit search string, you might try something like the following:

Hash the static list of words
Keep a list of unique lengths for those words
for each unique length, iterate the string pulling out substrings of
that length
see if that substring is in your hashed list of words

It at least reduces the amount of InString's you'd have to do.

I'd be interested in seeing how this performs against large strings.
Here's my test code:

public class WordMatch
{
private Dictionary<string, string> _words = new
Dictionary<string,string>();
private Dictionary<int, int> _lengths = new Dictionary<int,int>();

public void SetWords(List<string> words)
{
foreach (string word in words)
{
string lowerWord = word.ToLower();
if (!_words.ContainsKey(lowerWord))
{
_words.Add(lowerWord, lowerWord);
}

if (!_lengths.ContainsKey(word.Length))
{
_lengths.Add(word.Length, word.Length);
}
}
}

public bool StringHasMatch(string text)
{
string lowerText = text.ToLower();
foreach (int length in _lengths.Keys)
{
if (text.Length < length) break;

int ndx = 0;

while (ndx + length <= text.Length)
{
if (_words.ContainsKey(lowerText.Substring(ndx, length)))
{
return true;
}
ndx++;
}
}
return false;
}
}

Mufaka

unread,
Mar 6, 2008, 5:07:41 PM3/6/08
to
Mufaka wrote:
> Karch wrote:
>> The Hashtable would work great alongside a tokenized string, but the
>> problem is that I don't have a distinct delimiter since the string to
>> be matched can span multiple words. It's really a question of "do any
>> of these strings appear in this source string", irrespective of
>> whitespace.
>>
> <snip>
> This is a little different than your original post, but if you can't
> delimit search string, you might try something like the following:
>
> Hash the static list of words
> Keep a list of unique lengths for those words
> for each unique length, iterate the string pulling out substrings of
> that length
> see if that substring is in your hashed list of words
>
> It at least reduces the amount of InString's you'd have to do.
>
> I'd be interested in seeing how this performs against large strings.
> Here's my test code:
<snip>
>
> if (text.Length < length) break;
<snip>
That should be continue instead of break.

Karch

unread,
Mar 6, 2008, 11:58:29 PM3/6/08
to
WOW! Really disappointing timings for the state machine method. Here are the
results:

Method: Looping through 359 strings doing an if (IndexOf > 0) and a source
string of length = 47
Execution time: 0.055008 seconds.


Method: Using the StateMachine method on 359 strings and a source string of
length = 47
Execution time: 2.359223 seconds.

"Karch" <news.microsoft.com> wrote in message

news:eXx$RU7fIH...@TK2MSFTNGP03.phx.gbl...

Peter Duniho

unread,
Mar 7, 2008, 3:24:13 AM3/7/08
to
On Thu, 06 Mar 2008 20:58:29 -0800, Karch <nos...@absotutely.com> wrote:

> WOW! Really disappointing timings for the state machine method. Here are
> the
> results:
>
> Method: Looping through 359 strings doing an if (IndexOf > 0) and a
> source
> string of length = 47
> Execution time: 0.055008 seconds.
>
> Method: Using the StateMachine method on 359 strings and a source string
> of
> length = 47
> Execution time: 2.359223 seconds.

Are you counting the time it takes to create the state graph? How many
trials is that? Just one? What is the test string like? Have you tried
testing the worst-case scenario, which is a source string that has zero
matches?

Basically, those times sound wrong to me. Or, at least the second time
(the first seems plausible). The only way I could see the state graph
taking almost 2.5 seconds to process a 47 character string is if you are
including the time it takes to build the state graph (which you shouldn't,
since you only have to do that once).

Even then, with only 359 strings it seems odd. Even assuming 10-character
strings on average with _no_ overlapping characters (worst-case for
allocations), that's only 35900 iterations (3590 characters, with a
10-node traversal in the graph) to build the graph. I'm not used to
seeing .NET only being able to handle ~10K of "something" per
second...that's pretty slow.

Regardless of how slow building the graph is, once that's done processing
an actual string should be extremely quick.

Can you post some sample code and test data? A simple
concise-but-complete console application that demonstrates the two
different algorithms should be sufficient, along with whatever input data
you're testing of course.

By the way, if you _are_ building the graph each time and for some reason
this isn't something you can fix, then no...a state graph might not help.
It depends a lot on how large your input data is, but if you're only
processing ~50 characters every time you start with a fresh search list,
then even if a state graph or other solution is faster than brute force,
it probably wouldn't be by much (and while a 50X disparity seems wrong to
me, it actually wouldn't surprise me for brute force to be faster than a
state graph for such short source strings if you have to rebuild the state
graph for each search).

Pete

Peter Duniho

unread,
Mar 7, 2008, 4:49:01 AM3/7/08
to
On Thu, 06 Mar 2008 20:58:29 -0800, Karch <nos...@absotutely.com> wrote:

> WOW! Really disappointing timings for the state machine method.

Well, I was right that it shouldn't have been that bad. But it was my
fault that it was that bad.

I wrote that code a long time ago, before I understood just how costly
throwing and catching an exception can be. Practically all of the time
spent in processing your string is likely to be in a single method in the
code I posted, where I use an exception to detect non-existing keys. That
was a very bad choice on my part. I've since learned my lesson, but I
didn't ever update that code.

Anyway, you should be able to dramatically improve the performance of the
state graph implementation by replacing the StateNode.NextState() method
with this code:

public StateNode NextState(TransitionKey tk, bool fAdd)
{
StateNode state = null;

if (_mpchstate.ContainsKey(tk))
{
state = _mpchstate[tk];
}
else if (fAdd)
{
state = new StateNode();
_mpchstate.Add(tk, state);
}

return state;
}

Sorry about that. I _did_ say that the code was in need of review. :)
There are other things I would do differently were I writing that code
today, but I think that's the most serious problem (most of the other
issues are stylistic in nature).

With that change on my own machine, I went from taking 1.5 seconds to
process a 50-character string (with no nodes in the graph, which is the
worst-case scenario) to taking 0.005 seconds. I think that should make it
more competitive with the brute force approach. :)

Pete

John Daragon

unread,
Mar 7, 2008, 8:02:02 AM3/7/08
to
Karch wrote:
Snip 8<

>
> Yeah, this would work really good - but one point that I didn't mention is
> that I can't really tokenize the string since the strings to match could
> span multiple words and whitepace. I looked at the data and I couldn't find
> a realiable way to tokenize.
>
Snip 8<

Is the data in your list arbitrarily complex? I guess what I'm getting
at is: if the majority of the strings to match are simple strings with
no embedded spaces or other special characters, it may well be possible
to build a hashtable (fixated, me?) that caters for the simplest
(tokenised) case, but which returns an object identifying whether this
string can be standalone or is a prefix for a more complex one.
Everything apart from the simple case would have to be handled
exceptionally, but if they're few and far between you may well still be
a long way ahead of the game adopting a technique that is based on hashing.


jd

Jesse Houwing

unread,
Mar 7, 2008, 9:37:36 AM3/7/08
to
Hello Karch,

> WOW! Really disappointing timings for the state machine method. Here
> are the results:
>
> Method: Looping through 359 strings doing an if (IndexOf > 0) and a
> source
> string of length = 47
> Execution time: 0.055008 seconds.
> Method: Using the StateMachine method on 359 strings and a source
> string of
> length = 47
> Execution time: 2.359223 seconds.

Could you post the code for both? This I'd like to see why it's so much slower...

My guess is it somehow creates massive amoutns of strings to allocate...
but I could be wrong...

Jesse

Peter Duniho

unread,
Mar 7, 2008, 1:05:24 PM3/7/08
to
On Fri, 07 Mar 2008 06:37:36 -0800, Jesse Houwing
<jesse....@newsgroup.nospam> wrote:

> Could you post the code for both? This I'd like to see why it's so much
> slower...
>
> My guess is it somehow creates massive amoutns of strings to allocate...
> but I could be wrong...

You are. :) See my other replies...the code I posted earlier and
referred to in this thread was actually kind of stupid, as it used an
exception as a normal part of processing. Very bad idea.

Hopefully the OP will actually come back here and see my fix for the
problem. Sometimes people start threads and then once they feel they've
reached a conclusion, whether useful or not, they just abandon the
newsgroup until they need help again.

Now, if only I could figure out a way to retroactively fix the post I made
before. :) I'm not entirely proud of that code for a variety of reasons,
but I'm especially embarassed that it has that serious performance bug in
it. It makes the class entirely useless for the intended purpose. :(

Pete

Karch

unread,
Mar 12, 2008, 11:54:49 PM3/12/08
to
WOW! Really disappointing timings for the state machine method. Here are the
results:

Method: Looping through 359 strings doing an if (IndexOf > 0) and a source
string of length = 47
Execution time: 0.055008 seconds.


Method: Using the StateMachine method on 359 strings and a source string of
length = 47
Execution time: 2.359223 seconds.

In my case, I can terminate the search as soon as ONE of the strings is
found because the list is mutually exclusive. In the sample code that you
provided, at what point do I know I have a match and can terminate the
traversal?

Also, to answer your question, I do not have space delimiters - the strings
can span multiple words, etc.

"Peter Duniho" <NpOeS...@nnowslpianmk.com> wrote in message

news:op.t7lw7...@petes-computer.local...

Peter Duniho

unread,
Mar 13, 2008, 2:25:16 AM3/13/08
to
On Wed, 12 Mar 2008 20:54:49 -0700, Karch <news.microsoft.com> wrote:

> WOW! Really disappointing timings for the state machine method.

Did you really just post this message? Have you read my messages in reply
to your previous (very similar) post? The state machine will work very
well, once you incorporate the bug-fix I provided.

> [...]


> In my case, I can terminate the search as soon as ONE of the strings is
> found because the list is mutually exclusive. In the sample code that you
> provided, at what point do I know I have a match and can terminate the
> traversal?

The code I posted doesn't provide that functionality. However, it's not
hard to add. You could add a property to the StateNode class that tests
to see if it's at a "terminal" state (in the code I posted, this would be
true when the private field "_lobj" is a non-zero-length list...that is,
"_lobj.Count > 0").

Note, of course, that this graph supports non-leaf nodes with
terminations. That's so you can have search strings that overlap each
other, like "Smith" and "Smitherson". You'll need to decide for yourself
what an appropriate action to take is in that situation and implement it
accordingly.

> Also, to answer your question, I do not have space delimiters - the
> strings
> can span multiple words, etc.

Okay...in that case, I think the state graph may well be one of the better
options for you.

Pete

Peter Duniho

unread,
Mar 13, 2008, 2:31:10 AM3/13/08
to
Sorry, some follow-up points (it's late, I'm getting sloppy)...

On Wed, 12 Mar 2008 23:25:16 -0700, Peter Duniho
<NpOeS...@nnowslpianmk.com> wrote:

>> [...]
>> In my case, I can terminate the search as soon as ONE of the strings is
>> found because the list is mutually exclusive. In the sample code that
>> you
>> provided, at what point do I know I have a match and can terminate the
>> traversal?
>
> The code I posted doesn't provide that functionality. However, it's not
> hard to add. You could add a property to the StateNode class that tests
> to see if it's at a "terminal" state (in the code I posted, this would
> be true when the private field "_lobj" is a non-zero-length list...that
> is, "_lobj.Count > 0").

Of course, you could also just note the return value for the
"RgobjFromNextState()" method, and if it's a non-zero-length list, that's
a terminal node. No need to modify the StateNode class at all in that
case.

> Note, of course, that this graph supports non-leaf nodes with
> terminations. That's so you can have search strings that overlap each
> other, like "Smith" and "Smitherson". You'll need to decide for
> yourself what an appropriate action to take is in that situation and
> implement it accordingly.

Okay, so...since you did point out that your search strings are "mutually
exclusive" -- I assume this means that you know for sure your source text
will only ever include one of the strings that's in the list of search
strings -- then obviously you can terminate the traversal of the state
graph as soon as you reach a state where the list of objects for a node
has a non-zero length.

Pete

Karch

unread,
Mar 13, 2008, 11:56:58 AM3/13/08
to
Sorry, I must have missed both my earlier post and your bug-fix post. Could
you please repost the bug-fix provided? I would like to try the timings
again with the additional changes.

"Peter Duniho" <NpOeS...@nnowslpianmk.com> wrote in message

news:op.t7xys...@petes-computer.local...

Peter Duniho

unread,
Mar 13, 2008, 1:13:31 PM3/13/08
to
On Thu, 13 Mar 2008 08:56:58 -0700, Karch <news.microsoft.com> wrote:

> Sorry, I must have missed both my earlier post and your bug-fix post.
> Could
> you please repost the bug-fix provided? I would like to try the timings
> again with the additional changes.

I would have just pointed you to the Google Groups archive, with a link to
the message. But for some reason, Google is having trouble today. When I
first looked at this thread on their archives, I could not see the most
recent messages (Google claimed to only have the first 16). Then when I
looked again, Google sent me back an error message saying it couldn't even
get at the archive for this newsgroup.

Oh well.

In the meantime, here's my previous post with the fix, copied and pasted:

On Thu, 06 Mar 2008 20:58:29 -0800, Karch <nos...@absotutely.com> wrote:

> WOW! Really disappointing timings for the state machine method.

Well, I was right that it shouldn't have been that bad. But it was my

fault that it was that bad.

I wrote that code a long time ago, before I understood just how costly
throwing and catching an exception can be. Practically all of the time
spent in processing your string is likely to be in a single method in the
code I posted, where I use an exception to detect non-existing keys. That
was a very bad choice on my part. I've since learned my lesson, but I
didn't ever update that code.

Anyway, you should be able to dramatically improve the performance of the
state graph implementation by replacing the StateNode.NextState() method
with this code:

public StateNode NextState(TransitionKey tk, bool fAdd)
{
StateNode state = null;

if (_mpchstate.ContainsKey(tk))
{
state = _mpchstate[tk];
}
else if (fAdd)
{
state = new StateNode();
_mpchstate.Add(tk, state);
}

return state;
}

Sorry about that. I _did_ say that the code was in need of review.

There are other things I would do differently were I writing that code
today, but I think that's the most serious problem (most of the other
issues are stylistic in nature).

With that change on my own machine, I went from taking 1.5 seconds to
process a 50-character string (with no nodes in the graph, which is the
worst-case scenario) to taking 0.005 seconds. I think that should make it
more competitive with the brute force approach.

Pete

Karch

unread,
Mar 13, 2008, 6:37:45 PM3/13/08
to
New timings including changes are MUCH better. And the state machine wins:

Method: Looping through 359 strings doing an if (IndexOf > 0) and a source
string of length = 47

Execution time: 0.533385 seconds.


Method: Using the StateMachine method on 359 strings and a source string of
length = 47

Execution time: 0.351235 seconds.

34% improvement!

Thanks.

"Peter Duniho" <NpOeS...@nnowslpianmk.com> wrote in message

news:op.t7yss...@petes-computer.local...

Peter Duniho

unread,
Mar 13, 2008, 6:55:57 PM3/13/08
to
On Thu, 13 Mar 2008 15:37:45 -0700, Karch <news.microsoft.com> wrote:

> New timings including changes are MUCH better. And the state machine
> wins:

I thought it might. :)

Keep in mind also that it appears you're roughly around the break-even
point between the two implementations. Depending on the actual input data
and how you're using it, the state graph can potentially be much better
than the IndexOf() implementation.

In fact, based on the numbers you posted, I suspect that you're including
the cost of creating the state graph in the total time. For best results,
you would actually just create the state graph once and then reuse the
same graph each time you need to do a search. In that case, the cost of
creating the state graph should not be included in the performance
comparison.

With 359 search strings, the IndexOf() implementation would scan the input
string 359 times as compared to having to scan it only 1 time for the
state graph (in the worst-case scenario, that is: when none of the search
strings are in the input string). Even in the best case -- when the very
first search string is in the input string -- the two would be
equivalent. The average case would have the IndexOf() implementation 180
times slower than the state graph.

In other words, a mere 34% improvement is remarkably tiny given the actual
difference in efficiency of the two algorithms. If you were to perform
the timing test using the same state graph (that is, creating it only
once) on a larger number of iterations, you'd likely find a difference of
100x or more between the two, with the state graph winning. And
presumably, if performance is actually important to you, the number of
iterations in the real-world situation would be high enough to realize
that kind of performance advantage, assuming you use the state graph
correctly by only creating it once.

Pete

Karch

unread,
Mar 13, 2008, 7:35:05 PM3/13/08
to
I only create the state graph one time and I only populate the list of
strings one time. So, those are not interfering with proper timings; I am
only timing the searches. But, just for fun, here is the same test with 1M
iterations. The ratio is about the same.

Method: Looping with IndexOf
Execution time: 53.548416 seconds.

Method: State Machine
Execution time: 31.109533 seconds.


"Peter Duniho" <NpOeS...@nnowslpianmk.com> wrote in message

news:op.t7y8n...@petes-computer.local...

Ben Voigt [C++ MVP]

unread,
Mar 13, 2008, 7:41:20 PM3/13/08
to

It's now O(N) instead of O(N * M), but the constant has likely changed by a
factor of 5 or more. I'd guess the speedup would be closer to 20x than
180x.

>
> Pete


Peter Duniho

unread,
Mar 13, 2008, 7:43:51 PM3/13/08
to
On Thu, 13 Mar 2008 16:35:05 -0700, Karch <news.microsoft.com> wrote:

> I only create the state graph one time and I only populate the list of
> strings one time. So, those are not interfering with proper timings; I am
> only timing the searches. But, just for fun, here is the same test with
> 1M
> iterations. The ratio is about the same.

What is your input data? Does the input string contain any of the search
strings? If so, where is that search string in the list of search
strings? Are you doing 1 million identical searches, or are you somehow
randomizing the input?

Basically, that's a fairly minimal improvement relative to the potential
of the state graph versus a brute force IndexOf() solution. So either
your test case is somehow favorable to the IndexOf(), or there's still
some remaining problem with the state graph implementation.

Pete

Karch

unread,
Mar 13, 2008, 8:02:49 PM3/13/08
to
Yes, you are right; my input data is not as random as it could be. I just
wanted to get a quick snapshot, so it's likely that the improvement would be
greater over a larger and more varied set. I'll put together a more
realistic set of test data and report back.

"Peter Duniho" <NpOeS...@nnowslpianmk.com> wrote in message

news:op.t7zav...@petes-computer.local...

Peter Duniho

unread,
Mar 13, 2008, 8:29:56 PM3/13/08
to
On Thu, 13 Mar 2008 16:41:20 -0700, Ben Voigt [C++ MVP]
<r...@nospam.nospam> wrote:

> It's now O(N) instead of O(N * M), but the constant has likely changed
> by a
> factor of 5 or more. I'd guess the speedup would be closer to 20x than
> 180x.

You're right that my analysis was too simplistic. However, I'm not sure
that the 5x ratio is necessarily a precise estimate either. The
performance of the IndexOf() implementation depends a _lot_ on the input
data. Which string matches and where it appears in the input string plays
a big part in performance (as noted already), but also how close the
search strings match to non-matching text in the input string. IndexOf()
can waste a lot of time trying to match strings that eventually fail and
doing that can bring the cost of testing any one string much closer to the
cost of testing any one node in the state graph (which is, as you surely
noted, is where the constant difference you're talking about comes from).

That said, yes...I'd agree that a more conservative estimate of potential
speed difference might be 20x. Whatever the estimate, it's at least an
order of magnitude difference and potentially much greater, rather than
something relative small like the 34% observed here. On average, that
is...obviously for at least one particular test case, the difference is in
fact 34%. :)

Pete

Reply all
Reply to author
Forward
0 new messages