Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Finding the longest common prefix over a list of strings

70 views
Skip to first unread message

Ilya Zakharevich

unread,
Nov 21, 1997, 3:00:00 AM11/21/97
to

In article <652nbl$kg...@eccws1.dearborn.ford.com>,
Ken Fox <kf...@ford.com> wrote:
> The last couple of "what's the best way to do it" questions put
> me into a regex hunting mood. I had a bit of code in a file
> browser class which finds the common string prefix in a list of
> strings. Here's the basic algorithm:

If *using* Text::Trie is an overkill, you may try to look for the
*algorithm* used there.

Ilya

Ken Fox

unread,
Nov 21, 1997, 3:00:00 AM11/21/97
to

Ilya Zakharevich <il...@math.ohio-state.edu> writes:
> Ken Fox <kf...@ford.com> wrote:
> > ... finds the common string prefix in a list of strings.

>
> If *using* Text::Trie is an overkill, you may try to look for the
> *algorithm* used there.

The algorithm you use in the Trie module has much better performance
than mine. Theoretically that shouldn't be the case, but Perl is
very slow on substr() operations and your algorithm does a lot less
of them. Then again, computers aren't theoretical devices either. ;)

For everybody else, here's my implementation of Ilya's algorithm:

my @words = qw(the thick thesis thoroughly threw thor);

my $prefix = shift @words;
my $len = length($prefix);

foreach (@words) {
while (substr($_, 0, $len) ne $prefix) {
--$len;
$prefix = substr($prefix, 0, $len);
}
}

print "$prefix\n";

Perl's regex is *still* faster than Ilya's algorithm though, but only
by a few percent.

So, now there's a different question. Given the fact that this new
algorithm is just about as fast as the regex approach, which would you
rather use/maintain? Here's the regex again:

(join(',', @words).',') =~ /^(\w*)\w*,(\1\w*,)*$/;

print "$1\n";

- Ken

--
Ken Fox (kf...@ford.com) | My opinions or statements do
| not represent those of, nor are
Ford Motor Company, Powertrain | endorsed by, Ford Motor Company.
Analytical Powertrain Methods Department |
Software Development Section | "Is this some sort of trick
| question or what?" -- Calvin

Tushar Samant

unread,
Nov 22, 1997, 3:00:00 AM11/22/97
to

kf...@ford.com writes:
>For everybody else, here's my implementation of Ilya's algorithm:

How about benchmarking with some kind of check here?


|
| my @words = qw(the thick thesis thoroughly threw thor);
|
| my $prefix = shift @words;
| my $len = length($prefix);
|
| foreach (@words) {
|

+--> last unless $len;

while (substr($_, 0, $len) ne $prefix) {
--$len;

+-------> $prefix = substr($prefix, 0, $len);
| }
| }
|
| print "$prefix\n";
|
|
Why not just chop $prefix? After all $len is always length($prefix) ...


>Perl's regex is *still* faster than Ilya's algorithm though, but only
>by a few percent.
>
>So, now there's a different question. Given the fact that this new
>algorithm is just about as fast as the regex approach, which would you
>rather use/maintain? Here's the regex again:
>
> (join(',', @words).',') =~ /^(\w*)\w*,(\1\w*,)*$/;
>
> print "$1\n";

This is as understandable, or more. Besides, aren't both pieces of code
going to get commented with pretty much the same thing?

You might want to put the commas in front, maybe include them in the
pattern and benchmark that...


Abigail

unread,
Nov 23, 1997, 3:00:00 AM11/23/97
to

Tushar Samant (scri...@shoga.wwa.com) wrote on 1545 September 1993 in
<URL: news:65877a$3...@shoga.wwa.com>:
++ kf...@ford.com writes:
++ >For everybody else, here's my implementation of Ilya's algorithm:
++
++ How about benchmarking with some kind of check here?

How about trying to *analyze* both algorithms? Let's say we have
N words, and each word is at most k characters long.

++ | my @words = qw(the thick thesis thoroughly threw thor);
++ |
++ | my $prefix = shift @words;
++ | my $len = length($prefix);
++ |
++ | foreach (@words) {

So, we will have O (N) iterations through this loop.

++ |
++ +--> last unless $len;

That's O (1).

++
++ while (substr($_, 0, $len) ne $prefix) {

Clearly, a single test takes O (k) time.

++ --$len;
++ +-------> $prefix = substr($prefix, 0, $len);

With a chop here, both actions take O (1).

++ | }
++ | }

Now, note that the inner loop is iterated at most k times - the
length is always decreasing. So, the runtime for this algorithm
is bound by O (k * N). (With $prefix = substr($prefix, 0, $len)
in the inner loop, the bound would be O (k * (N + k)).)

++ |
++ | print "$prefix\n";
++ |
++ |
++ Why not just chop $prefix? After all $len is always length($prefix) ...
++
++
++ >Perl's regex is *still* faster than Ilya's algorithm though, but only
++ >by a few percent.

Let's analyze it a bit further, before we drop to conclusions, shall we?

++ >So, now there's a different question. Given the fact that this new
++ >algorithm is just about as fast as the regex approach, which would you
++ >rather use/maintain? Here's the regex again:
++ >
++ > (join(',', @words).',') =~ /^(\w*)\w*,(\1\w*,)*$/;

For this, I'm going to prove a lowerbound on the run time for a
specific example. I will use W for Omega, as ascii doesn't have
Greek symbols.

Take N - 1 words of length k all sharing the same prefix, and let that
prefix be at least k/2 characters long. Let the last word not have
any prefix in common with the rest (except for the empty string).

The join will generate a string of length W (k * N). Due to the
greedyness and backtracking of the regex machine, after being
discovered a certain prefix didn't make it, the regex machine
will try a prefix one shorter, and start all over again.

After finding the common prefix for the first N - 1 words, it
has to scan W (k * N) words before it fails and needs to backtrack.
It needs to backtrack W (k) times. Hence, a total running time of
W (k^2 * N).

Clearly, Ilya's algorithm beats the regex hands down.


I am almost certain that with a bit more careful analysis, I would
be able to proof that Ilya's algorithm (with Tushar's suggestion) is
linear (and hence optimal) in the size of the input.

++ > print "$1\n";
++
++ This is as understandable, or more. Besides, aren't both pieces of code
++ going to get commented with pretty much the same thing?

Any program that needs to be maintained probably needs to be
efficient too. So, there would be no place for the regex.

(The regex is much cuter though.)

++ You might want to put the commas in front, maybe include them in the
++ pattern and benchmark that...


No need for a benchmark I would say.

Abigail
--
perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'

Mark Mielke

unread,
Nov 23, 1997, 3:00:00 AM11/23/97
to

abi...@fnx.com (Abigail) writes:
> Tushar Samant (scri...@shoga.wwa.com) wrote on 1545 September 1993 in
> ++ >Perl's regex is *still* faster than Ilya's algorithm though, but only
> ++ >by a few percent.
> Let's analyze it a bit further, before we drop to conclusions, shall we?

> ++ >So, now there's a different question. Given the fact that this new
> ++ >algorithm is just about as fast as the regex approach, which would you
> ++ >rather use/maintain? Here's the regex again:
> ++ > (join(',', @words).',') =~ /^(\w*)\w*,(\1\w*,)*$/;

> [ cool interesting proof that in theory Ilya's algorithm is much faster ]

> Clearly, Ilya's algorithm beats the regex hands down.
> I am almost certain that with a bit more careful analysis, I would
> be able to proof that Ilya's algorithm (with Tushar's suggestion) is
> linear (and hence optimal) in the size of the input.

I think the difference is that the regexp code is written in optimized C.
Ilya's may be more efficient in theory, but in practice it may be the same
speed (or a tad slower) because it is written in interpreted perl.

What about if Ilya's algorithym was implemented in an XS code? :-)

> Any program that needs to be maintained probably needs to be
> efficient too. So, there would be no place for the regex.
> (The regex is much cuter though.)

This is true. I particularly foundthe:


perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'

at the end of your posts intriguing... but after thinking about it
for a while i realized that although it may be efficient for the first call,
it doesn't do any prime number caching for the second, third etc. call...
so it's kinda useless :-)

> No need for a benchmark I would say.

If you want raw speed... the benchmark way IS the way to go :-) If you
care about maintenance/portability/etc... stick to the theory and use
algorithyms that have a much clearer/obvious approach. You may not be
the last maintainer of your code... :-)

> Abigail
> --
> perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'

hehehe that's cool :-)

mark

-- _________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Northern Telecom Ltd. |
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | Box 3511, Station 'C' |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, ON K1Y 4H7 |
ma...@nortel.ca / al...@freenet.carleton.ca |_______________________|

Tushar Samant

unread,
Nov 23, 1997, 3:00:00 AM11/23/97
to

abi...@fnx.com writes:
>Tushar Samant (scri...@shoga.wwa.com) wrote on 1545 September 1993 in
><URL: news:65877a$3...@shoga.wwa.com>:
>++ kf...@ford.com writes:
>++ >For everybody else, here's my implementation of Ilya's algorithm:
>++
>++ How about benchmarking with some kind of check here?
>
>How about trying to *analyze* both algorithms?

I agree with the analysis. The reasons for suggesting benchmarking
are different.

The check I suggested will make a difference depending on what the
specific sample of inputs is. If the "typical" sample consists of
words where the first two or three already have no common prefix,
then the loop will abort regardless of the size of the list. Bench-
marking is one way of getting a hold on this kind of statistical
characteristic.

We could make further refinements -- for instance, if you get a
word which is shorter than the prefix so far, you chop the prefix
down by several characters at once, OR for instance you make a
pass (still linear) and start with the shortest word -- etc etc,
but it becomes less and less clear just how much these refinements
accomplish for an input population we don't know much about.

Agreed, the regex restarts the "loop" every time the prefix gets
smaller, and the prefix gets smaller only in steps of 1. So in bad
cases (e.g. a list of 101 words which are the decreasing substrings
of a 100-letter word) it will do a horrible job getting to the
common prefix "".

But that brings up two points: 1) what are all these constants--we
are imagining that the regex engine is going through a loop similar
to ours, but surely a "single operation" in that "loop" doesn't cost
the same. 2) Once again, what sort of population is it being sicked
on--if all it ever did any one time was about 20 words of length 7
on average, benchmarking becomes pretty much the only method of
comparison.


BTW I suggested "chop" not because I think substr is more expensive
(I have no idea if it is), but because it's easier to read:

$len = length($string); $len--;
$string = substr($string, 0, $len);

vs

$len = length($string); $len--;
chop $string;


Abigail

unread,
Nov 24, 1997, 3:00:00 AM11/24/97
to

Tushar Samant (scri...@shoga.wwa.com) wrote on 1545 September 1993 in
<URL: news:65aacm$q...@shoga.wwa.com>:
++ abi...@fnx.com writes:
++ >Tushar Samant (scri...@shoga.wwa.com) wrote on 1545 September 1993 in
++ ><URL: news:65877a$3...@shoga.wwa.com>:
++ >++ kf...@ford.com writes:
++ >++ >For everybody else, here's my implementation of Ilya's algorithm:

++ >++
++ >++ How about benchmarking with some kind of check here?
++ >
++ >How about trying to *analyze* both algorithms?
++
++ I agree with the analysis.

Well, I don't. Even with the chop() there, the algorithm is
still O (k * (N + k)) due to the test. Here's an optimal version:


my @words = ("word1", "word2", "word3", ...., "wordN");
my $end = length $words [0] - 1;

foreach my $w (1 .. $#words) {
$end = length $words [$w] - 1 if $end >= length $words [$w];
foreach my $i (0 .. length $words [$w] - 1) {
if (substr ($words [$w - 1], $i, 1) ne
substr ($words [$w], $i, 1)) {
$end = $i - 1;
last;
}
}
last if $end < 0;
}
$prefix = substr $words [0], 0, $end + 1;


Let word [i] have length k_i; define K = Sum_i k_i.
Let N be the number of words; assuming we have no empty
words, N <= K.

It's easy to see each letter gets involved into a comparison
at most twice. Hence, we have at most O (K) iterations of
the inner loop, which all take O (1). We have O (N) iterations
of the outer loop, each iteration takes O (1) + plus time
we already charged to the inner loop.

Hence the total running time is O (K), which is linear in
the input size.

++ The reasons for suggesting benchmarking
++ are different.
++
++ The check I suggested will make a difference depending on what the
++ specific sample of inputs is. If the "typical" sample consists of
++ words where the first two or three already have no common prefix,
++ then the loop will abort regardless of the size of the list. Bench-
++ marking is one way of getting a hold on this kind of statistical
++ characteristic.
++
++ We could make further refinements -- for instance, if you get a
++ word which is shorter than the prefix so far, you chop the prefix
++ down by several characters at once, OR for instance you make a
++ pass (still linear) and start with the shortest word -- etc etc,
++ but it becomes less and less clear just how much these refinements
++ accomplish for an input population we don't know much about.

Hey! In the previous paragraph you come with "typical samples",
but now you say you don't much about the input...

++ Agreed, the regex restarts the "loop" every time the prefix gets
++ smaller, and the prefix gets smaller only in steps of 1. So in bad
++ cases (e.g. a list of 101 words which are the decreasing substrings
++ of a 100-letter word) it will do a horrible job getting to the
++ common prefix "".
++
++ But that brings up two points: 1) what are all these constants--we
++ are imagining that the regex engine is going through a loop similar
++ to ours, but surely a "single operation" in that "loop" doesn't cost

Ah, but the problem isn't constants. The problem is that the number
of operations the regex solution is of a different *order* then the
number of operations of Ilya's solution.

++ the same. 2) Once again, what sort of population is it being sicked
++ on--if all it ever did any one time was about 20 words of length 7
++ on average, benchmarking becomes pretty much the only method of
++ comparison.
++
++
++ BTW I suggested "chop" not because I think substr is more expensive
++ (I have no idea if it is), but because it's easier to read:

The substr was an assignment, so you had to copy the string; which
is clearly at least linear in the length of the string. A chop only
needs to decrease the length of the string, which is just a variable
in the structure for that variable.

++
++ $len = length($string); $len--;
++ $string = substr($string, 0, $len);
++
++ vs
++
++ $len = length($string); $len--;
++ chop $string;

Well, wouldn't the following even be more readable:

chop $string;
$len = length ($string);

Abigail

unread,
Nov 24, 1997, 3:00:00 AM11/24/97
to

Mark Mielke (ma...@nortel.ca) wrote on 1545 September 1993 in
<URL: news:lq17m9z...@bmers2e5.nortel.ca>:

++ abi...@fnx.com (Abigail) writes:
++ > Tushar Samant (scri...@shoga.wwa.com) wrote on 1545 September 1993 in
++ > ++ >Perl's regex is *still* faster than Ilya's algorithm though, but only
++ > ++ >by a few percent.
++ > Let's analyze it a bit further, before we drop to conclusions, shall we?
++
++ > ++ >So, now there's a different question. Given the fact that this new
++ > ++ >algorithm is just about as fast as the regex approach, which would you
++ > ++ >rather use/maintain? Here's the regex again:
++ > ++ > (join(',', @words).',') =~ /^(\w*)\w*,(\1\w*,)*$/;
++ > [ cool interesting proof that in theory Ilya's algorithm is much faster ]
++
++ > Clearly, Ilya's algorithm beats the regex hands down.
++ > I am almost certain that with a bit more careful analysis, I would
++ > be able to proof that Ilya's algorithm (with Tushar's suggestion) is
++ > linear (and hence optimal) in the size of the input.
++
++ I think the difference is that the regexp code is written in optimized C.
++ Ilya's may be more efficient in theory, but in practice it may be the same
++ speed (or a tad slower) because it is written in interpreted perl.

No, Ilya's will be more efficient in practise as well. The fact the
regex code is written in optimized C just means the input size where
it becomes slower is a bit larger.

++ What about if Ilya's algorithym was implemented in an XS code? :-)

Would not change the complexity.

++ > Any program that needs to be maintained probably needs to be
++ > efficient too. So, there would be no place for the regex.
++ > (The regex is much cuter though.)
++
++ This is true. I particularly foundthe:
++ perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'
++ at the end of your posts intriguing... but after thinking about it
++ for a while i realized that although it may be efficient for the first call,
++ it doesn't do any prime number caching for the second, third etc. call...
++ so it's kinda useless :-)

It extremely inefficient. It's exponential in the input size.

++ > No need for a benchmark I would say.
++
++ If you want raw speed... the benchmark way IS the way to go :-) If you
++ care about maintenance/portability/etc... stick to the theory and use
++ algorithyms that have a much clearer/obvious approach. You may not be
++ the last maintainer of your code... :-)

If you can't do an analysis of an algorithm, how can you interpret
the outcome of benchmarks? Or do you always know all the possible
inputs beforehand?

Ken Fox

unread,
Nov 25, 1997, 3:00:00 AM11/25/97
to

abi...@fnx.com (Abigail) writes:
> Here's an optimal version:
>
> my @words = ("word1", "word2", "word3", ...., "wordN");
> my $end = length $words [0] - 1;
>
> foreach my $w (1 .. $#words) {
> $end = length $words [$w] - 1 if $end >= length $words [$w];
> foreach my $i (0 .. length $words [$w] - 1) {
^^^^^^^^^^^^^^^^^^^^^^
Should be $end?

> if (substr ($words [$w - 1], $i, 1) ne
> substr ($words [$w], $i, 1)) {
> $end = $i - 1;
> last;
> }
> }
> last if $end < 0;
> }
> $prefix = substr $words [0], 0, $end + 1;

In my original posting, I gave a similar (cleaner?) algorithm:

my @words = qw(the thick thesis thoroughly threw thor);

my $offset = 0;
my $reference = shift @words;
my $prefix;

MATCH: while (1) {
$prefix = substr($reference, $offset, 1);
foreach (@words) {
if (substr($_, $offset, 1) ne $prefix) {
$prefix = substr($reference, 0, $offset);
last MATCH;
}
}
++$offset;
}

print "$prefix\n";

This algorithm is O(N) where N is the number of words in the input.
Actually, it has run-time T(k*N+f) where N is the number of words in the
input, k is the length of the common prefix, and f is the number of
false prefix matches that are tested (f < N). For those that don't
know, it's standard in Big-O notation to drop all the constants used in
the run-time bound, so T(k*N+f) ends up O(N).

Abigail's algorithm is also O(N), but mine will short circuit as soon as
possible, whereas Abigail's might do a lot of work only to throw it away
on the last word. Try both algorithms on the input set of 100 "fubar"s
with one "snafu" at the end.

I (almost) agree with Abigail's analysis of the regex:

> ++ > (join(',', @words).',') =~ /^(\w*)\w*,(\1\w*,)*$/;
>

> For this, I'm going to prove a lowerbound on the run time for a
> specific example. I will use W for Omega, as ascii doesn't have
> Greek symbols.
>
> Take N - 1 words of length k all sharing the same prefix, and let that
> prefix be at least k/2 characters long. Let the last word not have
> any prefix in common with the rest (except for the empty string).
>
> The join will generate a string of length W (k * N). Due to the
> greedyness and backtracking of the regex machine, after being
> discovered a certain prefix didn't make it, the regex machine
> will try a prefix one shorter, and start all over again.
>
> After finding the common prefix for the first N - 1 words, it
> has to scan W (k * N) words before it fails and needs to backtrack.

^^^^^
Should be characters.

> It needs to backtrack W (k) times. Hence, a total running time of
> W (k^2 * N).
>

> Clearly, Ilya's algorithm beats the regex hands down.

So regex is linear in N, but geometric in k. If we know that the common
prefix is short, then it's probably ok to use it. (Keep in mind that
this is a lower bound because it ignores all sorts of ugly stuff like
repeated scanning over the non-prefix parts of words and looking for
commas.) My original algorithm and Abigail's algorithm both have have a
lower bound of W(k*N) though. If k is a constant then k^2 is too... ;)

I think Abigail's example of a long prefix is a specially formulated
worst case. Will it ever appear in practice? For *my* purposes, I'm
doubtful. I find in my sample data as N becomes larger, the prefix
length becomes smaller. For short prefixes the regex algorithm becomes
(nearly) linear. Caveat emptor.

Abigail's algorithm is superior to the regex from a theoretical view,
but it fails miserably. Why? Because Big-O notation ignores
computational "constants" and "implementation details". :-) Perl is
*much* faster at regex execution than it is at "normal" code execution.
The total work done in opcode dispatch, argument passing, etc. for
Abigail's optimized solution exceeds the regex execution, even though
the regex has poorer performance bounds.

I don't know if regex ever becomes worse than Abigail's algorithm, but
for my input (filenames with a known matching partial prefix), the regex
wins all test cases I've given it.

> Tushar Samant (scri...@shoga.wwa.com) writes:
> ++ The reasons for suggesting benchmarking are different. The check I
> ++ suggested will make a difference depending on what the specific sample
> ++ of inputs is. If the "typical" sample ...

You'll surely lose this argument with Abigail. You'll have to think
like a theorist to win. You raise a valid point: to perform a valid
performance analysis we need a sample dataset, but which dataset? How
about any of them? Abigail can analyze her algorithm with any input so
your argument doesn't float. She can even pick the worst case dataset
and give you a definitive formula for the algorithm's performance.

What you need benchmarking for is for computing the run-time (Big-T
notation) of an algorithm. This serves as raw data to verify that an
analysis is of the proper order. It also can be used for estimating the
constants that Big-O notation drops.

For instance, benchmarking shows that the regex solution is faster on
today's hardware, with today's Perl interpreter when given one of my
reasonable input sets. This means that the constants in Abigail's analysis
matter.

Most of the time that's all anybody really wants to know anyway. ;)

Abigail

unread,
Nov 25, 1997, 3:00:00 AM11/25/97
to

Ken Fox (f...@pt0204.pto.ford.com) wrote on 1547 September 1993 in
<URL: news:65dhvt$mn...@eccws1.dearborn.ford.com>:

++ abi...@fnx.com (Abigail) writes:
++ >
++ > foreach my $w (1 .. $#words) {
++ > $end = length $words [$w] - 1 if $end >= length $words [$w];
++ > foreach my $i (0 .. length $words [$w] - 1) {
++ ^^^^^^^^^^^^^^^^^^^^^^
++ Should be $end?

Yes.

++ In my original posting, I gave a similar (cleaner?) algorithm:

My newsserver was down for more than a day, and I've missed many
postings. I never saw your algorithm.

++ [ Algorithm that starts with a null prefix, then builds it up ]

Nice algorithm.

++ This algorithm is O(N) where N is the number of words in the input.
++ Actually, it has run-time T(k*N+f) where N is the number of words in the
++ input, k is the length of the common prefix, and f is the number of
++ false prefix matches that are tested (f < N). For those that don't
++ know, it's standard in Big-O notation to drop all the constants used in
++ the run-time bound, so T(k*N+f) ends up O(N).

I'm not familiar with big-T. Or is that T as in Theta?

++ So regex is linear in N, but geometric in k. If we know that the common
++ prefix is short, then it's probably ok to use it. (Keep in mind that

No, no, no. My k isn't the length of the common prefix, my k is the
length of the longest word. The regex is quadratic in the word *length*,
regardless of the length of the common prefix.

++ this is a lower bound because it ignores all sorts of ugly stuff like
++ repeated scanning over the non-prefix parts of words and looking for
++ commas.) My original algorithm and Abigail's algorithm both have have a
++ lower bound of W(k*N) though. If k is a constant then k^2 is too... ;)

Well, if you want to see k as a constant, then certainly all the mentioned
algorithms are linear.

++ I think Abigail's example of a long prefix is a specially formulated
++ worst case. Will it ever appear in practice? For *my* purposes, I'm
++ doubtful. I find in my sample data as N becomes larger, the prefix
++ length becomes smaller. For short prefixes the regex algorithm becomes
++ (nearly) linear. Caveat emptor.

For both my solution and the regex solution, the worst case situation
depends on the order of the words; you reach the regex worst case
if the first cN words share a prefix of dk length. (N the number of
words, k the maximum length of the words, c, d constants > 0).

Your solution if fairly independent of the order of the words.

++ Abigail's algorithm is superior to the regex from a theoretical view,
++ but it fails miserably. Why? Because Big-O notation ignores
++ computational "constants" and "implementation details". :-) Perl is
++ *much* faster at regex execution than it is at "normal" code execution.
++ The total work done in opcode dispatch, argument passing, etc. for
++ Abigail's optimized solution exceeds the regex execution, even though
++ the regex has poorer performance bounds.

Not so fast. If you don't take word lengths/prefix lengths into account,
the constant that disappears in the big-O *will* depend on the maximum
word length. For Ilya's, yours and mine, that constant is linear on the
maximum word length. For the regex solution, that constant is quadratic
in the maximum word length.

Ken Fox

unread,
Nov 26, 1997, 3:00:00 AM11/26/97
to

abi...@fnx.com (Abigail) writes:
> Ken Fox (f...@pt0204.pto.ford.com) wrote:
> ++ ... it has run-time T(k*N+f) where N is the number of words in the
> ++ input, k is the length of the common prefix, and f is the number of
> ++ false prefix matches that are tested (f < N).

>
> I'm not familiar with big-T. Or is that T as in Theta?

I was taught that big-T is simply used to denote theoretical actual
run time, i.e. the sum of all operations an algorithm performs. (My
algorithm analysis coursework was only a portion of a data structures
class I took a *long* time ago. I'm probably messing up notation as
it is taught now. Sorry.)

> ++ So regex is linear in N, but geometric in k. If we know that the common
> ++ prefix is short, then it's probably ok to use it. (Keep in mind that
>
> No, no, no. My k isn't the length of the common prefix, my k is the
> length of the longest word. The regex is quadratic in the word *length*,

> regardless of the length of the common prefix.

I think this algorithm is equivalent to the regex:

PREFIX: for (i = length(word[0]); i >= 0; --i) {
prefix = substr(word[0], 0, i)
for (j = 1; j < N; ++j) {
if (prefix ne substr(word[j], 0, i)) {
next PREFIX;
}
}
last;
}

It looks like this algorithm will scan the input at most length(word[0])
times. Each scan of a word does at most length(word[0]) work. So I
agree with you that the regex is quadratic in (the first) word length,
iff the input is pathological. I think in actual practice the average
run times will be linear (or nearly linear), but proving the average
case performance is beyond my ability. This seems like the same problem
QuickSort has -- quadratic worst case performance but very good average
performance.

The regex solution can also be made more stable by putting the
shortest word at the head of the array. This can be done in linear
time.

> Well, if you want to see k as a constant, then certainly all the
> mentioned algorithms are linear.

You're right, I'm not sure what I was thinking... Maybe I wasn't
thinking. ;)

> For both my solution and the regex solution, the worst case situation
> depends on the order of the words; you reach the regex worst case
> if the first cN words share a prefix of dk length. (N the number of
> words, k the maximum length of the words, c, d constants > 0).
>
> Your solution if fairly independent of the order of the words.

Agreed.

> ++ Abigail's algorithm is superior to the regex from a theoretical view,
> ++ but it fails miserably. Why? Because Big-O notation ignores
> ++ computational "constants" and "implementation details". :-) Perl is
> ++ *much* faster at regex execution than it is at "normal" code execution.
> ++ The total work done in opcode dispatch, argument passing, etc. for
> ++ Abigail's optimized solution exceeds the regex execution, even though
> ++ the regex has poorer performance bounds.
>
> Not so fast. If you don't take word lengths/prefix lengths into account,
> the constant that disappears in the big-O *will* depend on the maximum
> word length. For Ilya's, yours and mine, that constant is linear on the
> maximum word length. For the regex solution, that constant is quadratic
> in the maximum word length.

I'm only saying that the regex (with modern perl on a modern OS on a
modern machine when given normal input) outperforms your algorithm by
a large margin. How could that be when analysis clearly indicates
regex to be the worst performer? Why don't *you* explain? ;)

Abigail

unread,
Nov 27, 1997, 3:00:00 AM11/27/97
to

Ken Fox (f...@pt0204.pto.ford.com) wrote on 1548 September 1993 in
<URL: news:65hv4m$a...@eccws1.dearborn.ford.com>:
++ abi...@fnx.com (Abigail) writes:
++ > Ken Fox (f...@pt0204.pto.ford.com) wrote:
++ > ++ ... it has run-time T(k*N+f) where N is the number of words in the
++ > ++ input, k is the length of the common prefix, and f is the number of
++ > ++ false prefix matches that are tested (f < N).
++ >
++ > I'm not familiar with big-T. Or is that T as in Theta?
++
++ I was taught that big-T is simply used to denote theoretical actual
++ run time, i.e. the sum of all operations an algorithm performs.

Well, you are displaying T here as a function working on functions
(and if it would be similar as big-O, it would return an equivalence
class of functions). I guess you meant T (N) = O (k*N + f).

++ I think this algorithm is equivalent to the regex:
++
++ PREFIX: for (i = length(word[0]); i >= 0; --i) {
++ prefix = substr(word[0], 0, i)
++ for (j = 1; j < N; ++j) {
++ if (prefix ne substr(word[j], 0, i)) {
++ next PREFIX;
++ }
++ }
++ last;
++ }

I agree.

++ It looks like this algorithm will scan the input at most length(word[0])
++ times. Each scan of a word does at most length(word[0]) work. So I
++ agree with you that the regex is quadratic in (the first) word length,
++ iff the input is pathological. I think in actual practice the average
++ run times will be linear (or nearly linear), but proving the average
++ case performance is beyond my ability. This seems like the same problem
++ QuickSort has -- quadratic worst case performance but very good average
++ performance.

Yes and no. As we all know, quicksorts worst case behaviour is a sorted
array. A sorted list of words is more likely to share a long common
prefix on the first fraction of words - more likely than a randomly
shuffled list of words. You can calculate the expected run time of the
algorithm for a certain distribution of inputs - but do you know your
actual input has that distribution? Furthermore, even if the expected
case if "fast", there will also be cases were it will be "slow". Will
that be acceptable for the application?

++ The regex solution can also be made more stable by putting the
++ shortest word at the head of the array. This can be done in linear
++ time.

True, but that doesn't buy you much if all words are of (almost)
the same length.

++ > Not so fast. If you don't take word lengths/prefix lengths into account,
++ > the constant that disappears in the big-O *will* depend on the maximum
++ > word length. For Ilya's, yours and mine, that constant is linear on the
++ > maximum word length. For the regex solution, that constant is quadratic
++ > in the maximum word length.
++
++ I'm only saying that the regex (with modern perl on a modern OS on a
++ modern machine when given normal input) outperforms your algorithm by
++ a large margin. How could that be when analysis clearly indicates
++ regex to be the worst performer? Why don't *you* explain? ;)

Can I pick the input?

Abigail

unread,
Dec 2, 1997, 3:00:00 AM12/2/97
to

Ken Fox (f...@pt0204.pto.ford.com) wrote on 1554 September 1993 in
<URL: news:65vo3f$g4...@eccws1.dearborn.ford.com>:
++
++ abi...@fnx.com (Abigail) writes:
++ > Ken Fox (f...@pt0204.pto.ford.com) wrote:
++ > ++ I was taught that big-T is simply used to denote theoretical actual
++ > ++ run time, i.e. the sum of all operations an algorithm performs.
++ >
++ > Well, you are displaying T here as a function working on functions
++ > (and if it would be similar as big-O, it would return an equivalence
++ > class of functions). I guess you meant T (N) = O (k*N + f).
++
++ I see what you're saying, but that's not what I meant. T is time, not an
++ equivalence function, so I guess the notation should really just be
++ T = k*N+f. What notation is used to compare algorithms with the same big-O?

That is extremely hard. Then you need to take the actual constants into
account, and they can depend on a large range of variables, including
processor type and the compiler used to compile perl. It wouldn't be
unpossible to have 2 algorithms A1, A2 with the same upperbound, where A1
is faster for any input than A2 on one machine, but on an other machine,
A2 beats A1. What if one algorithm has actual run time 10 * int (n/2),
the other 5 * n - 2? Which one is "better"? What if the run time isn't
continuous? Not to mention the actual runtime is highly dependent on
the input. For a specific input, sometimes the one, sometimes the other
will be faster. Quicksort and bubblesort have the same worst case
complexity. Quicksort has a way better expected runtime; but on a
sorted array, bubblesort can outperform quicksort. Which one is better?
How do you express T if the expected run time is O (n log n), but
the worst case run time is quadractic?

How you express the running time T of the following algorithm:

@array = ( .... ); # Some set of values.
foreach (@array) {last if $_ eq 7; $_ ++;}

Certainly not as T = c * N + d, as the actual running time highly depends
on the distribution of 7's over the input.

Now, you might be give an upperbound, T <= c * N + d, and find appropriate
c and d. But how do you compare T1 and T2, where T1 <= c1 * N + d1,
T2 <= c2 * N + d2? Is T1 better than T2 iff c1 <= c2? Maybe, but what
if we compare the above algorithm with:

@array = ( .... ); # As above.
foreach (@array) {last unless $_ == 713812345124; "aaaab" =~ /^((((a+)*)*)*)c/;}

Clearly T2 <= c2 * N + d2, and c1 <= c2. But which will run better
in practise?

++ Certainly O(2*N) should be worse than O(N), but big-O doesn't mean that.

Writing O (2*N) is a sign of not understanding big-O. ;)

The nice thing about a tight big-O complexity is that it will show
you how well an algorithm scales. Even if the regex algorithm was
ten times faster, the quadratic dependency would still make it to
perform worse than the other algoritms for larger inputs.

++ Here's some more raw data. It's interesting to see how different
++ algorithms perform so inconsistently on different input -- and this
++ is a trivial problem too! All these tests were run 20_000 times.

Interesting results, but for completeness, you should also include
cases were all the words are relatively large.

Ken Fox

unread,
Dec 2, 1997, 3:00:00 AM12/2/97
to

Here are some benchmark results. This supports Abigail's point pretty
strongly -- I'll even admit this shows she's right. Ilya's algorithm
(named "shrink prefix" below) is very stable and consistently comes out
on top. I guess he learned something writing that trie module. ;)

IMO, the right way to answer my original question is to write a subroutine
and hide the implementation. Then when somebody eventually posts "yet
another way to do it" you can quietly replace your code. :)

I've included another algorithm here just for fun. Gurusamy Sarathy has a
really cool algorithm based on treating strings as bit vectors. All the
code is at the end.

This table shows the performance of several algorithms used to
find the common prefix. The input array is a list of words in
this form:

words = (fubar0quirky, fubar1quirky, fubar2quirky, fubar3quirky, ...
fubar.$N.quirky, fuquirky)

The common prefix is "fu". The size of the array varies from N=200 to
N=200_000. All times shown are CPU seconds. All the tests were run 200
times.

N: 200 2_000 20_000 40_000 200_000

shrink prefix: 0.39 3.53 35.51 69.17 354.95
grow prefix: 0.66 6.55 66.65 130.24 652.83
bit string: 1.15 11.00 108.20 219.01 1079.07
regex: 1.12 12.28 141.22 - -

I couldn't get the regex test to run over the array of length 40_000
or more. I'm still running 5.004_01 though so the limit might be
different in newer perls. Abigail hasn't mentioned that the regex
algorithm uses a *lot* more temporary space. It creates a temp string
proportional to the input size plus all the space used for backtracking
and the regex itself. All the others use nearly constant space.

Only "grow prefix" and "regex" are sensitive to the common prefix
length. Doubling the common prefix to "fuba" will double the time
that "grow prefix" takes and reduce the time regex takes. Conversely,
halving the common prefix to "f" will halve the time that "grow prefix"
takes and increase the time regex takes.


abi...@fnx.com (Abigail) writes:
> Ken Fox (f...@pt0204.pto.ford.com) wrote:
> ++ I was taught that big-T is simply used to denote theoretical actual
> ++ run time, i.e. the sum of all operations an algorithm performs.
>

> Well, you are displaying T here as a function working on functions

> (and if it would be similar as big-O, it would return an equivalence

> class of functions). I guess you meant T (N) = O (k*N + f).

I see what you're saying, but that's not what I meant. T is time, not an


equivalence function, so I guess the notation should really just be

T = k*N+f. What notation is used to compare algorithms with the same big-O?

Certainly O(2*N) should be worse than O(N), but big-O doesn't mean that.

Here's some more raw data. It's interesting to see how different

algorithms perform so inconsistently on different input -- and this

is a trivial problem too! All these tests were run 20_000 times.


words = (fubar0, fubar1, fubar2, fubar3, fubar4, fubar5
fubar6, fubar7, fubar8, fubar9, fubar10, fubar11
fubar12, fubar13, fubar14, fubar15, fubar16, fubar17
fubar18, fubar19)

regex: 3 secs ( 3.31 usr 0.00 sys = 3.31 cpu)
shrink prefix: 4 secs ( 3.82 usr 0.00 sys = 3.82 cpu)
bit string: 12 secs (11.38 usr 0.00 sys = 11.38 cpu)
grow prefix: 13 secs (11.87 usr 0.00 sys = 11.87 cpu)

words = (fubar0, fubar1, fubar2, fubar3, fubar4, fubar5
fubar6, fubar7, fubar8, fubar9, fubar10, fubar11
fubar12, fubar13, fubar14, fubar15, fubar16, fubar17
fubar18, fubar19, x)

grow prefix: 3 secs ( 2.60 usr 0.00 sys = 2.60 cpu)
shrink prefix: 5 secs ( 4.83 usr 0.00 sys = 4.83 cpu)
bit string: 12 secs (11.90 usr 0.00 sys = 11.90 cpu)
regex: 14 secs (14.47 usr 0.00 sys = 14.47 cpu)

words = (x, fubar0, fubar1, fubar2, fubar3, fubar4
fubar5, fubar6, fubar7, fubar8, fubar9, fubar10
fubar11, fubar12, fubar13, fubar14, fubar15, fubar16
fubar17, fubar18, fubar19)

grow prefix: 0 secs ( 0.66 usr 0.00 sys = 0.66 cpu)
bit string: 1 secs ( 0.96 usr 0.00 sys = 0.96 cpu)
regex: 4 secs ( 3.42 usr 0.00 sys = 3.42 cpu)
shrink prefix: 4 secs ( 3.80 usr 0.00 sys = 3.80 cpu)

words = (fubar0, fubar1, fubar2, fubar3, fubar4, fubar5
fubar6, fubar7, fubar8, fubar9, fubar10, fubar11
fubar12, fubar13, fubar14, fubar15, fubar16, fubar17
fubar18, fubar19, fu)

shrink prefix: 5 secs ( 4.89 usr 0.00 sys = 4.89 cpu)
grow prefix: 8 secs ( 8.16 usr 0.00 sys = 8.16 cpu)
regex: 11 secs (10.84 usr 0.00 sys = 10.84 cpu)
bit string: 12 secs (11.91 usr 0.00 sys = 11.91 cpu)

words = (fu, fubar0, fubar1, fubar2, fubar3, fubar4
fubar5, fubar6, fubar7, fubar8, fubar9, fubar10
fubar11, fubar12, fubar13, fubar14, fubar15, fubar16
fubar17, fubar18, fubar19)

regex: 3 secs ( 3.52 usr 0.00 sys = 3.52 cpu)
shrink prefix: 4 secs ( 3.80 usr 0.00 sys = 3.80 cpu)
grow prefix: 6 secs ( 5.56 usr 0.00 sys = 5.56 cpu)
bit string: 12 secs (11.72 usr 0.00 sys = 11.72 cpu)

words = (fubar0, fubar1, fubar2, fubar3, fu, fubar5
fubar6, fubar7, fubar8, fubar9, fubar10, fubar11
fubar12, fubar13, fubar14, fubar15, fubar16, fubar17
fubar18, fubar19)

shrink prefix: 5 secs ( 4.22 usr 0.00 sys = 4.22 cpu)
regex: 4 secs ( 4.58 usr 0.00 sys = 4.58 cpu)
grow prefix: 6 secs ( 5.58 usr 0.00 sys = 5.58 cpu)
bit string: 11 secs (10.76 usr 0.00 sys = 10.76 cpu)

words = (fubar0quirky, fubar1quirky, fubar2quirky, fubar3quirky, fubar4quirky,
fubar5quirky fubar6quirky, fubar7quirky, fubar8quirky, fubar9quirky,
fubar10quirky, fubar11quirky fubar12quirky, fubar13quirky,
fubar14quirky, fubar15quirky, fubar16quirky, fubar17quirky
fubar18quirky, fubar19quirky)

regex: 3 secs ( 4.14 usr 0.00 sys = 4.14 cpu)
shrink prefix: 5 secs ( 4.96 usr 0.00 sys = 4.96 cpu)
bit string: 11 secs (11.15 usr 0.00 sys = 11.15 cpu)
grow prefix: 12 secs (12.32 usr 0.00 sys = 12.32 cpu)

words = (fubar0quirky, fubar1quirky, fubar2quirky, fubar3quirky, fubar4quirky,
fubar5quirky fubar6quirky, fubar7quirky, fubar8quirky, fubar9quirky,
fubar10quirky, fubar11quirky fubar12quirky, fubar13quirky,
fubar14quirky, fubar15quirky, fubar16quirky, fubar17quirky
fubar18quirky, fubar19quirky, xquirky)

grow prefix: 2 secs ( 2.67 usr 0.00 sys = 2.67 cpu)
shrink prefix: 6 secs ( 6.14 usr 0.00 sys = 6.14 cpu)
bit string: 12 secs (11.60 usr 0.00 sys = 11.60 cpu)
regex: 17 secs (16.83 usr 0.00 sys = 16.83 cpu)

words = (xquirky, fubar0quirky, fubar1quirky, fubar2quirky, fubar3quirky,
fubar4quirky fubar5quirky, fubar6quirky, fubar7quirky, fubar8quirky,
fubar9quirky, fubar10quirky fubar11quirky, fubar12quirky,
fubar13quirky, fubar14quirky, fubar15quirky, fubar16quirky
fubar17quirky, fubar18quirky, fubar19quirky)

grow prefix: 0 secs ( 0.65 usr 0.00 sys = 0.65 cpu)
bit string: 1 secs ( 0.94 usr 0.00 sys = 0.94 cpu)
regex: 5 secs ( 4.27 usr 0.00 sys = 4.27 cpu)
shrink prefix: 5 secs ( 5.03 usr 0.00 sys = 5.03 cpu)

words = (fubar0quirky, fubar1quirky, fubar2quirky, fubar3quirky, fubar4quirky,
fubar5quirky fubar6quirky, fubar7quirky, fubar8quirky, fubar9quirky,
fubar10quirky, fubar11quirky fubar12quirky, fubar13quirky,
fubar14quirky, fubar15quirky, fubar16quirky, fubar17quirky
fubar18quirky, fubar19quirky, fuquirky)

shrink prefix: 5 secs ( 5.75 usr 0.00 sys = 5.75 cpu)
grow prefix: 7 secs ( 7.60 usr 0.00 sys = 7.60 cpu)
regex: 13 secs (12.01 usr 0.00 sys = 12.01 cpu)
bit string: 14 secs (13.70 usr 0.00 sys = 13.70 cpu)

*** The following data set is what I consider most representative of
a sorted directory listing.

words = (fuquirky, fubar0quirky, fubar1quirky, fubar2quirky, fubar3quirky,
fubar4quirky fubar5quirky, fubar6quirky, fubar7quirky, fubar8quirky,
fubar9quirky, fubar10quirky fubar11quirky, fubar12quirky,
fubar13quirky, fubar14quirky, fubar15quirky, fubar16quirky
fubar17quirky, fubar18quirky, fubar19quirky)

regex: 4 secs ( 4.11 usr 0.00 sys = 4.11 cpu)
shrink prefix: 5 secs ( 4.94 usr 0.00 sys = 4.94 cpu)
grow prefix: 6 secs ( 5.85 usr 0.00 sys = 5.85 cpu)
bit string: 10 secs (11.43 usr 0.00 sys = 11.43 cpu)

words = (fubar0quirky, fubar1quirky, fubar2quirky, fubar3quirky, fuquirky,
fubar5quirky fubar6quirky, fubar7quirky, fubar8quirky, fubar9quirky,
fubar10quirky, fubar11quirky fubar12quirky, fubar13quirky,
fubar14quirky, fubar15quirky, fubar16quirky, fubar17quirky
fubar18quirky, fubar19quirky)

shrink prefix: 5 secs ( 5.50 usr 0.00 sys = 5.50 cpu)
regex: 6 secs ( 5.51 usr 0.00 sys = 5.51 cpu)
grow prefix: 6 secs ( 5.61 usr 0.00 sys = 5.61 cpu)
bit string: 11 secs (10.88 usr 0.00 sys = 10.88 cpu)

- Ken

--
Ken Fox (kf...@ford.com) | My opinions or statements do
| not represent those of, nor are
Ford Motor Company, Powertrain | endorsed by, Ford Motor Company.
Analytical Powertrain Methods Department |
Software Development Section | "Is this some sort of trick
| question or what?" -- Calvin

# ----------------------------------------------------------------------
use Benchmark qw(timethese);

#my @words = qw(the thick thesis thoroughly threw thor);

my @words;
for (my $i = 0; $i < 20; ++$i) {
push @words, "fubar$i";
}

my $suffix = "";

if (defined($ARGV[0])) {
if ($ARGV[0] > 5) {
$ARGV[0] -= 6;
$suffix = "quirky";
foreach (@words) {
$_ .= $suffix;
}
}

if ($ARGV[0] == 1) { push @words, ("x".$suffix) }
elsif ($ARGV[0] == 2) { unshift @words, ("x".$suffix) }
elsif ($ARGV[0] == 3) { push @words, ("fu".$suffix) }
elsif ($ARGV[0] == 4) { unshift @words, ("fu".$suffix) }
elsif ($ARGV[0] == 5) { $words[4] = "fu".$suffix }
}

print "\n\n\nwords = (";
my $columns = 6;
my $pos = 0;
while ($pos < ($#words - $columns)) {
print join(', ', @words[$pos .. $pos+$columns-1]), "\n ";
$pos += $columns;
}
print join(', ', @words[$pos..$#words]), ")\n\n";

my @test_words = @words;
my $reference = shift @test_words;

timethese(20_000, {
'regex' =>
sub {
my $prefix;


(join(',', @words).',') =~ /^(\w*)\w*,(\1\w*,)*$/;

$prefix = $1;
},

'bit string' =>
sub {
my $prefix = $reference;
foreach my $w (@test_words) {
last if $prefix eq '';
($prefix ^ $w) =~ /^([\0]*)/;
$prefix = $reference & ~$1;
}
},

'shrink prefix' =>
sub {
my $len = length($reference);
my $prefix = $reference;

foreach (@test_words) {


while (substr($_, 0, $len) ne $prefix) {

--$len;


$prefix = substr($prefix, 0, $len);
}
}

},

'grow prefix' =>
sub {
my $offset = 0;
my $prefix;

MATCH: while (1) {
$prefix = substr($reference, $offset, 1);

foreach (@test_words) {


if (substr($_, $offset, 1) ne $prefix) {
$prefix = substr($reference, 0, $offset);
last MATCH;
}
}
++$offset;
}
}

});

Justin Vallon

unread,
Dec 20, 1997, 3:00:00 AM12/20/97
to

This may be dead, but:

local @words = qw(silly sally sorted seven sequences);

local $min = $words[0];
local $max = $words[0];

for $w (@words) {
if ($w lt $min) { $min = $w; }
if ($w gt $max) { $max = $w; }
}

local $lower_len = 0;
local $upper_len = length($min);
if ($upper_len < length($max)) { $upper_len = length($max); }

while ($lower_len < $upper_len) {
local $mid_len = int(($lower_len + $upper_len + 1)/2);
if (substr($min, 0, $mid_len) ne substr($max, 0, $mid_len)) {
$upper_len = $mid_len - 1;
} else {
$lower_len = $mid_len;
}
}

local $prefix = substr $min, 0, $lower_len;
print "$prefix\n";

Run time: O(N + ln k).
T = 1*N + ln k.

Ken Fox <f...@pt0204.pto.ford.com> wrote:


--
-Justin
val...@mindspring.com

0 new messages