Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[QUIZ] Longest Repeated Substring (#153)

85 views
Skip to first unread message

Ruby Quiz

unread,
Jan 18, 2008, 8:23:40 AM1/18/08
to
The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby Quiz is to write a script that finds the longest repeated
substring in a given text.

Your program will be passed some text on STDIN and is expected to print the
longest repeated substring within that text to STDOUT.

Repeated substrings may not overlap. If more than one substring is repeated
with the same length, you may print any of them. If there is no repeated
substring, the result is an empty string (print nothing).

Example:

$ echo banana | ruby longest_repeated_substring.rb
an

OR

$ echo banana | ruby longest_repeated_substring.rb
na

Make sure your code runs efficiently when passed a large text.

James Gray

unread,
Jan 18, 2008, 9:30:00 AM1/18/08
to
On Jan 18, 2008, at 7:23 AM, Ruby Quiz wrote:

> This week's Ruby Quiz is to write a script that finds the longest
> repeated
> substring in a given text.

> Make sure your code runs efficiently when passed a large text.

Posting the strings you find in a given text and/or timings is not a
spoiler.

James Edward Gray II


Denis Hennessy

unread,
Jan 18, 2008, 10:49:47 AM1/18/08
to
On 18 Jan 2008, at 13:23, Ruby Quiz wrote:
> This week's Ruby Quiz is to write a script that finds the longest
> repeated
> substring in a given text.
>
> Your program will be passed some text on STDIN and is expected to
> print the
> longest repeated substring within that text to STDOUT.
>
> Repeated substrings may not overlap. If more than one substring is
> repeated
> with the same length, you may print any of them. If there is no
> repeated
> substring, the result is an empty string (print nothing).
>
> Example:
>
> $ echo banana | ruby longest_repeated_substring.rb
> an
>
> OR
>
> $ echo banana | ruby longest_repeated_substring.rb
> na
>
> Make sure your code runs efficiently when passed a large text.
>

Should white space and punctuation be treated as part of the
substring, or are they to be ignored?

/dh

James Gray

unread,
Jan 18, 2008, 10:53:15 AM1/18/08
to

I vote that we treat all characters equally.

James Edward Gray II

Dave Thomas

unread,
Jan 18, 2008, 11:37:08 AM1/18/08
to

On Jan 18, 2008, at 7:23 AM, Ruby Quiz wrote:

> Repeated substrings may not overlap. If more than one substring is
> repeated
> with the same length, you may print any of them. If there is no
> repeated
> substring, the result is an empty string (print nothing).


Must they be adjacent, or does "aaabaaa" contain the repeating
substring "aaa"?


Dave

James Gray

unread,
Jan 18, 2008, 11:43:11 AM1/18/08
to

They do not have to be adjacent. aaa would be a valid answer for
aaabaaa.

James Edward Gray II


Philip Gatt

unread,
Jan 18, 2008, 11:50:03 AM1/18/08
to
I hope I get some time to take a shot at this one. It looks like fun.

Ken Bloom

unread,
Jan 18, 2008, 3:38:06 PM1/18/08
to

First testcase:
"your banana my banana" should give " banana"

--Ken

--
Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

yermej

unread,
Jan 18, 2008, 3:59:57 PM1/18/08
to
On Jan 18, 2:38 pm, Ken Bloom <kbl...@gmail.com> wrote:
> First testcase:
> "your banana my banana" should give " banana"
>
> --Ken
>
> --
> Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory.
> Department of Computer Science. Illinois Institute of Technology.http://www.iit.edu/~kbloom1/

And a second:
"aaaaaa" should give "aaa"

Right?

Radosław Bułat

unread,
Jan 18, 2008, 4:10:12 PM1/18/08
to

It should, otherwise "banana" would give "ana" rather than "an".

My question is (I'm not familiar with RubyQuiz too much :)): episode
focus on algorithm (speed) or source code (readable)?


--
Radosław Bułat

http://radarek.jogger.pl - mój blog

Radosław Bułat

unread,
Jan 18, 2008, 4:14:42 PM1/18/08
to
> It should, otherwise "banana" would give "ana" rather than "an".

Or even better it would be the same string.

James Gray

unread,
Jan 18, 2008, 4:16:49 PM1/18/08
to
On Jan 18, 2008, at 3:10 PM, Radosław Bułat wrote:

> On Jan 18, 2008 10:00 PM, yermej <yer...@gmail.com> wrote:
>> On Jan 18, 2:38 pm, Ken Bloom <kbl...@gmail.com> wrote:
>
>> And a second:
>> "aaaaaa" should give "aaa"
>>
>> Right?
>
> It should, otherwise "banana" would give "ana" rather than "an".
>
> My question is (I'm not familiar with RubyQuiz too much :)): episode
> focus on algorithm (speed) or source code (readable)?

Hopefully both. :)

James Edward Gray II

Robert Dober

unread,
Jan 18, 2008, 4:18:30 PM1/18/08
to
2008/1/18 Radosław Bułat <radek...@gmail.com>:

> On Jan 18, 2008 10:00 PM, yermej <yer...@gmail.com> wrote:
> > On Jan 18, 2:38 pm, Ken Bloom <kbl...@gmail.com> wrote:
>
> > And a second:
> > "aaaaaa" should give "aaa"
> >
> > Right?
>
> It should, otherwise "banana" would give "ana" rather than "an".
>
> My question is (I'm not familiar with RubyQuiz too much :)): episode
> focus on algorithm (speed) or source code (readable)?
Normally this is not very important all aspects can be of interest but
please be aware that this time James has explicitly asked for
solutions that are reasonable fast for longer input.
Great to have you join in BTW. The most important thing is to
participate of course...

Cheers
Robert
--
http://ruby-smalltalk.blogspot.com/

---
Whereof one cannot speak, thereof one must be silent.
Ludwig Wittgenstein

tho_mica_l

unread,
Jan 18, 2008, 4:54:37 PM1/18/08
to
> First testcase:
> "your banana my banana" should give " banana"

For the GNU GENERAL PUBLIC LICENSE Version 2, June 1991, I get:

"customarily used for software interchange; or, "... some more
whitespace

For the GPL3 I get:

") Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by "...

If somebody can verify this.

I think this really starts to make fun when running the script over
Mark Twains entire work (from Gutenberg) or something similar.

yermej

unread,
Jan 18, 2008, 5:30:58 PM1/18/08
to

I agree on GPL3.

How much whitespace followed your GPL2 result? I ended up with
" 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA\n"
which is 57 characters. Maybe just a difference in the text files we
used? I used 'ruby-1.8.6/GPL'.

tho_mica_l

unread,
Jan 18, 2008, 5:53:13 PM1/18/08
to
> " 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA\n"

It seems there is another version of this document around (at least on
my harddisk) in which the line in the header reads " 59 Temple Place,
Suite 330, Boston, MA 02111 USA\n" instead.

Thanks for verifying.

Thomas.

Radosław Bułat

unread,
Jan 18, 2008, 4:42:18 PM1/18/08
to
> > My question is (I'm not familiar with RubyQuiz too much :)): episode
> > focus on algorithm (speed) or source code (readable)?
>
> Hopefully both. :)

How long input string I could expect?

ara.t.howard

unread,
Jan 18, 2008, 6:05:05 PM1/18/08
to

i would think that

'aaaaaa'.longest_repeating_substring #=> 'aaaaa'

the quiz did not say that the two strings could not overlap.

is this correct james?

a @ http://codeforpeople.com/
--
share your knowledge. it's a way to achieve immortality.
h.h. the 14th dalai lama

James Gray

unread,
Jan 18, 2008, 6:06:34 PM1/18/08
to
On Jan 18, 2008, at 3:42 PM, Radosław Bułat wrote:

>>> My question is (I'm not familiar with RubyQuiz too much :)): episode
>>> focus on algorithm (speed) or source code (readable)?
>>
>> Hopefully both. :)
>
> How long input string I could expect?

Anybody found a good long text to work with yet? The text of the U.S.
Constitution perhaps? Tristram Shandy?

I would say that you should handle the biggest text you possibly
can. :)

James Edward Gray II

Denis Hennessy

unread,
Jan 18, 2008, 7:18:49 PM1/18/08
to
On 18 Jan 2008, at 23:05, ara.t.howard wrote:

>> And a second:
>> "aaaaaa" should give "aaa"
>>
>> Right?
>
> i would think that
>
> 'aaaaaa'.longest_repeating_substring #=> 'aaaaa'
>
> the quiz did not say that the two strings could not overlap.
>
> is this correct james?

Actually the spec says "Repeated substrings may not overlap." so the
correct answer for "aaaaaa" should be "aaa".

/dh

James Gray

unread,
Jan 18, 2008, 7:20:28 PM1/18/08
to
On Jan 18, 2008, at 5:05 PM, ara.t.howard wrote:

>
> On Jan 18, 2008, at 2:00 PM, yermej wrote:
>
>> On Jan 18, 2:38 pm, Ken Bloom <kbl...@gmail.com> wrote:
>>> First testcase:
>>> "your banana my banana" should give " banana"
>>>
>>> --Ken
>>>
>>> --
>>> Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory.
>>> Department of Computer Science. Illinois Institute of Technology.http://www.iit.edu/~kbloom1/
>>
>> And a second:
>> "aaaaaa" should give "aaa"
>>
>> Right?
>
> i would think that
>
> 'aaaaaa'.longest_repeating_substring #=> 'aaaaa'
>
> the quiz did not say that the two strings could not overlap.
>
> is this correct james?

Actually, it did. Here's the relevant line from the quiz:

"Repeated substrings may not overlap."

James Edward Gray II

Radosław Bułat

unread,
Jan 18, 2008, 7:23:21 PM1/18/08
to
> 'aaaaaa'.longest_repeating_substring #=> 'aaaaa'
>
> the quiz did not say that the two strings could not overlap.
>
> is this correct james?

No, it isn't. Look at example from James:

Example:

$ echo banana | ruby longest_repeated_substring.rb
an

OR

$ echo banana | ruby longest_repeated_substring.rb
na

Following your reasoning 'banana' would give 'ana', rather than 'an' (or 'na').

ara.t.howard

unread,
Jan 18, 2008, 8:17:13 PM1/18/08
to

On Jan 18, 2008, at 5:20 PM, James Gray wrote:

> "Repeated substrings may not overlap."

sorry. missed that...

cheers.

a @ http://drawohara.com/
--
sleep is the best meditation.

ara.t.howard

unread,
Jan 18, 2008, 9:12:31 PM1/18/08
to

http://www.fordham.edu/halsall/ancient/homer-illiad.txt

we can deny everything, except that we have the possibility of being
better. simply reflect on that.

fw

unread,
Jan 18, 2008, 9:27:57 PM1/18/08
to

There's also War and Peace: http://www.gutenberg.org/dirs/etext01/wrnpc12.txt

$ wc wrnpc12.txt
67403 564514 3285165 wrnpc12.txt


yermej

unread,
Jan 18, 2008, 10:42:38 PM1/18/08
to
On Jan 18, 8:12 pm, "ara.t.howard" <ara.t.how...@gmail.com> wrote:
> On Jan 18, 2008, at 4:06 PM, James Gray wrote:
>
>
>
>
>
>
>
> > On Jan 18, 2008, at 3:42 PM, Radosław Bułat wrote:
>
> >>>> My question is (I'm not familiar with RubyQuiz too much :)):
> >>>> episode
> >>>> focus on algorithm (speed) or source code (readable)?
>
> >>> Hopefully both. :)
>
> >> How long input string I could expect?
>
> > Anybody found a good long text to work with yet? The text of the
> > U.S. Constitution perhaps? Tristram Shandy?
>
> > I would say that you should handle the biggest text you possibly
> > can. :)
>
> > James Edward Gray II
>
> http://www.fordham.edu/halsall/ancient/homer-illiad.txt
>
> a @http://codeforpeople.com/

> --
> we can deny everything, except that we have the possibility of being
> better. simply reflect on that.
> h.h. the 14th dalai lama

For The Illiad, I got:
' who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives' (168 characters)

yermej

unread,
Jan 18, 2008, 10:55:34 PM1/18/08
to
On Jan 18, 8:27 pm, fw <fwmailingli...@gmail.com> wrote:
> On Sat, 2008-01-19 at 11:12 +0900, ara.t.howard wrote:
> > On Jan 18, 2008, at 4:06 PM, James Gray wrote:
>
> > > On Jan 18, 2008, at 3:42 PM, Radosław Bułat wrote:
>
> > >>>> My question is (I'm not familiar with RubyQuiz too much :)):
> > >>>> episode
> > >>>> focus on algorithm (speed) or source code (readable)?
>
> > >>> Hopefully both. :)
>
> > >> How long input string I could expect?
>
> > > Anybody found a good long text to work with yet? The text of the
> > > U.S. Constitution perhaps? Tristram Shandy?
>
> > > I would say that you should handle the biggest text you possibly
> > > can. :)
>
> > > James Edward Gray II
>
> >http://www.fordham.edu/halsall/ancient/homer-illiad.txt
>
> > a @http://codeforpeople.com/

> > --
> > we can deny everything, except that we have the possibility of being
> > better. simply reflect on that.
> > h.h. the 14th dalai lama
>
> There's also War and Peace:http://www.gutenberg.org/dirs/etext01/wrnpc12.txt
>
> $ wc wrnpc12.txt
> 67403 564514 3285165 wrnpc12.txt

For War & Peace:
'

The Project Gutenberg Literary Archive Foundation has been ' (63
characters)

That one took awhile - 5m29s real time vs 1m2s real time for the
Illiad - though the increase is close to linear given the number of
characters in each text (3285165 vs. 806921).

Robert Dober

unread,
Jan 19, 2008, 6:29:26 AM1/19/08
to
2008/1/19 yermej <yer...@gmail.com>:
So we need to take care of unicode too?

> - though the increase is close to linear given the number of
> characters in each text (3285165 vs. 806921).
>
>

--

Dave Thomas

unread,
Jan 19, 2008, 11:21:34 AM1/19/08
to

On Jan 18, 2008, at 9:59 PM, yermej wrote:

> That one took awhile - 5m29s real time vs 1m2s real time for the
> Illiad - though the increase is close to linear given the number of
> characters in each text (3285165 vs. 806921).


I've got the same output for both examples. Current times are 2.767s
and 12.894s elapsed on a 3GHz processor.

Ruby's clever copy-on-write string handling really shines here.

Dave


James Gray

unread,
Jan 19, 2008, 12:31:23 PM1/19/08
to
On Jan 19, 2008, at 5:29 AM, Robert Dober wrote:

> So we need to take care of unicode too?

As always, it's your choice. That's a nice bonus though.

James Edward Gray II


Denis Hennessy

unread,
Jan 19, 2008, 4:02:54 PM1/19/08
to

Hmm, I was happy with my solution (similiar times to yermej on 1.8GHz)
until I saw Dave's times. I'm really looking forward to seeing the code.

/dh

>

Raf Coremans

unread,
Jan 20, 2008, 9:05:52 AM1/20/08
to
2008/1/18, James Gray <ja...@grayproductions.net>:


My solution offers speed nor readability; I went for a one-liner (if you
discount the require):

Raf Coremans

unread,
Jan 20, 2008, 9:09:10 AM1/20/08
to
2008/1/20, Raf Coremans <rra...@gmail.com>:

(Oops, pushed the wrong button.)

$ cat rq153.rb
#!/usr/bin/ruby -w

require 'enumerator'

p(
ARGV.
join( ' ').
reverse.
to_enum( :each_byte).
inject( []){ |a, b| a << (a.last || "") + b.chr}.
map{ |a| a.reverse}.
inject( []){ |a, b| a << b.match( /^(.+).*\1/).to_a.pop }.
flatten.
compact.
sort_by{ |a| a.size}.
last
)

$ ./rq153.rb banana
"an"
$ ./rq153.rb aaabaaa
"aaa"
$ ./rq153.rb aaaaaa
"aaa"
$ ./rq153.rb ambidextrous
nil

Don't even think of running it on a reasonably-sized text, lest you want
your PC to grind to a halt.
Morale: ruby code *can* be ugly if you put your mind to it.


Regards,
Raf

Denis Hennessy

unread,
Jan 20, 2008, 9:13:40 AM1/20/08
to
Here's my solution to the quiz. It's has O(n) behaviour and takes
about 1 minute on the illiad and 5 minutes on war and peace on my
1.8GHz linux box (much longer on my powerbook).

It works by constructing a map from every letter in the text to an
array of its locations. It then iterates, increasing each string
(which sometimes creates splits) and removing strings which don't have
at least 2 locations. Thus, for each length n, the algorithm only has
to deal with the patterns which already matched with length n-1. This
is easiest to see by running it with the verbose option:

$ echo banana | ruby -v find_repeats.rb
ruby 1.8.6 (2007-09-23 patchlevel 110) [powerpc-darwin9.0.0]
Initial: {"a"=>[1, 3, 5], "b"=>[0], "n"=>[2, 4], "\n"=>[6]}
Filter (len=1): {"a"=>[1, 3, 5], "n"=>[2, 4]}
Grow (len=2): {"an"=>[1, 3], "na"=>[2, 4], "a\n"=>[5]}
Filter (len=2): {"an"=>[1, 3], "na"=>[2, 4]}
Grow (len=3): {"na\n"=>[4], "nan"=>[2], "ana"=>[1, 3]}
Filter (len=3): {}
an


Cheers,
Denis

find_repeats.rb:

#!/usr/bin/env ruby

text = ARGF.read
size = text.size

# Build a map from each (1-character) string to a list of its positions
roots = {}
size.times do |o|
s = text[o,1]
if roots.has_key? s
roots[s] << o
else
roots[s] = [o]
end
end

puts "Initial: #{roots.inspect}" if $VERBOSE
len = 1
first = nil
while true do
# Remove entries which don't have at least 2 non-overlapping
occurances
roots.delete_if do |s,offsets|
count = 0
last = nil
offsets.each do |o|
next if last && last+len > o
last = o
count += 1
end
count < 2
end
puts "Filter (len=#{len}): #{roots.inspect}" if $VERBOSE
break if roots.size == 0
first = roots[roots.keys[0]][0]

# Increase len by 1 and replace each existing root with the set of
longer roots
len += 1
new_roots = {}
roots.each do |s,offsets|
offsets.each do |o|
next if o > size - len
s = text[o,len]
if new_roots.has_key? s
new_roots[s] << o
else
new_roots[s] = [o]
end
end
end
roots = new_roots
puts "Grow (len=#{len}): #{roots.inspect}" if $VERBOSE
end

exit if first == nil

puts text[first,len-1]


Justin Ethier

unread,
Jan 20, 2008, 9:30:14 AM1/20/08
to
[Note: parts of this message were removed to make it a legal post.]

Here is my solution. It is easy to follow but breaks down on larger inputs:


# Find first non-overlapping repeated substring contained in the input
string.
# Search space is smaller for longer substrings, so search for longest ones
first.
# Returns - Longest repeated substring, or nil if none
def longest_repeated_substring(input)
len = input.size / 2 # Max size is half total length, since strings cannot
overlap

while len > 0
# Find all substrings of given length
sub_strings = {}
for i in 0...input.size-len
sub_str = input[i..i+len]

if not sub_strings.has_key?(sub_str)
sub_strings[sub_str] = i+len # Add to list, track end pos for
overlaps
elsif sub_strings[sub_str] < i
return sub_str # First non-overlapping match ties for longest
end
end

len -= 1
end

nil
end

puts longest_repeated_substring(ARGV[0])


Thanks,

Justin

Dave Thomas

unread,
Jan 20, 2008, 12:04:03 PM1/20/08
to
My hack at the substring problem (nice one, James) is based on suffix
trees.

Suffix trees and suffix arrays and a great tool for substring
operations. The idea is to create a list of all the possible suffixes
in the original string. For "banana" this would be

banana
anana
nana
ana
na
a

Because the list contains an entry starting at every position in the
string, we know that all possible substrings of the original string
must occur at the start of one of these list elements. The substring
"nan", for example, occurs at the start of the 3rd entry.

Now sort them

a
ana
anana
banana
na
nana

Now we know that all substrings that start with the same sequence of
characters must occur at the start of items in the list that are
adjacent. Searching through to list to find these is a linear operation.

A suffix array is basically this data structure. For efficiency,
though, it doesn't store the actual strings. Instead it stores the
offset of the start of the string (for our example, this would be [6,
4, 2, 1, 5, 3]). You'll find a whole bunch of stuff on using suffix
arrays for searching and string matching, particularly in the area of
DNA sequencing.

It turns out that in Ruby, you don't always have to go to the trouble
of constructing the suffix array. When you take a substring in Ruby,
it doesn't copy the string data. Instead, it just constructs a new
string object (just a few words of memory) that references into the
original string. Only when either string is modified does the string
content get copied. This means the code for constructing the suffix
list is both simple and relatively efficient:

def initialize(text)
@suffix_list = []
len = text.length
len.times { |i| @suffix_list << text[i, len] } # the ,len] is a
hack...
@suffix_list.sort!
end

The only ugly part is the use of [i, len] to create the substring.
Technically it should be 1..len, but that constructs a new,
unnecessary range object. Using 'len' as the final index does the same
thing, because the substring stops at the end of the input string, but
it can be misleading to read.

The code to scan the suffix list looks at each adjacent pair, and sees
if it can find a longer matching substring at the start of each that
it has previously found. Because James disallowed overlapping
substrings, you have to be careful to look at no more that 1/2 of the
longest string. The basic loop looks like this:


s1 = @suffix_list[0]

1.upto(@suffix_list.size - 1) do |i|

s2 = @suffix_list[i]
max_possible = s2.length / 2 # stop them overlapping

while # quick sanity check - saves doing the more expensive
substring if it fails
s1[max_length_so_far] == s2[max_length_so_far] &&
# stop strings from overlapping
max_length_so_far < max_possible &&
# brute force comparison
s1[0,max_plus_one] == s2[0,max_plus_one]

max_length_so_far = max_plus_one
max_plus_one += 1
found_at = i
end
s1 = s2
end

The while loop has three conditions. The last one is the money earner:
it looks for a match at the start of the two strings which is one
longer that the longest match found so far. If it finds it, it records
this as the new longest match, and then tries for a even longer one
with the current pair.

The second condition on the while loop stops us considering more than
1/2 the second string. Because our strings are sorted, if thereis
overlap, the second string will always be the one that contains the
overlapping segment of the first.

The first condition is just an optimization: it stops us doing the
more expensive substring comparison if the last characters of the two
substrings we're comparing don't match.

So, put it all together, and we have

# - - - - - - - - - - - - -

class CommonSeq

def initialize(text)
@suffix_list = []
len = text.length
len.times { |i| @suffix_list << text[i, len] } # the ,len] is a
hack...
@suffix_list.sort!
end

def find_substrings
max_length_so_far = 0
max_plus_one = 1 # save a little math in the loop
found_at = nil

# Look at all adjacent pairs of suffices.
s1 = @suffix_list[0]

1.upto(@suffix_list.size - 1) do |i|

s2 = @suffix_list[i]
max_possible = s2.length / 2 # stop them overlapping

while # quick sanity check - saves doing the more expensive
substring if it fails
s1[max_length_so_far] == s2[max_length_so_far] &&
# stop strings from overlapping
max_length_so_far < max_possible &&
# brute force comparison
s1[0,max_plus_one] == s2[0,max_plus_one]

max_length_so_far = max_plus_one
max_plus_one += 1
found_at = i
end
s1 = s2
end

if found_at
suffix = @suffix_list[found_at]
[max_length_so_far, suffix[0, max_length_so_far]]
else
nil
end
end
end

if __FILE__ == $0
seq = CommonSeq.new(STDIN.read.chomp)
puts seq.find_substrings
end

# - - - - - - - - - - - - -


Run times are shown for the Illiad and War and Peace:

dave[tmp/seq] time ruby seq.rb <homer.txt
168


who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives

ruby seq.rb < homer.txt 2.82s user 0.05s system 99% cpu 2.872 total


dave[tmp/seq] time ruby seq.rb <war.txt
63


The Project Gutenberg Literary Archive Foundation has been

ruby seq.rb < war.txt 12.49s user 0.17s system 99% cpu 12.737 total


Here's the somewhat basic set of tests I used


# - - - - - - - - - - - - -
require 'seq'
require 'test/unit'

class CommonSeq
attr_reader :suffix_list
end

class TestSeq < Test::Unit::TestCase

def test_basic_suffix_creation
cs = CommonSeq.new("banana")
assert_equal(%w{a ana anana banana na nana}, cs.suffix_list)
end

def test_empty
cs = CommonSeq.new("")
assert_nil cs.find_substrings
end

def test_length_one
cs = CommonSeq.new("a")
assert_nil cs.find_substrings
end

def test_length_two_no_match
cs = CommonSeq.new("ab")
assert_nil cs.find_substrings
end

def test_length_two_with_match
cs = CommonSeq.new("aa")
assert_equal [ 1, "a"], cs.find_substrings
end

def test_length_three_no_match
cs = CommonSeq.new("abc")
assert_nil cs.find_substrings
end

def test_length_three_adjacent_match
cs = CommonSeq.new("aab")
assert_equal [ 1, "a"], cs.find_substrings
end

def test_length_three_separated_match
cs = CommonSeq.new("aba")
assert_equal [ 1, "a"], cs.find_substrings
end

def test_does_not_find_overlapping_match_length_one
cs = CommonSeq.new("aaa")
assert_equal [ 1, "a"], cs.find_substrings
end

def test_does_not_find_overlapping_match_length_three
cs = CommonSeq.new("aaaa")
assert_equal [ 2, "aa"], cs.find_substrings
end

def test_does_not_find_overlapping_match_length_two
cs = CommonSeq.new("ababa")
assert_equal [ 2, "ab"], cs.find_substrings
end

def test_banana
cs = CommonSeq.new("banana")
assert_equal [ 2, "an"], cs.find_substrings
end
end
# - - - - - - - - - - - - -


I wouldn't be surprised if the idea of searching only 1/2 of the
second string to prevent overlaps is wrong.. :)


Dave

Guillaume Carbonneau

unread,
Jan 20, 2008, 1:10:34 PM1/20/08
to

Here is my solution using Suffix arrays

class SuffixArray
attr_accessor :suffix_array
def initialize(the_string)
@the_string = the_string
@suffix_array = Array.new
#build the suffixes
last_index = the_string.length-1
(0..last_index).each do |i|
the_suffix = the_string[i..last_index]
the_position = i
# << is the append (or push) operator for arrays in Ruby
@suffix_array << { :suffix=>the_suffix, :position=>the_position }
end

#sort the suffix array
@suffix_array.sort! { |a,b| a[:suffix] <=> b[:suffix] }
end

end

text = STDIN.read

highest_count = 0
longest_string = ""
sa = SuffixArray.new(text)
sa.suffix_array.each_with_index do |s,i|
j = 1
if sa.suffix_array[i+1]
while sa.suffix_array[i][:suffix][0,j] ==
sa.suffix_array[i+1][:suffix][0,j]
if j > highest_count
highest_count = j
longest_string = sa.suffix_array[i][:suffix][0,j]
end
j += 1
end
end

end
p longest_string

---------------------


I get the following times

ILLIAD :
$ time cat homer-illiad.txt | ruby suffix.rb
" who are\nperishing and coming to a bad end. We will, however, since you
so\nbid us, refrain from actual fighting, but we will make
serviceable\nsuggestions to the Argives"

real 1m15.566s
user 1m14.810s
sys 0m0.410s

WAR AND PEACE
$ time cat wrnpc12.txt | ruby suffix.rb
"\r\n\r\nThe Project Gutenberg Literary Archive Foundation has been "

real 4m55.145s
user 4m49.979s
sys 0m1.951s


--
View this message in context: http://www.nabble.com/-QUIZ--Longest-Repeated-Substring-%28-153%29-tp14953800p14984651.html
Sent from the ruby-talk mailing list archive at Nabble.com.


Eric I.

unread,
Jan 20, 2008, 2:23:13 PM1/20/08
to
On Jan 20, 12:04 pm, Dave Thomas <d...@pragprog.com> wrote:
> I wouldn't be surprised if the idea of searching only 1/2 of the  
> second string to prevent overlaps is wrong.. :)

I think you're right in that it's wrong. ;)

If you submit the string "ababab" to your program, it comes back with
"aba" as the longest non-overlapping substring, but the answer should
be "ab". When you compare the consecutive sorted suffixes "abab" and
"ababab", you allow strings up to 3 ("ababab".size / 2) to be used,
but in fact, they overlap in all but 2 characters.

I'll post my solution in a reply, which is very similar to your except
in the overlap prevention code, which, I have to admit, is pretty
ugly. And I'm not even convinced that I got it right!

Eric

Eric I.

unread,
Jan 20, 2008, 2:24:18 PM1/20/08
to
# A solution to RubyQuiz #153.
#
# Finds the longest, non-overlapping repeated substring in its input.
#
# See http://www.rubyquiz.com/quiz153.html for details.
#
# The latest version of this solution can also be found at
# http://learnruby.com/examples/ruby-quiz-153.shtml .

# When run, the input can be on the command line, come from standard
# input, or come from a file:
#
# ruby lrs.rb banana
# ruby lrs.rb "Madam I'm Adam."
# ruby lrs.rb -f homer-illiad.txt
# cat homer-illiad.txt | ruby lrs.rb

# The basic technique used by this solution is to create an array of
# all suffixes of the data. So if the input were "banana", the array
# would contain ["banana", "anana", "nana", "ana", "na", "a"]. Then
# we sort this array, so it would now contain ["a", "ana", "anana",
# "banana", "na", "nana"]. Finally we can compare neighboring entries
# in the array to see if they share a long enough prefix to beat the
# current best.

# Extra care must be taken if the substrings are not allowed to
# overlap. Consider the input "ananana"; the longest non-overlapping
# substring is "ana". The array of sorted suffixes of is ["a", "ana",
# "anana", "ananana", "na", "nana", "nanana"]. The 2nd and 3rd items
# can only have a match of "an" because the "ana" would overlap, and
# the same is true with the 3rd and 4th items. However by comparing
# the 2nd and 4th items we can get the desired result of "ana". So
# under certain circumstances we have to compare an item with more
# than just its immediate predecessor.

# This program seems to run reasonably fast. It should run in O(n *
# log n) time in most cases, assuming that Array's sort method
# provides that performance. Due to the rare cases when the program
# cannot just compare an item and its immediate predecessor, there may
# be some strange cases where it requires O(n ** 2). Because Ruby
# allows a computed substring to share the data with the original
# string (until one of the strings is altered, i.e., "copy on write"),
# the memory used is linear to the input size.


# returns the maximum of the two parameters
def max(a, b)
a >= b ? a : b
end


# Return the longest common prefix between two strings. If max is
# specified then the longest common prefix cannot exceed it
def longest_common_prefix(s1, s2, max = nil)
l1, l2 = s1.size, s2.size
min = l1 < l2 ? l1 : l2
min = min < max ? min : max if max
min.times do |i|
return s1.slice(0, i) if s1[i] != s2[i]
end
return s1.slice(0, min)
end


# Returns the longest repeated substring in a given string.
def longest_repeated_substring(string)
size = string.length

# put every possible suffix into an array
suffixes = Array.new(size)
size.times do |i|
suffixes[i] = string.slice(i, size)
end

# sort the array of suffixes, so common substrings (i.e., prefixes
# of suffixes) will be found in neighboring elements of the array
suffixes.sort!

best = ""
at_least_size = 1 # the size to meet or exceed to be the new best
distance = nil
neighbors_to_check = 1

# compare pairs of consecutive suffixes and see how much initial
# commonality there is
# (size - 1).times do |i|
(1...size).each do |i|
# p [i, neighbors_to_check]
s1 = suffixes[i]

# generally we will only need to compare the ith item and the one
# preceding it; however if we were in a position to reject a long
# enough common substring due to overlap issues, then we may have
# to compare an ith item with additional preceding items;
# neighbors_to_check tracks how many neighbors we need to check
neighbors_to_check.downto(1) do |neighbor|
s2 = suffixes[i - neighbor]

# make sure that these to suffixes further apart than the size
# of the current best; we don't explicitly track the index of
# these suffixes, but since all suffixes go to the end of the
# initial string, the size can be used as a proxy
distance = (s1.size - s2.size).abs
if distance < at_least_size
if s1.size >= at_least_size &&
s2.size >= at_least_size &&
s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
neighbors_to_check = max(neighbors_to_check, neighbor + 1)
else
neighbors_to_check = neighbor
end
next
end

# if neighboring suffixes don't at least match as far as the
best,
# no need to check more carefully
unless s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
neighbors_to_check = neighbor
next
end

# get the longest common prefix that's no larger than distance,
# since at that point the substrings overlap
best = longest_common_prefix(s1, s2, distance)
at_least_size = best.size + 1
if best.size == distance
neighbors_to_check = max(neighbors_to_check, neighbor + 1)
else
neighbors_to_check = neighbor
end
end
end

best
end


if $0 == __FILE__
string = nil
if ARGV[0] == "-f"
open(ARGV[1]) do |f|
string = f.read
end
elsif ARGV.size == 0
string = STDIN.read
elsif ARGV[0] =~ /^-/ || ARGV.size > 1
STDERR.puts "usage:"
STDERR.puts " #{$0} (note: input comes from standard input)"
STDERR.puts " #{$0} string"
STDERR.puts " #{$0} -f filename"
exit
else
string = ARGV[0]
end

result = longest_repeated_substring(string)
puts result && "\"#{result}\" (#{result.length} characters)" ||
"none"
end

Eric I.

unread,
Jan 20, 2008, 2:35:38 PM1/20/08
to
Oh, and the run times of my solution for _The Illiad_ and _War and
Peace_ data sets are 9.179s and 43.875s respectively. This is running
on a 2.33GHz processor.

Eric

JJ

unread,
Jan 20, 2008, 5:18:12 PM1/20/08
to
On Jan 20, 2:23 pm, "Eric I." <rubytrain...@gmail.com> wrote:
> On Jan 20, 12:04 pm, Dave Thomas <d...@pragprog.com> wrote:
>
> > I wouldn't be surprised if the idea of searching only 1/2
> > of the second string to prevent overlaps is wrong.. :)
>
> I think you're right in that it's wrong. ;)

...snip

> I'll post my solution in a reply, which is very similar to your
> except in the overlap prevention code, which, I have to admit, is
> pretty ugly. And I'm not even convinced that I got it right!

Dave's code can be corrected by realizing that since all suffix
strings end at the same place (the end of the string), then of the two
adjacent strings being tested, one is a suffix of the other.

This means that to detect overlap, the following test can be used:

if prefix.length + s1.length > s2.length then
# Overlap
end

where "prefix" is the current prefix being checked in the two adjacent
suffix strings.

Here is a picture. Pretend in the "ababab" case, we are checking the
adjacent strings "abab" and "ababab". Since one is a suffix of the
other, they can be lined up as they appeared in the original string
(in your mind):

abab
ababab

Now, the prefix being checked might be "aba". It matches both strings,
but if you check "aba".length + s1.length (7), it's too long to fit in
s2.length (6). In other words, they line up like this:

ababab # s2
abab # s1
aba # prefix, lined up with s2
^
`---- # overlap exists because the prefix
# as lined up with s2 overlaps with s1
# when s1 is lined up with s2 as they
# appear in the original string. In other
# words, the "aba" in s2 goes past the
# beginning of the "aba" in s1.

Adding this test (instead of the s2.length / 2 test) and also testing
adjacent strings that start with the prefix currently being searched
(to find later matches if earlier ones overlap) would correct Dave's
solution and shouldn't be much more complicated.

-JJ

Dido Sevilla

unread,
Jan 21, 2008, 3:01:04 AM1/21/08
to
On Jan 18, 2008 9:23 PM, Ruby Quiz <ja...@grayproductions.net> wrote:
> Your program will be passed some text on STDIN and is expected to print the
> longest repeated substring within that text to STDOUT.

I'm guessing that someone wants to break a Vigenere cipher. The
longest repeated substring in text enciphered with a Vigenere cipher
has a close relationship to the length of the key, and once the key
length is determined, the cipher is halfway broken. At least that's
one application for this sort of algorithm...

--
普通じゃないのが当然なら答える私は何ができる?
普通でも普通じゃなくて感じるまま感じることだけをするよ!
http://stormwyrm.blogspot.com

Todd Benson

unread,
Jan 21, 2008, 9:00:32 AM1/21/08
to

Nice code. From the "Lost World" by Sir Arthur Conan Doyle, I get...

"ve had the escape of your life, young fellah my lad"

Todd

Jan

unread,
Jan 21, 2008, 9:28:13 AM1/21/08
to
[Note: parts of this message were removed to make it a legal post.]

How many memory do you guys use when running the code? I run of of memory
when read illiad ... I have 1G ddr.

Thanks,
Jan


--
Fly on wheels.

Todd Benson

unread,
Jan 21, 2008, 10:12:20 AM1/21/08
to
2008/1/18 yermej <yer...@gmail.com>:

>
> For The Illiad, I got:
> ' who are
> perishing and coming to a bad end. We will, however, since you so
> bid us, refrain from actual fighting, but we will make serviceable
> suggestions to the Argives' (168 characters)
>
>

Using Eric's code, for "The Iliad" downloaded from the Gutenberg project...

"On this the rest of the Acheans with one voice were for respecting
the priest and taking the ransom that he offered; but not so
Agamemnon, who spoke fiercely to him and sent him roughly away. " (193
characters)

Todd

Denis Hennessy

unread,
Jan 21, 2008, 10:49:30 AM1/21/08
to

In my copy of the Illiad (there's a phrase I never thought I'd use!),
those two sections differ in whitespace / linebreaks:

On this the rest of the Achaeans with one voice were for


respecting the priest and taking the ransom that he offered; but not
so Agamemnon, who spoke fiercely to him and sent him roughly away.

"Old man," said he, "let me not find you tarrying about our ships, nor
yet coming hereafter.

and

"On this the rest of the Achaeans with one voice were for respecting


the priest and taking the ransom that he offered; but not so

Agamemnon, who spoke fiercely to him and sent him roughly away. So
he went back in anger, and Apollo, who loved him dearly, heard his
prayer.

Did you get yours from http://www.fordham.edu/halsall/ancient/homer-illiad.txt
?

/dh

tho_mica_l

unread,
Jan 21, 2008, 2:32:58 PM1/21/08
to
Hi,

With some delay, here is my solution. I started on Friday with my
take
on making suffix trees respect overlaps but then stopped. I have to
admit that it was Eric's solution that made take another look on
this.
The current version is about 3 times slower than Eric's submission.
It
passes all my test cases though. It also takes the idea from (IIRC)
Dave
Thomas's submission to calculate the original start position on the
basis of the string's length.

BTW it seems to me that ruby19 (cygwin, win xp) uses a __lot__ more
memory for this.

Regards,
Thomas.


#!/usr/bin/env ruby

class String
def longest_substring_st4
suffixes = Array.new(size) {|i| self[i..size]}
suffixes.sort!
common = ''
comlen = 0
suffixes.each_with_index do |curr, index|
next if index == 0
curridx = size - curr.size
pindex = index - 1
pindex.downto(pindex - comlen) do |i|
pred = suffixes[i]
psize = pred.size
predidx = size - psize
maxlen = [(predidx - curridx).abs, psize].min - 1
next if maxlen < comlen
prefix = pred[0 .. comlen]
break if prefix != curr[0..comlen]
(comlen + 1 .. maxlen).each do |i|
p = pred[i]
c = curr[i]
if p == c
prefix << p
else
break
end
end
common = prefix
comlen = common.size
break if comlen <= maxlen
end
end
return common
end
end


if __FILE__ == $0
if ARGV.empty?
puts String.new(STDIN.read).longest_substring_st4
else
ARGV.each {|f| puts
String.new(File.read(f)).longest_substring_st4}
end
end


Adam Shelly

unread,
Jan 21, 2008, 5:07:00 PM1/21/08
to
I had the idea to use suffix trees too, but I literally built a tree.
Each node holds a start index and length for a suffix.
As I add more substrings, I count the number of matching characters
with any previous suffixes, and record the max as needed (after
handling overlap).

Unfortunately, this turns out to be slower that the ones that just
sort all the strings,
and runs out of memory on the Illiad...

-Adam

---

class String
#helper - counts number of matching characters.
def matchlen other
i=0
other.each_byte{|b|
break if b!=self[i]
i+=1
}
i
end
end

class Node
def initialize start,len,tail=nil
@i=start
@l=len
@h=tail ? tail : {}
end

def insert idx,len,matched=0
match = @h[$STR[idx]]
#add or expand child
if match
match.expand(idx,len,matched)
else
@h[$STR[idx]]=Node.new(idx,len)
end
end

def expand idx,len,matched
#count matching characters
matchlen = $STR[@i,@l].matchlen($STR[idx,len])

updateMax(idx-matched, matchlen+matched) if matchlen+matched > $max
if matchlen < @l
#split if partial match
split_at(matchlen, idx,len)
else
#else add remainder of unmatched characters
insert(idx+@l, len-@l, matchlen+matched)
end
end

def split_at point, idx,len
#one child contains the remainder of the original string(s)
newchild = Node.new(@i+point,@l-point, @h)
@h={}
@h[$STR[@i+point]]=newchild
#the other child has the remainder of the new str
@h[$STR[idx+point]]=Node.new(idx+point, len-point)
@l=point #shorten our length
end

def updateMax idx,matchlen
#if our string ends past the beginining of the match,
# discount the overlap
overlap = @i+@l-idx-1
matchlen-=overlap if overlap > 0
if matchlen>$max
$max = matchlen
$maxpt = idx
end
end
end


$STR=ARGF.read
$max=0
$maxpt=0
slen = $STR.length
half= (slen/2.0).ceil
@root=Node.new(0,0)

#we don't have to look at any substrings longer than half the input
(0..half).each{|i|
@root.insert(i,half)
}
#and the ones that start in the second half of the input are shorter still
(half+1..slen).each{|i|
len = slen-i
break if len<$max
@root.insert(i,len)
}

puts $STR[$maxpt,$max].inspect
puts "\n#{$max} chars"

James Koppel

unread,
Jan 21, 2008, 5:37:07 PM1/21/08
to
My solution finds substrings of length n, groups the identical ones together in a hash of their starting indices, prunes out cases where two of the "repeated" substring overlap each other, and then looks in the same places for substrings of length n+1.

The aggressive pruning does, however, cause it to fail in the case where a substring that overlaps at one length does not overlap at a higher length. For example, take the string "aaaaaa". The substring "aa" exists at indicies [0,1,2,3,4]. My program will prune that down to [0,2,4]. It will then record the substring of length 3 that starts at each of [0,2,4], prune out the ones that start at 2 (overlaps with the one startign at 0) and 4 (goes out of bounds), leaving only an unrepeated substring, and would thus fall back and print "aa," even though the correct answer is "aaa," because the index of 3 had already been pruned.

On the other hand, the aggressive pruning does make my program somewhat fast. It takes about 45s on Also Sprach Zarathustra.

text = ARGF.read

substr_locs = {}

text.split(//).each_with_index do |c,idx|
if substr_locs[c]
substr_locs[c] << idx
else
substr_locs[c] = [idx]
end
end

substr_locs.each_pair do |substr,locs|
substr_locs.delete(substr) if locs.length == 1
end

len = 1
done = false
until done
len += 1
new_substr_locs = {}
substr_locs.each_pair do |substr,locs|
locs.each do |idx|
s = text[idx,len]
next if idx+len>text.length
if new_substr_locs[s]
new_substr_locs[s] << idx
else
new_substr_locs[s] = [idx]
end
end
end
##Must reduce for the case where a substring overlaps itself
new_substr_locs.each_key do |s|
locs = new_substr_locs[s]
idx = 1
while idx < locs.length
if locs[idx]-locs[idx-1] < len
locs.delete_at(idx)
else
idx += 1
end
end
new_substr_locs.delete(s) if locs.length == 1
end
unless new_substr_locs.keys.empty?
substr_locs = new_substr_locs
else
done = true
end
end
puts substr_locs.keys.first

____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping


Todd Benson

unread,
Jan 21, 2008, 7:55:24 PM1/21/08
to

http://www.gutenberg.org/dirs/etext00/iliad10.txt

It goes to show that any text on the "web" is spurious :-)

Side note and totally off topic... why do people like to use two el's
in "The Iliad"?

> ?
>
> /dh

Todd

Alex Shulgin

unread,
Jan 22, 2008, 3:08:42 AM1/22/08
to

Hi Dave,

Thanks for the brilliant idea--this time I really feel like I've
learned something new! :)

I even took my time to implement it in C (80 lines) and I must admit
that ruby performs very well in this case: my C version is only about
6 times faster than your ruby...

However, with the algorithm presented, I don't see how one can obtain
the correct result for the "aaaa" case. Lets see; the suffix list
would be:

aaaa
aaa
aa
a

now, sorted:

a
aa
aaa
aaaa

And the catch is--in order to obtain the correct result, that is "aa",
one needs to observe "aa" and "aaaa", but these are not adjacent. In
this case two pairs of adjacent suffixes ("aa" and "aaa", "aaa" and
"aaaa") might give "aa", but they do overlap.

So do we need a special-case treatment here?

--
Cheers,
Alex

Jan

unread,
Jan 22, 2008, 3:51:29 AM1/22/08
to
[Note: parts of this message were removed to make it a legal post.]

Hi Alex,

Have the same problem with you :)

I think we should understand the 'adjacent' as 'not overlapped adjacent'
here. Each time we take two adjacent suffix, we check whether they
overlapped first - if so, make s2 point to next suffix, until we found a not
overlapped one.

For aaaa: on the step of comparing aa and aaa, we found they overlapped, so
we keep s1 = aa, but make s2 = aaaa, the one next to aaa.

Thanks
Jan


--
Fly on wheels.

Urs Meyer

unread,
Jan 22, 2008, 7:58:28 AM1/22/08
to
Hi there,

thought first of two loops, one that moves the start position
of the candidate substring, and another that makes it longer.
It turned out that one loop that combines both is sufficient.

Without considering searching (text.index) the loop is O(n).
However, the index method is O(n). Thus, worst case (for my
program) is O(n^2).

Can one do it in O(n*log(n))?

--

#!/usr/bin/ruby
#
# ruby quiz 153 - longest repeated substring
#
# print the first one found if several such substrings exist
#
# Urs Meyer, 2008-01-22

--

# read text from STDIN, convert to lower case, as you like
text = STDIN.read.tr("\nA-Z"," a-z")

# start position, determines the remaining search space
start = 0
longest_substring = ""

# search while remaining search space is at least twice the
# the size of the currently longest substring

while (2 * longest_substring.size) < (text.length - start)

# generate substring to search for with size is one bigger
# than longest found so far
substring = text[start...(start+longest_substring.size+1)]

# search for it
i = text.index(substring, start+substring.size)

if i.nil?
# nothing found, advance start position
start += 1
else
# found a longer one, record it
longest_substring = substring
end
end

puts longest_substring

--

$ echo banana | ruby lrs.rb
an
$ echo aa | ruby lrs.rb
a
$ echo aaa | ruby lrs.rb
a
$ echo aaaa | ruby lrs.rb
aa
$

some timings...

using random characters:
$ ruby -e "puts Array.new(100000){('a'[0]+rand(26)).chr}.join" | ( time ruby lrs.rb > /dev/null )
21.015u 0.046s 0:21.39 98.4% 0+0k 0+0io 1636pf+0w

with just a bunch of 'a's the job is easier:
$ ruby -e "puts 'a'*100000" | ( time ruby lrs.rb > /dev/null )
3.390u 0.015s 0:03.50 97.1% 0+0k 0+0io 3705pf+0w

lets see whether the result is correct:
$ ruby -e "puts 'a'*100000" | ruby lrs.rb | wc -c
50001

oops, but
$ echo "a" | wc -c
2

o.k. we are fine

if you are curious, I ran it a few times on 100000 random chars:
lnxmfqu
nvpban
ibhwilj
eggfur
mbljqx
gzvxhw
..

~~
Regards
Urs


--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail

Denis Hennessy

unread,
Jan 22, 2008, 8:06:16 AM1/22/08
to
On 22 Jan 2008, at 00:55, Todd Benson wrote:
> Side note and totally off topic... why do people like to use two el's
> in "The Iliad"?

Long version: because the capital-I looks identical to the lower case
L so people think they've seen two 'L's. Of course they also heard an
'I' so they insert that first.

Short version: Iggnorance.

/dh

Eric I.

unread,
Jan 22, 2008, 1:16:28 PM1/22/08
to
On Jan 21, 9:28 am, Jan <jan.h....@gmail.com> wrote:
> [Note:  parts of this message were removed to make it a legal post.]
>
> How many memory do you guys use when running the code? I run of of memory
> when read illiad ... I have 1G ddr.

Hi Jan,

I have 2G. I'm surprised that 1G would be a problem, though. I can
run my code on the War and Peace text, which is 4 times larger than
The Illiad.

The problem with pasting code here is that there can be line break
issues depending on the reader you're using. I'm wondering whether a
bad line break broke something. One way around this potential issue
might be to pull the code from:

http://learnruby.com/examples/ruby-quiz-153.shtml

Eric

Eric I.

unread,
Jan 22, 2008, 1:26:16 PM1/22/08
to
On Jan 20, 5:18 pm, JJ <jjnoa...@gmail.com> wrote:
> On Jan 20, 2:23 pm, "Eric I." <rubytrain...@gmail.com> wrote:
>
> > On Jan 20, 12:04 pm, Dave Thomas <d...@pragprog.com> wrote:
>
> > > I wouldn't be surprised if the idea of searching only 1/2
> > > of the second string to prevent overlaps is wrong.. :)
>
> > I think you're right in that it's wrong.  ;)
>
> ...snip
>
> > I'll post my solution in a reply, which is very similar to your
> > except in the overlap prevention code, which, I have to admit, is
> > pretty ugly.  And I'm not even convinced that I got it right!
>
> Dave's code can be corrected by realizing that since all suffix
> strings end at the same place (the end of the string), then of the two
> adjacent strings being tested, one is a suffix of the other.
>
> This means that to detect overlap, the following test can be used:
>
>   if prefix.length + s1.length > s2.length then
>     # Overlap
>   end
>
> where "prefix" is the current prefix being checked in the two adjacent
> suffix strings.

...

> Adding this test (instead of the s2.length / 2 test) and also testing
> adjacent strings that start with the prefix currently being searched
> (to find later matches if earlier ones overlap) would correct Dave's
> solution and shouldn't be much more complicated.

That's very close to what I do, but I go in the opposite direction. I
(try to) track the size window of how many previous sorted suffixes
must be checked against the last in the sequence.

So if I'm comparing substrings a and b in [..., a, b, ...], and if the
amount of non-overlap is smaller than the longest found substring so
far, then I enlarge the window. So now when I'm focusing on c in
[..., a, b, c, ...] I'll compare it to both a and b.

There needs to be a way to shrink the window as well, so it doesn't
grow monotonically, and so if the a and c don't have the same prefix
which exceeds the length of the longest found substring so far, the
window shrinks.

Part of my complexity comes from trying to detect cases where I can
avoid the full comparison early, and so every time I try to do the
avoid (by using "next" to start the next iteration), I must also do
some window management.

Maybe re-orienting the code in your way would simplify things. If you
have the time, I'd be very curious to see what you would come up with.

Eric

tho_mica_l

unread,
Jan 22, 2008, 1:39:25 PM1/22/08
to
I noticed a huge difference in memory use on whether I run the code
with ruby 1.8 or 1.9.

With ruby 1.8, the suffix-array-based/-like solutions require about
120mb for the Iliad. With 1.9 ... I get an "failed to allocate memory
(NoMemoryError)" error too (virtual memory being disabled, 1GB).

Thomas.

Todd Benson

unread,
Jan 22, 2008, 4:25:18 PM1/22/08
to

Okay, this is an ignorant reply. I'm really sorry to the people that
actually want to program. Just delete this :)

Your previous link has two "L's". And many of the other posts do as
well. And, I am pretty sure it has nothing to do with fonts.

Todd

Todd Benson

unread,
Jan 22, 2008, 4:53:57 PM1/22/08
to
On Jan 22, 2008 3:25 PM, Todd Benson <cadu...@gmail.com> wrote:
>
> On Jan 22, 2008 7:06 AM, Denis Hennessy <de...@hennessynet.com> wrote:
> > On 22 Jan 2008, at 00:55, Todd Benson wrote:
> > > Side note and totally off topic... why do people like to use two el's
> > > in "The Iliad"?
> >
> > Long version: because the capital-I looks identical to the lower case
> > L so people think they've seen two 'L's. Of course they also heard an
> > 'I' so they insert that first.
> >
> > Short version: Iggnorance.
>
> Okay, this is an ignorant reply. I'm really sorry to the people that
> actually want to program. Just delete this :)

I meant "this will be an ignorant reply". It is hard to throw candor
correctly over the internet.

Has anyone figured out why my text result differs from others?
Spacing/newlines, obviously, is a factor, but I get the same result on
every machine. But then, if you think about it, then to use something
like this, you _really_ have to trust the data.

Todd

tho_mica_l

unread,
Jan 22, 2008, 5:03:30 PM1/22/08
to
> Your previous link has two "L's". And many of the other posts do as
> well. And, I am pretty sure it has nothing to do with fonts.

I think, it's "Ilias" in ancient Greek anyway. It's quite a common
mistake/transliteration.

Alex Shulgin

unread,
Jan 23, 2008, 6:38:27 AM1/23/08
to
On Jan 22, 10:51 am, Jan <jan.h....@gmail.com> wrote:
> [Note: parts of this message were removed to make it a legal post.]
>
> Hi Alex,
>
> Have the same problem with you :)
>
> I think we should understand the 'adjacent' as 'not overlapped adjacent'
> here. Each time we take two adjacent suffix, we check whether they
> overlapped first - if so, make s2 point to next suffix, until we found a not
> overlapped one.

Hm... but every pair of suffixes _are_ overlapped just 'cause all of
them last to the end of the text, so we cannot 'check first'.
Anyway...

> For aaaa: on the step of comparing aa and aaa, we found they overlapped, so
> we keep s1 = aa, but make s2 = aaaa, the one next to aaa.

This might be an appropriate hack--I'll try it when I have a
minute. :)

--
Cheers,
Alex

Jan

unread,
Jan 23, 2008, 10:05:06 AM1/23/08
to
[Note: parts of this message were removed to make it a legal post.]

On Jan 23, 2008 7:39 PM, Alex Shulgin <alex.s...@gmail.com> wrote:

> On Jan 22, 10:51 am, Jan <jan.h....@gmail.com> wrote:
> > [Note: parts of this message were removed to make it a legal post.]
> >
> > Hi Alex,
> >
> > Have the same problem with you :)
> >
> > I think we should understand the 'adjacent' as 'not overlapped adjacent'
> > here. Each time we take two adjacent suffix, we check whether they
> > overlapped first - if so, make s2 point to next suffix, until we found a
> not
> > overlapped one.
>
> Hm... but every pair of suffixes _are_ overlapped just 'cause all of
> them last to the end of the text, so we cannot 'check first'.
> Anyway...


Yes u'r right ... I should have described more clear: take two adjacent,
check whether the "head" of them overlapped:

overlap?(s1[0, current_longest_substr.length], s2[0,
current_longest_substr.length])

Thanks
Jan


>
> > For aaaa: on the step of comparing aa and aaa, we found they overlapped,
> so
> > we keep s1 = aa, but make s2 = aaaa, the one next to aaa.
>
> This might be an appropriate hack--I'll try it when I have a
> minute. :)
>
> --
> Cheers,
> Alex
>
>


--
Fly on wheels.

Raffa

unread,
Jan 23, 2008, 12:32:06 PM1/23/08
to
cannot use regex, i suppose:

(.{aNumber}).*\1

where aNumber = 2..x (x=text.length/2), until regex 'response' is
'nil'

James Gray

unread,
Jan 23, 2008, 1:08:47 PM1/23/08
to

You can use whatever you like.

Try it out on War & Peace though and see how it does. :)

James Edward Gray II


SERGEY VOLKOV

unread,
Jan 23, 2008, 1:21:31 PM1/23/08
to
Why not?
The simplest solution could be:

def longest_repeated_substring str
(str.size/2).downto(1) { |i|
/(.{#{i}}).*\1/m =~ str and return $1
}
nil
end

but it's too slow;

Ken Bloom

unread,
Jan 24, 2008, 12:38:05 AM1/24/08
to

Well, one would think that the absolute simplest solution is /(.*+).*\1/,
but that's what I was testing when I invented the "your banana my banana"
test case. (Which it failed, returning only "y" as the repeated
substring).

Yours is a nice extension of this basic idea.

--Ken

--
Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

SERGEY VOLKOV

unread,
Jan 24, 2008, 7:17:26 AM1/24/08
to
It's linear search for repeated substring length,
I've tried binary search, too slow too

Ruby Quiz

unread,
Jan 24, 2008, 11:11:46 AM1/24/08
to
I first saw this problem years ago. It was one of the Perl's Quiz of the Week
problems. I remember being surprised at how the solution was sort of
counter-intuitive to me. I mean that my first thought was: start at half the
string size, try to find a string of that length that occurs twice in the full
text, reduce the length by one character and recheck until you find a match.
Yoan Blanc coded that strategy up for us:

text = STDIN.read

(text.length/2).downto 1 do |l|
match = Regexp.new("(.{#{l}})\\1").match(text)
if match
puts text[match.offset(1)[0]..(match.offset(1)[1]-1)]
break
end
end

Technically the regular expression engine will choke on this solution for a
significantly large text. The reason is that quantifiers in {…} limits are
only allowed to be so large. The theory is as I described though.

However, even if it could match any size, this solution turns out to be far too
slow to be practical. The reason is that given a megabyte or more of text, the
longest repeated substring is typically less than 200 bytes. That means you do
a whole lot of wasted checks before you are even in the range that a match can
occur.

Yoan realized this and submitted a second solution that flipped the search
around. Starting with a match of zero characters and finding the next biggest
puts you in the right neighborhood from the get go.

The other key realized in Yoan's second solution is that if we know there is a
four character repeated substring at some index, we can also count on the fact
that there are one, two, and three character repeated substrings at the same
index. Given that, you can keep checking at the same index until you don't find
a match for a given size. The size found just before that was the biggest at
that index.

The Perl guys found further means to optimize this approach, by trimming the
search space. However, this too breaks down in the face of significantly large
inputs. That's why we have suffix trees.

The idea of a suffix tree is to build a list of every suffix that occurs in the
text. Thus the quiz example "banana" breaks down to:

banana
anana
nana
ana
na
a

That may not look like a lot of help yet, but watch what happens if we sort the
entries:

a
ana
anana
banana
na
nana

See how the common prefixes group together? We can now compare adjacent entries
of our suffix tree for prefixes they have in common. In this case, the second
and third element share an "an" and that's one of the possible answers. Note
that the second and third share "ana" as well, but selecting that one causes
overlap:

[ana]na
[ana]

There's a catch though. Consider this example input from Eric I.: ananana.
The suffix tree is:

ananana
nanana


anana
nana
ana
na
a

That sorts into:

a
ana
anana
ananana
na
nana
nanana

In this case comparing just the second and third entries gives us "an", because
"ana" would overlap. But, if you compare the second and fourth entries, "ana"
becomes a valid answer. That tells us that it's sometimes necessary to look
ahead more than just one step.

Enough explaining, let's read some code. I'm going to show Eric's solution
below, because it was very well commented and that helped me understand it.
However, I'm going to remove those comment in the following listings in the
interests of space. I also made a couple of tiny edits that I felt simplified
the code a touch.

First we have a simple helper method:

def longest_common_prefix(s1, s2, max = nil)

min = [s1.size, s2.size, max].compact.min


min.times do |i|
return s1.slice(0, i) if s1[i] != s2[i]
end
return s1.slice(0, min)
end

# ...

Given two Strings, this code will find the longest prefix they share. It begins
by finding the smallest length among the passed Strings and an optional provided
maximum. It then walks those indexes looking for mismatched characters.

When it finds a mismatch, it returns what came before. How it does that is a
little clever though, since it seems to use the same numbers. Consider that we
are comparing "abc" and "abd" on the third character. That means i = 2 and
s1[i] != s2[i]. That will cause the code to return s1.slice(0, i), but remember
that i is used as a length there, not an index. That's why we get the first two
characters back ("ab").

If no mismatches arise during the search, they match all the way to our limit
and that is returned.

The next method is where all the action is and it's a doozy, so we will take it
in pieces. The first chunk builds and sorts the suffix tree:

# ...



def longest_repeated_substring(string)
size = string.length

suffixes = Array.new(size)
size.times do |i|
suffixes[i] = string.slice(i, size)
end

suffixes.sort!

# ...

This code might seem wildly inefficient (memory-wise) at first glance, but Ruby
will surprise you here. When you slice() a substring like this, Ruby cheats and
doesn't actually make a copy of the text. Instead, the new object is a pointer
into the old object. Now, if you change either String down the road, Ruby must
and will finish the job, turning it into a full copy. We call this behavior
"Copy on Write." Since we won't be changing these Strings though, just
examining them, we're safe with the tiny pointers for the duration.

Note that the size variable is always used as the substring length here, though
it's too long in all but the first case. Ruby will stop at the end of the
String even if you provide a longer length, so it does the same thing and avoids
unneeded math or Range object construction.

OK, let's begin the search through the tree:

# ...



best = ""
at_least_size = 1 # the size to meet or exceed to be the new best
distance = nil
neighbors_to_check = 1

(1...size).each do |i|
s1 = suffixes[i]



neighbors_to_check.downto(1) do |neighbor|
s2 = suffixes[i - neighbor]

distance = (s1.size - s2.size).abs
if distance < at_least_size
if s1.size >= at_least_size &&
s2.size >= at_least_size &&
s1.slice(0, at_least_size) == s2.slice(0, at_least_size)

neighbors_to_check = [neighbors_to_check, neighbor + 1].max


else
neighbors_to_check = neighbor
end
next
end

# ...

First, some variables are set to hold the best match so far, the size a match
would need to be to beat it, the distance between the substrings, and how far
down the tree we need to search. After, that we enter a simple loop trying each
suffix.

The neighbors_to_check iteration is our look ahead for similar suffixes that
share our prefix. We usually only need to look one step ahead, but these checks
watch for overlap cases and extends our search when needed.

Note that we figure a distance between substrings here, even though we didn't
remember where they started. That's possible because all suffixes are anchored
to the far edge of our original text. Knowing that, comparing lengths is the
same as comparing starting indexes.

A distance below our needed match size warns us that there is overlap and we may
need to look further ahead in this case. The if conditions just ensure the
Strings are of the length we should even consider and that they match at least
that much. If all of that turns out to be true, our search ahead is extended.

Now we are ready to do the actual comparisons:

# ...



unless s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
neighbors_to_check = neighbor
next
end

best = longest_common_prefix(s1, s2, distance)
at_least_size = best.size + 1
if best.size == distance

neighbors_to_check = [neighbors_to_check, neighbor + 1].max


else
neighbors_to_check = neighbor
end
end
end

best
end

# ...

The unless check we see here is just a short circuit optimization. There's no
need to check for common prefixes if the Strings aren't a match at least as long
as our current best.

If we've made it this far, we know the current two Strings share a prefix that
meets or exceeds our current best. A hand-off is made to
longest_common_prefix() to grab our new best and the rest of the iterator resets
the search parameters.

The best we found in the entire search is then returned as the solution.

Here's the interface code that wraps this method:

# ...



if $0 == __FILE__
string = nil
if ARGV[0] == "-f"

string = File.read(ARGV[1])


elsif ARGV.size == 0
string = STDIN.read
elsif ARGV[0] =~ /^-/ || ARGV.size > 1
STDERR.puts "usage:"
STDERR.puts " #{$0} (note: input comes from standard input)"
STDERR.puts " #{$0} string"
STDERR.puts " #{$0} -f filename"
exit
else
string = ARGV[0]
end

result = longest_repeated_substring(string)

puts %Q{"#{result}" (#{result.length} characters)}
end

Most of the above code just sorts out the command-line arguments. The checks
determine whether to read the text from a file, STDIN, or the arguments
themselves.

The last two lines call the workhorse method we just finished examining and
print details about the best match found.

My thanks to all who stretched my brain with all of the excellent discussion
about this week's problem.

Tomorrow we will try Ruby's hand at coin counting...

0 new messages