Pimp my algorithm: finding repeating subsequences

687 views
Skip to first unread message

Mark Fredrickson

unread,
Feb 13, 2011, 1:11:31 PM2/13/11
to Clojure
Hello friends,

I am writing a program to generate directions for humans to read. The
directions are composed of a collection with a series of steps, some
of which may be duplicated:

[:a :b :c :a :b :c]

I would like to compress them by indicating repeats, using this
notation:

[[2 [:a :b :c]]

Since the directions will be read by humans (after some formatting), I
am limiting nesting to one level. So the following would be an invalid
transformation:

[:a :b :a :b :c :a :b :a :b :c] => [[2 [2 [:a :b]] :c]]

My current algorithm for the compress has O(2^n) complexity, which
makes it useless except for the most trivial cases. I'm looking for
something that runs in polynomial time or better. To give some
context, my current algorithm is as follows:

1. Generate the set indices for the power set of the directions (e.g.
(powerset [:a :b :c]) => [[] [1 0 0] [1 1 0] ...]).
2. For each index, use partition-by to generate a set of groups of the
vector (e.g. [1 1 0] => [[:a :b] [:c]]). Since [1 1 0] and [0 0 1] are
identical for my needs, take the distinct groupings (only generating
the distinct groups would only be O(2^{n-1]), so I haven't bothered to
shave computations here).
3. For each grouping, merge identical groups from left to right.
4. Score all the groupings by counting the number of steps (symbols in
these examples). The grouping with the fewest steps is the "best".

I've looked at some string searching algorithms (Boyer-Moore, e.g.),
but these seem geared towards finding a specific substring in a larger
string. Perhaps these algorithms would apply to my problem, but I'm
just not seeing it.

Thanks in advance for your help. All suggestions are welcome.

-Mark

Mark Engelberg

unread,
Feb 13, 2011, 2:00:59 PM2/13/11
to clo...@googlegroups.com
This is essentially a compression problem.  I think you want to research the topic of "Run-length encoding", and then look at algorithms like LZ77 which handle runs of repeated characters.  LZ77 finds repeated blocks that are some distance apart; so, for example, you could restrict the algorithm to look at distances which match the block size, to locate the blocks that immediately follow one another.

I think you're right that finding the truly optimal instruction set would be hopelessly time consuming, but it should be possible to draw inspiration from compression algorithms and write a program that finds a satisfyingly good instruction set in polynomial time.

I hope that gives you some ideas to explore!

Andy Fingerhut

unread,
Feb 13, 2011, 2:41:43 PM2/13/11
to clo...@googlegroups.com

Here are some ideas that won't necessarily give an optimal answer,
i.e. a shortest one, but sometimes it will, and even when it doesn't
it should give something reasonable.

There are O(n^2) subsequences of your original sequence.

Step 1. For each one, you can in linear time determine whether that
subsequence repeats immediately afterwards, and if so, how many times
it repeats. Calculate those answers R[i,j] for each of the O(n^2)
subsequences and save them, e.g. in a 2-d array indexed by start
position i and end position j. That should take O(n^3) time in the
worst case, but often closer to O(n^2) depending upon the sequence you
have.

Example: I'll use just strings of characters to represent the
sequences, to save typing. So your sequence
[:a :b :a :b :c :a :b :a :b :c] above I will write as: ababcababc

If we use [i,j] to represent a subsequence starting at index i (0-
based, so the first element of the sequence is at 0) and up to but not
including the element at index j, then [1,4] represents the
subsequence "bab".

Here is part of the table calculated in part 1:

R[0,1] 1 (i.e. "a" repeats 1 time starting at index 0, consecutively.
It does not immediately repeat after that)
R[0,2] 2 (i.e. "ab" repeats 2 times starting at index 0, consecutively.)
R[0,3] 1 ("aba" appears only 1 time starting at index 0, consecutively.)
R[0,4] 1
R[0,5] 2
R[0,6] 1
etc.

Step 2. Now go over every subsequence again, this time looking for the
shortest way to represent that subsequence using your notation. This
will depend upon exactly what you consider the quantitative "savings"
is for your repetition notation. In this example, I'll use the number
of characters used to write down the notation.

So ababcababc has "cost" 10
but 2[ababc] has cost 8 (I'm counting the digits and the square
brackets, too -- you can choose differently in whatever algorithm you
choose to implement)

For [0,10], we can either represent it as its original 10-element
sequence (cost 10), or maybe as 2 repetitions of [0,5], or maybe as 5
repetitions of [0,2]. In general, a sequence of length N can either
be represented as itself, or for each value K that evenly divides N,
as (N/K) repetitions of the sequence of length K at its beginning.
For each one of those possibilities, use table R to tell in constant
time which of those is possible and which is not.

[0,10] - cost 10
2 repetitions of [0,5]? - Yes, this is possible, because R[0,5] >= 2.
5 repetitions of [0,2]? - No, impossible, because R[0,2] < 5.

So for [0,10], there are two ways to represent it: cost 10 in its
original unrepeated form, or cost 8 for "2[ababc]".

Let us make a table S[i,j] for all subsequences that tells us how much
we can save by representing [i,j] (the whole thing, not just part of
it) as a repetition.

So S[i,j]=10-8=2, because we can save 2 characters by using the
repetition 2[ababc] in place of writing out the full sequence.

Part of table S would look like this:

S[0,1]=0 No shorter way than writing out "ab", so no savings
S[0,2]=0 No way to write _all_ of "aba" as a repetition, so no
savings possible.
S[0,3]=0 We could write "abab" as that or as "2[ab]", but both have
cost 4, so no savings possible.
S[0,4]=0 Same as S[0,2]
S[0,5]=0
S[0,6]=0
S[0,7]=0
S[0,8]=0
S[0,9]=0
S[0,10]=2

I believe that step 2 would take at most O(n^3) time as well, if you
precomputed a table of factors once.


Step 3. Steps 1 and 2 give exact answers for what they calculate, but
in this step, I don't yet see an efficient algorithm that is
guaranteed to find a "least cost" way of representing the original
sequence. Maybe someone else will.

You start with the whole sequence, and do a greedy algorithm on the
table S looking for big savings. So you find the largest number in S,
and say "I'm going to use that cost savings to represent that part of
the original sequence, since it saves so much". Let's say you picked
S[3,9] because it was a largest element of S. After that, you cannot
pick any subsequence of [3,9], or any subsequence that overlaps with
[3,9] at all, and use their savings, because you can't overlap or nest
repetitions. So you look for the largest savings possible among all
remaining subsequences that do not overlap [3,9]. Keep doing that
until you've completely covered the original sequence, or until the
only remaining choices give 0 savings.

Andy


Fred Concklin

unread,
Feb 13, 2011, 3:07:12 PM2/13/11
to clo...@googlegroups.com
DEFLATE is a compression method that combines LZ77 with Huffman
coding. It is implemented in java.util.zip.

Read about it here:
http://download.oracle.com/javase/1.4.2/docs/api/java/util/zip/package-summary.html

fpc

Mark Fredrickson

unread,
Feb 13, 2011, 10:56:04 PM2/13/11
to Clojure
Thanks for the hint on LZ77. That was exactly what I was looking for.

Thank you also to the others who provided this and other answers.

Thanks!
-Mark

Tyler Perkins

unread,
Feb 19, 2011, 2:05:17 PM2/19/11
to Clojure
Interesting. This problem is like compression, but the first thing I
thought of was pattern recognition. If your tokens are just characters
in a string, you can let regular expressions do all the heavy lifting:

(defn rep-groups [s] (re-seq #"(.+)\1+|." s))

Then (rep-groups ".,ababcababc.,cc,.abab.") for example, gives

(["." nil] ["," nil] ["ababcababc" "ababc"] ["." nil] ["," nil] ["cc"
"c"] ["," nil] ["." nil] ["abab" "ab"] ["." nil])

Now we just need to convert each of those match vectors into the form
you want, like this:

(map (fn [[mtch subst]] (if subst
[(/ (count mtch) (count subst)) subst]
mtch))
*1) ;; Where *1 is the match results, above.

Gives

("." "," [2 "ababc"] "." "," [2 "c"] "," "." [2 "ab"] ".")

Cool! Pretty close to what you need. But you wanted to use arbitrary
tokens, like :a, :b, etc. So let's just represent each token by a
unique character. (There are LOTS of characters available.) Here's a
function that creates two 1-to-1 maps -- one that converts tokens to
characters, and the other its inverse, converting characters to their
tokens:

(defn token-to-char-map-and-inverse [token-collection]
(let [tokens (seq (set token-collection))
chars (map char (range (count tokens)))]
[(zipmap tokens chars) (zipmap chars tokens)]))

We'll need a presentation function like the (fn ...) above. Since it
will also convert the characters back to tokens, it will also depend
upon the char-to-token map. So the following function takes such a map
and returns such a function:

(defn count-subst-fn [c-to-t]
(fn [[mtch subst]]
(if subst
[(/ (count mtch) (count subst)) (map c-to-t subst)]
(apply c-to-t mtch))))

Finally, put it all together:

(defn parse-reps [token-seq]
(let
[[t-to-c c-to-t] (token-to-char-map-and-inverse token-seq)
s (apply str (map t-to-c token-seq))]
(map (count-subst-fn c-to-t) (rep-groups s))))

For example, (parse-reps [:_ :- :a :b :c :a :b :c :a :a :_]) gives

(:_ :- [2 (:a :b :c)] [2 (:a)] :_)

Notice that the input itself is used to define the set of tokens we're
using. We could even us a string as input. (parse-reps
".,ababcababc.,cc,.abab.") gives

(\. \, [2 (\a \b \a \b \c)] \. \, [2 (\c)] \, \. [2 (\a \b)] \.)

These four (defn ...) provide a simple solution to your problem. Using
regular expressions also yields a nice separation of concerns, I
think, and it allows for experimentation with the pattern. As defined
above, (parse-reps [:a :b :a :b :c :a :b :a :b :c]) gives

([2 (:a :b :a :b :c)])

But let's try redefining the regular expression (note the "?"):

(defn rep-groups [s] (re-seq #"(.+?)\1+|." s))

Then the same (parse-reps [:a :b :a :b :c :a :b :a :b :c]) gives the
less greedy result,

([2 (:a :b)] :c [2 (:a :b)] :c)

Mark Fredrickson

unread,
Feb 19, 2011, 5:10:40 PM2/19/11
to Clojure
Very interesting solution. I implemented an LZ77 like solution, which
appears to be working, but its good to have alternative solutions. As
you point out, different match patterns might be interesting. I tune
the matching by disallowing small matches based on length, but the
pattern string would be more flexible than hard coded logic rules.

Thanks,
-M

Jules

unread,
Feb 20, 2011, 11:18:37 AM2/20/11
to Clojure
Here is my solution that solves the problem optimally. Observe that
given a sequence we can do one of two things: (1) represent it as a
repetition or (2) represent it as a concatenation of repetitions. If
we had two functions to calculate all the ways of doing (1) and (2) we
could solve the problem as follows:

(defn represent [s] (apply min-key cost (concat (reps s) (concats
s))))

That is, we pick out the minimum cost way of representing s as a
repetition or as a concatenation of repetitions, where reps returns
all repetitions and concats returns all concatenations.

The cost function to count the number of symbols is:

(defn cost [repr] (apply + (map #(count (% 1)) repr)))

Of course you can use another cost function.

The function reps can be defined as follows:

(defn reps [s]
(map vector (filter #(= s (apply concat (repeat (first %) (second
%))))
(map #(vector (/ (count s) %) (take % s)) (range 1 (+ 1 (count
s)))))))

For example, reps on ababab gives: [3 ab] and [1 ababab]. Note that
this algorithm is very naive: it tries all repetitions [n (take n s)]
and eliminates invalid ones. The code may look involved but the
algorithm is rather simplistic.

The function concats is as follows:

(defn concats [s]
(map #(apply concat (map represent (split-at % s))) (range 1 (-
(count s) 1))))

It works like this. We split s at all possible boundaries: ababab => a
babab, ab abab, aba bab, abab ab, ababa b. Then we recursively
calculate the optimal representation of each of the pairs, and
concatenate the representations. This gives us all ways to represent s
as a concatenation of optimal representations.

That is all it takes :)

(represent '(a b b a b b a)) => ([1 (a)] [2 (b b a)])

Unfortunately the algorithm is exponential time. We can fix this by
memoizing represent:

(memoize represent)

Now the algorithm is polynomial time. It is still rather slow for
large sequences (but there is a lot of room for optimization), and
gives stack overflows. You can fix this by eliminating the recursion
and filling the memoization table iteratively.

Jules

Jules

unread,
Feb 21, 2011, 10:21:01 AM2/21/11
to Clojure
Sorry that should be:

(def represent (memoize represent))
Reply all
Reply to author
Forward
0 new messages