Here are some ideas that won't necessarily give an optimal answer,
i.e. a shortest one, but sometimes it will, and even when it doesn't
it should give something reasonable.
There are O(n^2) subsequences of your original sequence.
Step 1. For each one, you can in linear time determine whether that
subsequence repeats immediately afterwards, and if so, how many times
it repeats. Calculate those answers R[i,j] for each of the O(n^2)
subsequences and save them, e.g. in a 2-d array indexed by start
position i and end position j. That should take O(n^3) time in the
worst case, but often closer to O(n^2) depending upon the sequence you
have.
Example: I'll use just strings of characters to represent the
sequences, to save typing. So your sequence
[:a :b :a :b :c :a :b :a :b :c] above I will write as: ababcababc
If we use [i,j] to represent a subsequence starting at index i (0-
based, so the first element of the sequence is at 0) and up to but not
including the element at index j, then [1,4] represents the
subsequence "bab".
Here is part of the table calculated in part 1:
R[0,1] 1 (i.e. "a" repeats 1 time starting at index 0, consecutively.
It does not immediately repeat after that)
R[0,2] 2 (i.e. "ab" repeats 2 times starting at index 0, consecutively.)
R[0,3] 1 ("aba" appears only 1 time starting at index 0, consecutively.)
R[0,4] 1
R[0,5] 2
R[0,6] 1
etc.
Step 2. Now go over every subsequence again, this time looking for the
shortest way to represent that subsequence using your notation. This
will depend upon exactly what you consider the quantitative "savings"
is for your repetition notation. In this example, I'll use the number
of characters used to write down the notation.
So ababcababc has "cost" 10
but 2[ababc] has cost 8 (I'm counting the digits and the square
brackets, too -- you can choose differently in whatever algorithm you
choose to implement)
For [0,10], we can either represent it as its original 10-element
sequence (cost 10), or maybe as 2 repetitions of [0,5], or maybe as 5
repetitions of [0,2]. In general, a sequence of length N can either
be represented as itself, or for each value K that evenly divides N,
as (N/K) repetitions of the sequence of length K at its beginning.
For each one of those possibilities, use table R to tell in constant
time which of those is possible and which is not.
[0,10] - cost 10
2 repetitions of [0,5]? - Yes, this is possible, because R[0,5] >= 2.
5 repetitions of [0,2]? - No, impossible, because R[0,2] < 5.
So for [0,10], there are two ways to represent it: cost 10 in its
original unrepeated form, or cost 8 for "2[ababc]".
Let us make a table S[i,j] for all subsequences that tells us how much
we can save by representing [i,j] (the whole thing, not just part of
it) as a repetition.
So S[i,j]=10-8=2, because we can save 2 characters by using the
repetition 2[ababc] in place of writing out the full sequence.
Part of table S would look like this:
S[0,1]=0 No shorter way than writing out "ab", so no savings
S[0,2]=0 No way to write _all_ of "aba" as a repetition, so no
savings possible.
S[0,3]=0 We could write "abab" as that or as "2[ab]", but both have
cost 4, so no savings possible.
S[0,4]=0 Same as S[0,2]
S[0,5]=0
S[0,6]=0
S[0,7]=0
S[0,8]=0
S[0,9]=0
S[0,10]=2
I believe that step 2 would take at most O(n^3) time as well, if you
precomputed a table of factors once.
Step 3. Steps 1 and 2 give exact answers for what they calculate, but
in this step, I don't yet see an efficient algorithm that is
guaranteed to find a "least cost" way of representing the original
sequence. Maybe someone else will.
You start with the whole sequence, and do a greedy algorithm on the
table S looking for big savings. So you find the largest number in S,
and say "I'm going to use that cost savings to represent that part of
the original sequence, since it saves so much". Let's say you picked
S[3,9] because it was a largest element of S. After that, you cannot
pick any subsequence of [3,9], or any subsequence that overlaps with
[3,9] at all, and use their savings, because you can't overlap or nest
repetitions. So you look for the largest savings possible among all
remaining subsequences that do not overlap [3,9]. Keep doing that
until you've completely covered the original sequence, or until the
only remaining choices give 0 savings.
Andy
Read about it here:
http://download.oracle.com/javase/1.4.2/docs/api/java/util/zip/package-summary.html
fpc