Rule #2 requires a bit of explanation. If a chosen pattern requires 40
bits to store it based on its length and overall number of combinations
chosen and this pattern is represented in text enough times to justify
its pick, then we take it.
Problem statement:
Can somebody think of an algorithm to efficiently apply found patterns
below to the original sample, and using some cost formula as explained
in #2, derive an optimal combination.
My thinking:
It's not easy to do, because you don't know whether or not remaining
patterns will make it into a final selection. Thus, when judiging some
pattern I need to make an optimal (locally) decision based on partial
info, and adjust it as it proceeds? There is a huge amount of info to
keep while the algorithm is crunching even on a small data sample.
If anyone is interested in the code for PatternRecognizer, drop me a
line.
vort...@hotmail.com
-- EXAMPLE --
( run on a sample "LOVE YOU" )
Found patterns:=35
Pattern#, PatternLength, PatternBegins, Occurences, Pattern (in
brackets)
1 , p_len:=1 , p_beg:=2 , freq:=2 , <O>
2 , p_len:=8 , p_beg:=1 , freq:=1 , <LOVE YOU>
3 , p_len:=7 , p_beg:=2 , freq:=1 , <OVE YOU>
4 , p_len:=7 , p_beg:=1 , freq:=1 , <LOVE YO>
5 , p_len:=6 , p_beg:=3 , freq:=1 , <VE YOU>
6 , p_len:=6 , p_beg:=2 , freq:=1 , <OVE YO>
7 , p_len:=6 , p_beg:=1 , freq:=1 , <LOVE Y>
8 , p_len:=5 , p_beg:=4 , freq:=1 , <E YOU>
9 , p_len:=5 , p_beg:=3 , freq:=1 , <VE YO>
10 , p_len:=5 , p_beg:=2 , freq:=1 , <OVE Y>
11 , p_len:=5 , p_beg:=1 , freq:=1 , <LOVE >
12 , p_len:=4 , p_beg:=5 , freq:=1 , < YOU>
13 , p_len:=4 , p_beg:=4 , freq:=1 , <E YO>
14 , p_len:=4 , p_beg:=3 , freq:=1 , <VE Y>
15 , p_len:=4 , p_beg:=2 , freq:=1 , <OVE >
16 , p_len:=4 , p_beg:=1 , freq:=1 , <LOVE>
17 , p_len:=3 , p_beg:=6 , freq:=1 , <YOU>
18 , p_len:=3 , p_beg:=5 , freq:=1 , < YO>
19 , p_len:=3 , p_beg:=4 , freq:=1 , <E Y>
20 , p_len:=3 , p_beg:=3 , freq:=1 , <VE >
21 , p_len:=3 , p_beg:=2 , freq:=1 , <OVE>
22 , p_len:=3 , p_beg:=1 , freq:=1 , <LOV>
23 , p_len:=2 , p_beg:=7 , freq:=1 , <OU>
24 , p_len:=2 , p_beg:=6 , freq:=1 , <YO>
25 , p_len:=2 , p_beg:=5 , freq:=1 , < Y>
26 , p_len:=2 , p_beg:=4 , freq:=1 , <E >
27 , p_len:=2 , p_beg:=3 , freq:=1 , <VE>
28 , p_len:=2 , p_beg:=2 , freq:=1 , <OV>
29 , p_len:=2 , p_beg:=1 , freq:=1 , <LO>
30 , p_len:=1 , p_beg:=8 , freq:=1 , <U>
31 , p_len:=1 , p_beg:=6 , freq:=1 , <Y>
32 , p_len:=1 , p_beg:=5 , freq:=1 , < >
33 , p_len:=1 , p_beg:=4 , freq:=1 , <E>
34 , p_len:=1 , p_beg:=3 , freq:=1 , <V>
35 , p_len:=1 , p_beg:=1 , freq:=1 , <L>
Well, why not use a rating system by savings gained (with a fuzzy middle
ground for ambiguous smaller pattern overlaps).
For example, I assume your example's prime savings is gained by
recognizing the "LOVE YOU " phrase is repeated N number of times. Given
that in your list #2 matches the obvious maximal savings, that should be
rated the highest in comparison to the smaller strings.
Obvious pattern rating check would be NUMBER OF OCCURANCES (shouldn't
be zero) divided by STRING LENGTH. Obvious flaw in that would be a
potential "LOVE YOU LOVE YOU " getting rated higher than "LOVE YOU " if the
occurances is dividable evenly by 2 (which is non-optimal) and the final
"LOVE YOU" has no ending space or a grammatical exclaimation ( '.', ','
'!' ).
Logically the single "LOVE YOU " should get a much higher rating for
your example everytime otherwise so the truth is that the rating that is
closest to equalling 1.0 is your most desireable choice.
Example,
========
"LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU "
1 2 3 4 5
6
"LOVE YOU " = 6 occurances / 9 characters = RATING 0.67
========
"LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU"
1 2 3
(no space at end)
"LOVE YOU LOVE YOU " = 3 occurances / 18 characters = RATING 0.17
=========
"LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU"
1L 1Y 2L 2Y 3L 3Y 4L 4Y 5L 5Y 6L
6Y 7L 7Y
"LOVE" (no spaces) = 7 occurances / 4 characters = RATING 1.4
"YOU" (no spaces) = 7 occurances / 3 characters = RATING 2.33
" " (space) = 13 occurances / 1 character = RATING 13.0
"LOVE YOU" (ignoring the end space) = 7 occurances / 8 characters = RATING
0.875
"LOVE YOU "(with end space) = 6 occurances / 9 characters = RATING 0.67
("LOVE YOU" is optimal here" with the rating of 0.875)
=========
"LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU "
1L 1Y 2L 2Y 3L 3Y 4L 4Y 5L 5Y 6L
6Y 7L 7Y
"LOVE" (no spaces) = 7 occurances / 4 characters = RATING 1.75
"YOU" (no spaces) = 7 occurances / 3 characters = RATING 2.33
" " (space) = 13 occurances / 1 character = RATING 13.0
"LOVE YOU" (ignoring the end space) = 7 occurances / 8 characters = RATING
0.875
"LOVE YOU "(with end space) = 7 occurances / 9 characters = RATING 0.78
("LOVE YOU " is more optimal here, but gets a worse rating.)
=========
So I suggest a simple rating overrule, if the target string "LOVE YOU
LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU " is cleared out
completely by the potential optimals (0 characters left over or fewest) then
that gets the prize over the one that leaves bits remaining requiring
compression.
==========
"LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU LOVE YOU "
"LOVE YOU" (ignoring the end space) = 7 occurances / 8 characters = RATING
0.875
(Remaining characters once applied = " " <<space>> at the end = 1)
"LOVE YOU "(with end space) = 7 occurances / 9 characters = RATING 0.78
(Remaining characters once applied = <<NULL>> = 0) OVERRIDE = "LOVE YOU "
gets the prize for optimal clumping)
==========