collectio...@googlemail.com writes: > On Aug 12, 9:42 pm, Ashley Labowitz <sporecoun...@gmail.com> wrote: > > Thanks guys, I do appreciate that point of view. But the fact of the > > matter is that if you have a universal compressor, it is it's own > > proof. It seems a little anti-intellectual to assert that all ideas > > shouldn't be able to stand on their own and that one is necessarily > > better than another. I would just like to have a honest scientific > > discussion, the way comp.compression was intended to operate.
> > As a matter of fact, I've almost finished the coding. It went fast as > > a result of using high-level string and fractal packages, which means > > there will be some need for optimization later. If anyone has some > > short test cases for me to run please send them my way.
> Just use any good source of random data?
Better, use a completely predictable source of sets of predictable data.
> Otherwise, you could just try using data generated from C++, via > rand(). Make sure you seed the random number with the current time, > the time (in as high resolution as you can acheive, microseconds > usually), is almost always a good source of randomness. This explains > how: http://www.daniweb.com/forums/thread1769.html
If it's a truncated MC/LC, then it's pretty predictable. If it's not truncated, it's utterly predictable.
Phil -- Dear aunt, let's set so double the killer delete select all. -- Microsoft voice recognition live demonstration
> collectio...@googlemail.com writes: > > On Aug 12, 9:42 pm, Ashley Labowitz <sporecoun...@gmail.com> wrote: > > > Thanks guys, I do appreciate that point of view. But the fact of the > > > matter is that if you have a universal compressor, it is it's own > > > proof. It seems a little anti-intellectual to assert that all ideas > > > shouldn't be able to stand on their own and that one is necessarily > > > better than another. I would just like to have a honest scientific > > > discussion, the way comp.compression was intended to operate.
> > > As a matter of fact, I've almost finished the coding. It went fast as > > > a result of using high-level string and fractal packages, which means > > > there will be some need for optimization later. If anyone has some > > > short test cases for me to run please send them my way.
> > Just use any good source of random data?
> Better, use a completely predictable source of sets of predictable data.
The program extracts approximate roots of the high-degree fractal polynomials using Newton's method (quadratic convergence) and saves off just enough digits.
For files with structured data, there is generally a way to "align" the bytes so that they match up with edges of a fractal, particularly so when many degrees of freedom are permitted. Consider the following class of polynomials:
z^n + a cos(z)^n + b cos(sin(z)) + c
When iterated, this takes on periodic and chaotic orbits over the complex plane, and by basing the fractal edges on whether the point diverges, we can describe a region with unlimited self-similarity on progressively smaller scales. That is basically how Mandelbrot first discovered his set, but he didn't discuss this particular polynomial.
For instance, given a fragment of data, say 74 6f 20 62 65 20 6f 72 20 6e 6f 74 20 74 6f 20 62 65 fractals identify the small-scale patterns (such as repeated 20's) and large-scale patterns (the byte spectrum, generating grammar and common stems). For these reason, it may outperform gzip or related simple- minded compressors.
By contrast, for random data it is almost as simple. We use knapsacking to optimizing the placement of fractal curve alignments to fully cover the data, taking advantage of "found order", or the correlations that can be discovered. Any random file will contain many instances of the same byte for example. This by itself is not enough however. If it was, we could simply sort the bytes in the file, run-length encode it, and send the arrangement of the bytes separately. However, it is the arranging that almost all the information is contained!
Instead, we use Taylor series from calculus to manage the arrangement complexity. To illustrate an example, I'll use a random 16 bytes:
$ python -c "f = open('/dev/random', 'r'); print f.read(36)" | hexdump -C 00000000 fa 41 c2 72 55 8d c8 27 b1 4a ac a2 13 68 37 d9 |.A.rU..'.J...h7.| 00000010 d5 60 de af d8 85 9c d9 3c 7e 1e ca c1 59 c5 cd |.`......<~...Y..| 00000020 dd 1e 3d f9 0a |..=..|
In effect, the fractals treat this as floating point numbers and use approximations from the fractal dimension and polynomial roots (of which there are usually an infinite number). I can use Euler's theorem to find these quickly, e.g.,
pi^2 - pi^1.7 + cos(sin(pi/1.7)) = 3.44071322 (variant of polynomial above) e^2 - e^1.7 + cos(sin(e/1.7)) = 2.45574537 and "fa 41 c2 72" from the above is 4198613618 = 2 * 179 * 11727971. Notice the relationship between these 3 primes the sequence of polynomial values taken as follows: 1464905053 = 18959 * 77267 1756147005 = 3 * 3 * 5 * 17 * 907 * 2531 1815095743 = 359 * 5055977 417443012 = 2 * 2 * 7 * 227 * 65677 Factoring could be a problem, but there are fast methods for this. I'm using an EC method now (yay for libraries doing all the work for you!). Right now, I'm not sure how to encode prime numbers, but even if I have to add one, then factor, this extra flag doesn't offset the compression savings.
In this case, the 36 bytes above compress down to d1 1f 4e 99 4b 3d db 22 0e c3 b0 6d ae 4e e8 cc 23 a4 81 17 10 f5 c8 5f 72 ea b3 c7 07 76 f2 f5 which is only 32 bytes, and it contains all the information needed to reconstruct the original 36.
Thanks for the idea, collectio...@googlemail.com, but I was already using /dev/random for random data. I just need some sort of corpus for structured data. I found the Calgary Corpus, but I would like something smaller to make better examples in this thread (and because the algorithm is slow for big file :(, not as bad as PAQAR, but still).
Wow, there were a lot of messages while I writing up that Mathematical post.
All I will say is that you folks seem overly skeptical of what I would think are clear mathematical truths. It is clear to me now that comp.compression attracts inordinate numbers of that type (why is this anyway? Most Math and CS groups have no such issue). No hard feelings though, I'm the live and let live kind :) Patience for now, I'm running some test cases...
collectio...@googlemail.com writes: > On Aug 12, 10:27 pm, Phil Carmody <thefatphil_demun...@yahoo.co.uk> > wrote: > > Compress these 2 1-bit strings: > > 0 > > 1
> Sure.
> 0.5 > 0.2
> ?
> Did that work? :o)
Mathematically, yes. As "compression", as the man in the street would know it, no. And depending on your encoding, you do realise that you've "compressed" '1' to an infinite string! (0.001100110011... )
Phil -- Dear aunt, let's set so double the killer delete select all. -- Microsoft voice recognition live demonstration
> collectio...@googlemail.com writes: > > On Aug 12, 10:27 pm, Phil Carmody <thefatphil_demun...@yahoo.co.uk> > > wrote: > > > Compress these 2 1-bit strings: > > > 0 > > > 1
> > Sure.
> > 0.5 > > 0.2
> > ?
> > Did that work? :o)
> Mathematically, yes. As "compression", as the man in the > street would know it, no. And depending on your encoding, > you do realise that you've "compressed" '1' to an infinite > string! (0.001100110011... )
On Aug 11, 4:27 pm, Ashley Labowitz <sporecoun...@gmail.com> wrote:
> The square has to > be at least 6 x 6 (that is, 36 bytes) before space is saved from all > files and probably at least 1K before 12% is seen.
"space is saved from all files" implies you can take your compressor's output and run it through the compressor again. Are you now also claiming that, through this method, you can compress any file down to 36 bytes?
> I'm looking for short cases because of the program's running time and > because I'm still debugging it.
I advise testing all values of a short range but with 36 bytes as minimum input lenght that's to much. Otherwise testing some 1024KB lenght random strings or many shorter lenght random test sets to be sure the compression algoritme is not tuned at some short random sequences. Don't forget to test decompressor as soon as possible to be sure decompressor has all data in all cases to get the original input back. Good luck :-)
Hi as the group knows me quite well ;) i think i should give my opinion!
On 12 Aug, 22:55, Ashley Labowitz <sporecoun...@gmail.com> wrote:
> The program extracts approximate roots of the high-degree fractal > polynomials using Newton's method (quadratic convergence) and saves > off just enough digits.
the convergence of the fractals taylor series chosen will decide the number of significant coefficients, and how acurate they have to be stored. so a faster converging fractal would need less coefficients to represent the same acuracy, BUT you'd also have to show that there is no overhead in coefficient storage because more bits are required in the coefficient field.
> For files with structured data, there is generally a way to "align" > the bytes so that they match up with edges of a fractal, particularly > so when many degrees of freedom are permitted. Consider the following > class of polynomials:
define align, as i seem to think you are quantizing the complex plane into boxes, and placing one datum byte in each box, and you have not explained how the datum is constructed by the decompressor back from 'sample' points within the boxes.
> z^n + a cos(z)^n + b cos(sin(z)) + c
can not be bothered to show any convergence properties?
> When iterated, this takes on periodic and chaotic orbits over the > complex plane, and by basing the fractal edges on whether the point > diverges, we can describe a region with unlimited self-similarity on > progressively smaller scales. That is basically how Mandelbrot first > discovered his set, but he didn't discuss this particular polynomial.
how does this relate to a byte within a box on the complex plane?
> For instance, given a fragment of data, say > 74 6f 20 62 65 20 6f 72 20 6e 6f 74 20 74 6f 20 62 65 > fractals identify the small-scale patterns (such as repeated 20's) and > large-scale patterns (the byte spectrum, generating grammar and common > stems). For these reason, it may outperform gzip or related simple- > minded compressors.
maybe it could but so what? does it work on its own output?
> By contrast, for random data it is almost as simple. We use > knapsacking to optimizing the placement of fractal curve alignments to > fully cover the data, taking advantage of "found order", or the > correlations that can be discovered. Any random file will contain > many instances of the same byte for example. This by itself is not > enough however. If it was, we could simply sort the bytes in the > file, run-length encode it, and send the arrangement of the bytes > separately. However, it is the arranging that almost all the > information is contained!
shouldn't be any more complicated?! does the above paragraph imply you remove a fractal, and calculate a set of residuals, so another fratal can be found? an why do the roots have to be high order, is this due to convergence speed requirements of a taylor series? if you factor a file the sorting destroys no information, but as bytes are not commutative within the file then this RLE process would fail.
> Instead, we use Taylor series from calculus to manage the arrangement > complexity. To illustrate an example, I'll use a random 16 bytes:
> $ python -c "f = open('/dev/random', 'r'); print f.read(36)" | hexdump > -C > 00000000 fa 41 c2 72 55 8d c8 27 b1 4a ac a2 13 68 37 d9 > |.A.rU..'.J...h7.| > 00000010 d5 60 de af d8 85 9c d9 3c 7e 1e ca c1 59 c5 cd > |.`......<~...Y..| > 00000020 dd 1e 3d f9 0a |..=..|
> In effect, the fractals treat this as floating point numbers and use > approximations from the fractal dimension and polynomial roots (of > which there are usually an infinite number). I can use Euler's > theorem to find these quickly, e.g.,
> pi^2 - pi^1.7 + cos(sin(pi/1.7)) = 3.44071322 (variant of polynomial > above) > e^2 - e^1.7 + cos(sin(e/1.7)) = 2.45574537 > and "fa 41 c2 72" from the above is 4198613618 = 2 * 179 * 11727971. > Notice the relationship between these 3 primes the sequence of > polynomial values taken as follows: > 1464905053 = 18959 * 77267 > 1756147005 = 3 * 3 * 5 * 17 * 907 * 2531 > 1815095743 = 359 * 5055977 > 417443012 = 2 * 2 * 7 * 227 * 65677
whats this got to do with roots?
> Factoring could be a problem, but there are fast methods for this. > I'm using an EC method now (yay for libraries doing all the work for > you!). Right now, I'm not sure how to encode prime numbers, but even > if I have to add one, then factor, this extra flag doesn't offset the > compression savings.
try shanks or pollard rho or lenstras factor methods. or get a quantum computer :).
> In this case, the 36 bytes above compress down to > d1 1f 4e 99 4b 3d db 22 0e c3 b0 6d ae 4e e8 cc 23 a4 81 17 10 f5 c8 > 5f 72 ea b3 c7 07 76 f2 f5 > which is only 32 bytes, and it contains all the information needed to > reconstruct the original 36.
are you 100% sure?
looking at your post, the only things which relate to anything i have thought about before are the factorization to introduce commutivity in a data stream, the 2x+1 to obtain another factor set from a prime, and residual calculations as in 1 bit dacs which at four times oversampling (4 bits) can be made to output 16 bits resolution.
so how is the number in the byte box on the complex plane calculated?
describe the decopressor as this involves no patterm matching, just pattern construction.
I've taken the liberty of compressing some small files as a demonstration of the algorithm. I'm still writing up a formal document describing the method, but please feel free to "reverse engineer" it from the samples. More to come on the algorithm.
Here, I've used as test cases the du man page (chosen because it is short), 500 random bytes, 500 null bytes, a JPG image, and the text of the Magna Carta (translated by Project Gutenberg). Here are the results and the initial bytes of each compressed file.
du man page 66% savings 91 79 4a 1b d3 05 7e f3 ef 96 13 5d 0b d3 69 8a d0 cb 8f 05 0c ac 63 a5 07 ec 3e a2 76 5d 02 4c ca 47 ff d3 2a 6c 0d 91 1a 83 5d 35 ac 3f 80 11 ab bb 9d 68 cf 11 95 5c 0a 5b 1b 78 9f 6f cc 61 1b 3a 70 51 08 f7 13 91 c3 e8 33 07 01 b9 cb b9 5d 16 84 6f 97 6d 05 08 de 04 cf 53 16 8b 7d 31 de e8 3b 0c 2b c1 c3 ed 6a 6f 69 37 a4 b1 91 ca 35 a1 99 dc e6 27 c0 8c f8 c3 fd b7 fe e3 2f 67 5c 5d 11 23 42 85 80 71 01 1d 65 62 44 58 ba 81
500 random bytes 14% savings
40 5a 27 90 04 f1 17 9e 02 84 ce 6e 6f 61 46 f0 95 ff d7 1b a3 cf 42 1b a8 15 05 eb ad 29 dc 93 ce 68 e3 8d 8c e9 38 73 67 3d 8e aa 37 9b 1e 8e 22 93 74 28 c2 95 a2 c6 56 8a f0 00 e0 a4 27 53 49 4d f9 43 f4 02 f9 15 44 b9 5e f0 60 b6 fc 98 8e fa 82 e2 72 18 88 f3 17 a0 30 51 ec 29 37 f2
It seems fitting that the Magna Carta should have some place in a compression breakthrough of this nature!
However, in the interests of full disclosure (as they say on Bugtraq), I should point out that I'm still shaking the bugs out of the decompressor: For some reason the files come out garbled. I've been going a little bit "wild" with this compressor on all kinds of files and I'm really pleased with the results, I just haven't take the time to properly debug the decompressor (I needed the compressor asap, expanding can wait).
On Aug 16, 1:21 pm, Ashley Labowitz <sporecoun...@gmail.com> wrote:
> I should point out that I'm still shaking the bugs out of the > decompressor: For some reason the files come out garbled.
Ah yes, that pesky little detail.
It is as this point that we will probably never hear from you again. Most people are too embarrassed to come back here and report that there was a flaw in their thinking.
Well, Jim, I don't know "most people" are about, but in this case I really would like the decompressor. What motivated all this was a lack of space on my Hard Drive where none of the off-the-shelf compressors could save enough space. The fractal compressor I wrote was perfect (and I could prove its correctness too). I was able to squeeze down several Gigabytes of data to fix my space issues for now.
The files in question are a lot of personal correspondance, research, and coursework archives. It's nothing terribly critical, but still it's stuff I've saved for years and which I'd really like to get back : (. The decompressor must be out there, it *has* to be. I don't know what these other people you're talking about did, but no one will try harder than I will to write it.
I'm very optimistic (I still have all the compressed files sitting right here, the data is in them in some form), but I'll keep you guys posted no matter what. You deserve no less, I suppose.
> > It's nothing terribly critical, but still > > it's stuff I've saved for years and which I'd really like to get back
> You deleted the originals without testing a working decompresser > first?
> (I'd write more, but I lack the wit and eloquence necessary to make it > resonate.)
It's not so silly when you consider that there is no need for a decompressor just yet: It's not stuff I look at every day. Besides, if it does happen that I need to retrieve some files soon, before the decompressor is finished, I can just finish it then (extra motivation). I mean, what do people normally do in this situation?
Also, I wouldn't say I deleted the files per se, the compression process simply replaces them with smaller versions. I'll add that the savings were much larger than I could get with bzip2 (2.2ish GB -> 400ish MB).
On 16 aug, 22:28, Ashley Labowitz <sporecoun...@gmail.com> wrote:
> I was able to > squeeze down several Gigabytes of data to fix my space issues for now.
First you want little test files then compress GB's.... Then delete originals without building and testing a decompressor to save HD space while Live hotmail offer 5GB free en GMail 2.8GB free disk space...
On Thu, 16 Aug 2007 15:16:04 -0700, in comp.compression, Ashley Labowitz
<sporecoun...@gmail.com> wrote: >It's not so silly when you consider that there is no need for a >decompressor just yet: It's not stuff I look at every day. Besides, >if it does happen that I need to retrieve some files soon, before the >decompressor is finished, I can just finish it then (extra
Heh. I don't think you've really thought this all the way through.
>motivation). I mean, what do people normally do in this situation?
If your data is worth keeping? Buy more hard drives. Seriously.
>Also, I wouldn't say I deleted the files per se, the compression >process simply replaces them with smaller versions.
Thank you. That's the best laugh I've had all day.
> > > It's nothing terribly critical, but still > > > it's stuff I've saved for years and which I'd really like to get back
> > You deleted the originals without testing a working decompresser > > first?
> > (I'd write more, but I lack the wit and eloquence necessary to make it > > resonate.)
> It's not so silly when you consider that there is no need for a > decompressor just yet: It's not stuff I look at every day. Besides, > if it does happen that I need to retrieve some files soon, before the > decompressor is finished, I can just finish it then (extra > motivation). I mean, what do people normally do in this situation?
> Also, I wouldn't say I deleted the files per se, the compression > process simply replaces them with smaller versions. I'll add that the > savings were much larger than I could get with bzip2 (2.2ish GB -> > 400ish MB).
I have a hard time figuring out if Ashley is serious or not.
On Aug 16, 5:16 pm, Ashley Labowitz <sporecoun...@gmail.com> wrote:
> I mean, what do people normally do in this situation?
They test their theories with a working decompresser before deleting only copies of files.
I'll just break the news now to save you the frustration later: Your theory was flawed, and most if not all of your compressed data lacks the necessary information to be reconstructed losslessly.
Your goal was to save space on your hard drive; you have achieved that goal twofold, because now you can simply delete the "compressed" files. They are useless.
Well this is frustrating. Everytime I think I got the bugs out, the files still come out wrong. The bug affects even the simple test files I wrote about in an earlier post (even the 500 null byte file comes back with lots of non-null bytes). I've been coding nonstop all week, so I'm going to take some time to empty my mind and get a fresh start.
collectio...@googlemail.com, rest assured that it seems very serious from my PoV.
Jim, I know you're just trying to help, but my data wasn't random. As much as I flatter myself that my data was valuable, it had plenty of redundancy so it should be possible to decode, even if we grant your guys' pigeonhole argument. In addition, I was able to prove the compressor worked: that it makes all files at least 6% smaller, with an average of twice that.
On Aug 17, 11:54 am, Ashley Labowitz <sporecoun...@gmail.com> wrote:
> In addition, I was able to prove the > compressor worked: that it makes all files at least 6% smaller, with > an average of twice that.
You did not prove your compressor worked because you never proved your DECOMPRESSER worked. Your compressor output data that was smaller than the original -- but without a working decompresser, you have no idea if the compressed data was garbled during creation.
The above has nothing to do with the merits of your method, flawed or not. It is just basic testing theory.
Ashley Labowitz <sporecoun...@gmail.com> writes: > Well this is frustrating. Everytime I think I got the bugs out, the > files still come out wrong. The bug affects even the simple test > files I wrote about in an earlier post (even the 500 null byte file > comes back with lots of non-null bytes). I've been coding nonstop all > week, so I'm going to take some time to empty my mind and get a fresh > start.
> collectio...@googlemail.com, rest assured that it seems very serious > from my PoV.
> Jim, I know you're just trying to help, but my data wasn't random. As > much as I flatter myself that my data was valuable, it had plenty of > redundancy so it should be possible to decode, even if we grant your > guys' pigeonhole argument. In addition, I was able to prove the > compressor worked: that it makes all files at least 6% smaller, with > an average of twice that.
That doesn't prove that the compressor works in any meaningful sense.
I can write a compressor that reduces the size of all input files by 50%. It just discards every other byte of the file. But it's not a working compressor unless I can retrieve the original file, unchanged.
It's very likely that you've done a somewhat more sophisticated version of this, that your input files are irretrievably lost. The *only* way to prove otherwise would have been to successfully decompress the compressed files, recreating the original files without error. By discarding your original files before you demonstrated an ability to retrieve them, you've probably made a terrible mistake. (If you're very lucky, you may be able to retrieve *some* of your data, but that's likely to be more difficult than writing a correctly working compressor/decompressor pair would have been.)
Next time, invest in a bigger hard drive or a DVD burner.
-- Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst> San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst> "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"