Good compression

170 views
Skip to first unread message

sundaresh...@gmail.com

unread,
Jun 10, 2020, 10:03:10 AM6/10/20
to
The web site is www.complimentaryware.unaux.com . I have an excellent method on
PDF, an unpublished research of my own but this newsgroup does not allow attachments.

Elhana

unread,
Jun 16, 2020, 9:11:30 AM6/16/20
to
sundaresh...@gmail.com:

> The web site is www.complimentaryware.unaux.com . I have an excellent method on
> PDF, an unpublished research of my own but this newsgroup does not allow attachments.

Show us the code.

sundaresh...@gmail.com

unread,
Jun 16, 2020, 11:37:42 AM6/16/20
to

I started coding but shelved it and put it in the background since l am also simultaneously developing and coding other s/w as well. But why do you ask ? I am quite certain there is nothing wrong with the method.

Helm

unread,
Jun 17, 2020, 5:58:47 PM6/17/20
to
You can always attach it if you use a newsreader instead of the Google
Groups interface (that also puts your IP in the headers by the way).
Don't do it however since all the free newsservers won't be able to
hold your post thus reducing the visibility and reach. Instead, upload
it to a website like WeTransfer and post the link in the newsgroup.

Good research though! Look forward to learning more.

Eli the Bearded

unread,
Jun 17, 2020, 6:02:07 PM6/17/20
to
This group is littered with the dead claims of people who said they had
a better compression method but could never prove it with code.

Elijah
------
"compression" seems to attrack kooks as much as "perpetual motion"

sundaresh...@gmail.com

unread,
Jun 17, 2020, 11:59:27 PM6/17/20
to
Well thanks ! One encouraging response . Here is the link

https://drive.google.com/file/d/1w5sNV_YVM9XGymz8A0MT0V_cubxMPYCM/view?usp=sharing

If it does not work, let me know.

sundaresh...@gmail.com

unread,
Jun 18, 2020, 12:58:39 AM6/18/20
to
The little(of the other stuff) I have coded works extremely well, just as I expected, in fact in most cases surprisingly better. All I am saying is, proof of the recipe is different from proof of the pudding, a truth which is even more exacting w.r.t recipe's and pudding's in the world of computing or as you may call it the virtual world. Speaking only for myself, and in all candor I will admit, having tried my hand at cooking, I am a much better cook today than I was when I started off cooking, both on my kitchen table and on my computer table.

Scott

unread,
Jun 18, 2020, 12:13:44 PM6/18/20
to
On Wed, 17 Jun 2020 22:02:05 +0000 (UTC), Eli the Bearded
<*@eli.users.panix.com> wrote:

>In comp.compression, <sundaresh...@gmail.com> wrote:
>> I started coding but shelved it and put it in the background since l
>> am also simultaneously developing and coding other s/w as well. But
>> why do you ask ? I am quite certain there is nothing wrong with the
>> method.
>
>This group is littered with the dead claims of people who said they had
>a better compression method but could never prove it with code.

Oh, it's not that bad. Recursive compression is actually pretty easy
to do. The decompressor turns out to be a good deal trickier, though.

BGB

unread,
Jun 18, 2020, 3:39:13 PM6/18/20
to
Yep.


Luckily at least, data compression is more objectively measurable.


Things like audio and video compression also involve a lot of subjective
tradeoffs, like how do the artifacts of one codec look or sound relative
to another, and I have noted that it seems my perceptions don't really
agree with others.

Eg: I seem to prefer a loss of dynamic range and an increase in
"sharpness" a lot more readily than a loss of detail and blurring;
whereas many other people would apparently much rather have a blurry
mess than sharp high-contrast edges.

Similarly for audio: I prefer codecs which decay into a "metallic" sound
over ones which introduce a bunch of warbling and whistling at lower
bitrates.

Eg: MP3 below ~ 60 kbps sounds a bit unpleasant with all the whistling
and warbling and similar, whereas 11KHz ADPCM still sounds pretty
reasonable in comparison.

The result of this has partly been a lot of my own codecs using
Color-Cell and VQ style technology, along with ADPCM style audio codecs,
which apparently a lot of other people perceive as looking and sounding
awful.



But, yeah, as for data compression, don't have all that much notable.

I had a few compressors, which did pretty good (vs, eg, LZ4 and Zstd) at
being fast on my past computers (K10 and FX), but after switching to a
Ryzen then both LZ4 and Zstd got a pretty significant speed-up.

Getting high compression ratios is, granted, a little harder...


Recently, managed to pull off something which gets slightly better
compression at comparable decode speeds to LZ4 (on my Ryzen), and is
faster to decode than LZ4 on my own ISA (BJX2).

Decode speeds on a 3.7 GHz Ryzen in both cases is a little over 2GB/sec.
I haven't yet gotten around to running tests on ARM.


Compression for many files seems to be part-way between LZ4 and Deflate.
Note that, like LZ4, it is byte oriented with no entropy coding.

<--

BtRP2 (Transposed, LE):
* dddddddd-dlllrrr0 (l=3..10, d=0..511, r=0..7)
* dddddddd-dddddlll-lllrrr01 (l=4..67, d=0..8191)
* dddddddd-dddddddd-dlllllll-llrrr011 (l=4..515, d=0..131071)
* rrrr0111 (Raw Bytes, r=(r+1)*8, 8..128)
* * rrr01111 (Long Match)
* rr011111 (r=1..3 bytes, 0=EOB)
* rrrrrrrr-r0111111 (Long Raw, r=(r+1)*8, 8..4096)
** d: Distance
** l: Match Length
** r: Literal Length

Values are encoded in little-endian order, with tag bits located in the
LSB. Bits will be contiguous within the value, with shift-and-mask being
used to extract individual elements.

Long Match will encode length and distance using variable-length
encodings directly following the initial tag byte.

Length VLN:
lllllll0, 4.. 131
llllllll-llllll01, 132..16383

Distance VLN:
dddddddd-ddddddd0, 32K (0..32767)
dddddddd-dddddddd-dddddd01, 4M

-->


The compression could be improved a little more in the Long Match case
via a few more conjoined cases, but the gain in compression would be
pretty small relative to the amount of code added (as the first few
cases handle the vast majority of LZ matches).

I had also left out adding 1 to the distance, mostly because in my tests
this had very little effect on compression, but a somewhat more
noticeable effect on decode speed.

icep...@gmail.com

unread,
Jul 2, 2020, 9:38:23 AM7/2/20
to
Den torsdag 18 juni 2020 kl. 05:59:27 UTC+2 skrev sundaresh...@gmail.com:

> > > The web site is www.complimentaryware.unaux.com . I have an excellent method on
> > > PDF, an unpublished research of my own but this newsgroup does not
> > > allow attachments.
> >
> > You can always attach it if you use a newsreader instead of the Google
> > Groups interface (that also puts your IP in the headers by the way).
> > Don't do it however since all the free newsservers won't be able to
> > hold your post thus reducing the visibility and reach. Instead, upload
> > it to a website like WeTransfer and post the link in the newsgroup.
> >
> > Good research though! Look forward to learning more.
>
> Well thanks ! One encouraging response . Here is the link
>
> https://drive.google.com/file/d/1w5sNV_YVM9XGymz8A0MT0V_cubxMPYCM/view?usp=sharing
>
> If it does not work, let me know.

It doesn't work.

sundaresh...@gmail.com

unread,
Jul 2, 2020, 11:38:42 PM7/2/20
to
Really, have you tried coding it. If you have, can I see the code ? You can do just what I have done. In the meantime I have made significant improvements to it, and it should be visible in the link and I will be coding this new version.

Földes László

unread,
Jul 3, 2020, 6:58:45 AM7/3/20
to
This paper talks about a method with an example applied to a single code but a single code has no entropy (on which compression methods work).

icep...@gmail.com

unread,
Jul 3, 2020, 9:28:11 AM7/3/20
to
Please post the code you have then, and the results.

You are the one making a claim that a large number M can be written as a smaller number S by transforming it in one of several ways, which the pigeonhole principle will show anyone will not work for all numbers from 0 to M, since they can't all fit in S.

You can find S numbers out of M that do, but even selecting which set of Ses is out of the set 0->M will cost bits.

Sundaresh Venugopal

unread,
Aug 19, 2020, 5:45:24 AM8/19/20
to
I will see what I an do.

Sundaresh Venugopal

unread,
Aug 19, 2020, 6:23:21 AM8/19/20
to
Sorry. Meant, I will see what I can do, but not to convince people like you.

icep...@gmail.com

unread,
Aug 19, 2020, 12:59:45 PM8/19/20
to
> > > Please post the code you have then, and the results.
> > >
> > > You are the one making a claim that a large number M can be written as a smaller number S by transforming it in one of several ways, which the pigeonhole principle will show anyone will not work for all numbers from 0 to M, since they can't all fit in S.
> > >
> > > You can find S numbers out of M that do, but even selecting which set of Ses is out of the set 0->M will cost bits.
> > I will see what I an do.
> Sorry. Meant, I will see what I can do, but not to convince people like you.

I am not expecting you to do any work for me, I was suggesting you do the work for *you*, so you see how it does, or does not work for any numbers or all numbers.

It is quite common for people to assume things like "I can find my wanted large number in the decimals of PI and will refer to it by its position" which is true, but is not usable for compression since the pointer will be as long, or longer than the number when you take all possible combinations into account.

If you are only envisioning "I will look for the number 141592653589" then yes, you can find a very short index for it. If you instead look for the far shorter number "93774" then it will not appear in the first 100000 digits.

So what you need to do in order to create fame for yourself with a new novel way of compressing data is 1) understand the pigeon hole principle (everyone in compression research must understand this one) and 2) make some kind of implementation of your idea to know when it will *not* work.

Good compressors work on something like 1% of the inputs, they just chose the 1% which are commonly found in files on computers. For all the 99% other kinds of data which are uncommon, they will expand, but if you don't hand such data to your compressor, noone notices and all is fine.

So by making the implementation, you have a good chance of finding out what your "99%" are.

Sundaresh Venugopal

unread,
Aug 19, 2020, 8:08:54 PM8/19/20
to
I am not the one making a false claim here. Do not mislead people.

Matthias Waldhauer

unread,
Aug 26, 2020, 3:51:01 AM8/26/20
to
BGB wrote on 18th June 2020 at 21:39:13 UTC+2:
> Compression for many files seems to be part-way between LZ4 and Deflate.
> Note that, like LZ4, it is byte oriented with no entropy coding.
>
> <--
>
> BtRP2 (Transposed, LE):
> * dddddddd-dlllrrr0 (l=3..10, d=0..511, r=0..7)
> * dddddddd-dddddlll-lllrrr01 (l=4..67, d=0..8191)
> * dddddddd-dddddddd-dlllllll-llrrr011 (l=4..515, d=0..131071)
> * rrrr0111 (Raw Bytes, r=(r+1)*8, 8..128)
> * * rrr01111 (Long Match)
> * rr011111 (r=1..3 bytes, 0=EOB)
> * rrrrrrrr-r0111111 (Long Raw, r=(r+1)*8, 8..4096)
> ** d: Distance
> ** l: Match Length
> ** r: Literal Length
>
> Values are encoded in little-endian order, with tag bits located in the
> LSB. Bits will be contiguous within the value, with shift-and-mask being
> used to extract individual elements.

Interesting tokens! You're using unary coding for the different tokens. In the last one (Long Raw) you might switch the 0 for another r bit, as a limit on code length (6 bits, 111111) would be enough to differentiate the last 2 cases.

I guess with some effort (PhD thesis), one could develop a method to formally describe all possible token variants via a grammar. And based on that, both the efficiency and the decoder complexity could be calculated and optimized.

M.

Janne Johansson

unread,
Aug 26, 2020, 6:07:10 AM8/26/20
to
torsdag 20 augusti 2020 kl. 02:08:54 UTC+2 skrev Sundaresh Venugopal:
> On Wednesday, August 19, 2020 at 10:29:45 PM UTC+5:30, icep...@gmail.com wrote:
> > > > > Please post the code you have then, and the results.
> > > > >
> > > > > You are the one making a claim that a large number M can be written as a smaller number S by transforming it in one of several ways, which the pigeonhole principle will show anyone will not work for all numbers from 0 to M, since they can't all fit in S.

> > So by making the implementation, you have a good chance of finding out what your "99%" are.
> I am not the one making a false claim here. Do not mislead people.

False or not remains to be seen. You are making a claim though. We are waiting for some proof of it.

BGB

unread,
Aug 26, 2020, 3:29:13 PM8/26/20
to
On 8/26/2020 2:50 AM, Matthias Waldhauer wrote:
> BGB wrote on 18th June 2020 at 21:39:13 UTC+2:
>> Compression for many files seems to be part-way between LZ4 and Deflate.
>> Note that, like LZ4, it is byte oriented with no entropy coding.
>>
>> <--
>>
>> BtRP2 (Transposed, LE):
>> * dddddddd-dlllrrr0 (l=3..10, d=0..511, r=0..7)
>> * dddddddd-dddddlll-lllrrr01 (l=4..67, d=0..8191)
>> * dddddddd-dddddddd-dlllllll-llrrr011 (l=4..515, d=0..131071)
>> * rrrr0111 (Raw Bytes, r=(r+1)*8, 8..128)
>> * * rrr01111 (Long Match)
>> * rr011111 (r=1..3 bytes, 0=EOB)
>> * rrrrrrrr-r0111111 (Long Raw, r=(r+1)*8, 8..4096)
>> ** d: Distance
>> ** l: Match Length
>> ** r: Literal Length
>>
>> Values are encoded in little-endian order, with tag bits located in the
>> LSB. Bits will be contiguous within the value, with shift-and-mask being
>> used to extract individual elements.
>
> Interesting tokens! You're using unary coding for the different tokens. In the last one (Long Raw) you might switch the 0 for another r bit, as a limit on code length (6 bits, 111111) would be enough to differentiate the last 2 cases.
>

Leaving it as 0 leaves room for possible further expansion; also the
"Long Raw" case is infrequent enough (and big enough) that 1 more bit
wont make a big difference (worst case overhead is ~ 0.05%).

> I guess with some effort (PhD thesis), one could develop a method to formally describe all possible token variants via a grammar. And based on that, both the efficiency and the decoder complexity could be calculated and optimized.
>

Dunno.

As noted, compression ratios tend to generally be slightly better than
LZ4 in this case (including if encoded using the same match algo and the
same match length/distance settings).

This seems to be more so for "general data compression", however for
compressing PE/COFF binaries with my ISA, compression ratio seems to be
roughly break-even with LZ4.


An older "fast" LZ encoder, which did everything in terms of 32-bit
DWORD values, also did reasonably well for binaries, but worse for most
other data compression.

A likely factor though is that the ISA is mostly a mix of 16 and 32-bit
instructions, and sequences of 32-bit instructions also tend to be
32-bit aligned (though, this is not enforced as doing so in-general has
an adverse effect on code-density).

Reply all
Reply to author
Forward
0 new messages