more "information" in a sorted list ? or not ?

7 views
Skip to first unread message

Goulu

unread,
Nov 9, 2006, 10:43:29 AM11/9/06
to
I'd like to know how the fact that data is sorted affects its
"information" content, from a theoretical point of view.
>From what I remember from university, the string "abcdef" contains less
information than "cdfeab" because it can be compressed in less bits,
using for example a single "delta compression" algo.
Is it true ? Then sorting destroys information ?
Thanks

Jussi Piitulainen

unread,
Nov 9, 2006, 11:20:37 AM11/9/06
to
Goulu writes:

"abcdef" "cdfeab" "delta "information" >From ? ? I I'd Is Thanks Then
a a affects algo. be because bits, can compressed compression"
contains content, data destroys example fact for from from how in
information information is it it its know less less like of point
remember single sorted sorting string than that the the theoretical to
true university, using view. what

A.L.

unread,
Nov 9, 2006, 11:32:27 AM11/9/06
to
On 09 Nov 2006 18:20:37 +0200, Jussi Piitulainen
<jpii...@ling.helsinki.fi> wrote:

Less information?...

A.L.

Ben Bacarisse

unread,
Nov 9, 2006, 11:34:11 AM11/9/06
to
"Goulu" <goo...@goulu.net> writes:

> I'd like to know how the fact that data is sorted affects its
> "information" content, from a theoretical point of view.
>>From what I remember from university, the string "abcdef" contains less
> information than "cdfeab" because it can be compressed in less bits,

No. They can both be compressed to one bit. If there are two
messages, "abcdef" and "cdfeab", the first can be encoded as 0 and the
second as 1.

Do you see where this is going?

(You are right about the general link between information content
and compressability but you have to be very careful about terms and
definitions if you want to avoid confusion.)

--
Ben.

Paul E. Black

unread,
Nov 9, 2006, 2:54:55 PM11/9/06
to
On Thursday 09 November 2006 11:34, Ben Bacarisse wrote:

> "Goulu" <goo...@goulu.net> writes:
>
>> I'd like to know how the fact that data is sorted affects its
>> "information" content, from a theoretical point of view.
>>>From what I remember from university, the string "abcdef" contains less
>> information than "cdfeab" because it can be compressed in less bits,
>
> No. They can both be compressed to one bit. If there are two
> messages, "abcdef" and "cdfeab", the first can be encoded as 0 and the
> second as 1.

> ...


> (You are right about the general link between information content
> and compressability but you have to be very careful about terms and
> definitions if you want to avoid confusion.)

I agree with Ben: you have to be very careful with definitions to
avoid confusion.

Given that, in general (that is, for most definitions, in my opinion),
a sorted string would be considered to have less information than a
"random" string. Here are some reasons I feel justified in making
such a generalization:

* For a given length, L, and number of characters, n, there are n^L
strings, but there are, uh, a lot fewer sorted strings. (Its about
n^L/L! but I haven't finished my combinatorics book, yet) So it
takes fewer bits to indicate a string, given that it is sorted, than
to indicate a string for which no order is know.

* Sort algorithms cannot be reversed. "abc" might be the output of
sorting "abc" or "acb" or "cba" or ... Some information (the
original order) has been thrown away.

* A random shuffling algorithm needs a source of random bits.

-paul-
--
Paul E. Black (p.b...@acm.org)

Goulu

unread,
Nov 10, 2006, 1:35:22 AM11/10/06
to
Paul Black wrote:
> * Sort algorithms cannot be reversed. "abc" might be the output of
> sorting "abc" or "acb" or "cba" or ... Some information (the
> original order) has been thrown away.

Thanks. That's exactely what I was looking for. I was aware my question
lacked some strict definition of "information", it was precisely my
goal to obtain some clearer explanation.

Ben Bacarisse

unread,
Nov 10, 2006, 9:12:23 AM11/10/06
to p.b...@acm.org
"Paul E. Black" <p.b...@acm.org> writes:

> On Thursday 09 November 2006 11:34, Ben Bacarisse wrote:
>
>> "Goulu" <goo...@goulu.net> writes:
>>
>>> I'd like to know how the fact that data is sorted affects its
>>> "information" content, from a theoretical point of view.
>>>>From what I remember from university, the string "abcdef" contains less
>>> information than "cdfeab" because it can be compressed in less bits,
>>
>> No. They can both be compressed to one bit. If there are two
>> messages, "abcdef" and "cdfeab", the first can be encoded as 0 and the
>> second as 1.
>> ...
>> (You are right about the general link between information content
>> and compressability but you have to be very careful about terms and
>> definitions if you want to avoid confusion.)
>
> I agree with Ben: you have to be very careful with definitions to
> avoid confusion.
>
> Given that, in general (that is, for most definitions, in my opinion),
> a sorted string would be considered to have less information than a
> "random" string.

I am sorry if I sounded like a pedant and I apologise in advance
because I am going to sound like one again, but you can't, in a any
rigorous way, talk about information in "a" string. From what you
later say, I think you are talking about information entropy -- which,
roughly speaking, measures the "uncertainty" of a set of events. See
later comments.

BTW, if you are talking about Kolmogorov complexity as a measure of
information then you *can* talk about the "information" in a particular
string, but then sorted strings are much more harder to analyse
because the sorting order (which is quit arbitrary) is part of the
complexity measure as well.

> Here are some reasons I feel justified in making
> such a generalization:
>
> * For a given length, L, and number of characters, n, there are n^L
> strings, but there are, uh, a lot fewer sorted strings. (Its about
> n^L/L! but I haven't finished my combinatorics book, yet) So it
> takes fewer bits to indicate a string, given that it is sorted, than
> to indicate a string for which no order is know.

I don't know the formula either but for n=2 there are L+1 sorted
strings of length L so n^L/L! is way off (for L > 2). Anyway, I think
it is a mistake to focus on the sorting. Smaller sets have, in lots of
senses, less information than large ones -- they need fewer samplings
to "know" them (the information entropy meaning of information), they
need more complex descriptions (the Kolmogorov complexity meaning) and
they carry more information if used a symbols in a communication
channel (the Shannon theory meaning of information).

> * Sort algorithms cannot be reversed. "abc" might be the output of
> sorting "abc" or "acb" or "cba" or ... Some information (the
> original order) has been thrown away.

Yes. Sorting can't be reversed because the set of sorted strings is
smaller. Focusing on the sorting aspect can make it seem special
because we recognise the pattern, but as far as information theory is
concerned, any set that is restricted in some way has a pattern even
if we can't see it.

When I said to the OP, "Do you see where this is going?" (rather too
cryptically I now see) I was inviting him to count the sets.

BTW, I am not an expert in this field so there is a reasonable chance
that I am talking bollocks. That is just one of the pitfalls of
Usenet.

--
Ben.

kaby

unread,
Nov 11, 2006, 4:06:54 AM11/11/06
to
I consider it a special algorithms for special circumstance.
In my opinion the problem is de facto about statistics distribution,
not entropy.

Goulu

unread,
Nov 14, 2006, 9:46:58 AM11/14/06
to

Jussi Piitulainen wrote:
> "abcdef" "cdfeab" "delta "information" >From ? ? I I'd Is Thanks Then
> a a affects algo. be because bits, can compressed compression"
> contains content, data destroys example fact for from from how in
> information information is it it its know less less like of point
> remember single sorted sorting string than that the the theoretical to
> true university, using view. what

I didn't understand at first. Though it was a joke. But in fact it is a
very subtle answer to my question :-) Thanks !

jacko

unread,
Nov 14, 2006, 12:48:18 PM11/14/06
to

look up burrows wheeler transform if information sorting and
compression are of interest.

Reply all
Reply to author
Forward
0 new messages