7 views

Skip to first unread message

Nov 9, 2006, 10:43:29 AM11/9/06

to

I'd like to know how the fact that data is sorted affects its

"information" content, from a theoretical point of view.

>From what I remember from university, the string "abcdef" contains less

information than "cdfeab" because it can be compressed in less bits,

using for example a single "delta compression" algo.

Is it true ? Then sorting destroys information ?

Thanks

"information" content, from a theoretical point of view.

>From what I remember from university, the string "abcdef" contains less

information than "cdfeab" because it can be compressed in less bits,

using for example a single "delta compression" algo.

Is it true ? Then sorting destroys information ?

Thanks

Nov 9, 2006, 11:20:37 AM11/9/06

to

Goulu writes:

"abcdef" "cdfeab" "delta "information" >From ? ? I I'd Is Thanks Then

a a affects algo. be because bits, can compressed compression"

contains content, data destroys example fact for from from how in

information information is it it its know less less like of point

remember single sorted sorting string than that the the theoretical to

true university, using view. what

Nov 9, 2006, 11:32:27 AM11/9/06

to

On 09 Nov 2006 18:20:37 +0200, Jussi Piitulainen

<jpii...@ling.helsinki.fi> wrote:

<jpii...@ling.helsinki.fi> wrote:

Less information?...

A.L.

Nov 9, 2006, 11:34:11 AM11/9/06

to

"Goulu" <goo...@goulu.net> writes:

> I'd like to know how the fact that data is sorted affects its

> "information" content, from a theoretical point of view.

>>From what I remember from university, the string "abcdef" contains less

> information than "cdfeab" because it can be compressed in less bits,

No. They can both be compressed to one bit. If there are two

messages, "abcdef" and "cdfeab", the first can be encoded as 0 and the

second as 1.

Do you see where this is going?

(You are right about the general link between information content

and compressability but you have to be very careful about terms and

definitions if you want to avoid confusion.)

--

Ben.

Nov 9, 2006, 2:54:55 PM11/9/06

to

On Thursday 09 November 2006 11:34, Ben Bacarisse wrote:

> "Goulu" <goo...@goulu.net> writes:

>

>> I'd like to know how the fact that data is sorted affects its

>> "information" content, from a theoretical point of view.

>>>From what I remember from university, the string "abcdef" contains less

>> information than "cdfeab" because it can be compressed in less bits,

>

> No. They can both be compressed to one bit. If there are two

> messages, "abcdef" and "cdfeab", the first can be encoded as 0 and the

> second as 1.

> ...

> (You are right about the general link between information content

> and compressability but you have to be very careful about terms and

> definitions if you want to avoid confusion.)

I agree with Ben: you have to be very careful with definitions to

avoid confusion.

Given that, in general (that is, for most definitions, in my opinion),

a sorted string would be considered to have less information than a

"random" string. Here are some reasons I feel justified in making

such a generalization:

* For a given length, L, and number of characters, n, there are n^L

strings, but there are, uh, a lot fewer sorted strings. (Its about

n^L/L! but I haven't finished my combinatorics book, yet) So it

takes fewer bits to indicate a string, given that it is sorted, than

to indicate a string for which no order is know.

* Sort algorithms cannot be reversed. "abc" might be the output of

sorting "abc" or "acb" or "cba" or ... Some information (the

original order) has been thrown away.

* A random shuffling algorithm needs a source of random bits.

-paul-

--

Paul E. Black (p.b...@acm.org)

Nov 10, 2006, 1:35:22 AM11/10/06

to

Paul Black wrote:

> * Sort algorithms cannot be reversed. "abc" might be the output of

> sorting "abc" or "acb" or "cba" or ... Some information (the

> original order) has been thrown away.

> * Sort algorithms cannot be reversed. "abc" might be the output of

> sorting "abc" or "acb" or "cba" or ... Some information (the

> original order) has been thrown away.

Thanks. That's exactely what I was looking for. I was aware my question

lacked some strict definition of "information", it was precisely my

goal to obtain some clearer explanation.

Nov 10, 2006, 9:12:23 AM11/10/06

to p.b...@acm.org

"Paul E. Black" <p.b...@acm.org> writes:

> On Thursday 09 November 2006 11:34, Ben Bacarisse wrote:

>

>> "Goulu" <goo...@goulu.net> writes:

>>

>>> I'd like to know how the fact that data is sorted affects its

>>> "information" content, from a theoretical point of view.

>>>>From what I remember from university, the string "abcdef" contains less

>>> information than "cdfeab" because it can be compressed in less bits,

>>

>> No. They can both be compressed to one bit. If there are two

>> messages, "abcdef" and "cdfeab", the first can be encoded as 0 and the

>> second as 1.

>> ...

>> (You are right about the general link between information content

>> and compressability but you have to be very careful about terms and

>> definitions if you want to avoid confusion.)

>

> I agree with Ben: you have to be very careful with definitions to

> avoid confusion.

>

> Given that, in general (that is, for most definitions, in my opinion),

> a sorted string would be considered to have less information than a

> "random" string.

I am sorry if I sounded like a pedant and I apologise in advance

because I am going to sound like one again, but you can't, in a any

rigorous way, talk about information in "a" string. From what you

later say, I think you are talking about information entropy -- which,

roughly speaking, measures the "uncertainty" of a set of events. See

later comments.

BTW, if you are talking about Kolmogorov complexity as a measure of

information then you *can* talk about the "information" in a particular

string, but then sorted strings are much more harder to analyse

because the sorting order (which is quit arbitrary) is part of the

complexity measure as well.

> Here are some reasons I feel justified in making

> such a generalization:

>

> * For a given length, L, and number of characters, n, there are n^L

> strings, but there are, uh, a lot fewer sorted strings. (Its about

> n^L/L! but I haven't finished my combinatorics book, yet) So it

> takes fewer bits to indicate a string, given that it is sorted, than

> to indicate a string for which no order is know.

I don't know the formula either but for n=2 there are L+1 sorted

strings of length L so n^L/L! is way off (for L > 2). Anyway, I think

it is a mistake to focus on the sorting. Smaller sets have, in lots of

senses, less information than large ones -- they need fewer samplings

to "know" them (the information entropy meaning of information), they

need more complex descriptions (the Kolmogorov complexity meaning) and

they carry more information if used a symbols in a communication

channel (the Shannon theory meaning of information).

> * Sort algorithms cannot be reversed. "abc" might be the output of

> sorting "abc" or "acb" or "cba" or ... Some information (the

> original order) has been thrown away.

Yes. Sorting can't be reversed because the set of sorted strings is

smaller. Focusing on the sorting aspect can make it seem special

because we recognise the pattern, but as far as information theory is

concerned, any set that is restricted in some way has a pattern even

if we can't see it.

When I said to the OP, "Do you see where this is going?" (rather too

cryptically I now see) I was inviting him to count the sets.

BTW, I am not an expert in this field so there is a reasonable chance

that I am talking bollocks. That is just one of the pitfalls of

Usenet.

--

Ben.

Nov 11, 2006, 4:06:54 AM11/11/06

to

I consider it a special algorithms for special circumstance.

In my opinion the problem is de facto about statistics distribution,

not entropy.

In my opinion the problem is de facto about statistics distribution,

not entropy.

Nov 14, 2006, 9:46:58 AM11/14/06

to

Jussi Piitulainen wrote:

> "abcdef" "cdfeab" "delta "information" >From ? ? I I'd Is Thanks Then

> a a affects algo. be because bits, can compressed compression"

> contains content, data destroys example fact for from from how in

> information information is it it its know less less like of point

> remember single sorted sorting string than that the the theoretical to

> true university, using view. what

I didn't understand at first. Though it was a joke. But in fact it is a

very subtle answer to my question :-) Thanks !

Nov 14, 2006, 12:48:18 PM11/14/06

to

look up burrows wheeler transform if information sorting and

compression are of interest.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu