Google 網路論壇不再支援新的 Usenet 貼文或訂閱項目,但過往內容仍可供查看。

Re: more "information" in a sorted list ? or not ?

瀏覽次數:10 次
跳到第一則未讀訊息

Jussi Piitulainen

未讀,
2006年11月9日 上午11:20:372006/11/9
收件者:
Goulu writes:

> I'd like to know how the fact that data is sorted affects its
> "information" content, from a theoretical point of view. From what
> I remember from university, the string "abcdef" contains less
> information than "cdfeab" because it can be compressed in less bits,
> using for example a single "delta compression" algo. Is it true ?
> Then sorting destroys information ? Thanks

"abcdef" "cdfeab" "delta "information" >From ? ? I I'd Is Thanks Then
a a affects algo. be because bits, can compressed compression"
contains content, data destroys example fact for from from how in
information information is it it its know less less like of point
remember single sorted sorting string than that the the theoretical to
true university, using view. what

Goulu

未讀,
2006年11月9日 上午10:43:292006/11/9
收件者:

A.L.

未讀,
2006年11月9日 上午11:32:272006/11/9
收件者:
On 09 Nov 2006 18:20:37 +0200, Jussi Piitulainen
<jpii...@ling.helsinki.fi> wrote:

Less information?...

A.L.

Ben Bacarisse

未讀,
2006年11月9日 上午11:34:112006/11/9
收件者:
"Goulu" <goo...@goulu.net> writes:

> I'd like to know how the fact that data is sorted affects its
> "information" content, from a theoretical point of view.
>>From what I remember from university, the string "abcdef" contains less
> information than "cdfeab" because it can be compressed in less bits,

No. They can both be compressed to one bit. If there are two
messages, "abcdef" and "cdfeab", the first can be encoded as 0 and the
second as 1.

Do you see where this is going?

(You are right about the general link between information content
and compressability but you have to be very careful about terms and
definitions if you want to avoid confusion.)

--
Ben.

Paul E. Black

未讀,
2006年11月9日 下午2:54:552006/11/9
收件者:
On Thursday 09 November 2006 11:34, Ben Bacarisse wrote:

> "Goulu" <goo...@goulu.net> writes:
>
>> I'd like to know how the fact that data is sorted affects its
>> "information" content, from a theoretical point of view.
>>>From what I remember from university, the string "abcdef" contains less
>> information than "cdfeab" because it can be compressed in less bits,
>
> No. They can both be compressed to one bit. If there are two
> messages, "abcdef" and "cdfeab", the first can be encoded as 0 and the
> second as 1.

> ...


> (You are right about the general link between information content
> and compressability but you have to be very careful about terms and
> definitions if you want to avoid confusion.)

I agree with Ben: you have to be very careful with definitions to
avoid confusion.

Given that, in general (that is, for most definitions, in my opinion),
a sorted string would be considered to have less information than a
"random" string. Here are some reasons I feel justified in making
such a generalization:

* For a given length, L, and number of characters, n, there are n^L
strings, but there are, uh, a lot fewer sorted strings. (Its about
n^L/L! but I haven't finished my combinatorics book, yet) So it
takes fewer bits to indicate a string, given that it is sorted, than
to indicate a string for which no order is know.

* Sort algorithms cannot be reversed. "abc" might be the output of
sorting "abc" or "acb" or "cba" or ... Some information (the
original order) has been thrown away.

* A random shuffling algorithm needs a source of random bits.

-paul-
--
Paul E. Black (p.b...@acm.org)

Goulu

未讀,
2006年11月10日 凌晨1:35:222006/11/10
收件者:
Paul Black wrote:
> * Sort algorithms cannot be reversed. "abc" might be the output of
> sorting "abc" or "acb" or "cba" or ... Some information (the
> original order) has been thrown away.

Thanks. That's exactely what I was looking for. I was aware my question
lacked some strict definition of "information", it was precisely my
goal to obtain some clearer explanation.

Ben Bacarisse

未讀,
2006年11月10日 上午9:12:232006/11/10
收件者:p.b...@acm.org
"Paul E. Black" <p.b...@acm.org> writes:

> On Thursday 09 November 2006 11:34, Ben Bacarisse wrote:
>
>> "Goulu" <goo...@goulu.net> writes:
>>
>>> I'd like to know how the fact that data is sorted affects its
>>> "information" content, from a theoretical point of view.
>>>>From what I remember from university, the string "abcdef" contains less
>>> information than "cdfeab" because it can be compressed in less bits,
>>
>> No. They can both be compressed to one bit. If there are two
>> messages, "abcdef" and "cdfeab", the first can be encoded as 0 and the
>> second as 1.
>> ...
>> (You are right about the general link between information content
>> and compressability but you have to be very careful about terms and
>> definitions if you want to avoid confusion.)
>
> I agree with Ben: you have to be very careful with definitions to
> avoid confusion.
>
> Given that, in general (that is, for most definitions, in my opinion),
> a sorted string would be considered to have less information than a
> "random" string.

I am sorry if I sounded like a pedant and I apologise in advance
because I am going to sound like one again, but you can't, in a any
rigorous way, talk about information in "a" string. From what you
later say, I think you are talking about information entropy -- which,
roughly speaking, measures the "uncertainty" of a set of events. See
later comments.

BTW, if you are talking about Kolmogorov complexity as a measure of
information then you *can* talk about the "information" in a particular
string, but then sorted strings are much more harder to analyse
because the sorting order (which is quit arbitrary) is part of the
complexity measure as well.

> Here are some reasons I feel justified in making
> such a generalization:
>
> * For a given length, L, and number of characters, n, there are n^L
> strings, but there are, uh, a lot fewer sorted strings. (Its about
> n^L/L! but I haven't finished my combinatorics book, yet) So it
> takes fewer bits to indicate a string, given that it is sorted, than
> to indicate a string for which no order is know.

I don't know the formula either but for n=2 there are L+1 sorted
strings of length L so n^L/L! is way off (for L > 2). Anyway, I think
it is a mistake to focus on the sorting. Smaller sets have, in lots of
senses, less information than large ones -- they need fewer samplings
to "know" them (the information entropy meaning of information), they
need more complex descriptions (the Kolmogorov complexity meaning) and
they carry more information if used a symbols in a communication
channel (the Shannon theory meaning of information).

> * Sort algorithms cannot be reversed. "abc" might be the output of
> sorting "abc" or "acb" or "cba" or ... Some information (the
> original order) has been thrown away.

Yes. Sorting can't be reversed because the set of sorted strings is
smaller. Focusing on the sorting aspect can make it seem special
because we recognise the pattern, but as far as information theory is
concerned, any set that is restricted in some way has a pattern even
if we can't see it.

When I said to the OP, "Do you see where this is going?" (rather too
cryptically I now see) I was inviting him to count the sets.

BTW, I am not an expert in this field so there is a reasonable chance
that I am talking bollocks. That is just one of the pitfalls of
Usenet.

--
Ben.

kaby

未讀,
2006年11月11日 凌晨4:06:542006/11/11
收件者:
I consider it a special algorithms for special circumstance.
In my opinion the problem is de facto about statistics distribution,
not entropy.

Goulu

未讀,
2006年11月14日 上午9:46:582006/11/14
收件者:

Jussi Piitulainen wrote:
> "abcdef" "cdfeab" "delta "information" >From ? ? I I'd Is Thanks Then
> a a affects algo. be because bits, can compressed compression"
> contains content, data destroys example fact for from from how in
> information information is it it its know less less like of point
> remember single sorted sorting string than that the the theoretical to
> true university, using view. what

I didn't understand at first. Though it was a joke. But in fact it is a
very subtle answer to my question :-) Thanks !

jacko

未讀,
2006年11月14日 中午12:48:182006/11/14
收件者:

look up burrows wheeler transform if information sorting and
compression are of interest.

0 則新訊息