Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Complex Specified Information - Pitman Formula

16 views
Skip to first unread message

Seanpit

unread,
Jul 23, 2007, 3:53:39 PM7/23/07
to
After a bit of discussion and revision of my initial effort, the
following is my formula for calculating my version of complex
specified information (CSI):

For X = 2:

CSI: X^n - (n! / (n-hd)! hd!)

For X > 2:

CSI: X^n - (( log(base2)(X^n)! / (log(base 2)(X^n) - hd)! * hd!)

X = number of possible characters per position
n = size of the sequence
hd = Hamming Distance

And yes, I have tried it out and it does seem to work quite well
regardless of string size or the number of potential characters per
position. Hopefully no further revisions will be necessary, but
that's why I'm presenting it here to see if anyone can find anything
wrong with the formula.

As it currently stands, the greater the CSI number, the better the
odds of non-random production. This is especially true when compared
to reference sequences with no repeating patterns, like pi, regardless
of the assumed distribution of the origin of the symbols in the test
sequence.

I also want to note, one more time, that the detecting of high CSI, by
itself, does not equal the detection of ET or ID. That sort of
hypothesis requires additional knowledge concerning the material in
which the pattern in carried as well as how this material interacts
with various deliberate and non-deliberate forces of nature.

Sean Pitman
www.DetectingDesign.com

snex

unread,
Jul 23, 2007, 4:14:45 PM7/23/07
to
On Jul 23, 2:53 pm, Seanpit <seanpitnos...@naturalselection.

when are you going to apply this function to the bit strings in my
"CSI Challenge" thread?

hersheyh

unread,
Jul 23, 2007, 4:25:41 PM7/23/07
to
On Jul 23, 3:53 pm, Seanpit <seanpitnos...@naturalselection.

0catch.com> wrote:
> After a bit of discussion and revision of my initial effort, the
> following is my formula for calculating my version of complex
> specified information (CSI):

What, other than generating numbers that you like, are these formulas
supposed to represent? Specifically, how do you *determine* the hd
number? Arbitrary choice? Choosing the value n/2 (the average
sequence difference between all the sites being the same and all the
sites being different)? And isn't the *minimum* difference between a
'reference' sequence and a 'target' sequence still going to be hd=1 no
matter what you say and no matter how large n is?

Mark VandeWettering

unread,
Jul 23, 2007, 4:28:05 PM7/23/07
to

Well, the primary problem with this formula remains the same as the
problem with its predecessor: it only works in comparison to a target,
which can only be viewed as "begging the question" that it was supposed
to answer.

One need not really go beyond that, but here are some additional things
to ponder:

1. In what sense does this "work well"? Can you provide some examples
which demonstrate it working?
2. How can you show that "the greater the CSI number, the better the
odds of non-random production"? Presumably that requires some mapping
from CSI to probabilities, which seems to be absent here. Absent that
derivation, how can you be sure that the mapping of values of CGI to
probability are monotically increasing?
3. It is not difficult to think of sequences of length n which are very
closely related but whose Hamming distance is n (for instance, rotates
or complements of the target sequence). Can you justify the fact that
the value of CSI depends in large part upon how you define your target
alphabet and where you try to begin to match your sequence, rather than
on any property of the sequence itself?
4. There is an odd incongruity between the two formulas presented. It
seems rather odd to me that the first term in both cases is merely the
size of the set of all strings from the alphabet of length n. (Over
a binary alphabet, this is merely 2^n). The second term in the case
where X = 2 is equal to the number of strings which have Hamming
distance Hd. This is not the same as the second term in the second
formula. The appropriate analog would appear to be (X - 1) *
n! / ((n-hd!) hd!).

> Sean Pitman
> www.DetectingDesign.com

Mark VandeWettering

unread,
Jul 23, 2007, 5:15:46 PM7/23/07
to
On 2007-07-23, Mark VandeWettering <wett...@attbi.com> wrote:

> 4. There is an odd incongruity between the two formulas presented. It
> seems rather odd to me that the first term in both cases is merely the
> size of the set of all strings from the alphabet of length n. (Over
> a binary alphabet, this is merely 2^n). The second term in the case
> where X = 2 is equal to the number of strings which have Hamming
> distance Hd. This is not the same as the second term in the second
> formula. The appropriate analog would appear to be (X - 1) *
> n! / ((n-hd!) hd!).

A moment's further reflection indicate that I have made a mistake in this
as well, as well as a typo (the n-hd!) term should actually be (n-hd)!).

Let comb(a, b) = a! / ((a-b)! b!). If there are hd places where the
two strings differ, then there are comb(n, hd) potential ways to chose
which site differs. At each site, there are (X-1) ways to assign a value
to it which differ, so this means that the number of strings which have
hamming distance hd =

(X-1)^hd * comb(n, hd).

>> Sean Pitman
>> www.DetectingDesign.com
>

Perplexed in Peoria

unread,
Jul 23, 2007, 9:39:38 PM7/23/07
to

"Seanpit" <seanpi...@naturalselection.0catch.com> wrote in message news:1185220419.1...@x35g2000prf.googlegroups.com...

> After a bit of discussion and revision of my initial effort, the
> following is my formula for calculating my version of complex
> specified information (CSI):
>
> For X = 2:
>
> CSI: X^n - (n! / (n-hd)! hd!)
>
> For X > 2:
>
> CSI: X^n - (( log(base2)(X^n)! / (log(base 2)(X^n) - hd)! * hd!)
>
> X = number of possible characters per position
> n = size of the sequence
> hd = Hamming Distance

Er. Hamming distance from what? From a string providing a "complex
specification"? Does the measured string have to exactly match the
specifying string in length? However would you know in practice
what the specification string is?

Also your definition for the X > 2 case makes no sense to me, perhaps
because the parentheses don't balance.

But it doesn't make much sense to me to call your definition 'information'
even in the case of X=2. I would have thought that the CSI in a
string which exactly matches the specification would be the information
in the specification. That is, CSI = n. But you define it as 2^n - 1.
(In the general case, I would have defined it as
(log(base 2) X) * (n - (X/(X-1)) * hd),
regardless of X.)

Very weird, Sean! What exactly are you trying to accomplish? I know what
you ARE accomplishing, but I am too polite to say it.

R. Baldwin

unread,
Jul 24, 2007, 12:10:17 AM7/24/07
to
"Seanpit" <seanpi...@naturalselection.0catch.com> wrote in message
news:1185220419.1...@x35g2000prf.googlegroups.com...
> After a bit of discussion and revision of my initial effort, the
> following is my formula for calculating my version of complex
> specified information (CSI):
>
> For X = 2:
>
> CSI: X^n - (n! / (n-hd)! hd!)
>
> For X > 2:
>
> CSI: X^n - (( log(base2)(X^n)! / (log(base 2)(X^n) - hd)! * hd!)
>
> X = number of possible characters per position
> n = size of the sequence
> hd = Hamming Distance

You might want to correct the unbalanced parentheses in the X>2 formula.

You might also want to spell out how you are computing the Hamming Distance;
that is, between what and what?

>
> And yes, I have tried it out and it does seem to work quite well
> regardless of string size or the number of potential characters per
> position. Hopefully no further revisions will be necessary, but
> that's why I'm presenting it here to see if anyone can find anything
> wrong with the formula.
>
> As it currently stands, the greater the CSI number, the better the
> odds of non-random production. This is especially true when compared
> to reference sequences with no repeating patterns, like pi, regardless
> of the assumed distribution of the origin of the symbols in the test
> sequence.

So far, you have not provided anything to corroborate your claims about how
the CSI number behaves with respect to non-random production. I'm sure it
seems obvious to you, as always, but I'm afraid that is not good enough.

How do you know the CSI number correlates with non-random production? By
what relationship?

>
> I also want to note, one more time, that the detecting of high CSI, by
> itself, does not equal the detection of ET or ID. That sort of
> hypothesis requires additional knowledge concerning the material in
> which the pattern in carried as well as how this material interacts
> with various deliberate and non-deliberate forces of nature.

What material? Why does a pattern have to involve material?


_Arthur

unread,
Jul 24, 2007, 8:40:30 AM7/24/07
to
Seanpit wrote:

> As it currently stands, the greater the CSI number, the better the
> odds of non-random production. This is especially true when compared
> to reference sequences with no repeating patterns, like pi, regardless
> of the assumed distribution of the origin of the symbols in the test
> sequence.

Most excellent, Professor Pitmann !

Now, could you mesure the CSI of the elephant trunk ?

You pick the elephant. I suggest Loxodonta africana africana.

Is the CSI high, low, or middlin' ?

Please explain carefully how do you come up with the proper target
sequence and reference sequence.

I wanna know if She created the trunk, or if a crocodile stretched an
elephant nose.

_Arthur.

fropome

unread,
Jul 24, 2007, 9:59:29 AM7/24/07
to
On 23 Jul, 20:53, Seanpit <seanpitnos...@naturalselection.0catch.com>
wrote:

> After a bit of discussion and revision of my initial effort, the
> following is my formula for calculating my version of complex
> specified information (CSI):
>
> For X = 2:
>
> CSI: X^n - (n! / (n-hd)! hd!)
>
> For X > 2:
>
> CSI: X^n - (( log(base2)(X^n)! / (log(base 2)(X^n) - hd)! * hd!)
>
> X = number of possible characters per position
> n = size of the sequence
> hd = Hamming Distance
>

You haven't got your brackets right here so this could be the only
problem, but looking at your denominator, if x = 3 and n = 2
(log(base 2)(X^n) - hd)!
= (log(base 2) (3^2) - hd)!
= (log(base2)(9 - hd)) !

which makes no sense if hd is- say- 2 (can you see why?).

If you meant to write:
(log(base 2)(X^n) - hd!) * hd!)
then this also does not work since hd! could be > than log(base 2)
(X^n). This would make the fraction -'ve, which would create a CSI >
X^n. This doesn't make sense from a conceptual standpoint.


> And yes, I have tried it out and it does seem to work quite well
> regardless of string size or the number of potential characters per
> position. Hopefully no further revisions will be necessary, but
> that's why I'm presenting it here to see if anyone can find anything
> wrong with the formula.
>

see above. You really tried this out for x <> 2 ?


> As it currently stands, the greater the CSI number, the better the
> odds of non-random production. This is especially true when compared
> to reference sequences with no repeating patterns, like pi, regardless
> of the assumed distribution of the origin of the symbols in the test
> sequence.
>
> I also want to note, one more time, that the detecting of high CSI, by
> itself, does not equal the detection of ET or ID. That sort of
> hypothesis requires additional knowledge concerning the material in
> which the pattern in carried as well as how this material interacts
> with various deliberate and non-deliberate forces of nature.
>
> Sean Pitmanwww.DetectingDesign.com

How are you claiming to have derived this? By using experimental data
or by mathematical proof? I can't see how you could have possibly have
used either. This means that it looks like you don't have any process
and that you're guessing. Why don't you tell us how you are getting to
these equations and maybe we can help.
You seem to have pulled this formula out of your behind. You've just
created a random formula which roughly outputs what you want and is
complex enough to look good to a layman. It's an ad hoc justification
for something you've already decided. Unless you can show otherwise
nothing you say will be convincing.

hersheyh

unread,
Jul 24, 2007, 11:25:07 AM7/24/07
to

Nothing Sean says *is* convincing. You are quite right to suspect
that he has simply pulled this formula out of his arse. For starters,
the X^n part is there merely because it increases the size of the
numbers. In terms of telling us how close two sequences are to each
other, the term is irrelevant. The only *relevant* part of his
equation is the part that has the term hd. And he is furiously waving
his hands implying that the size of hd is something other than his
arbitrary choice, that he actually has a way of determining what the
'reference' sequence was. He also resolutely fails to specify what
his operational assumptions for this equation are, even though they
are quite clear: movement is only between sequences of the same size,
change always only affects a single unit of sequence, change is random
and without selective benefit until the final change occurs. In fact,
he repeatedly denies that his model involves these assumptions that
are implicit in his equation.

Bob Berger

unread,
Jul 24, 2007, 4:19:42 PM7/24/07
to
In article <1185290707.7...@n60g2000hse.googlegroups.com>, hersheyh
says...


On the other hand, when one applies Sean's formula to his own posts to this news
group, it invariably returns a negative value, which is what we'd expect.

Bob

Seanpit

unread,
Jul 24, 2007, 6:33:41 PM7/24/07
to
On Jul 23, 1:14 pm, snex <s...@comcast.net> wrote:

> when are you going to apply this function to the bit strings in my
> "CSI Challenge" thread?

If the reference string works, then positive results are useful.
However, negative results do not rule out the possibility of a very
non-random bias. So, all you have to do is pick various reference
strings that you know to be non-random in production and then compare
your unknown sequences to them.

For example, in testing your strings for possible bias lets use as a
reference string a binary string that is made up of all zeros that is
1,324 characters in length (the same size as your test strings). Lets
now compare your Test String A to the reference string:

Test String A:
110100000110111100101110010100100111101010101011001011001000111010000100
- 35
110110010000011101110100111001010111100000110101001001001111100100011001
- 36
100110011000101100001000001010011010111001000001111101010000100111110010
- 32
000110010110100001001100100110001001110111000001010011001001001100000101
- 29
101111111110110010000011111000101011000000101001011100010111100001111101
- 39
110101000010000110011111001111000101110100010110100001110111101111100110
- 39
101001011100111010110110110000111110101100000001011000100100011111010111
- 38
011100101101111001001011101111001101001001001010111011110101100000100001
- 38
100111100010000110101101011110110111100100011010101000100111000111110001
- 38
101001001011110110110011011111100110000111011000111100001010011111101101
- 42
011110011110000110111100010111110111000100010110111100100100111001011100
- 40
110111011001101100010100011111001001110100110010111001110010001100110100
- 38
001101010010110011100111101000010101010101010011001100010010111001011000
- 34
000101001110110100111110010001001111011100010101101010000000100111011110
- 36
010001001111110011011100101111101110110001000110011011111111100100000100
- 40
111010100010010001111011010110101110111001100110101111101010101010110110
- 42
100010101010011000111000101010001000111011011011000010001001010100110011
- 32
111100111111100000000110110100001001010101110101000111111010011010101001
- 38
1011110110000010101000110111
- 15

Signal A: 18 x 72 + 28 = 1324 characters, Hamming Distance: 681
(expected 662)

The expected number of 1's in this string, given the assumption of
uniform distribution for random generation, is 662. The actual
measured Hamming Distance is 681.

At this point, I'm going to use a slightly modified version of my CSI
formula that seems to work more easily (as least for me).

CSI: (n! / (n-|(n/2-hd))|! |(n/2-hd)|!)

If the HD were zero or maximum (i.e., a perfect match or inverse
match) the CSI would be maximum either way:

CSImax: (1324! / (1324 -(1324/2 - 0))! (662-0)!)
= 1324! / 662! 662!
= 8.028387e+396

CSI for the expected random outcome:

CSIexp: (1324! / (1324 -(662 - 662))! (662-662)!)
= 1324! / 1324! 0!
= 1

The actual CSI for Test String A:

CSI - A: (1324! / (1324 -|(662 - 681))|! |(662 - 681)|!)
= 1324! / 1305! 19!
= 1.49424e+42

Now, this might seem like a large number, but compared to the maximum
CSI value of about 8e+396, a number like 1.49e+42 isn't a very
reliable match. Therefore, although it isn't quite the CSI value one
would expect from a randomly generated sequence with a uniform
distribution, the biased nature of the sequence has relatively low
reliability.

Test String B:
010101010110100000100110011010010110010001100001110100010010000000110000
- 27
010101111000000101100100111101000010100101111100010101010011111101110010
- 37
001101100110101100011000011000000110000000110110100101000111010001101001
- 30
111011010110111101111100001010101101000110100100011011011001000101111111
- 42
000110110010101011110001111001101111011111001000011110101100111100001010
- 40
101000001010010101100100001000110110010001101101000100100100000001100101
- 27
001000101001010101100111001100110010111110000011010101001010000010011001
- 32
011100110000101100110111001000101101010110101100100110000110000011110010
- 34
010101101110000001010100101100101100011001100010011011010001011010100101
- 33
010000000110000110000000011011100011110100000111001001011110001001100100
- 28
000000110010101100100101001110110010001100100000011110001010101000110011
- 30
000011000000101110000001011000011110011010001001010111110011110001101100
- 32
111011000101001101100011000010000101001010111010011111111100010011110110
- 38
101011000110111001110100010111010100101010110000011000100111011101100011
- 37
011000010100110000010110000110011000000000101000110011010010110000101100
- 26
010110110101001111100011001101100000101000100001001010000001001100000111
- 30
110000101010101001101001001110100000010000100010010111010111011110000101
- 32
111001100110010010011001011101101111010001000100101001101101111101110000
- 38
1111110001110000100110101010 - 15

Signal B: 18 x 72 + 28 = 1324 characters, Hamming Distance: 608
(expected 662)

The expected number of 1's in this string, given the assumption of
uniform distribution for random generation, is 662. The actual
measured Hamming Distance is 608.

CSI: (n! / (n-(n/2-hd))! (n/2-hd)!)

= 1324! / (1324 -(662 - 608))! (662-608)!)

= 1324! / 1270! 54!

= 5.5307e+96

As mentioned above as CSI of 5.5e+96 might not seem very reliable when
compared to the max CSI value of 8e+396. However, compared to the CSI
value for Test String A (CSI of 1.49e+42) the bias of B is much more
significant compared to A. Therefore, between the two strings A and
B, as compared to the reference sequence, B is more likely to have
been the result of non-random generation.

Of course, if the test string happened to be pi, the biased nature of
the test string would not have been reliably picked up by comparison
to the reference string used in this case as having maximum CSI. That
is why a negative finding does not rule out the possibility of highly
predictable non-random generation from a relatively simple algorithm
that simply hasn't been included as one of the reference algorithms/
strings.

One other potential problem is one of where a random sequence is in
fact produced, but has a non-uniform distribution. This test will
reliably detect the non-uniform nature of the distribution via the
relative increase in CSI above the expected. The formula can be
applied by equalizing the number of characters in the sequence to
remove the non-uniform bias, and then comparing the resulting pattern
of 0s and 1s in comparison to various reference strings that may
detect various forms of symmetry etc.

Again, none of these methods will be able to detect all forms of bias
that could in fact be reduced to a relatively simple algorithm if it
were only known. In fact, it is impossible to rule out the
possibility of non-random bias for a sequence that is apparently
random from the perspective of the test. A positive test result can
still be quit helpful.

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 24, 2007, 6:41:42 PM7/24/07
to
On Jul 23, 1:25 pm, hersheyh <hershe...@yahoo.com> wrote:
> On Jul 23, 3:53 pm, Seanpit <seanpitnos...@naturalselection.
>
> 0catch.com> wrote:
> > After a bit of discussion and revision of my initial effort, the
> > following is my formula for calculating my version of complex
> > specified information (CSI):
>
> What, other than generating numbers that you like, are these formulas
> supposed to represent? Specifically, how do you *determine* the hd
> number? Arbitrary choice?

The Hamming Distance is defined by the number of character difference
exist between the test sequence and the target sequence.

> Choosing the value n/2 (the average
> sequence difference between all the
> sites being the same and all the
> sites being different)?

That number, the average HD, has the lowest CSI value . . .

> And isn't the *minimum* difference between a
> 'reference' sequence and a 'target'
> sequence still going to be hd=1 no
> matter what you say and no matter
> how large n is?

The minimum HD is actually zero - or complete identity. The maximum
HD also produces the same CSI value (1) for a given string as does the
minimum HD.

Again, Howard, this particular topic isn't about how long it would
take to find an unknown target sequence in sequence or structure
space. It is about identifying of a target that has been found has a
non-random bias that can actually be reliably detected. This is a
different concept Howard. You're drifting way off topic again.

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 24, 2007, 7:02:39 PM7/24/07
to
On Jul 23, 1:28 pm, Mark VandeWettering <wetter...@attbi.com> wrote:

> On 2007-07-23, Seanpit <seanpitnos...@naturalselection.0catch.com> wrote:

> > After a bit of discussion and revision of my initial effort, the
> > following is my formula for calculating my version of complex
> > specified information (CSI):
>
> > For X = 2:
>
> > CSI: X^n - (n! / (n-hd)! hd!)
>
> > For X > 2:
>
> > CSI: X^n - (( log(base2)(X^n)! / (log(base 2)(X^n) - hd)! * hd!)
>
> > X = number of possible characters per position
> > n = size of the sequence
> > hd = Hamming Distance
>

> Well, the primary problem with this formula remains the same as the
> problem with its predecessor: it only works in comparison to a target,
> which can only be viewed as "begging the question" that it was supposed
> to answer.

Not at all. This is how all forms of detecting of artifact begin.
They all start with the detection of bias and all of these forms of
detection are based on reference to a string that is known to be the
product of biased non-random production. There simply is no other way
to detect non-randomness other than to use a reference. The use of a
reference is also the basis for related concepts like Kolmogorov/
Chaitin complexity, Shannon information, etc.

> One need not really go beyond that, but here are some additional things
> to ponder:
>
> 1. In what sense does this "work well"?
> Can you provide some examples
> which demonstrate it working?

Try it out yourself . . . except it might be easier to use the
following formula for CSI:

Binary Strings:
CSI: n! / (n-|(n/2-hd))|! * |(n/2-hd)|!)

Strings where N > 2:
(log(base2)(X^n)! / (log(base2)(X^n) - |(log(base2)(X^n) / 2-hd)|! * |
(log(base2)(X^n)/2) -hd)|!)

> 2. How can you show that "the greater the CSI number, the better the
> odds of non-random production"? Presumably that requires some mapping
> from CSI to probabilities, which seems to be absent here. Absent that
> derivation, how can you be sure that the mapping of values of CGI to
> probability are monotically increasing?

Try it yourself and see. It should be rather self evident to you
anyway. Short strings with a HD of 0 are not as reliably non-random
as are longer strings with the same HD of 0. That's a practically self-
evident statement.

> 3. It is not difficult to think of sequences of length n which are very
> closely related but whose Hamming distance is n (for instance, rotates
> or complements of the target sequence).

Compliments of the target sequence would have the same CSI value as
the target sequence itself. Try out the formula and see.

> Can you justify the fact that
> the value of CSI depends in large part upon how you define your target
> alphabet and where you try to begin to match your sequence, rather than
> on any property of the sequence itself?
>
> 4. There is an odd incongruity between the two formulas presented. It
> seems rather odd to me that the first term in both cases is merely the
> size of the set of all strings from the alphabet of length n. (Over
> a binary alphabet, this is merely 2^n). The second term in the case
> where X = 2 is equal to the number of strings which have Hamming
> distance Hd. This is not the same as the second term in the second
> formula. The appropriate analog would appear to be (X - 1) *
> n! / ((n-hd!) hd!).

I guess I don't understand this question . . . ?

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 24, 2007, 7:11:01 PM7/24/07
to
On Jul 23, 6:39 pm, "Perplexed in Peoria" <jimmene...@sbcglobal.net>
wrote:
> "Seanpit" <seanpitnos...@naturalselection.0catch.com> wrote in messagenews:1185220419.1...@x35g2000prf.googlegroups.com...

> > After a bit of discussion and revision of my initial effort, the
> > following is my formula for calculating my version of complex
> > specified information (CSI):
>
> > For X = 2:
>
> > CSI: X^n - (n! / (n-hd)! hd!)
>
> > For X > 2:
>
> > CSI: X^n - (( log(base2)(X^n)! / (log(base 2)(X^n) - hd)! * hd!)
>
> > X = number of possible characters per position
> > n = size of the sequence
> > hd = Hamming Distance
>
> Er. Hamming distance from what? From a
> string providing a "complex
> specification"?

>From a string or multiple strings that is/are of known non-random
production.

> Does the measured string
> have to exactly match the
> specifying string in length?

To achieve maximum CSI - yes, but not to achieve useful CSI.

> However would you know in practice
> what the specification string is?

By picking a string of known non-random production, like pi.

> Also your definition for the X > 2
> case makes no sense to me, perhaps
> because the parentheses don't balance.

For binary strings, X = 2. for strings who have more than two
possible characters per position, X >2.

> But it doesn't make much sense to me to call your definition 'information'
> even in the case of X=2. I would have thought that the CSI in a
> string which exactly matches the specification would be the information
> in the specification. That is, CSI = n. But you define it as 2^n - 1.
> (In the general case, I would have defined it as
> (log(base 2) X) * (n - (X/(X-1)) * hd),
> regardless of X.)

I'm not sure what your question is here . . .

> Very weird, Sean! What exactly are you trying to accomplish? I know what
> you ARE accomplishing, but I am too polite to say it.

Think about it for a bit . . .

Sean Pitman
www.DetectingDesign.com


hersheyh

unread,
Jul 24, 2007, 7:25:26 PM7/24/07
to
On Jul 24, 6:41 pm, Seanpit <seanpitnos...@naturalselection.

0catch.com> wrote:
> On Jul 23, 1:25 pm, hersheyh <hershe...@yahoo.com> wrote:
>
> > On Jul 23, 3:53 pm, Seanpit <seanpitnos...@naturalselection.
>
> > 0catch.com> wrote:
> > > After a bit of discussion and revision of my initial effort, the
> > > following is my formula for calculating my version of complex
> > > specified information (CSI):
>
> > What, other than generating numbers that you like, are these formulas
> > supposed to represent? Specifically, how do you *determine* the hd
> > number? Arbitrary choice?
>
> The Hamming Distance is defined by the number of character difference
> exist between the test sequence and the target sequence.

I *know* how hamming distance is *defined*. That wasn't the
question. The question was how you *determine* the number you use.
And I suggested that you made an arbitrary choice.


>
> > Choosing the value n/2 (the average
> > sequence difference between all the
> > sites being the same and all the
> > sites being different)?
>
> That number, the average HD, has the lowest CSI value . . .

SFW? Is that the number you choose by choosing your 'reference'
sequence? How do you determine the hd except by your choice of
'reference' sequence? Quit beating around the bush and tell us how
you *choose* the 'reference' sequence and thus *choose* the hd number?


>
> > And isn't the *minimum* difference between a
> > 'reference' sequence and a 'target'
> > sequence still going to be hd=1 no
> > matter what you say and no matter
> > how large n is?
>
> The minimum HD is actually zero - or complete identity.

I said minimum *difference*. Zero or identity is a state of 'no
difference'.

> The maximum
> HD also produces the same CSI value (1) for a given string as does the
> minimum HD.

And again you are merely *defining* the hd, not telling us shit about
how you *determine* it (or the 'reference' sequence, which does the
same thing).

> Again, Howard, this particular topic isn't about how long it would
> take to find an unknown target sequence in sequence or structure
> space.

It certainly is about how easy it is to generate a 'target' sequence
from a 'reference' sequence, not how deviant a sequence is from some
'average' sequence.

Again. I *asked* you how you *determine* the hd values you come up
with. And you beat around the bush *pretending* to say something
instead of telling the truth -- you make up any hd value to support
whatever position you want and have no way of actually *using* this
equation to describe nature. IOW, the equation is merely numerology
from which you can extract whatever hypothetical state you want.

> It is about identifying of a target that has been found has a
> non-random bias that can actually be reliably detected. This is a
> different concept Howard. You're drifting way off topic again.

Not if there is no way to *determine* hd values except by arbitrary
decision. That makes any determination you generate nothing but a
meaningless number of your choice, a mere reflection of your arbitrary
choice of 'reference sequence' or 'hd number'. The only thing it
tells us is which numbers you wanted to produce. Obviously this is
the case or you would not be spewing *definitions* when I asked for
*determinations*.

There certainly already are better ways of determining similarity or
degree of identity between sequences.
>
> Sean Pitmanwww.DetectingDesign.com


Seanpit

unread,
Jul 24, 2007, 8:46:53 PM7/24/07
to
On Jul 24, 6:59 am, fropome <monk...@hornsandhalos.co.uk> wrote:

> > CSI: X^n - (( log(base2)(X^n)! / (log(base 2)(X^n) - hd)! * hd!)
>
> > X = number of possible characters per position
> > n = size of the sequence
> > hd = Hamming Distance
>
> You haven't got your brackets right here so this could be the only
> problem, but looking at your denominator, if x = 3 and n = 2
> (log(base 2)(X^n) - hd)!
> = (log(base 2) (3^2) - hd)!
> = (log(base2)(9 - hd)) !
>
> which makes no sense if hd is- say- 2 (can you see why?).

Not really - - You just put the brackets in the wrong place.

(Log(base2)(3^2) - hd)!
= (Log(base2)9 - hd)!
= (3.17 - 2)!
= 1.17!

That means that:

= 9 - (3.17! / 1.17! * 2!)
= 9 - 3.43945
CSI = 5.56055

Using the modified formula for CSI:

(log(base2)(X^n)! / (log(base2)(X^n) - |(log(base2)(X^n) / 2-hd)|! * |
(log(base2)(X^n)/2) -hd)|!)

= (3.17! / (3.17 - |(3.17 / 2 - 2) |)! * |(3.17 / 2 - 2)|!)

= (3.17! / (3.17 - 0.415)! * 0.415!

= 3.17! / 2.755! * 0.415!

= 1.90

< snip >

Sean Pitman
www.DetectingDesign.com


Perplexed in Peoria

unread,
Jul 24, 2007, 8:47:19 PM7/24/07
to

"Seanpit" <seanpi...@naturalselection.0catch.com> wrote in message news:1185318661.1...@g12g2000prg.googlegroups.com...

> On Jul 23, 6:39 pm, "Perplexed in Peoria" <jimmene...@sbcglobal.net>
> wrote:
> > "Seanpit" <seanpitnos...@naturalselection.0catch.com> wrote in
messagenews:1185220419.1...@x35g2000prf.googlegroups.com...
> > > After a bit of discussion and revision of my initial effort, the
> > > following is my formula for calculating my version of complex
> > > specified information (CSI):
> >
> > > For X = 2:
> >
> > > CSI: X^n - (n! / (n-hd)! hd!)
> >
> > > For X > 2:
> >
> > > CSI: X^n - (( log(base2)(X^n)! / (log(base 2)(X^n) - hd)! * hd!)
> >
> > > X = number of possible characters per position
> > > n = size of the sequence
> > > hd = Hamming Distance
> >
> > Er. Hamming distance from what? From a
> > string providing a "complex
> > specification"?
>
> >From a string or multiple strings that is/are of known non-random
> production.
>
> > Does the measured string
> > have to exactly match the
> > specifying string in length?
>
> To achieve maximum CSI - yes, but not to achieve useful CSI.

So how do you define the Hamming distance between two sequences of
different length?

> > However would you know in practice
> > what the specification string is?
>
> By picking a string of known non-random production, like pi.

So this might be useful if someone is sending us a message such as
pi, which is a pretty stupid message, but that is ok because it isn't
intended as a message. It is an intelligence test, and the message
sender inserted just enough mistakes to make it a challenging test.

> > Also your definition for the X > 2
> > case makes no sense to me, perhaps
> > because the parentheses don't balance.
>
> For binary strings, X = 2. for strings who have more than two
> possible characters per position, X >2.

Yes, Sean, I already understood that, perhaps because you wrote


X = number of possible characters per position

I meant that the formula for CSI which you suggested in the X>2
case makes no sense.

> > But it doesn't make much sense to me to call your definition 'information'
> > even in the case of X=2. I would have thought that the CSI in a
> > string which exactly matches the specification would be the information
> > in the specification. That is, CSI = n. But you define it as 2^n - 1.
> > (In the general case, I would have defined it as
> > (log(base 2) X) * (n - (X/(X-1)) * hd),
> > regardless of X.)
>
> I'm not sure what your question is here . . .

There was no question. I simply told you how I would have done it
and why. It makes the answer come out as information - measured in
bits. Zero bits of CSI for a string whose only matches to the
specification string are those that would result from chance. n bits
of CSI for an exact match to an n bit binary string.

> > Very weird, Sean! What exactly are you trying to accomplish? I know what
> > you ARE accomplishing, but I am too polite to say it.
>
> Think about it for a bit . . .

Ah, I see. You are presenting us with a CSI problem. You (the supposedly
intelligent designer) won't reveal your purpose, but we are expected to
infer intelligent purpose just from the patterns in the algebra you
provide. Of course, what you provide isn't actually intelligent at
all - the parentheses don't even balance for krissake - but it may be
close to intelligent.

Interesting, but I don't think I will play.

Seanpit

unread,
Jul 24, 2007, 9:12:51 PM7/24/07
to
On Jul 24, 5:47 pm, "Perplexed in Peoria" <jimmene...@sbcglobal.net>
wrote:

> > > Does the measured string


> > > have to exactly match the
> > > specifying string in length?
>
> > To achieve maximum CSI - yes, but not to achieve useful CSI.
>
> So how do you define the Hamming distance between two sequences of
> different length?

Use the "Modified HD" or by trimming down the reference string to
match the size of the test string.

> > > However would you know in practice
> > > what the specification string is?
>
> > By picking a string of known non-random production, like pi.
>
> So this might be useful if someone is sending us a message such as
> pi, which is a pretty stupid message, but that is ok because it isn't
> intended as a message. It is an intelligence test, and the message
> sender inserted just enough mistakes to make it a challenging test.

It isn't an intelligence test. It is a test for non-random bias. And,
one can have as many known reference strings/algorithms as one wants -
pi being just one of many. A demonstration of high CSI relative to
any of the chosen reference strings would indicate a bias.

> > > Also your definition for the X > 2
> > > case makes no sense to me, perhaps
> > > because the parentheses don't balance.
>
> > For binary strings, X = 2. for strings who have more than two
> > possible characters per position, X >2.
>
> Yes, Sean, I already understood that, perhaps because you wrote
> X = number of possible characters per position
> I meant that the formula for CSI which you suggested in the X>2
> case makes no sense.

It works fine. It is nothing more than a conversion formula to turn a
string with X > 2 into a binary formula.

> > > But it doesn't make much sense to me to call your definition 'information'
> > > even in the case of X=2. I would have thought that the CSI in a
> > > string which exactly matches the specification would be the information
> > > in the specification. That is, CSI = n. But you define it as 2^n - 1.
> > > (In the general case, I would have defined it as
> > > (log(base 2) X) * (n - (X/(X-1)) * hd),
> > > regardless of X.)
>
> > I'm not sure what your question is here . . .
>
> There was no question. I simply told you how I would have done it
> and why. It makes the answer come out as information - measured in
> bits. Zero bits of CSI for a string whose only matches to the
> specification string are those that would result from chance. n bits
> of CSI for an exact match to an n bit binary string.

That's what my formula already does. See the reply to snex above for
an example of how my formulas work.

< snip rest >

Sean Pitman
www.DetectingDesign.com

Mark VandeWettering

unread,
Jul 24, 2007, 9:25:50 PM7/24/07
to
On 2007-07-24, Seanpit <seanpi...@naturalselection.0catch.com> wrote:
> On Jul 23, 1:28 pm, Mark VandeWettering <wetter...@attbi.com> wrote:
>> On 2007-07-23, Seanpit <seanpitnos...@naturalselection.0catch.com> wrote:
>
>> > After a bit of discussion and revision of my initial effort, the
>> > following is my formula for calculating my version of complex
>> > specified information (CSI):
>>
>> > For X = 2:
>>
>> > CSI: X^n - (n! / (n-hd)! hd!)
>>
>> > For X > 2:
>>
>> > CSI: X^n - (( log(base2)(X^n)! / (log(base 2)(X^n) - hd)! * hd!)
>>
>> > X = number of possible characters per position
>> > n = size of the sequence
>> > hd = Hamming Distance
>>
>> Well, the primary problem with this formula remains the same as the
>> problem with its predecessor: it only works in comparison to a target,
>> which can only be viewed as "begging the question" that it was supposed
>> to answer.
>
> Not at all. This is how all forms of detecting of artifact begin.

We were talking about CSI. I don't know what "detecting of artifact" is,
or what you think that formula has to do with it.

> They all start with the detection of bias

Given that you pick the target against which it is compared, and that this
is obviously the source of basically all the variation in the number returned,
isn't the bias obviously in the choice of targets?

> and all of these forms of
> detection are based on reference to a string that is known to be the
> product of biased non-random production. There simply is no other way
> to detect non-randomness other than to use a reference. The use of a
> reference is also the basis for related concepts like Kolmogorov/
> Chaitin complexity, Shannon information, etc.

Shannon information is certainly not defined that way. It is defined
merely upon the basis of the observed distribution of messages, not in
reference to any "base" set.

>> One need not really go beyond that, but here are some additional things
>> to ponder:
>>
>> 1. In what sense does this "work well"?
>> Can you provide some examples
>> which demonstrate it working?
>
> Try it out yourself . . . except it might be easier to use the
> following formula for CSI:
>
> Binary Strings:
> CSI: n! / (n-|(n/2-hd))|! * |(n/2-hd)|!)
>
> Strings where N > 2:
> (log(base2)(X^n)! / (log(base2)(X^n) - |(log(base2)(X^n) / 2-hd)|! * |
> (log(base2)(X^n)/2) -hd)|!)

So, the answer was "no", you can't provide any examples of it working.

>> 2. How can you show that "the greater the CSI number, the better the
>> odds of non-random production"? Presumably that requires some mapping
>> from CSI to probabilities, which seems to be absent here. Absent that
>> derivation, how can you be sure that the mapping of values of CGI to
>> probability are monotically increasing?
>
> Try it yourself and see. It should be rather self evident to you
> anyway. Short strings with a HD of 0 are not as reliably non-random
> as are longer strings with the same HD of 0. That's a practically self-
> evident statement.

So the answer is that you can't show that the greater the CSI number,
the better the odds of non-random production.

>> 3. It is not difficult to think of sequences of length n which are very
>> closely related but whose Hamming distance is n (for instance, rotates
>> or complements of the target sequence).
>
> Compliments of the target sequence would have the same CSI value as
> the target sequence itself. Try out the formula and see.

Your criticism is valid with respect to complements (somewhat accidently,
I suspect) but the comment with respect to rotates is valid.

>
>> Can you justify the fact that
>> the value of CSI depends in large part upon how you define your target
>> alphabet and where you try to begin to match your sequence, rather than
>> on any property of the sequence itself?

No comment?

>> 4. There is an odd incongruity between the two formulas presented. It
>> seems rather odd to me that the first term in both cases is merely the
>> size of the set of all strings from the alphabet of length n. (Over
>> a binary alphabet, this is merely 2^n). The second term in the case
>> where X = 2 is equal to the number of strings which have Hamming
>> distance Hd. This is not the same as the second term in the second
>> formula. The appropriate analog would appear to be (X - 1) *
>> n! / ((n-hd!) hd!).
>
> I guess I don't understand this question . . . ?

I suspect you don't understand the formula.

Mark
>
> Sean Pitman
> www.DetectingDesign.com
>

Seanpit

unread,
Jul 24, 2007, 9:32:04 PM7/24/07
to
On Jul 24, 4:25 pm, hersheyh <hershe...@yahoo.com> wrote:

> > > What, other than generating numbers that you like, are these formulas
> > > supposed to represent? Specifically, how do you *determine* the hd
> > > number? Arbitrary choice?
>
> > The Hamming Distance is defined by the number of character difference
> > exist between the test sequence and the target sequence.
>
> I *know* how hamming distance is *defined*. That wasn't the
> question. The question was how you *determine* the number you use.
> And I suggested that you made an arbitrary choice.

Your question doesn't make any sense to me. The Hamming Distance
"number" is determined by the number of mismatched character positions
between the reference string and the test string. It is an absolute
number; not an arbitrary or subjective choice.

Also, the selection of the reference string must be done without any
knowledge ahead of time of the test string. The choice must be
completely independent. Useful choices would include strings that are
known to be the result of non-random simple algorithms, like pi or a
repeat of a simple pattern - like 01010101 . . ., etc.

> > > Choosing the value n/2 (the average
> > > sequence difference between all the
> > > sites being the same and all the
> > > sites being different)?
>
> > That number, the average HD, has the lowest CSI value . . .
>
> SFW? Is that the number you choose by choosing your 'reference'
> sequence? How do you determine the hd except by your choice of
> 'reference' sequence?

That's exactly how you determine the HD, by comparison to your
previously chosen reference sequence.

> Quit beating around the bush and tell us how
> you *choose* the 'reference' sequence and
> thus *choose* the hd number?

The reference sequences are chosen ahead of time before the test
sequence in analyzed. The choice of reference or references is not
based on the test sequence, but on sequences that are already known to
be the result of non-random production.

> > > And isn't the *minimum* difference between a
> > > 'reference' sequence and a 'target'
> > > sequence still going to be hd=1 no
> > > matter what you say and no matter
> > > how large n is?
>
> > The minimum HD is actually zero - or complete identity.
>
> I said minimum *difference*. Zero or identity is a state of 'no
> difference'.

Of course - but making the point for a minimum HD difference being 1
is irrelevant in this particular discussion.

< snip repetitive >

Sean Pitman
www.DetectingDesign.com


_Arthur

unread,
Jul 24, 2007, 9:41:21 PM7/24/07
to
On Jul 24, 4:19 pm, Bob Berger <s...@eskimo.com> wrote:
> On the other hand, when one applies Sean's formula to his own posts to this news
> group, it invariably returns a negative value, which is what we'd expect.
>
> Bob

That would mean his posts have NO information content, and are pure
noise.


Seanpit

unread,
Jul 24, 2007, 9:57:55 PM7/24/07
to
On Jul 24, 6:25 pm, Mark VandeWettering <wetter...@attbi.com> wrote:

> >> Well, the primary problem with this formula remains the same as the
> >> problem with its predecessor: it only works in comparison to a target,
> >> which can only be viewed as "begging the question" that it was supposed
> >> to answer.
>
> > Not at all. This is how all forms of detecting of artifact begin.
>
> We were talking about CSI. I don't know what "detecting of artifact" is,
> or what you think that formula has to do with it.
>
> > They all start with the detection of bias
>
> Given that you pick the target against which it is compared, and that this
> is obviously the source of basically all the variation in the number returned,
> isn't the bias obviously in the choice of targets?

No, because the reference strings are chosen independent of the test
strings. Therefore a significant match between a reference and a test
string is good evidence of non-random production.

> > and all of these forms of
> > detection are based on reference to a string that is known to be the
> > product of biased non-random production. There simply is no other way
> > to detect non-randomness other than to use a reference. The use of a
> > reference is also the basis for related concepts like Kolmogorov/
> > Chaitin complexity, Shannon information, etc.
>
> Shannon information is certainly not defined that way. It is defined
> merely upon the basis of the observed distribution of messages, not in
> reference to any "base" set.

Actually, a string with maximum Shannon information is a "random"
string. And, how does one define randomness? By comparison to a
reference computer - a UTM.

> >> One need not really go beyond that, but here are some additional things
> >> to ponder:
>
> >> 1. In what sense does this "work well"?
> >> Can you provide some examples
> >> which demonstrate it working?
>
> > Try it out yourself . . . except it might be easier to use the
> > following formula for CSI:
>
> > Binary Strings:
> > CSI: n! / (n-|(n/2-hd))|! * |(n/2-hd)|!)
>
> > Strings where N > 2:
> > (log(base2)(X^n)! / (log(base2)(X^n) - |(log(base2)(X^n) / 2-hd)|! * |
> > (log(base2)(X^n)/2) -hd)|!)
>
> So, the answer was "no", you can't provide any examples of it working.

I'm really not sure of what type of example you want? These formulas
will pick up any match to a predetermined sequence quite easily. The
greater the match the greater the CSI and the more likely the test
string is not the result of random production. It is really quite a
simple concept that should be rather self-evident to you. Of course,
you are the one who thinks pi is not at all computable, so you may
have more difficulties.

> >> 2. How can you show that "the greater the CSI number, the better the
> >> odds of non-random production"? Presumably that requires some mapping
> >> from CSI to probabilities, which seems to be absent here. Absent that
> >> derivation, how can you be sure that the mapping of values of CGI to
> >> probability are monotically increasing?
>
> > Try it yourself and see. It should be rather self evident to you
> > anyway. Short strings with a HD of 0 are not as reliably non-random
> > as are longer strings with the same HD of 0. That's a practically self-
> > evident statement.
>
> So the answer is that you can't show that the greater the CSI number,
> the better the odds of non-random production.

What do you want?! If my reference string is 010101 . . . x 1 million
digits and you bring me a test string with a HD of 1 million relative
to the reference string, would you honestly question the concept that
such a match would strongly indicate a non-random origin for your test
string?

> >> 3. It is not difficult to think of sequences of length n which are very
> >> closely related but whose Hamming distance is n (for instance, rotates
> >> or complements of the target sequence).
>
> > Compliments of the target sequence would have the same CSI value as
> > the target sequence itself. Try out the formula and see.
>
> Your criticism is valid with respect to complements (somewhat accidently,
> I suspect) but the comment with respect to rotates is valid.

You are the one who mistakenly tried to used "complimentary strings"
to challenge my formula . . . accidently I'm sure ; )

I'm not sure I know what you mean by string "rotation"?

Regardless, no method is able to catch all possible non-random
permutations of a string. That's simply impossible and is not a valid
test of the usefulness of an algorithm.

< snip >

Sean Pitman
www.DetectingDesign.com

R. Baldwin

unread,
Jul 24, 2007, 11:20:20 PM7/24/07
to
"Seanpit" <seanpi...@naturalselection.0catch.com> wrote in message
news:1185328675.4...@i38g2000prf.googlegroups.com...

> On Jul 24, 6:25 pm, Mark VandeWettering <wetter...@attbi.com> wrote:
>
>> >> Well, the primary problem with this formula remains the same as the
>> >> problem with its predecessor: it only works in comparison to a target,
>> >> which can only be viewed as "begging the question" that it was
>> >> supposed
>> >> to answer.
>>
>> > Not at all. This is how all forms of detecting of artifact begin.
>>
>> We were talking about CSI. I don't know what "detecting of artifact" is,
>> or what you think that formula has to do with it.
>>
>> > They all start with the detection of bias
>>
>> Given that you pick the target against which it is compared, and that
>> this
>> is obviously the source of basically all the variation in the number
>> returned,
>> isn't the bias obviously in the choice of targets?
>
> No, because the reference strings are chosen independent of the test
> strings. Therefore a significant match between a reference and a test
> string is good evidence of non-random production.

That is not necessarily true. It is good evidence that the test string was
not produced by a stationary random process with Uniform distribution, which
is a much more restricted case.

>
>> > and all of these forms of
>> > detection are based on reference to a string that is known to be the
>> > product of biased non-random production. There simply is no other way
>> > to detect non-randomness other than to use a reference. The use of a
>> > reference is also the basis for related concepts like Kolmogorov/
>> > Chaitin complexity, Shannon information, etc.
>>
>> Shannon information is certainly not defined that way. It is defined
>> merely upon the basis of the observed distribution of messages, not in
>> reference to any "base" set.
>
> Actually, a string with maximum Shannon information is a "random"
> string. And, how does one define randomness? By comparison to a
> reference computer - a UTM.

No, that is not right, Sean. A string with maximum Shannon information has
equiprobable symbols, random or not. Shannon's theory models information
sources as random variables to make the math easy. That does not mean
information sources are actually random, nor does it mean a string with
equiprobable symbols is random.

Strictly speaking, with a single string, the Shannon information is only
estimated..

[snip rest]


Mark VandeWettering

unread,
Jul 24, 2007, 11:20:40 PM7/24/07
to
On 2007-07-25, Seanpit <seanpi...@naturalselection.0catch.com> wrote:
> On Jul 24, 6:25 pm, Mark VandeWettering <wetter...@attbi.com> wrote:
>
>> >> Well, the primary problem with this formula remains the same as the
>> >> problem with its predecessor: it only works in comparison to a target,
>> >> which can only be viewed as "begging the question" that it was supposed
>> >> to answer.
>>
>> > Not at all. This is how all forms of detecting of artifact begin.
>>
>> We were talking about CSI. I don't know what "detecting of artifact" is,
>> or what you think that formula has to do with it.
>>
>> > They all start with the detection of bias
>>
>> Given that you pick the target against which it is compared, and that this
>> is obviously the source of basically all the variation in the number returned,
>> isn't the bias obviously in the choice of targets?
>
> No, because the reference strings are chosen independent of the test
> strings.

You know, the funny thing is that I used to think that you were sincere
and had at least more brain cells than teeth.

Do you expect anyone to actually swallow this claptrap?

> Therefore a significant match between a reference and a test
> string is good evidence of non-random production.
>
>> > and all of these forms of
>> > detection are based on reference to a string that is known to be the
>> > product of biased non-random production. There simply is no other way
>> > to detect non-randomness other than to use a reference. The use of a
>> > reference is also the basis for related concepts like Kolmogorov/
>> > Chaitin complexity, Shannon information, etc.
>>
>> Shannon information is certainly not defined that way. It is defined
>> merely upon the basis of the observed distribution of messages, not in
>> reference to any "base" set.
>
> Actually, a string with maximum Shannon information is a "random"
> string.

Indeed. And that randomness can be judged completely on its own, without
reference to any target.

> And, how does one define randomness? By comparison to a
> reference computer - a UTM.

Wow. You really don't have a fucking clue about what you are talking.

>> >> One need not really go beyond that, but here are some additional things
>> >> to ponder:
>>
>> >> 1. In what sense does this "work well"?
>> >> Can you provide some examples
>> >> which demonstrate it working?
>>
>> > Try it out yourself . . . except it might be easier to use the
>> > following formula for CSI:
>>
>> > Binary Strings:
>> > CSI: n! / (n-|(n/2-hd))|! * |(n/2-hd)|!)
>>
>> > Strings where N > 2:
>> > (log(base2)(X^n)! / (log(base2)(X^n) - |(log(base2)(X^n) / 2-hd)|! * |
>> > (log(base2)(X^n)/2) -hd)|!)
>>
>> So, the answer was "no", you can't provide any examples of it working.
>
> I'm really not sure of what type of example you want?

You know, an example. One of it "working". Not how to calculate it,
that is simple enough. I want you to demonstrate it doing what you
says it can do.

> These formulas
> will pick up any match to a predetermined sequence quite easily.

Wow. That really is amazing. I don't see how it has anything to do with
your claim.

> The greater the match the greater the CSI and the more likely the test
> string is not the result of random production.

Surely, you can provide us an example then.

> It is really quite a simple concept that should be rather self-evident
> to you. Of course, you are the one who thinks pi is not at all
> computable, so you may have more difficulties.

Pi isn't "all computable". Since it is an endless, never repeating sequence,
we can never compute it all.

>> >> 2. How can you show that "the greater the CSI number, the better the
>> >> odds of non-random production"? Presumably that requires some mapping
>> >> from CSI to probabilities, which seems to be absent here. Absent that
>> >> derivation, how can you be sure that the mapping of values of CGI to
>> >> probability are monotically increasing?
>>
>> > Try it yourself and see. It should be rather self evident to you
>> > anyway. Short strings with a HD of 0 are not as reliably non-random
>> > as are longer strings with the same HD of 0. That's a practically self-
>> > evident statement.
>>
>> So the answer is that you can't show that the greater the CSI number,
>> the better the odds of non-random production.
>
> What do you want?!

An actual example. As I asked for above. Anyone can run the math (absurd
as it is), but it doesn't actually mean what you claim it does. If it does,
you can go ahead and demonstrate it.

> If my reference string is 010101 . . . x 1 million
> digits and you bring me a test string with a HD of 1 million relative
> to the reference string, would you honestly question the concept that
> such a match would strongly indicate a non-random origin for your test
> string?

This might be true, but it isn't surprising, and it isn't equivalent to
the grandiose claim you make for your technique.

>> >> 3. It is not difficult to think of sequences of length n which are very
>> >> closely related but whose Hamming distance is n (for instance, rotates
>> >> or complements of the target sequence).
>>
>> > Compliments of the target sequence would have the same CSI value as
>> > the target sequence itself. Try out the formula and see.
>>
>> Your criticism is valid with respect to complements (somewhat accidently,
>> I suspect) but the comment with respect to rotates is valid.
>
> You are the one who mistakenly tried to used "complimentary strings"
> to challenge my formula . . . accidently I'm sure ; )

I can at least distinguish between "complements" and "compliments".

> I'm not sure I know what you mean by string "rotation"?
>
> Regardless, no method is able to catch all possible non-random
> permutations of a string. That's simply impossible and is not a valid
> test of the usefulness of an algorithm.

We've yet to see any use at all for your algorithm.

>
>< snip >
>
> Sean Pitman
> www.DetectingDesign.com
>

Perplexed in Peoria

unread,
Jul 25, 2007, 1:02:03 AM7/25/07
to

"Seanpit" <seanpi...@naturalselection.0catch.com> wrote in message news:1185325971.4...@e16g2000pri.googlegroups.com...

> On Jul 24, 5:47 pm, "Perplexed in Peoria" <jimmene...@sbcglobal.net>
> wrote:
>
> > > > Does the measured string
> > > > have to exactly match the
> > > > specifying string in length?
> >
> > > To achieve maximum CSI - yes, but not to achieve useful CSI.
> >
> > So how do you define the Hamming distance between two sequences of
> > different length?
>
> Use the "Modified HD" or by trimming down the reference string to
> match the size of the test string.
>
> > > > However would you know in practice
> > > > what the specification string is?
> >
> > > By picking a string of known non-random production, like pi.
> >
> > So this might be useful if someone is sending us a message such as
> > pi, which is a pretty stupid message, but that is ok because it isn't
> > intended as a message. It is an intelligence test, and the message
> > sender inserted just enough mistakes to make it a challenging test.
>
> It isn't an intelligence test. It is a test for non-random bias. And,
> one can have as many known reference strings/algorithms as one wants -
> pi being just one of many. A demonstration of high CSI relative to
> any of the chosen reference strings would indicate a bias.

As many as one wants? I believe you may want to think about that Sean.
We might agree that the first 100 decimal digits of pi is a fine reference
string. And the 2nd 100 digits is just as good (though a little more
obscure). As is the 3rd hundred and the 100th hundred and the 679th
hundred, etc. But as the number of available reference strings gets
large, it becomes more and more likely that you are going to have a
close match to one of them just by chance.

Incidentally, if you were measuring your CSI's in bits, this would be
easy to fix. Simply subtract (log(base 2) R) where R is the number
of reference strings available. For your measure, maybe dividing by
R works.

I see you coming up with numbers on the order of 2^n for n-bit strings.
Which tells me that you are not measuring information in bits.

fropome

unread,
Jul 25, 2007, 4:53:32 AM7/25/07
to
On 25 Jul, 01:46, Seanpit <seanpitnos...@naturalselection.0catch.com>
wrote:

What?

oh I get it, you're having a laugh! ho ho ho! ha ha ha! Ah, sorry it
took me so long. Would you like the opportunity to explain the joke to
the lurkers? Or would you prefer it if I pointed out how what you've
written makes absolutely no sense?

Seanpit

unread,
Jul 25, 2007, 10:44:03 AM7/25/07
to
On Jul 24, 8:20 pm, Mark VandeWettering <wetter...@attbi.com> wrote:

> >> Given that you pick the target against which it is compared, and that this
> >> is obviously the source of basically all the variation in the number returned,
> >> isn't the bias obviously in the choice of targets?
>
> > No, because the reference strings are chosen independent of the test
> > strings.
>
> You know, the funny thing is that I used to think that you were sincere
> and had at least more brain cells than teeth.
> Do you expect anyone to actually swallow this claptrap?

How many teeth did you say you have? ; )

> > Therefore a significant match between a reference and a test
> > string is good evidence of non-random production.
>
> >> > and all of these forms of
> >> > detection are based on reference to a string that is known to be the
> >> > product of biased non-random production. There simply is no other way
> >> > to detect non-randomness other than to use a reference. The use of a
> >> > reference is also the basis for related concepts like Kolmogorov/
> >> > Chaitin complexity, Shannon information, etc.
>
> >> Shannon information is certainly not defined that way. It is defined
> >> merely upon the basis of the observed distribution of messages, not in
> >> reference to any "base" set.
>
> > Actually, a string with maximum Shannon information is a "random"
> > string.
>
> Indeed. And that randomness can be judged completely on its own, without
> reference to any target.

Oh really? How so? How is "randomness judged completely on its own
without reference to any target"? Please do list your relevant quote
and reference for this assertion.

> > And, how does one define randomness? By comparison to a
> > reference computer - a UTM.
>
> Wow. You really don't have a fucking clue about what you are talking.

We'll see. Where is your reference that randomness can be judged "on
its own"? How can you tell if a string is "random" or not?

> >> >> One need not really go beyond that, but here are some additional things
> >> >> to ponder:
>
> >> >> 1. In what sense does this "work well"?
> >> >> Can you provide some examples
> >> >> which demonstrate it working?
>
> >> > Try it out yourself . . . except it might be easier to use the
> >> > following formula for CSI:
>
> >> > Binary Strings:
> >> > CSI: n! / (n-|(n/2-hd))|! * |(n/2-hd)|!)
>
> >> > Strings where N > 2:
> >> > (log(base2)(X^n)! / (log(base2)(X^n) - |(log(base2)(X^n) / 2-hd)|! * |
> >> > (log(base2)(X^n)/2) -hd)|!)
>
> >> So, the answer was "no", you can't provide any examples of it working.
>
> > I'm really not sure of what type of example you want?
>
> You know, an example. One of it "working". Not how to calculate it,
> that is simple enough. I want you to demonstrate it doing what you
> says it can do.

Look at the #3 post I made in this thread to snex. Is that a good
enough example for you?

> > These formulas
> > will pick up any match to a predetermined sequence quite easily.
>
> Wow. That really is amazing. I don't see how it has anything to do with
> your claim.

What do you think my claim is?

> > The greater the match the greater the CSI and the more likely the test
> > string is not the result of random production.
>
> Surely, you can provide us an example then.
>
> > It is really quite a simple concept that should be rather self-evident
> > to you. Of course, you are the one who thinks pi is not at all
> > computable, so you may have more difficulties.
>
> Pi isn't "all computable". Since it is an endless, never repeating sequence,
> we can never compute it all.

That's not the definition of a non-computable number.

"He [Alan Turing] defined a computable number as a real number whose
decimal expansion could be produced by a Turing machine starting with
a blank tape. He was able to demonstrate that the irational number pi
was computable."

http://www.amt.canberra.edu.au/turingb.html

< snip >

> > If my reference string is 010101 . . . x 1 million
> > digits and you bring me a test string with a HD of 1 million relative
> > to the reference string, would you honestly question the concept that
> > such a match would strongly indicate a non-random origin for your test
> > string?
>
> This might be true, but it isn't surprising, and it isn't equivalent to
> the grandiose claim you make for your technique.

That's the only claim I make for my technique. I make no "grandiose"
claims for it.

> >> >> 3. It is not difficult to think of sequences of length n which are very
> >> >> closely related but whose Hamming distance is n (for instance, rotates
> >> >> or complements of the target sequence).
>
> >> > Compliments of the target sequence would have the same CSI value as
> >> > the target sequence itself. Try out the formula and see.
>
> >> Your criticism is valid with respect to complements (somewhat accidently,
> >> I suspect) but the comment with respect to rotates is valid.
>
> > You are the one who mistakenly tried to used "complimentary strings"

> > to challenge my formula . . . accidentally I'm sure ; )


>
> I can at least distinguish between "complements" and "compliments".

Well, at least you can do that . . . ; )

< snip >

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 25, 2007, 10:46:22 AM7/25/07
to
On Jul 25, 1:53 am, fropome <monk...@hornsandhalos.co.uk> wrote:

> What?
>
> oh I get it, you're having a laugh! ho ho ho! ha ha ha! Ah, sorry it
> took me so long. Would you like the opportunity to explain the joke to
> the lurkers? Or would you prefer it if I pointed out how what you've
> written makes absolutely no sense?

Oh please, do point the "joke" out. I'm sure at least some of the
lurkers don't get whatever you seem to be "getting". I sure don't.

Sean Pitman
www.DetectingDesign.com


fropome

unread,
Jul 25, 2007, 11:04:00 AM7/25/07
to

Really? oh dear...

In that case can you tell me how you got from:

> = 3.17! / 2.755! * 0.415!

to:

> = 1.90
?

any sort of working would be fine.

Seanpit

unread,
Jul 25, 2007, 11:12:31 AM7/25/07
to
On Jul 24, 8:20 pm, "R. Baldwin" <res0k...@nozirevBACKWARDS.net>
wrote:

> > No, because the reference strings are chosen independent of the test
> > strings. Therefore a significant match between a reference and a test
> > string is good evidence of non-random production.
>
> That is not necessarily true. It is good evidence that the test string was
> not produced by a stationary random process with Uniform distribution, which
> is a much more restricted case.

If a perfect match happened to be to a reference string that had to
regular character repeats, like pi, this would be good evidence the
test string was not produced by a random process with uniform or non-
uniform distribution.

> > Actually, a string with maximum Shannon information is a "random"
> > string. And, how does one define randomness? By comparison to a
> > reference computer - a UTM.
>
> No, that is not right, Sean. A string with maximum Shannon information has
> equiprobable symbols, random or not. Shannon's theory models information
> sources as random variables to make the math easy. That does not mean
> information sources are actually random, nor does it mean a string with
> equiprobable symbols is random.

Shannon information is determined by reference to a known "random"
source of string production - a source that produces maximum Shannon
information.

"In the Shannon approach, however, the method of encoding objects is
based on the presupposition that the objects to be encoded are
outcomes of a known random source it is only the characteristics of
that random source that determine the encoding, not the
characteristics of the objects that are its outcomes."

http://homepages.cwi.nl/~paulv/papers/info.pdf

This means that Shannon information is more about the type of source
it will take to transmit a particular type of string rather than the
string itself. So, to transmit a number like Pi, where all the
symbols seem to appear with equal frequency, the source needed to
transmit a sequence like pi will have to be able to produce all
possible numbers with a similar character frequency. In other words,
this source must be able to produce not only pi, but all possible
numbers in infinite sequence space - to include truly "random" and
"non-computable" numbers like sigma.

Again, it is all about the source or the reference that is chosen.

> Strictly speaking, with a single string,
> the Shannon information is only
> estimated..

That's true, but the estimate is based on the type of source needed to
produce such a string.

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 25, 2007, 11:18:46 AM7/25/07
to
On Jul 24, 10:02 pm, "Perplexed in Peoria" <jimmene...@sbcglobal.net>
wrote:

> > It isn't an intelligence test. It is a test for non-random bias. And,


> > one can have as many known reference strings/algorithms as one wants -
> > pi being just one of many. A demonstration of high CSI relative to
> > any of the chosen reference strings would indicate a bias.
>
> As many as one wants? I believe you may want to think about that Sean.
> We might agree that the first 100 decimal digits of pi is a fine reference
> string. And the 2nd 100 digits is just as good (though a little more
> obscure). As is the 3rd hundred and the 100th hundred and the 679th
> hundred, etc. But as the number of available reference strings gets
> large, it becomes more and more likely that you are going to have a
> close match to one of them just by chance.

Having a match within HD values of 1 or 2 drop so dramatically with
increasing sequence size that the number of reference strings you can
possibly include (based on simple algorithmic non-random productions)
is quickly outpaced well before the time a length of just a few
hundred digits is achieved.

< snip >

Sean Pitman
www.DetectingDesign.com


David Wilson

unread,
Jul 25, 2007, 11:29:05 AM7/25/07
to
In article <1185353612....@k79g2000hse.googlegroups.com> on

It does (sort of), provided you understand the Pitmanese dialect of
mathematics. Dr Sean seems to be using the factorial sign as a
synonym for its generalisation to arbitrary non-negative real numbers
--i.e the Gamma function evaluated at the factorial's argument
incremented by one: x! =pitmandef Gamma( x+1 ). So 3.17! / 2.755! * 0.415!,
for instance, is just Gamma( 4.17 )/( Gamma( 3.755 ) * Gamma( 1.415 )
= 7.45836852.../(4.4492345954... * 0.8865489992441 ...) = 1.8908443883...

-----------------------------------------------------------------------------
David Wilson

SPAMMERS_fingers@WILL_BE_fwi_PROSECUTED_.net.au
(Remove underlines and upper case letters to obtain my email address)

Seanpit

unread,
Jul 25, 2007, 11:26:04 AM7/25/07
to
On Jul 25, 8:04 am, fropome <monk...@hornsandhalos.co.uk> wrote:

> Really? oh dear...
>
> In that case can you tell me how you got from:
>
> > = 3.17! / 2.755! * 0.415!
>
> to:
>
> > = 1.90
>
> ?
>
> any sort of working would be fine.


log(base2)(X^n)! / ((log(base2)(X^n) - |(log(base2)(X^n) / 2-hd)|! *
|
(log(base2)(X^n)/2) -hd)|!))

= 3.17! / ((3.17 - |(3.17 / 2 - 2) |)! * |(3.17 / 2 - 2)|!))
= 3.17! / ((3.17 - 0.415)! * 0.415!)
= 3.17! / (2.755! * 0.415!)
= ~1.90

Thanks for the clarification - however minor. Hope this helps.

Sean Pitman
www.DetectingDesign.com

Mark Isaak

unread,
Jul 25, 2007, 11:33:03 AM7/25/07
to
On Tue, 24 Jul 2007 16:02:39 -0700, Seanpit wrote:

> On Jul 23, 1:28 pm, Mark VandeWettering <wetter...@attbi.com> wrote:
>> On 2007-07-23, Seanpit <seanpitnos...@naturalselection.0catch.com> wrote:
>
>> > After a bit of discussion and revision of my initial effort, the
>> > following is my formula for calculating my version of complex
>> > specified information (CSI):
>>
>> > For X = 2:
>>
>> > CSI: X^n - (n! / (n-hd)! hd!)
>>
>> > For X > 2:
>>
>> > CSI: X^n - (( log(base2)(X^n)! / (log(base 2)(X^n) - hd)! * hd!)
>>
>> > X = number of possible characters per position
>> > n = size of the sequence
>> > hd = Hamming Distance
>>

> [...]


> This is how all forms of detecting of artifact begin.

You have never in your life actually tried to detect artifacts, have you?

--
Mark Isaak eciton (at) earthlink (dot) net
"Voice or no voice, the people can always be brought to the bidding of
the leaders. That is easy. All you have to do is tell them they are
being attacked, and denounce the pacifists for lack of patriotism and
exposing the country to danger." -- Hermann Goering

Seanpit

unread,
Jul 25, 2007, 11:45:29 AM7/25/07
to
On Jul 25, 8:33 am, Mark Isaak <eci...@earthlink.net> wrote:

> > This is how all forms of detecting of artifact begin.
>
> You have never in your life actually tried to detect artifacts, have you?

It depends on what you think forensics is all about . . . or finding
an arrowhead in a field amongst other naturally formed rocks.

> Mark Isaak

Sean Pitman
www.DetectingDesign.com


Mark VandeWettering

unread,
Jul 25, 2007, 11:53:35 AM7/25/07
to

Perhaps it is only visible to people who possess a different viewpoint.

You might be able to remedy the situation using a mirror.

Mark
>
> Sean Pitman
> www.DetectingDesign.com
>
>

fropome

unread,
Jul 25, 2007, 11:58:57 AM7/25/07
to
On 25 Jul, 16:29, David Wilson <see_sig@for_my.address> wrote:
> In article <1185353612.840766.3...@k79g2000hse.googlegroups.com> on
> ---------------------------------------------------------------------------­--
> David Wilson
>
> SPAMMERS_fingers@WILL_BE_fwi_PROSECUTED_.net.au
> (Remove underlines and upper case letters to obtain my email address)- Hide quoted text -
>
> - Show quoted text -

I actually suspect that he was going to _say_ something like that, but
had actually just been working it out on his windows calculator and
didn't realise that n! is actually undefined for n <> integer.
My next question was going to by why he thinks the gamma function is
useful here, other than because it generates large numbers. Either his
maths is very good- in which case he would have a proof for this
formula which he can show us- or his maths is very bad. Using the
gamma function without saying that he is makes me suspect rather
strongly that his maths is very bad, using it without having a proof
of why it's there would make me _know_ his maths was bad.


Mark VandeWettering

unread,
Jul 25, 2007, 12:01:01 PM7/25/07
to
On 2007-07-25, Seanpit <sea...@gmail.com> wrote:

In Shannon information theory, "random" merely means that all messages are
equiprobable.

>> > And, how does one define randomness? By comparison to a
>> > reference computer - a UTM.
>>
>> Wow. You really don't have a fucking clue about what you are talking.
>
> We'll see. Where is your reference that randomness can be judged "on
> its own"? How can you tell if a string is "random" or not?

In Shannon information theory, "random" merely means that all messages are
equiprobable.

>> >> >> One need not really go beyond that, but here are some additional things
>> >> >> to ponder:
>>
>> >> >> 1. In what sense does this "work well"?
>> >> >> Can you provide some examples
>> >> >> which demonstrate it working?
>>
>> >> > Try it out yourself . . . except it might be easier to use the
>> >> > following formula for CSI:
>>
>> >> > Binary Strings:
>> >> > CSI: n! / (n-|(n/2-hd))|! * |(n/2-hd)|!)
>>
>> >> > Strings where N > 2:
>> >> > (log(base2)(X^n)! / (log(base2)(X^n) - |(log(base2)(X^n) / 2-hd)|! * |
>> >> > (log(base2)(X^n)/2) -hd)|!)
>>
>> >> So, the answer was "no", you can't provide any examples of it working.
>>
>> > I'm really not sure of what type of example you want?
>>
>> You know, an example. One of it "working". Not how to calculate it,
>> that is simple enough. I want you to demonstrate it doing what you
>> says it can do.
>
> Look at the #3 post I made in this thread to snex. Is that a good
> enough example for you?

No. Because it isn't an example of it "working".

>
>> > These formulas
>> > will pick up any match to a predetermined sequence quite easily.
>>
>> Wow. That really is amazing. I don't see how it has anything to do with
>> your claim.
>
> What do you think my claim is?
>
>> > The greater the match the greater the CSI and the more likely the test
>> > string is not the result of random production.
>>
>> Surely, you can provide us an example then.
>>
>> > It is really quite a simple concept that should be rather self-evident
>> > to you. Of course, you are the one who thinks pi is not at all
>> > computable, so you may have more difficulties.
>>
>> Pi isn't "all computable". Since it is an endless, never repeating sequence,
>> we can never compute it all.
>
> That's not the definition of a non-computable number.
>
> "He [Alan Turing] defined a computable number as a real number whose
> decimal expansion could be produced by a Turing machine starting with
> a blank tape. He was able to demonstrate that the irational number pi
> was computable."
>
> http://www.amt.canberra.edu.au/turingb.html

I'm sorry, but a Turing machine can't compute pi, it computes approximations
to pi. By definition, anything which is computable is computable by a
Turing machine _that halts_. Since the decimal expansion of pi is infinite
and non-repeating, it is by definition uncomputable.

Mark VandeWettering

unread,
Jul 25, 2007, 12:04:42 PM7/25/07
to
On 2007-07-25, Seanpit <sea...@gmail.com> wrote:
> On Jul 24, 8:20 pm, "R. Baldwin" <res0k...@nozirevBACKWARDS.net>
> wrote:
>
>> > No, because the reference strings are chosen independent of the test
>> > strings. Therefore a significant match between a reference and a test
>> > string is good evidence of non-random production.
>>
>> That is not necessarily true. It is good evidence that the test string was
>> not produced by a stationary random process with Uniform distribution, which
>> is a much more restricted case.
>
> If a perfect match happened to be to a reference string that had to
> regular character repeats, like pi, this would be good evidence the
> test string was not produced by a random process with uniform or non-
> uniform distribution.

In what sense do you think that pi has "regular character repeats"?

>> > Actually, a string with maximum Shannon information is a "random"
>> > string. And, how does one define randomness? By comparison to a
>> > reference computer - a UTM.
>>
>> No, that is not right, Sean. A string with maximum Shannon information has
>> equiprobable symbols, random or not. Shannon's theory models information
>> sources as random variables to make the math easy. That does not mean
>> information sources are actually random, nor does it mean a string with
>> equiprobable symbols is random.
>
> Shannon information is determined by reference to a known "random"
> source of string production - a source that produces maximum Shannon
> information.

You've got this entirely backwards.

> "In the Shannon approach, however, the method of encoding objects is
> based on the presupposition that the objects to be encoded are
> outcomes of a known random source it is only the characteristics of
> that random source that determine the encoding, not the
> characteristics of the objects that are its outcomes."
>
> http://homepages.cwi.nl/~paulv/papers/info.pdf

Yes. Luckily, we can know a lot about the random source, not by examining
an independent reference or a UTM, but by merely taking large chunks of it
and measuring statistics of it.

> This means that Shannon information is more about the type of source
> it will take to transmit a particular type of string rather than the
> string itself. So, to transmit a number like Pi, where all the
> symbols seem to appear with equal frequency, the source needed to
> transmit a sequence like pi will have to be able to produce all
> possible numbers with a similar character frequency. In other words,
> this source must be able to produce not only pi, but all possible
> numbers in infinite sequence space - to include truly "random" and
> "non-computable" numbers like sigma.
>
> Again, it is all about the source or the reference that is chosen.
>
>> Strictly speaking, with a single string,
>> the Shannon information is only
>> estimated..
>
> That's true, but the estimate is based on the type of source needed to
> produce such a string.

No. It is based upon measurements of the string itself. That is why
it is an estimate.

> Sean Pitman
> www.DetectingDesign.com

fropome

unread,
Jul 25, 2007, 12:14:53 PM7/25/07
to

You've probably seen David's post just below and my reply- can you
confirm that you are using the gamma function here? If so, how do you
justify using it? Unlike a normal factorial it's far from intuitive-
even if you could justify coming up with your formula by inspection if
it just contained factorials, this would surely be impossible using
the gamma function.
Unless you're making this up as you go, you _must_ have a proof
written down nearby which you used to get to your formula. Would you
show it to us? Pretty please?

Seanpit

unread,
Jul 25, 2007, 12:19:23 PM7/25/07
to
On Jul 25, 9:04 am, Mark VandeWettering <wetter...@attbi.com> wrote:

> >> > No, because the reference strings are chosen independent of the test
> >> > strings. Therefore a significant match between a reference and a test
> >> > string is good evidence of non-random production.
>
> >> That is not necessarily true. It is good evidence that the test string was
> >> not produced by a stationary random process with Uniform distribution, which
> >> is a much more restricted case.
>
> > If a perfect match happened to be to a reference string that had to
> > regular character repeats, like pi, this would be good evidence the
> > test string was not produced by a random process with uniform or non-
> > uniform distribution.
>
> In what sense do you think that pi has "regular character repeats"?

That was a typo - should read "no regular character repeats".

> >> > Actually, a string with maximum Shannon information is a "random"
> >> > string. And, how does one define randomness? By comparison to a
> >> > reference computer - a UTM.
>
> >> No, that is not right, Sean. A string with maximum Shannon information has
> >> equiprobable symbols, random or not. Shannon's theory models information
> >> sources as random variables to make the math easy. That does not mean
> >> information sources are actually random, nor does it mean a string with
> >> equiprobable symbols is random.
>
> > Shannon information is determined by reference to a known "random"
> > source of string production - a source that produces maximum Shannon
> > information.
>
> You've got this entirely backwards.
>
> > "In the Shannon approach, however, the method of encoding objects is
> > based on the presupposition that the objects to be encoded are
> > outcomes of a known random source it is only the characteristics of
> > that random source that determine the encoding, not the
> > characteristics of the objects that are its outcomes."
>
> >http://homepages.cwi.nl/~paulv/papers/info.pdf
>
> Yes. Luckily, we can know a lot about the random source, not by examining
> an independent reference or a UTM, but by merely taking large chunks of it
> and measuring statistics of it.

Not true. The randomness of the source is based on comparison to a
reference - like a reference UTM. You cannot determine randomness by
"measuring statistics of it". I mean really, what statistics, in
particular would determine "randomness" for you?

> > This means that Shannon information is more about the type of source
> > it will take to transmit a particular type of string rather than the
> > string itself. So, to transmit a number like Pi, where all the
> > symbols seem to appear with equal frequency, the source needed to
> > transmit a sequence like pi will have to be able to produce all
> > possible numbers with a similar character frequency. In other words,
> > this source must be able to produce not only pi, but all possible
> > numbers in infinite sequence space - to include truly "random" and
> > "non-computable" numbers like sigma.
>
> > Again, it is all about the source or the reference that is chosen.
>
> >> Strictly speaking, with a single string,
> >> the Shannon information is only
> >> estimated..
>
> > That's true, but the estimate is based on the type of source needed to
> > produce such a string.
>
> No. It is based upon measurements of the string itself. That is why
> it is an estimate.

Not true.

Sean Pitman
www.DetectingDesign.com


Seanpit

unread,
Jul 25, 2007, 12:36:19 PM7/25/07
to
On Jul 25, 9:01 am, Mark VandeWettering <wetter...@attbi.com> wrote:

> >> Indeed. And that randomness can be judged completely on its own, without
> >> reference to any target.
>
> > Oh really? How so? How is "randomness judged completely on its own
> > without reference to any target"? Please do list your relevant quote
> > and reference for this assertion.
>
> In Shannon information theory, "random" merely means that all messages are
> equiprobable.

And that depends upon what source you choose to produce your messages
- i.e., a "random" source. The problem is that randomness can never
be proven or judged "on its own" without reference to any target. In
short, no string or string source can be known to be absolutely
"random". That's an impossibility. The best that can be said is that
the string and/or string source appears to be random from the
perspective of a given reference - like a UTM.

> >> > And, how does one define randomness? By comparison to a
> >> > reference computer - a UTM.
>
> >> Wow. You really don't have a fucking clue about what you are talking.
>
> > We'll see. Where is your reference that randomness can be judged "on
> > its own"? How can you tell if a string is "random" or not?
>
> In Shannon information theory, "random" merely means that all messages are
> equiprobable.

Yeah, and how can you apply that? How can you tell if a particular
string or string source is random without comparison to any reference?
- based just on the string itself?

> >> You know, an example. One of it "working". Not how to calculate it,
> >> that is simple enough. I want you to demonstrate it doing what you
> >> says it can do.
>
> > Look at the #3 post I made in this thread to snex. Is that a good
> > enough example for you?
>
> No. Because it isn't an example of it "working".

That simply makes no sense to me. Please do provide an example of
your own were my formulas, if applied, would actually be "working".

> >> Pi isn't "all computable". Since it is an endless, never repeating sequence,
> >> we can never compute it all.
>
> > That's not the definition of a non-computable number.
>
> > "He [Alan Turing] defined a computable number as a real number whose
> > decimal expansion could be produced by a Turing machine starting with
> > a blank tape. He was able to demonstrate that the irational number pi
> > was computable."
>
> >http://www.amt.canberra.edu.au/turingb.html
>
> I'm sorry, but a Turing machine can't compute pi, it computes approximations
> to pi. By definition, anything which is computable is computable by a
> Turing machine _that halts_. Since the decimal expansion of pi is infinite
> and non-repeating, it is by definition uncomputable.

You are countering Alan Turing himself on this one. Do you have any
reference to back your own notion of computability up on this one? I
doubt it. Pardon the double negative, but the fact is that pi is not
uncomputable.

"What does "uncomputable" mean? Something is uncomputable if it
can't be represented in terms of a finite-sized computer program. For
instance, the number Pi=3.1415926235.... is *not* uncomputable. Even
though it goes on forever, and never repeats itself, there is a simple
computer program that will generate it. True, this computer program
can never generate *all* of Pi, because to do so it would have to run
on literally *forever* - it can only generate each new digit at a
finite speed. But still, there is a program with the property that,
*if* you let it run forever, then it *would* generate all of Pi, and
because of this the number Pi is not considered uncomputable. What's
fascinating about Pi is that even though it goes on forever and
doesn't repeat itself, in a sense it only contains a finite amount of
information -- because it can be compactly represented by the computer
program that generates it."

http://www.goertzel.org/benzine/QuantumComputingArticle.htm

You need to look up the definition of computability. Your definition
doesn't seem to be correct. But, I'd be very interested if you could
in fact find a reference that supports your definition. Good
luck . . .

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 25, 2007, 12:39:24 PM7/25/07
to
On Jul 25, 8:53 am, Mark VandeWettering <wetter...@attbi.com> wrote:

> On 2007-07-25, Seanpit <sean...@gmail.com> wrote:
>
> > On Jul 25, 1:53 am, fropome <monk...@hornsandhalos.co.uk> wrote:
>
> >> What?
>
> >> oh I get it, you're having a laugh! ho ho ho! ha ha ha! Ah, sorry it
> >> took me so long. Would you like the opportunity to explain the joke to
> >> the lurkers? Or would you prefer it if I pointed out how what you've
> >> written makes absolutely no sense?
>
> > Oh please, do point the "joke" out. I'm sure at least some of the
> > lurkers don't get whatever you seem to be "getting". I sure don't.
>
> Perhaps it is only visible to people who possess a different viewpoint.
>
> You might be able to remedy the situation using a mirror.

Perhaps you have one? - a mirror I could use? Otherwise, how are
comments like this remotely helpful to me or to any lurkers that might
be interested in your "mirror"?

> Mark

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 25, 2007, 12:58:16 PM7/25/07
to
On Jul 25, 9:14 am, fropome <monk...@hornsandhalos.co.uk> wrote:

> You've probably seen David's post just below and my reply- can you
> confirm that you are using the gamma function here?

Sure - what's wrong with using the gamma function?

> If so, how do you
> justify using it? Unlike a normal factorial it's far from intuitive-
> even if you could justify coming up with your formula by inspection if
> it just contained factorials, this would surely be impossible using
> the gamma function.

I don't understand your question. The gamma function is used because
it "fills in" the factorial function for non-integer (positive values)
and complex values of n. This "filling in" effect satisfies the
control in my binary formula in that all of the values of HD when
applied to the formula:

n! / ((n-hd)! hd!)

produce, when added together, the total value of sequence space (i.e.,
X^n).

The same thing is true for the formula for X>2.

> Unless you're making this up as you go, you _must_ have a proof
> written down nearby which you used to get to your formula. Would you
> show it to us? Pretty please?

I'm not sure what other "proof" you are looking for . . .?

Sean Pitman
www.DetectingDesign.com

snex

unread,
Jul 25, 2007, 1:19:11 PM7/25/07
to
On Jul 24, 5:33 pm, Seanpit <seanpitnos...@naturalselection.
0catch.com> wrote:
> On Jul 23, 1:14 pm, snex <s...@comcast.net> wrote:
>
> > when are you going to apply this function to the bit strings in my
> > "CSI Challenge" thread?
>
> If the reference string works, then positive results are useful.
> However, negative results do not rule out the possibility of a very
> non-random bias. So, all you have to do is pick various reference
> strings that you know to be non-random in production and then compare
> your unknown sequences to them.
>
> For example, in testing your strings for possible bias lets use as a
> reference string a binary string that is made up of all zeros that is
> 1,324 characters in length (the same size as your test strings). Lets
> now compare your Test String A to the reference string:

why did you use a reference string made of all zeroes?

>
> Test String A:
> 110100000110111100101110010100100111101010101011001011001000111010000100
> - 35
> 110110010000011101110100111001010111100000110101001001001111100100011001
> - 36
> 100110011000101100001000001010011010111001000001111101010000100111110010
> - 32
> 000110010110100001001100100110001001110111000001010011001001001100000101
> - 29
> 101111111110110010000011111000101011000000101001011100010111100001111101
> - 39
> 110101000010000110011111001111000101110100010110100001110111101111100110
> - 39
> 101001011100111010110110110000111110101100000001011000100100011111010111
> - 38
> 011100101101111001001011101111001101001001001010111011110101100000100001
> - 38
> 100111100010000110101101011110110111100100011010101000100111000111110001
> - 38
> 101001001011110110110011011111100110000111011000111100001010011111101101
> - 42
> 011110011110000110111100010111110111000100010110111100100100111001011100
> - 40
> 110111011001101100010100011111001001110100110010111001110010001100110100
> - 38
> 001101010010110011100111101000010101010101010011001100010010111001011000
> - 34
> 000101001110110100111110010001001111011100010101101010000000100111011110
> - 36
> 010001001111110011011100101111101110110001000110011011111111100100000100
> - 40
> 111010100010010001111011010110101110111001100110101111101010101010110110
> - 42
> 100010101010011000111000101010001000111011011011000010001001010100110011
> - 32
> 111100111111100000000110110100001001010101110101000111111010011010101001
> - 38
> 1011110110000010101000110111
> - 15
>
> Signal A: 18 x 72 + 28 = 1324 characters, Hamming Distance: 681
> (expected 662)
>
> The expected number of 1's in this string, given the assumption of
> uniform distribution for random generation, is 662. The actual
> measured Hamming Distance is 681.
>
> At this point, I'm going to use a slightly modified version of my CSI
> formula that seems to work more easily (as least for me).
>
> CSI: (n! / (n-|(n/2-hd))|! |(n/2-hd)|!)
>
> If the HD were zero or maximum (i.e., a perfect match or inverse
> match) the CSI would be maximum either way:
>
> CSImax: (1324! / (1324 -(1324/2 - 0))! (662-0)!)
> = 1324! / 662! 662!
> = 8.028387e+396
>
> CSI for the expected random outcome:
>
> CSIexp: (1324! / (1324 -(662 - 662))! (662-662)!)
> = 1324! / 1324! 0!
> = 1
>
> The actual CSI for Test String A:
>
> CSI - A: (1324! / (1324 -|(662 - 681))|! |(662 - 681)|!)
> = 1324! / 1305! 19!
> = 1.49424e+42
>
> Now, this might seem like a large number, but compared to the maximum
> CSI value of about 8e+396, a number like 1.49e+42 isn't a very
> reliable match. Therefore, although it isn't quite the CSI value one
> would expect from a randomly generated sequence with a uniform
> distribution, the biased nature of the sequence has relatively low
> reliability.
>
> Test String B:
> 010101010110100000100110011010010110010001100001110100010010000000110000
> - 27
> 010101111000000101100100111101000010100101111100010101010011111101110010
> - 37
> 001101100110101100011000011000000110000000110110100101000111010001101001
> - 30
> 111011010110111101111100001010101101000110100100011011011001000101111111
> - 42
> 000110110010101011110001111001101111011111001000011110101100111100001010
> - 40
> 101000001010010101100100001000110110010001101101000100100100000001100101
> - 27
> 001000101001010101100111001100110010111110000011010101001010000010011001
> - 32
> 011100110000101100110111001000101101010110101100100110000110000011110010
> - 34
> 010101101110000001010100101100101100011001100010011011010001011010100101
> - 33
> 010000000110000110000000011011100011110100000111001001011110001001100100
> - 28
> 000000110010101100100101001110110010001100100000011110001010101000110011
> - 30
> 000011000000101110000001011000011110011010001001010111110011110001101100
> - 32
> 111011000101001101100011000010000101001010111010011111111100010011110110
> - 38
> 101011000110111001110100010111010100101010110000011000100111011101100011
> - 37
> 011000010100110000010110000110011000000000101000110011010010110000101100
> - 26
> 010110110101001111100011001101100000101000100001001010000001001100000111
> - 30
> 110000101010101001101001001110100000010000100010010111010111011110000101
> - 32
> 111001100110010010011001011101101111010001000100101001101101111101110000
> - 38
> 1111110001110000100110101010 - 15
>
> Signal B: 18 x 72 + 28 = 1324 characters, Hamming Distance: 608
> (expected 662)
>
> The expected number of 1's in this string, given the assumption of
> uniform distribution for random generation, is 662. The actual
> measured Hamming Distance is 608.
>
> CSI: (n! / (n-(n/2-hd))! (n/2-hd)!)
>
> = 1324! / (1324 -(662 - 608))! (662-608)!)
>
> = 1324! / 1270! 54!
>
> = 5.5307e+96
>
> As mentioned above as CSI of 5.5e+96 might not seem very reliable when
> compared to the max CSI value of 8e+396. However, compared to the CSI
> value for Test String A (CSI of 1.49e+42) the bias of B is much more
> significant compared to A. Therefore, between the two strings A and
> B, as compared to the reference sequence, B is more likely to have
> been the result of non-random generation.
>
> Of course, if the test string happened to be pi, the biased nature of
> the test string would not have been reliably picked up by comparison
> to the reference string used in this case as having maximum CSI. That
> is why a negative finding does not rule out the possibility of highly
> predictable non-random generation from a relatively simple algorithm
> that simply hasn't been included as one of the reference algorithms/
> strings.
>
> One other potential problem is one of where a random sequence is in
> fact produced, but has a non-uniform distribution. This test will
> reliably detect the non-uniform nature of the distribution via the
> relative increase in CSI above the expected. The formula can be
> applied by equalizing the number of characters in the sequence to
> remove the non-uniform bias, and then comparing the resulting pattern
> of 0s and 1s in comparison to various reference strings that may
> detect various forms of symmetry etc.
>
> Again, none of these methods will be able to detect all forms of bias
> that could in fact be reduced to a relatively simple algorithm if it
> were only known. In fact, it is impossible to rule out the
> possibility of non-random bias for a sequence that is apparently
> random from the perspective of the test. A positive test result can
> still be quit helpful.
>
> Sean Pitmanwww.DetectingDesign.com


fropome

unread,
Jul 25, 2007, 1:33:59 PM7/25/07
to


Let's start from scratch.
In mathematics, you can only write something if it clearly follows
from what has come before. Everything follows from very basic axioms
(which I'm not about to go into here).
This means that every formula can be shown to be correct or not only
by a mathematical proof, which is a set of steps which show how it
follows from what is already proven (or assumed, in the case of
axioms).

It is possible to write down formulae which work in the real world
without doing all the work- this is because you are approximating. You
could say that as intelligence increases, so does income. You could
even write it in in the form of an equation and claim that it was
true. People may argue with you, but they may struggle to definitively
prove you wrong.

However when dealing only with numbers (as you are, with your strings)
you have ventured out of the real world and into the world of
mathematics. This requires a formal proof of every formula.
Please note that this is not an arbitary demand, but is required. We
_cannot_ know anything in mathematics without a proof- trying a few
numbers does not count.

You might have gotten away with not writing a proof had you not used
the gamma function, because you could have claimed that the formula
was intuitively true- i.e. you had worked out each component of the
equation as you went. However the gamma function is a very abstract
piece of mathematics which does not intuitively fit into the equation.

Even if you had been able to formulate the equation intuitively, you
would still require a mathematical proof before it would be accepted
since it is very much not intuitive to anyone else.

Now your formula may be correct. It may be that it is mathematically
provable. However unlike in science, we cannot say that it's a good
theory and use it until we think of something better (even if we think
it is any good at all), rather we must prove it. The fact that you
don't even seem to be aware of the concept of a mathematical proof
rather suggests that you have not done so. Personally, beyond quoting
a couple of things about filling in gaps in the factorial function
from wikipedia I doubt that you understood what a factorial was when
you wrote the formula down, let alone what the gamma function is.

Basically, you have a choice. You say that you made the formula up as
a post-hoc rationalisation, you say that it is based on experimental
evidence (and you need to show how you got from the evidence to the
formula rather than the other way around- ie no other formula fits
better) or you come up with a formal proof.
(Other, less honourable, options include not replying or claiming that
you aren't doing mathematics, but I'm sure you wouldn't do that)

Now which is it to be?

Seanpit

unread,
Jul 25, 2007, 1:53:19 PM7/25/07
to

All the formula is intended to show is the number of sequences in
sequence space at a given Hamming Distance. It is quite easy to prove
that the formula does this. Just add up all the sequences for each
Hamming Distance and compare this total to the total size of sequence
space (X^n). They match. What other proof do you need?

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 25, 2007, 2:10:14 PM7/25/07
to
On Jul 25, 10:19 am, snex <s...@comcast.net> wrote:

> > For example, in testing your strings for possible bias lets use as a
> > reference string a binary string that is made up of all zeros that is
> > 1,324 characters in length (the same size as your test strings). Lets
> > now compare your Test String A to the reference string:
>
> why did you use a reference string made of all zeroes?

Any non-random reference string with a uniform distribution would work
in this case - like a string of 1s, 01s or 11001100s etc.

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 25, 2007, 2:14:34 PM7/25/07
to
On Jul 25, 8:58 am, fropome <monk...@hornsandhalos.co.uk> wrote:

> > > What?
>
> > > oh I get it, you're having a laugh! ho ho ho! ha ha ha! Ah, sorry it
> > > took me so long. Would you like the opportunity to explain the joke to
> > > the lurkers? Or would you prefer it if I pointed out how what you've
> > > written makes absolutely no sense?
>
> > It does (sort of), provided you understand the Pitmanese dialect of
> > mathematics. Dr Sean seems to be using the factorial sign as a
> > synonym for its generalisation to arbitrary non-negative real numbers
> > --i.e the Gamma function evaluated at the factorial's argument
> > incremented by one: x! =pitmandef Gamma( x+1 ). So 3.17! / 2.755! * 0.415!,
> > for instance, is just Gamma( 4.17 )/( Gamma( 3.755 ) * Gamma( 1.415 )
> > = 7.45836852.../(4.4492345954... * 0.8865489992441 ...) = 1.8908443883...
>

> > ---------------------------------------------------------------------------限--


> > David Wilson
>
> > SPAMMERS_fingers@WILL_BE_fwi_PROSECUTED_.net.au
> > (Remove underlines and upper case letters to obtain my email address)- Hide quoted text -
>
> > - Show quoted text -
>
> I actually suspect that he was going to _say_ something like that, but
> had actually just been working it out on his windows calculator and
> didn't realise that n! is actually undefined for n <> integer.
> My next question was going to by why he thinks the gamma function is
> useful here, other than because it generates large numbers. Either his
> maths is very good- in which case he would have a proof for this
> formula which he can show us- or his maths is very bad. Using the
> gamma function without saying that he is makes me suspect rather
> strongly that his maths is very bad, using it without having a proof
> of why it's there would make me _know_ his maths was bad.

Are you asking me to "prove" that the gamma function actually works
for filling in the gaps between factorials? I'm not sure what the
problem is here? What are you asking me to prove exactly? If the
gamma function actually does fill in the gaps appropriately, and it
does seem to indeed, what's the problem?

What's your proof for 2+2=4 ?

Sean Pitman
www.DetectingDesign.com

Mark Isaak

unread,
Jul 25, 2007, 2:15:19 PM7/25/07
to
On Wed, 25 Jul 2007 15:45:29 +0000, Seanpit wrote:

> On Jul 25, 8:33 am, Mark Isaak <eci...@earthlink.net> wrote:
>
>> > This is how all forms of detecting of artifact begin.
>>
>> You have never in your life actually tried to detect artifacts, have
>> you?
>
> It depends on what you think forensics is all about . . . or finding an
> arrowhead in a field amongst other naturally formed rocks.

So when you have a field of naturally formed rocks and possibly some
artifacts such as arrowheads, how do you detect the artifacts?

I gather, from your writings to date, the procedure would have to be
something like this:
1) Gather all rocks in an area. Let's say this gets you 1000 rocks.
2) Create a digital representation of each rock. (Do you do this with
some kind of 3-D scan, or with digital photos from three angles?)
3) Compute Hamming distances between the digital representations of each
pair of rocks, and from them, compute the CSI values.
4) I'm not sure of the next step. Now that you have 500,000 CSI values
computed, how do you use them to determine which of the rocks are
artifacts?

Now, when I was doing the same task, these are the steps I took:
1) Lean from an archaeologist what properties one tends to find on human
artifacts and not on naturally occurring rocks. The archaeologist
learned about such features, in large part, from observations of the work
of people actually making arrowheads from flint or (as applied to where I
was searching) chert.
2) For each rock, look for those human-made features.

Can you give me *any* reason to prefer your method over mine?

Mark VandeWettering

unread,
Jul 25, 2007, 2:29:30 PM7/25/07
to
On 2007-07-25, Seanpit <sea...@gmail.com> wrote:

They are considerably more helpful than your ideas about CSI.

>> Mark
>
> Sean Pitman
> www.DetectingDesign.com
>

snex

unread,
Jul 25, 2007, 2:38:22 PM7/25/07
to

are you claiming that any non-random reference string with a uniform
distribution would yield the same results (i.e. one string has more
CSI than the other)? i would like to see you demonstrate this.

>
> Sean Pitmanwww.DetectingDesign.com


fropome

unread,
Jul 25, 2007, 3:18:43 PM7/25/07
to

1) This is not a proof. It is a demonstration that the formula works
for particular values. x+x = x * x works for x = 0, this does not mean
that it's always true.
2)


For X = 2:
CSI: X^n - (n! / (n-hd)! hd!)

Let us take x = 2, n = 2 (binary string length 2), hd = 2

CSI = 2^2 - (2/ 0! * 2)
CSI = 4 - 1 = 3

string 1: 11

what are the other 3 strings with hd of 2 from this string?

3) Earlier you gave a result of your formula as:
CSI
= ~1.90

How exactly can this be true when:


> All the formula is intended to show is the number of sequences in
> sequence space at a given Hamming Distance.

what exactly could a number of ~1.90 sequences mean?

Seanpit

unread,
Jul 25, 2007, 6:05:45 PM7/25/07
to
On Jul 25, 11:38 am, snex <s...@comcast.net> wrote:

> are you claiming that any non-random reference string with a uniform
> distribution would yield the same results (i.e. one string has more
> CSI than the other)? i would like to see you demonstrate this.

I'm not saying that any reference string would yield the same result.
That's not true. What I am saying is that the odds are poor that the
simple algorithm chosen to produce the predictable reference string
will match a randomly produced test string vs. a non-randomly produced
test string.

The reason for this is that a randomly produced test string with
uniform distribution would have equal odds of a match at each position
to any independently produced reference string. However, a test
string that is the result of non-random production will have a
predictably biased distribution relative to certain reference strings.

That is why a high CSI is helpful, while a low CSI doesn't mean much.

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 25, 2007, 6:11:22 PM7/25/07
to
On Jul 25, 11:38 am, snex <s...@comcast.net> wrote:

> are you claiming that any non-random reference string with a uniform
> distribution would yield the same results (i.e. one string has more
> CSI than the other)? i would like to see you demonstrate this.

I'm not saying that any reference string would yield the same result.

Seanpit

unread,
Jul 25, 2007, 6:10:40 PM7/25/07
to
On Jul 25, 11:38 am, snex <s...@comcast.net> wrote:

> are you claiming that any non-random reference string with a uniform
> distribution would yield the same results (i.e. one string has more
> CSI than the other)? i would like to see you demonstrate this.

I'm not saying that any reference string would yield the same result.

Seanpit

unread,
Jul 25, 2007, 6:23:33 PM7/25/07
to
On Jul 25, 11:15 am, Mark Isaak <eci...@earthlink.net> wrote:
> On Wed, 25 Jul 2007 15:45:29 +0000, Seanpit wrote:
> > On Jul 25, 8:33 am, Mark Isaak <eci...@earthlink.net> wrote:
>
> >> > This is how all forms of detecting of artifact begin.
>
> >> You have never in your life actually tried to detect artifacts, have
> >> you?
>
> > It depends on what you think forensics is all about . . . or finding an
> > arrowhead in a field amongst other naturally formed rocks.
>
> So when you have a field of naturally formed rocks and possibly some
> artifacts such as arrowheads, how do you detect the artifacts?
>
> I gather, from your writings to date, the procedure would have to be
> something like this:
> 1) Gather all rocks in an area. Let's say this gets you 1000 rocks.
> 2) Create a digital representation of each rock. (Do you do this with
> some kind of 3-D scan, or with digital photos from three angles?)
> 3) Compute Hamming distances between the digital representations of each
> pair of rocks, and from them, compute the CSI values.
> 4) I'm not sure of the next step. Now that you have 500,000 CSI values
> computed, how do you use them to determine which of the rocks are
> artifacts?

You don't. CSI, by itself, doesn't say anything about the artifactual
or non-artifactual origin of anything. CSI only detects non-random
bias, not artifact, ET, or ID. I've pointed this out several times
now.

> Now, when I was doing the same task, these are the steps I took:
> 1) Lean from an archaeologist what properties one tends to find on human
> artifacts and not on naturally occurring rocks. The archaeologist
> learned about such features, in large part, from observations of the work
> of people actually making arrowheads from flint or (as applied to where I
> was searching) chert.
>
> 2) For each rock, look for those human-made features.
>
> Can you give me *any* reason to prefer your method over mine?

You just used CSI here. If the rocks you found all looked like they
were the result of random production and had no evident bias that you
could detect, you wouldn't be able to hypothesize any sort of artifact
at all. You see, you must first detect bias before you can move on to
propose any sort of origin for that bias.

Since bias is evident in certain types of artifacts, this bias can be
analyzed by comparing it to what known non-deliberate vs. deliberate
forces in nature can achieve. If the bias goes significantly beyond
what known non-deliberate forces can achieve, yet remains within the
realm of what deliberate forces can achieve, your hypothesis of
deliberate artifact gains useful predictive value.

Again, you can't detect artifact until you first detect non-random
bias.

> Mark Isaak eciton (at) earthlink (dot) net
> "Voice or no voice, the people can always be brought to the bidding of
> the leaders. That is easy. All you have to do is tell them they are
> being attacked, and denounce the pacifists for lack of patriotism and
> exposing the country to danger." -- Hermann Goering

Sean Pitman
www.DetectingDesign.com


Seanpit

unread,
Jul 25, 2007, 7:02:22 PM7/25/07
to
On Jul 25, 12:18 pm, fropome <monk...@hornsandhalos.co.uk> wrote:

> 1) This is not a proof. It is a demonstration that the formula works
> for particular values. x+x = x * x works for x = 0, this does not mean
> that it's always true.
> 2)
> For X = 2:
> CSI: X^n - (n! / (n-hd)! hd!)
>
> Let us take x = 2, n = 2 (binary string length 2), hd = 2
>
> CSI = 2^2 - (2/ 0! * 2)
> CSI = 4 - 1 = 3
>
> string 1: 11
>
> what are the other 3 strings with hd of 2 from this string?

The number of strings at a particular HD is given by the formula:

n! / ((n-hd)! hd!)

In other words, for x=2, n=2 the total number of sequences at hd=2 is:

2! / ((2-2)! 2!)) = 2!/2! = 1

This formula works for all binary strings. I really don't see the need
for further proof.

The CSI formula for binary strings (modified version):

n! / ((n-|(n/2-hd)|! * |(n/2-hd)|)!)

Is kind of an inverse value where the Hamming Distances with the
greatest number of strings have the least CSI value and those HDs with
the least number of strings have the greatest CSI value - creating an
inverse relationship.

> 3) Earlier you gave a result of your formula as:
> CSI = ~1.90
>
> How exactly can this be true when:
>
> > All the formula is intended to show is the number of sequences in
> > sequence space at a given Hamming Distance.
>
> what exactly could a number of ~1.90 sequences mean?

It is an odds ratio in the context of X>2 where the CSI value is still
useful in understanding the likelihood of a match to the reference
string given the hypothesis of a random origin of the test string.

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 25, 2007, 7:03:38 PM7/25/07
to
On Jul 25, 11:29 am, Mark VandeWettering <wetter...@attbi.com> wrote:

> They are considerably more helpful than your ideas about CSI.

As helpful as your idea that pi is not computable?

Sean Pitman
www.DetectingDesign.com


hersheyh

unread,
Jul 25, 2007, 7:22:56 PM7/25/07
to
On Jul 24, 9:32 pm, Seanpit <seanpitnos...@naturalselection.
0catch.com> wrote:
> On Jul 24, 4:25 pm, hersheyh <hershe...@yahoo.com> wrote:
>
> > > > What, other than generating numbers that you like, are these formulas
> > > > supposed to represent? Specifically, how do you *determine* the hd
> > > > number? Arbitrary choice?
>
> > > The Hamming Distance is defined by the number of character difference
> > > exist between the test sequence and the target sequence.
>
> > I *know* how hamming distance is *defined*. That wasn't the
> > question. The question was how you *determine* the number you use.
> > And I suggested that you made an arbitrary choice.
>
> Your question doesn't make any sense to me.

That doesn't surprise me.

> The Hamming Distance
> "number" is determined by the number of mismatched character positions
> between the reference string and the test string. It is an absolute
> number; not an arbitrary or subjective choice.

And, if what I have is a 'target' sequence *or* a 'reference' sequence
but not both, how do I determine what the other sequence is? Do you
*intentionally* choose and rig the choice of 'target' or 'reference'
to get the hd you want and the CSI number you want? *That* is the
problem. I *know* what sequences actually exist or possibly could
exist. They don't come with labels of 'target' or 'reference'. How
do you determine, out of all the sequences that currently exist or
could have existed in the past, what the 'target' and 'reference'
sequences are? That is rather crucial to determining hd.

> Also, the selection of the reference string must be done without any
> knowledge ahead of time of the test string. The choice must be
> completely independent.

IOW, the *reference* string must be a *randomly* chosen sequence out
of total sequence space. If you are going to choose a reference
sequence without any pre-knowledge of the test string, aren't you
choosing a sequence at random among all the possibilities? So that
means that one would *expect* n/2 differences. Hell, if you had a
more than binary system, you might expect even *more* differences.
Random unrelated aa sequences have about 15-20% identity (it is
somewhat more than one expects given 20 aa's, because different aa's
occur at different %s), so I presume that is what one would expect
unless (because the choice has to be *random* wrt the 'target') one
was quite lucky.

> Useful choices would include strings that are
> known to be the result of non-random simple algorithms, like pi or a
> repeat of a simple pattern - like 01010101 . . ., etc.

But that would be an *arbitrary* choice of 'reference'. You would be
*selectively* and rather *arbitrarily* choosing some modern functional
sequence as the 'reference'. What then would be the 'target'? How
would you choose the 'target' *after* you have "independently chosen
the 'reference'? You 'reference' isn't a random sequence from total
sequence space and certainly would also not even be the putative
'ancestral' protein to some 'target', since both would be *modern*
proteins derived from a common ancestor of uncertain *sequence*. I
have already mentioned that, for some deeply divergent sequences (ones
that have had separate lineages for more than, say 400 million years),
one has almost no detectable *sequence* homology but has extremely
good *structure* homology and *function* homology, so *sequence* is an
especially poor indicator of ancestry.

> > > > Choosing the value n/2 (the average
> > > > sequence difference between all the
> > > > sites being the same and all the
> > > > sites being different)?
>
> > > That number, the average HD, has the lowest CSI value . . .
>
> > SFW? Is that the number you choose by choosing your 'reference'
> > sequence? How do you determine the hd except by your choice of
> > 'reference' sequence?
>
> That's exactly how you determine the HD, by comparison to your
> previously chosen reference sequence.

And you chose the 'reference' by...?
>
> > Quit beating around the bush and tell us how
> > you *choose* the 'reference' sequence and
> > thus *choose* the hd number?
>
> The reference sequences are chosen ahead of time before the test
> sequence in analyzed. The choice of reference or references is not
> based on the test sequence, but on sequences that are already known to
> be the result of non-random production.

IOW, the 'reference' is some *arbitrarily* chosen modern protein or
sequence that has a function, not a random sequence taken from total
sequence space. Then after you have *chosen* this *arbitrary*
sequence to be the 'reference', how do you determine the 'target'
sequence from total sequence space? And how do you recognize or deal
with the fact that neither sequence is the ancestral sequence and you
have no way of determining *function* (or lack thereof) of any of the
intermediate steps?

I am having trouble not only seeing the relevance of this number you
calculate to CSI, but its relevance to evolution or to any *real
world* discussion about anything.

> > > > And isn't the *minimum* difference between a
> > > > 'reference' sequence and a 'target'
> > > > sequence still going to be hd=1 no
> > > > matter what you say and no matter
> > > > how large n is?
>
> > > The minimum HD is actually zero - or complete identity.
>
> > I said minimum *difference*. Zero or identity is a state of 'no
> > difference'.
>
> Of course - but making the point for a minimum HD difference being 1
> is irrelevant in this particular discussion.

I agree. But *if* one is talking about the evolution of some new
sequence, the minimum HD difference is 1. But that one step can be
the generation of a second copy of the entire initial sequence.
Duplication and divergence (or specialization) is a common
evolutionary mechanism where the first step (duplication) is both
common and often selectively neutral.
>
> < snip repetitive >
>
> Sean Pitmanwww.DetectingDesign.com


Seanpit

unread,
Jul 25, 2007, 7:40:10 PM7/25/07
to
On Jul 25, 4:22 pm, hersheyh <hershe...@yahoo.com> wrote:

> > The Hamming Distance
> > "number" is determined by the number of mismatched character positions
> > between the reference string and the test string. It is an absolute
> > number; not an arbitrary or subjective choice.
>
> And, if what I have is a 'target' sequence *or* a 'reference' sequence
> but not both, how do I determine what the other sequence is? Do you
> *intentionally* choose and rig the choice of 'target' or 'reference'
> to get the hd you want and the CSI number you want? *That* is the
> problem. I *know* what sequences actually exist or possibly could
> exist. They don't come with labels of 'target' or 'reference'. How
> do you determine, out of all the sequences that currently exist or
> could have existed in the past, what the 'target' and 'reference'
> sequences are? That is rather crucial to determining hd.

The reference sequences are determined, by you, ahead of time before
you go out to analyze any other sequences. The reference sequences
are based on non-random strings that are known to be produced by
simple algorithms - like pi or like 0101010 . . .

After you have your set of reference strings, you can compare incoming
sequences to your set of reference sequences to see if the incoming
sequences is likely to be non-random in origin.

> > Also, the selection of the reference string must be done without any
> > knowledge ahead of time of the test string. The choice must be
> > completely independent.
>
> IOW, the *reference* string must be a *randomly* chosen sequence out
> of total sequence space.

No. The reference string must be chosen based on knowledge that it is
not random - i.e., the reproducible product of a simple algorithm.

< snip >

> > Useful choices would include strings that are
> > known to be the result of non-random simple algorithms, like pi or a
> > repeat of a simple pattern - like 01010101 . . ., etc.
>
> But that would be an *arbitrary* choice of 'reference'. You would be
> *selectively* and rather *arbitrarily* choosing some modern functional
> sequence as the 'reference'.

That's right . . .

> What then would be the 'target'?

There is no "target". There are only test strings that you compare to
your reference strings. If the test strings match one of your
reference strings, to a high level of CSI, the hypothesis of non-
random origin is supported.

> How
> would you choose the 'target' *after*
> you have "independently chosen
> the 'reference'?

You don't choose the test string. Any string could be tested by the
reference strings - any string at all.

< snip >

> > > > > And isn't the *minimum* difference between a
> > > > > 'reference' sequence and a 'target'
> > > > > sequence still going to be hd=1 no
> > > > > matter what you say and no matter
> > > > > how large n is?
>
> > > > The minimum HD is actually zero - or complete identity.
>
> > > I said minimum *difference*. Zero or identity is a state of 'no
> > > difference'.
>
> > Of course - but making the point for a minimum HD difference being 1
> > is irrelevant in this particular discussion.
>
> I agree. But *if* one is talking about the evolution of some new
> sequence, the minimum HD difference is 1. But that one step can be
> the generation of a second copy of the entire initial sequence.
> Duplication and divergence (or specialization) is a common
> evolutionary mechanism where the first step (duplication) is both
> common and often selectively neutral.

We aren't talking about finding target sequences in this discussion
Howard. That's a different topic altogether. Your notion that the
minimum possible HD (i.e., 1) is always the likely distance is your
stumbling block when it comes to your ability to grasp the fundamental
problem with finding unknown target sequences that exist in sequence/
structure space with different average distances between them -
average distances that are directly related to minimum structural
threshold requirements.

Again though, that is a different topic from the one being discussed
here - for the umpteenth time.

Sean Pitman
www.DetectingDesign.com

hersheyh

unread,
Jul 25, 2007, 8:32:11 PM7/25/07
to
On Jul 25, 7:40 pm, Seanpit <seanpitnos...@naturalselection.

0catch.com> wrote:
> On Jul 25, 4:22 pm, hersheyh <hershe...@yahoo.com> wrote:
>
> > > The Hamming Distance
> > > "number" is determined by the number of mismatched character positions
> > > between the reference string and the test string. It is an absolute
> > > number; not an arbitrary or subjective choice.
>
> > And, if what I have is a 'target' sequence *or* a 'reference' sequence
> > but not both, how do I determine what the other sequence is? Do you
> > *intentionally* choose and rig the choice of 'target' or 'reference'
> > to get the hd you want and the CSI number you want? *That* is the
> > problem. I *know* what sequences actually exist or possibly could
> > exist. They don't come with labels of 'target' or 'reference'. How
> > do you determine, out of all the sequences that currently exist or
> > could have existed in the past, what the 'target' and 'reference'
> > sequences are? That is rather crucial to determining hd.
>
> The reference sequences are determined, by you, ahead of time before
> you go out to analyze any other sequences. The reference sequences
> are based on non-random strings that are known to be produced by
> simple algorithms - like pi or like 0101010 . . .

IOW, you would know it if the SETI signal were repeated digits of pi
in base 10, but would not be able to recognize pi in base 2 or the
other reference you give. Using *your* idea, you would declare any
other signal as "random" and unrelated to the 'reference'. Is *that*
what you claim that SETI is doing?

> After you have your set of reference strings, you can compare incoming
> sequences to your set of reference sequences to see if the incoming
> sequences is likely to be non-random in origin.

Again, you would only be able to detect 'targets' that were near
enough to your *biased* selection of 'reference' sequences to register
as 'sufficiently close'.

> > > Also, the selection of the reference string must be done without any
> > > knowledge ahead of time of the test string. The choice must be
> > > completely independent.
>
> > IOW, the *reference* string must be a *randomly* chosen sequence out
> > of total sequence space.
>
> No. The reference string must be chosen based on knowledge that it is
> not random - i.e., the reproducible product of a simple algorithm.

Fractals are generated by simple algorithms. So, for that matter, is
a pathway in which you have occasional random mutation and fixation of
the result as a second rarer event. But those are not
*reproducible*. Does that mean you are *specifically* ruling out
evolutionary algorithms *arbitrarily* by requiring a determinative
result rather than a probabilistic one? What would be the 'reference'
sequence for proteins. since, because the same functional protein in
different organisms have different sequences and sometimes
dramatically different sequences, you cannot claim that any particular
available sequence is a reproducibly determined product of a simple
algorithm? If your CSI calculation is going to have meaning for
evolution, you do have to tell us *which* sequence for, say, beta
globin of hemoglobin is the "reproducible result of a simple
algorithm", don't you? Is it the human gamma-G? gamma-A? Embryonic?
Adult?

> < snip >
>
> > > Useful choices would include strings that are
> > > known to be the result of non-random simple algorithms, like pi or a
> > > repeat of a simple pattern - like 01010101 . . ., etc.
>
> > But that would be an *arbitrary* choice of 'reference'. You would be
> > *selectively* and rather *arbitrarily* choosing some modern functional
> > sequence as the 'reference'.
>
> That's right . . .
>
> > What then would be the 'target'?
>
> There is no "target". There are only test strings that you compare to
> your reference strings. If the test strings match one of your
> reference strings, to a high level of CSI, the hypothesis of non-
> random origin is supported.

Then the result you get is entirely dependent on which sequences you
*arbitrarily* chose as your 'reference' strings. How can you be sure
that your choice of all the 'reference' strings you *arbitrarily*
chose to look at will catch the 'intelligently designed' sequence you
'test'. And how will you, if you are too broad in your *arbitrary*
choices, prevent false positives? And if you are too narrow in your
*arbitrary* choices of 'references' aren't you going to ensure many
false negatives?


>
> > How
> > would you choose the 'target' *after*
> > you have "independently chosen
> > the 'reference'?
>
> You don't choose the test string. Any string could be tested by the
> reference strings - any string at all.

You must *really* be brilliant if you can think of all the 'reference'
strings that not only a non-human ET might send as a signal, but also
all the protein 'reference' sequences that have *ever* existed.
Otherwise I cannot think of any way that your test would not wind up
being hit-or-miss and not much better than dumb luck.

R. Baldwin

unread,
Jul 25, 2007, 10:07:34 PM7/25/07
to
"Seanpit" <sea...@gmail.com> wrote in message
news:1185376351....@z24g2000prh.googlegroups.com...

> On Jul 24, 8:20 pm, "R. Baldwin" <res0k...@nozirevBACKWARDS.net>
> wrote:
>
>> > No, because the reference strings are chosen independent of the test
>> > strings. Therefore a significant match between a reference and a test
>> > string is good evidence of non-random production.
>>
>> That is not necessarily true. It is good evidence that the test string
>> was
>> not produced by a stationary random process with Uniform distribution,
>> which
>> is a much more restricted case.
>
> If a perfect match happened to be to a reference string that had to
> regular character repeats, like pi, this would be good evidence the
> test string was not produced by a random process with uniform or non-
> uniform distribution.

I note that you've correct this to "no regular character repeats".

I would accept that, happening upon the digits of pi, one probably has an
artificial pattern.

I would not accept that the lack of regular character repeats in general
implies what you say it does. Computable transcendental numbers are a
special case. Most real numbers are algorithmically random, as defined by
Chaitin, and there are no finite algorithms to compute their digits. These
are the numbers that lack regular character repeats.

>
>> > Actually, a string with maximum Shannon information is a "random"

>> > string. And, how does one define randomness? By comparison to a


>> > reference computer - a UTM.
>>

>> No, that is not right, Sean. A string with maximum Shannon information
>> has
>> equiprobable symbols, random or not. Shannon's theory models information
>> sources as random variables to make the math easy. That does not mean
>> information sources are actually random, nor does it mean a string with
>> equiprobable symbols is random.
>
> Shannon information is determined by reference to a known "random"
> source of string production - a source that produces maximum Shannon
> information.

No, no, no. That is very badly wrong. As I said above, Shannon modeled
information sources as if they were random variables. Specifically, Markov
random variables. That does not mean information sources really are Markov
random variables. With very few exceptions, they are *not* Markov random
variables. Shannon's theory only *pretends* information sources are Markov
random variables because it makes the math easy, and it is a decent
approximation that works pretty well.

Furthermore, the amount of information produced by an information source
depends on how surprised a receiver will be. This depends on the relative
probabilities of the different symbols the information source can produce.
There *is* no maximum on Shannon information. If you want more information,
just watch the random variable for a longer time.

It is the information *entropy* (average information) that can have a
maximum. The entropy is maximum if the information source produces symbols
equiprobably. For a binary source, this means producing either 0's or 1's
with a 50% probability. For a decimal source, this means producing any of
the 10 digits with a 10% probability.

Shannon's theory also defines the information as it is received, distinct
from the information that left the transmitter, when noise is present.

>
> "In the Shannon approach, however, the method of encoding objects is
> based on the presupposition that the objects to be encoded are
> outcomes of a known random source it is only the characteristics of
> that random source that determine the encoding, not the
> characteristics of the objects that are its outcomes."
>
> http://homepages.cwi.nl/~paulv/papers/info.pdf

That quote is about Shannon's Coding Theorem, and is not relevant to the
definitions of or calculations for information or entropy under Shannon's
Mathematical Theory of Information. The quote is a reference to a means for
recoding the output of a random variable to maximize its entropy and
optimize channel usage.

>
> This means that Shannon information is more about the type of source
> it will take to transmit a particular type of string rather than the
> string itself.

No. Shannon information is about the symbol probabilities of the source in
question (not the "type" of source), the rate at which they are delivered,
and the interest of the receiver. You need a receiver attempting to copy the
information source for information to exist at all, in Shannon's model.

> So, to transmit a number like Pi, where all the
> symbols seem to appear with equal frequency, the source needed to
> transmit a sequence like pi will have to be able to produce all
> possible numbers with a similar character frequency.

No. That is absolute hogwash. A PC hooked up to the Internet can output any
8-bit character frequency pattern you program into it. That is not a
requirement on a source in able to produce the digits of pi. You simply have
to program a source to produce the digits of pi, in whatever numeral system
you decide to use. If it produces pi, the symbols output while pi is running
will tend toward uniform distribution simply because pi has that property.
It has nothing to do with the abilities of the source.

> In other words,
> this source must be able to produce not only pi, but all possible
> numbers in infinite sequence space - to include truly "random" and
> "non-computable" numbers like sigma.

Utter nonsense. Pi is a computable transcendental. A finite algorithm can
produce as many digits of pi as you like. Sources that produce the digits of
pi, to any arbitrary precision, can be realized by finite algorithms. That
is not true for uncomputable numbers.

By the way, where you wrote "sigma" did you possibly mean "omega"?

>
> Again, it is all about the source or the reference that is chosen.
>
>> Strictly speaking, with a single string,
>> the Shannon information is only
>> estimated..
>
> That's true, but the estimate is based on the type of source needed to
> produce such a string.

No. It is based on the measured digits in a realized string. That allows an
estimate of the probabilities of the symbols produced by the source.

R. Baldwin

unread,
Jul 25, 2007, 11:03:08 PM7/25/07
to
"Seanpit" <sea...@gmail.com> wrote in message
news:1185381379.4...@i13g2000prf.googlegroups.com...

> On Jul 25, 9:01 am, Mark VandeWettering <wetter...@attbi.com> wrote:
[snip]

I have to side with Sean on this one. In Turing's usage, pi falls under the
definition of a computable number. Turing explicitly states this in
paragraph 2 of his classic paper, "On Computable Numbers, with an
Application to the Entscheidungsproblem."

"In sections 9, 10 I give some arguments with the intention of showing that
the computable numbers include all numbers which could naturally be regarded
as computable. In particular, I show that certain large classes of numbers
are computable. They include, for instance, the real parts of all algebraic
numbers, the real parts of the zeros of the Bessel functions, the numbers
pi, e, etc. The computable numbers do not, however, include all definable
numbers, and an example is given of a definable number which is not
computable."
http://www.abelard.org/turpap2/tp2-ie.asp

Turing's definition of computable number is a bit farther down, but I find
its language a bit awkward. The Wikipedia article has a more directly stated
definition:

"In mathematics, theoretical computer science and mathematical logic, the
computable numbers, also known as the recursive numbers or the computable
reals, are the real numbers that can be computed to within any desired
precision by a finite, terminating algortihm."
http://en.wikipedia.org/wiki/Computable_number


Bob Berger

unread,
Jul 26, 2007, 12:15:58 AM7/26/07
to
In article <1185406810....@e9g2000prf.googlegroups.com>, Seanpit
says...

Well then, your formula has a problem. Every string of finite length can be
generated by an infinite number of simple algorithms, depending of course on
what you mean by a "simple" algorithm.

Challenge: Provide us with a 10 position string that is not the reproducible
product of any simple algorithm.

Bob

< snip >

>Sean Pitman
>www.DetectingDesign.com
>

fropome

unread,
Jul 26, 2007, 6:08:25 AM7/26/07
to
On 26 Jul, 00:02, Seanpit <seanpitnos...@naturalselection.0catch.com>
wrote:


I see- so when you said:
"All the formula is intended to show is the number of sequences in
sequence space at a given Hamming Distance."

you only meant the right hand part of the formula.
The whole formula for X = 2 is therefore the total number of possible
sequences minus the number of sequences at a given HD from a given
string. What is that supposed to represent? Wouldn't it be more useful
to use a ratio? You claim that your X > 2 formula produces an 'odds
ratio'- what is it a ratio of? Why doesn't the formula just have that
ratio instead of being A - B ?

Your formula for X > 2 is probably supposed to be a generalisation of
your formula for X = 2, but it isn't. Why have you used log(base2)? It
appears that you have chosen it just because it simplifies into your X
= 2 formula, but this isn't really enough. You need to prove it does
what you say in order for people to accept that the formula does what
you say.
For example, I could claim that the formula:
CSI: X^n - (( log(base3)(X^n)! / (log(base 3)(X^n) - hd)! * hd!)

was the correct formula when X > 2.

how would you know which of us was right? This is why you must _prove_
what you say. Seriously- I'm not making this up to be awkward... well
not just to be awkward anyway. If you've taken it from somewhere else
then show us where.

I'm not arguing with you because I'm scared of being wrong- even if
your formula can be proven it's got a huge distance to go before it
does what you claim- I'm arguing with you because until you've proven
that formula you can't even begin.


People have criticised creatioinsts for starting with their conclusion
and then trying to prove it. Creationists claim that they aren't doing
this- that they have a valid theory and that they are testing it...
they claim it fits the information better than evolution.
In science you have to make a hypothesis and then you test it. This
hypothesis should be based on the evidence, but you are making a leap
and then trying to prove that you are right or wrong (I think the 'or
wrong' bit is where creationists are getting confused). You've taken a
poor version of this approach and applied it to mathematics, but this
doesn't work. In maths you don't make a guess and then try to back it
up- you build upwards from what you already know. We don't guess our
formulae and then try to show that they work for all the numbers we
try- we prove that they are true for all cases by using algebra.

It is possible that you have done this, or that you are using someone
elses proof- but you _must_ show us this proof before what you say has
any mathematical validity.

jensp...@hotmail.com

unread,
Jul 26, 2007, 6:28:46 AM7/26/07
to
Hey smartass, why didn't you comment on my answer last time you tried
starting a thread on this subject?
That would have saved you a lot of thinking as I actually have a clue
about what I'm talkning about.

Similarity of two symbolstrings can be measured as the size of the
smalest general Turing machine able to convert one string into the
other. I.e. by Colmogorov complexity.

J.O.

R. Baldwin

unread,
Jul 26, 2007, 9:58:09 AM7/26/07
to
<jensp...@hotmail.com> wrote in message
news:1185445726....@r34g2000hsd.googlegroups.com...

The Kolmogorov Complexity of a string is the length of the shortest program
on a reference Universal Turing Machine that produces that string. The size
of the Universal Turing Machine is not part of the equation. You can't
represent the size of a Universal Turing Machine without bringing another
machine into the discussion. The Universal Turing Machine converts the
program into the string, but the program and the string are not, in general
similar.

Having similar Kolmogorov Complexity does not generally translate into
string similarity. For example, if we feed into a Universal Turing Machine
the kerned program
(p)(n), where (p) is the algorithm for a hash table and (n) is an argument
to the program, we would hardly expect similar strings to be produced for
n=2 vs. n=3. Neither would we get similar strings if (p) was a function like
"n squared" If (p) was an algorithm of the type "print the first n symbols
of a sequence" or "print the first million symbols of a sequence, then erase
n symbols" we would expect a tendency for similar Kolmogorov Complexity for
similar strings, but it won't always hold because there are so many other
ways to produce each substring in the sequence.

Don Cates

unread,
Jul 26, 2007, 10:12:36 AM7/26/07
to
On Wed, 25 Jul 2007 17:32:11 -0700, hersheyh <hers...@yahoo.com>
posted:

It occurs to me that according to the 'Pitman definition', a high CSI
in aplication to genetic sequences supports the evolutionary scenario.
After all, if a 'test'(is it designed?) sequence is similar to other
existing(not designed?) sequences then it has a high 'Pitman'CSI and
is most likely to have a similar background(not designed). Whereas
having a small 'Pitman'CSI doesn't tell us anything.

So says Pitman: "That is why a high CSI is helpful, while a low CSI
doesn't mean much."

>> < snip >

--
Don Cates ("he's a cunning rascal" - PN)

Mark VandeWettering

unread,
Jul 26, 2007, 11:15:50 AM7/26/07
to

I will cede the point, and apologize for the criticism. While I find
the terminology rather odd, I understand why it arises: Turing was trying
to demonstrate that the digits of pi are in fact computable, while the
vast majority of real numbers are not so computable. That is either a
profound or a simple result, depending upon your perspective.

Mark

Seanpit

unread,
Jul 26, 2007, 11:31:38 AM7/26/07
to
On Jul 25, 9:15 pm, Bob Berger <s...@eskimo.com> wrote:
> In article <1185406810.805682.60...@e9g2000prf.googlegroups.com>, Seanpit

A simple algorithm is one that requires significantly fewer bits than
the string itself and does not require more bits as the string it
produces grows in size - like the various algorithms for producing a
string like pi or the square root of 2 or a string of all zeros or
01s. These are all examples of "simple" algorithms.

> Challenge: Provide us with a 10 position string that is not the reproducible
> product of any simple algorithm.
>
> Bob

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 26, 2007, 11:57:12 AM7/26/07
to
On Jul 25, 5:32 pm, hersheyh <hershe...@yahoo.com> wrote:
>
> > The reference sequences are determined, by you, ahead of time before
> > you go out to analyze any other sequences. The reference sequences
> > are based on non-random strings that are known to be produced by
> > simple algorithms - like pi or like 0101010 . . .
>
> IOW, you would know it if the SETI signal were repeated digits of pi
> in base 10, but would not be able to recognize pi in base 2 or the
> other reference you give. Using *your* idea, you would declare any
> other signal as "random" and unrelated to the 'reference'. Is *that*
> what you claim that SETI is doing?

As I've pointed out many times now, a match to a reference string, by
itself, is not enough to detect ET. A maximum Pitman CSI number does
NOT equal ET or ID for that matter. What it does indicate is non-
random bias. Try to remember this point this time.

Now, is it possible for a set of reference strings to miss a non-
random sequence? Certainly! In fact, it is impossible to rule out
this possibility. No one can do it - not SETI scientists, not
anthropologists, biologist, chemists, physicists, or even IDists. No
one. It is impossible.

> > After you have your set of reference strings, you can compare incoming
> > sequences to your set of reference sequences to see if the incoming
> > sequences is likely to be non-random in origin.
>
> Again, you would only be able to detect 'targets' that were near
> enough to your *biased* selection of 'reference' sequences to register
> as 'sufficiently close'.

That's right . . .

> > > > Also, the selection of the reference string must be done without any


> > > > knowledge ahead of time of the test string. The choice must be
> > > > completely independent.
>
> > > IOW, the *reference* string must be a *randomly* chosen sequence out
> > > of total sequence space.
>
> > No. The reference string must be chosen based on knowledge that it is
> > not random - i.e., the reproducible product of a simple algorithm.
>
> Fractals are generated by simple algorithms. So, for that matter, is
> a pathway in which you have occasional random mutation and fixation of
> the result as a second rarer event. But those are not
> *reproducible*. Does that mean you are *specifically* ruling out
> evolutionary algorithms *arbitrarily* by requiring a determinative
> result rather than a probabilistic one? What would be the 'reference'
> sequence for proteins. since, because the same functional protein in
> different organisms have different sequences and sometimes
> dramatically different sequences, you cannot claim that any particular
> available sequence is a reproducibly determined product of a simple
> algorithm? If your CSI calculation is going to have meaning for
> evolution, you do have to tell us *which* sequence for, say, beta
> globin of hemoglobin is the "reproducible result of a simple
> algorithm", don't you? Is it the human gamma-G? gamma-A? Embryonic?
> Adult?

Functional systems are known to be non-random because of their
functional properties - properties that cannot be produced by just any
randomly produced sequence.

Beyond this, although a bit more complicated, a significant match for
a reference string that is "apparently random" does indicate non-
random production of one or the other or both the test string and the
reference string.

> > There is no "target". There are only test strings that you compare to
> > your reference strings. If the test strings match one of your
> > reference strings, to a high level of CSI, the hypothesis of non-
> > random origin is supported.
>
> Then the result you get is entirely dependent on which sequences you
> *arbitrarily* chose as your 'reference' strings. How can you be sure
> that your choice of all the 'reference' strings you *arbitrarily*
> chose to look at will catch the 'intelligently designed' sequence you
> 'test'.

First off, the CSI calculation isn't about detecting ET or ID. Let me
make that very clear once more. It is about detecting non-random
bias.

Beyond this, you can't be sure to catch all non-randomly produced
sequences. It is actually impossible to detect all such sequences as
noted above. That is the nature of science. Perfection cannot be
achieved. That is what makes science useful. If perfection could be
achieved, science would no longer be needed.

> And how will you, if you are too broad in your *arbitrary*
> choices, prevent false positives?

You can't prevent false positives with absolute perfection - only with
a high degree of predictive value.

> And if you are too narrow in your
> *arbitrary* choices of 'references'
> aren't you going to ensure many
> false negatives?

You can't insure against false negatives with perfection either.
Again, that's impossible in science.

> > > How
> > > would you choose the 'target' *after*
> > > you have "independently chosen
> > > the 'reference'?
>
> > You don't choose the test string. Any string could be tested by the
> > reference strings - any string at all.
>
> You must *really* be brilliant if you can think of all the 'reference'
> strings that not only a non-human ET might send as a signal, but also
> all the protein 'reference' sequences that have *ever* existed.
> Otherwise I cannot think of any way that your test would not wind up
> being hit-or-miss and not much better than dumb luck.

Though not perfect, the hits are much better than dumb luck - and
that's the value of science. Science does not require perfection or
rule out the possibility of being wrong. As a "scientist" you should
know this already.

Sean Pitman
www.DetectingDesign.com

Mark Isaak

unread,
Jul 26, 2007, 3:14:39 PM7/26/07
to
On Wed, 25 Jul 2007 22:23:33 +0000, Seanpit wrote:

> On Jul 25, 11:15 am, Mark Isaak <eci...@earthlink.net> wrote:

>> [...]


>> 2) For each rock, look for those human-made features.
>>
>> Can you give me *any* reason to prefer your method over mine?
>
> You just used CSI here. If the rocks you found all looked like they
> were the result of random production and had no evident bias that you
> could detect, you wouldn't be able to hypothesize any sort of artifact
> at all. You see, you must first detect bias before you can move on to
> propose any sort of origin for that bias.

I challenge you to find one rock, anywhere in the world, which *is* the
result of random production and has no evident bias. They all show
biases in composition, even if you manage to find some without biased
fracture planes, grain size, layering, erosion patterns, etc.



> Since bias is evident in certain types of artifacts, this bias can be
> analyzed by comparing it to what known non-deliberate vs. deliberate
> forces in nature can achieve. If the bias goes significantly beyond
> what known non-deliberate forces can achieve, yet remains within the
> realm of what deliberate forces can achieve, your hypothesis of
> deliberate artifact gains useful predictive value.
>
> Again, you can't detect artifact until you first detect non-random bias.

Bias is everywhere, already, by nature. You have narrowed your starting
point to, literally, the entire universe. How is that useful? The
starting point was smaller *before* applying CSI. CSI *removes*
predictive value.

--

Seanpit

unread,
Jul 26, 2007, 3:56:15 PM7/26/07
to
On Jul 25, 7:07 pm, "R. Baldwin" <res0k...@nozirevBACKWARDS.net>
wrote:
> "Seanpit" <sean...@gmail.com> wrote in message

>
> news:1185376351....@z24g2000prh.googlegroups.com...
>
> > On Jul 24, 8:20 pm, "R. Baldwin" <res0k...@nozirevBACKWARDS.net>
> > wrote:
>
> >> > No, because the reference strings are chosen independent of the test
> >> > strings. Therefore a significant match between a reference and a test
> >> > string is good evidence of non-random production.
>
> >> That is not necessarily true. It is good evidence that the test string
> >> was
> >> not produced by a stationary random process with Uniform distribution,
> >> which
> >> is a much more restricted case.
>
> > If a perfect match happened to be to a reference string that had no

> > regular character repeats, like pi, this would be good evidence the
> > test string was not produced by a random process with uniform or non-
> > uniform distribution.
>
> I note that you've correct this to "no regular character repeats".
>
> I would accept that, happening upon the digits of pi, one probably has an
> artificial pattern.
>
> I would not accept that the lack of regular character repeats in general
> implies what you say it does. Computable transcendental numbers are a
> special case. Most real numbers are algorithmically random, as defined by
> Chaitin, and there are no finite algorithms to compute their digits. These
> are the numbers that lack regular character repeats.

Although most real numbers are algorithmically random, finding a match
to one would be just as significant as finding a match to pi - to the
same number of digits. A match in either case to a independently
established reference string would indicate some form of non-random
bias.

> > Shannon information is determined by reference to a known "random"
> > source of string production - a source that produces maximum Shannon
> > information.
>
> No, no, no. That is very badly wrong. As I said above, Shannon modeled
> information sources as if they were random variables. Specifically, Markov
> random variables. That does not mean information sources really are Markov
> random variables. With very few exceptions, they are *not* Markov random
> variables. Shannon's theory only *pretends* information sources are Markov
> random variables because it makes the math easy, and it is a decent
> approximation that works pretty well.

Just because the reference is a "pretend" or hypothetical reference
doesn't mean it isn't a reference. A pretend reference is still a
reference.

> Furthermore, the amount of information produced by an information source
> depends on how surprised a receiver will be. This depends on the relative
> probabilities of the different symbols the information source can produce.

You mean the "pretend" information source? Right? Again, the amount
of receiver "surprise" depends upon the receiver's comparing what is
received to what is expected to be produced by the pretend information
source. And, there you have it - a reference is indeed required.

> There *is* no maximum on Shannon information. If you want more information,
> just watch the random variable for a longer time.

There is maximum Shannon information for a set finite period of time -
that's the point.

> It is the information *entropy*
> (average information) that can have a
> maximum.

You mean informational entropy has a maximum regardless of the period
of time involved. That's because there is in fact maximum SI for each
point in time. If there weren't, you couldn't calculate an average
over a span of time.

> The entropy is maximum if the
> information source produces symbols
> equiprobably.

Exactly. . . Again reference to this "source" is required.

> For a binary source, this means producing either 0's or 1's
> with a 50% probability. For a decimal source, this means producing any of
> the 10 digits with a 10% probability.

Right . . . Which is simply assumed to be the case via use of an
imaginary source that actually does this. In real life, however, this
cannot be perfectly assumed.

> Shannon's theory also defines the information as it is received, distinct
> from the information that left the transmitter, when noise is present.

Yes - also assuming the character of the "noise".

> > "In the Shannon approach, however, the method of encoding objects is
> > based on the presupposition that the objects to be encoded are

> > outcomes of a known random source. It is only the characteristics of


> > that random source that determine the encoding, not the
> > characteristics of the objects that are its outcomes."
>
> >http://homepages.cwi.nl/~paulv/papers/info.pdf
>
> That quote is about Shannon's Coding Theorem, and is not relevant to the
> definitions of or calculations for information or entropy under Shannon's
> Mathematical Theory of Information. The quote is a reference to a means for
> recoding the output of a random variable to maximize its entropy and
> optimize channel usage.

I don't see where you get this from this passage since the preceding
passage reads:

"Both theories aim at providing a means for measuring
'information'. They use the same unit to do this: the bit. In both
cases, the amount of information in an object may be interpreted as
the length of a description of the object."

This sounds to me like the authors are indeed talking about
definitions and measurements of "information". It is just that
Shannon information is concerned with the source (pretend or not)
while Kolmogorov complexity is concerned with the resulting string or
"object".

> > This means that Shannon information is more about the type of source
> > it will take to transmit a particular type of string rather than the
> > string itself.
>
> No. Shannon information is about the symbol probabilities of the source in
> question (not the "type" of source), the rate at which they are delivered,
> and the interest of the receiver. You need a receiver attempting to copy the
> information source for information to exist at all, in Shannon's model.

In order to propose symbol probabilities, you have to propose
something about a source that is able to produce said probabilities.

> > So, to transmit a number like Pi, where all the
> > symbols seem to appear with equal frequency, the source needed to
> > transmit a sequence like pi will have to be able to produce all
> > possible numbers with a similar character frequency.
>
> No. That is absolute hogwash. A PC hooked up to the Internet can output any
> 8-bit character frequency pattern you program into it.

Yes, and it can also output pi to the same number of digits.

> That is not a
> requirement on a source in able to produce the digits of pi. You simply have
> to program a source to produce the digits of pi, in whatever numeral system
> you decide to use. If it produces pi, the symbols output while pi is running
> will tend toward uniform distribution simply because pi has that property.
> It has nothing to do with the abilities of the source.

"The fundamental problem of communication is that of reproducing
at one point either exactly or approximately a message selected at
another point. Frequently the messages have meaning; that is they
refer to or are correlated according to some system with certain
physical or conceptual entities. These semantic aspects of
communication are irrelevant to the engineering problem. The
significant aspect is that the actual message is one selected from a
set of possible messages. The system must be designed to operate for
each possible selection, not just the one which will actually be
chosen since this is unknown at the time of design."

So, you see, if a system must be able to operate regardless of if pi
was chosen or some other number with equal probability, it must be set
up to handle all possibilities that could be chosen, at random.
Therefore, from the perspective of a receiver who does not yet know
what sequence is going to be sent, the receiver has to be able to
receive not only pi, but all other sequences that are equally
probable.

> > In other words,
> > this source must be able to produce not only pi, but all possible
> > numbers in infinite sequence space - to include truly "random" and
> > "non-computable" numbers like sigma.
>
> Utter nonsense. Pi is a computable transcendental. A finite algorithm can
> produce as many digits of pi as you like. Sources that produce the digits of
> pi, to any arbitrary precision, can be realized by finite algorithms. That
> is not true for uncomputable numbers.

That's true, but it seems like you are moving into KC here. Pi is
only one of an ensemble of possible messages where all messages are
equally probable. The fact that pi can be compressed into a simple
algorithm seems irrelevant from the perspective of SI.

"Shannon's classical information theory assigns a quantity of
information to an ensemble of possible messages. All messages in the
ensemble being equally probable, this quantity is the number of bits
needed to count all possibilities. This expresses the fact that each
message in the ensemble can be communicated using this number of bits.
However, it does not say anything about the number of bits needed to
convey any individual message in the ensemble."

http://homepages.cwi.nl/~paulv/papers/info.pdf

It seems like you are trying to do just that. In your argument for pi
being compressible to a finite algorithm, it seems like you are
arguing that the number of bits needed to convey pi to the ensemble is
smaller than pi. While this is true, Shannon information theory "does
not say anything about the number of bits needed to convey pi". That
notion seems to require that one move into the realm of KC.

> By the way, where you wrote "sigma" did you possibly mean "omega"?

Yes . . . Omega.

> > Again, it is all about the source or the reference that is chosen.
>
> >> Strictly speaking, with a single string,
> >> the Shannon information is only
> >> estimated..
>
> > That's true, but the estimate is based on the type of source needed to
> > produce such a string.
>
> No. It is based on the measured digits in a realized string. That allows an
> estimate of the probabilities of the symbols produced by the source.

I'm not sure I follow. It almost seems like you are saying what I just
said in a different way.

Sean Pitman
www.DetectingDesign.com

jensp...@hotmail.com

unread,
Jul 26, 2007, 4:46:51 PM7/26/07
to
On Jul 26, 3:58 pm, "R. Baldwin" <res0k...@nozirevBACKWARDS.net>
wrote:
> <jenspol...@hotmail.com> wrote in message

>
> news:1185445726....@r34g2000hsd.googlegroups.com...
>
> > Hey smartass, why didn't you comment on my answer last time you tried
> > starting a thread on this subject?
> > That would have saved you a lot of thinking as I actually have a clue
> > about what I'm talkning about.
>
> > Similarity of two symbolstrings can be measured as the size of the
> > smalest general Turing machine able to convert one string into the
> > other. I.e. by Colmogorov complexity.
>
> > J.O.
>
> The Kolmogorov Complexity of a string is the length of the shortest program
> on a reference Universal Turing Machine that produces that string.

Okay, I wrote it short and unprecise. But I see that you understood it
anyway. So no big need to complain.

> Having similar Kolmogorov Complexity does not generally translate into
> string similarity.

That's correct.
What it wrote is that similarity can be measured as the Colmogorov
complexity af the program needed to turn one string into the other.
I guees it really just an generalisation/foamlisation of the widely
used measure of editing distance.

J.O.

jensp...@hotmail.com

unread,
Jul 26, 2007, 4:52:05 PM7/26/07
to

I guees it's really just an generalisation/formalisation of the widely


used measure of editing distance.

Anyway. What would be interesting would be to see Seanpit's comment to
this. He likes to pretend that this subject interest him, so he ought
to be interested in an insight into this that's deeper than anything
he has provided.

J.O.


hersheyh

unread,
Jul 26, 2007, 5:08:00 PM7/26/07
to
On Jul 26, 11:57 am, Seanpit <seanpitnos...@naturalselection.

0catch.com> wrote:
> On Jul 25, 5:32 pm, hersheyh <hershe...@yahoo.com> wrote:
>
>
>
> > > The reference sequences are determined, by you, ahead of time before
> > > you go out to analyze any other sequences. The reference sequences
> > > are based on non-random strings that are known to be produced by
> > > simple algorithms - like pi or like 0101010 . . .
>
> > IOW, you would know it if the SETI signal were repeated digits of pi
> > in base 10, but would not be able to recognize pi in base 2 or the
> > other reference you give. Using *your* idea, you would declare any
> > other signal as "random" and unrelated to the 'reference'. Is *that*
> > what you claim that SETI is doing?
>
> As I've pointed out many times now, a match to a reference string, by
> itself, is not enough to detect ET. A maximum Pitman CSI number does
> NOT equal ET or ID for that matter. What it does indicate is non-
> random bias. Try to remember this point this time.

'Non-random bias' is apparently nothing more than 'degree of
similarity'. The more similar the 'reference' and 'target' sequences
(with both being nothing other than an arbitrary choice on your part)
are to each other, the higher the Pitman CSI number. Of course, you
do muck it up with your first term, the size of total sequence space,
which is basically irrelevant and tells us nothing of any utility.

> Now, is it possible for a set of reference strings to miss a non-
> random sequence?

Huh? You aren't claiming that your set of reference strings are able
to *identify* a 'random' sequence at all unless your claim is that you
can identify and use all 'non-random' (whatever that means) sequences
as your reference set. I say whatever you mean by 'non-random'
because the numbers in the sequence for pi are about as random as they
come. Take any stretch of numbers in pi and see if they can predict
any other similarly sized non-overlapping stretch of numbers in pi.
There is no repeatablility in pi and thus the string of numbers in pi
is pretty much *random* despite being predictable by a simple
algorithm. And simple algorithms produce fractals. And a pattern of
random mutation and neutral drift over time would also not produce a
*predictable* determinative single result. If you ran the experiment
over again you would get a different result for such a process.

All your numerology can *really*do is identify whether or not a
sequence is reasonably close to one or another of the sequences you
chose to be 'reference' sequences; that is, it identifies degree of
similarity.

There actually are programs that can do this much better. And they
can handle more than a single 'reference' and a single 'target'. In
fact, they can arrange sequences in a nested hierarchy of similarity
on the assumption of the number of single event changes required to
produce a pattern. They come with terms like "maximal parsimony".
And these programs are actually used to identify the nature of changes
in actual proteins (usually controlled for function) in actual
organisms. Again, the problem for your brand of creationism is not,
in general, the rarity of 'novel' functions. Those are very rare
indeed. It is the vast amount of difference that is selectively
effectively neutral but which produces change in patterns *of
similarity* that are so closely related to each other that they
*cannot* be due to chance. The only non-chance explanations that make
sense are historical (and largely vertical) descent, which requires
the time-frames that geology gives us, and deliberate deception by a
designer, if the time-frame is too short.

> Certainly! In fact, it is impossible to rule out
> this possibility. No one can do it - not SETI scientists, not
> anthropologists, biologist, chemists, physicists, or even IDists. No
> one. It is impossible.

IOW, you will, in fact, generate *many* false negatives where you will
claim that some 'target' sequence cannot be derived from any of the
'reference' sequences you have tested because you have no idea what
'reference' sequences to use and are simply pulling them out or yer
arse in the first place. Let's try the reference sequence for pi in
base ten! No. I don't see any signal sufficiently close to that.
Let's try the reference sequence for pi to the base two. No. I don't
see any signals close to that. Let's try pi to the base seven...

> > > After you have your set of reference strings, you can compare incoming
> > > sequences to your set of reference sequences to see if the incoming
> > > sequences is likely to be non-random in origin.
>
> > Again, you would only be able to detect 'targets' that were near
> > enough to your *biased* selection of 'reference' sequences to register
> > as 'sufficiently close'.
>
> That's right . . .

And more importantly you are NOT, repeat NOT, determining anything
about the 'randomness' of the 'tested' sequences. You are ONLY
determining how close they are to one or another of your 'reference'
sequences. Closeness to a 'reference' sequence is NOT a measure of
the randomness of the 'target' sequence. It is ONLY a measure of
similarity between the two sequences.

Actually, and in fact, *functional* systems can be produced randomly.
A random sample of about 10^17 50-mer RNAs has, almost certainly, a
molecule with RNA ligase activity, one with polynucleotide kinase
activity, and undoubtedly several other activities. You will note
that 10^17 molecules is around a millimole of RNA. Proteinoids,
generated by heat, also have enzymatic activities.

What you mean is that not all sequences have utility to a particular
organism. That doesn't tell us anything about the 'randomness' of the
sequence as a sequence.

> Beyond this, although a bit more complicated, a significant match for
> a reference string that is "apparently random" does indicate non-
> random production of one or the other or both the test string and the
> reference string.

You mean like the fact that many *different* proteins and even more
proteins with *different* sequences serving different but related
functions (like myoglobin, alpha globin, beta globin) have some degree
of *sequence* homology beyond that expected by chance alone or that
many different proteins share structural moieties more than would be
expected by chance alone is evidence that they were produced from
common ancestors? [Again, there is more structural homology than
sequence homology.]

> > > There is no "target". There are only test strings that you compare to
> > > your reference strings. If the test strings match one of your
> > > reference strings, to a high level of CSI, the hypothesis of non-
> > > random origin is supported.
>
> > Then the result you get is entirely dependent on which sequences you
> > *arbitrarily* chose as your 'reference' strings. How can you be sure
> > that your choice of all the 'reference' strings you *arbitrarily*
> > chose to look at will catch the 'intelligently designed' sequence you
> > 'test'.
>
> First off, the CSI calculation isn't about detecting ET or ID. Let me
> make that very clear once more. It is about detecting non-random
> bias.

No, it is about identifying degree of similarity with a 'reference'
sequence. I have no idea of what sort of "randomness" or "non-random
bias" you are talking about. If I choose a random sequence as my
'reference', I may or may not find another sequence with some
similarity to that random sequence. How does the CSI number tell me
shit about "non-random bias" in such a case? It would only tell me
that the sequence I call the 'target' is or is not close to the
sequence I call the 'reference'.

But since you do say that 'reference' sequences are not *randomly*
chosen, but are only chosen on the basis of their being produced as a
single determinative *sequence* by a simple algorithm (which excludes
sequences produced by evolution by neutral drift, even though
evolution by neutral drift is a simple algorithm, because evolution
does not produce a single determinative sequence; it produces a
different sequence each time in a probabilistic fashion), I fail to
see the utility of this number for anything having to do with biology.

> Beyond this, you can't be sure to catch all non-randomly produced
> sequences. It is actually impossible to detect all such sequences as
> noted above. That is the nature of science. Perfection cannot be
> achieved. That is what makes science useful. If perfection could be
> achieved, science would no longer be needed.

You need to convince me that your CSI is a better way of detecting
sequence similarity (which is all it seems to be able to do) than hd
alone is. And then you need to convince me that it is better at
identifying the patterns of protein sequence similarity than maximal
parsimony methods are.

> > And how will you, if you are too broad in your *arbitrary*
> > choices, prevent false positives?
>
> You can't prevent false positives with absolute perfection - only with
> a high degree of predictive value.
>
> > And if you are too narrow in your
> > *arbitrary* choices of 'references'
> > aren't you going to ensure many
> > false negatives?
>
> You can't insure against false negatives with perfection either.
> Again, that's impossible in science.

But since all you are measuring is degree of similarity between
sequences, the choice of 'reference' sequence is crucial. And making
extravagent claims that *evolutionary mechanisms* cannot work based on
such limited knowledge is a bit of intellectual arrogance.


>
> > > > How
> > > > would you choose the 'target' *after*
> > > > you have "independently chosen
> > > > the 'reference'?
>
> > > You don't choose the test string. Any string could be tested by the
> > > reference strings - any string at all.
>
> > You must *really* be brilliant if you can think of all the 'reference'
> > strings that not only a non-human ET might send as a signal, but also
> > all the protein 'reference' sequences that have *ever* existed.
> > Otherwise I cannot think of any way that your test would not wind up
> > being hit-or-miss and not much better than dumb luck.
>
> Though not perfect, the hits are much better than dumb luck - and
> that's the value of science. Science does not require perfection or
> rule out the possibility of being wrong. As a "scientist" you should
> know this already.

Yet you *regularly* reject the repeated evidence of sequence
similarity as evidence of *evolutionary* relatedness. Isn't that a
bit more than a double standard on your part? Isn't it delusional?
>
> Sean Pitmanwww.DetectingDesign.com

Seanpit

unread,
Jul 26, 2007, 5:34:42 PM7/26/07
to
On Jul 26, 12:14 pm, Mark Isaak <eci...@earthlink.net> wrote:

> >> Can you give me *any* reason to prefer your method over mine?
>
> > You just used CSI here. If the rocks you found all looked like they
> > were the result of random production and had no evident bias that you
> > could detect, you wouldn't be able to hypothesize any sort of artifact
> > at all. You see, you must first detect bias before you can move on to
> > propose any sort of origin for that bias.
>
> I challenge you to find one rock, anywhere in the world, which *is* the
> result of random production and has no evident bias. They all show
> biases in composition, even if you manage to find some without biased
> fracture planes, grain size, layering, erosion patterns, etc.

Biases are certainly produced in many materials naturally. However,
certain maturials do no show CSI that comes very close to the maximum
possible CSI - like surface irregularities of a rock made out of, say,
granite.

> > Since bias is evident in certain types of artifacts, this bias can be
> > analyzed by comparing it to what known non-deliberate vs. deliberate
> > forces in nature can achieve. If the bias goes significantly beyond
> > what known non-deliberate forces can achieve, yet remains within the
> > realm of what deliberate forces can achieve, your hypothesis of
> > deliberate artifact gains useful predictive value.
>
> > Again, you can't detect artifact until you first detect non-random bias.
>
> Bias is everywhere, already, by nature. You have narrowed your starting
> point to, literally, the entire universe. How is that useful? The
> starting point was smaller *before* applying CSI. CSI *removes*
> predictive value.

Not at at all. If everything truly appeared random to you to the same
degree, you wouldn't be able to detect artifact at all. It is because
different materials express different amounts of bias that you can in
fact detect artificat with better than even odds of success.

> Mark Isaak eciton (at) earthlink (dot) net
> "Voice or no voice, the people can always be brought to the bidding of
> the leaders. That is easy. All you have to do is tell them they are
> being attacked, and denounce the pacifists for lack of patriotism and
> exposing the country to danger." -- Hermann Goering

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 26, 2007, 6:01:24 PM7/26/07
to
On Jul 26, 2:08 pm, hersheyh <hershe...@yahoo.com> wrote:
> On Jul 26, 11:57 am, Seanpit <seanpitnos...@naturalselection.
>
> 0catch.com> wrote:
> > On Jul 25, 5:32 pm, hersheyh <hershe...@yahoo.com> wrote:
>
> > > > The reference sequences are determined, by you, ahead of time before
> > > > you go out to analyze any other sequences. The reference sequences
> > > > are based on non-random strings that are known to be produced by
> > > > simple algorithms - like pi or like 0101010 . . .
>
> > > IOW, you would know it if the SETI signal were repeated digits of pi
> > > in base 10, but would not be able to recognize pi in base 2 or the
> > > other reference you give. Using *your* idea, you would declare any
> > > other signal as "random" and unrelated to the 'reference'. Is *that*
> > > what you claim that SETI is doing?
>
> > As I've pointed out many times now, a match to a reference string, by
> > itself, is not enough to detect ET. A maximum Pitman CSI number does
> > NOT equal ET or ID for that matter. What it does indicate is non-
> > random bias. Try to remember this point this time.
>
> 'Non-random bias' is apparently nothing more than 'degree of
> similarity'. The more similar the 'reference' and 'target' sequences
> (with both being nothing other than an arbitrary choice on your part)
> are to each other, the higher the Pitman CSI number.

How many times do I have to tell you that there is no "target" string
and that I only choose the reference string, not the test string? The
test and the reference strings are chosen independently . . . AND I
don't pick the test string(s).

> Of course, you
> do muck it up with your first term, the size of total sequence space,
> which is basically irrelevant and tells us nothing of any utility.

It tells the odds of a randomly produced test string ending up with a
match to the reference string. The larger the sequence space size, the
lower the odds of a randomly produced match.

> > Now, is it possible for a set of reference strings to miss a non-
> > random sequence?
>
> Huh? You aren't claiming that your set of reference strings are able
> to *identify* a 'random' sequence at all

The method isn't set up to identify random sequences, but non-random
biased test sequences.

> unless your claim is that you
> can identify and use all 'non-random'
> (whatever that means) sequences
> as your reference set.

I specifically explained that identification of all non-random or
"biased" sequences is impossible.

> I say whatever you mean by 'non-random'
> because the numbers in the sequence for pi are about as random as they
> come.

That's not true. The numbers in the sequence for pi have a uniform
distribution, but the sequence itself is not random. It is perfectly
predictable and computable.

> Take any stretch of numbers in pi and see if they can predict
> any other similarly sized non-overlapping stretch of numbers in pi.
> There is no repeatablility in pi and thus the string of numbers in pi
> is pretty much *random* despite being predictable by a simple
> algorithm.

The definition of "random" is "non-predictable". Therefore, since pi
is in fact predictable, it is non-random.

> And simple algorithms produce fractals.

Fractals are not random.

> And a pattern of
> random mutation and neutral drift over time would also not produce a
> *predictable* determinative single result.

There is no "also" since random mutations and neutral drift would
produce a non-predictable result - unlike fractals that are produced
by repetitions of a simple algorithm.

> If you ran the experiment
> over again you would get a different result for such a process.

If the result is not predictable, it is random from that perspective.

> All your numerology can *really*do is identify whether or not a
> sequence is reasonably close to one or another of the sequences you
> chose to be 'reference' sequences; that is, it identifies degree of
> similarity.

That's right. And this greater degree of similarity of the unknown
compared to the known is good evidence of non-random biased production
of one or the other or both.

> There actually are programs that can do this much better. And they
> can handle more than a single 'reference' and a single 'target'.

My program can handle as many references and targets as your computer
can hold.

> In
> fact, they can arrange sequences in a nested hierarchy of similarity
> on the assumption of the number of single event changes required to
> produce a pattern. They come with terms like "maximal parsimony".
> And these programs are actually used to identify the nature of changes
> in actual proteins (usually controlled for function) in actual
> organisms. Again, the problem for your brand of creationism is not,
> in general, the rarity of 'novel' functions. Those are very rare
> indeed. It is the vast amount of difference that is selectively
> effectively neutral but which produces change in patterns *of
> similarity* that are so closely related to each other that they
> *cannot* be due to chance. The only non-chance explanations that make
> sense are historical (and largely vertical) descent, which requires
> the time-frames that geology gives us, and deliberate deception by a
> designer, if the time-frame is too short.

I agree that the similar patterns are not due to chance - that they
are indeed biased since they have a relatively high CSI value. Of
course, as explained before, a high CSI, by itself, says nothing about
the likely origin of the bias. Your assumption that the bias was the
result of random mutation and function-based selection is not the only
option for the production of bias.

> > Certainly! In fact, it is impossible to rule out
> > this possibility. No one can do it - not SETI scientists, not
> > anthropologists, biologist, chemists, physicists, or even IDists. No
> > one. It is impossible.
>
> IOW, you will, in fact, generate *many* false negatives where you will
> claim that some 'target' sequence cannot be derived from any of the
> 'reference' sequences you have tested because you have no idea what
> 'reference' sequences to use and are simply pulling them out or yer
> arse in the first place.

The false positives will be extremely few relative to the true
positives. That's the strength of the CSI calculation. Again, the
reference sequences are chosen before the test sequences are presented
- completely independently.

> Let's try the reference sequence for pi in
> base ten! No. I don't see any signal sufficiently close to that.

But, if you did happen to see a radiosignal coming from outer space,
you would know that this signal was not the result of some random
process. That would be very useful if it ever happened.

Again, you have to be able to detect significant bias before you can
hope to detect any kind of artifact like ETI or ID of any kind.

> Let's try the reference sequence for pi to the base two. No. I don't
> see any signals close to that. Let's try pi to the base seven...

There are many different references you can include - not just based
on pi. It is just that they have to be independently derived.

> > > > After you have your set of reference strings, you can compare incoming
> > > > sequences to your set of reference sequences to see if the incoming
> > > > sequences is likely to be non-random in origin.
>
> > > Again, you would only be able to detect 'targets' that were near
> > > enough to your *biased* selection of 'reference' sequences to register
> > > as 'sufficiently close'.
>
> > That's right . . .
>
> And more importantly you are NOT, repeat NOT, determining anything
> about the 'randomness' of the 'tested' sequences. You are ONLY
> determining how close they are to one or another of your 'reference'
> sequences. Closeness to a 'reference' sequence is NOT a measure of
> the randomness of the 'target' sequence. It is ONLY a measure of
> similarity between the two sequences.

Similarity between two independently derived sequences is in fact
evidence of bias. This is in fact your own conclusion when you see
similarities between biological sequences. You assume a non-random
biased origin.

You can't have it both ways Howard. If you yourself use sequence
similarity as evidence of common origin then you can't argue that
sequence similarities are not a measure of non-random origin.

This really isn't that hard Howard. You're turning yourself into a
pretzel here.

< snip rest >

Sean Pitman
www.DetectingDesign.com


Seanpit

unread,
Jul 26, 2007, 6:01:42 PM7/26/07
to
On Jul 26, 2:08 pm, hersheyh <hershe...@yahoo.com> wrote:
> On Jul 26, 11:57 am, Seanpit <seanpitnos...@naturalselection.
>
> 0catch.com> wrote:
> > On Jul 25, 5:32 pm, hersheyh <hershe...@yahoo.com> wrote:
>
> > > > The reference sequences are determined, by you, ahead of time before
> > > > you go out to analyze any other sequences. The reference sequences
> > > > are based on non-random strings that are known to be produced by
> > > > simple algorithms - like pi or like 0101010 . . .
>
> > > IOW, you would know it if the SETI signal were repeated digits of pi
> > > in base 10, but would not be able to recognize pi in base 2 or the
> > > other reference you give. Using *your* idea, you would declare any
> > > other signal as "random" and unrelated to the 'reference'. Is *that*
> > > what you claim that SETI is doing?
>
> > As I've pointed out many times now, a match to a reference string, by
> > itself, is not enough to detect ET. A maximum Pitman CSI number does
> > NOT equal ET or ID for that matter. What it does indicate is non-
> > random bias. Try to remember this point this time.
>
> 'Non-random bias' is apparently nothing more than 'degree of
> similarity'. The more similar the 'reference' and 'target' sequences
> (with both being nothing other than an arbitrary choice on your part)
> are to each other, the higher the Pitman CSI number.

How many times do I have to tell you that there is no "target" string


and that I only choose the reference string, not the test string? The
test and the reference strings are chosen independently . . . AND I
don't pick the test string(s).

> Of course, you


> do muck it up with your first term, the size of total sequence space,
> which is basically irrelevant and tells us nothing of any utility.

It tells the odds of a randomly produced test string ending up with a


match to the reference string. The larger the sequence space size, the
lower the odds of a randomly produced match.

> > Now, is it possible for a set of reference strings to miss a non-


> > random sequence?
>
> Huh? You aren't claiming that your set of reference strings are able
> to *identify* a 'random' sequence at all

The method isn't set up to identify random sequences, but non-random
biased test sequences.

> unless your claim is that you


> can identify and use all 'non-random'
> (whatever that means) sequences
> as your reference set.

I specifically explained that identification of all non-random or
"biased" sequences is impossible.

> I say whatever you mean by 'non-random'


> because the numbers in the sequence for pi are about as random as they
> come.

That's not true. The numbers in the sequence for pi have a uniform


distribution, but the sequence itself is not random. It is perfectly
predictable and computable.

> Take any stretch of numbers in pi and see if they can predict


> any other similarly sized non-overlapping stretch of numbers in pi.
> There is no repeatablility in pi and thus the string of numbers in pi
> is pretty much *random* despite being predictable by a simple
> algorithm.

The definition of "random" is "non-predictable". Therefore, since pi


is in fact predictable, it is non-random.

> And simple algorithms produce fractals.

Fractals are not random.

> And a pattern of
> random mutation and neutral drift over time would also not produce a
> *predictable* determinative single result.

There is no "also" since random mutations and neutral drift would


produce a non-predictable result - unlike fractals that are produced
by repetitions of a simple algorithm.

> If you ran the experiment


> over again you would get a different result for such a process.

If the result is not predictable, it is random from that perspective.

> All your numerology can *really*do is identify whether or not a


> sequence is reasonably close to one or another of the sequences you
> chose to be 'reference' sequences; that is, it identifies degree of
> similarity.

That's right. And this greater degree of similarity of the unknown
compared to the known is good evidence of non-random biased production
of one or the other or both.

> There actually are programs that can do this much better. And they
> can handle more than a single 'reference' and a single 'target'.

My program can handle as many references and targets as your computer
can hold.

> In


> fact, they can arrange sequences in a nested hierarchy of similarity
> on the assumption of the number of single event changes required to
> produce a pattern. They come with terms like "maximal parsimony".
> And these programs are actually used to identify the nature of changes
> in actual proteins (usually controlled for function) in actual
> organisms. Again, the problem for your brand of creationism is not,
> in general, the rarity of 'novel' functions. Those are very rare
> indeed. It is the vast amount of difference that is selectively
> effectively neutral but which produces change in patterns *of
> similarity* that are so closely related to each other that they
> *cannot* be due to chance. The only non-chance explanations that make
> sense are historical (and largely vertical) descent, which requires
> the time-frames that geology gives us, and deliberate deception by a
> designer, if the time-frame is too short.

I agree that the similar patterns are not due to chance - that they


are indeed biased since they have a relatively high CSI value. Of
course, as explained before, a high CSI, by itself, says nothing about
the likely origin of the bias. Your assumption that the bias was the
result of random mutation and function-based selection is not the only
option for the production of bias.

> > Certainly! In fact, it is impossible to rule out


> > this possibility. No one can do it - not SETI scientists, not
> > anthropologists, biologist, chemists, physicists, or even IDists. No
> > one. It is impossible.
>
> IOW, you will, in fact, generate *many* false negatives where you will
> claim that some 'target' sequence cannot be derived from any of the
> 'reference' sequences you have tested because you have no idea what
> 'reference' sequences to use and are simply pulling them out or yer
> arse in the first place.

The false positives will be extremely few relative to the true


positives. That's the strength of the CSI calculation. Again, the
reference sequences are chosen before the test sequences are presented
- completely independently.

> Let's try the reference sequence for pi in


> base ten! No. I don't see any signal sufficiently close to that.

But, if you did happen to see a radiosignal coming from outer space,


you would know that this signal was not the result of some random
process. That would be very useful if it ever happened.

Again, you have to be able to detect significant bias before you can
hope to detect any kind of artifact like ETI or ID of any kind.

> Let's try the reference sequence for pi to the base two. No. I don't


> see any signals close to that. Let's try pi to the base seven...

There are many different references you can include - not just based


on pi. It is just that they have to be independently derived.

> > > > After you have your set of reference strings, you can compare incoming


> > > > sequences to your set of reference sequences to see if the incoming
> > > > sequences is likely to be non-random in origin.
>
> > > Again, you would only be able to detect 'targets' that were near
> > > enough to your *biased* selection of 'reference' sequences to register
> > > as 'sufficiently close'.
>
> > That's right . . .
>
> And more importantly you are NOT, repeat NOT, determining anything
> about the 'randomness' of the 'tested' sequences. You are ONLY
> determining how close they are to one or another of your 'reference'
> sequences. Closeness to a 'reference' sequence is NOT a measure of
> the randomness of the 'target' sequence. It is ONLY a measure of
> similarity between the two sequences.

Similarity between two independently derived sequences is in fact

Mark Isaak

unread,
Jul 26, 2007, 6:26:02 PM7/26/07
to
On Thu, 26 Jul 2007 21:34:42 +0000, Seanpit wrote:

> On Jul 26, 12:14 pm, Mark Isaak <eci...@earthlink.net> wrote:
>>
>> Bias is everywhere, already, by nature. You have narrowed your
>> starting point to, literally, the entire universe. How is that
>> useful? The starting point was smaller *before* applying CSI. CSI
>> *removes* predictive value.
>
> Not at at all. If everything truly appeared random to you to the same
> degree, you wouldn't be able to detect artifact at all.

In other words, if artifacts looked nothing like artifacts, they would
look nothing like artifacts. Duh.

If everything *except* artifacts appeared random, then detecting
artifacts would be trivial. In fact, this false assumption is why you
expect CSI to be useful. But as I noted above, your assumption is not
merely false, it is so egregiously false that it is worse than no
assumption.

> It is because different materials express different amounts of bias
> that you can in fact detect artificat with better than even odds of
> success.

Wrong. It is because we know from experience which biases are produced
by humans and which are produced by natural processes that we can detect
artifacts with better than even odds. Amount of bias has nothing to do
with it. There are plenty of natural artifacts with extremely little
bias which we can easily distinguish from much more random human
artifacts.

If I give you, on one hand, a whole-wheat bagel with a bite taken from
it, and on the other, a rounded quartzite pebble from a stream bed, your
method would identify the pebble as (most likely) the artifact, but just
about any grade school student, wisely ignoring your counsel, would
correctly say that the bagel was man-made. How? Not by knowing about
randomness and biases, but by knowing about streams and bagels.

--

hersheyh

unread,
Jul 26, 2007, 9:52:25 PM7/26/07
to
On Jul 26, 6:01 pm, Seanpit <seanpitnos...@naturalselection.

0catch.com> wrote:
> On Jul 26, 2:08 pm, hersheyh <hershe...@yahoo.com> wrote:
>
>
>
> > On Jul 26, 11:57 am, Seanpit <seanpitnos...@naturalselection.
>
> > 0catch.com> wrote:
> > > On Jul 25, 5:32 pm, hersheyh <hershe...@yahoo.com> wrote:
>
> > > > > The reference sequences are determined, by you, ahead of time before
> > > > > you go out to analyze any other sequences. The reference sequences
> > > > > are based on non-random strings that are known to be produced by
> > > > > simple algorithms - like pi or like 0101010 . . .
>
> > > > IOW, you would know it if the SETI signal were repeated digits of pi
> > > > in base 10, but would not be able to recognize pi in base 2 or the
> > > > other reference you give. Using *your* idea, you would declare any
> > > > other signal as "random" and unrelated to the 'reference'. Is *that*
> > > > what you claim that SETI is doing?
>
> > > As I've pointed out many times now, a match to a reference string, by
> > > itself, is not enough to detect ET. A maximum Pitman CSI number does
> > > NOT equal ET or ID for that matter. What it does indicate is non-
> > > random bias. Try to remember this point this time.
>
> > 'Non-random bias' is apparently nothing more than 'degree of
> > similarity'. The more similar the 'reference' and 'target' sequences
> > (with both being nothing other than an arbitrary choice on your part)
> > are to each other, the higher the Pitman CSI number.
>
> How many times do I have to tell you that there is no "target" string
> and that I only choose the reference string, not the test string?

As I remember, 'target' was your term, not mine. But since you prefer
'tested', I will use that term now.

> The
> test and the reference strings are chosen independently . . . AND I
> don't pick the test string(s).

But you do arbitrarily choose the set of 'reference' strings.

> > Of course, you
> > do muck it up with your first term, the size of total sequence space,
> > which is basically irrelevant and tells us nothing of any utility.
>
> It tells the odds of a randomly produced test string ending up with a
> match to the reference string.

Not really. The odds of a randomly produced test string ending up as
a match to the reference string is 1/total sequence space size, not
total sequence space minus some value involving hd. The reference
string is already arbitrarily chosen, so the odds of the reference
string is 1. The odds of any other sequence (or even the same
sequence), chosen randomly from a universe of total sequence space
that matches that reference sequence (assuming each sequence is
present only once), is 1/the size of total sequence space. There
would be no subtraction of anything having to do with hd for the
calculation of the odds of some randomly chosen sequence matching a
pre-chosen reference sequence.

> The larger the sequence space size, the
> lower the odds of a randomly produced match.

I certainly agree that the odds of a match are lower as the size of
total sequence space increases. The odds of picking a match for any
arbitrarily chosen (even randomly chosen) 'reference' sequence (again
assuming that each sequence is present only once) is always going to
be 1/the total number of sequences in sequence space.

But then what is the term involving hd doing as a subtraction? If you
want to express the odds of finding any sequence within x hd units
away from your reference sequence, you would calculate the number of
sequences that are 0, 1, 2, 3, and .... x hd units away from your
reference sequence and *divide* that number by the size of total
sequence space. This would, however, be the same calculation and give
the same result regardless of whether the 'reference' sequence were a
sequence that has some special meaning to you (such as the first 100
digits of pi in base 10) or was randomly chosen from total sequence
space.

But you seem to be saying that by choosing 'reference' sequences that
have meaning to you, you somehow change the odds of finding a match,
be that match a perfect match or one that includes any sequence within
x hd units of the 'reference'. That simply is not true.

So, again, what *exactly* are you measuring here. Why is there this
subtraction and why do you think that by choosing particular
'reference' sequences you are somehow changing the odds of a match?

At best (and I find your equation pretty much meaningless), the hd
part of that equation does show us the degree of similarity of a test
sequence to a reference sequence. But there are certainly better ways
to do that.

> > > Now, is it possible for a set of reference strings to miss a non-
> > > random sequence?
>
> > Huh? You aren't claiming that your set of reference strings are able
> > to *identify* a 'random' sequence at all
>
> The method isn't set up to identify random sequences, but non-random
> biased test sequences.

It looks like it is set up to identify the degree of identity between
two sequences with the addition of a term that looks like a
calculation of total sequence space thrown in for no apparent reason.
But do explain what each part of your equation is measuring and why it
is relevant. And why you think that your choice of 'reference'
sequence changes the odds of something or other.

> > unless your claim is that you
> > can identify and use all 'non-random'
> > (whatever that means) sequences
> > as your reference set.
>
> I specifically explained that identification of all non-random or
> "biased" sequences is impossible.
>
> > I say whatever you mean by 'non-random'
> > because the numbers in the sequence for pi are about as random as they
> > come.
>
> That's not true. The numbers in the sequence for pi have a uniform
> distribution, but the sequence itself is not random. It is perfectly
> predictable and computable.
>
> > Take any stretch of numbers in pi and see if they can predict
> > any other similarly sized non-overlapping stretch of numbers in pi.
> > There is no repeatablility in pi and thus the string of numbers in pi
> > is pretty much *random* despite being predictable by a simple
> > algorithm.
>
> The definition of "random" is "non-predictable". Therefore, since pi
> is in fact predictable, it is non-random.

Pi is calculated. The sequence of numbers in pi is not predictable as
a sequence.


>
> > And simple algorithms produce fractals.
>
> Fractals are not random.
>
> > And a pattern of
> > random mutation and neutral drift over time would also not produce a
> > *predictable* determinative single result.
>
> There is no "also" since random mutations and neutral drift would
> produce a non-predictable result - unlike fractals that are produced
> by repetitions of a simple algorithm.
>
> > If you ran the experiment
> > over again you would get a different result for such a process.
>
> If the result is not predictable, it is random from that perspective.

And the sequence of pi is random in that you cannot predict one
stretch from another. There is no repetitiveness.


>
> > All your numerology can *really*do is identify whether or not a
> > sequence is reasonably close to one or another of the sequences you
> > chose to be 'reference' sequences; that is, it identifies degree of
> > similarity.
>
> That's right. And this greater degree of similarity of the unknown
> compared to the known is good evidence of non-random biased production
> of one or the other or both.

Then what the heck do you need the term that involves total sequence
space as an addition (as opposed to a division, where it would at
least make *some* sense).

Well, again, I have specifically not ruled out a particularly
malicious designer intent on misleading us into thinking that the
process is historical descent (which largely involves random mutation
and neutral drift at the sequence level). Again, the pattern
*specifically* looks at proteins that all have the *same* function, so
selection is effectively irrelevant or a minor feature of the
pattern. And the pattern that arises is the one that would be
predicted by random mutation (which certainly exists) and neutral
fixation (which is unpreventable *except* by selection). That the
*same* pattern repeats again and again and again for different
proteins, each with a different function, shows that the pattern is
not strongly related to *function*.

> > > Certainly! In fact, it is impossible to rule out
> > > this possibility. No one can do it - not SETI scientists, not
> > > anthropologists, biologist, chemists, physicists, or even IDists. No
> > > one. It is impossible.
>
> > IOW, you will, in fact, generate *many* false negatives where you will
> > claim that some 'target' sequence cannot be derived from any of the
> > 'reference' sequences you have tested because you have no idea what
> > 'reference' sequences to use and are simply pulling them out or yer
> > arse in the first place.
>
> The false positives will be extremely few relative to the true
> positives. That's the strength of the CSI calculation. Again, the
> reference sequences are chosen before the test sequences are presented
> - completely independently.
>
> > Let's try the reference sequence for pi in
> > base ten! No. I don't see any signal sufficiently close to that.
>
> But, if you did happen to see a radiosignal coming from outer space,
> you would know that this signal was not the result of some random
> process. That would be very useful if it ever happened.

Searching for a needle in a haystack is very difficult if you don't
have a magnet.


>
> Again, you have to be able to detect significant bias before you can
> hope to detect any kind of artifact like ETI or ID of any kind.

So, again, the *fact* that all the different beta globins of
hemoglobin in many different organisms have highly significant
similarity/identity tells me what? And that they don't have identical
CSI despite having the same function in different organisms tells me
what? And that the pattern of changes in sequence that is far and
away the best fit for the observed sequence differences also largely
fits the branching pattern of other proteins and also the
morphological branching (humans and chimps closest, other primates
more distant, reptiles more distant, etc.) proposed by historical
divergences tells me what? Oh, I know...all of this was designed by
an evil designer to fool us into thinking historical descent.

> > Let's try the reference sequence for pi to the base two. No. I don't
> > see any signals close to that. Let's try pi to the base seven...
>
> There are many different references you can include - not just based
> on pi. It is just that they have to be independently derived.

Can you choose a random sequence as the 'reference'? How would that
change the CSI calculation? How would it change the odds of any
randomly chosen sequence being within x hd units away (a value which
your CSI does NOT calculate)?

> > > > > After you have your set of reference strings, you can compare incoming
> > > > > sequences to your set of reference sequences to see if the incoming
> > > > > sequences is likely to be non-random in origin.
>
> > > > Again, you would only be able to detect 'targets' that were near
> > > > enough to your *biased* selection of 'reference' sequences to register
> > > > as 'sufficiently close'.
>
> > > That's right . . .
>
> > And more importantly you are NOT, repeat NOT, determining anything
> > about the 'randomness' of the 'tested' sequences. You are ONLY
> > determining how close they are to one or another of your 'reference'
> > sequences. Closeness to a 'reference' sequence is NOT a measure of
> > the randomness of the 'target' sequence. It is ONLY a measure of
> > similarity between the two sequences.
>
> Similarity between two independently derived sequences is in fact
> evidence of bias.

NO. It is a measure of similarity or dissimilarity. But if by bias
you mean similarity or dissimilarity, why call it 'bias'?

> This is in fact your own conclusion when you see
> similarities between biological sequences. You assume a non-random
> biased origin.

When I see similarity between biological sequences, I see
*similarity*. One possible explanation of *similarity* is common
ancestry. Another is common design. But in addition to *similarity*,
I also see differences. That is proteins that perform the same
*function* can differ from 'not at all' to 'so much that they don't
look like the same sequence at all' (usually, of course, without
changing structure nearly as much). Moreover, I can examine the
pattern of changes and determine the pathways that would produce those
differences that is most parsimonious. And I can do that again for a
different protein. And I can do it for morphology (but less well).
The most parsimonious explanation consistent with all this evidence is
common descent. For sequence information, specifically, the proposed
mechanism is largely mutation (which certainly exists) and neutral
fixation over long time frames. This is a process which *certainly*
happens in the absence of selection to prevent it. Selection actually
works, largely -- but crucially, not always, to *prevent* evolutionary
change. Sequence change, if any fraction of a sequence is selectively
neutral (and the existence of many different sequences that have the
same function is clear evidence that a substantial amount of effective
neutrality exists), cannot be prevented given sufficient time.

> You can't have it both ways Howard. If you yourself use sequence
> similarity as evidence of common origin then you can't argue that
> sequence similarities are not a measure of non-random origin.

The question is not that one can or cannot measure sequence
similarity. One certainly can (and that is the basis of sequence
homologies that support common descent -- or a deceiver designer). It
is whether what *you* are measuring with *your* CSI formula has any
meaning at all, other than some vague resemblance to measuring
something like similarity. I look at your formula as either a botched
attempt to re-invent the wheel and have some measure of 'sequence
similarity' you can call CSI or some bizarre idea that what you are
measuring really does change the odds depending on your arbitrary
choice of 'reference' to be just those sequences that have meaning to
you. Frankly, I don't have a clue as to what you think that formula
does. It looks like GIGO designed (and I know who this designer is
and what his motivations are) to produce appropriate hypothetical
numbers to the designer's need of the moment to me.


>
> This really isn't that hard Howard. You're turning yourself into a
> pretzel here.

No I'm not. I have explicitly said that there already are good
measures of sequence similarity, that they already have been used and
are used, and that I don't think your formula is worth shit.
>
> < snip rest >
>
> Sean Pitmanwww.DetectingDesign.com


R. Baldwin

unread,
Jul 26, 2007, 10:49:29 PM7/26/07
to
"Seanpit" <seanpi...@naturalselection.0catch.com> wrote in message
news:1185479775.7...@q75g2000hsh.googlegroups.com...

So finding a match to any number at all is significant? Why?

>
>> > Shannon information is determined by reference to a known "random"
>> > source of string production - a source that produces maximum Shannon
>> > information.
>>
>> No, no, no. That is very badly wrong. As I said above, Shannon modeled
>> information sources as if they were random variables. Specifically,
>> Markov
>> random variables. That does not mean information sources really are
>> Markov
>> random variables. With very few exceptions, they are *not* Markov random
>> variables. Shannon's theory only *pretends* information sources are
>> Markov
>> random variables because it makes the math easy, and it is a decent
>> approximation that works pretty well.
>
> Just because the reference is a "pretend" or hypothetical reference
> doesn't mean it isn't a reference. A pretend reference is still a
> reference.

You are not comprehending here. It is not a pretend information source. It
is pretended (i.e.; assumed, or modeled) that the information source is
Markov.

Further, the information source is not a "reference." You are improperly
crossing over different theories.

>
>> Furthermore, the amount of information produced by an information source
>> depends on how surprised a receiver will be. This depends on the relative
>> probabilities of the different symbols the information source can
>> produce.
>
> You mean the "pretend" information source? Right? Again, the amount
> of receiver "surprise" depends upon the receiver's comparing what is
> received to what is expected to be produced by the pretend information
> source. And, there you have it - a reference is indeed required.

No, I do not mean a pretend information source. Kindly re-read my
explanation, because you have misconstrued it. There is no pretend
information source.

It certainly appears that you have an incorrect understanding of information
sources and receivers under Shannon's Information Theory, because you are
using language from Algorithmic Informaton Theory. This implies you are
assuming the receiver compares the received string to an expected string.
This is an invalid interpretation of Shannon's theory. The only surprisal in
a receiver is symbol by symbol. If a binary information source produces
digit 1 90% of the time, then the receiver obtains more information by
observing a digit 0 than a digit 1. Since the theory treats the information
source as Markov, the remainder of the string is irrelavent.

The only *reference*, which is not accepted terminology for this field, is
the set of individual symbol properties for the information source. There is
no reference string, and no reference computer, contrary to the
iimplications of your version.

>
>> There *is* no maximum on Shannon information. If you want more
>> information,
>> just watch the random variable for a longer time.
>
> There is maximum Shannon information for a set finite period of time -
> that's the point.

The point is, you posted several badly incorrect statements about
Information Theory. If you persist in doing so, you might as well get used
to being corrected, because quite a number of talk.origins regulars are
quite versant in the topic.

And no, there is need not be a "maximum Shannon information for a set finite
period of time." Information sources can quite easily operate at less than a
maximum information rate for indefinitely long periods.

>
>> It is the information *entropy*
>> (average information) that can have a
>> maximum.
>
> You mean informational entropy has a maximum regardless of the period
> of time involved. That's because there is in fact maximum SI for each
> point in time. If there weren't, you couldn't calculate an average
> over a span of time.

No, that is very wrong. Informational entropy has a maximum possible value
because of the form of the equation, H = - sum p log p. The H value is
constrained to the interval [0,1]. You can never get less than 0 or more
than 1.

The *only way* to get an H value of 1 is to have an information source
produce equiprobable symbols. If it does, H = 1. If it does not, H < 1.
That's it, period.

>
>> The entropy is maximum if the
>> information source produces symbols
>> equiprobably.
>
> Exactly. . . Again reference to this "source" is required.

You just agreed with an explanation you contradicted one paragraph back.

In the Shannon model, it is *required* to have an information source, a
channel, and a receiver. Has math only makes sense with respect to this
model. Why you are going on and on about "reference" makes no sense at all.

>
>> For a binary source, this means producing either 0's or 1's
>> with a 50% probability. For a decimal source, this means producing any of
>> the 10 digits with a 10% probability.
>
> Right . . . Which is simply assumed to be the case via use of an
> imaginary source that actually does this. In real life, however, this
> cannot be perfectly assumed.

No. It is easy to construct real, physical information sources that produce
equiprobable symbols, to an accuracy tighter than we would care about.

>
>> Shannon's theory also defines the information as it is received, distinct
>> from the information that left the transmitter, when noise is present.
>
> Yes - also assuming the character of the "noise".

Your statement makes no sense. What do you mean by "assuming the character
of the noise"?

>
>> > "In the Shannon approach, however, the method of encoding objects is
>> > based on the presupposition that the objects to be encoded are
>> > outcomes of a known random source. It is only the characteristics of
>> > that random source that determine the encoding, not the
>> > characteristics of the objects that are its outcomes."
>>
>> >http://homepages.cwi.nl/~paulv/papers/info.pdf
>>
>> That quote is about Shannon's Coding Theorem, and is not relevant to the
>> definitions of or calculations for information or entropy under Shannon's
>> Mathematical Theory of Information. The quote is a reference to a means
>> for
>> recoding the output of a random variable to maximize its entropy and
>> optimize channel usage.
>
> I don't see where you get this from this passage since the preceding
> passage reads:
>
> "Both theories aim at providing a means for measuring
> 'information'. They use the same unit to do this: the bit. In both
> cases, the amount of information in an object may be interpreted as
> the length of a description of the object."
>
> This sounds to me like the authors are indeed talking about
> definitions and measurements of "information". It is just that
> Shannon information is concerned with the source (pretend or not)
> while Kolmogorov complexity is concerned with the resulting string or
> "object".

I get it because I've been using Shannon's theories since 1981. The only
context for "encoding" with respect to Shannon is the recoding of an
information source for which
H < 1, such that H = 1 at the output of the encoder. This is a scheme for
making optimal use of a channel. Shannon's Coding Theorem provides the
means. It is based on the symbol probabilities of the original source.
Because I already knew this, as would anyone with experience in the field, I
can tell by the exact phrase "the method of encoding objects is based on the

presupposition that the objects to be encoded are outcomes of a known random

source" that the authors refer to Shannon's Coding Theorem, which provides
the mathematical framework for accomplishing the encoding.

>
>> > This means that Shannon information is more about the type of source
>> > it will take to transmit a particular type of string rather than the
>> > string itself.
>>
>> No. Shannon information is about the symbol probabilities of the source
>> in
>> question (not the "type" of source), the rate at which they are
>> delivered,
>> and the interest of the receiver. You need a receiver attempting to copy
>> the
>> information source for information to exist at all, in Shannon's model.
>
> In order to propose symbol probabilities, you have to propose
> something about a source that is able to produce said probabilities.

The only proposition in Shannon's theory about the source is that the source
is Markov, with a set of symbol probabilities.

>
>> > So, to transmit a number like Pi, where all the
>> > symbols seem to appear with equal frequency, the source needed to
>> > transmit a sequence like pi will have to be able to produce all
>> > possible numbers with a similar character frequency.
>>
>> No. That is absolute hogwash. A PC hooked up to the Internet can output
>> any
>> 8-bit character frequency pattern you program into it.
>
> Yes, and it can also output pi to the same number of digits.

That is beside the point. This is not a restriction on the ability of the
source. You said it was. That is the error I am correcting.

Sean, this is beyond wrong. In the Shannon model, a Markov information
source produces one symbol at a time. The receiver receives one symbol at a
time. The receiver is *inherently* able to receive any string, no matter how
probable. A Markov source is *inherently* able to send any sequence. The
probabilities of the ensemble of output strings depend on the symbol symbol
probabilities, but any string may be sent or received in a theoretical
Shannon communication system. Real communication systems are band limited
and time limited, and usually don't have a Markov source, so they are not
able to send *any* sequence.

>
>> > In other words,
>> > this source must be able to produce not only pi, but all possible
>> > numbers in infinite sequence space - to include truly "random" and
>> > "non-computable" numbers like sigma.
>>
>> Utter nonsense. Pi is a computable transcendental. A finite algorithm can
>> produce as many digits of pi as you like. Sources that produce the digits
>> of
>> pi, to any arbitrary precision, can be realized by finite algorithms.
>> That
>> is not true for uncomputable numbers.
>
> That's true, but it seems like you are moving into KC here. Pi is
> only one of an ensemble of possible messages where all messages are
> equally probable. The fact that pi can be compressed into a simple
> algorithm seems irrelevant from the perspective of SI.

No, I am not moving into KC. I moved to computable and uncomputable numbers,
because you already wrote about them and made an error. It is possible to
discuss these concepts without worrying about shortest possible algorithms.
I did not, hence I am not moving into KC.

It is true that the compressibility of pi is not relevant to Shannon
Information. I never said it was. I was disputing a different issue.

It is not true that an actual, realizable information source "must be able

to produce not only pi, but all possible numbers in infinite sequence

space," as you stated. It is not true that "the source needed to transmit a

sequence like pi will have to be able to produce all possible numbers with a

similar character frequency," as you stated. In other words, an information
source *does not need* to be able to output uncomputable numbers

An infinite number of theoretical Markov sources can access any string in
{0,1}^inf, or a single theoretical Markov source can access one string in
{0,1}^inf, but actual information sources are seldom truly Markov, and a
Markov source is not required to output pi. Also, realizeable information
sources cannot access any string in {0,1}^inf or even {0,1}^*, because that
would require unlimited time and bandwidth, which is not possible to
realize.

Terminology:
{0,1}^inf is the set of all infinte binary strings
{0,1}^* is the set of all finite binary strings

>
> "Shannon's classical information theory assigns a quantity of
> information to an ensemble of possible messages. All messages in the
> ensemble being equally probable, this quantity is the number of bits
> needed to count all possibilities. This expresses the fact that each
> message in the ensemble can be communicated using this number of bits.
> However, it does not say anything about the number of bits needed to
> convey any individual message in the ensemble."
>
> http://homepages.cwi.nl/~paulv/papers/info.pdf
>
> It seems like you are trying to do just that. In your argument for pi
> being compressible to a finite algorithm, it seems like you are
> arguing that the number of bits needed to convey pi to the ensemble is
> smaller than pi. While this is true, Shannon information theory "does
> not say anything about the number of bits needed to convey pi". That
> notion seems to require that one move into the realm of KC.

Look, you moved the discussion into computability. Why are you complaining
about it?

I quite agree that Shannon's theory does not say anything about
computability. You have again misconstrued my argument. I was pointing out
that you made a false assertion about the necessary capabilities of
information sources. They are not required to be capable of producing
uncomputable numbers.

>
>> By the way, where you wrote "sigma" did you possibly mean "omega"?
>
> Yes . . . Omega.
>
>> > Again, it is all about the source or the reference that is chosen.
>>
>> >> Strictly speaking, with a single string,
>> >> the Shannon information is only
>> >> estimated..
>>
>> > That's true, but the estimate is based on the type of source needed to
>> > produce such a string.
>>
>> No. It is based on the measured digits in a realized string. That allows
>> an
>> estimate of the probabilities of the symbols produced by the source.
>
> I'm not sure I follow. It almost seems like you are saying what I just
> said in a different way.

You said "the estimate is based on the type of source needed to produce such
a string." If we use standard English parsing, you are saying that
characteristics of the source are the basis for the estimate. This is wrong.
The characteristics of the source are unknown. The characteristics of the
source are estimated based on measurements of the string. You got it
backwards.


hersheyh

unread,
Jul 27, 2007, 11:03:58 AM7/27/07
to
On Jul 26, 9:52 pm, hersheyh <hershe...@yahoo.com> wrote:
> On Jul 26, 6:01 pm, Seanpit <seanpitnos...@naturalselection.
>
> 0catch.com> wrote:
> > On Jul 26, 2:08 pm, hersheyh <hershe...@yahoo.com> wrote:
>
> > > On Jul 26, 11:57 am, Seanpit <seanpitnos...@naturalselection.
>
> > > 0catch.com> wrote:
> > > > On Jul 25, 5:32 pm, hersheyh <hershe...@yahoo.com> wrote:
>
[snip]

>
> > The method isn't set up to identify random sequences, but non-random
> > biased test sequences.
>
> It looks like it is set up to identify the degree of identity between
> two sequences with the addition of a term that looks like a
> calculation of total sequence space thrown in for no apparent reason.
> But do explain what each part of your equation is measuring and why it
> is relevant. And why you think that your choice of 'reference'
> sequence changes the odds of something or other.
>
[snip repetitive stuff]

In fact, I am explicitly saying that your equation is worthless and a
re-invention of the wheel, except for the minor detail that your re-
invented 'wheel' (your formula) is square and has no axle and is thus
worthless as a wheel. I have no problem at all with identifying a
degree of sequence similarity that is non-chance (that is, is unlikely
to be due to chance but, instead, must have a *causal* explanation
that links the 'test' sequence to a particular 'reference' sequence).

In fact, dope slap to me, looking at sequence identity or homology is
one of the first things that *real* biologists do when they get the
sequence of a 'new' or 'novel' protein. They perform a BLAST or
similar sequence analysis to compare the new or 'test' sequence to the
population (and I mean population and not sample) of sequences that
have already been analysed. This is like comparing a new word, say
"bullshit", to a dictionary of English words and seeing if there are
any sequence matches. I suspect that you would find at least two
words that have statistically significant similarity to parts of this
'test' word. That is, scientists compare the new sequence to a
dictionary of sequences that have already been found to be present
(and which are often useful, although present is quite good enough) in
other living organisms. This is, in fact, a comparison to a
'reference' collection of sequences. They set parameters to exclude
any matches which are due to chance (for proteins, this amounts to
sequence identity significantly higher than about 15-20% identity; the
reason why this is higher than the 5% you would use is because aa's
are not present in equimolar amounts in real proteins, but since this
makes it *more* difficult to get a significant match, you can hardly
complain) and look for stretches of significant matching within a
sequence as well as overall matching.

Not surprisingly to me, but apparently to you, very often *real*
scientists find significant or highly significant matches (non-chance
relationship between sequences) between new sequences and previously
recorded sequences. There are certain general features to these
matches. First, evolutionary closeness of the organisms in time since
divergence is the most significant factor. If one examines any
sequence (functional or not) in humans, there is an extremely high
probability of sequence in chimpanzees that will be highly significant
in similarity to the point where that similarity cannot possibly be
attributed to chance. The further back the divergence of the
organisms (in standard evolutionary terms), the greater the degree of
sequence dissimilarity, even for proteins that perform the very same
function in all the organisms. At some point, identifying similarity
becomes difficult for *some* sequences that perform a particular
function. However, often these proteins have retained similar
*structure* in addition to *function*. That is why *sequence* is less
informative for *function* than *structure* is.

Moreover, they *often* find *sequence* similarity (often in smaller
patches or moieties within a larger sequence) even in proteins that no
longer perform the same *function* (the two globins of hemoglobin and
myoglobin for example, or the different flagellin proteins of the
eubacterial flagella). Typically this similarity is more pronounced
when one looks at structure. Some of the *sequence* differences in
*these* comparisons are undoubtedly due to selection for the small
number of sites where selectively favored change occurred. But much
of the difference is undoubtedly just more of the random fixation of
neutral changes that selection does not affect nor prevent.

Now tell me again why your formula is better than using a BLAST
program to search for similarity of a test sequence (any new sequence)
with a reference dictionary of sequences that have already been
discovered in organisms? What will you learn by your formula that
would not be better analysed by the existing systems for identifying
*statistically significant* similarity?

josephus

unread,
Jul 28, 2007, 9:16:12 PM7/28/07
to
sean
You are simplifying a difficult task. suppose you receive some data. It
is received and filtered from a RADIO TELESCOPE. Unknown to you
someone has transmitted a message encoded with PI to base 13. How would
you identify this encoded message? you have a message, you may KNOW
that a message is present but you still must recognize it and you must
decode it before you can say I HAVE A MESSAGE.
also just because we use base 2 or 10 or 16, the aliens could use base
23 or 32 or even 101 why? The aliens would not know why WE use 2 ,10,
or 16.

you want to suggest that PI could exist in Genome DATA. YOU have to
show an EXAMPLE. and if you CANNOT find one. please shut up.
>
josephus

Bobby Bryant

unread,
Jul 29, 2007, 4:07:38 PM7/29/07
to
In article <1185328675.4...@i38g2000prf.googlegroups.com>,
Seanpit <seanpi...@naturalselection.0catch.com> writes:

> No, because the reference strings are chosen independent of the test
> strings. Therefore a significant match between a reference and a
> test string is good evidence of non-random production.

Wouldn't it be evidence of non-random choice of a reference string?

What about the thought experiment I posted a week or so ago, where you
have a genuinely random string and a copy of it? Since one is
completely random and the other is completely manufactured, shouldn't
they give vastly different CSI values?

How do we know what reference strings to use for them? What if we
don't know which one is the original and which is the copy?

Why should we believe your calculation gives the slightest bit of
evidence that something is non-random?

--
Bobby Bryant
Reno, Nevada

Remove your hat to reply by e-mail.

Bobby Bryant

unread,
Jul 29, 2007, 4:13:48 PM7/29/07
to
In article <13aaut4...@news.supernews.com>,
"R. Baldwin" <res0...@nozirevBACKWARDS.net> writes:

> You might also want to spell out how you are computing the Hamming
> Distance; that is, between what and what?

Fact and fantasy, I think.

Bobby Bryant

unread,
Jul 29, 2007, 4:11:43 PM7/29/07
to
In article <1185318661.1...@g12g2000prg.googlegroups.com>,
Seanpit <seanpi...@naturalselection.0catch.com> writes:
> On Jul 23, 6:39 pm, "Perplexed in Peoria" <jimmene...@sbcglobal.net>
> wrote:

>> However would you know in practice what the specification string is?
>
> By picking a string of known non-random production, like pi.

Do we use the digits of pi for everything?

Could you work some examples for us, e.g. what does your formula produce
when you use pi as the reference string and calculate the CSI of pi^2, e,
1, the Encylopedia Britannica, and a few random strings?


Do you really believe all the crap you post here?

Bobby Bryant

unread,
Jul 29, 2007, 4:16:49 PM7/29/07
to
In article <1185290707.7...@n60g2000hse.googlegroups.com>,
hersheyh <hers...@yahoo.com> writes:

> You are quite right to suspect that he has simply pulled this
> formula out of his arse.

So... does that lead us to expect high CSI or low? Knowing that, we
could test his formula on itself.

Bobby Bryant

unread,
Jul 29, 2007, 4:21:00 PM7/29/07
to
In article <slrnfaesnv.4...@fishtank.brainwagon.org>,
Mark VandeWettering <wett...@attbi.com> writes:
> On 2007-07-25, Seanpit <sea...@gmail.com> wrote:
>> On Jul 25, 1:53 am, fropome <monk...@hornsandhalos.co.uk> wrote:
>>
>>> What?
>>>
>>> oh I get it, you're having a laugh! ho ho ho! ha ha ha! Ah, sorry
>>> it took me so long. Would you like the opportunity to explain the
>>> joke to the lurkers? Or would you prefer it if I pointed out how
>>> what you've written makes absolutely no sense?
>>
>> Oh please, do point the "joke" out. I'm sure at least some of the
>> lurkers don't get whatever you seem to be "getting". I sure don't.
>
> Perhaps it is only visible to people who possess a different viewpoint.
>
> You might be able to remedy the situation using a mirror.

Ah, but the symmetry would increase the CSI of the situation, and might
make it meaningful instead of jokish.

Bobby Bryant

unread,
Jul 29, 2007, 4:22:05 PM7/29/07
to
In article <1185374782.3...@z24g2000prh.googlegroups.com>,

Seanpit <sea...@gmail.com> writes:
> On Jul 25, 1:53 am, fropome <monk...@hornsandhalos.co.uk> wrote:
>
>> What?
>>
>> oh I get it, you're having a laugh! ho ho ho! ha ha ha! Ah, sorry
>> it took me so long. Would you like the opportunity to explain the
>> joke to the lurkers? Or would you prefer it if I pointed out how
>> what you've written makes absolutely no sense?
>
> Oh please, do point the "joke" out. I'm sure at least some of the
> lurkers don't get whatever you seem to be "getting". I sure don't.

Do lurkers bother to follow this nonsense?

Seanpit

unread,
Jul 30, 2007, 7:41:57 PM7/30/07
to
On Jul 28, 6:16 pm, josephus <dogb...@earthlink.net> wrote:

> sean
> You are simplifying a difficult task. suppose you receive some data. It
> is received and filtered from a RADIO TELESCOPE. Unknown to you
> someone has transmitted a message encoded with PI to base 13. How would
> you identify this encoded message? you have a message, you may KNOW
> that a message is present but you still must recognize it and you must
> decode it before you can say I HAVE A MESSAGE.
> also just because we use base 2 or 10 or 16, the aliens could use base
> 23 or 32 or even 101 why? The aliens would not know why WE use 2 ,10,
> or 16.

That's right. It is actually impossible to be able to detect all
possible non-random potentially biased sequences. No one can do it.
That is why a negative test result does not rule out the possibility
of bias or deliberate design. This is also why a positive result
means much more than a negative result.

> you want to suggest that PI could exist in Genome DATA. YOU have to
> show an EXAMPLE. and if you CANNOT find one. please shut up.

Again, this particular discussion is only about the detection of
bias. The detection of bias, by itself, says nothing about the origin
of the bias. Demonstration of bias is no problem in biosystems. All
biosystems have a great deal of genetic bias. What is the origin of
this bias? That's a completely different discussion.

> josephus

Sean Pitman
www.DetectingDesign.com

Seanpit

unread,
Jul 30, 2007, 7:34:56 PM7/30/07
to
On Jul 26, 6:52 pm, hersheyh <hershe...@yahoo.com> wrote:
> On Jul 26, 6:01 pm, Seanpit <seanpitnos...@naturalselection.

< snip >

> > > Of course, you
> > > do muck it up with your first term, the size of total sequence space,
> > > which is basically irrelevant and tells us nothing of any utility.
>
> > It tells the odds of a randomly produced test string ending up with a
> > match to the reference string.
>
> Not really. The odds of a randomly produced test string ending up as
> a match to the reference string is 1/total sequence space size, not
> total sequence space minus some value involving hd. The reference
> string is already arbitrarily chosen, so the odds of the reference
> string is 1. The odds of any other sequence (or even the same
> sequence), chosen randomly from a universe of total sequence space
> that matches that reference sequence (assuming each sequence is
> present only once), is 1/the size of total sequence space. There
> would be no subtraction of anything having to do with hd for the
> calculation of the odds of some randomly chosen sequence matching a
> pre-chosen reference sequence.

This method would only tell you the odds of having a perfect match.
It wouldn't say anything about the odds of having a match that was
only 1 character off, or two characters . . . etc. This is where my
method comes into play - it determines the relative odds of matches at
different Hamming Distances and different sequence lengths.

< snip >

> > > Take any stretch of numbers in pi and see if they can predict
> > > any other similarly sized non-overlapping stretch of numbers in pi.
> > > There is no repeatablility in pi and thus the string of numbers in pi
> > > is pretty much *random* despite being predictable by a simple
> > > algorithm.
>
> > The definition of "random" is "non-predictable". Therefore, since pi
> > is in fact predictable, it is non-random.
>
> Pi is calculated. The sequence of numbers in pi is not predictable as
> a sequence.

LOL - What?!

The sequence of numbers in pi is most certainly predictable. A
"calculation" or algorithm that produces pi every time is the means by
which one can indeed prove that pi is in fact predictable and
therefore not random.

> > If the result is not predictable, it is random from that perspective.
>
> And the sequence of pi is random in that you cannot predict one
> stretch from another. There is no repetitiveness.

Pi is *not* random. Look it up. Also, it might help if you looked up
the definition for randomness. If you don't believe me, try asking
someone like R. Baldwin if you are correct in this particular notion
of yours.

The fact is that when it comes to pi, you most certainly *can* predict
one stretch after another. Just because there are no repetitive
patterns doesn't mean pi is therefore not predictable. It is
perfectly predictable. It just isn't predictable from all
perspectives (like from a UTM that doesn't have access to the
algorithm for pi). It is like any number sequence that may seem
"random" to you at first approximation. However, once you find that
there is in fact a much smaller formula that can in fact produce the
same sequence perfectly, you can know that the sequence in question
most likely was not the result of random production. The same thing
is true for pi. The resulting sequence produced by algorithms for pi
is not random at all - even though it doesn't have a repeating
pattern.

Shane

unread,
Jul 30, 2007, 9:27:08 PM7/30/07
to


So predict the next five digits, from this 30 digit sequence.
724587006606315588174881520920.....

Oh and you are not allowed to use a calculation for pi, merely use the
given sequence. If required, I can make the sequence longer, but I
suspect that will not help.

> A
> "calculation" or algorithm that produces pi every time is the means by
> which one can indeed prove that pi is in fact predictable and
> therefore not random.

Come on Sean, surely you can do better than that? You have just
reinforced that pi is, as Howard posted, calculable. You have not
demonstrated that it is predictable based on the previous sequence, no
matter how long that sequence is. I'm not any form of mathematician,
and yet I can see the glaring distinction between what Howard claimed
about pi and what you say he claimed.

[...]

shel...@msn.com

unread,
Jul 30, 2007, 9:49:07 PM7/30/07
to

09628

>
> Oh and you are not allowed to use a calculation for pi, merely use the
> given sequence. If required, I can make the sequence longer, but I
> suspect that will not help.
>
> > A
> > "calculation" or algorithm that produces pi every time is the means by
> > which one can indeed prove that pi is in fact predictable and
> > therefore not random.
>
> Come on Sean, surely you can do better than that? You have just
> reinforced that pi is, as Howard posted, calculable. You have not
> demonstrated that it is predictable based on the previous sequence, no
> matter how long that sequence is. I'm not any form of mathematician,
> and yet I can see the glaring distinction between what Howard claimed
> about pi and what you say he claimed.
>

So can anyone, see that you included "based on the previous sequence",
where neither Howard or Sean made that claim.
Don't let the pot burn your ass on the way out.

hersheyh

unread,
Jul 30, 2007, 9:51:05 PM7/30/07
to
On Jul 30, 7:34 pm, Seanpit <seanpitnos...@naturalselection.

No it doesn't. The odds of having a match with 1 character off is
simply not your formula, Sean. It is 1) a calculation of the number
of sequences that are 0 off (which is 1 sequence) plus the number that
are one sequence off (which is n*(X-1), since each site (n) in the
sequence can have any sequence alternative but one). Then *divide*
that by the size of total sequence space. Not *add* total sequence
space to some other number.

For two differences, the odds of a randomly chosen sequence being
within two sequence differences of the original would be the sum of
the first two values above plus [(n-1)*(X-1)*n*(X-1)/2] since each of
the sequences that differ from the original sequence by one (there are
n*(X-1) of these sequence) can itself differ from the original
sequence at some other site (there are n-1 other sites with X-1
possibilities at each of those sites), but only half of these
sequences will be unique, since for each site that has an A in the
first site to change from the original and a W in the second one,
there is another that has W first and then an A in the same two
sites. Then that summation would be *divided* by the size of total
sequence space. NOT have the size of total sequence space added to
it. *That* (and I stress that math is not my strong suit, so someone
better qualified should check up on me) would give you the odds of a
random sequence being within 0,1, or 2 hd of the original sequence.

Frankly, I have no idea what *your* formula does or shows.

And, as I mentioned, since there are programs like BLAST that already
analyze and look for significant homology between a 'test' sequence
and the total population of currently known 'reference' sequences, I
fail to see why your formula, whatever you think it says, adds
anything useful at all.

Oh, and did I mention that the already available programs do indeed
often find homology between 'test' sequences and the 'reference'
dictionary of *known* sequences present in living organisms. And that
what one sees is a pattern largely consistent with evolution and the
acquisition of modified function by modification of old sequence with
related function and structure and, often, sequence.

[snip stuff about pi, which is uninteresting because, unlike the
'reference' dictionary of sequences present in other life forms (a
dictionary that gets larger every day) that current homology testing
methods uses, there is no such dictionary of mathematical algorithms
that is universally meaningful.]

An interesting test of your ideas: There should be less likelihood of
a statistically significant match between large 'test' protein
sequences and any other sequences in the reference dictionary because
large proteins have to arise by long random walks than for smaller
(>300 aa?) proteins.

R. Baldwin

unread,
Jul 30, 2007, 10:24:26 PM7/30/07
to
"Seanpit" <seanpi...@naturalselection.0catch.com> wrote in message
news:1185838496....@x40g2000prg.googlegroups.com...

> On Jul 26, 6:52 pm, hersheyh <hershe...@yahoo.com> wrote:
>> On Jul 26, 6:01 pm, Seanpit <seanpitnos...@naturalselection.
>
[snip]

>
>> > > Take any stretch of numbers in pi and see if they can predict
>> > > any other similarly sized non-overlapping stretch of numbers in pi.
>> > > There is no repeatablility in pi and thus the string of numbers in pi
>> > > is pretty much *random* despite being predictable by a simple
>> > > algorithm.
>>
>> > The definition of "random" is "non-predictable". Therefore, since pi
>> > is in fact predictable, it is non-random.
>>
>> Pi is calculated. The sequence of numbers in pi is not predictable as
>> a sequence.
>
> LOL - What?!
>
> The sequence of numbers in pi is most certainly predictable. A
> "calculation" or algorithm that produces pi every time is the means by
> which one can indeed prove that pi is in fact predictable and
> therefore not random.
>
>> > If the result is not predictable, it is random from that perspective.
>>
>> And the sequence of pi is random in that you cannot predict one
>> stretch from another. There is no repetitiveness.
>
> Pi is *not* random. Look it up. Also, it might help if you looked up
> the definition for randomness. If you don't believe me, try asking
> someone like R. Baldwin if you are correct in this particular notion
> of yours.

You are both right. Whether you can call pi random depends on what
definition of random you are using. Pi is not algorithmically random. It can
be computed deterministically, so we generally don't think of it as
stochastically random; but there are stochastic random variables that
converge to pi (for example, variations of experiments that drop objects
into circles). It is not known whether pi is a (Borel) normal number, but
that is a strong conjecture and pi is generally believed to be a normal
number. Normal numbers have certain properties we like to call random. For
example, no matter what radix is chosen, the digits follow a Uniform
distribution. Finite state martingales cannot succeed on normal numbers. I
suspect it was this latter aspect that Howard was getting at.

>
> The fact is that when it comes to pi, you most certainly *can* predict
> one stretch after another. Just because there are no repetitive
> patterns doesn't mean pi is therefore not predictable. It is
> perfectly predictable. It just isn't predictable from all
> perspectives (like from a UTM that doesn't have access to the
> algorithm for pi). It is like any number sequence that may seem
> "random" to you at first approximation. However, once you find that
> there is in fact a much smaller formula that can in fact produce the
> same sequence perfectly, you can know that the sequence in question
> most likely was not the result of random production. The same thing
> is true for pi. The resulting sequence produced by algorithms for pi
> is not random at all - even though it doesn't have a repeating
> pattern.

It seems to me that, to predict the sequence of pi, you need to know where
you are starting, what numeral system you are using, and that it is indeed
pi. Given these facts, its digits are deterministic and 100% predictable.

On the other hand, if pi does turn out to be a normal number, any finite
sequence will occur within pi an infinite number of times with asymptotic
frequency depending on the length of the finite sequence. The implication is
that, if pi is normal (and it is believed to be), you would have to know the
starting point in order to predict the rest of the sequence. Absent that
knowledge, the sequence is unpredictable.

Shane

unread,
Jul 30, 2007, 11:17:38 PM7/30/07
to

Please, do feel free to show your work. And remember you had to
predict them, not calculatde them.

>> Oh and you are not allowed to use a calculation for pi, merely use the
>> given sequence. If required, I can make the sequence longer, but I
>> suspect that will not help.
>>
>>> A
>>> "calculation" or algorithm that produces pi every time is the means by
>>> which one can indeed prove that pi is in fact predictable and
>>> therefore not random.
>>
>> Come on Sean, surely you can do better than that? You have just
>> reinforced that pi is, as Howard posted, calculable. You have not
>> demonstrated that it is predictable based on the previous sequence, no
>> matter how long that sequence is. I'm not any form of mathematician,
>> and yet I can see the glaring distinction between what Howard claimed
>> about pi and what you say he claimed.
>>
> So can anyone,

Thanks for agreeing that Sean misinterpreted Howard.

> see that you included "based on the previous sequence",
> where neither Howard or Sean made that claim.

Please, do feel free to give *your* interpretation of these words of
Howards;


"The sequence of numbers in pi is not predictable as a sequence."

> Don't let the pot burn your ass on the way out.

I'm not on the way out. I plan to stick around and watch you either
run away, or try to explain your above evidenced remarkable ability to
*predict* the digits in pi, and your similarly remarkable inability to
comprehend the english language.

Shane

unread,
Jul 30, 2007, 11:31:14 PM7/30/07
to
On Mon, 30 Jul 2007 18:49:07 -0700, shel...@msn.com wrote:
> On Jul 30, 6:27 pm, Shane <remarcsdNOS...@ozemail.com.au> wrote:
>> On Mon, 30 Jul 2007 16:34:56 -0700, Seanpit wrote:
>>> On Jul 26, 6:52 pm, hersheyh <hershe...@yahoo.com> wrote:
>>>> On Jul 26, 6:01 pm, Seanpit <seanpitnos...@naturalselection.

[...]

>>>>> > Take any stretch of numbers in pi and see if they can predict
>>>>> > any other similarly sized non-overlapping stretch of numbers in pi.
>>>>> > There is no repeatablility in pi and thus the string of numbers in pi
>>>>> > is pretty much *random* despite being predictable by a simple
>>>>> > algorithm.
>>
>>>>> The definition of "random" is "non-predictable". Therefore, since pi
>>>>> is in fact predictable, it is non-random.
>>
>>>> Pi is calculated. The sequence of numbers in pi is not predictable as
>>>> a sequence.
>>
>>> LOL - What?!
>>
>>> The sequence of numbers in pi is most certainly predictable.
>>
>> So predict the next five digits, from this 30 digit sequence.
>> 724587006606315588174881520920.....
> ;
> 09628
>

Missed this in my previous response. Your prediction failed, the
correct answer is;

96282

HTH

HAND

It is loading more messages.
0 new messages