Genomes

Mok-Kong Shen

unread,

Jan 1, 2001, 6:29:00 AM1/1/01

to

Does anyone happen to know the statistical properties of the
genome sequences in general? Are they sufficiently 'random'?

BTW, since the code is base 4, one can use the same to readily
transcribe any given binary sequences. This could have some
steganographical benefit, I suppose. For a paragraph that is
gibberish easily gives rise to suspicion of crypto, while the
same in the alphabet AGCT is presumably difficult to
distinguish from the result of a genetic research, if
appropriately enveloped. One could perhaps also hide information
in genuine genome sequences through modifications analogous to
what is done with graphical files.

M. K. Shen

wtshaw

unread,

Jan 1, 2001, 4:27:43 PM1/1/01

to

In article <3A5069FC...@t-online.de>, Mok-Kong Shen
<mok-ko...@t-online.de> wrote:

> Does anyone happen to know the statistical properties of the
> genome sequences in general? Are they sufficiently 'random'?
>

Since all life is similiar, so is the coded information.
--
History repeats itself when given the opportunity.
Question repeating old mistakes.
Be certain of the outcome.

Andrew Demma

unread,

Jan 1, 2001, 7:47:47 PM1/1/01

to

On Mon, 01 Jan 2001 12:29:00 +0100, Mok-Kong Shen
<mok-ko...@t-online.de> wrote:

>
>Does anyone happen to know the statistical properties of the
>genome sequences in general? Are they sufficiently 'random'?
>

Genomes are, in general, very nonrandom. For example, some bacterial
classifications are based on if species are AT-rich or GC-rich, so
there is not necessarily and equal 25% of each. Also horizontal gene
transfer between species can be dected by the differences in
nucleotide biases. The same applies for higherorganisms, though I am
not familiar enough to have any examples off hand.

Trevor L. Jackson, III

unread,

Jan 1, 2001, 11:35:18 PM1/1/01

to

Mok-Kong Shen wrote:

This may become less useful in the near future. If a distinguisher is
found for valid (i.e., natural) introns then it becomes hard to hide
random information as if it were an intron. Such a distinguisher may be
feasible based on the fact that most of the genetic code for mammals is
shared with all multi-cellular organisms. I suspect the introns would
be shared too.

Mathew Hendry

unread,

Jan 2, 2001, 9:34:23 AM1/2/01

to

On Mon, 01 Jan 2001 12:29:00 +0100, Mok-Kong Shen <mok-ko...@t-online.de>
wrote:

>Does anyone happen to know the statistical properties of the

>genome sequences in general? Are they sufficiently 'random'?

They're not random or they almost certainly wouldn't work. :) But the recent
thread "compression of DNA sequences" in news:comp.compression might be of
interest. The thread starts at Message-ID
<Pine.GSO.4.21.001013...@chopin.ifp.uiuc.edu>

-- Mat.

Pealco

unread,

Jan 2, 2001, 8:52:38 PM1/2/01

to

Actually I believe this has been done. The winner of 2000's Intel Science
Talent Search did her project on DNA-Based Steganography.

Here's the link: http://www.intel.com/pressroom/archive/releases/ed031300.htm

--Pedro

Mok-Kong Shen

unread,

Jan 3, 2001, 12:59:04 PM1/3/01

to

"Trevor L. Jackson, III" wrote:
>
> Mok-Kong Shen wrote:
>
> > Does anyone happen to know the statistical properties of the
> > genome sequences in general? Are they sufficiently 'random'?
> >
> > BTW, since the code is base 4, one can use the same to readily
> > transcribe any given binary sequences. This could have some
> > steganographical benefit, I suppose. For a paragraph that is
> > gibberish easily gives rise to suspicion of crypto, while the
> > same in the alphabet AGCT is presumably difficult to
> > distinguish from the result of a genetic research, if
> > appropriately enveloped. One could perhaps also hide information
> > in genuine genome sequences through modifications analogous to
> > what is done with graphical files.

> This may become less useful in the near future. If a distinguisher is

> found for valid (i.e., natural) introns then it becomes hard to hide
> random information as if it were an intron. Such a distinguisher may be
> feasible based on the fact that most of the genetic code for mammals is
> shared with all multi-cellular organisms. I suspect the introns would
> be shared too.

I have no knowledge. But from what I read in a book it
seems that introns vary comparatively significantly among
related genomes. So if a sequence is claimed to be
from an yet unstudied gene, it would be difficult to
detect the fraud, I guess.

M. K. Shen

John Feth

unread,

Jan 5, 2001, 1:54:46 PM1/5/01

to

I looked at about 700,000 bases (base here relates to a chemical
constitution of the components A, C, G, T, not numerical) on a gene and
found the A:C:G:T ratios to be very close to 3:2:2:3. An Allan deviation
analysis shows that the order looks like noise in strings of up to about
1,000 bases but carries information in strings from 1,000 to 10,000 bases
long. I believe an A always occurs with a T so steganography in DNA might
be a little different than in photos or music.

John Feth

Mok-Kong Shen <mok-ko...@t-online.de> wrote in article
<3A5069FC...@t-online.de>...

Terry Ritter

unread,

Jan 5, 2001, 10:06:00 PM1/5/01

to

On 5 Jan 2001 18:54:46 GMT, in
<01c0774a$05965f40$2104...@cdc11q71.cas.honeywell.com>, in sci.crypt
"John Feth" <John...@honeywell.com> wrote:

>I looked at about 700,000 bases (base here relates to a chemical
>constitution of the components A, C, G, T, not numerical) on a gene and
>found the A:C:G:T ratios to be very close to 3:2:2:3. An Allan deviation
>analysis shows that the order looks like noise in strings of up to about
>1,000 bases but carries information in strings from 1,000 to 10,000 bases
>long. I believe an A always occurs with a T so steganography in DNA might
>be a little different than in photos or music.

Although I am somewhat familiar with "Allan variance," and continue to
read the many papers available on the web, I am confused about the
implication that it can be relied upon to distinguish between noise
and information. Perhaps you would care to describe your experiments
in detail.

---
Terry Ritter rit...@io.com http://www.io.com/~ritter/
Crypto Glossary http://www.io.com/~ritter/GLOSSARY.HTM

Mok-Kong Shen

unread,

Jan 8, 2001, 12:32:41 PM1/8/01

to

Terry Ritter wrote:
>
[snip]

> Although I am somewhat familiar with "Allan variance," and continue to
> read the many papers available on the web, I am confused about the
> implication that it can be relied upon to distinguish between noise
> and information.

[snip]

Could you or someone else kindly give a good reference of
Allan variance or a tiny summary of it? I failed to find
pointers from a couple of well-known and very comprehensive
reference materials of statistical sciences in the library.

Thanks.

M. K. Shen

Douglas A. Gwyn

unread,

Jan 8, 2001, 3:08:04 PM1/8/01

to

Mok-Kong Shen wrote:
> Could you or someone else kindly give a good reference of
> Allan variance or a tiny summary of it?

http://www.allanstime.com/AllanVariance/

It's essentially a 2-point variance, used for oscillators.

Terry Ritter

unread,

Jan 9, 2001, 1:29:52 AM1/9/01

to

On Mon, 08 Jan 2001 18:32:41 +0100, in
<3A59F9B9...@t-online.de>, in sci.crypt Mok-Kong Shen
<mok-ko...@t-online.de> wrote:

>[...]

>Could you or someone else kindly give a good reference of
>Allan variance or a tiny summary of it? I failed to find
>pointers from a couple of well-known and very comprehensive
>reference materials of statistical sciences in the library.

VARIANCE

We recall from descriptive statistics that a "variance" statistic
attempts to capture (or "model") -- in one value -- the extent to
which data vary from some basis. The square root of variance is
"deviation," which is the expected difference each sample has from the
base value.

Common (or "classic") variance is based on the mean, the arithmetic
average of sampled values (here please pardon my pseudocode):

| mean := SUM(x[i]) / n;
| var := SUM( SQR(x[i] - mean) ) / (n - 1);
| sdev := SQRT( var );

for an array of n sample values x[].

In contrast, Allan variance is based on the value of the previous
sample:

| allanvar := SUM( SQR(x[i] - x[i-1]) ) / (2*(n-1));
| allandev := SQRT( allanvar );

The value "2" in the denominator is apparently intended to produce the
same result as classical variance over white noise. Note that an
n-element array implies only n-1 difference values.

There is also a "mean deviation" or "absolute deviation" which uses
the absolute value of the difference, which thus avoids the squaring
operation and is also supposedly "more robust":

| adev := SUM( ABS(x[i] - mean) ) / n;

Other types of variance include "Hadamard variance," a related form
called "SIGMA-Z," and probably many other types as well. Each of
these presumably provides a unique view of the differences in sampled
data, and none is likely to be ideal for every application.

ADVANTAGE

The first role of Allan variance is fairly conventional: to provide a
measure of variation. In frequency measurement work, measured
frequency may be sampled at some rate. The resulting Allan deviation
over the sample values is a general measure of frequency stability at
the sampling rate. And by averaging each m adjacent samples, we can
get an Allan variance for (synthetic) slower sampling rates.

It is also possible to measure time differences between two sources
and then compute the Allan deviation from a slightly more complex
formula.

The more interesting role of Allan variance is to assist in the
analysis of residual noise. In frequency measurement work, five
different types of noise are defined: white noise phase modulation,
flicker noise phase modulation, white noise frequency modulation,
flicker noise frequency modulation, and random walk frequency
modulation. A log-log plot of Allan variance versus sample period
produces approximate straight line values of different slopes in four
of the five possible cases. A different (more complex) form called
"modified Allan deviation" can distinguish between the remaining two
cases. The result is a powerful basis for identifying problems and
engineering improved designs.

SOURCES

If you go to www.google.com and type in "Allan variance" or "Allan
deviation" you should get several pages of links to web pages. Some
of those are just a use in a particular project, some purport to be
definitions and are confusing, but overall one can develop an
understanding of the concept.

There is a lot of mention of Allan variance in the literature
surrounding precision frequency measurement, e.g., in the yearly
proceedings of the annual IEEE International Frequency Control
Symposium.

EXAMPLE REFERENCES

Allan, D. and J. Barnes. 1981. A Modified "Allan Variance" with
Increased Oscillator Characterization Ability. Proceedings of the
35th Annual Frequency Control Symposium. 470-475.

Greenhall, C. 1992. A Shortcut for Computing the Modified Allan
Variance. 1992 IEEE Frequency Control Symposium. 262-264.

Ferre-Pikal, E., et. al. 1997. Draft Revision of IEEE Std 1139-1988
Standard Definitions of Physical Quantities for Fundamental Frequency
and Time Metrology -- Random Instabilities. 1997 IEEE International
Frequency Control Symposium. 338-357.

Respero. 1999. Allan variance: variations and application to
metrology gauge data. http://huey.jpl.nasa.gov/~respero/allan-var/

Riley, W. 2001. The Calculation of Time Domain Frequency Stability.
http://www.ieee-uffc.org/freqcontrol/paper1ht.html