Estimating entropy of a stream

bmearns

unread,

Nov 16, 2009, 8:43:07 PM11/16/09

to

I'm trying to set up a system to use audio static from a clock radio
as a
random number generator (getting that static through the sound card).
I have
a couple questions on this, but the biggest one is how to calculate/
estimate
the entropy of my data. I'm familiar with the general entropy
calculation,
but I'm not sure what to calculate it on: per bit, per overlapping
bytes,
non overlapping bytes?

My general plan is as follows: read data in from the sound card,
estimate
the total entropy of the stream and feed it into a cryptographic hash
until I've read enough data such that the total entropy is at least
equal
to the output of the hash, then use the output as my random bits.

So in addition to my first question about how best to estimate the
entropy
of the data, I have a few more, if some kind soul would like to shed
some
light on my situation.

First, does it make sense to do any preprocessing on the data before
feeding it into the hash? For instance, I'm thinking about collapsing
each 16
bit audio sample into a single bit by XORing all 16 bits together,
thinking
that this should destroy any audio based correlation that may exist,
but I may
be completely off the mark. I was also thinking about doing a Von
Neumann
mono-bit filter on the data: considering the data two-bits at a time
and
discarding any pairs that have both bits the same. So does this make
sense at
all, or is it a waste of time (or even counter-productive) considering
it's
all just going into the hash anyway?

Similarly, what about post processing? Again, I was thinking of the
mono-bit
filter. I was also thinking of running the FIPS tests on it (probably
just
passing it through rngtest in pipe mode) and discarding any sets that
fail.
Is it worth the extra processing and the possibility of discarding
such large
sets, or should I just trust a good hash function?

Finally, if I do any pre-processing on the data, do I measure the
entropy of
the original data, or the processed data? My instinct tells me to use
the
processed data, but my instinct has been wrong in the past.

Any help, comments, or other insight would be greatly appreciated.
Thanks in
advance.

-Brian

Robert Scott

unread,

Nov 17, 2009, 7:40:12 AM11/17/09

to

On Mon, 16 Nov 2009 17:43:07 -0800 (PST), bmearns <mear...@gmail.com> wrote:

>I'm trying to set up a system to use audio static from a clock radio
>as a
>random number generator (getting that static through the sound card).
>I have
>a couple questions on this, but the biggest one is how to calculate/
>estimate

>the entropy of my data. ..

Entropy is the measure of how unpredicable any one particular part of the data
stream might be, given all the other parts. Predictability depends on your
ability to recognize correlations in the data. Simple correlations, like the
overall frequecy of '0' bits vs '1' bits, are easy to calculate. But for
cryptographic purposes, you must consider all forms of correlation.

For example, suppose your static-based generator was being influenced by an
external signal in such a way that every 41 bits saw a bit that was always the
inverse of the bit that occurred 129 bits ago. This correlation might not be
detected in any statistical randomness test. But it does affect the entropy,
because in every 41 bits we find a bit that is absolutely predicatable to those
who know what to look for. I have no idea how an external source might produce
such an effect on your audio static source, but it is conceivable in prinicple.

Therefore the calcuation of entropy is simlar to cracking a cipher. If you are
clever enough to recognize the hidden correlation, then you can declare a lower
entropy. If you can't, then you will have to declare a higher entropy.

Now you can hope that such complex correlations simply can't occur in your
setup. But if I were an attacker wishing to pollute your random number
generator, I might look for a way to introduce correlations that would escape
detection by ordinary statistical means, and thus lower the entropy of your
generator without your knowing it.

Robert Scott
Ypsilanti, Michigan

bmearns

unread,

Nov 17, 2009, 10:09:51 AM11/17/09

to

On Nov 17, 7:40 am, n...@dont-mail-me.com (Robert Scott) wrote:

Thank you, Robert. That was very informative, but unfortunately, not
really that helpful to me: I'm still not sure of the best approach to
estimate the entropy of my data. I understand that I can't determine
the entropy exactly, for the reasons you very nicely illustrated. But
obviously entropy estimates are done, for instance by the Linux kernel
when deciding how much entropy it has in the /dev/random pool.

Based on a quick web search it would appear that there may not be any
one good answer to this, and that a more elaborate study of the data
and its source would be appropriate. However, I'm really not looking
for anything that involved, I'm planning to drastically underestimate
the entropy anyway (i.e., feeding two or three times more data than my
entropy calculation actually suggests into the hash function before
taking the output). But I would like to get a reasonable estimate, and
I'm just wondering what the best way is to do that.

On the one hand, I feel like calculating the entropy bit by bit (i.e.,
based on the Bernouli distribution of 1's and 0's) would give me a bad
overestimate as it does not account for any long-term trends. For
instance, a stream consisting entirely of alternating 1's and 0's
would have extremely high entropy by this measure (1 Sh/bit), but
obviously it's a completely non-random stream.

This leads me to think that using very large sequences of bits as the
elements over which I calculate the sample-entropy would be better.
But on the other hand, this reduces the amount of samples I have to
work with. For instance, if I have a 1024 bit stream and I take the
elements to be overlapping sequences of 1000 bits, then I only have 24
elements to work with, which doesn't seem like enough to give me a
reasonable estimate. Is there a sweet spot somewhere in the middle, or
should I use some combination of different sample-entropy calculations
(e.g. a weighted average)? I guess it makes sense that the more
relationships I can account for, the better, so I should calculate
sample-entropies per-bit, and per-byte, and per- every-other bit, and
per- bit with the bit 120 before it, and per- whatever other
relationships I can come up with. But then how do I combine them into
a unified estimate?

Thanks for the help, any more would be greatly appreciated.
-Brian

rossum

unread,

Nov 17, 2009, 11:43:39 AM11/17/09

to

On Mon, 16 Nov 2009 17:43:07 -0800 (PST), bmearns <mear...@gmail.com>
wrote:

>I'm trying to set up a system to use audio static from a clock radio
>as a
>random number generator (getting that static through the sound card).
>I have
>a couple questions on this, but the biggest one is how to calculate/
>estimate
>the entropy of my data. I'm familiar with the general entropy
>calculation,
>but I'm not sure what to calculate it on: per bit, per overlapping
>bytes,
>non overlapping bytes?

One way to estimate entropy is to look at compressibility - the more
compressible a stream is the less entropy it will have. Get a chunk
of input and zip it. Compare before and after sizes. If the
compressed data is too small, for wherever level of "too small" you
set, then reject it. Otherwise accept the data and pass it into your
hash function.

If you are using a good cryptographic hash then the preconditioning is
wasted effort, providing you have enough entropy in the input.

>
>Similarly, what about post processing? Again, I was thinking of the
>mono-bit
>filter. I was also thinking of running the FIPS tests on it (probably
>just
>passing it through rngtest in pipe mode) and discarding any sets that
>fail.
>Is it worth the extra processing and the possibility of discarding
>such large
>sets, or should I just trust a good hash function?

You will need to do some testing because with a physical device such
as a sound card there is always the possibility of component failure.
You do not neccessarily need to test every output, but you do need to
test output frequently enough to reassure yourself to the level of
security you require.

>
>Finally, if I do any pre-processing on the data, do I measure the
>entropy of
>the original data, or the processed data? My instinct tells me to use
>the
>processed data, but my instinct has been wrong in the past.

In theory both will have the same amount of entropy. In practice I
would be inclined to test the raw input. It is possible that later
processing may mask problems.

>
>Any help, comments, or other insight would be greatly appreciated.
>Thanks in
>advance.
>
>-Brian

You may find some useful input in John Denker's piece on a High
Entropy Randomness Generator:
http://www.av8n.com/turbid/paper/turbid.htm

rossum

Unruh

unread,

Nov 17, 2009, 1:06:56 PM11/17/09

to

rossum <ross...@coldmail.com> writes:

>On Mon, 16 Nov 2009 17:43:07 -0800 (PST), bmearns <mear...@gmail.com>
>wrote:

>>I'm trying to set up a system to use audio static from a clock radio
>>as a
>>random number generator (getting that static through the sound card).
>>I have
>>a couple questions on this, but the biggest one is how to calculate/
>>estimate
>>the entropy of my data. I'm familiar with the general entropy
>>calculation,
>>but I'm not sure what to calculate it on: per bit, per overlapping
>>bytes,
>>non overlapping bytes?
>One way to estimate entropy is to look at compressibility - the more
>compressible a stream is the less entropy it will have. Get a chunk
>of input and zip it. Compare before and after sizes. If the
>compressed data is too small, for wherever level of "too small" you
>set, then reject it. Otherwise accept the data and pass it into your
>hash function.

Compression is an extremely bad way of estimating entropy. An RC4 stream
has very very low entropy ( the key length or less, about 130 bits no matter how long the stream is) but
no compression engine would ever discover that ( and as far as is know,
no human could either). Ie, compression engines can find extremely
common and obvious correlations, but anything at all subtle completely
escapes them.

bmearns

unread,

Nov 17, 2009, 1:22:42 PM11/17/09

to

On Nov 17, 1:06 pm, Unruh <unruh-s...@physics.ubc.ca> wrote:
> rossum <rossu...@coldmail.com> writes:
> >On Mon, 16 Nov 2009 17:43:07 -0800 (PST), bmearns <mearn...@gmail.com>

Hello, again, Unruh. I hope you're not offended that I widened my
scope a little. I genuinely appreciate your insights on this as well.

I knew from our earlier discussion that compression was not a good
route as it would give me such a ridiculously high estimate of the
entropy as to do more harm than good. I was just reading about Fortuna
and how they do away with entropy estimates altogether through some
clever mechanism with different size entropy pools. This suggests that
they don't trust entropy estimates much at all. On the other hand,
it's designed to take "alleged entropy" from any number of unknown and
untrusted sources, so it's understandable that they wouldn't trust
estimates provided by the same. Seeing as how my entropy is coming
from a known (though not necessarily trusted) source, I still hold out
hope that there is a reasonable way to estimate the entropy of it's
data.

-Brian

Mok-Kong Shen

unread,

Nov 17, 2009, 1:37:52 PM11/17/09

to

Unruh wrote:

> Compression is an extremely bad way of estimating entropy. An RC4 stream
> has very very low entropy ( the key length or less, about 130 bits no matter how long the stream is) but
> no compression engine would ever discover that ( and as far as is know,
> no human could either). Ie, compression engines can find extremely
> common and obvious correlations, but anything at all subtle completely
> escapes them.

Very true. What applies in the above to RC4 apparently applies also
to any block cipher. AES in e.g. the counter mode would be able to
generate a very long sequence unlikely to be highly compressible and
yet has certainly no more than 128 bits (the key length). Looking at
this fact in the reverse direction, it seems to be an entirely
nontrivial question to know/decide how much entropy one would need
as a minimum in a bit sequence that one uses for a particular purpose
(or am I seeing a problem where there is in fact none?).

Thanks,

M. K. Shen

Unruh

unread,

Nov 17, 2009, 1:42:49 PM11/17/09

to

bmearns <mear...@gmail.com> writes:

>On Nov 17, 1:06=A0pm, Unruh <unruh-s...@physics.ubc.ca> wrote:
>> rossum <rossu...@coldmail.com> writes:
>> >On Mon, 16 Nov 2009 17:43:07 -0800 (PST), bmearns <mearn...@gmail.com>
>> >wrote:
>> >>I'm trying to set up a system to use audio static from a clock radio
>> >>as a
>> >>random number generator (getting that static through the sound card).
>> >>I have
>> >>a couple questions on this, but the biggest one is how to calculate/
>> >>estimate
>> >>the entropy of my data. I'm familiar with the general entropy
>> >>calculation,
>> >>but I'm not sure what to calculate it on: per bit, per overlapping
>> >>bytes,
>> >>non overlapping bytes?
>> >One way to estimate entropy is to look at compressibility - the more

>> >compressible a stream is the less entropy it will have. =A0Get a chunk
>> >of input and zip it. =A0Compare before and after sizes. =A0If the

>> >compressed data is too small, for wherever level of "too small" you

>> >set, then reject it. =A0Otherwise accept the data and pass it into your

>> >hash function.
>>
>> Compression is an extremely bad way of estimating entropy. An RC4 stream

>> has very very low entropy ( the key length or less, about 130 bits no mat=

>ter how long the stream is) but
>> no compression engine would ever discover that ( and as far as is know,
>> no human could either). Ie, compression engines can find extremely
>> common and obvious correlations, but anything at all subtle completely
>> escapes them.

>Hello, again, Unruh. I hope you're not offended that I widened my
>scope a little. I genuinely appreciate your insights on this as well.

>I knew from our earlier discussion that compression was not a good
>route as it would give me such a ridiculously high estimate of the
>entropy as to do more harm than good. I was just reading about Fortuna
>and how they do away with entropy estimates altogether through some
>clever mechanism with different size entropy pools. This suggests that
>they don't trust entropy estimates much at all. On the other hand,
>it's designed to take "alleged entropy" from any number of unknown and
>untrusted sources, so it's understandable that they wouldn't trust
>estimates provided by the same. Seeing as how my entropy is coming
>from a known (though not necessarily trusted) source, I still hold out
>hope that there is a reasonable way to estimate the entropy of it's
>data.

Certainly examine it. Look for biases in the bit values. Take a long
sequence and fourier transform it, looking for peaks in the transform
which are relatively independent of the frequency. Look for 1/f noise in
the fourier transform (ie noise which rises at low frequencies).
That will at least give you an initial estimate.
Strip the data down to its lowest one or two bits, and do that same
thing for that data stream, to see if the lowest bits have biases and
correleations.

bmearns

unread,

Nov 17, 2009, 1:53:29 PM11/17/09

to

On Nov 17, 1:42 pm, Unruh <unruh-s...@physics.ubc.ca> wrote:

[snip]

Sorry for being dense, but I'm not seeing how a Fourier xform gives me
an estimate. I understand the basic concepts of looking for
correlations in it, but as far as getting an actual numeric estimate,
I'm still in the dark.

Thanks,
-Brian

Mok-Kong Shen

unread,

Nov 17, 2009, 3:18:24 PM11/17/09

to

bmearns wrote:

> Unruh <unruh-s...@physics.ubc.ca> wrote:
>> Certainly examine it. Look for biases in the bit values. Take a long
>> sequence and fourier transform it, looking for peaks in the transform
>> which are relatively independent of the frequency. Look for 1/f noise in
>> the fourier transform (ie noise which rises at low frequencies).
>> That will at least give you an initial estimate.
> [snip]
>
> Sorry for being dense, but I'm not seeing how a Fourier xform gives me
> an estimate. I understand the basic concepts of looking for
> correlations in it, but as far as getting an actual numeric estimate,
> I'm still in the dark.

As a layman I was once interested in the issue of entropy and had
cast a glance to quite a number of books in the library. I don't
remember to have seen any mention of Fourier transform in that
connection. So, barring the expert's refutation, I surmise that
that must be a very new, possibly yet unpublished, methodology
of entropy computation.

On the other hand, I wonder why do you strongly require to compute
the value of entropy. There are good comprehensive test suites
for randomness. (I must say that I have no experience with them,
but I would personally include the Maurer's test anyway.) If your
bit sequences pass all the tests that are (reasonably) available
and you are sure that your hardware source has no technical defects
or malicious manipulations, then I would in your place simply use
such sequences for your intended purpose.

M. K. Shen

amzoti

unread,

Nov 19, 2009, 11:19:18 AM11/19/09

to

Look at the section "Calculating the entropy" at: http://www.comscire.com/Products/PCQNG20/

Also: http://www.comscire.com/Products/PCQNG20/HCalc.aspx

Maybe this helps.

~A

Joe Green

unread,

Dec 20, 2009, 12:08:28 AM12/20/09

to

Data examination, spectrum analysis and intimate knowledge of your randomness source is what is needed. Thus if you calculate 5 bits of randomness per sample, arithmetic says output 32 bits every 32/5 (round up to 7) times. Then double or triple that. I would go with 32 bits every 16 samples.

//Do it like this
hash_work_area tbl;
for(;;) {
for(j=0;j<16;j++) {
augment tbl with randomness source
tbl = someHash(tbl);
}
output 32 bits from table
}

After program start or power up be sure to toss the first outputs equivalent to at least 3 times the hash area size / bits of randomness per sample.

I will soon have a < US$100 USB TRNG that produces > 50K true random bytes per second. My processor is a 32-bit machine that runs the "augment tbl with randomness source and tbl = someHash(tbl)" steps for two samples in just over 3 microseconds. My hash work area is over 1200 bits with excellent and efficient augmentation and mixing.

Or you could cheat and no one could tell. (My USB TRNG does not cheat.)

lukeo

unread,

Dec 22, 2009, 7:11:55 AM12/22/09

to

I recently blogged about a USB Entropy Stick, and they use Maurer's
Universal statistical test which converges to the entropy
of a stream. Also see the other referenced paper which cleans up some
of Maurer's estimates/constants. Post is here

http://lukenotricks.blogspot.com/2009/12/usb-entropy-drive.html

rgs Luke

unruh

unread,

Dec 22, 2009, 1:38:58 PM12/22/09

to

On 2009-12-22, lukeo <lukejame...@gmail.com> wrote:
> On Dec 20, 6:08?am, Joe Green <joe.gr...@hmtown.com> wrote:
>> Data examination, spectrum analysis and intimate knowledge of your randomness source is what is needed. ?Thus if you calculate 5 bits of randomness per sample, arithmetic says output 32 bits every 32/5 (round up to 7) times. ?Then double or triple that. ?I would go with 32 bits every 16 samples. ?

>>
>> //Do it like this
>> hash_work_area tbl;
>> for(;;) {

>> ? for(j=0;j<16;j++) {
>> ? ? augment tbl with randomness source
>> ? ? tbl = someHash(tbl);
>> ? }
>> ? output 32 bits from table

>>
>> }
>>
>> After program start or power up be sure to toss the first outputs equivalent to at least 3 times the hash area size / bits of randomness per sample.
>>

>> I will soon have a < US$100 USB TRNG that produces > 50K true random bytes per second. ?My processor is a 32-bit machine that runs the "augment tbl with randomness source and tbl = someHash(tbl)" steps for two samples in just over 3 microseconds. ?My hash work area is over 1200 bits with excellent and efficient augmentation and mixing. ?
>>
>> Or you could cheat and no one could tell. ?(My USB TRNG does not cheat.)

>
> I recently blogged about a USB Entropy Stick, and they use Maurer's
> Universal statistical test which converges to the entropy
> of a stream. Also see the other referenced paper which cleans up some

No it does not converge to the entropy of a stream. Consider the stream
1 2 3 4 5 6 7 8.... run through say MD4. The result will fool any
entropy test on the output into reporting a maximal entropy stream. But
it clearly is not. It has almost the same entropy ( a few bytes more) as the stream 1 2 3 4 5 6 7 8
...

Maaartin

unread,

Dec 22, 2009, 2:19:04 PM12/22/09

to graj...@seznam.cz

On Dec 22, 7:38 pm, unruh <un...@wormhole.physics.ubc.ca> wrote:

> On 2009-12-22, lukeo <lukejamesocon...@gmail.com> wrote:
>
> No it does not converge to the entropy of a stream. Consider the stream
> 1 2 3 4 5 6 7 8.... run through say MD4. The result will fool any
> entropy test on the output into reporting a maximal entropy stream. But
> it clearly is not. It has almost the same entropy ( a few bytes more) as the stream 1 2 3 4 5 6 7 8

I think it will.... after seeing md4(1) the probabilty of seeing md4
(2) is very high (it'd be 1 for a block cipher, it may be 1, 1/2, 1/3
or anything like this for a hash, nobody knows). But for this you need
to work with blocks of size 128 bits and await the next occurence of
md4(1), which is with time and memory complexity of 2**128 completely
impossible. So there's convergence, but of no use to anybody. Or am I
wrong?

unruh

unread,

Dec 22, 2009, 4:13:27 PM12/22/09

to

On 2009-12-22, Maaartin <graj...@seznam.cz> wrote:

> On Dec 22, 7:38?pm, unruh <un...@wormhole.physics.ubc.ca> wrote:
>> On 2009-12-22, lukeo <lukejamesocon...@gmail.com> wrote:
>>
>> No it does not converge to the entropy of a stream. Consider the stream
>> 1 2 3 4 5 6 7 8.... run through say MD4. The result will fool any
>> entropy test on the output into reporting a maximal entropy stream. But

>> it clearly is not. It has almost the same entropy ( a few bytes more) ?as the stream 1 2 3 4 5 6 7 8

>
> I think it will.... after seeing md4(1) the probabilty of seeing md4
> (2) is very high (it'd be 1 for a block cipher, it may be 1, 1/2, 1/3
> or anything like this for a hash, nobody knows). But for this you need
> to work with blocks of size 128 bits and await the next occurence of
> md4(1), which is with time and memory complexity of 2**128 completely
> impossible. So there's convergence, but of no use to anybody. Or am I
> wrong?

Depends on how md4 is applied to the stream of successive integers.
Remember it can hash an arbitrary length. Thus there is no "recycling"
after 2^128 elements. And md4(1) is also equal to md4(r) for an
infinite number of r, and 1 never reoccurs. Ie, in the stream, the
probability of seeing md4(2) occur after md4(1) is tiny ( someting like
2^(-128).

Maaartin

unread,

Dec 22, 2009, 7:49:53 PM12/22/09

to graj...@seznam.cz

On Dec 22, 10:13 pm, unruh <un...@wormhole.physics.ubc.ca> wrote:
>
> Depends on how md4 is applied to the stream of successive integers.
> Remember it can hash an arbitrary length. Thus there is no "recycling"
> after 2^128 elements. And md4(1) is also equal to md4(r) for an
> infinite number of r, and 1 never reoccurs. Ie, in the stream, the
> probability of seeing md4(2) occur after md4(1) is tiny ( someting like
> 2^(-128).

Yes, but...

Let's assume, the 128 least significant bits come last. The internal
state is finite, so after seeing
md4(1), md4(2), and md4(3) in row again, you can bet the next sample
will be md4(4).
There's (afaik) no period, but the whole sequence consists of "only"
2**128 chunks of length 2**128.
Knowing this, you'd need "only" memory of 2**128 samples and about
2*256 time,
but I guess the universal statistical test would need 2**256 of both,

Actually, I don't think it converges to the right value of 0 bits per
sample,
from the above I assume it leads to no more than 128/(2**128) bits per
sample.

Let's assume, the 128 least significant bits come first.
The first 2**256 samples are the same as in the previous case, just
reordered,
but this would mean that the test had to look at values 2**128 samples
apart,
which is even more crazy. But maybe there's a better way.

I ignored the padding since considering it would need a precise
description of the representation of unlimited integers.

unruh

unread,

Dec 23, 2009, 12:48:04 AM12/23/09

to

On 2009-12-23, Maaartin <graj...@seznam.cz> wrote:

> On Dec 22, 10:13?pm, unruh <un...@wormhole.physics.ubc.ca> wrote:
>>
>> Depends on how md4 is applied to the stream of successive integers.
>> Remember it can hash an arbitrary length. Thus there is no "recycling"
>> after 2^128 elements. And md4(1) is also equal to md4(r) for an
>> infinite number of r, and 1 never reoccurs. Ie, in the stream, the
>> probability of seeing md4(2) occur after md4(1) is tiny ( someting like
>> 2^(-128).
>
> Yes, but...
>
> Let's assume, the 128 least significant bits come last. The internal
> state is finite, so after seeing
> md4(1), md4(2), and md4(3) in row again, you can bet the next sample
> will be md4(4).
> There's (afaik) no period, but the whole sequence consists of "only"
> 2**128 chunks of length 2**128.
> Knowing this, you'd need "only" memory of 2**128 samples and about
> 2*256 time,
> but I guess the universal statistical test would need 2**256 of both,
>
> Actually, I don't think it converges to the right value of 0 bits per
> sample,
> from the above I assume it leads to no more than 128/(2**128) bits per
> sample.
>
> Let's assume, the 128 least significant bits come first.

So what? That is not how md4 works. It depends on the all of the bits in
the input. So if the input is 100000 bits long, the output depends on
all of those input bits.

Maaartin

unread,

Dec 23, 2009, 1:39:20 AM12/23/09

to

On Dec 23, 6:48 am, unruh <un...@wormhole.physics.ubc.ca> wrote:
>
> So what? That is not how md4 works. It depends on the all of the bits in
> the input. So if the input is 100000 bits long, the output depends on
> all of those input bits.

Sure, it does. But I don't see
- what makes you believe I don't know it
- what part of what I wrote do you consider to be wrong
- how this should be an argument against it