recognising plaintext

Iain

unread,

Nov 28, 2002, 11:28:59 AM11/28/02

to

*** post for FREE via your newsreader at post.newsfeed.com ***

I am trying to get my head around statistical cryptoanalysis - with
somewhat limited success :=)

I understand the theory of recognising plaintext using statistics of
common letter occurances (mono/di/tri-grams, etc.)

My problem is that whilst plenty of sites describe the maths involved
(Shannon's Information Theory, unicity distance, etc), I cannot find one
site which explains how to actually *use* these in implementing a brute
force cracking machine.
How does one actually implement a function which will
recognise a valid block of English text without actually displaying it
onscreen and asking the user to decide if it is the plaintext?

Does it involve building an ascii character lookup table? Or can it be
done mathematically just by adding through the results and dividing in
some way?

I would be really grateful if someone could help me with this.

Regards

Iain A. Mcleod

-----= Posted via Newsfeed.Com, Uncensored Usenet News =-----
http://www.newsfeed.com - The #1 Newsgroup Service in the World!
-----== 100,000 Groups! - 19 Servers! - Unlimited Download! =-----

Peter Fairbrother

unread,

Nov 28, 2002, 11:59:45 AM11/28/02

to

Iain wrote

> *** post for FREE via your newsreader at post.newsfeed.com ***
>
> I am trying to get my head around statistical cryptoanalysis - with
> somewhat limited success :=)
>
> I understand the theory of recognising plaintext using statistics of
> common letter occurances (mono/di/tri-grams, etc.)
>
> My problem is that whilst plenty of sites describe the maths involved
> (Shannon's Information Theory, unicity distance, etc), I cannot find one
> site which explains how to actually *use* these in implementing a brute
> force cracking machine.
> How does one actually implement a function which will
> recognise a valid block of English text without actually displaying it
> onscreen and asking the user to decide if it is the plaintext?
>
> Does it involve building an ascii character lookup table? Or can it be
> done mathematically just by adding through the results and dividing in
> some way?
>
> I would be really grateful if someone could help me with this.

Practically speaking, it's easier to look for spaces between words. Most
plaintext has spaces these days. Decrypt say 128 bytes and count the spaces.
If there are more than twelve then you have a likely candidate.

Or you can count "e"'s instead, but don't bother with a whole table.
--
Peter Fairbrother

Iain

unread,

Nov 28, 2002, 12:48:23 PM11/28/02

to

*** post for FREE via your newsreader at post.newsfeed.com ***

Thanks for your prompt reply.

Are you sure that's enough to do it?

Rgds.

Iain

"Peter Fairbrother" <zenad...@zen.co.uk> wrote in message
news:BA0BFC01.2857C%zenad...@zen.co.uk...

Peter Fairbrother

unread,

Nov 28, 2002, 1:26:56 PM11/28/02

to

Iain wrote

> *** post for FREE via your newsreader at post.newsfeed.com ***
>

> Thanks for your prompt reply.
>
> Are you sure that's enough to do it?

It's usually enough, but it may not be the best way. Depends what you know
about the plaintext and the cypher.

If there are no spaces (unusual) then checking for them won't work, if it's
in a foreign encoding or language then you'll have to work around that. If
it's an image then you have different problems.

If you know that it's all-ascii plaintext then often you can just check
whether or not the bytes are ascii bytes, probably the best way. Go through
bytes one at a time, discard decryption if there are non-ascii bytes, and if
the same number of bytes as the number of bits in the key are all ascii
bytes then get a human to do a final check.

It's a good idea to eliminate impossible texts first using a simple check,
then do more complex checks on decryptions that pass the first test.
Whatever, it isn't usually necessary to create bigram tables (or even letter
frequency tables). Just count the most common characters.

--
Peter Fairbrother

Douglas A. Gwyn

unread,

Nov 28, 2002, 1:30:26 PM11/28/02

to

Iain wrote

>how to actually *use* these in implementing a brute
>force cracking machine.

Just compute the delta I.C., and reject when the bulge is
less than half what you'd expect for plaintext. That's
good enough to weed out almost all incorrect keys. Then
for the remaining candidates correlate the uniliteral
frequencies (which you have anyway from the delta I.C.
computation against the reference plaintext source
probabilities, and keep track of the key for the so-for
highest correlation. You could try for a higher-order
match but it's rarely necessary.

Bryan Olson

unread,

Nov 28, 2002, 8:02:34 PM11/28/02

to

"Iain" asked:

> Are you sure that's enough to do it?

Quite often it is. For more advanced methods see:

http://www.cs.umbc.edu/~sherman/Rtopics/Crypto/plain.html

--
--Bryan

Criptyk Hayz

unread,

Nov 28, 2002, 8:22:30 PM11/28/02

to

I would suggest iterating through the text and counting the number of
trigrams.
the number of matches divided by the length of the text will give you a low
percentage.

In english you'd expect to see a fair bit of matches. "the" "ing" and "and"
should all
be common. "the" should probably be close to, or over, 1% of the trigrams.

If the text was random (as in a bad decryption would be) this number would
be
extremely low.

Even easier, count the Es. there should be around 10-12% of the text being
the letter E.
(closer to 10 if you count spaces, which you probably would).

Random text would be way way lower. maybe 0.4% or something (assuming a
range
of 255 possible characters)

"Iain" <no@spam> wrote in message news:3de65295$1...@post.newsfeed.com...

Douglas A. Gwyn

unread,

Nov 29, 2002, 2:05:22 AM11/29/02

to

Douglas A. Gwyn wrote:
> Just compute the delta I.C., and ...

I should point out that it isn't necessary to compute
the full formula (which has some normalizing factors)
when the normalizing factors remain the same for every
computation, as they would in this case. A simple sum
of f*(f-1) is enough for relative comparison of the
keys. This general principle is used a lot to speed
things up in practice.

Rob Marston

unread,

Nov 28, 2002, 12:38:13 PM11/28/02

to

Simple, lets look at Trigrams,

1) Get some sample text and calculate the trigram frequency of the text.
2) Get your cipher text of a known encryption system.
3) Guess a key.
4) Decode the cipher text into plain text using the key and decoding
algorithm
5) Work out the trigram frequency of this new text
6) Score the result by comparing with known trigram score with this one
7) Change something in the key
8) Do (4) (5) (6) again
9) If the score is better keep the change, else throw it
10) Repeat 8 & 9 until you have the best score you can generate
11) Start again at (3) with a new random key

(7) is where it gets clever,and the faster you can make (8) & (9) the better

Have a look at my program http://www.de-crypt.co.uk/ to see such a system
working

Rob

Peter Fairbrother

unread,

Nov 29, 2002, 6:23:39 AM11/29/02

to

Rob Marston wrote

> Simple, lets look at Trigrams,
>
> 1) Get some sample text and calculate the trigram frequency of the text.
> 2) Get your cipher text of a known encryption system.
> 3) Guess a key.
> 4) Decode the cipher text into plain text using the key and decoding
> algorithm
> 5) Work out the trigram frequency of this new text
> 6) Score the result by comparing with known trigram score with this one
> 7) Change something in the key
> 8) Do (4) (5) (6) again
> 9) If the score is better keep the change, else throw it
> 10) Repeat 8 & 9 until you have the best score you can generate
> 11) Start again at (3) with a new random key
>
> (7) is where it gets clever,and the faster you can make (8) & (9) the better

This won't work with a well-designed cypher (ok it will eventually, but it
will be sloowwwww).

Just count spaces or "e"s. There is no need to mess with letter frequency
tables, bigrams etc, it's usually a complete waste of time.

If the plaintext is known to be ascii you can eliminate 3/4 of decryptions
(on a 32-bit processor) with just two operations - an AND on the first bits
of each byte of trial plaintext, and a Jcc to a further test on the next 31
bits if the result is zero - otherwise discard decryption and increment
trial key.

Depending on the cypher you may be able to do trial decryptions quicker by
other methods than incrementing the key, eg by keeping some unchanging
key-dependent s-boxes, doing parallel computations on related keys, or
things like that. It's worth looking into.

If you are counting spaces or "e"s you can do something similar. Try testing
packed values.

It is generally better to write this stuff directly in assembler (assuming
you know a bit about optimisation, branch hints and the like). If you can
keep it all in L1 cache (not too hard, Athlon is better than Pentium for
this) it should really fly - but brute forces are inherently long.

These tests are simple, effective if you know a little about the plaintext
(language, encoding), and _very_ much faster then the average trial
decryption. They will add little to the overall time required. Put your
effort into doing trial decryptions quicker instead.

--
Peter Fairbrother

Rob Marston

unread,

Nov 29, 2002, 12:04:33 PM11/29/02

to

> Just count spaces or "e"s. There is no need to mess with letter frequency
> tables, bigrams etc, it's usually a complete waste of time.

What if the plaintext doesn't have an E?

Rob

Douglas A. Gwyn

unread,

Nov 30, 2002, 4:25:48 AM11/30/02

to

Rob Marston wrote:
> What if the plaintext doesn't have an E?

That's why it is better to use something like delta IC.

Peter Fairbrother

unread,

Nov 30, 2002, 7:28:47 AM11/30/02

to

Douglas A. Gwyn wrote

> Rob Marston wrote:
>> What if the plaintext doesn't have an E?
>

Then it's almost certainly not in English (unless it's very short).

You can total the "common letters" instead of just "e"'s. This can be
quicker overall (by saving the time needed to trial decrypt extra
cyphertext).

> That's why it is better to use something like delta IC.

A single-letter index of coincidence test may sometimes be advantageous,
especially if the language is unknown. It can pick up a lot of weirdness.
Bigram and trigram counts are not usually needed.

For reliable results in English it will need around 200 letters of trial
plaintext though, and counting spaces and/or common letters will give
reliable results with less.

I just reread that last paragraph (it's from a quote), and I'm not sure it's
correct. Will do some math, and let you know. Any ref's, before I start?

If you know the language used then a weighted (against expected) frequency
count is probably better than an IC.

The fastest-overall method depends on what you know about the plaintext, the
cost of the tests, their reliability, the highly cypher-dependant cost of
trial decryptions, and so on.

But for a simple method that usually works, just count spaces, or common
letters, or even the total of both.

--

Peter Fairbrother

rjh

unread,

Nov 30, 2002, 7:36:19 AM11/30/02

to

Peter Fairbrother wrote:

> Douglas A. Gwyn wrote
>
>> Rob Marston wrote:
>>> What if the plaintext doesn't have an E?
>>
>
> Then it's almost certainly not in English (unless it's very short).

Almost right. A fairly big book has shown that it is not /always/ so.

(I forget the title, but I have every respect for the author of that fairly
big book, which did not contain a single letter 'e'. See those first two
sentences above? It took me several minutes to compose them.)

--
Richard Heathfield : bin...@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton

Michael Scott

unread,

Nov 30, 2002, 8:07:40 AM11/30/02

to

"rjh" <bin...@eton.powernet.co.uk> wrote in message
news:asabc2$991$1...@helle.btinternet.com...

> Peter Fairbrother wrote:
>
> > Douglas A. Gwyn wrote
> >
> >> Rob Marston wrote:
> >>> What if the plaintext doesn't have an E?
> >>
> >
> > Then it's almost certainly not in English (unless it's very short).
>
> Almost right. A fairly big book has shown that it is not /always/ so.
>
> (I forget the title, but I have every respect for the author of that
fairly
> big book, which did not contain a single letter 'e'. See those first two
> sentences above? It took me several minutes to compose them.)
>

The book is called Gadsby, and you can read about here

http://gadsby.hypermart.net/

Mike Scott

Joe Peschel

unread,

Nov 30, 2002, 8:22:22 AM11/30/02

to

"Michael Scott" msc...@indigo.ie writes:

>The book is called Gadsby, and you can read about here
>
>http://gadsby.hypermart.net/

Yes, Wright's horrible novella, tossed together in less than a year.

J
__________________________________________

Joe Peschel
D.O.E. SysWorks
http://members.aol.com/jpeschel/index.htm
__________________________________________

Mok-Kong Shen

unread,

Nov 30, 2002, 12:08:13 PM11/30/02

to

Michael Scott wrote:
>

> The book is called Gadsby, and you can read about here
>
> http://gadsby.hypermart.net/
>

It says that the text is available online. I clicked
there on the link 'Introduction by Ernest Wright and
chapters', but it didn't seem to work.

M. K. Shen

rjh

unread,

Nov 30, 2002, 12:21:13 PM11/30/02

to

Mok-Kong Shen wrote:

I thought that too at first, but I eventually found the text (using
Konqueror, so it's not IE-specific). Here are the first three paragraphs:

"IF YOUTH, THROUGHOUT all history, had had a champion to stand up for it; to
show a doubting world that a child can think; and, possibly, do it
practically; you wouldn't constantly run across folks today who claim that
"a child don't know anything." A child's brain starts functioning at birth;
and has, amongst its many infant convolutions, thousands of dormant atoms,
into which God has put a mystic possibility for noticing an adult's act,
and figuring out its purport.

Up to about its primary school days a child thinks, naturally, only of play.
But many a form of play contains disciplinary factors. "You can't do this,"
or "that puts you out," shows a child that it must think, practically or
fail. Now, if, throughout childhood, a brain has no opposition, it is plain
that it will attain a position of "status quo," as with our ordinary
animals. Man knows not why a cow, dog or lion was not born with a brain on
a par with ours; why such animals cannot add, subtract, or obtain from
books and schooling, that paramount position which Man holds today.

But a human brain is not in that class. Constantly throbbing and pulsating,
it rapidly forms opinions; attaining an ability of its own; a fact which is
startlingly shown by an occasional child "prodigy" in music or school work.
And as, with our dumb animals, a child's inability convincingly to impart
its thoughts to us, should not class it as ignorant."

Kevin Buhr

unread,

Dec 5, 2002, 9:41:07 PM12/5/02

to

"Iain" <no@spam> writes:
>
> I understand the theory of recognising plaintext using statistics of
> common letter occurances (mono/di/tri-grams, etc.)

You definitely don't have to get that fancy. Assuming the encryption
algorithm is reasonably strong (i.e., something more sophisticated
than a polyalphabetic cipher or XORing with a repeated key), text
"decrypted" with the wrong key will have a flat frequency
profile---every byte will be equally likely. Text decrypted with the
right key will have a non-flat profile (byte frequencies will match
something else, like English text, an executable file, or whatever).

You can use several tests to determine when the frequency profile is
generically "not flat". A really good test is the chi-square goodness
of fit test. For each trial decryption, you calculate the statistic:

chisq = sum_{i=0 to 255} (O_i-E_i)^2/E_i

where O_i is the observed number of bytes with value "i" and E_i is
the expected number of such bytes under the flat profile (that is, the
total length of the ciphertext divided by 256).

Then, you choose a critical value (from a table of chi-squared values)
depending on how many false positives you're willing to tolerate, and
you try keys until the statistic exceeds that critical value. When it
does, you've got the right decryption (or you've got a false positive).

In practice, you can make the probability of a false positive
ridiculously small. For example, if you're only willing to tolerate
one false positive in a million million decryptions, the critical
value is 448, independent of the size of the ciphertext.

In practice, even a small English plaintext will produce a chi-squared
statistic far in excess of that critical value when it is properly
decrypted.

For example, I took the text of your article (952 bytes) and encrypted
it with the OpenSSL DES implementation using the key "test". Then, I
decrypted it using the incorrect keys "wrong", "bad", and "notclose"
as well as the correct key "test" and calculated the chi-squared
values of the resulting "plaintext". Here are the results:

key chisq
-----------------
wrong 241
bad 252
notclose 255
test 13473

It would take about a million million wrong keys before I saw one that
gave a chisq of more than 448, and yet the correct key gives a
chi-square of more than 13,000!

And this algorithm isn't limited to plaintext. The "openssl" binary,
when encrypted and then decrypted with the wrong key, has a
chi-squared value of 249. (That's not surprising. Regardless of the
length of the plaintext, an incorrectly decrypted version should have
a statistic of about 255, the number of degrees of freedom.) If I
decrypt it with the right key, the plaintext has a chi-squared value
of 3,019,119. This gigantic number has more to do with the size of
the plaintext (about 260k) than the fact it is an executable versus an
English text, though.

Even if your plaintext is quite small, like the size of this sentence,
you'll be in pretty good shape. That last sentence had a chi-squared
value of 1790, so even at our very high 448 critical value (that only
finds a false positive one in a million million times), we would have
caught it.

--
Kevin <bu...@telus.net>