New way to post binaries

Bill Silvert

unread,

Oct 18, 1986, 10:59:09 AM10/18/86

to

Here are some opinions about the problem of posting binaries, along with
a draft solution. There should be some discussion on the net before it
gets implemented.

Sources are no substitute for binaries, since not everyone has the same
compiler, or even language, on micros.

Binaries have to be encoded as ASCII files. But there is no reason why
we have to use uuencode! There are evidently problems with it, and we
should feel free to invent an alternate encoding method which avoids the
problems with uuencode. These problems, aside from the minor one that
uuencode is designed for th Unix environment, are that some characters
(such as curly braces {}) do not make it through all nodes unscathed
(IBM machines and others with EBCDIC codes appear to be the culprits),
and for long files the posting have to be combined in an editor.
Another problem is that udecode is a complicated program which a lot of
users have trouble getting or rewriting.

I propose that we develop an encoding method for microcomputers that
meets these requirements:

> So simple that users can easily learn the protocol and write their own
version of the decoding program. Uudecode is relatively easy to write
in C, but gets tricky in languages that do not have low-level bit
operations.

> Moderately compact, to keep the traffic volume down.

> Reasonably good error trapping to check for damaged files.

> Convenient to use, preferably not requiring the use of an editor even
for multi-part postings.

One possibility would be to post hex files, but these are very bulky, at
least twice as long as the binary being posted. However, a
generalization of posting hex will work -- if we encounter the letter G
in a hex file we know it is an error, but we can also adopt the
convention that the letters G-Z do not have to be encoded, so that they
are represented by one byte in the encoded file instead of two. This
can save a lot of space. Based on this, here is my proposal:

*** TO ENCODE A FILE ***

Read through the file a byte at a time, and classify each byte as
follows:

>OK, pass through unchanged

>TRANSFORM to a single byte

>ENCODE as a pair of bytes

The encoding I propose is a modified hex, using the letters A-P instead
of the usual hex 0-9A-F -- the reason for this is that it is trivial to
map this way, e.g., value = char - 'A'. The rest of upper case letters,
Q-Z, can be used for error checking and for 1-byte transformations of
common non-graphic bytes, such as NULL and NEWLINE. Thus the actual
encoding rules could be:

>OK includes digits 0-9, lower case alphabet, and punctuation marks.

>TRANSFORM \0 -> Q, \r -> R, space -> S, \t -> T, etc.

>ENCODE all upper case letters and other characters into modified hex
codes, AA to PP.

I have done this encoding on a number of files using a crude set of
programs that I wrote a while back when I didn't have xmodem working on
my net machine and couldn't get uudecode working on my micro -- the
files were generally no larger than uuencoded files, often smaller.

To avoid very long lines, adopt the convention that white space is
ignored, so that you can put in newlines wherever you want (probably not
in the middle of a hex pair though).

To decode a file, one simply reverses the process. Read through the
file a byte at a time, and use switch or a set of ifs to do the
following:

>letter A-P? Read next byte and output 16*(first-'A') + (second - 'A')

>letter Q-Z? Output \0, \r, etc., according to above table.

>anything else? Output it as stands.

*** REFINEMENTS ***

I haven't said anything yet about error checking, convenience, etc.
Note that there are several byte combinations that are not used in this
scheme of things, specifically a letter A-P followed by Q-Z. These can
be used to add these features. For example, an encoded file should
begin with the pair AZ and end with PZ, similar to the begin and end
lines used by uuencode. However, we could also adopt the convention
that when a file is broken into parts, the first part ends with BZ, the
next begins with CZ, and so on. This way one could simply decode a set
of files without first combining them -- the program would start at the
AZ flag, and stop when it found BZ. Then it would go on to the next
file and search for CZ, etc. If it didn't find PZ at the end of the
last file, or if the codes were out of order, it would complain.

Further refinements would be to add various checksums, set off by other
unused code pairs. I'll pass on this one, since it sounds like a good
idea, but adds to the complication. Perhaps it could be made optional,
such as writing a checksum after each termination code like BZ ... PZ.

If this idea seems reasonable, perhaps net moderators could carry the
ball from here. Unfortunately this site is not very reliable for news
and mail.

Computer Sci Club

unread,

Oct 21, 1986, 11:22:12 AM10/21/86

to

I mentioned in a previous article that I had written some arbitrary 8 bit
encoding programs that I could easily turn into an archiver specifically
designed for network transfer. Since there seems to be some interest in
the topic in general, I'll tell you what my programs do.

In the most general case, my encoding routines are given a list of characters
that the transmission medium does not want to see. It will then never
generate those characters. It accepts a stream of arbitrary 8 bit bytes,
and produces an encoded or decoded stream.

The actual encoder operates in two modes: eight bit encoding (M8) and
seven bit encoding (M7). The encoder decides what mode it should be running
in, depending on the cost of the mode.

In seven bit encoding mode, any seven bit character can be transmitted. Any
character the encoder has been told is pathological is mapped into a two
character escape code. This mode is obviously very cheap for text. Any
transmittable character is sent as itself.

In eight bit mode, the encoder can accept any 8 bit value. Each group of
three input bytes is turned into four output 6 bit values. These six bit
values are then mapped onto the ASCII characters "a-zA-Z,.".

Run length encoding is performed in both modes.

The archive format I was considering would have a special archive control
character (something non-controversial) which would never be generated
by the encoder. The archive control characters would signal the beginning
of easily parse text strings that would describe the beginning and end
of archived files, their CRC's and lengths. It would be possible to
generate checkpoints in the archive. The archiver could extract all
undamaged files from an damaged archive.

The checkpoints could be used for retransmitting parts of the archive. If
an archive was damaged, the archiver could tell the user what part of the
archive it needed replaced. Anyone else who had a complete archive could use
that information to generate just the data needed to repair the broken archive.

The archiver will treat a set of unordered files as an archive. Each
file would be searched for a header. The header would identify the archive
and be used to order the parts. It would then read the files in the
correct order. The archive creation command would automatically generate
a numbered set of files for posting, according to a maximum size constraint.
No more editing and cat'ing of news articles.

My experiments show me that this program encodes a.out files slightly more
cheaply than uuencode, text files are very cheap. The experimental version
does not transmit any of the following characters: all control characters,
"<>{}[]^|\\~", and del.

If there is any interest, myself and another fellow here, Mike Gore, will
put this together out of code we already have (for encoding, CRC checking)
and we will post the source (for UNIX and Atari) and uuencoded versions for
the Atari ST.

It would be written to be portable.

Comments?

Tracy Tims
mail to ihnp4!watmath!unit36!tracy

Computer Sci Club

unread,

Oct 21, 1986, 5:53:19 PM10/21/86

to

As a followup to my previous article, I have come across a new encoding
technique (courtesy of the Math Faculty Computing Facility here at
the university) which has the following properties:

- eight bit input data
- very restricted character set
- generally compresses files rather than expanding them
(including binaries)
- need no bit level twiddling
- involves only table lookup

The encoding system is easy to implement. I am whipping one up now.

This should make the transmittable archive format very easy to implement.
My co-developer, Mike Gore, is planning to do a Basic (gack spew) version
after I write one in C.

Still: comments?

Ken Thompson

unread,

Oct 22, 1986, 9:58:37 AM10/22/86

to

I, for one, am strongly opposed to trying to develop a new "standard"
decoding scheme. For one thing, it will be next to impossible to get
agreement from such a large group on what it should be. The effort will
cause more problems and confusion than we already have.

I have never experienced the problems with curly braces but it is possible
I suppose that we don't ever get messages that pass through EBCDIC machines
here. Certainly, I have never gotten any C source that had the curly braces
corrupted.

I have versions of uudecode in turbo pascal, c , and microsoft basic. This
is pretty wide availability and the basic while slow should be easily
adaptible to most machines. I don't think there will be too many problems
getting a version. Just ask in net.sources.wanted.

The problem with having to use the editor has nothing to do with uuencode/uudecode. The problem is that some news software running on some machines on the net
truncate files longer than 64K bytes. Unless the mail software is changed,
you will always need to get rid of the mail header/signature information
put in by mail as this is ascii too. You will probably have to do this
with an editor. I failed to find this task difficult.

I don't know about other sites, but 99% of the problems we have here are
files which are corrupted in transit, either because the sender posted
files larger than 64K and they got truncated or some of the information
was just corrupted in transit. A new scheme is not going to fix this.

I vote that we stick with uudecode/encode. If you have problems with these,
I am sure someone on the net will be glad to help you get them worked out.
I receive files all the time that have been arced and then uuencoded and
am able to reverse the process without problems.

--
Ken Thompson Phone : (404) 894-7089
Georgia Tech Research Institute
Georgia Insitute of Technology, Atlanta Georgia, 30332
...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!thomps

braner

unread,

Oct 24, 1986, 12:09:58 AM10/24/86

to

[]

I've been thinking recently about transferring ST screen-dumps (32K
bit-maps, one bit deep in "hires" (monochrome) mode) over the modem
(to a VAX running UNIX, to print the graphics on a laser-printer).

Since my applications are concerned with B&W line drawings which are
mostly white space, it would be very inefficient to send the whole
32K. Also I'd like some compression method to compress it before I
ever save it on the ST disk.

I guess I could use ARC or some other standard, general purpose
compression program. But it seems that an algorithm designed
especially for this purpose should beat general ones handily!
I am thinking about an algorithm that would compare bytes (or words)
VERTICALLY DOWN THE SCREEN, rather than along consecutive RAM.
Such an algorithm would find more runs of identical bytes. It should
also work for color.

Is such a thing available (preferably PD)? Is anybody working on one?
Or should I do it myself?

- Moshe Braner

bar...@maggot.applicon.uucp

unread,

Oct 24, 1986, 3:40:00 PM10/24/86

to

Tracy, I followed your idea of last posting to do strange things with
restricted character sets, and I think I have found a nice way to do encoding/
decoding without too much overhead. I'll send you the sources if you want.

Encoding scheme:

Assumptions:

The character set contains the digits 0-9,
and at LEAST the letters A-P.

The letters A-P are the Hex values 0-15. Two hex "digits" are always
together. So AA is 0, and PP is 255 (standard hex, with different characters).
The digits 0-9 are used as a decimal repeat count. The count is 1 less than
the number of characters to repeat. This implies that a repeat count of 1
is interpreted as "two of the following"(obviously no repeatcount implies
that there is only one of the following).

Caching is implemented by using other characters (except whitespace)
as "cache" values of the most frequently used values. This requires a two-pass
encoding scheme, but a one pass decoder.

The encoded file has a series of lines at the head seperated from the body
of the encoded test by a blank line. This is the only whitespace dependency.
Each line in the head contains three characters (plus the newline):
a character for the cached representation, and a two hex digit code that the
cache character represents. Since the caches can be anything besize 0-9,A-P
and whitespace, that leaves a lot of choices. Plus it can even be dynamic.
No caching leads right to a hex encoder(with its two for one expansion),
but with caching characters enabled, the size of the encoded output is
smaller.

An example of encoding:

The line:

This is a test of the emergency broadcasting system <52 characters>

Becomes:

<blank line>
FEGIGJHDCAGJHDCAGBCAHEGFHDHECAGPGGCAHEGIGF
CAGFGNGFHCGHGFGOGDHJCAGCHCGPGBGEGDGBHDHEGJ
GOGHCAHDHJHDHEGFGNAK <108 characters>

Which is like BinHex(in a way), but if I use the letters
Q-Z as caching values, the file changes to:

QCA <Q is caching the hex value CA>
UGB <U is caching the hex value GB>
WGD < Etc... >
RGF
XGH
YGI
VGJ
ZGN
SHD
THE

FEYVSQVSQUQTRSTQGPGGQTYRQRZRHCXRGOWHJQGCHC
GPUGEWUSTVGOXQSHJSTRZAK

Which is the same size (108 characters) But since the input is short,
it does not give the caching a chance to be effective.

I'll compare my encoding scheme to uuencode and report on the results.
On to gigantic programs, and since its for binary, I used my Sun 3.0
Kernel as the acid test:

-rwxr-xr-x 1 root 493853 Apr 8 1986 /pub/vmunix <plain>
-rw-r--r-- 1 barada 680443 Oct 24 14:05 vmunix.uuencoded <uuencode>
-rw-r--r-- 1 barada 670910 Oct 24 14:07 vmunix.encoded <encoded>
Note, The cache values used were Q-Z and a-z (36 cache values)
-rw-r--r-- 1 barada 600208 Oct 24 14:31 vmunix.encoded <encoded>
Note, The cach values used were Q-Za-Z and:
!@#$%^&*()-_+=|{[}]:;<>,./ (62 cache values)

As can be seen, my encoding scheme can produce a smaller output that
uuencode, but I don't have any checksums in my scheme.

Compressing the above encodings produce the following results:

-rw-r--r-- 1 barada 387571 Oct 24 14:05 vmunix.uuencoded.Z
-rw-r--r-- 1 barada 346875 Oct 24 14:07 vmunix.encoded.Z

So my encoding scheme expands a binary file by 35.85% compared to
uuencodes 37.78% which isn't much different.

What is really useful is the ability to limit the caching values to only
those that can pass through the networks, and to be almost completely
whitespace independant. It also doesn't perform so bad either.

Any comments or ideas???

--
Peter Barada | (617)-671-9905
Applicon, Inc. A division of Schlumberger Ltd. | Billerica MA, 01821

UUCP: {allegra|decvax|mit-eddie|utzoo}!linus!raybed2!applicon!barada
{amd|bbncca|cbosgd|wjh12|ihnp4|yale}!ima!applicon!barada

Sanity is only a state of mind.

Kenneth Ng

unread,

Oct 24, 1986, 8:52:03 PM10/24/86

to

In article <20...@dalcs.UUCP>, sil...@dalcs.UUCP (Bill Silvert) writes:

> Here are some opinions about the problem of posting binaries, along with
> a draft solution. There should be some discussion on the net before it
> gets implemented.
>

> Binaries have to be encoded as ASCII files. But there is no reason why
> we have to use uuencode! There are evidently problems with it, and we
> should feel free to invent an alternate encoding method which avoids the
> problems with uuencode. These problems, aside from the minor one that
> uuencode is designed for th Unix environment, are that some characters
> (such as curly braces {}) do not make it through all nodes unscathed
> (IBM machines and others with EBCDIC codes appear to be the culprits),
> and for long files the posting have to be combined in an editor.

The problem is square brackets, which do not exist in EBCDIC in any
standard form. This becomes a real pain even with source programs
written in C and Pascal.

--
Kenneth Ng: Post office: NJIT - CCCC, Newark New Jersey 07102
uucp !ihnp4!allegra!bellcore!argus!ken
*** WARNING: NOT k...@bellcore.uucp ***
!psuvax1!cmcl2!ciap!andromeda!argus!ken
bitnet(prefered) k...@orion.bitnet

McCoy: "This won't hurt a bit"
Chekov: "That's what you said last time"
McCoy: "Did it?"
Chekov: "Yes"

Robert E. Fortin

unread,

Oct 26, 1986, 9:01:03 PM10/26/86

to

In article <12...@batcomputer.TN.CORNELL.EDU> bra...@batcomputer.UUCP (braner) writes:
>[]
>
>I've been thinking recently about transferring ST screen-dumps (32K
>bit-maps, one bit deep in "hires" (monochrome) mode) over the modem
>(to a VAX running UNIX, to print the graphics on a laser-printer).
>
>Since my applications are concerned with B&W line drawings which are
>mostly white space, it would be very inefficient to send the whole
>32K. Also I'd like some compression method to compress it before I
>ever save it on the ST disk.
>

I have the C - source for a compression algorithm that reduces files to
about 20-30% of their original size. It uses the LZW algorithm (if you
know what that is - I don't). It would be great to compress your files,
but you might want to uuencode them before transmitting them. The only
problem is that I don't have a C-compiler yet. If anyone is interested,
I could get it to work on Unix 4.3 and then someone could translate it
to the ST. You would need to keep the uncompress algorithm on your host
system anyway.

Bob Fortin
{allegra seismo decvax}!rochester!ref0070

j...@mitre-bedford.arpa

unread,

Oct 28, 1986, 6:56:45 AM10/28/86

to

One disadvantage to the method mentioned in the abbove-referenced message is
that each encoded byte transmits only 4 bits of information.

By comparison, uuencoded data transmits 6 bits per encoded byte, obviously
50% more for your money.

The problem with uuencode is that it uses certain troublesome characters such
as spaces to encode data. This problem would be immediately eliminated if we
were to revise uuencode to use only non-troublesome characters. For example,
the set a-z, A-Z, 0-9, and . and / or any other set of 64 noncontroversial
characters would do the trick. However, it would make programming simpler if
the selected characters were in two contiguous ranges, such as a-z and
56789:;<=>?@ plus A-Z. (I think that adds up to 64 of 'em!)

John Sangster
j...@mitre-bedford.arpa

braner

unread,

Oct 29, 1986, 12:58:44 PM10/29/86

to

[]

By now I have a working AL RAM-resident program that, upon Alt-Help,
saves the screen in a file in a format that is BOTH compressed and
in a text (modem-able) form. Doing both in one algorithm
is not only convenient, it is essential for getting the most compact
final product. A typical desktop yields a TEXT FILE of about 7600
chars: 25% of the length of the bit map!

(Details on the coding algorithm used do not belong here,
It is similar to uuencode, but does not
use space chars nor is it sensitive to added/deleted control
chars or spaces.)

I am now working on a decoding program, to view such files, and on
a translator from the compressed format to Postscript (my intended
use of this whole mess is to send ST graphics to be printed on a
remote Laserwriter...). I'll post it all when done.

Problems:

The coding program (I call it scode, or perhaps sencode?)
works fine from the desktop, but hangs if Alt-Help is pressed inside
Micro-C-Shell. Why? (When first run, the program replaces the screen
dump vector at $502, then does Ptermres() to stay in RAM.)
(I also bite hard and do OS TRAP calls from my code, even though it
is called from the Alt-Help interrupt handler. I don't really have
a choice, do I?)

Also: At the end of the ROM screen dump routine, it has:

ADDQ.L #4,A7
RTS

If I end my dump routine with the same or with just plain RTS, it
works the same. What's going on?

Any advice would be appreciated.

- Moshe Braner

Paul Smee

unread,

Oct 31, 1986, 11:40:08 AM10/31/86

to

I'd agree with Ken Thompson. uuencode may not be great, but it is usable --
sometimes with a little thought using inspiration derivable from the
uuencode doc. And, it is more-or-less standardly available on Unix systems.
If we invent a new encode decode technique, I foresee continual requests from
new people joining the net for yet-another retransmission of fredcode -- or
whatever it's called.

From this standpoint, even hex encoding is optimal, as it is obvious when
you see a hex file what you should do to unpack it. I fear new baroque
and clever coding techniques will cause more confusion than they solve.
(Unless, of course, you can manage to get the new stuff packaged in as a
standard bit of Unix systems.)

(Also, a quick comment on the original version of the proposal --
The character set A-P is only contiguous on ASCII machines. Fine for
my purposes, but not that handy for EBCDIC users. Of course, I've been
known to argue that EBCDIC users deserve whatever they get, but I'm
prepared to accept that others might disagree; and besides, it's not clear
there *is* any nice subset of chars in EBCDIC.)

Computer Sci Club

unread,

Nov 5, 1986, 10:32:08 PM11/5/86

to

I (with some others) am in the process of building a printable character
archiver, called "earthpig" (it's an aarchiver, :-)). The following is an
example of the compression its printable character encoding algorithm gets.

The "test" file is /bin/vi. I have shown the compression ratios for
the various interesting files. A file ending in "pig" is an earthpig
encoded file. A file ending in "uue" is a uuencoded file. A file ending
in 'Z' is a "compress"ed file.

test 131338
test.Z 70103 (0.533 of test)
test.pig 143087 (1.090 of test)
test.uue 180979 (1.378 of test)
test.pig.Z 73175
test.uue.Z 94691
test.Z.pig 96698 (0.736 of test)
test.Z.uue 96611 (0.735 of test)

For compressed data, it does about as well as uuencode. For uncompressed
data it's quite a lot better. Small binaries (under 20K) shrink slightly.
C programs and text files shrink slightly or stay about the same.

Earthpig uses only the characters

+-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
@.,;:=?*"'/!()_%&

This character set should make it through almost anything unchanged. The
algorithm only uses table lookup: no bit masking or shifting.

When we finish the archiver we will post various versions of it to the
net.

What it does:

- can generate correction requests from errors
- can generate patches from correction requests
- CRC checking on two levels
- supports os independent hierchical file names
- high immunity to format changes and noise characters (space/control)
- close to 1:1 encoding on uncompressed data

The basic goal of earthpig is to provide a single tool that will allow
the transfer of arbitrary data and software around the network while
providing a very high level of confidence that the data arrived correctly.