idea : UTF-8 inspired retroImage encoding

52 views

Skip to first unread message

Michal Wallace

unread,

Nov 6, 2012, 11:19:17 AM11/6/12

to retr...@googlegroups.com

Hey all,

I've been thinking about using the a variable-length encoding system to compress the retro image, especially for transmission over HTTP.

If you look at vm/web/html5/retroImage.js ( a human-readable dump of retroImage ), almost all of the numbers require WAY fewer than 32 bits.

UTF-8 uses a variable length encoding that would shrink each opcode and ascii character ( or any number below 128 ) down to a single byte of storage. Any number greater than that would require multiple bytes. The very highest numbers would actually require a sequence of 4-7 bytes .

Basically, in UTF-8, the upper 128 values of the 8-bit range are used to signal multiplications and additions, and they're arranged very logically.

Also, the bytes are arranged in specific orders so that most random 4-byte sequences are invalid.

For example, if you look at a UTF-8 encoded retroImage in a hex editor, and you saw the value 0x71, you know that it's either representing the letter "q" or the number 113... That value would never appear for any other reason. For example, 0x71 normally appears in the straight binary encoding for 29,162 ( 0x71ea ), but in UTF-8, you would encode this number as : E7 87 AA

0xE7 --> push 28672 ( 0x7000 ) -- E0..EF always indicate a 3-byte sequence

0x87 --> add 7 ( 0x07 ) , then << 6 to make room for next 6 bits

0xAA --> add 42 ( 0x2A )

One potential glitch is negative numbers. Unicode doesn't account for negative numbers, obviously, and so -1 would require seven bytes to encode by their scheme. I think probably I will just use one of the invalid unicode characters to indicate negative numbers ( maybe 255 = true / -1, 254 = general negative marker ).

Anyway, as a transport encoding for javascript, I think this is an all-around win, and I'm very likely going to implement this for ngaro.js... but I'm curious whether the same system could be used elsewhere.... For example, the character-oriented file device could automatically handle UTF-8... And the character generator could be extended to support code pages.

BTW : all of this would be transparent to retro. Once code got into ram, each character and instruction would still be 32 bits, padded with zeros. It would just be an optional feature in the ngaro implementations.

Thoughts?

Helpful links:

http://en.wikipedia.org/wiki/UTF-8 ( has a nice 16 x 16 table of the encoding scheme )