i wrote a patch that provides UTF-8 + binary in one codec with no
hand-waving, using Markus Kuhn's brilliant proposal to encode invalid
bytes 0xyz using unpaired surrogates U+DCyz. this means there need not
be a text/binary distinction for UTF-8-using programs. legal UTF-8
decodes/encodes correctly, and other bytes are handled as "opaque"
U+DCxx on input and correctly serialized on output. so one can once
again consider editing a binary format with a "notepad"-type editor
without sacrificing internationalization support.
Markus Kuhn's description of the idea: (search for "option d")
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html
the patch:
http://xent.com/~bsittler/libiconv-1.9.1-utf-8b.diff
enjoy! (not sure how/whether this fits into the official distro, but i
hope it gets used)
-ben
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/