On 8/28/15 7:34 AM, Gerhard Wolf wrote:
> Hi,
> im stuck with handling multibyte characters in text-files.
> For example files with german umlauts.
>
> ----- File1 3 Bytes -----
> aüb
> -------------------------
> in Hex 0x61 0xFC 0x62
>
>
> ----- File2 4 Bytes -----
> aüb
> -------------------------
> in Hex 0x61 0x3C 0xBC 62
>
Your basic issue is that a text file is not just a text file, but to
process it your program needs to know how the file is encoded.
In general, this isn't trivial, as there isn't always a standard that
tells you how a file is encoded.
Your first files appears to be encoded using a basic 8 bit character
set, which requires you to know which character set is being used (often
called the 'codepage'). In this case likely some varient of ISO/IEC 8859
The second appears to use a multibyte character set (and if the second
character is actually 0xC3, would appear to be UTF-8, i.e. Unicode).
The problem is that the second file could also be a valid file if
interpreted in the same codepage as the first! (with different
characters) IF it was 8859-1 for example the file would be aüb (if the
characters pass through usenet properly).
Ideally, something will tell you what the encoding of the file will be
(and some document formats will encode this in the file). If not, you
really need to figure it out, which is mostly guessing.
One 'trick' that is sometimes used if you really can't know ahead what
the encoding is would be to first see if the file is a valid UTF-8 file,
and if so assume it is. Due to the (intentional) redundancy built into
UTF-8, it is unlikely that a normal non-UTF-8 file will validate as
valid UTF-8, that one major exception being a file with no high bit
sets, but then detecting that as UTF-8 is just the assumption of
standard ASCII (there are some codepages which change the meaning of
some of the first 128 characters, but this is sort of unusual).
If it doesn't validate as UTF-8, you mostly will need to guess,
generally you will assume the codepage that is the default for your system.