dealing with multi byte characters

Gerhard Wolf

unread,

Aug 28, 2015, 7:34:52 AM8/28/15

to

Hi,
im stuck with handling multibyte characters in text-files.
For example files with german umlauts.

----- File1 3 Bytes -----
aüb
-------------------------
in Hex 0x61 0xFC 0x62

----- File2 4 Bytes -----
aüb
-------------------------
in Hex 0x61 0x3C 0xBC 62

this causes many nasty problems whe using std::*find* std::setfill
std::setw ...

in my actual case i have c++ code that should convert a csv file
into a column fixed format.
Therefor i have to determine the max width of earch filed then write
the fields using std::setw and std::setfill.
The Result of my example code would be
02
003

how can i get solve this problem? Are there any std:: or boost libs.
Hopefully the only solution is NOT Search/Replace of all multibyte
characters!

Also i did not find any useful resources on that topic in the net. Got any?

----------sample code---------------------------------------------------
#include <iostream>
#include <fstream>
#include <string>
#include <iomanip>

int _tmain(int argc, _TCHAR* argv[])
{
std::ifstream in1,in2;
in1.open("file1");
in2.open("file2");
std::string tmp,tmp2;
std::getline(in1,tmp);
size_t pos = tmp.find("b");
std::cout << std::setw(pos) << std::setfill('0') << pos << std::endl;

std::getline(in2,tmp);
pos = tmp.find("b");
std::cout << std::setw(pos) << std::setfill('0') << pos << std::endl;

return 0;
}

Ralf Goertz

unread,

Aug 28, 2015, 9:12:59 AM8/28/15

to

Am Fri, 28 Aug 2015 13:34:26 +0200
schrieb Gerhard Wolf <lea...@gmx.de>:

> Hi,
> im stuck with handling multibyte characters in text-files.
> For example files with german umlauts.
>
> ----- File1 3 Bytes -----
> aüb
> -------------------------
> in Hex 0x61 0xFC 0x62
>
>
> ----- File2 4 Bytes -----
> aüb
> -------------------------
> in Hex 0x61 0x3C 0xBC 62
>
> this causes many nasty problems whe using std::*find* std::setfill
> std::setw ...

…

> how can i get solve this problem? Are there any std:: or boost libs.
> Hopefully the only solution is NOT Search/Replace of all multibyte
> characters!

Use std::wstring and std::wfstream instead of std::string and
std::fstream and imbue your std::wfstream with the encoding used in the
input file.

Ralf

Lőrinczy Zsigmond

unread,

Aug 28, 2015, 2:16:18 PM8/28/15

to

On 2015.08.28. 13:34, Gerhard Wolf wrote:
> Hi,
> im stuck with handling multibyte characters in text-files.
> For example files with german umlauts.

> ----- File2 4 Bytes -----
> aüb
> -------------------------
> in Hex 0x61 0x3C 0xBC 62

that is C3 BC most likely -- the utf8 code of 'Ü'
I'd suggest iconv -- its purpose is converting between codes

Richard Damon

unread,

Aug 28, 2015, 9:48:32 PM8/28/15

to

On 8/28/15 7:34 AM, Gerhard Wolf wrote:
> Hi,
> im stuck with handling multibyte characters in text-files.
> For example files with german umlauts.
>
> ----- File1 3 Bytes -----
> aüb
> -------------------------
> in Hex 0x61 0xFC 0x62
>
>
> ----- File2 4 Bytes -----
> aüb
> -------------------------
> in Hex 0x61 0x3C 0xBC 62
>

Your basic issue is that a text file is not just a text file, but to
process it your program needs to know how the file is encoded.

In general, this isn't trivial, as there isn't always a standard that
tells you how a file is encoded.

Your first files appears to be encoded using a basic 8 bit character
set, which requires you to know which character set is being used (often
called the 'codepage'). In this case likely some varient of ISO/IEC 8859

The second appears to use a multibyte character set (and if the second
character is actually 0xC3, would appear to be UTF-8, i.e. Unicode).

The problem is that the second file could also be a valid file if
interpreted in the same codepage as the first! (with different
characters) IF it was 8859-1 for example the file would be aÃ¼b (if the
characters pass through usenet properly).

Ideally, something will tell you what the encoding of the file will be
(and some document formats will encode this in the file). If not, you
really need to figure it out, which is mostly guessing.

One 'trick' that is sometimes used if you really can't know ahead what
the encoding is would be to first see if the file is a valid UTF-8 file,
and if so assume it is. Due to the (intentional) redundancy built into
UTF-8, it is unlikely that a normal non-UTF-8 file will validate as
valid UTF-8, that one major exception being a file with no high bit
sets, but then detecting that as UTF-8 is just the assumption of
standard ASCII (there are some codepages which change the meaning of
some of the first 128 characters, but this is sort of unusual).

If it doesn't validate as UTF-8, you mostly will need to guess,
generally you will assume the codepage that is the default for your system.

Gerhard Wolf

unread,

Aug 31, 2015, 2:34:33 AM8/31/15

to

> Use std::wstring and std::wfstream instead of std::string and
> std::fstream and imbue your std::wfstream with the encoding used in the

Thanks. Using std::w* was half the battle because
it seems all the std::string functions (find, subst...),
std::setw, ..setfill have no std::w* counterpart or im a wrong?
So i have to convert to std::string? but how?

std::string line = std::string(wline.begin(), wline.end());
does not work.

Gerhard Wolf

unread,

Aug 31, 2015, 2:36:06 AM8/31/15

to

> Use std::wstring and std::wfstream instead of std::string and
> std::fstream and imbue your std::wfstream with the encoding used in the
> input file.

Thanks. Using std::w* was half the battle because
it seems all the std::string functions (find, subst...),

std::setw, ..setfill have no std::w* counterpart!?

Gerhard Wolf

unread,

Aug 31, 2015, 2:47:47 AM8/31/15

to

Ok i created a 3rd and a 4th File. It contains both 'ü' variants in HEX:
61 FC 62 3C BC 62

and a File4 (FC-3CBC) flipped:
61 3C BC 62 FC 62

Windows Notepad (Sublime Text) opens:
File1 displaying aüb (61 FC 62)
File2 " aüb (61 C3 BC 62)
File3 " aüb<¼b (61 FC 62 3C BC 62)
File4 " a<¼büb (61 3C BC 62 FC 62)

it seems these or all Win Applications (or MS libs) seems handle
both variants correct if not mixed otherwise basic 8 bit character set
is used.
But that doesn't help solving my problem.

Ralf Goertz

unread,

Aug 31, 2015, 3:59:52 AM8/31/15

to

Am Mon, 31 Aug 2015 08:34:19 +0200
schrieb Gerhard Wolf <lea...@gmx.de>:

> > Use std::wstring and std::wfstream instead of std::string and
> > std::fstream and imbue your std::wfstream with the encoding used in
> > the
>
> Thanks. Using std::w* was half the battle because
> it seems all the std::string functions (find, subst...),
> std::setw, ..setfill have no std::w* counterpart or im a wrong?
> So i have to convert to std::string? but how?

No, you don't have to convert to std::string (and it wouldn't generally
solve your problem as you can't fit *each* utf-8 encoded character into
8bit simultaneously). And yes, you can use find(), subst() etc. on
wstring since it is merely a „typedef basic_string<wchar_t> wstring;“
and these functions are defined with basic_string, eg.:

http://www.cplusplus.com/reference/string/basic_string/find/

What you have to keep in mind is that you have to use the „L“ specifier
when using literals:

std::wstring ws=L"Hello World";

Ralf

Öö Tiib

unread,

Aug 31, 2015, 4:59:22 AM8/31/15

to

MS Notepad is using pile of heuristics to detect the encoding of text
in *.txt file. Sometimes the heuristics fail in notable way so even
Wikipedia contains article about it:
https://en.wikipedia.org/wiki/Bush_hid_the_facts
Those heuristics are not a part of C++ for obvious reasons.

Öö Tiib

unread,

Aug 31, 2015, 5:24:23 AM8/31/15

to

I forgot to mention that the heuristics are in 'IsTextUnicode' function
of Windows API:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd318672(v=vs.85).aspx
So use it at your own risk and pay close attention to usage of 'lpiResult'
parameter of it.

Richard

unread,

Aug 31, 2015, 3:14:51 PM8/31/15

to

[Please do not mail me a copy of your followup]

Gerhard Wolf <lea...@gmx.de> spake the secret code
<55E3F56B...@gmx.de> thusly:

>std::string line = std::string(wline.begin(), wline.end());
>does not work.

std::string and std::wstring are just typedefs for flavors of
std::basic_string:

<http://en.cppreference.com/w/cpp/string>

All the members functions you're using are really defined on
std::basic_string.

There is a facility in the locale library for "narrowing" wide strings
(std::basic_string<wchar_t>) and "widening" narrow strings
(std::basic_string<char>). However, I find the API cumbersome to use
and people aren't very familiar with the details of the locale support
in the standard library so it's hard to find people who can answer
your questions.
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
The Terminals Wiki <http://terminals.classiccmp.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Richard Damon

unread,

Aug 31, 2015, 7:53:41 PM8/31/15

to

The case that it will not handle is the case of something generating the
glyph sequence aÃ¼b as a ISO 8859-1 formatted file. It will say it has
the glyph sequence aüb

Note, this message is most likely NOT in ISO 8850-1 but UTF-8, so you
can't just look at the binary of the message.

Just to help you, in 8859-1 the sequence is: 61 C3 BC 62, and the issue
of course is that file is totally identical to aüb in UTF-8, so unless
the system has information from some other source as to the encoding, it
needs to guess, and the standard guess is that if it passes as UTF-8,
consider it so. (Sometimes there are a few things checked before that,
things like detecting a Byte Order Mark at the beginning of a file to
detect UTF-16 (BE or LE) or UCS-4.