how to read utf-16 file using streams

Paul Nolan

unread,

Aug 17, 2001, 9:15:45 AM8/17/01

to

Hi,

Does anybody know how to read a utf-16 encoded file using ifstream/wifstream ?

rgds,

- Paul.

John Harrison

unread,

Aug 17, 2001, 3:14:28 PM8/17/01

to

"Paul Nolan" <pfn...@ireland.com> wrote in message
news:a3415c1c.01081...@posting.google.com...

I don't think you have any alternative but to implement the UTF-16
algorithm by hand. I don't think C++ is going to do it for you.

john

Ron Natalie

unread,

Aug 17, 2001, 3:37:34 PM8/17/01

to

John Harrison wrote:
>

> I don't think you have any alternative but to implement the UTF-16
> algorithm by hand. I don't think C++ is going to do it for you.
>

If he is lucky, his implementation uses UTF-16 to encode wchar's.

Wha do you mean by "UTF-16 algorithm"?

John Harrison

unread,

Aug 17, 2001, 4:56:12 PM8/17/01

to

"Ron Natalie" <r...@sensor.com> wrote in message
news:3B7D727E...@sensor.com...

The encoding/decoding scheme for UTF-16

john

Ron Natalie

unread,

Aug 17, 2001, 5:30:37 PM8/17/01

to

I know what UTF-16 is, I still have no clue what "scheme"
you are talking about.

John Harrison

unread,

Aug 17, 2001, 5:24:31 PM8/17/01

to

> >
> > Wha do you mean by "UTF-16 algorithm"?
>
> The encoding/decoding scheme for UTF-16
>

I beg your pardon, the UTF-8 scheme used to encode 16 bit characters. I
presume that this was what the OP was asking about, I might have
presumed wrong.

john

Paul Nolan

unread,

Aug 18, 2001, 11:03:12 AM8/18/01

to

"John Harrison" <jah...@bigfoot.com> wrote in message news:<ZXff7.6885$0c2.1...@news2-win.server.ntlworld.com>...

Hi,

What happens is that when I read in a line from a utf-16 file, I get
something similiar to the following :

00 34 00 56 00 67 .....

How do I convert this to a wide string ??

rgds,

- Paul.

John Harrison

unread,

Aug 18, 2001, 11:24:42 AM8/18/01

to

"Paul Nolan" <pfn...@ireland.com> wrote in message
news:a3415c1c.01081...@posting.google.com...

Since I completely misunderstood your original question, I'm not sure
I'm the best person to answer this.

How are you reading it currently? What happens if you read it using a
wide stream?

john

Dietmar Kuehl

unread,

Aug 18, 2001, 8:40:27 PM8/18/01

to

Hi,

Paul Nolan wrote:
> Does anybody know how to read a utf-16 encoded file using ifstream/wifstream ?

Sure, here we go:

std::wifstream in;
in.imbue(utf16_locale);
in.open("your file name goes here");
// ... and now just use 'in' to read your utf-16 file

The tricky part here is, of course, the line with 'imbue()' and
where the argument comes from. Basically, 'utf16_locale' has to
be an 'std::locale' object which uses an
'std::codecvt<wchar_t, char, std::mbstate_t>' facet converting between
UTF-16 external representation and the internal representation.

If you are lucky and your library vendor has implemented this (I would
expect commercial library vendors to do this for UTF-8 and UTF-16),
you can get such a locale object something like this:

std::locale utf16_locale("en_US.utf16");

Of course, the C++ standard does not provide any naming scheme for
locale names and thus this name is mostly just made up: You have to
look through your C++ library implementation and/or scan through
support files. How such a locale is named, is up to the vendor or
maybe some other standard I'm not aware of.

If you are less lucky, you would have to implement the corresponding
facet yourself. This would mean you derive a class from
'std::codecvt<wchar_t, char, std::mbstate_t>' and implement the
various members. If you really just need input, you might get away
implementing only a subset but I would recommend to implement all
of the members. If you need help with this, post a corresponding
question to comp.lang.c++.moderated (and probably you should also
CC it to me so that I don't miss it :-)
--
<mailto:dietma...@yahoo.com> <http://www.dietmar-kuehl.de/>
Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>

Michael Rubenstein

unread,

Aug 19, 2001, 7:16:45 PM8/19/01

to

On Fri, 17 Aug 2001 17:30:37 -0400, Ron Natalie <r...@sensor.com>
wrote:

He probably means the algorithm used to convert a UTF-16 string
to fixed width characters.
--
Michael M Rubenstein

Ron Natalie

unread,

Aug 20, 2001, 9:15:21 AM8/20/01

to

Paul Nolan wrote:

> What happens is that when I read in a line from a utf-16 file, I get
> something similiar to the following :
>
> 00 34 00 56 00 67 .....
>
> How do I convert this to a wide string ??

Use the wifstream into wchar_t or wstring.
That data looks like UTF16 encoding of 8 bit ASCII already.
(Presuming your wchar_t's are 16 bit).

Ron Natalie

unread,

Aug 20, 2001, 9:16:09 AM8/20/01

to

> >
> >I know what UTF-16 is, I still have no clue what "scheme"
> >you are talking about.
>
> He probably means the algorithm used to convert a UTF-16 string
> to fixed width characters.

Unless you're in really wierd character spaces, UTF-16 will
map directly into 16 bit wchar-t characters.

Paul Nolan

unread,

Aug 21, 2001, 3:58:57 AM8/21/01

to

Ron Natalie <r...@sensor.com> wrote in message news:<3B810D99...@sensor.com>...

The problem is that when I do a getline or a read into a buffer of
wchar_t's, the representation is exactly like it is in the file i.e.
the first wchar_t contains 34, the second 00, the third 46, the fourth
00 etc...

So when I try to convert this to a wstring, as soon as it encounters
the first "00" (second wchar_t) it assumes it is a null terminating
character and the result is a string of size 1.

The problem is that the read/getline on a wifstream should take the
first four characters and put them into one wchar_t, and so on.

This is what I am trying to achieve.

Michael Rubenstein

unread,

Aug 21, 2001, 8:33:19 AM8/21/01

to

On Mon, 20 Aug 2001 09:16:09 -0400, Ron Natalie <r...@sensor.com>
wrote:

>
>> >

In other words, unless you are using Unicode and the stream
includes some characters that require more than 16 bits.

I'm sorry, I thought we were talking about methods that work, not
methods that sometimes work.
--
Michael M Rubenstein

P.J. Plauger

unread,

Aug 21, 2001, 9:54:26 AM8/21/01

to

"Paul Nolan" <pfn...@ireland.com> wrote in message news:a3415c1c.01082...@posting.google.com...

> The problem is that when I do a getline or a read into a buffer of
> wchar_t's, the representation is exactly like it is in the file i.e.
> the first wchar_t contains 34, the second 00, the third 46, the fourth
> 00 etc...
>
> So when I try to convert this to a wstring, as soon as it encounters
> the first "00" (second wchar_t) it assumes it is a null terminating
> character and the result is a string of size 1.
>
> The problem is that the read/getline on a wifstream should take the
> first four characters and put them into one wchar_t, and so on.
>
> This is what I am trying to achieve.

I can't take this any more. Look, what you probably need is a version
of codecvt<wchar_t, char, mbstate_t> that converts between little-endian
UTF-16 in a file and UCS-2 (16-bit UNICODE, roughly) stored in a wchar_t
inside the program. We've got such a critter, but it's proprietary code
that's destined for a future release of our libraries. My bet, however,
is that you can probably make do with a much simpler alternative. Pasted
below is a dirt-simple codecvt replacement, and a sample bit of code for
putting it to use. It reads two-byte integers from a file, in native
byte order, and packs 'em into wchar_t elements. If you're not using
one of our libraries, you can replace the line that uses the _ADDFAC
macro with:

locale loc(locale::classic(), new Simple_codecvt);

For more information, see the article from which I cribbed the code:
P.J. Plauger, ``Standard C/C++: Unicode Files,'' C/C++ Users Journal,
April 1999. It's doubtless available on the CD-Rom that CUJ flogs.

HTH,

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

----------
using namespace std;

typedef codecvt<wchar_t, char, mbstate_t> Mybase;

// CLASS Simple_codecvt

class Simple_codecvt : public Mybase {

public:

typedef wchar_t _E;

typedef char _To;

typedef mbstate_t _St;

explicit Simple_codecvt(size_t _R = 0)

: Mybase(_R) {}

protected:

virtual result do_in(_St& _State,

const _To *_F1, const _To *_L1, const _To *& _Mid1,

_E *_F2, _E *_L2, _E *& _Mid2) const

{return (noconv); }

virtual result do_out(_St& _State,

const _E *_F1, const _E *_L1, const _E *& _Mid1,

_To *_F2, _To *_L2, _To *& _Mid2) const

{return (noconv); }

virtual result do_unshift(_St& _State,

_To *_F2, _To *_L2, _To *& _Mid2) const

{return (noconv); }

virtual int do_length(_St& _State, const _To *_F1,

const _To *_L1, size_t _N2) const _THROW0()

{return (_N2 < (size_t)(_L1 - _F1)

? _N2 : _L1 - _F1); }

virtual bool do_always_noconv() const _THROW0()

{return (true); }

virtual int do_max_length() const _THROW0()

{return (2); }

virtual int do_encoding() const _THROW0()

{return (2); }

};

---------------

const char *fname = "filename.txt"; // or whatever

locale loc = _ADDFAC(locale::classic(), new Simple_codecvt);

wofstream myostr;

myostr.imbue(loc);

myostr.open(fname, ios_base::binary);

if (!myostr.is_open())

cerr << "can't write to " << fname << endl;

Paul Nolan

unread,

Aug 21, 2001, 1:26:23 PM8/21/01

to

"P.J. Plauger" <p...@dinkumware.com> wrote in message news:<3b82680f$0$23...@wodc7nh0.news.uu.net>...

A breath of Fresh air....thanks a lot.

- Paul.

Paul Nolan

unread,

Aug 23, 2001, 6:51:46 AM8/23/01

to

"P.J. Plauger" <p...@dinkumware.com> wrote in message news:<3b82680f$0$23...@wodc7nh0.news.uu.net>...

> "Paul Nolan" <pfn...@ireland.com> wrote in message news:a3415c1c.01082...@posting.google.com...
>
> > The problem is that when I do a getline or a read into a buffer of
> > wchar_t's, the representation is exactly like it is in the file i.e.
> > the first wchar_t contains 34, the second 00, the third 46, the fourth
> > 00 etc...
> >
> > So when I try to convert this to a wstring, as soon as it encounters
> > the first "00" (second wchar_t) it assumes it is a null terminating
> > character and the result is a string of size 1.
> >
> > The problem is that the read/getline on a wifstream should take the
> > first four characters and put them into one wchar_t, and so on.
> >
> > This is what I am trying to achieve.
>
> I can't take this any more. Look, what you probably need is a version
> of codecvt<wchar_t, char, mbstate_t> that converts between little-endian
> UTF-16 in a file and UCS-2 (16-bit UNICODE, roughly) stored in a wchar_t
> inside the program. We've got such a critter, but it's proprietary code
> that's destined for a future release of our libraries. My bet, however,
> is that you can probably make do with a much simpler alternative. Pasted
> below is a dirt-simple codecvt replacement, and a sample bit of code for
> putting it to use. It reads two-byte integers from a file, in native
> byte order, and packs 'em into wchar_t elements. If you're not using
> one of our libraries, you can replace the line that uses the _ADDFAC
> macro with:

[snip]

ok, first off I cannot belive it is so complicated to read a standard
format from a file, manipulate it and then write it out again in the
same format! I mean utf-16 is a well known standard isn't it ? The
fact that you can actually charge for code to do this (as well as lots
of other stuff of course) tells me that information on this subject is
not readily available. Does C++ not support unicode ? Java does...

Secondly, the code you gave me does not work....or else I am not using
it correctly. Here is how I used it :

#ifdef WIN32
locale loc = _ADDFAC(locale::classic(), new UTF16Locale);
#endif

std::wifstream xmlFile(fileName.c_str(), std::ios_base::binary);
xmlFile.imbue(loc);
wchar_t* buf = new wchar_t[256];

assert(xmlFile.is_open());
// Read some stuff in
xmlFile.read(buf, 100);
std::wcerr << buf << endl;

The result of the std::wcerr is "<" i.e. the first character of the
first line, it interprets the second byte as a null terminatig char.

All I want to do is be able to read in the file, line by line,
manipulate it as a string and then write it out again! I have searched
high and low for information on this and it seems severely lacking.

rgds,

- Paul.

Paul Nolan

unread,

Aug 23, 2001, 7:39:47 AM8/23/01

to

"P.J. Plauger" <p...@dinkumware.com> wrote in message news:<3b82680f$0$23...@wodc7nh0.news.uu.net>...

> "Paul Nolan" <pfn...@ireland.com> wrote in message news:a3415c1c.01082...@posting.google.com...
>
> > The problem is that when I do a getline or a read into a buffer of
> > wchar_t's, the representation is exactly like it is in the file i.e.
> > the first wchar_t contains 34, the second 00, the third 46, the fourth
> > 00 etc...
> >
> > So when I try to convert this to a wstring, as soon as it encounters
> > the first "00" (second wchar_t) it assumes it is a null terminating
> > character and the result is a string of size 1.
> >
> > The problem is that the read/getline on a wifstream should take the
> > first four characters and put them into one wchar_t, and so on.
> >
> > This is what I am trying to achieve.
>
> I can't take this any more. Look, what you probably need is a version
> of codecvt<wchar_t, char, mbstate_t> that converts between little-endian
> UTF-16 in a file and UCS-2 (16-bit UNICODE, roughly) stored in a wchar_t
> inside the program. We've got such a critter, but it's proprietary code
> that's destined for a future release of our libraries. My bet, however,
> is that you can probably make do with a much simpler alternative. Pasted
> below is a dirt-simple codecvt replacement, and a sample bit of code for
> putting it to use. It reads two-byte integers from a file, in native
> byte order, and packs 'em into wchar_t elements. If you're not using
> one of our libraries, you can replace the line that uses the _ADDFAC
> macro with:
>

[snip]

Apologies! I was opening the file before calling imbue(), it now seems to work.

Thanks very much!

- Paul.

Pete Becker

unread,

Aug 23, 2001, 8:47:36 AM8/23/01

to

Paul Nolan wrote:
>
> Does C++ not support unicode ? Java does...
>

Java supports Unicode 2.0. Now that Unicode 3.1 has 94,140 characters,
Java's 16-bit char type can't handle it. C and C++ have opted for more
complexity here in order to provide flexibility in a rapidly changing
environment. Java made it simple, and now has a serious problem.

--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

P.J. Plauger

unread,

Aug 23, 2001, 9:46:05 AM8/23/01

to

"Paul Nolan" <pfn...@ireland.com> wrote in message news:a3415c1c.01082...@posting.google.com...

> Apologies! I was opening the file before calling imbue(), it now seems to work.

Not a problem. Microsoft got enough gripes about that behavior that
they asked us to change it, and we did. Later versions of our library
let you change the imbued locale after the file is open.

P.J. Plauger

unread,

Aug 23, 2001, 9:58:51 AM8/23/01

to

"Paul Nolan" <pfn...@ireland.com> wrote in message news:a3415c1c.01082...@posting.google.com...

> ok, first off I cannot belive it is so complicated to read a standard

> format from a file, manipulate it and then write it out again in the
> same format! I mean utf-16 is a well known standard isn't it ? The
> fact that you can actually charge for code to do this (as well as lots
> of other stuff of course) tells me that information on this subject is
> not readily available. Does C++ not support unicode ? Java does...

The information is readily available -- it's just not all that easy to
apply. Pete Becker and I have spent months, here at Dinkumware, working
on code to do various ``obvious'' code interconversions in our C, C++,
and Java libraries, and we've still got more to do. There are many
levels of ``support.'' As Pete pointed out separately, Java cut corners
several years ago to offer simplified Unicode support and now faces real
problems in adapting to the latest Unicode specification. UTF-16 is a
clever kludge currently being promoted as a palliative, but it introduces
problems of its own. That's partly why I gave you a simplified converter,
rather than the full-bore UTF-16 guy.

> Secondly, the code you gave me does not work....or else I am not using
> it correctly.

I'm glad you eventually got it working, as your separate posting indicates.

> All I want to do is be able to read in the file, line by line,
> manipulate it as a string and then write it out again! I have searched
> high and low for information on this and it seems severely lacking.

Uh huh.

More and more people are hitting the same snags. That's why we're working
on additions to our Standard C and Standard C++ libraries to make a few of
these ``obvious'' chores somewhat easier to carry out.

P.J. Plauger
Dinkumware, Ltd.
htp://www.dinkumware.com

Tom

unread,

Aug 23, 2001, 11:01:43 AM8/23/01

to

pfn...@ireland.com (Paul Nolan) wrote in message news:<a3415c1c.01082...@posting.google.com>...

> ok, first off I cannot belive it is so complicated to read a standard
> format from a file, manipulate it and then write it out again in the
> same format! I mean utf-16 is a well known standard isn't it ? The
> fact that you can actually charge for code to do this (as well as lots
> of other stuff of course) tells me that information on this subject is
> not readily available. Does C++ not support unicode ? Java does...

C++ supports all character encodings - the precise ones supported by a
particular implementation is the problem you're suffering from. I
don't know what utf-16 is, but ordinary 16-bit unicode is supported by
Dinkumware's lib, and it sounds like they are in the process of adding
support for some other character encodings, including utf-16

>
> Secondly, the code you gave me does not work....or else I am not using
> it correctly.

The latter, I believe.

> std::wifstream xmlFile(fileName.c_str(), std::ios_base::binary);
> xmlFile.imbue(loc);

Change it to (like you were told by both Dietmar and PJP):

std::wifstream xmlFile;
xmlFile.imbue(loc);
xmlFile.open(fileName.c_str(), std::ios_base::binary);

The imbue must happen before the file is opened, or it is not
guaranteed to work. I think your code would work on some libraries,
but not the Dinkumware one (I seem to recall heated arguments between
Plauger and Kuehl about this very thing - most users reasonably expect
your code to work, but the C++ standard doesn't mandate this.)

Tom