Writing unicode text to file

zakharin

unread,

Jan 7, 2003, 3:18:07 PM1/7/03

to

I've been looking for this all over the place and it seems no one has the answer other than to modify the locale, which no one seems to know how to do (for free, anyway). I need to write unicode strings to file in such a way that they retain their meaning. I don't care what the physical encoding is as long as I can read back what I have written. If there is an easy locale change for this, I'd be grateful if someone posts an example of how to use it. If there is a free library, I'd like to know where I can get it. If this can easily be done with something other than STL (ie some Windows functions) that would be ok as a last resort.

Thanks in advance
Boris Zakharin

tom_usenet

unread,

Jan 7, 2003, 3:50:39 PM1/7/03

to

On Tue, 7 Jan 2003 15:18:07 -0500, "zakharin"
<zakh...@seas.upenn.edu> wrote:

>
>I've been looking for this all over the place and it seems no one has =
>the answer other than to modify the locale, which no one seems to know =
>how to do (for free, anyway). I need to write unicode strings to file in =
>such a way that they retain their meaning. I don't care what the =
>physical encoding is as long as I can read back what I have written. If =
>there is an easy locale change for this, I'd be grateful if someone =
>posts an example of how to use it. If there is a free library, I'd like =
>to know where I can get it. If this can easily be done with something =
>other than STL (ie some Windows functions) that would be ok as a last =
>resort.

Boost just had something posted to the files section.

http://groups.yahoo.com/group/boost/files/codecvt.zip

You'll have to create a yahoo id first if you haven't got one, and
subscribe to the boost yahoo group.

Apparently no documentation yet though. See also www.boost.org and the
boost mailing list.

Tom

Reginald Blue

unread,

Jan 7, 2003, 4:31:26 PM1/7/03

to

"tom_usenet" <tom_u...@hotmail.com> wrote in message
news:3e1b3ab9....@news.easynet.co.uk...

> On Tue, 7 Jan 2003 15:18:07 -0500, "zakharin"
> <zakh...@seas.upenn.edu> wrote:
>
> >I've been looking for this all over the place and it seems no one has =
> >the answer other than to modify the locale, which no one seems to know =
> >how to do (for free, anyway). I need to write unicode strings to file in
=
> >such a way that they retain their meaning. I don't care what the =
> >physical encoding is as long as I can read back what I have written. If =
> >there is an easy locale change for this, I'd be grateful if someone =
> >posts an example of how to use it. If there is a free library, I'd like =
> >to know where I can get it. If this can easily be done with something =
> >other than STL (ie some Windows functions) that would be ok as a last =
> >resort.
>
> Boost just had something posted to the files section.
>
> http://groups.yahoo.com/group/boost/files/codecvt.zip

Darn. I was going to post this as well, but you beat me to it. (By the
way, thanks for pointing out the boost libraries that haven't been released
yet to me a while ago with respect to TCP/IP streaming.)

The one interesting thing that this brings to my mind, at least, is whether
such a facet that understands unicode would generate a file "correctly". It
will do it in such a way as to meet zakharin's needs, but I don't think it
will generate a "unicode text file" with the appropriate signal byte at the
beginning.

Anyone know if that's right?

Alberto Barbati

unread,

Jan 7, 2003, 5:56:24 PM1/7/03

to

Reginald Blue wrote:
> "tom_usenet" <tom_u...@hotmail.com> wrote in message

>><zakh...@seas.upenn.edu> wrote:
>>Boost just had something posted to the files section.
>>
>>http://groups.yahoo.com/group/boost/files/codecvt.zip

In fact that submission dates back to 09/26/2001. What has been *just*
posted is here:

http://groups.yahoo.com/group/boost/files/utf/

(it's three days old!). The new submission is more complete, in the
sense that can read and write UTF-8, UTF-16 and UTF-32 with all
endianness variants, while the old one has only UTF-8 and UCS-2 (which
is similar to UTF-16 but lacks proper handling of surrogates and
non-characters).

> The one interesting thing that this brings to my mind, at least, is whether
> such a facet that understands unicode would generate a file "correctly". It
> will do it in such a way as to meet zakharin's needs, but I don't think it
> will generate a "unicode text file" with the appropriate signal byte at the
> beginning.
>
> Anyone know if that's right?

That's very easy... to write a Unicode file you do:

// open wide-char file (yes: Unicode files must be opened as binary!)
std::wofstream file("myfile.uni", std::ios_base::binary);

// choose UTF-16LE encoding
boost::utf::imbue_utf16le(file);

// write byte order mark
file << boost::utf::bom;

To read:

std::wifstream file("myfile.uni", std::ios_base::binary);
boost::utf::imbue_utf16le(file);
file >> boost::utf::bom;

You can use any UTF-* encoding on both reading and writing (UTF-16LE is
the one used by Windows, but other platforms may use different ones).

On input, you can also detect the correct encoding automatically by
reading the file BOM, with this:

boost::utf::imbue_detect_from_bom(file);

Please notice that this library has been submitted to Boost very
recently and has not yet been reviewed nor it has been formally
accepted. Thus it's not endorsed by Boost. For any problem, just blame
the author (that's me ;).

Alberto

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 80,000 Newsgroups - 16 Different Servers! =-----

Reginald Blue

unread,

Jan 7, 2003, 6:09:56 PM1/7/03

to

"Alberto Barbati" <abar...@iaanus.com> wrote in message
news:3e1b5...@corp.newsgroups.com...

> > The one interesting thing that this brings to my mind, at least, is
whether
> > such a facet that understands unicode would generate a file "correctly".
It
> > will do it in such a way as to meet zakharin's needs, but I don't think
it
> > will generate a "unicode text file" with the appropriate signal byte at
the
> > beginning.
> >
> > Anyone know if that's right?
>
> That's very easy... to write a Unicode file you do:
>

<snip>

Wow. Just wow.

If they don't accept it...well...how could they not? :-)

zakharin

unread,

Jan 8, 2003, 11:07:33 AM1/8/03

to

So I try the following code in a class derived from wofstream in the open method:

wofstream::open(name, ios::binary);
this->rdbuf()->pubimbue(_ADDFAC(locale(), new utf8_conversion()));

After using the >> operator to write to file, it still writes the text as a single char (far as I can tell)
except the \n does not add a linefeed (because it's a binary stream)

Am I missing something or do I need to use binary output functions? If so, what's the point of
this facet anyway?

Igor Tandetnik

unread,

Jan 8, 2003, 11:45:19 AM1/8/03

to

UTF8 is a multibyte encoding, where a single Unicode character can be
represented by 1 to 6 bytes. All ASCII7 characters (code points 0
through 127) are represented by a single byte equal to the character
code. Try outputting some characters outside ASCII7 set. Or use UTF16 -
it seems that's the encoding you are after.
--
With best wishes,
Igor Tandetnik

"For every complex problem, there is a solution that is simple, neat,
and wrong." H.L. Mencken

"zakharin" <zakh...@seas.upenn.edu> wrote in message
news:O$29$$ytCHA.1632@TK2MSFTNGP12...

P.J. Plauger

unread,

Jan 8, 2003, 12:36:33 PM1/8/03

to

"zakharin" <zakh...@seas.upenn.edu> wrote in message news:O$29$$ytCHA.1632@TK2MSFTNGP12...

So I try the following code in a class derived from wofstream in the open method:

wofstream::open(name, ios::binary);
this->rdbuf()->pubimbue(_ADDFAC(locale(), new utf8_conversion()));

After using the >> operator to write to file, it still writes the text as a single char (far as I
can tell)
except the \n does not add a linefeed (because it's a binary stream)

Am I missing something or do I need to use binary output functions? If so, what's the point of
this facet anyway?

[pjp] You need to do the imbue before the open, at least with older versions
of our library. If you want \n to be output as \r\n, open the file as text,
not binary. You should also see the Dinkum CoreX Library at our web site.
It costs $, but it does way more, and it has been made to work properly
with most popular libraries, including all the currently used versions of
VC++.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

tom_usenet

unread,

Jan 8, 2003, 2:34:54 PM1/8/03

to

On Wed, 8 Jan 2003 11:07:33 -0500, "zakharin"
<zakh...@seas.upenn.edu> wrote:

>So I try the following code in a class derived from wofstream in the =

>open method:
>
> wofstream::open(name, ios::binary);
> this->rdbuf()->pubimbue(_ADDFAC(locale(), new utf8_conversion()));

Try:

this->rdbuf()->pubimbue(_ADDFAC(locale(), new utf8_conversion()));

wofstream::open(name, ios::binary);

I think you have to open the file after imbuing the locale. Apparently
the standard allows filebuf to ignore imbue calls. AFAIK Dinkumware
ignores all calls to imbue on open files (recalling a Kuehl vs Plauger
argument long ago in this very group).

>
>After using the >> operator to write to file, it still writes the text =

>as a single char (far as I can tell)
>except the \n does not add a linefeed (because it's a binary stream)
>

>Am I missing something or do I need to use binary output functions? If =

>so, what's the point of
>this facet anyway?

UTF8 outputs single chars for standard ascii. It uses multiple chars
for non-ascii, so try those if you aren't already. Alternatively, you
could use UTF16.

Tom

zakharin

unread,

Jan 8, 2003, 6:57:39 PM1/8/03

to

Thanks, that does work for output stream. On input, though, the program
seems to think EOF comes early and refuses to read beyond a certain point
with the extraction operator. I am not sure what triggers this, but I do know
that notepad, EMACS, and Hex Editor display the file in its entirety
(although with some understandable gibberish where non-english characters
appear)

"tom_usenet" <tom_u...@hotmail.com> wrote in message news:3e1c78e2....@news.easynet.co.uk...

tom_usenet

unread,

Jan 9, 2003, 5:32:47 AM1/9/03

to

On Wed, 8 Jan 2003 18:57:39 -0500, "zakharin"
<zakh...@seas.upenn.edu> wrote:

>Thanks, that does work for output stream. On input, though, the program

>seems to think EOF comes early and refuses to read beyond a certain =
>point=20
>with the extraction operator. I am not sure what triggers this, but I do =

>know
>that notepad, EMACS, and Hex Editor display the file in its entirety

>(although with some understandable gibberish where non-english =
>characters
>appear)

I suggest you get into e-mail contact with the author, who has posted
in this thread: Alberto Barbati <abarbatiR...@iaanus.com>

Tom

SKV

unread,

Jan 9, 2003, 2:06:27 PM1/9/03

to

"zakharin" <zakh...@seas.upenn.edu> wrote in message news:<O$29$$ytCHA.1632@TK2MSFTNGP12>...

When I was facing the same porblem I ended up using
Win32 API MultiByteToWideChar(...) to traslate everything to UNICODE
and
int WideCharToMultiByte(...)

The string of choice was BSTR with ATL CComBSTR warapper
I was using Win32 API WriteFile(...)/ReadFile(...).
The idea was to prepare buffer and flush to the file instead of
writing small chunks.

Probably this is not the best but this is the way in which MS OS works
and most MS products including Internet Explorer.
-Sergey Karpov

> So I try the following code in a class derived from wofstream in the
> open method:
>
> wofstream::open(name, ios::binary);

> this->rdbuf()->pubimbue( ADDFAC(locale(), new utf8 conversion()));