wstring, wofstream, and encodings

Jeffrey Walton

unread,

Apr 12, 2008, 1:40:48 AM4/12/08

to

Hi All,

I'm attempting to write a wstring to a file by way of wofstream. I'm
getting compression on the stream (I presumed it is UTF-8, but maybe
not). How/where do I invoke an alternate constructotor so that the
stream stays wide (UTF-16)?

This may be broken. wstring ws = L"wide" produces:
77 69 64 65 ('wide' using 7 bit/8 bit ASCII)

When I change wstring ws = L"wide" to wstring ws = L"wide
\u9aa8" (added the wchar_t for U+9AA8), the file is as follows:
77 69 64 65 6F ('wide' with garbage 6F)

Any ideas?
Jeff
Jeffrey Walton

== Sample ==
wstring ws = L"wide";

wofstream ofs;
ofs.open("wide.dat", std::ios::binary | std::ios::trunc );
if( !ofs.good() ) { return; }

ofs << ws;
ofs.close();
== End Sample ==

Alex Blekhman

unread,

Apr 12, 2008, 3:56:31 AM4/12/08

to

"Jeffrey Walton" wrote:
> I'm attempting to write a wstring to a file by way of wofstream.
> I'm
> getting compression on the stream (I presumed it is UTF-8, but
> maybe
> not). How/where do I invoke an alternate constructotor so that
> the
> stream stays wide (UTF-16)?

This is known problem with standard streams. Google for "wofstream
unicode file" to see various solutions. Here's the thread that
discusses Boost approach to this problem:

"Writing unicode text to file"
http://groups.google.com/group/microsoft.public.vc.stl/browse_frm/thread/1d191588f45632ed/40c0b0ce46ca6e65

Also, this article may help:

"Upgrading an STL-based application to use Unicode."
http://www.codeproject.com/KB/stl/upgradingstlappstounicode.aspx

HTH
Alex

David Wilkinson

unread,

Apr 12, 2008, 6:43:01 AM4/12/08

to

Jeffrey:

What I do is convert the strings to UTF8, and write them using ofstream.

--
David Wilkinson
Visual C++ MVP

Jeffrey Walton

unread,

Apr 12, 2008, 7:47:56 AM4/12/08

to

Hi Alex,

On Apr 12, 3:56 am, "Alex Blekhman" <tkfx.REM...@yahoo.com> wrote:
> "Jeffrey Walton" wrote:
> > I'm attempting to write a wstring to a file by way of wofstream.
> > I'm getting compression on the stream (I presumed it is UTF-8,
> > but maybe not). How/where do I invoke an alternate constructotor
> > so that the stream stays wide (UTF-16)?
>
> This is known problem with standard streams. Google for "wofstream
> unicode file" to see various solutions. Here's the thread that
> discusses Boost approach to this problem:
>

> "Writing unicode text to file"http://groups.google.com/group/microsoft.public.vc.stl/browse_frm/thr...

>
> Also, this article may help:
>
> "Upgrading an STL-based application to use Unicode."http://www.codeproject.com/KB/stl/upgradingstlappstounicode.aspx
>
> HTH
> Alex

> This is known problem with standard streams. Google for
> "wofstream unicode file"
Damn... My searching skills are getting really bad. Believe it or not,
I did search. It just was not with a term as broad as "Unicode". I try
to get as specific as possible to avoid all the incorrect answers [and
the scourge of the modern Internt - the BLOG].

Jeff

Jeffrey Walton

unread,

Apr 12, 2008, 8:25:10 AM4/12/08

to

Hi David,

On Apr 12, 6:43 am, David Wilkinson <no-re...@effisols.com> wrote:
> Jeffrey Walton wrote:
> > Hi All,
>
> > I'm attempting to write a wstring to a file by way of wofstream. I'm
> > getting compression on the stream (I presumed it is UTF-8, but maybe
> > not). How/where do I invoke an alternate constructotor so that the
> > stream stays wide (UTF-16)?
>
> > This may be broken. wstring ws = L"wide" produces:
> > 77 69 64 65 ('wide' using 7 bit/8 bit ASCII)
>
> > When I change wstring ws = L"wide" to wstring ws = L"wide
> > \u9aa8" (added the wchar_t for U+9AA8), the file is as follows:
> > 77 69 64 65 6F ('wide' with garbage 6F)

> > [SNIP ]

>
> Jeffrey:
>
> What I do is convert the strings to UTF8, and write them using ofstream.
>
> --
> David Wilkinson
> Visual C++ MVP
>

Thanks. I'm actually wroking on a Crypto interop project. Is there
anyway to stop it in it's entirety? I guess you can imagine the issues
this is causing on encrypted data.

Last night I read pertinent sections of Stroustrap's C++ Programming
Language [1] (strings, streams, locales) and his Appendix D - Locales
[and Facets] [2]. I also read Schmitt's 'International Programming for
Microsoft Windows' [3]. None seem to address this issue.

Is this specifed in ISO? Unfortunately, I'm not too familiar with the
standard, and I do not have a copy of it. If so, I could understand
the compression (if specified). I can deal with/without the BOM and
The Unicode Consortium's standards and recommendation, and ISO/IEC
10646. But the loss of data on the code point U+9AA8 is absolutely
unacceptable.

Jeff
Jeffrey Walton

[1] B Stroustrup, The C++ Programming Language, Addison Wesley Inc.,
ISBN 0-201-70073-5
[2] B Stroustrup, The C++ Programming Language, Appendix D: Locales,
Addison Wesley Inc., ISBN 0-201-70073-5
[3] International Programming for Microsoft Windows, Microsoft Press,
ISBN 1572319569

David Wilkinson

unread,

Apr 12, 2008, 8:30:47 AM4/12/08

to

Jeffrey Walton wrote:
>> This is known problem with standard streams. Google for
>> "wofstream unicode file"
> Damn... My searching skills are getting really bad. Believe it or not,
> I did search. It just was not with a term as broad as "Unicode". I try
> to get as specific as possible to avoid all the incorrect answers [and
> the scourge of the modern Internt - the BLOG].

Jeff:

Do your search on Google Groups, not Google. Doesn't get the forums, but most of
the good C++ stuff is still on newsgroups.

Blogs can be neat, but personally I find it too time-consuming to keep up with
all the blogs that might be of interest to me. Unfortunately they are becoming
a repository for useful information that cannot be found elsewhere, a role for
which they are very unsuited, IMHO.

David Wilkinson

unread,

Apr 12, 2008, 8:38:18 AM4/12/08

to

Jeffrey Walton wrote:
> Thanks. I'm actually wroking on a Crypto interop project. Is there
> anyway to stop it in it's entirety? I guess you can imagine the issues
> this is causing on encrypted data.
>
> Last night I read pertinent sections of Stroustrap's C++ Programming
> Language [1] (strings, streams, locales) and his Appendix D - Locales
> [and Facets] [2]. I also read Schmitt's 'International Programming for
> Microsoft Windows' [3]. None seem to address this issue.
>
> Is this specifed in ISO? Unfortunately, I'm not too familiar with the
> standard, and I do not have a copy of it. If so, I could understand
> the compression (if specified). I can deal with/without the BOM and
> The Unicode Consortium's standards and recommendation, and ISO/IEC
> 10646. But the loss of data on the code point U+9AA8 is absolutely
> unacceptable.
>
> Jeff
> Jeffrey Walton
>
> [1] B Stroustrup, The C++ Programming Language, Addison Wesley Inc.,
> ISBN 0-201-70073-5
> [2] B Stroustrup, The C++ Programming Language, Appendix D: Locales,
> Addison Wesley Inc., ISBN 0-201-70073-5
> [3] International Programming for Microsoft Windows, Microsoft Press,
> ISBN 1572319569

Jeffrey:

Not sure what your question is here. As I said, I convert my wide character
strings to UTF8 and write them using std::ofstream.

Alex Blekhman

unread,

Apr 12, 2008, 9:22:08 AM4/12/08

to

"David Wilkinson" wrote:
> Blogs can be neat, but personally I find it too time-consuming
> to keep up with all the blogs that might be of interest to me.
> Unfortunately they are becoming a repository for useful
> information that cannot be found elsewhere, a role for which
> they are very unsuited, IMHO.

Hear hear! It's especially frustrationg to see how MSDN returns
MSFT employees' blogs as search results. As if these blogs are
valid documentation. Don't get me wrong, I infinitely appeciate
MSFT people who bother to write entries with a valuable info,
which is impossible to obtain elsewhere. What I don't appreciate
is the assumption that a blog entry is sufficient documentation.
If a blog entry of Raymond Chen or Larry Osterman contains
something that can't be found in MSDN library, then the entry
should make its way to the library. Poking around some blogs is
the documentation fiasco, as I see it.

Alex

David Lowndes

unread,

Apr 12, 2008, 10:04:02 AM4/12/08

to

>Blogs can be neat, but personally I find it too time-consuming to keep up with
>all the blogs that might be of interest to me. Unfortunately they are becoming
>a repository for useful information that cannot be found elsewhere, a role for
>which they are very unsuited, IMHO.

I couldn't agree more.

So much useful information that ought to be in/on MSDN is only finding
its way out onto MS employee blogs. Since (I assume) some of the
content of those blogs must be getting vetted by MS (you occasionally
see the odd references as such), they ought to be properly catalogued,
archived, and available from MSDN.

Dave

Carl Daniel [VC++ MVP]

unread,

Apr 12, 2008, 10:50:00 AM4/12/08

to

Jeffrey Walton wrote:
> Thanks. I'm actually wroking on a Crypto interop project. Is there
> anyway to stop it in it's entirety? I guess you can imagine the issues
> this is causing on encrypted data.

Note that cryptographic algorithms work on bytes, not characters. If you're
going to be running the data you read (or write) through a cryptographic
algorithm, you should avoid C++ stream I/O entirely, since it's designed to
handle character I/O, potentially doing all sorts of transformations on the
fly during reading and/or writing. If you need to write out encrypted data
in a text form, convert it to BASE-64 encoding or something along those
lines.

-cd