CFile and Unicode

MFC

unread,

Jan 11, 2006, 11:58:03 AM1/11/06

to

if I build my project in Unicode does that automatically write out everything
in Unicode?
I didn't change aything in my code, just defined Unicode and _Unicode in the
preprocessor and now CFile::Write( string ) is writing the file in Unicode
which was in ANSI before...
thoughts??

thanks

Tom Serface

unread,

Jan 11, 2006, 12:30:54 PM1/11/06

to

CFile writes what you have so it's likely your string is now in Unicode.

Tom

"MFC" <M...@discussions.microsoft.com> wrote in message
news:B4EB8819-EDBD-4E8D...@microsoft.com...

Joseph M. Newcomer

unread,

Jan 11, 2006, 3:18:08 PM1/11/06

to

CFile::Write writes bytes, not characters. So if you have a CString (not a CStringA or
CStringW), then the CString will be 8-bit characters in a non-Unicode app and 16-bit
characters in a Unicode app.

It is not possible to code
CFile f;
...
f.Write(string);

because Write takes TWO arguments, a pointer and a length. You have also failed to say
what the type of 'string' is, which makes it impossible to evaluate what you are doing.

What I would do is code it as
CString string;
string = ...;
f.Write((LPCTSTR)string, string.GetLength() * sizeof(TCHAR));

If 'string' is some other type, then you need to tell us what it is, so we can tell you
how to write the code. It is very important to recognize that bytes and characters are
different, and operations like Write work in terms of bytes, and ONLY bytes.

You might also consider if you want to write a Byte Order Mark (BOM) at the start of a
Unicode text file. This is optional, but is often useful when a program opens a text
file. The presense of the BOM tells it to read the file as Unicode text, not 8-bit text.

The little-endian BOM is to start the file with the first byte 0xFF and the second byte
0xFE, that is, L'\xFEFF' is the character you need to write.

Note that some other programs will not work right if the BOM is present. You can use the
BOM in your own program to determine what you are reading. For example,

#define BOM L'\xFEFF'

CString s;

WCHAR ch;
f.Read(&ch, sizeof(WCHAR));
if(ch == BOM)
{ /* Unicode file */
WCHAR buffer[MAX_LENGTH];
int n = f.Read(buffer, sizeof(buffer) - sizeof(WCHAR));
buffer[n / sizeof(WCHAR)] = L'\0';
s = W2T(buffer);
} /* Unicode file */
else
{ /* ANSI file */
CHAR buffer[MAX_LENGTH];
int n = f.Read(buffer, sizeof(buffer) - sizeof(CHAR));
buffer[n] = '\0';
s = A2T(buffer);
} /* ANSI file */

joe

Joseph M. Newcomer [MVP]
email: newc...@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Eddie Pazz

unread,

Jan 11, 2006, 4:40:47 PM1/11/06

to

I ran into this problem when I updated my application. I have a general log
that apps write to and it's always been ANSI. My new app writes to this log
but it uses Unicode, so I wanted to keep it as ANSI. I came up with the
following (not CFile but the Win APIs which CFile encapsulates):

// Write the line to the log
DWORD dwWritten = 0;
BOOL bRes = FALSE;

#ifndef _UNICODE
dwSize = (DWORD)(_tcslen( szLine ) * sizeof( TCHAR ));
bRes = ::WriteFile( hFile, szLine, dwSize, &dwWritten, NULL );
#else
char szTmp[MAX_LINE_LEN] = {0};
lstrcpyA( szTmp, CW2A( szLine ) );
dwSize = (DWORD) lstrlenA( szTmp ) * sizeof( char );
bRes = ::WriteFile( hFile, szTmp, dwSize, &dwWritten, NULL );
#endif

If my app is not UNICODE, I do a simple write. However, if Unicode is
defined, I use the ATL class (CW2A) to convert to Ansi before it writes. I
was concerned about performance since this is a log being written to quite a
lot. I didn't see a bad performance hit (measured with GetTickCount and
10000 lines in a loop) doing the conversion.

Hope this helps.

Eddie

"MFC" <M...@discussions.microsoft.com> wrote in message
news:B4EB8819-EDBD-4E8D...@microsoft.com...

Joseph M. Newcomer

unread,

Jan 12, 2006, 12:16:10 AM1/12/06

to

This is amazingly complicated code for a simple problem. For example, why do you need a
fixed-size character buffer of MAX_LINE_LEN, and why do you need to initialize is? Why do
you need to use something as completely unsafe as a non-bounds-checked lstrcpyA? Why,
having done CW2A, do you not just use the pointer it returns directly?

Why do you need to initialize bRes when there is no control path that does not assign to
it?

This code is dangerous, and exhibits the kind of coding that produces buffer overflow
exploits. Pretend you never heard of strcpy, lstrcpy, strcat, etc. They are now
considered obsolete and dangerous. Use strsafe.h if you must do copies, but in this case,
none of the copies are necessary, and merely waste time, space, and create a dangerous
program.

BOOL bRes = ::WriteFile(hFile, T2A(line), lstrlen(line), &bytesWritten, NULL);

is sufficient! You've added a lot of meaningless and dangerous code that serves no
purpose.
joe

Norman Diamond

unread,

Jan 12, 2006, 7:41:53 PM1/12/06

to

If you have a CString (not a CStringA or CStringW), then the CString will be

8-bit bytes in a non-Unicode app and 16-bit wchar_ts in a Unicode app. Each
character will consist of one or more 8-bit bytes in a non-Unicode app.
Each character other than those requiring surrogate pairs will consist of
exactly one wchar_t in a Unicode app.

MSDN's wording implies that in some cases CFile::Write will convert to ANSI
when writing in a non-Unicode app. I haven't tested it after some
discussion a few months ago. If anyone else knows, they aren't saying. As
likely as it looks that MSDN is probably wrong in that implication, it might
be right. (For comparison Visual Basic does convert from Unicode to ANSI
when writing a VB string to a file, *even when the file is opened for
binary*. A co-worker got hit by that and I tested it. So we have to
prepare for the possibility that MSDN might be right about this in MFC.)

A byte order mark is a good idea if the file contains Unicode text only.
This even worked on Pocket Word in Windows CE around 5 years ago, when
Pocket Word ordinarily assumed that a .txt file was stored in Shift-JIS, but
starting the .txt file with a byte order mark persuaded Pocket Word to
recognize that the file was stored in Unicode. But if the file contains a
combination of Unicode text and binary data, then a byte order mark won't
accomplish much.

"Joseph M. Newcomer" <newc...@flounder.com> wrote in message
news:onoas1tgqo8pd26k3...@4ax.com...

Tom Serface

unread,

Jan 13, 2006, 1:35:35 PM1/13/06

to

I'm played with this a lot lately (worked on my own version of CStdioFile)
and I'm reasonably sure that CFile does't care what kind of data it is
writing. Since Unicode and ANSI are really only terms dealing with
characters there is no way that CFile could know that it is writing
characters or binary data. CStdioFile assumes that it is writing characters
and does do Unicode on Unicode builds and ANSI/MBCS on MBCS builds.
However, from what I've seen CStdioFile or CFile never write a BOM. That is
left up to the user to figure out.

Tom

"Norman Diamond" <ndia...@community.nospam> wrote in message
news:uNVqqo9...@TK2MSFTNGP14.phx.gbl...

Joseph M. Newcomer

unread,

Jan 14, 2006, 3:08:41 AM1/14/06

to

See below...

On Fri, 13 Jan 2006 09:41:53 +0900, "Norman Diamond" <ndia...@community.nospam> wrote:

>If you have a CString (not a CStringA or CStringW), then the CString will be
>8-bit bytes in a non-Unicode app and 16-bit wchar_ts in a Unicode app. Each
>character will consist of one or more 8-bit bytes in a non-Unicode app.
>Each character other than those requiring surrogate pairs will consist of
>exactly one wchar_t in a Unicode app.
>
>MSDN's wording implies that in some cases CFile::Write will convert to ANSI
>when writing in a non-Unicode app.

****
Not that I've ever seen or heard of. It couldn't even make sense to do so, since there is
no possible way that CFile::Write could have the foggiest idea what it is writing.
****

>I haven't tested it after some
>discussion a few months ago. If anyone else knows, they aren't saying. As
>likely as it looks that MSDN is probably wrong in that implication, it might
>be right. (For comparison Visual Basic does convert from Unicode to ANSI
>when writing a VB string to a file, *even when the file is opened for
>binary*. A co-worker got hit by that and I tested it. So we have to
>prepare for the possibility that MSDN might be right about this in MFC.)

***
I would expect VB to be complete crap, and this confirms it.
***

>
>A byte order mark is a good idea if the file contains Unicode text only.
>This even worked on Pocket Word in Windows CE around 5 years ago, when
>Pocket Word ordinarily assumed that a .txt file was stored in Shift-JIS, but
>starting the .txt file with a byte order mark persuaded Pocket Word to
>recognize that the file was stored in Unicode. But if the file contains a
>combination of Unicode text and binary data, then a byte order mark won't
>accomplish much.

****
I've seen some programs just screw up and try to use the BOM as text. It depends on who
the consumers of the text are. Mostly, these are programs that can only read 8-bit
characters, but not all of them. NotePad gets it right, for example (I used NotePad to
get the BOM values), Most Microsoft programs understand the BOM. But if you are reading
your own file, you need to read the BOM and decide if it is correct. By the way, there's
a bug in my code (fix left as an exercise for the reader). It fails if there is only one
byte in the file...
***