I am trying to read a file with some Japanese words.
(Well, it has a mix of Japanese and English words).
// --------------------------------
// _UNICODE is defined
//
FILE* fp = 0;
errno_t err = _tfopen_s( &fp, _T("name.txt"), _T("rb") );
...
//-- get the file length
...
TCHAR* buf = new TCHAR[ length+1 ];
memset( buf, 0, length+1 );
if( fread( buf, sizeof(TCHAR), length, file ) != length )
{
...
return
}
...
// --------------------------------
But doing that does not load the file in 'buf' properly.
Even the non Japanese characters are not loaded properly.
What am I doing wrong? (Using notpad++ I can see that the data is as
expected).
Thanks
Simon
> I am trying to read a file with some Japanese words.
> (Well, it has a mix of Japanese and English words).
I think you should figure out which encoding the file uses.
The file could be Unicode UTF-16 (LE or BE), or UTF-8...
There is a useful freely-available class that allows you to load texts from
different formats and convert them in Unicode UTF-16 (which is Windows
default Unicode format):
http://www.codeproject.com/KB/files/stdiofileex.aspx
HTH,
Giovanni
Simon wrote:
> Hi,
>
> I am trying to read a file with some Japanese words.
> (Well, it has a mix of Japanese and English words).
As Giovanni already pointed out, you need to be aware
of the encoding of the file. Besides the various
unicode encodings he mentioned a japanese text file
might also easily be encoded using shift-jis or some
other non unicode encoding.
> FILE* fp = 0;
> errno_t err = _tfopen_s( &fp, _T("name.txt"), _T("rb") );
> ...
> //-- get the file length
> ...
>
> TCHAR* buf = new TCHAR[ length+1 ];
> memset( buf, 0, length+1 );
memset(buf, 0, sizeof(TCHAR)*(length+1));
as TCHAR will be several bytes in size if _UNICODE is
defined.
> if( fread( buf, sizeof(TCHAR), length, file ) != length )
Here again it should be
fread(...) != sizeof(TCHAR) * length
As fread returns the number of bytes read.
> But doing that does not load the file in 'buf' properly.
> Even the non Japanese characters are not loaded properly.
>
> What am I doing wrong? (Using notpad++ I can see that the data is as
> expected).
Well, you are reading the file as a bunch of bytes. But you have to
first convert the read data from the encoding used in the file into
unicode to make real sense of the content.
Best regards,
Oliver
As pointed out already, you have to know the encoding of the file.
But based on the above my best guess is UTF-16.
Then you have to define both _UNICODE and UNICODE.
memset( buf, 0, length+1 );
should be memset( buf, 0, (length+1)*sizeof(TCHAR) );
--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
My bad!
Since you have _UNICODE defined and you don't even see the English
right, then the file is anything but UTF-16.
if you on a Japanese system
probably UTF-8 or Shift-JIS (cp932)
else
probably UTF-8
So load the file as bytes, then use MultiByteToWideChar.
In the end I looked at the way notepad++ reads the files, as Mihai N.
mentioned, they read the file in 'rb' and then call MultiByteToWideChar(
... )
because the file is read in Bytes they have various functions to check
the file format, (UTF-8, UTF-16, ascci and so forth).
Simon
Thanks for the replies,
How do I know if I am on a Japanese system???
and even if I know, (using the local and so forth), how can I test if it
is UTF-8 or Shift-JIS (cp932)?
if it is UTF-8 I can, (now), read it properly, (using MultiByteToWideChar).
But how can I convert 'read' Shift-JIS (cp932) and convert to wide char
accordingly?
>
> So load the file as bytes, then use MultiByteToWideChar.
Many thanks
Simon
> But how can I convert 'read' Shift-JIS (cp932) and convert to wide char
> accordingly?
You can use MultiByteToWideChar with proper code page identifier:
http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
Giovanni
To test a file type you should check the Byte Order Mark (BOM) which is the
first two or three bytes in the file:
#define UTF8_BOM "\xef\xbb\xbf" // UTF-8 file "byte order mark" which goes
at start of file
#define UTF8_BOM_SIZE 3
#define UTF16_LE_BOM "\xff\xfe" // Unicode "byte order mark" which goes
at start of file
#define UTF16_BOM_SIZE 2
Tom
"Simon" <b...@example.com> wrote in message
news:OnxYESLz...@TK2MSFTNGP02.phx.gbl...