Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

_tfopen_s( ... )/fread a Japanese file

592 views
Skip to first unread message

Simon

unread,
Mar 24, 2010, 11:09:02 AM3/24/10
to
Hi,

I am trying to read a file with some Japanese words.
(Well, it has a mix of Japanese and English words).


// --------------------------------
// _UNICODE is defined
//
FILE* fp = 0;
errno_t err = _tfopen_s( &fp, _T("name.txt"), _T("rb") );
...
//-- get the file length
...

TCHAR* buf = new TCHAR[ length+1 ];
memset( buf, 0, length+1 );

if( fread( buf, sizeof(TCHAR), length, file ) != length )
{
...
return
}

...
// --------------------------------

But doing that does not load the file in 'buf' properly.
Even the non Japanese characters are not loaded properly.

What am I doing wrong? (Using notpad++ I can see that the data is as
expected).

Thanks

Simon

Giovanni Dicanio

unread,
Mar 24, 2010, 12:09:26 PM3/24/10
to

"Simon" <b...@example.com> ha scritto nel messaggio
news:#FEiyP2y...@TK2MSFTNGP05.phx.gbl...

> I am trying to read a file with some Japanese words.
> (Well, it has a mix of Japanese and English words).

I think you should figure out which encoding the file uses.
The file could be Unicode UTF-16 (LE or BE), or UTF-8...

There is a useful freely-available class that allows you to load texts from
different formats and convert them in Unicode UTF-16 (which is Windows
default Unicode format):

http://www.codeproject.com/KB/files/stdiofileex.aspx

HTH,
Giovanni

Oliver Regenfelder

unread,
Mar 24, 2010, 7:45:59 PM3/24/10
to
Hello,

Simon wrote:
> Hi,
>
> I am trying to read a file with some Japanese words.
> (Well, it has a mix of Japanese and English words).

As Giovanni already pointed out, you need to be aware
of the encoding of the file. Besides the various
unicode encodings he mentioned a japanese text file
might also easily be encoded using shift-jis or some
other non unicode encoding.

> FILE* fp = 0;
> errno_t err = _tfopen_s( &fp, _T("name.txt"), _T("rb") );
> ...
> //-- get the file length
> ...
>
> TCHAR* buf = new TCHAR[ length+1 ];
> memset( buf, 0, length+1 );

memset(buf, 0, sizeof(TCHAR)*(length+1));

as TCHAR will be several bytes in size if _UNICODE is
defined.

> if( fread( buf, sizeof(TCHAR), length, file ) != length )

Here again it should be

fread(...) != sizeof(TCHAR) * length

As fread returns the number of bytes read.

> But doing that does not load the file in 'buf' properly.
> Even the non Japanese characters are not loaded properly.
>
> What am I doing wrong? (Using notpad++ I can see that the data is as
> expected).

Well, you are reading the file as a bunch of bytes. But you have to
first convert the read data from the encoding used in the file into
unicode to make real sense of the content.

Best regards,

Oliver

Mihai N.

unread,
Mar 25, 2010, 3:57:37 AM3/25/10
to

> But doing that does not load the file in 'buf' properly.
> Even the non Japanese characters are not loaded properly.
>
> What am I doing wrong? (Using notpad++ I can see that the data is as
> expected).

As pointed out already, you have to know the encoding of the file.
But based on the above my best guess is UTF-16.

Then you have to define both _UNICODE and UNICODE.

memset( buf, 0, length+1 );

should be memset( buf, 0, (length+1)*sizeof(TCHAR) );


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Mihai N.

unread,
Mar 26, 2010, 12:13:00 AM3/26/10
to

>> But doing that does not load the file in 'buf' properly.
>> Even the non Japanese characters are not loaded properly.
>>
>> What am I doing wrong? (Using notpad++ I can see that the data is as
>> expected).
>
> As pointed out already, you have to know the encoding of the file.
> But based on the above my best guess is UTF-16.
>
> Then you have to define both _UNICODE and UNICODE.
>
> memset( buf, 0, length+1 );
> should be memset( buf, 0, (length+1)*sizeof(TCHAR) );

My bad!
Since you have _UNICODE defined and you don't even see the English
right, then the file is anything but UTF-16.

if you on a Japanese system
probably UTF-8 or Shift-JIS (cp932)
else
probably UTF-8

So load the file as bytes, then use MultiByteToWideChar.

Simon

unread,
Mar 26, 2010, 3:15:15 AM3/26/10
to
Thanks for all the replies.

In the end I looked at the way notepad++ reads the files, as Mihai N.
mentioned, they read the file in 'rb' and then call MultiByteToWideChar(
... )

because the file is read in Bytes they have various functions to check
the file format, (UTF-8, UTF-16, ascci and so forth).

Simon

Simon

unread,
Mar 26, 2010, 3:18:19 AM3/26/10
to
>
> if you on a Japanese system
> probably UTF-8 or Shift-JIS (cp932)
> else
> probably UTF-8

Thanks for the replies,

How do I know if I am on a Japanese system???
and even if I know, (using the local and so forth), how can I test if it
is UTF-8 or Shift-JIS (cp932)?

if it is UTF-8 I can, (now), read it properly, (using MultiByteToWideChar).

But how can I convert 'read' Shift-JIS (cp932) and convert to wide char
accordingly?

>
> So load the file as bytes, then use MultiByteToWideChar.

Many thanks

Simon

Giovanni Dicanio

unread,
Mar 26, 2010, 8:25:16 AM3/26/10
to
"Simon" <b...@example.com> ha scritto nel messaggio
news:OnxYESLz...@TK2MSFTNGP02.phx.gbl...

> But how can I convert 'read' Shift-JIS (cp932) and convert to wide char
> accordingly?

You can use MultiByteToWideChar with proper code page identifier:

http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx

Giovanni

Tom Serface

unread,
Mar 26, 2010, 12:27:21 PM3/26/10
to
If your file is UTF-8 or Unicode and you are reading into Unicode for the
memory string it shouldn't matter what kind of system you are on since the
codepage would no longer be an issue.

To test a file type you should check the Byte Order Mark (BOM) which is the
first two or three bytes in the file:

#define UTF8_BOM "\xef\xbb\xbf" // UTF-8 file "byte order mark" which goes
at start of file
#define UTF8_BOM_SIZE 3
#define UTF16_LE_BOM "\xff\xfe" // Unicode "byte order mark" which goes
at start of file
#define UTF16_BOM_SIZE 2

Tom

"Simon" <b...@example.com> wrote in message
news:OnxYESLz...@TK2MSFTNGP02.phx.gbl...

0 new messages