How to read Unicode(Big-Endian) text file(s) in Non-MFC

meme

unread,

Feb 18, 2008, 6:46:00 AM2/18/08

to

Hello,

I'm trying to read unicode text files.... so far I'm able to do

following....but lost in "Big-Endian" thingies...

Any help/comments are appreciated..... (sorry for my bad English)...

>>>>>>>>>>>>

void ReadAndDisplay(HWND owner)
{
if ( BrowseFile(owner) )
{
FILE *file = NULL;
int fByte;

file = _wfopen(szFile, L"rb" );

if (file != NULL)
{
fByte = fgetc(file);
//fclose(file);

if(fByte > 254) //unicode
{
readUnicode(file);
}
else if(fByte == 239) //utf-8
{
readUTF8(file);
}
else //ansi
{
readAnsi(file);
}

SetWindowText(GetParent(hwndEdit), wcscat

(szFileTitle, L" - Reader"));
}
}
}

void readUTF8(FILE *file)
{
long flen;
flen = _filelength(_fileno(file));

char *data = new char[flen];
wchar_t *dataW = new wchar_t[flen];

fseek(file, 3, SEEK_SET);
fread(data, sizeof(char), flen-3, file);
fclose(file);

data[flen-3] = '\0';

//convert utf-8 to Unicode (utf-16)
MultiByteToWideChar(CP_UTF8 , 0, data, -1, dataW, flen-3);

SetWindowText(hwndEdit, dataW);

delete []data;
delete []dataW;
}

void readUnicode(FILE *file)
{
long flen;
flen = (_filelength(_fileno(file))) / 2;

wchar_t *data = new wchar_t[flen];

fseek(file, 2, SEEK_SET);
fread(data, sizeof(wchar_t), flen-1, file);
fclose(file);

data[flen-1] = '\0';
SetWindowText(hwndEdit, data);

delete []data;
}
<<<<<<<<<<<<<<

Giovanni Dicanio

unread,

Feb 18, 2008, 7:19:45 AM2/18/08

to

"meme" <me...@myself.com> ha scritto nel messaggio
news:eJlvmRic...@TK2MSFTNGP04.phx.gbl...

> I'm trying to read unicode text files.... so far I'm able to do
>
> following....but lost in "Big-Endian" thingies...

Reading MSDN documentation about fopen, it seems that it can handle Unicode
UTF-16 LE, but not BE.

http://msdn2.microsoft.com/en-us/library/yeby3zcb.aspx

So, I think you should just read the raw WORDs (16 bits, two bytes) from
file, and swap the byte order from your code.

1. For each WORD in file
2. read that WORD
3. swap low-byte and high-byte, transforming the WORD from BE to LE
4. store this LE word (Unicode UTF-16LE wchar_t) in memory

To swap two bytes in a word, you may use the following code:

<code>

// Converts a word from Big-Endian to Little-Endian (or vice-versa)
inline WORD SwapWordEndiannes(WORD w)
{
// Swap low and high bytes
return MAKEWORD( HIBYTE(w), LOBYTE(w) );
}

WORD bigEndianWord = ...;
WORD littleEndianWord = SwapWordEndiannes(bigEndianWord);

</code>

HTH,
Giovanni

meme

unread,

Feb 18, 2008, 1:58:54 PM2/18/08

to

"Giovanni Dicanio" <giovanni...@invalid.com> wrote in message
news:OrIwpjic...@TK2MSFTNGP03.phx.gbl...

Hi.... thanks for responding.....

so I tried ......following.....but I think I missed or messed up something
and therefore all I see some junk characters when executed ..... :(

// Converts a word from Big-Endian to Little-Endian (or vice-versa)
inline WORD SwapWordEndiannes(WORD w)
{
// Swap low and high bytes
return MAKEWORD( HIBYTE(w), LOBYTE(w) );
}

WORD GetBigWord(FILE *FilePtr)
{
register WORD word;

word = (WORD) (fgetc(FilePtr) & 0xff);
word = ((WORD) (fgetc(FilePtr) & 0xff)) | (word << 0x08);

return(word);
}

void readUnicodeBE(FILE *file)

{
long flen;
flen = _filelength(_fileno(file));

wchar_t *data = new wchar_t[flen + 1];

// rewind(file);
WORD bigEndianWord;
WORD littleEndianWord;

int i=0;

fseek(file, 2, SEEK_SET);
while(!feof(file))
{
bigEndianWord = GetBigWord(file);
littleEndianWord = SwapWordEndiannes(bigEndianWord);

data[i] = (wchar_t)littleEndianWord;
i++;
}

fclose(file);

data[i] = '\0';

SetWindowText(hwndEdit, data);

delete []data;
}

Giovanni Dicanio

unread,

Feb 18, 2008, 6:12:19 PM2/18/08

to

"meme" <me...@myself.com> ha scritto nel messaggio

news:eALO1Cmc...@TK2MSFTNGP04.phx.gbl...

> so I tried ......following.....but I think I missed or messed up something
> and therefore all I see some junk characters when executed ..... :(

You can solve this problem in several ways, there's no one single way.

You might consider this code of mine (need more test, and can be optimized,
but seems to work).
I've put comments in code, so you can read them.

(I hope that Outlook Express does not scramble the pasted lines...)

You should pay attention to the code of the function ReadFileUtf16BE(), that
reads the content of a UTF-16 BE file, and stores it into a Unicode UTF-16
(LE) string (I used std::wstring, but you can use CStringW as well).

The function WriteFileUtf16BE() is used for test (to write a simple UTF-16BE
file).

In your main(), you can use them like this:

<code>

// Write a test file...
WriteFileUtf16BE(_T("test"));

// Read file content
std::wstring fileText;
ReadFileUtf16BE(_T("test"), fileText);
// ...should check return code, if false --> error

// Show it
MessageBoxW( NULL, fileText.c_str(), L"File content:", MB_OK);

</code>

Here are the functions:

<code>

// Swap bytes
inline void SwapBytes(BYTE & b1, BYTE & b2)
{
BYTE temp = b1;
b1 = b2;
b2 = temp;
}

//
// Reads a UTF-16 BE file, and returns a Unicode string with its content.
// Returns 'true' on success, 'false' on error.
//
bool ReadFileUtf16BE(
LPCTSTR filename, // [in] filename
std::wstring & text // [out] file string content
)
{
// Clear output parameter (set to empty string)
text = L"";

// Check filename input parameter
ASSERT( filename != NULL );
if ( filename == NULL )
return false;

//
// Open file
//
FILE * file = _tfopen(filename, _T("rb"));
ASSERT( file != NULL );
if ( file == NULL )
return false;

//
// Check that file is UTF-16 BE
//
BYTE bom[2];
if ( fread( bom, sizeof(bom), 1, file) != 1 )
{
// No UTF-16 BE (BOM does not match)
ASSERT(FALSE);

fclose(file);
return false;
}

// UTF-16 BE BOM is FE FF
if ( bom[0] != 0xFE && bom[1] != 0xFF )
{
// No UTF-16 BE (BOM does not match)
ASSERT(FALSE);

fclose(file);
return false;
}

//
// Get file size, in bytes
//
fseek(file, 0L, SEEK_END);
long size = ftell(file);

// To correctly compute size, we should exclude BOM (-2 bytes),
// but we need to consider string termination L'\0' (+2 bytes).
// So, we don't change 'size' parameter here.

//
// Read file content into memory string
//

// Alloc memory to read file in
std::vector<BYTE> buffer( size );

// Read all file in memory, excluding BOM (2 bytes)
fseek(file, 2, SEEK_SET);
fread(
&(buffer[0]), // destination buffer
1, // read each byte
size - 2, // exclude BOM
file
);

// Add the end-of-string L'\0'
buffer[size-2] = 0x00;
buffer[size-1] = 0x00;

// Close file
fclose(file);
file = NULL;

//
// Now convert from BE to LE, swapping byte order in WORDs
//
BYTE * pBuffer = &(buffer[0]);
ASSERT(pBuffer != NULL);
for ( long i = 0; i < size; i++ )
{
// Swap low and high bytes (*pBuffer and *(pBuffer+1))
SwapBytes( *pBuffer, *(pBuffer+1) );

// Go to next WORD (2 bytes)
pBuffer += 2;
i += 2;
}

// Copy file content to string
text = std::wstring( (const wchar_t *) &(buffer[0]) );

// All right
return true;
}

//
// Prepares a test file UTF-16 BE to read next
//
void WriteFileUtf16BE(LPCTSTR filename)
{
// Open file to write in binary mode
FILE * file = _tfopen(filename, _T("wb") );
ASSERT( file != NULL );

//
// Prepare file content in memory.
//
// We print:
// - UTF-16 BE BOM
// - (c) symbol
// - é symbol
//
std::vector<BYTE> data;
data.push_back(0xFE); // UTF-16 BE BOM
data.push_back(0xFF);

data.push_back(0x00); // (c)
data.push_back(0xA9);

data.push_back(0x00); // é
data.push_back(0xE9);

// Write file using our memory buffer
fwrite(&(data[0]), 1, data.size(), file );

// Close file
fclose(file);
file = NULL;
}

</code>

HTH,
Giovanni

Ben Voigt [C++ MVP]

unread,

Feb 19, 2008, 10:11:24 AM2/19/08

to

Giovanni Dicanio wrote:
> "meme" <me...@myself.com> ha scritto nel messaggio
> news:eJlvmRic...@TK2MSFTNGP04.phx.gbl...
>> I'm trying to read unicode text files.... so far I'm able to do
>>
>> following....but lost in "Big-Endian" thingies...
>
> Reading MSDN documentation about fopen, it seems that it can handle
> Unicode UTF-16 LE, but not BE.
>
> http://msdn2.microsoft.com/en-us/library/yeby3zcb.aspx
>
> So, I think you should just read the raw WORDs (16 bits, two bytes)
> from file, and swap the byte order from your code.
>
> 1. For each WORD in file
> 2. read that WORD
> 3. swap low-byte and high-byte, transforming the WORD from BE to LE
> 4. store this LE word (Unicode UTF-16LE wchar_t) in memory
>
> To swap two bytes in a word, you may use the following code:

Why roll your own when there's _swab (prototype in stdlib.h)?

"If n is even, the _swab function copies n bytes from src, swaps each pair
of adjacent bytes, and stores the result at dest. If n is odd, _swab copies
and swaps the first n-1 bytes of src. _swab is typically used to prepare
binary data for transfer to a machine that uses a different byte order."

Giovanni Dicanio

unread,

Feb 19, 2008, 10:43:54 AM2/19/08

to

"Ben Voigt [C++ MVP]" <r...@nospam.nospam> ha scritto nel messaggio
news:eWmiwlwc...@TK2MSFTNGP04.phx.gbl...

> Why roll your own when there's _swab (prototype in stdlib.h)?

Just because I did not know about _swab :)

Thanks for your information.
Giovanni

Giovanni Dicanio

unread,

Feb 19, 2008, 12:14:11 PM2/19/08

to

"Ben Voigt [C++ MVP]" <r...@nospam.nospam> ha scritto nel messaggio
news:eWmiwlwc...@TK2MSFTNGP04.phx.gbl...

>> To swap two bytes in a word, you may use the following code:
>
> Why roll your own when there's _swab (prototype in stdlib.h)?

Hmm... is it possible with _swab to swap in-place?
Or two *different* buffers are required for source and destination?

In my code, I swapped in-place (no duplicate buffer).

Thanks,
Giovanni

Giovanni Dicanio

unread,

Feb 19, 2008, 12:27:46 PM2/19/08

to

"Giovanni Dicanio" <giovanni...@invalid.com> ha scritto nel messaggio
news:ehNgwPoc...@TK2MSFTNGP06.phx.gbl...

> //
> // Now convert from BE to LE, swapping byte order in WORDs
> //
> BYTE * pBuffer = &(buffer[0]);
> ASSERT(pBuffer != NULL);
> for ( long i = 0; i < size; i++ )
> {
> // Swap low and high bytes (*pBuffer and *(pBuffer+1))
> SwapBytes( *pBuffer, *(pBuffer+1) );
>
> // Go to next WORD (2 bytes)
> pBuffer += 2;
> i += 2;
> }

I think this code should be modified, because I wrongly increment the 'i'
counter.
It is ok to increment the counter by delta of 2 (i.e. i += 2), in for loop:

for ( long i = 0; i < size; i += 2 )
{
...

pBuffer += 2;
// REMOVE: i += 2;
}

Giovanni

Ben Voigt [C++ MVP]

unread,

Feb 19, 2008, 2:38:16 PM2/19/08

to

Giovanni Dicanio wrote:
> "Ben Voigt [C++ MVP]" <r...@nospam.nospam> ha scritto nel messaggio
> news:eWmiwlwc...@TK2MSFTNGP04.phx.gbl...
>
>>> To swap two bytes in a word, you may use the following code:
>>
>> Why roll your own when there's _swab (prototype in stdlib.h)?
>
> Hmm... is it possible with _swab to swap in-place?
> Or two *different* buffers are required for source and destination?

That's a valid question... (this is presumably quoted from the POSIX
standard, the wording is very consistent if you google swab in place posix)
"If copying takes place between objects that overlap, the behavior is
undefined. If nbytes is negative, swab() does nothing."

>
> In my code, I swapped in-place (no duplicate buffer).

However, this isn't true. You definitely *could have*, though.

>
> Thanks,
> Giovanni

Giovanni Dicanio

unread,

Feb 19, 2008, 3:44:04 PM2/19/08

to

"Ben Voigt [C++ MVP]" <r...@nospam.nospam> ha scritto nel messaggio

news:O$P536ycI...@TK2MSFTNGP06.phx.gbl...

>> In my code, I swapped in-place (no duplicate buffer).
>
> However, this isn't true. You definitely *could have*, though.

In my code I traversed the buffer, and swapped byte couples in the same
buffer.
I did not allocate a second buffer.

Or maybe are you referring to the fact that I used a swap algorithm with a
temporary variable, and are you suggesting the XOR swap without a third
variable :) ?

Giovanni

meme

unread,

Feb 19, 2008, 4:58:01 PM2/19/08

to

"Giovanni Dicanio" <giovanni...@invalid.com> wrote in message

news:ehNgwPoc...@TK2MSFTNGP06.phx.gbl...

>
> "meme" <me...@myself.com> ha scritto nel messaggio
> news:eALO1Cmc...@TK2MSFTNGP04.phx.gbl...
>
>> so I tried ......following.....but I think I missed or messed up
>> something and therefore all I see some junk characters when executed
>> ..... :(
>
> You can solve this problem in several ways, there's no one single way.
>
> You might consider this code of mine (need more test, and can be
> optimized, but seems to work).
> I've put comments in code, so you can read them.

Hi... Thanks again... :-)

Yes this seems working.... finally :-D

However, I made some changes.... And I also have few quarries in mind...

> //
> // Check that file is UTF-16 BE
> //
> BYTE bom[2];
> if ( fread( bom, sizeof(bom), 1, file) != 1 )
> {
> // No UTF-16 BE (BOM does not match)
> ASSERT(FALSE);
>
> fclose(file);
> return false;
> }
>
> // UTF-16 BE BOM is FE FF
> if ( bom[0] != 0xFE && bom[1] != 0xFF )
> {
> // No UTF-16 BE (BOM does not match)
> ASSERT(FALSE);
>
> fclose(file);
> return false;
> }

This does not worked for me.... so I used the following instead...

int fByte[2];

file = _wfopen(szFile, L"rb" );

if (file != NULL)
{
// Read the 1st. two bytes... to see if we have a BOM
fByte[0] = fgetc(file);
fByte[1] = fgetc(file);
//fclose(file);

if((fByte[0] == 255) && (fByte[1] == 254))
{
//FF FE i.e. UTF-16(Unicode Little-Endian)
readUnicode(file, false);
}
else if((fByte[0] == 254) && (fByte[1] == 255))
{
//FE FF i.e. UTF-16(Unicode Big-Endian)
readUnicode(file, true);
}
else if((fByte[0] == 239) && (fByte[1] == 187))
{
//EF BB i.e. UTF-8 with BOM
readUTF8(file);
}
else //ansi
{
readAnsi(file);
}
}

And I change the following...

> //
> // Now convert from BE to LE, swapping byte order in WORDs
> //
> BYTE * pBuffer = &(buffer[0]);
> ASSERT(pBuffer != NULL);
> for ( long i = 0; i < size; i++ )
> {
> // Swap low and high bytes (*pBuffer and *(pBuffer+1))
> SwapBytes( *pBuffer, *(pBuffer+1) );
>
> // Go to next WORD (2 bytes)
> pBuffer += 2;
> i += 2;
> }

to ......

//
// Convert from BE to LE, swapping byte order in WORDs
//
long i = 0;
while( i < size )
{
SwapBytes(data[i], data[i+1]);
i = i + 2;
}

here data is....

BYTE *data = new BYTE[size];

Now the code can read ANSI, Unicode(UTF-16 LE/BE (thanks to you ;-) ) and
UTF-8(with BOM) files.
And here comes "UTF-8 Without BOM" files.... in fact the code can read it
alright but it cannot differentiate it from the plain ANSI file..... the
above code is useless there.... any thought on this...

meme

unread,

Feb 19, 2008, 5:26:33 PM2/19/08

to

"Giovanni Dicanio" <giovanni...@invalid.com> wrote in message

news:uRP7Y0xc...@TK2MSFTNGP03.phx.gbl...

oops !! but anyway I already change it to a while loop :-D

Ben Voigt [C++ MVP]

unread,

Feb 19, 2008, 6:14:08 PM2/19/08

to

Giovanni Dicanio wrote:
> "Ben Voigt [C++ MVP]" <r...@nospam.nospam> ha scritto nel messaggio
> news:O$P536ycI...@TK2MSFTNGP06.phx.gbl...
>
>>> In my code, I swapped in-place (no duplicate buffer).
>>
>> However, this isn't true. You definitely *could have*, though.
>
> In my code I traversed the buffer, and swapped byte couples in the
> same buffer.
> I did not allocate a second buffer.

We must be talking about different code. Here is the snippet in the message
I originally replied to, suggesting swab:

<code>
// Converts a word from Big-Endian to Little-Endian (or vice-versa)
inline WORD SwapWordEndiannes(WORD w)
{
// Swap low and high bytes
return MAKEWORD( HIBYTE(w), LOBYTE(w) );
}

WORD bigEndianWord = ...;
WORD littleEndianWord = SwapWordEndiannes(bigEndianWord);
</code>

I don't see any buffer traversal. I don't see any overwriting the original
variable. I see only copying.

Giovanni Dicanio

unread,

Feb 19, 2008, 6:28:25 PM2/19/08

to

"Ben Voigt [C++ MVP]" <r...@nospam.nospam> ha scritto nel messaggio

news:uKlFgz0c...@TK2MSFTNGP04.phx.gbl...

> We must be talking about different code.

Yes.

> I don't see any buffer traversal. I don't see any overwriting the
> original variable. I see only copying.

So my apologies Ben; I was instead talking about that:

<code>
...

BYTE * pBuffer = &(buffer[0]);
ASSERT(pBuffer != NULL);

for ( long i = 0; i < size; i += 2 )

{
// Swap low and high bytes (*pBuffer and *(pBuffer+1))
SwapBytes( *pBuffer, *(pBuffer+1) );

// Go to next WORD (2 bytes)
pBuffer += 2;
}

...
</code>

Giovanni

Giovanni Dicanio

unread,

Feb 19, 2008, 6:34:57 PM2/19/08

to

"meme" <me...@myself.com> ha scritto nel messaggio

news:OrUQ2K0c...@TK2MSFTNGP04.phx.gbl...

> here data is....
>
> BYTE *data = new BYTE[size];

IMHO, I prefer using std::vector (I say thank you to several people also
here and on the MFC newsgroup, like David Wi., Doug, etc. who in the past
enlightened me about using useful robust classes in STL like std::vector).

> And here comes "UTF-8 Without BOM" files.... in fact the code can read it
> alright but it cannot differentiate it from the plain ANSI file..... the
> above code is useless there.... any thought on this...

I don't know if the IsTextUnicode API

http://msdn2.microsoft.com/en-us/library/ms776445.aspx

may be useful for you...

Giovanni

Ben Voigt [C++ MVP]

unread,

Feb 19, 2008, 6:39:34 PM2/19/08

to

You know, I think you may have the one and only circumstance where it's ok
to have multiple side effects on the same variable without an intervening
sequence point, and still be assured of the right result (ok, I guess it
would apply to any commutative operation):

SwapBytes(*(pBuffer++), *(pBuffer++));

> }
> ...
> </code>
>
> Giovanni

Ben Voigt [C++ MVP]

unread,

Feb 19, 2008, 6:48:36 PM2/19/08

to

> And here comes "UTF-8 Without BOM" files.... in fact the code can
> read it alright but it cannot differentiate it from the plain ANSI
> file..... the above code is useless there.... any thought on this...

Yes, read files without magic as UTF-8. Since ASCII files are both valid
UTF-8 and ANSI it won't hurt you there. Only if there's a sequence that
isn't valid UTF-8, start over assuming single byte characters.

Alf P. Steinbach

unread,

Feb 19, 2008, 8:21:12 PM2/19/08

to

* Ben Voigt [C++ MVP]:

I guess it might be OK for Visual C++ (I don't know).

Standardwise, as a compiler I'm allowed to do

BYTE* p1 = pBuffer; // Note result.
BYTE* p2 = pBuffer; // Note result.
++pBuffer; // Increment.
++pBuffer; // Increment.
SwapBytes( p1, p2 ); // Use earlier noted results.

Practically speaking the reason is that the standard doesn't specify
those ++'es as atomic operations to be performed (note result,
increment, go on), but in terms of what values you are guaranteed.
Somewhere "After the result is noted" (§5.2.6/1) the object is modified,
which constitutes a side-effect. And "the order in which side-effects
take place" between sequence points is generally "unspecified" (§5/4).

It sort of doubly illustrates the evils of premature optimization.

Once, for the programmer shoehorning those operations into postfix ++
operator applications, and once, the language specification's evil
premature optimization by leaving the evaluation order unspecified.

Cheers, & hth.,

- Alf

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Ben Voigt [C++ MVP]

unread,

Feb 19, 2008, 8:38:27 PM2/19/08

to

Aww, well, back to "never refer to any variable affected by a side-effect
anywhere else in the same expression".

Ulrich Eckhardt

unread,

Feb 20, 2008, 4:13:27 AM2/20/08

to

meme wrote:
> WORD GetBigWord(FILE *FilePtr)
> {
> register WORD word;
>
> word = (WORD) (fgetc(FilePtr) & 0xff);
> word = ((WORD) (fgetc(FilePtr) & 0xff)) | (word << 0x08);
>
> return(word);
> }

Sorry, but I can't help myself saying something about this code:
1. assert(FilePtr);
2. Forget about 'register', the compiler does a much better job allocating
registers to temporaries.
3. This completely fails when the file reaches EOF.
4. I would read two bytes from the stream (checking for errors, of course)
and then combine those two bytes to an integer.
5. Return is not a function, no brackets needed.
6. Initialise variables rather than declaring them and then assigning to
them.

> wchar_t *data = new wchar_t[flen + 1];

Don't do this. In C++, use

std::vector<wchar_t> data(flen+1);

The reason is that you can't forget to manually invoke delete. Getting the
manual resource management right gets pretty difficult with multiple return
paths and exceptions.

> while(!feof(file))
> {
> bigEndianWord = GetBigWord(file);
> littleEndianWord = SwapWordEndiannes(bigEndianWord);
>
> data[i] = (wchar_t)littleEndianWord;
> i++;
> }

This is broken by design. Always, when reading something, first perform the
read operations and then, before using the data, verify that reading
actually succeeded! If the size of the file is odd, you will happily read a
single byte and mix in EOF and interpret that as last character of your
text.

Further:
- Reading large amounts of data in small steps in inefficient.
- In C++, never use C-style casts.

Uli

--
C++ FAQ: http://parashift.com/c++-faq-lite

Sator Laser GmbH
Geschäftsführer: Michael Wöhrmann, Amtsgericht Hamburg HR B62 932

meme

unread,

Feb 20, 2008, 8:26:49 AM2/20/08

to

"Giovanni Dicanio" <giovanni...@invalid.com> wrote in message

news:OLxYQB1c...@TK2MSFTNGP02.phx.gbl...

>
> "meme" <me...@myself.com> ha scritto nel messaggio
> news:OrUQ2K0c...@TK2MSFTNGP04.phx.gbl...
>
>> here data is....
>>
>> BYTE *data = new BYTE[size];
>
> IMHO, I prefer using std::vector (I say thank you to several people also
> here and on the MFC newsgroup, like David Wi., Doug, etc. who in the past
> enlightened me about using useful robust classes in STL like std::vector).

Yes, I think so.... but(honestly) the problem is my knowledge abot STL is
quite limited.
I fear them....I don't understand MSDN doc about STL (it's too much
short...)....

Well...can I make a request here? Can you (or anyone else) might probably
point me to some good online resource about them?..... then perhaps.... I
might write proper C++ rather than C disguised as C++, some day... :).....
(BTW thanks again for a good advise )

>
>
>> And here comes "UTF-8 Without BOM" files.... in fact the code can read it
>> alright but it cannot differentiate it from the plain ANSI file..... the
>> above code is useless there.... any thought on this...
>
> I don't know if the IsTextUnicode API
>
> http://msdn2.microsoft.com/en-us/library/ms776445.aspx
>

Hmm.... let me see....

meme

unread,

Feb 20, 2008, 8:30:26 AM2/20/08

to

"Ulrich Eckhardt" <eckh...@satorlaser.com> wrote in message
news:oi7t85-...@satorlaser.homedns.org...

> meme wrote:
>> WORD GetBigWord(FILE *FilePtr)
>> {
>> register WORD word;
>>
>> word = (WORD) (fgetc(FilePtr) & 0xff);
>> word = ((WORD) (fgetc(FilePtr) & 0xff)) | (word << 0x08);
>>
>> return(word);
>> }
>
> Sorry, but I can't help myself saying something about this code:
> 1. assert(FilePtr);
> 2. Forget about 'register', the compiler does a much better job allocating
> registers to temporaries.
> 3. This completely fails when the file reaches EOF.
> 4. I would read two bytes from the stream (checking for errors, of course)
> and then combine those two bytes to an integer.
> 5. Return is not a function, no brackets needed.
> 6. Initialise variables rather than declaring them and then assigning to
> them.

Yes... you are right.

>> wchar_t *data = new wchar_t[flen + 1];
>
> Don't do this. In C++, use
>
> std::vector<wchar_t> data(flen+1);

Please see my reply to "Giovanni Dicanio"

> This is broken by design. Always, when reading something, first perform
> the
> read operations and then, before using the data, verify that reading
> actually succeeded! If the size of the file is odd, you will happily read
> a
> single byte and mix in EOF and interpret that as last character of your
> text.
>
> Further:
> - Reading large amounts of data in small steps in inefficient.
> - In C++, never use C-style casts.

Yes.... I have changed it with "Giovanni Dicanio"'s help.

meme

unread,

Feb 20, 2008, 11:16:13 AM2/20/08

to

"Ulrich Eckhardt" <eckh...@satorlaser.com> wrote in message
news:oi7t85-...@satorlaser.homedns.org...

>> wchar_t *data = new wchar_t[flen + 1];
>
> Don't do this. In C++, use
>
> std::vector<wchar_t> data(flen+1);

**************************************************************
#1
===
void readUTF8(FILE *file)

{
long flen;
flen = _filelength(_fileno(file));

char *data = new char[flen];

wchar_t *dataW = new wchar_t[flen];

fseek(file, 3, SEEK_SET);
fread(data, sizeof(char), flen-3, file);
fclose(file);

file = NULL;

data[flen-3] = '\0';
//convert utf-8 to Unicode (utf-16)
MultiByteToWideChar(CP_UTF8 , 0, data, -1, dataW, flen-3);

SetWindowText(hwndEdit, dataW);

delete []data;
delete []dataW;
}

#2
===
#include <vector>
....
....
void readUTF8(FILE *file)

{
long flen;
flen = _filelength(_fileno(file));

std::vector<char> data(flen);
std::vector<wchar_t> dataW(flen);

fseek(file, 3, SEEK_SET);
fread(&data[0], sizeof(char), flen-3, file);
fclose(file);
file = NULL;

data[flen-3] = '\0';
//convert utf-8 to Unicode (utf-16)

MultiByteToWideChar(CP_UTF8 , 0, (char*)&data[0], -1,(wchar_t*)&dataW[0],
flen-3);

SetWindowText(hwndEdit, (wchar_t*)&dataW[0]);
}
*********************************************************
Q1: Why are you prefering #2 ? What are the advantages ?
(I'm asking because.... I have no idea about STl's.... but I want to learn
:) )

Ben Voigt [C++ MVP]

unread,

Feb 20, 2008, 11:26:03 AM2/20/08

to

> *********************************************************
> Q1: Why are you prefering #2 ? What are the advantages ?
> (I'm asking because.... I have no idea about STl's.... but I want to
> learn :) )

Let's make a simple change. Test the result of fread and return a boolean
indicating whether the read was successful. With #2, you just do:

if (fread(&data[0], sizeof(char), flen-3, file) < flen - 3) return false;
...
return true;

With #1, you've got to duplicate all the cleanup logic at every exit point.

meme

unread,

Feb 20, 2008, 11:45:23 AM2/20/08

to

"Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
news:uJVWa09c...@TK2MSFTNGP04.phx.gbl...

> Let's make a simple change. Test the result of fread and return a boolean
> indicating whether the read was successful. With #2, you just do:
>
> if (fread(&data[0], sizeof(char), flen-3, file) < flen - 3) return false;
> ...
> return true;
>
> With #1, you've got to duplicate all the cleanup logic at every exit
> point.

Ah! hmmm! so you are saying I don't have to care for the clean-up code
anymore if I use vector....in fact there is no code for clen-up.... just
allocate and forget.... like in other languages .... ohho... that's a pretty
good news....

well then next question is is there any good online resources for STL ?
besides MSDN...

If you know any... please let me know....

Thank you very much :-D

PS: BTW Is this MS specific... can I use it on other compilers like
borland...

Ben Voigt [C++ MVP]

unread,

Feb 20, 2008, 2:09:33 PM2/20/08

to

meme wrote:
> "Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
> news:uJVWa09c...@TK2MSFTNGP04.phx.gbl...
>> Let's make a simple change. Test the result of fread and return a
>> boolean indicating whether the read was successful. With #2, you
>> just do: if (fread(&data[0], sizeof(char), flen-3, file) < flen - 3)
>> return
>> false; ...
>> return true;
>>
>> With #1, you've got to duplicate all the cleanup logic at every exit
>> point.
>
> Ah! hmmm! so you are saying I don't have to care for the clean-up code
> anymore if I use vector....in fact there is no code for clen-up....

Not exactly. The cleanup code is specially designated by being in a
"destructor", and you can trust the compiler to run it no matter how you
leave the function (reach end, return statement, exception, etc.)

> just allocate and forget.... like in other languages .... ohho...
> that's a pretty good news....

When you start writing your own destructors you will have to remember it,
but not for just using the library objects.

>
> well then next question is is there any good online resources for STL
> ? besides MSDN...

Google "std::vector".

Here's one of the hits: http://www.cppreference.com/cppvector/index.html

>
> If you know any... please let me know....
>
> Thank you very much :-D
>
> PS: BTW Is this MS specific... can I use it on other compilers like
> borland...

std::vector is part of the C++ Standard Library, which should be supported
by any C++ compiler (at least in theory... any compiler that doesn't support
it is not really C++, but then there'd be NO C++ compilers at all because
none of them follow the standard perfectly, the closest one is from Comeau
Computing)

Giovanni Dicanio

unread,

Feb 21, 2008, 2:55:32 AM2/21/08

to

"meme" <me...@myself.com> ha scritto nel messaggio

news:ePDCCR8c...@TK2MSFTNGP03.phx.gbl...

> Well...can I make a request here? Can you (or anyone else) might probably
> point me to some good online resource about them?..... then perhaps.... I
> might write proper C++ rather than C disguised as C++, some day... :).....
> (BTW thanks again for a good advise )

Google is your friend here.
You may find several tutorial and documentations.
And if you use Google on these newsgroups (or the MFC newsgroup) you may
also find some STL books suggestions.

However, about std::vector, the *basics* are:

1. You can create a vector containing BYTEs like this:

// Vector of BYTEs
std::vector< BYTE > data;

(As you understand, if you want to store type MyType in a vector, just use
'std::vector< MyType > data;').

To use vector, you have to #include <vector> header (you can #include it in
precompiled header like StdAfx.h, because the <vector> header won't change
:) during your development process).

2. To add items to vector, you can use push_back() method:

data.push_back( aByte );
data.push_back( anotherByte );
...

Vector size dynamically grows.

3. To get element count, you can use size() method:

size_t howManyElements = data.size();

4. Method clear() clears the vector:

data.clear(); // empty vector

5. You can access vector items using operator[] or method at().
The difference is that method at() does a bounds-checking on index. If index
is out of range, a std::out_of_range exception is thrown.
Instead, operator[] is like standard raw C operator[], and does no
bounds-checking.
So, using at() is more secure, but more slow; using operator[] is less
secure, but more fast.

// For each vector item:
for ( size_t i = 0; i < data.size(); i++ )
... access data[i] or data.at(i)

6. If you want to create a non-empty vector, you can specify start size in
constructor:

std::vector< BYTE > data(1000); // starts with 1000 items

and if you want to change vector size, you can use resize() method.

There is a lot more, like using iterators (a very powerful STL concept,
which allows you to write code that is in most part independent from the
underlying container), and other vector methdos.
The list of points 1-6 presented here are just the very basics, to start
using std::vector.

As a web reference, you may also look here:

http://www.cplusplus.com/reference/stl/vector/

HTH,
Giovanni

meme

unread,

Feb 21, 2008, 7:01:19 AM2/21/08

to

"Giovanni Dicanio" <giovanni...@invalid.com> wrote in message

news:uXRZw9Fd...@TK2MSFTNGP02.phx.gbl...

Thousand thanks .... for bringing on ( is this a valid English!!) the STL
.... I'm presently reading ( the link you and another very good link "Ben
Voigt" gave) and writting small codes to understand the class templates.....
and wow!.... it's not that hard and they are just doing fine so far..... Now
have to look on Iterators....algorithms....

meme

unread,

Feb 21, 2008, 7:06:29 AM2/21/08

to

"Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message

news:ucFM%23P$cIHA...@TK2MSFTNGP04.phx.gbl...

> Google "std::vector".
>
> Here's one of the hits: http://www.cppreference.com/cppvector/index.html
>
>>
>> If you know any... please let me know....
>>
>> Thank you very much :-D
>>
>> PS: BTW Is this MS specific... can I use it on other compilers like
>> borland...
>
> std::vector is part of the C++ Standard Library, which should be supported
> by any C++ compiler (at least in theory... any compiler that doesn't
> support it is not really C++, but then there'd be NO C++ compilers at all
> because none of them follow the standard perfectly, the closest one is
> from Comeau Computing)

Hi... that is a very good indeed. And lot of thanks for that very useful
link :D, I downloaded it(the web-site). Presently reading everything I find
on STL and going through the sample codes and trying to write my own based
on STL.

Thank you very much for encouraging me for using STL. :D :D

meme

unread,

Feb 21, 2008, 7:09:37 AM2/21/08

to

"meme" <me...@myself.com> wrote in message
news:ePDCCR8c...@TK2MSFTNGP03.phx.gbl...

>>> And here comes "UTF-8 Without BOM" files.... in fact the code can read
>>> it
>>> alright but it cannot differentiate it from the plain ANSI file..... the
>>> above code is useless there.... any thought on this...
>>
>> I don't know if the IsTextUnicode API
>>
>> http://msdn2.microsoft.com/en-us/library/ms776445.aspx
>>
>
> Hmm.... let me see....

It's always returing 0(zero),.... passing 2n. arg as NULL.... need more
testing I gues! :(

Nathan Mates

unread,

Feb 21, 2008, 12:40:32 PM2/21/08

to

In article <uXRZw9Fd...@TK2MSFTNGP02.phx.gbl>,

Giovanni Dicanio <giovanni...@invalid.com> wrote:
>5. You can access vector items using operator[] or method at().
>The difference is that method at() does a bounds-checking on index. If index
>is out of range, a std::out_of_range exception is thrown.
>Instead, operator[] is like standard raw C operator[], and does no
>bounds-checking.

In Visual Studio 2005, in debug builds, I believe it does do
bounds-checking. There's a *lot* of really useful sanity checking
that's done in such builds. Some of the checks it's done have
highlighted bugs in code that I couldn't believe *ever* worked.

Nathan Mates
--
<*> Nathan Mates - personal webpage http://www.visi.com/~nathan/
# Programmer at Pandemic Studios -- http://www.pandemicstudios.com/
# NOT speaking for Pandemic Studios. "Care not what the neighbors
# think. What are the facts, and to how many decimal places?" -R.A. Heinlein

Giovanni Dicanio

unread,

Feb 21, 2008, 1:31:39 PM2/21/08

to

"Nathan Mates" <nat...@visi.com> ha scritto nel messaggio
news:13rrdsg...@corp.supernews.com...

> In Visual Studio 2005, in debug builds, I believe it does do
> bounds-checking.

I would not be surprised if it does.
And do I agree with you that, in debug builds, the more checkings are done,
the better! :)

Giovanni