FileRead problem with UTF8 File

Arthur Hefti

unread,

Oct 8, 2009, 1:52:07 PM10/8/09

to

Hi

I have a file which is in UTF8 format and preceded by a BOM. In a Hex
editor it looks like this (BOM not shown):
Test1
T??st
Test2

Where the two ?? are the correct encoding for an a-umlaut.

I open the file with FileOpen in Linemode.
Read Line 1 -> returns Test1 with len 5
Read line 2 -> returns T<a-umlaut>st with len 5
Read line 3 -> returns "" with len 0
Read line 4 -> returns Test2 with len 5

Problem is now that the program either stops readng after line 2 or the
number of lines are wrong.

PB 11.2 / 8739

Regards
Arthur

Scott Morris

unread,

Oct 8, 2009, 2:50:21 PM10/8/09

to

"Arthur Hefti" <art...@catsoft.ch> wrote in message
news:4ace26c7@forums-1-dub...

> Hi
>
> I have a file which is in UTF8 format and preceded by a BOM. In a Hex
> editor it looks like this (BOM not shown):

Is the BOM correct? Quoting online help:

"A byte-order mark (BOM) is a character code at the beginning of a data
stream that indicates the encoding used in a Unicode file. For UTF-8, the
BOM uses three bytes and is EF BB BF. For UTF-16, the BOM uses two bytes and
is FF FE for little endian and FE FF for big endian."

You can always force the encoding to use, which will cause the fileopen to
fail if PB thinks the encoding (i.e., BOM) of the file is not that which was
requested.

Arthur Hefti

unread,

Oct 8, 2009, 11:33:23 PM10/8/09

to

BOM is correct. Notepad or XML Spy read the file without problems and
with the right conversion.
The application should read ANSI and encoded files as well.

Arthur

Arthur Hefti

unread,

Oct 9, 2009, 2:42:58 AM10/9/09

to

I did some more research and found that FileReadEx works as expected....

Ivaylo Ivanov

unread,

Oct 9, 2009, 3:34:59 AM10/9/09

to

One possible solution with this encoding is to put the entire UTF8 file
contents into a string variable and then process it line by line. I'll point
the process of reading the file contents into a string variable:
1) Read the file in binary mode using FileReadEx into a BLOB variable (for
example, lblb_file_contents)
2) Check if the file starts with UTF8 byte-order mask and remove it from the
blob:
byte lbt_1, lbt_2, lbt_3
if len(lblb_file_contents) >= 3 then
GetByte(lblb_file_contents, 1, lbt_1)
GetByte(lblb_file_contents, 2, lbt_2)
GetByte(lblb_file_contents, 3, lbt_3)
if lbt_1 = 239 and lbt_2 = 187 and lbt_3 = 191 then // BOM for UTF8 = EF
BB BF
// BOM is found - remove it from the blob
lblb_file_contents = BlobMid(lblb_file_contents, 4,
len(lblb_file_contents) - 3)
// Check the truncated contents once again
if len(lblb_file_contents) = 0 then
MessageBox("Error", "There's no data in the file!", StopSign!)
return
end if
end if
end if

ls_file_contents = string(lblb_file_contents, EncodingUTF8!)

// Your line processor goes here:
// ...

Regards,
Ivaylo

"Arthur Hefti" <art...@catsoft.ch> wrote in message
news:4ace26c7@forums-1-dub...

Arthur Hefti

unread,

Oct 9, 2009, 7:30:23 AM10/9/09

to

Thanks. I guess fixing the bug or finding a work around is not too
diffcult.
The major problem is, that the application is running at many sites and
each customer who wants to import UTF-8 has to update the application.

Arthur

Ivaylo Ivanov

unread,

Oct 9, 2009, 7:55:56 AM10/9/09

to

It's a pity.
Maybe it's time to think of an intelligent updater. Time-saver for you :-)

"Arthur Hefti" <art...@catsoft.ch> wrote in message

news:4acf1ecf$1@forums-1-dub...

Arthur Hefti

unread,

Oct 9, 2009, 9:54:10 AM10/9/09

to

Depends on how the application is build and how the customer deploy it.
In our case we have an integration to web applications for e.g. log in,
reporting, etc. The version have to match. Some customers prefer to
distribute the application with their own packaging procedures
therefore take apart our Installshield routines. Programming is only a
minor part in the whole software development process and it would be
nice when the tools used work as expected. Of course the issue could
have been found in testing when reading more than one line with UTF-8
encoding.

Arthur