HB_TRANSLATE() returned string size problem

283 views
Skip to first unread message

Zoran Sibinovic

unread,
Nov 19, 2015, 5:09:17 AM11/19/15
to Harbour Users
Hi to all,

I encontrated an unusual string size problem after done a hb_translate()

SET( _SET_CODEPAGE, "UTF8EX" )
...

------
problematic sets of strings

cText="ШЂ" and  cText="ЧЋ"
...

cText=HB_TRANSLATE(cText,, "SRWIN")  // 1251 cyrillic
...
LEN(cText) returns 1 and 1, but in reality, the string carry on 2 characters all the way it is in use.

-------
on the other side if we turn the sequence оr separate the 2 characters in the original sequence

cText="ЂШ",  cText="ЋЧ", cText="Ш Ђ" and  cText="Ч Ћ"
...
cText=HB_TRANSLATE(cText,, "SRWIN")  // 1251 cyrillic
...
LEN(cText) returns 2,2,3 and 3,  that is the right result it have to returns.

-----
At this time I haven't found any anomalies with the other characters or sequences.

In app production this problem make a mess during defining the printing columns because I dont know if the size of the strings are real. 

Some opinion and advice?

Thanks
Zoran




 

Zoran Sibinovic

unread,
Nov 19, 2015, 6:09:48 AM11/19/15
to Harbour Users
Two possible solutions

1. Since the string in both cases, UTF8 ans SRWIN is cyrillic I get the

    SET( _SET_CODEPAGE, "UTF8EX" ) 
    ...
    LEN(cText)  result, before make the translation
    cText=HB_TRANSLATE(cText,, "SRWIN")  // 1251 cyrillic

    Possible, not elegant or logic solution, assuming that there is no side effects.

2. SET( _SET_CODEPAGE, "UTF8EX" ) 
    ...
    cText=HB_TRANSLATE(cText,, "SRWIN")  // 1251 cyrillic
    SET( _SET_CODEPAGE, "SRWIN" )
    LEN(cText)
    SET( _SET_CODEPAGE, "UTF8EX" )  


Opinions are welcome
Zoran
 

Przemyslaw Czerpak

unread,
Nov 23, 2015, 12:16:26 PM11/23/15
to harbou...@googlegroups.com
Hi,

This is expected behavior.
LEN() returns string size in current HVM characters.
So when you set UTF8EX then it return number of UTF8
characters in the string. After translation to SRWIN
this two characters "ЧЋ" probably creates one valid
UTF8 character stored in 2 bytes, etc.
If you want to check the string size in bytes in code
page independent way then you can use HB_BLEN().

best regards,
Przemek
> --
> --
> You received this message because you are subscribed to the Google
> Groups "Harbour Users" group.
> Unsubscribe: harbour-user...@googlegroups.com
> Web: http://groups.google.com/group/harbour-users
>
> ---
> You received this message because you are subscribed to the Google Groups "Harbour Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to harbour-user...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Zoran Sibinovic

unread,
Nov 24, 2015, 7:49:46 AM11/24/15
to Harbour Users
Thanks druzus for replaying,

1. Example : "ШЂ ЧЋ                         " - length 30

with LEN()
before making the translation, string in UTF8EX returns 30,
after the translation in SRWIN, 28

with  HB_BLEN() 
before making the translation, string in UTF8EX returns 34,
after the translation in SRWIN, 30


2. Example : "ШЂ ЧЋ                         " - length 30 

with LEN()
before making the translation, string in UTF8EX returns 30,
after the translation in SRWIN, 30

with  HB_BLEN() 
before making the translation, string in UTF8EX returns 34,
after the translation in SRWIN, 30

as conclusion, in both the examples:
LEN() works before the translation  
HB_BLEN()  after the translation  

Zoran

Przemyslaw Czerpak

unread,
Nov 24, 2015, 8:04:01 AM11/24/15
to harbou...@googlegroups.com
Hi,

Your conclusion is completely wrong.
LEN() returns number of character in HVM CP set by
hb_cdpSelect( <cp> ) or Set( _SET_CODEPAGE, <cp> ).
HB_BLEN() return size in bytes regardless of codepage
set in HVM.
Nothing more, nothing less.
Maybe this function will help you to understand it:

FUNCTION Len_SRWIN( cStr )
LOCAL cPrevCP := hb_cdpSelect( "SRWIN" ), nLen
nLen := Len( cStr )
hb_cdpSelect( "SRWIN", cPrevCP )
RETURN nLen

best regards,
Przemek

Zoran Sibinovic

unread,
Nov 24, 2015, 3:35:22 PM11/24/15
to Harbour Users
Hi Przemek

I agree that LEN() and HB_BLEN() return completely different type of informations, mine was a "value" conclusion not a logical.

I use a similar code snippet as you wrote, but using hb_cdpSelect() the length is wrong and with SET( _SET_CODEPAGE ...) is ok.

I you have time you can try the .prg In the attachment where are, yours and mine (yours altered) function code with examples.
To see the individual effect of the functions, just change the function name to be used.
The right results have to be "5"  for the displayed examples

Anyway, the problem is not in code gymnastics, but in the fact that sometimes
in the app you need a length of some string, readed from the .dbf field, translated or been a part of the app code itself (ex. messages text, warnings, etc.)

All of them can be wrote and used under various combinations od codepages and encodings. 
To reach the output values in a printed sheet, display, save/read from and in .dbf-s we use a "codepage" gymnastics as you wrote in the function, thing that actually I use too, but...
I have 7 codepage REQUEST in my app and call as basic UTF8EX then manipulate with others depending of it is a matrix printer, laser printer, direct print or use of a harupdf file and all with an encoding that the client can choose itself.

Before I used UTF8EX,SRWIN,HRWIN or similar complex codepages and encodings, instead, SR646 or SR646C, all the things works ok, the character rappresentation and manipulation was simple, now Im in converting the apps code to use the above complex codepages and encodings and faced with a completely different problem.

One of the problematic example is the result that returns LEN(), one of the most basic function (not sure of some others).    
Sometime is not a problem when I know the encoding of the string that enters in various part of the app, but mostly I dont and, if I use LEN(), I expect a real value.

Now, since LEN() cannot know what is the string codepage to process, I have to pass some "stringcodepageis" parameter all the way and then, when I use LEN(), 

Is not more simple like nLen=LEN(cString), but something like this

DO CASE
      CASE stringcodepageis="SRWIN" ;  nLen=LEN_SRWIN(cString)
      CASE stringcodepageis="HRWIN" ;  nLen=LEN_HRWIN(cString)
      ...
      ...
      ...
ENDCASE

each time when a LEN function is used, thing that, actully, I am doing in my code.

Best regards, 
Zoran 

Zoran Sibinovic

unread,
Nov 24, 2015, 3:38:40 PM11/24/15
to Harbour Users
Sorry, forgot, the code eample


hb.prg

Przemyslaw Czerpak

unread,
Nov 25, 2015, 7:53:18 AM11/25/15
to harbou...@googlegroups.com
On Tue, 24 Nov 2015, Zoran Sibinovic wrote:

Hi Zoran,

> I agree that LEN() and HB_BLEN() return completely different type of
> informations, mine was a "value" conclusion not a logical.

HB_BLEN() returns size in bytes so is codepage independent.
For given string always the same length is returned.
LEN() returns string length in characters if given codepage
supports such functionality, i.e. in Harbour "UTF8EX" does but "UTF8"
not - it's intentional. It means that the same string may have different
number of characters with different encoding, i.e.

request HB_CODEPAGE_UTF8EX
local s
s := chr( 197 ) + chr( 188 ) + chr( 195 ) + chr( 179 ) + ;
chr( 197 ) + chr( 130 ) + chr( 119 )
hb_cdpSelect( "EN" )
? Len( s ), "[" + s + "]" // 7 [żółw]
hb_cdpSelect( "UTF8" )
? Len( s ), "[" + s + "]" // 7 [żółw]
hb_cdpSelect( "UTF8EX" )
? Len( s ), "[" + s + "]" // 4 [żółw]

As you can see the same string which has 7 bytes in CP437 (EN) has 7
characters but in UTF8 only 4. In Harbour UTF8 CP is marked to operate
on byte indexes so LEN() return 7 for this CP and UTF8EX operates on
character indexes so for this CP LEN() returns 4.
Each time I used exactly the same 7 bytes but with different CPs they
are decoded in different ways.

> I use a similar code snippet as you wrote, but using hb_cdpSelect() the
> length is wrong and with SET( _SET_CODEPAGE ...) is ok.

So you made some mistake in your code because hb_cdpSelect(...) is
other wrapper to exactly the same resource which is controled
SET( _SET_CODEPAGE, ...). There is absolutely no difference between
both method.

> I you have time you can try the .prg In the attachment where are, yours and
> mine (yours altered) function code with examples.
> To see the individual effect of the functions, just change the function
> name to be used.
> The right results have to be "5" for the displayed examples
>
> Anyway, the problem is not in code gymnastics, but in the fact that
> sometimes
> in the app you need a length of some string, readed from the .dbf field,
> translated or been a part of the app code itself (ex. messages text,
> warnings, etc.)

Yes of course. I need such functionality very often.
For size in bytes I simply use HB_BLen(). If I want to check
the size in some different encoding then I use function like:

FUNCTION cdpLen( cStr, cCDP )
LOCAL nLen, cPrevCDP := hb_cdpSelect( cCDP )
nLen := Len( cStr )
hb_cdpSelect( cPrevCDP )
RETURN nLen

So I can test numer of character given string contains in different
encoding.

> All of them can be wrote and used under various combinations od codepages
> and encodings.
> To reach the output values in a printed sheet, display, save/read from and
> in .dbf-s we use a "codepage" gymnastics as you wrote in the function,
> thing that actually I use too, but...
> I have 7 codepage REQUEST in my app and call as basic UTF8EX then
> manipulate with others depending of it is a matrix printer, laser printer,
> direct print or use of a harupdf file and all with an encoding that the
> client can choose itself.
>
> Before I used UTF8EX,SRWIN,HRWIN or similar complex codepages and
> encodings, instead, SR646 or SR646C, all the things works ok, the character
> rappresentation and manipulation was simple, now Im in converting the apps
> code to use the above complex codepages and encodings and faced with a
> completely different problem.
> One of the problematic example is the result that returns LEN(), one of the
> most basic function (not sure of some others).

For me the results are perfect and what I expect.

> Sometime is not a problem when I know the encoding of the string that
> enters in various part of the app, but mostly I dont and, if I use LEN(), I
> expect a real value.
>
> Now, since LEN() cannot know what is the string codepage to process, I have
> to pass some "stringcodepageis" parameter all the way and then, when I use
> LEN(),
>
> Is not more simple like nLen=LEN(cString), but something like this
>
> DO CASE
> CASE stringcodepageis="SRWIN" ; nLen=LEN_SRWIN(cString)
> CASE stringcodepageis="HRWIN" ; nLen=LEN_HRWIN(cString)
> ...
> ...
> ...
> ENDCASE
>
> each time when a LEN function is used, thing that, actully, I am doing in
> my code.

Function like cdpLen() I presented above is more flexible.
In theory we can attach codepage information to string item and I was
thinking about it in the past anyhow it creates many new anomalies,
i.e. what to to when we have code like:
cSrWinStr + cSr646CStr + cDeISOStr
or:
cSrWinStr > cSvISO
So in fact it creates new set of problems much harder to resolve which
can be source of data corruption expressions like above are used in RDD
context. It will also introduce some speed and size overhead. So I
decided to not implement it in core HVM code and keep it as simple as
possible. If I'll find time to finish support for custom item types in
HVM then such functionality can be added as contrib extension and author
can define yourself how he plan to resolve such anomalies.
BTW I've seen you message about "\" problem in 646 based CPs.
Such CPs does not contain "\" character. It was replaced with "Đ" just
like few other characters: @, {, [, }, ], ^, ~, |, \
see: https://en.wikipedia.org/wiki/YUSCII
For path you can use "/" as replacement.
Anyhow if you only have ISO-646 data in DBF files and you do not want to
store in thouse files characters like @, {, [, }, ], ^, ~, |, \ and this
tables uses index with binary collation then you should use your own
custom SR646BIN CP defined in the following way:

/*** cpsr646bin.c ***/
#define HB_CP_ID SR646BIN
#define HB_CP_INFO "Serbian ISO-646 (YUSCII)"
#define HB_CP_UNITB HB_UNITB_646YU
#define HB_CP_ACSORT HB_CDP_ACSORT_NONE
#define HB_CP_UPPER "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
#define HB_CP_LOWER "abcdefghijklmnopqrstuvwxyz"
/* include CP registration code */
#include "hbcdpreg.h"

as DBF codepage.

best regards,
Przemek

Zoran Sibinovic

unread,
Nov 25, 2015, 8:12:29 AM11/25/15
to Harbour Users
Hi Przemek

I use in the at at it's beginning

SET( _SET_DBCODEPAGE,"SR646" ) to store the data as ascii,
SET( _SET_CODEPAGE, "UTF8EX" ) for the display conversion and
keyboarding trough the selected keyboard input from the Windows language bar.
Everything works fine.

On the other hand, some options need a "SRWIN", "HRWIN", "EN" and "ANSI" conversions when generate the the otput, printing, exporting in word, excel, etc.
With your last post lot of things were clarified, and the suggestion of making custom CP-s is very interesting. I will try something to see if I can make it work.

Thanks a lot

Best regards
Zoran
Reply all
Reply to author
Forward
0 new messages