On Tue, 24 Nov 2015, Zoran Sibinovic wrote:
Hi Zoran,
> I agree that LEN() and HB_BLEN() return completely different type of
> informations, mine was a "value" conclusion not a logical.
HB_BLEN() returns size in bytes so is codepage independent.
For given string always the same length is returned.
LEN() returns string length in characters if given codepage
supports such functionality, i.e. in Harbour "UTF8EX" does but "UTF8"
not - it's intentional. It means that the same string may have different
number of characters with different encoding, i.e.
request HB_CODEPAGE_UTF8EX
local s
s := chr( 197 ) + chr( 188 ) + chr( 195 ) + chr( 179 ) + ;
chr( 197 ) + chr( 130 ) + chr( 119 )
hb_cdpSelect( "EN" )
? Len( s ), "[" + s + "]" // 7 [żółw]
hb_cdpSelect( "UTF8" )
? Len( s ), "[" + s + "]" // 7 [żółw]
hb_cdpSelect( "UTF8EX" )
? Len( s ), "[" + s + "]" // 4 [żółw]
As you can see the same string which has 7 bytes in CP437 (EN) has 7
characters but in UTF8 only 4. In Harbour UTF8 CP is marked to operate
on byte indexes so LEN() return 7 for this CP and UTF8EX operates on
character indexes so for this CP LEN() returns 4.
Each time I used exactly the same 7 bytes but with different CPs they
are decoded in different ways.
> I use a similar code snippet as you wrote, but using hb_cdpSelect() the
> length is wrong and with SET( _SET_CODEPAGE ...) is ok.
So you made some mistake in your code because hb_cdpSelect(...) is
other wrapper to exactly the same resource which is controled
SET( _SET_CODEPAGE, ...). There is absolutely no difference between
both method.
> I you have time you can try the .prg In the attachment where are, yours and
> mine (yours altered) function code with examples.
> To see the individual effect of the functions, just change the function
> name to be used.
> The right results have to be "5" for the displayed examples
>
> Anyway, the problem is not in code gymnastics, but in the fact that
> sometimes
> in the app you need a length of some string, readed from the .dbf field,
> translated or been a part of the app code itself (ex. messages text,
> warnings, etc.)
Yes of course. I need such functionality very often.
For size in bytes I simply use HB_BLen(). If I want to check
the size in some different encoding then I use function like:
FUNCTION cdpLen( cStr, cCDP )
LOCAL nLen, cPrevCDP := hb_cdpSelect( cCDP )
nLen := Len( cStr )
hb_cdpSelect( cPrevCDP )
RETURN nLen
So I can test numer of character given string contains in different
encoding.
> All of them can be wrote and used under various combinations od codepages
> and encodings.
> To reach the output values in a printed sheet, display, save/read from and
> in .dbf-s we use a "codepage" gymnastics as you wrote in the function,
> thing that actually I use too, but...
> I have 7 codepage REQUEST in my app and call as basic UTF8EX then
> manipulate with others depending of it is a matrix printer, laser printer,
> direct print or use of a harupdf file and all with an encoding that the
> client can choose itself.
>
> Before I used UTF8EX,SRWIN,HRWIN or similar complex codepages and
> encodings, instead, SR646 or SR646C, all the things works ok, the character
> rappresentation and manipulation was simple, now Im in converting the apps
> code to use the above complex codepages and encodings and faced with a
> completely different problem.
> One of the problematic example is the result that returns LEN(), one of the
> most basic function (not sure of some others).
For me the results are perfect and what I expect.
> Sometime is not a problem when I know the encoding of the string that
> enters in various part of the app, but mostly I dont and, if I use LEN(), I
> expect a real value.
>
> Now, since LEN() cannot know what is the string codepage to process, I have
> to pass some "stringcodepageis" parameter all the way and then, when I use
> LEN(),
>
> Is not more simple like nLen=LEN(cString), but something like this
>
> DO CASE
> CASE stringcodepageis="SRWIN" ; nLen=LEN_SRWIN(cString)
> CASE stringcodepageis="HRWIN" ; nLen=LEN_HRWIN(cString)
> ...
> ...
> ...
> ENDCASE
>
> each time when a LEN function is used, thing that, actully, I am doing in
> my code.
Function like cdpLen() I presented above is more flexible.
In theory we can attach codepage information to string item and I was
thinking about it in the past anyhow it creates many new anomalies,
i.e. what to to when we have code like:
cSrWinStr + cSr646CStr + cDeISOStr
or:
cSrWinStr > cSvISO
So in fact it creates new set of problems much harder to resolve which
can be source of data corruption expressions like above are used in RDD
context. It will also introduce some speed and size overhead. So I
decided to not implement it in core HVM code and keep it as simple as
possible. If I'll find time to finish support for custom item types in
HVM then such functionality can be added as contrib extension and author
can define yourself how he plan to resolve such anomalies.
BTW I've seen you message about "\" problem in 646 based CPs.
Such CPs does not contain "\" character. It was replaced with "Đ" just
like few other characters: @, {, [, }, ], ^, ~, |, \
see:
https://en.wikipedia.org/wiki/YUSCII
For path you can use "/" as replacement.
Anyhow if you only have ISO-646 data in DBF files and you do not want to
store in thouse files characters like @, {, [, }, ], ^, ~, |, \ and this
tables uses index with binary collation then you should use your own
custom SR646BIN CP defined in the following way:
/*** cpsr646bin.c ***/
#define HB_CP_ID SR646BIN
#define HB_CP_INFO "Serbian ISO-646 (YUSCII)"
#define HB_CP_UNITB HB_UNITB_646YU
#define HB_CP_ACSORT HB_CDP_ACSORT_NONE
#define HB_CP_UPPER "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
#define HB_CP_LOWER "abcdefghijklmnopqrstuvwxyz"
/* include CP registration code */
#include "hbcdpreg.h"
as DBF codepage.
best regards,
Przemek