Re: unicode patch

Przemysław Czerpak

unread,

Apr 23, 2012, 7:54:16 AM4/23/12

to Harbour developers

On Sat, 21 Apr 2012, Viktor Szakáts wrote:

Hi Viktor,

> Can't get this message through the mailing list, one copy was
> deleted, two don't appear. So here it is in private:

Probably we should look at it closer.
Some messages do not appear on the list or they appear after
long delay and not all messages are delivered to subscribers
(at least I do not receive all of them).
I'm setting CC to devel list.

BTW Sorry for late response. I was out of city.

> Possibly the largest patch to Harbour at least in recent 5 years.
> Thank you very much Przemek. (and OTC for sponsoring)

Thank you.

> For those interested in looking into the whole patch (f.e. to
> update 3rd party code), use this command in Harbour SVN
> sandbox root:
> svn diff -r 17403:17404 > uni.dif
>
> One of the next logical questions: How to enable unicode
> fields in tables? Plus some more, but I'm still digesting the
> changes.

This is additional extension - you can use simple character fields
for UTF8 strings.
Anyhow I'll add support for setting field flags in DBCREATE() in
this week.

> One issue I've found:
> fs_win_get_drive() in filesys.c has a call to hb_wcntombcpy()
> which needs to be updated to one of the new APIs.

I've seen it but in fact it's necessary only for extracting drive
letter so it's unimportant.
BTW I think we should eliminate conversion to char* and call to
hb_fsNameSplit() in this function.
{
TCHAR lpBuffer[ HB_PATH_MAX ];
int iDrive;
lpBuffer[ 0 ] = TEXT( '\0' );
hb_fsSetIOError( GetCurrentDirectory(
HB_SIZEOFARRAY( lpBuffer ), lpBuffer ) != 0, 0 );
iDrive = HB_TOUPPER( lpBuffer[ 0 ] );
if( iDrive >= 'A' && iDrive <= 'Z' &&
lpBuffer[ 1 ] == HB_OS_DRIVE_DELIM_CHR )
iDrive -= 'A';
else
iDrive = 0;
}
should be enough.
I also think that we should add new API function which returns
full path with drive letter if any. It nicely simplify upper
level code. We can also create such function for setting
current directory though this operation is not MT safe and MT
programs should not change current directory.

best regards,
Przemek

Massimo Belgrano

unread,

Apr 23, 2012, 8:29:20 AM4/23/12

to harbou...@googlegroups.com

Possible have a common way for harbour , ads, any upcoming rdd who add unicode like otc mediator

sybase Advantage 10 contrib\rddads.lib includes three new field types; nChar, nVarChar and nMemo.
These field types will be able to store Unicode characters

http://blog.advantageevangelist.com/2010/06/ads-10-tip-4-unicode-support.html

for additional info i suggest search unicode at http://devzone.advantagedatabase.com

Il giorno 23 aprile 2012 13:54, Przemysław Czerpak <dru...@poczta.onet.pl> ha scritto:
>
> > One of the next logical questions: How to enable unicode
> > fields in tables? Plus some more, but I'm still digesting the
> > changes.

>
> This is additional extension - you can use simple character fields for UTF8 strings.
> Anyhow I'll add support for setting field flags in DBCREATE() in this week.

--
Massimo Belgrano

vszakats

unread,

Apr 23, 2012, 8:55:25 AM4/23/12

to harbou...@googlegroups.com

Hi Przemek,

On Monday, April 23, 2012 1:54:16 PM UTC+2, druzus wrote:

On Sat, 21 Apr 2012, Viktor Szakáts wrote:
Hi Viktor,
> Can't get this message through the mailing list, one copy was
> deleted, two don't appear. So here it is in private:
Probably we should look at it closer.
Some messages do not appear on the list or they appear after
long delay and not all messages are delivered to subscribers
(at least I do not receive all of them).
I'm setting CC to devel list.

Thanks, they still didn't appear, and I also spotted the problem

of not receiving stuff in mailbox, plus several other smaller

problems with this service. Maybe the management console

can give some clues for lost/pending mails.

BTW Sorry for late response. I was out of city.

No problem at all, meanwhile I started to switch my

app to UTF8EX as a hobby project, and after 1 day of

work it run and worked fine, though as usual 80% of

the work will need to be spent on 20% of weird cases

(like obscure C code and external interfaces/printing).

Harbour parts go smoothly and things work as expected.

UTF8 opens a new world when finally you're not restricted

with 8-bit, sounds obvious, but it's a huge step.

Noticed that sometimes it'd be useful to use the old

raw (non-uni) versions of functions like LEFT(), RIGHT(),

SUBSTR() (HB_BLEFT(), HB_BRIGHT(), HB_BSUBSTR()

seems to be fitting names, following HB_BLEN()) for

occasional binary data, it's work-aroundable but nevertheless.

Pending question is how to control sorting in UTF8EX, it's

not critical yet, but it will be when using this CP in indexed

tables.

I had minor confusion because, in order to make UTF8 box

chars display as expected (at least with GTWIN/GTWVT),

HB_GTI_BOXCP had to be set explicitly to "UTF8". Maybe

it'd be better to somehow make this the default, if technically

possible.

When reaching to some more peculiar parts of my apps,

I may still have some experiences/questions to share.

(one candidate is stripping accents to convert string to readable

ASCII string, looks like something hard to do from upper level

code.)

Plus, it will be interesting to see how certain external libs

handle UTF8 chars, like hbmzip, libcurl.

> One of the next logical questions: How to enable unicode
> fields in tables? Plus some more, but I'm still digesting the
> changes.
This is additional extension - you can use simple character fields
for UTF8 strings.
Anyhow I'll add support for setting field flags in DBCREATE() in
this week.

Sounds great, thank you. I've been toying with the idea of simply

pouring UTF8 into the raw string fields. One disadvantage, that

they will effectively change to variable length fields (which BTW

may cause potential data loss when converting existing 8-bit

data, even if you bump field widths at the same time), which may

be a good compromise, but how to handle potentially cutting

in the middle UTF8 chars that cannot fit to the field size? If this detail

would be handled gracefully by RDD, it'd be the most ideal I guess.

The other disadvantage is potential loss of indexing performance,

but all in all, these may well outweigh the double size of an UTF16

solution when most of the data is ASCII.

---

/* encoding: utf-8 */

hb_cdpSelect( "UTF8EX" )

dbCreate( "test", {{ "TEST", "C", 3, 0 }} )

USE test

dbAppend() ; FIELD->TEST := "űű"

---

> One issue I've found:
> fs_win_get_drive() in filesys.c has a call to hb_wcntombcpy()
> which needs to be updated to one of the new APIs.
I've seen it but in fact it's necessary only for extracting drive
letter so it's unimportant.
BTW I think we should eliminate conversion to char* and call to
hb_fsNameSplit() in this function.
{
TCHAR lpBuffer[ HB_PATH_MAX ];
int iDrive;
lpBuffer[ 0 ] = TEXT( '\0' );
hb_fsSetIOError( GetCurrentDirectory(
HB_SIZEOFARRAY( lpBuffer ), lpBuffer ) != 0, 0 );
iDrive = HB_TOUPPER( lpBuffer[ 0 ] );
if( iDrive >= 'A' && iDrive <= 'Z' &&
lpBuffer[ 1 ] == HB_OS_DRIVE_DELIM_CHR )
iDrive -= 'A';
else
iDrive = 0;
}
should be enough.

Looks fine to me. Indeed chances for non-ASCII

drive letters is pretty slim.

I also think that we should add new API function which returns
full path with drive letter if any. It nicely simplify upper
level code. We can also create such function for setting
current directory though this operation is not MT safe and MT
programs should not change current directory.

There is HB_CWD() implemented for this. Though probably

it would be better to have it moved to C and using lower-level APIs.

The setter part is also there as a TODO + placeholder, and indeed

users should not do this, so it's not critical to implement it.

Viktor

Mindaugas Kavaliauskas

unread,

Apr 23, 2012, 9:10:54 AM4/23/12

to harbou...@googlegroups.com

Hi,

On 2012.04.23 14:54, Przemysław Czerpak wrote:
>> Possibly the largest patch to Harbour at least in recent 5 years.

>> ...

>> svn diff -r 17403:17404> uni.dif

You are wrong Viktor :)

svn diff -r 9373:9374 > mt.dif
is multi-thread support:
2008-09-13 18:49 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
and is 4% larger than uni.dif (1185680 bytes vs. 1138273 bytes) :)

Thanks, Przemek, for such a huge contribution!!! Perhaps it will still
take me a few days (or weeks), to understand the whole new unicode
coding ideas, and how I should change my code (related to file IO,
socket IO, dbf char type column storage and sorting, etc) to work with
UTF8.

Thanks again and regards,
Mindaugas

vszakats

unread,

Apr 23, 2012, 9:33:03 AM4/23/12

to harbou...@googlegroups.com

On Monday, April 23, 2012 3:10:54 PM UTC+2, Mindaugas Kavaliauskas wrote:

Hi,

On 2012.04.23 14:54, Przemysław Czerpak wrote:
>> Possibly the largest patch to Harbour at least in recent 5 years.
>> ...
>> svn diff -r 17403:17404> uni.dif

You are wrong Viktor :)

svn diff -r 9373:9374 > mt.dif
is multi-thread support:
2008-09-13 18:49 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
and is 4% larger than uni.dif (1185680 bytes vs. 1138273 bytes) :)

I stand corrected :) I had appended a '?' at the end in my

first answer that didn't appear (or was deleted), then for the

3rd time I typed it in, I was more cheered up by the patch to bother.

However big it is, great job!

And an extra thought:

---

+ added new compiler switch:

-ku - strings in user encoding

Now it informs compiler that strings use custom encoding so some

optimizations which are byte oriented cannot be used.

It's possible that in the future we will change above definition

to sth like: "strings in UTF8 encoding" but now I would like to

keep more general.

---

I'd be very glad to see '-ku:utf8' or similar option (it will be

large job I reckon). For now I assumed I don't need -ku for

my UTF8 sources, but I may be proven wrong as I move

along the conversion process. Probably I'll fix those locally,

as I wouldn't want to lose compile-time string optimization.

Viktor

Przemysław Czerpak

unread,

Apr 23, 2012, 11:13:23 AM4/23/12

to harbou...@googlegroups.com

On Mon, 23 Apr 2012, vszakats wrote:

Hi,

> ---
> + added new compiler switch:
> -ku - strings in user encoding
> Now it informs compiler that strings use custom encoding so some
> optimizations which are byte oriented cannot be used.
> It's possible that in the future we will change above definition
> to sth like: "strings in UTF8 encoding" but now I would like to
> keep more general.
> ---
> I'd be very glad to see '-ku:utf8' or similar option (it will be
> large job I reckon). For now I assumed I don't need -ku for
> my UTF8 sources, but I may be proven wrong as I move
> along the conversion process. Probably I'll fix those locally,
> as I wouldn't want to lose compile-time string optimization.

Now this switch disables optimization for few function calls with
literal arguments, i.e. LEN( "ąćęłńóśźż" ), what should give 9 for
UTF8EX and 19 for byte oriented CPs.
These are exactly:
AT( <cLiteralString1>, <cLiteralString2> ) -> <nPos>
LEN( <cLiteralString> ) -> <nLen>
ASC( <cLiteralString> ) -> <nVal> // if 1-st byte in the string is
// greater then 127
CHR( <nVal> ) -> <cLiteralString> // id <nVal> is greater then 127

As you can see it's not too wide area and in most of cases it's good
to look at such code during conversion yo UTF8.

Probably we will have to introduce to compiler switch to control
string encoding at compile time. It should help in few things:
1. enable optimizations like above
2. automatic translation of constant values in source code
3. interaction with OS API and filename translations used
in #include ... and similar compiler/PP directives.
We can make it in the compiler but it means that we have to
integrate with compiler also CP oritented Harbour RTL code.
We can also reach this effect much easier inside HBMK2 using
integrated compiler code because in such case compiler inherits
HVM from HBMK2.
I.e. point 3 above can be implemented even now only inside
HBMK2. It's enough to parse switches for -ku:<cpname> and call:
cSaveCP = hb_cdpSelect( <cpname> )
before HB_COMPILER() and then restore HBMK2 CP with
hb_cdpSelect( cSaveCP )
Some time ago I suggested to add compiler time optimizations
for some functions with literal parameters which can be executed
to calculate the results, i.e.:
HB_CRC32( <cLiteralString> ) -> <nVal>
If we add such optimization then with above user codepage setting
to HBMK2 then as side effect we also address the problem of disabled
optimization for above functions - they will be optimized by our code.
Finally point 2 with active HVM can be quite easy resolved by custom
open function like the one used in HBRUN for included files.
It means that we can reach all above goals inside HBMK2 with some
minor modifications in pure compiler and PP code.
It's the reason why I didn't want to make any deeper modifications
in compiler/PP code with unicode patch and added only very simple -ku
switch.

This is first thing we may address in the future.
It should not cause any backward compatibility problems - it's will
be extension only.

The second one is constant string encoding for box drawing characters
and default CP.
Now inside box.ch we have pure CP437 definitions.
Also in RTL code we have few constant values hardcoded for this CP:
browse.prg // constant values: 198, 181, 205
checkbox.prg // constant values: 251
dbedit.prg // constant values: 205, 209, 179
listbox.prg // constant values: B_SINGLE, B_DOUBLE, 31
scrollbr.prg // constant values: 24, 25, 26, 27, 176, 178
tmenuitm.prg // constant values: MENU_SEPARATOR, 251, 16
tpopup.prg // B_SINGLE, SEPARATOR_SINGLE, MENU_SEPARATOR
browse.prg // B_DOUBLE_SINGLE
We can ignore ASCII values smaller then 32 because they are not
part of any multibyte encodings.
It's the reason why I left CP437 as default CP encoding for BOX
characters. Changing default here strongly interacts with existing
code so at this stage I prefer that users who want to fully switch
to UTF8 will set HB_GTI_BOXCP themselves. Such choice does not
force modifications in existing code. It may change in the future
if we agree final version of Harbour Unicode API. I would like to
avoid situation when we are forcing user PRG code updating in the
same area many times.

The third thing is bound with FOR EACH c in str / NEXT.
Now it operates on binary data. It's possible to switch
to character indexes but I would like to confirm it.
Such modification is not backward compatible so we should
take the decision quite fast.

best regards,
Przemek

Przemysław Czerpak

unread,

Apr 23, 2012, 12:01:06 PM4/23/12

to harbou...@googlegroups.com

On Mon, 23 Apr 2012, Massimo Belgrano wrote:

Hi Massimo,

> Possible have a common way for harbour , ads, any upcoming rdd who add
> unicode like otc mediator
> sybase Advantage 10 contrib\rddads.lib includes three new field types;
> nChar, nVarChar and nMemo.
> These field types will be able to store Unicode characters
> http://blog.advantageevangelist.com/2010/06/ads-10-tip-4-unicode-support.html

I added support for this fields to ADS* RDDs:

2010-10-09 19:07 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
* harbour/include/hbapirdd.h
+ added new field flag: HB_FF_UNICODE
* harbour/contrib/rddads/ads1.c
+ added support for new ADS 10.0 UNICODE fields: NChar, NVarChar, NMemo
They are supported in all ADS* RDDs.

and also for DBF* RDDs:

2010-10-13 13:21 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
* harbour/src/rdd/dbf1.c
* harbour/src/rdd/dbffpt/dbffpt1.c
+ added support for UNICODE fields compatible with the one used
by ADS

so this was done long time ago.

Now I'm talking only about adding an option to control field flags
in dbCreate(). To keep current DBSTRUCT() table dimmensions I plan
to define that ":" character used in field type starts field flags.
It means that:
{ "NAME", "C:U", 20, 0 }
will mean that field NAME has UNICODE flag and
{ "SIGNATURE", "C:B", 20, 0 }
means that field SIGNATURE has BINARY flag, etc.
It wil allow to use some extensions which exists in native DBF*
RDDs from long time and also add some new ones, i.e. we can define
that "Z" means COMPRESS flag so:
{ "DATA", "M:Z", 4, 0 }
means memo field with compressed body.
Of course it will be possible to mix different flags:
{ "DATA", "M:UZ", 4, 0 }
I have to use separator character for backward compatibility with
existing RDDs which used multiletter field type descriptions, i.e.
ADS* RDDs.

best regards,
Przemek

vszakats

unread,

Apr 23, 2012, 12:37:17 PM4/23/12

to harbou...@googlegroups.com

Now this switch disables optimization for few function calls with
literal arguments, i.e. LEN( "ąćęłńóśźż" ), what should give 9 for
UTF8EX and 19 for byte oriented CPs.
These are exactly:
AT( <cLiteralString1>, <cLiteralString2> ) -> <nPos>
LEN( <cLiteralString> ) -> <nLen>
ASC( <cLiteralString> ) -> <nVal> // if 1-st byte in the string is
// greater then 127
CHR( <nVal> ) -> <cLiteralString> // id <nVal> is greater then 127

As you can see it's not too wide area and in most of cases it's good
to look at such code during conversion yo UTF8.

Thanks, it looks far less worse and I probably don't use

any of the above. It's not very easy to tell, it will need further

analysis. Anyhow the point is that -ku can be used without

much cost.

Probably we will have to introduce to compiler switch to control
string encoding at compile time. It should help in few things:
1. enable optimizations like above
2. automatic translation of constant values in source code
3. interaction with OS API and filename translations used
in #include ... and similar compiler/PP directives.
We can make it in the compiler but it means that we have to
integrate with compiler also CP oritented Harbour RTL code.
We can also reach this effect much easier inside HBMK2 using
integrated compiler code because in such case compiler inherits
HVM from HBMK2.
I.e. point 3 above can be implemented even now only inside
HBMK2. It's enough to parse switches for -ku:<cpname> and call:
cSaveCP = hb_cdpSelect( <cpname> )
before HB_COMPILER() and then restore HBMK2 CP with
hb_cdpSelect( cSaveCP )

It's good idea.

It has one disadvantage, it's not easy to add a #pragma that

can control encoding on a per file basis. Unless we go the route

to let hbmk2 peek into the source and look for certain things in

it, but it won't be ideal for performance (and various other reasons),

and f.e. current multi-file compilation would have to be disabled.

I'll check what it takes to implement this in hbmk2 to give it a

first shot, with the option that we may move this option to the

low-level compiler at a later stage.

Some time ago I suggested to add compiler time optimizations
for some functions with literal parameters which can be executed
to calculate the results, i.e.:
HB_CRC32( <cLiteralString> ) -> <nVal>
If we add such optimization then with above user codepage setting
to HBMK2 then as side effect we also address the problem of disabled
optimization for above functions - they will be optimized by our code.

I cannot see the precise relation to above issue, but it seems

clearly a cool feature. And combining this sort of trick with

codepage support, I reckon it may even be possible to add support

for an encoding #pragma in some ways. (with a callback?)

In the longer run this leads to a minimal harbour compiler used

solely for the purpose of building Harbour itself, and hbmk2 which

will basically become _the_ compiler visible for the outside world.

This is in sync with my past notion to eventually drop the raw

harbour executable from the distribution (with the option to access

it via hbmk2, which is already implemented).

If communication between hbmk2 and compiler engine will be

closely integrated, we may even add automatic "package" (aka "lib")

selection right from the source.

Finally point 2 with active HVM can be quite easy resolved by custom
open function like the one used in HBRUN for included files.
It means that we can reach all above goals inside HBMK2 with some
minor modifications in pure compiler and PP code.
It's the reason why I didn't want to make any deeper modifications
in compiler/PP code with unicode patch and added only very simple -ku
switch.

Sounds perfect to me, I just wish I'd could imagine more

precisely the method for hbmk2 <=> compiler engine

communication you have in mind.

The second one is constant string encoding for box drawing characters
and default CP.
Now inside box.ch we have pure CP437 definitions.
Also in RTL code we have few constant values hardcoded for this CP:
browse.prg // constant values: 198, 181, 205
checkbox.prg // constant values: 251
dbedit.prg // constant values: 205, 209, 179
listbox.prg // constant values: B_SINGLE, B_DOUBLE, 31
scrollbr.prg // constant values: 24, 25, 26, 27, 176, 178
tmenuitm.prg // constant values: MENU_SEPARATOR, 251, 16
tpopup.prg // B_SINGLE, SEPARATOR_SINGLE, MENU_SEPARATOR
browse.prg // B_DOUBLE_SINGLE
We can ignore ASCII values smaller then 32 because they are not
part of any multibyte encodings.
It's the reason why I left CP437 as default CP encoding for BOX
characters. Changing default here strongly interacts with existing
code so at this stage I prefer that users who want to fully switch
to UTF8 will set HB_GTI_BOXCP themselves. Such choice does not
force modifications in existing code. It may change in the future
if we agree final version of Harbour Unicode API. I would like to
avoid situation when we are forcing user PRG code updating in the
same area many times.

Fair enough. This issue steps into the area to how to

move Harbour sources themselves to unicode eventually.

Besides box drawing chars the only place where Harbour

hosts non-ASCII string are the language modules. I converted

it for Hungarian, and if this seems alright the same can

be done for the rest of them. (then remains collations, but

it leads too far).

The third thing is bound with FOR EACH c in str / NEXT.
Now it operates on binary data. It's possible to switch
to character indexes but I would like to confirm it.

Very useful information. I've been thinking about it without

making tests and so far concluded that they _should_ work

on CP. If they work on bytes, I'll have to look all over the

source to see where this might cause a problem. [ I did

and I seldom use it, only at two places, both of them

expecting chars, plus a 3rd in a disabled low-level function

where I use an equivalent implemented in C. ]

Such modification is not backward compatible so we should
take the decision quite fast.

My vote for CP sensitive iteration, for consistency.

Maybe with additional keyword to force raw byte

processing ('FOR EACH c IN s BYTE' or similar).

Viktor

wen....@gmail.com

unread,

Apr 23, 2012, 8:24:50 PM4/23/12

to harbou...@googlegroups.com

>
> Now I'm talking only about adding an option to control field flags
> in dbCreate(). To keep current DBSTRUCT() table dimmensions I plan
> to define that ":" character used in field type starts field flags.
> It means that:
> { "NAME", "C:U", 20, 0 }
> will mean that field NAME has UNICODE flag and
> { "SIGNATURE", "C:B", 20, 0 }
> means that field SIGNATURE has BINARY flag, etc.

Why not use the fifth parameter?

{ "NAME", "C", 20, 0, "U" }
{ "SIGNATURE", "C", 20, 0, "B" }

vszakats

unread,

Apr 23, 2012, 8:44:49 PM4/23/12

to harbou...@googlegroups.com

On Tuesday, April 24, 2012 2:24:50 AM UTC+2, WenSheng wrote:

> { "NAME", "C:U", 20, 0 }
> will mean that field NAME has UNICODE flag and
> { "SIGNATURE", "C:B", 20, 0 }
> means that field SIGNATURE has BINARY flag, etc.

Why not use the fifth parameter?

{ "NAME", "C", 20, 0, "U" }
{ "SIGNATURE", "C", 20, 0, "B" }

You can find the answer to that in detail

in the archives where we've discussed this

many years ago.

Viktor

Mindaugas Kavaliauskas

unread,

Apr 24, 2012, 9:41:08 AM4/24/12

to harbou...@googlegroups.com

Hi,

I'm trying to understand the amount of changes necessary to use HVM
unicode. Questions:

1) this not a question, more a suggestion for other people. You need to
request HB_CODEPAGE_UTF8EX to use hb_cdpselect("UTF8EX"). I expected it
is included just like hb_cdpselect("UTF8") and this took me a few hours
of testing.

2) I have a large amount of code that do data file/packet data parsing,
encodes/decodes various structures, etc. CHR(), ASC(), I2BIN(), L2BIN,
BIN2I() and other functions are very common. How all this code should be
written in UTF8EX case? Should I use HB_B{LEN,CODE,CHAR}() instead of
LEN(), ASC(), CHR()? What about HB_B*() versions of LEFT(), RIGHT(),
SUBSTR() functions?

3) AFAIU, the following code is buggy because of LEFT()?
cFile := ""
DO WHILE .T.
cBuf := SPACE(BUF_SIZE)
IF (nI := FREAD(hF, @cBuf, BUF_SIZE)) > 0
cFile += LEFT(cBuf, nI)
ELSE; EXIT
ENDIF
ENDDO

4) STRTRAN() was not patched. I guess it should.

5) If I have text file in win1257 encoding and I want to read it using
hb_memoread(), what function should be used to convert file content from
known encoding to internal HVM encoding (and vice-versa)?

6) HB_BCHAR() vs. CHR() for argument values 0..127?

7) Is e"\xEE\x02" == HB_BCHAR(0xEE) + HB_BCHAR(2)? Or it can depend on
some codepage selection or compiler switch?

8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it works
the same as ASC(). If some "non-unicode" codepage is selected it returns
character code of current code page, so, it is ASC() again. Am I wrong?

9) I'm a little confused among the meanings of "Unicode", UTF16, UCS-2,
etc. AFAIU, UTF-8 can represent characters having numbers up to 31-bit
length. UTF-8 representation can take 1 to 6 bytes. What about UTF8EX?
What character range is supported?
Ex., I see HB_UCHAR() uses:
( HB_WCHAR ) hb_parni( 1 )
so, character code are from range 0 to 65535?
What is expected maximum return values of HB_BLEN(HB_UCHAR(nChar))?
What character ranges are supported by functions hb_cdp*U16()?
http://en.wikipedia.org/wiki/UTF-16 says, that "UTF-16 is used for text
in the OS API in Microsoft Windows 2000/XP/2003/Vista/CE", so, it can
represent characters up to U+10FFFF (in some cases 1 character occupies
two 2-byte wide characters, i.e., 4 bytes). How Harbour internals works
in these cases?

10) What is the meaning of hbmk2 -ku:<cp> switch? As far as I can see in
the source code of compiler, -ku just disables some optimisations, but
it does not change string encoding in pcode. So, I understand that .prg
source code is expected to have encoding set by hb_cdpselect().

11) Now we have situation similar to SET_EXACT. hb_cdpselect()
significantly changes program logic is string functions are used. What
are the rules to write portable code (to make it work on different
codepage settings, including other possible user multibyte codepages)?

12) I'm not sure I understand the whole problems about commandline
options, but CLIPINIT() with SET_CODEPAGE, SET_OSCODEPAGE looks ugly.
Can we use *W() windows unicode functions to obtain command line and
convert it to current codepage in case user calls hb_progname(), etc? I
guess it should be possibility to obtain current windows codepage
(ANSI/OEM) for non-unicode *A() API. Can we set this codepage as default
value for SET_OSCODEPAGE?

13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows? Windows
uses ANSI or UTF-16/UCS-2 for its API, but not UTF-8.

Regards,
Mindaugas

vszakats

unread,

Apr 24, 2012, 10:29:08 AM4/24/12

to harbou...@googlegroups.com

2) I have a large amount of code that do data file/packet data parsing,
encodes/decodes various structures, etc. CHR(), ASC(), I2BIN(), L2BIN,
BIN2I() and other functions are very common. How all this code should be
written in UTF8EX case? Should I use HB_B{LEN,CODE,CHAR}() instead of
LEN(), ASC(), CHR()? What about HB_B*() versions of LEFT(), RIGHT(),
SUBSTR() functions?

Yes. As for LEFT(), RIGHT(), SUBSTR() I also miss them, so if HB_B*()

is not enough, I switch back locally to "EN" CP. The code doesn't look

very good and I'm not sure of hb_cdpSelect() performance impact, but

it works.

3) AFAIU, the following code is buggy because of LEFT()?
cFile := ""
DO WHILE .T.
cBuf := SPACE(BUF_SIZE)
IF (nI := FREAD(hF, @cBuf, BUF_SIZE)) > 0
cFile += LEFT(cBuf, nI)
ELSE; EXIT
ENDIF
ENDDO

Also this:

// LOOP

nWritten := FWrite( fhnd, SubStr( cString, nPos + 1 ) )

nPos += nWritten

// ENDLOOP

Same when using hb_socket*() functions. F.e. I tried

to update your UDPDS code, and I couldn't get to the bottom

of it yet.

4) STRTRAN() was not patched. I guess it should.

It's okay as it is.

See explanation in src/rtl/cdpapihb.c

/* none of numeric parameters in STRTRAN() (4-th and 5-th) refers to

* character position in string so we do not need to create new

* HB_UTF8STRTRAN() but we can safely use normal STRTRAN() function

*/

FOR EACH is pending. Plus there are lots of standalone

contrib functions written in C where unicode support should

be decided and where applicable, implemented. Until then,

they all work on binary strings.

5) If I have text file in win1257 encoding and I want to read it using
hb_memoread(), what function should be used to convert file content from
known encoding to internal HVM encoding (and vice-versa)?

HB_TRANSLATE()

7) Is e"\xEE\x02" == HB_BCHAR(0xEE) + HB_BCHAR(2)? Or it can depend on

some codepage selection or compiler switch?

Former. e"" is always binary string ATM, because compiler

is not aware of source encoding.

8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it works
the same as ASC(). If some "non-unicode" codepage is selected it returns
character code of current code page, so

, it is ASC() again. Am I wrong?

As far as I could find out, HB_UCODE()/HB_UCHAR() functions work

on UTF16 characters. HB_UPEEK()/HB_UPOKE() also work on UTF16

chars, but the position is expected as raw byte position. I'm less sure

of HB_ULEN().

I have to admit to be little confused about these.

9) I'm a little confused among the meanings of "Unicode", UTF16, UCS-2,
etc. AFAIU, UTF-8 can represent characters having numbers up to 31-bit
length. UTF-8 representation can take 1 to 6 bytes. What about UTF8EX?
What character range is supported?
Ex., I see HB_UCHAR() uses:
( HB_WCHAR ) hb_parni( 1 )
so, character code are from range 0 to 65535?

From what I read/tested so far: Yes 0-0xFFFF.

"UTF8EX" is just a made up name. I'd better like if "UTF8" could

be used for this purpose.

"UTF16LE" CP cannot be used as HVM CP (it can be enabled,

but many things won't work, which is expected, given how many

things should be changed for it), but it's useful in HB_TRANSLATE().

10) What is the meaning of hbmk2 -ku:<cp> switch? As far as I can see in
the source code of compiler, -ku just disables some optimisations, but
it does not change string encoding in pcode. So, I understand that .prg
source code is expected to have encoding set by hb_cdpselect().

So far I think it only helps in OS conversions that occur while the

compiler is executing. I could not made up a test to confirm this. The

plan is that it will allow compile-time optimizations for strings in

passed CP, later maybe more.

11) Now we have situation similar to SET_EXACT. hb_cdpselect()
significantly changes program logic is string functions are used. What
are the rules to write portable code (to make it work on different
codepage settings, including other possible user multibyte codepages)?

Quite wide question. What's most important is to identify

places where you handle strings in HVM CP vs. where you

work on binary strings and change to HB_BCHAR() and

similar where working with binary ones.

Change all places where you communicate with outside

world to filter through HB_TRANSLATE(). This can be

needed with certain APIs, too.

12) I'm not sure I understand the whole problems about commandline
options, but CLIPINIT() with SET_CODEPAGE, SET_OSCODEPAGE looks ugly.

I think _SET_CODEPAGE is unavoidable to receive PARAMETERs

in Main() in selected CP, but I agree the _SET_OSCODEPAGE is far

from ideal, f.e. in my app it depends on user setting, which is not

available in CLIPINIT stage.

Can we use *W() windows unicode functions to obtain command line and
convert it to current codepage in case user calls hb_progname(), etc? I
guess it should be possibility to obtain current windows codepage
(ANSI/OEM) for non-unicode *A() API. Can we set this codepage as default
value for SET_OSCODEPAGE?

*W() can be used, which makes _SET_OSCODEPAGE unnecessary here.

13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows? Windows
uses ANSI or UTF-16/UCS-2 for its API, but not UTF-8.

In later years all Windows API interactions have been added support

for WIDE API (-DUNICODE) mode, at the same time all such places

were updated to use the String API to exchange strings between Harbour

HVM and Windows API. Now String API will automatically convert between

UTF8 or else (HVM CP) and WIndows CP (UTF16).

The only remaining place I know where WIDE API is not used (in fact

no API is used, because values come via WinMain()), is the command

line parsing stuff.

Worth noting that hbmzip, libcurl, libharu, expat, pcre and all other

3rd party libs have their special unicode support level and requirements,

and some of these will probably require attention to implement smooth

interfaces. F.e. HBQT doesn't convert CP in certain functions, libharu

seems seriously broken in some (yet to be determined) unicode cases.

Viktor

Bacco

unread,

Apr 24, 2012, 10:45:09 AM4/24/12

to harbou...@googlegroups.com

I really believe that HB_ULEFT( ) HB_ULEN( ) would be the right
approach to Unicode, and not changing the default and reliable
functions. Current HB_B functions make updating old software that
relies on binary strings for internal purposes to use unicode
interfaces a very risky task. Besides, the Unicode concept is a
harbour thing, not cl*pper one, so HB_U to the new functions seems
more logical to me.

vszakats

unread,

Apr 24, 2012, 11:05:30 AM4/24/12

to harbou...@googlegroups.com

On Tuesday, April 24, 2012 4:45:09 PM UTC+2, Bacco wrote:

I really believe that HB_ULEFT( ) HB_ULEN( ) would be the right
approach to Unicode, and not changing the default and reliable
functions. Current HB_B functions make updating old software that
relies on binary strings for internal purposes to use unicode
interfaces a very risky task. Besides, the Unicode concept is a
harbour thing, not cl*pper one, so HB_U to the new functions seems
more logical to me.

If doing it as you propose, much more app changes would be

required and much more app changes would mean much more

potential problems. (not to mention 3rd party code)

I find current direction quite fine, because I only had to touch

code which was already an exception in some respect. Such

code is more complicated but much less in quantity than the

remaining 98% of my application. It's also a very positive

change, that exceptions now will be much more easy to identify,

due to special API that it must use. IOW it's a future proof change.

[ I'd be more happy with a full set of HB_B*() functions, though. ]

Also, pure string operations are just one part of the problem,

what's also important is the automatic CP conversion on component

boundaries. Which would be just impossible with your proposal.

It would also move UTF8 into a different "league" than those

other codepages that the HVM had long time supported, thus

creating an exception, one that would never disappear.

With current method you once fix your code for binary strings,

fix your C code (if any) and make the unicode transition, you

seldom have to deal with this problem for the remaining lifetime

of your app.

BTW, I don't see how this is related to reliability. They work

just a reliably as before (especially compared that the whole

patch is only few days old), only in any CP you selected, now

including UTF8.

Viktor

Przemysław Czerpak

unread,

Apr 24, 2012, 11:08:06 AM4/24/12

to harbou...@googlegroups.com

On Tue, 24 Apr 2012, Mindaugas Kavaliauskas wrote:

Hi,

> I'm trying to understand the amount of changes necessary to use HVM
> unicode. Questions:
> 1) this not a question, more a suggestion for other people. You need
> to request HB_CODEPAGE_UTF8EX to use hb_cdpselect("UTF8EX"). I
> expected it is included just like hb_cdpselect("UTF8") and this took
> me a few hours of testing.

The conversion tables are very huge so I didn't made it default
part og HVM.

> 2) I have a large amount of code that do data file/packet data
> parsing, encodes/decodes various structures, etc. CHR(), ASC(),
> I2BIN(), L2BIN, BIN2I() and other functions are very common. How all
> this code should be written in UTF8EX case? Should I use
> HB_B{LEN,CODE,CHAR}() instead of LEN(), ASC(), CHR()?

HB_B*() functions always operates on bytes. It doesn't matter
what CP you use.

> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?

I'll add HB_BSUBSTR().
LEFT( <str>, <n> ) is the same as SUBSTR( <str>, 1, <n> ) and
RIGHT( <str>, <n> ) is the same as SUBSTR( <str>, -<n> )
so it's not strictly necessary anyhow I can add it too if you want.

> 3) AFAIU, the following code is buggy because of LEFT()?
> cFile := ""
> DO WHILE .T.
> cBuf := SPACE(BUF_SIZE)
> IF (nI := FREAD(hF, @cBuf, BUF_SIZE)) > 0
> cFile += LEFT(cBuf, nI)
> ELSE; EXIT
> ENDIF
> ENDDO

Yes, it's not portable.
In such context is necessary to use HB_BSUBSTR()/HB_BLEFT(), i.e.:
cFile += HB_BSUBSTR(cBuf, 1, nI)

> 4) STRTRAN() was not patched. I guess it should.

For valid and normalized UTF8 strings it's not necessary.

> 5) If I have text file in win1257 encoding and I want to read it
> using hb_memoread(), what function should be used to convert file
> content from known encoding to internal HVM encoding (and
> vice-versa)?

hb_cdpTranslate( <cText>, <cCpIN>, <cCpOUT> ) -> <cDest>
if <cCpIN> or <cCpOUT> is missing then HVM CP is used.
Limitations:
it operates on Harbour CPs not unicode ones which are more general
so please remember about REQUEST HB_CODEPAGE_<cpIN>, HB_CODEPAGE_<cpOUT>
In the future we should add support for using unicode CP IDs in this
functions.

> 6) HB_BCHAR() vs. CHR() for argument values 0..127?

No difference for UTF8 and DBCS encodings.

> 7) Is e"\xEE\x02" == HB_BCHAR(0xEE) + HB_BCHAR(2)? Or it can depend
> on some codepage selection or compiler switch?

Now it's binary string but this may change in the future if we
decide to add explicit support for unicode strings though maybe
in such case we should different syntax, i.e. u"..."

> 8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it
> works the same as ASC(). If some "non-unicode" codepage is selected
> it returns character code of current code page, so, it is ASC()
> again. Am I wrong?

It's not the same as ASC().
It always takes first character (not byte) from given string and
returns it unicode value, this code should illustrate it:

SET( _SET_CODEPAGE, "UTF8EX" )
s := HB_UCHAR( 0x104 ) // Ą
? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 0104

SET( _SET_CODEPAGE, "PLMAZ" )
? HB_NUMTOHEX( HB_UCODE( HB_UTF8TOSTR( s ) ), 4 ) // 0104
? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 2500

So regardless of used CP this functions operates on UNICODE values.

> 9) I'm a little confused among the meanings of "Unicode", UTF16,
> UCS-2, etc. AFAIU, UTF-8 can represent characters having numbers up
> to 31-bit length. UTF-8 representation can take 1 to 6 bytes. What
> about UTF8EX? What character range is supported?
> Ex., I see HB_UCHAR() uses:
> ( HB_WCHAR ) hb_parni( 1 )
> so, character code are from range 0 to 65535?
> What is expected maximum return values of HB_BLEN(HB_UCHAR(nChar))?
> What character ranges are supported by functions hb_cdp*U16()?
> http://en.wikipedia.org/wiki/UTF-16 says, that "UTF-16 is used for
> text in the OS API in Microsoft Windows 2000/XP/2003/Vista/CE", so,
> it can represent characters up to U+10FFFF (in some cases 1
> character occupies two 2-byte wide characters, i.e., 4 bytes). How
> Harbour internals works in these cases?

Harbour correctly process UTF8 strings up to 31 bytes characters.
Anyhow HB_WCHAR is 16 bit in current implementation so upper bits
are stripped from during translations. I haven't added support for
UTF16 encoding and intentionaly used U16 in names to not confuse
users. If we decide it's usefull then we can redefine HB_WCHAR
as 32 bit integer.

> 10) What is the meaning of hbmk2 -ku:<cp> switch? As far as I can
> see in the source code of compiler, -ku just disables some
> optimisations, but it does not change string encoding in pcode. So,
> I understand that .prg source code is expected to have encoding set
> by hb_cdpselect().

Exactly.

> 11) Now we have situation similar to SET_EXACT. hb_cdpselect()
> significantly changes program logic is string functions are used.
> What are the rules to write portable code (to make it work on
> different codepage settings, including other possible user multibyte
> codepages)?

HB_B*() functions for binary operations and HB_U*() functions for
operations on unicode characters. These functions are CP independent.

> 12) I'm not sure I understand the whole problems about commandline
> options, but CLIPINIT() with SET_CODEPAGE, SET_OSCODEPAGE looks
> ugly.
> Can we use *W() windows unicode functions to obtain command line and
> convert it to current codepage in case user calls hb_progname(),
> etc? I guess it should be possibility to obtain current windows
> codepage (ANSI/OEM) for non-unicode *A() API. Can we set this
> codepage as default value for SET_OSCODEPAGE?

This is minor and local problem which can be resolved quite easy
by modifications in cmdarg.c and hbwmain.c

> 13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows?

I do not know.
In Windows _SET_OSCODEPAGE should be set to ANSI CP. It's used
by code which operates on ANSI WIN32 API. In Harbour core code
we eliminated ANSI W32 API so it's rather for 3-rd party code
and communication with some libraries which do not have WCHAR
API.

> Windows uses ANSI or UTF-16/UCS-2 for its API, but not UTF-8.

yes it is.

best regards,
Przemek

Przemysław Czerpak

unread,

Apr 24, 2012, 11:19:44 AM4/24/12

to harbou...@googlegroups.com

On Tue, 24 Apr 2012, Bacco wrote:

Hi,

> I really believe that HB_ULEFT( ) HB_ULEN( ) would be the right
> approach to Unicode, and not changing the default and reliable
> functions. Current HB_B functions make updating old software that
> relies on binary strings for internal purposes to use unicode
> interfaces a very risky task. Besides, the Unicode concept is a
> harbour thing, not cl*pper one, so HB_U to the new functions seems
> more logical to me.

For me the most beautifully thing in the implementation I committed
is the fact that I do not have to agree or disagree with such messages
and discus about it ;-)
It's enough that you will use UTF8 instead of UTF8EX to keep
binary indexes as default.
And if you want then you can easy create your own custom UTF8XX
which will use any mixed parts of UTF8 and UTF8EX. That's only
your choice.

best regards,
Przemek

vszakats

unread,

Apr 24, 2012, 11:46:45 AM4/24/12

to harbou...@googlegroups.com

> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?
I'll add HB_BSUBSTR().
LEFT( <str>, <n> ) is the same as SUBSTR( <str>, 1, <n> ) and
RIGHT( <str>, <n> ) is the same as SUBSTR( <str>, -<n> )
so it's not strictly necessary anyhow I can add it too if you want.

I'd add a vote for HB_BLEFT() and HB_BRIGHT(). This

would make code conversion easier for code that already

uses LEFT() and RIGHT().

hb_cdpTranslate( <cText>, <cCpIN>, <cCpOUT> ) -> <cDest>

In the future we should add support for using unicode CP IDs in this
functions.

Would be great.

users. If we decide it's usefull then we can redefine HB_WCHAR
as 32 bit integer.

Even though I cannot see the practical benefits,

for some strange reason, this sound also very cool.

> 13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows?

[ misread this question. ]

Viktor

vszakats

unread,

Apr 24, 2012, 12:03:05 PM4/24/12

to harbou...@googlegroups.com

Hi Przemek,

On Tuesday, April 24, 2012 5:46:45 PM UTC+2, vszakats wrote:

> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?
I'll add HB_BSUBSTR().
LEFT( <str>, <n> ) is the same as SUBSTR( <str>, 1, <n> ) and
RIGHT( <str>, <n> ) is the same as SUBSTR( <str>, -<n> )
so it's not strictly necessary anyhow I can add it too if you want.
I'd add a vote for HB_BLEFT() and HB_BRIGHT(). This
would make code conversion easier for code that already
uses LEFT() and RIGHT().

I will add some #defines for this. Seem fine and it avoids

some bloat in RTL.

Thanks for the HB_BSUBSTR() patch, it help greatly to

move along.

Viktor

Massimo Belgrano

unread,

Apr 24, 2012, 12:11:20 PM4/24/12

to harbou...@googlegroups.com

Can samebody post a little source

first and after

--
Massimo Belgrano

Mindaugas Kavaliauskas

unread,

Apr 24, 2012, 1:49:41 PM4/24/12

to harbou...@googlegroups.com

Hi,

thank you, Viktor and Przemek for all explanations!

> FOR EACH is pending.

It's hard for me to vote if FOR EACH should work on characters or bytes.
In general I avoid using this sentence for strings. In my head, FOR EACH
is a kind of optimisation for evaluation using integer index and []
operator. Since string characters are not accessed using cStr[nPos], I
avoid to use it on strings. But from general reasoning I would expect to
work it on characters just like SUBSTR(cStr, nPos, 1) (unprefixed by any
HB_B*() or HB_U*()).

>> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?
>
> I'll add HB_BSUBSTR().
> LEFT(<str>,<n> ) is the same as SUBSTR(<str>, 1,<n> ) and
> RIGHT(<str>,<n> ) is the same as SUBSTR(<str>, -<n> )
> so it's not strictly necessary anyhow I can add it too if you want.

Viktor:

> I will add some #defines for this. Seem fine and it avoids
> some bloat in RTL.

I find HB_BLEFT(), HB_BRIGHT() useful without PP tricks, or manual
conversion to HB_BSUBSTR(). Code bloat would be minimal in comparison to
all unicode tables. This is very basic functions just like LEFT() and
RIGHT(), and I have quite many code which communicates to some devices,
and binary protocol parsing is done at Harbour level. I hope nobody
votes for removal of LEFT() and RIGHT() and adding PP tricks to
implement it :)

Even more... I already need HB_BAT() and HB_BRAT(). In many cases I find
AT() doing the job OK, but will it work OK if my binary search needle is
some substring (possibly malformed) UTF8 byte representation?
In some cases I need the 3rd parameter, so, I use HB_AT(). For sure I
should change it to HB_BAT()...

>> 8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it
>> works the same as ASC(). If some "non-unicode" codepage is selected
>> it returns character code of current code page, so, it is ASC()
>> again. Am I wrong?
>
> It's not the same as ASC().
> It always takes first character (not byte) from given string and
> returns it unicode value, this code should illustrate it:
>
> SET( _SET_CODEPAGE, "UTF8EX" )
> s := HB_UCHAR( 0x104 ) // Ą
> ? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 0104
>
> SET( _SET_CODEPAGE, "PLMAZ" )
> ? HB_NUMTOHEX( HB_UCODE( HB_UTF8TOSTR( s ) ), 4 ) // 0104
> ? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 2500
>
> So regardless of used CP this functions operates on UNICODE values.

The things become clear only after I added
? hb_strtohex(s) // C484
and looked to http://en.wikipedia.org/wiki/Mazovia_encoding

Though, I still has the same question about HB_ULEN(). If I set UTF8EX,
return value of LEN() and HB_ULEN() is the same. If I set single byte
per char codepage, LEN() also the same value as HB_ULEN(). Can I have a
situation with LEN(cStr) != HB_ULEN(cStr)? (Maybe in some other custom
codepage...)

> HB_B*() functions for binary operations and HB_U*() functions for
> operations on unicode characters. These functions are CP independent.

I'm not sure I understand how HB_USUBSTR() is CP independent if it
depends on hb_vmCDP(). Can you give example with !(SUBSTR(cStr, nPos,
nLen) == HB_USUBSTR(cStr,nPos,nLen)) ?

Regards,
Mindaugas

vszakats

unread,

Apr 24, 2012, 2:17:56 PM4/24/12

to harbou...@googlegroups.com

Viktor:
> I will add some #defines for this. Seem fine and it avoids
> some bloat in RTL.
I find HB_BLEFT(), HB_BRIGHT() useful without PP tricks, or manual
conversion to HB_BSUBSTR(). Code bloat would be minimal in comparison to
all unicode tables. This is very basic functions just like LEFT() and
RIGHT(), and I have quite many code which communicates to some devices,
and binary protocol parsing is done at Harbour level. I hope nobody
votes for removal of LEFT() and RIGHT() and adding PP tricks to
implement it :)

Now that I had finished converting most code to use them, they

turned out to be even more useful than I thought, they allowed to

avoid reverting OS CP to EN almost completely (working on one

remaining complex case), and you're right about the bloat being

insignificant in general.

Even more... I already need HB_BAT() and HB_BRAT(). In many cases I find
AT() doing the job OK, but will it work OK if my binary search needle is
some substring (possibly malformed) UTF8 byte representation?
In some cases I need the 3rd parameter, so, I use HB_AT(). For sure I
should change it to HB_BAT()...

Nothing against these from my side.

Viktor

Przemysław Czerpak

unread,

Apr 24, 2012, 4:23:47 PM4/24/12

to harbou...@googlegroups.com

On Tue, 24 Apr 2012, Mindaugas Kavaliauskas wrote:

Hi,

> Though, I still has the same question about HB_ULEN(). If I set

> UTF8EX, return value of LEN() and HB_ULEN() is the same. If I set
> single byte per char codepage, LEN() also the same value as
> HB_ULEN(). Can I have a situation with LEN(cStr) != HB_ULEN(cStr)?
> (Maybe in some other custom codepage...)
> >HB_B*() functions for binary operations and HB_U*() functions for
> >operations on unicode characters. These functions are CP independent.
> I'm not sure I understand how HB_USUBSTR() is CP independent if it
> depends on hb_vmCDP(). Can you give example with !(SUBSTR(cStr,
> nPos, nLen) == HB_USUBSTR(cStr,nPos,nLen)) ?

In both cases the answer is the same.
It's necessary of multibyte CPs which do not use custom indexes in
standard functions so we can make people which has preferences like
Bacco happy.

I'll add HB_[UB]{LEFT,RIGHT}() soon.

best regards,
Przemek

Bacco

unread,

Apr 24, 2012, 4:43:55 PM4/24/12

to harbou...@googlegroups.com

Hi, Przemek

> In both cases the answer is the same.
> It's necessary of multibyte CPs which do not use custom indexes in
> standard functions so we can make people which has preferences like
> Bacco happy.

Just as a side note: I have no problem with current implementation,
neither with explicit use of HB_U and HB_B functions, and I have
currently no problem at all with encoding concepts. My comment was
entirely based on the relation of visible problems (display/encoding
errors are easily detectable) vs binary operationerrors that common
users are unaware and maybe will have a hard time locating.

I know this is a huge change and important one, and I've been aware
about it since you gently shared your "todo" list with us, and I think
the overall achievement is very good. Just raised a concern with very
specific details as one additional opinion.

Best regards
Bacco

Reply all

Reply to author

Forward