Re: unicode patch

205 views
Skip to first unread message

Przemysław Czerpak

unread,
Apr 23, 2012, 7:54:16 AM4/23/12
to Harbour developers
On Sat, 21 Apr 2012, Viktor Szakáts wrote:

Hi Viktor,

> Can't get this message through the mailing list, one copy was
> deleted, two don't appear. So here it is in private:

Probably we should look at it closer.
Some messages do not appear on the list or they appear after
long delay and not all messages are delivered to subscribers
(at least I do not receive all of them).
I'm setting CC to devel list.

BTW Sorry for late response. I was out of city.

> Possibly the largest patch to Harbour at least in recent 5 years.
> Thank you very much Przemek. (and OTC for sponsoring)

Thank you.

> For those interested in looking into the whole patch (f.e. to
> update 3rd party code), use this command in Harbour SVN
> sandbox root:
> svn diff -r 17403:17404 > uni.dif
>
> One of the next logical questions: How to enable unicode
> fields in tables? Plus some more, but I'm still digesting the
> changes.

This is additional extension - you can use simple character fields
for UTF8 strings.
Anyhow I'll add support for setting field flags in DBCREATE() in
this week.

> One issue I've found:
> fs_win_get_drive() in filesys.c has a call to hb_wcntombcpy()
> which needs to be updated to one of the new APIs.

I've seen it but in fact it's necessary only for extracting drive
letter so it's unimportant.
BTW I think we should eliminate conversion to char* and call to
hb_fsNameSplit() in this function.
{
TCHAR lpBuffer[ HB_PATH_MAX ];
int iDrive;
lpBuffer[ 0 ] = TEXT( '\0' );
hb_fsSetIOError( GetCurrentDirectory(
HB_SIZEOFARRAY( lpBuffer ), lpBuffer ) != 0, 0 );
iDrive = HB_TOUPPER( lpBuffer[ 0 ] );
if( iDrive >= 'A' && iDrive <= 'Z' &&
lpBuffer[ 1 ] == HB_OS_DRIVE_DELIM_CHR )
iDrive -= 'A';
else
iDrive = 0;
}
should be enough.
I also think that we should add new API function which returns
full path with drive letter if any. It nicely simplify upper
level code. We can also create such function for setting
current directory though this operation is not MT safe and MT
programs should not change current directory.

best regards,
Przemek

Massimo Belgrano

unread,
Apr 23, 2012, 8:29:20 AM4/23/12
to harbou...@googlegroups.com
Possible have a common way for harbour , ads, any upcoming  rdd  who add unicode like otc mediator

sybase Advantage 10 contrib\rddads.lib includes three new field types; nChar, nVarChar and nMemo.
These field types will be able to store Unicode characters 

for additional info i suggest search unicode at  http://devzone.advantagedatabase.com 



Il giorno 23 aprile 2012 13:54, Przemysław Czerpak <dru...@poczta.onet.pl> ha scritto:
>
> > One of the next logical questions: How to enable unicode
> > fields in tables? Plus some more, but I'm still digesting the
> > changes.

 
>
> This is additional extension - you can use simple character fields for UTF8 strings.
> Anyhow I'll add support for setting field flags in DBCREATE() in this week. 


--
Massimo Belgrano

vszakats

unread,
Apr 23, 2012, 8:55:25 AM4/23/12
to harbou...@googlegroups.com
Hi Przemek,


On Monday, April 23, 2012 1:54:16 PM UTC+2, druzus wrote:
On Sat, 21 Apr 2012, Viktor Szakáts wrote:

Hi Viktor,

> Can't get this message through the mailing list, one copy was
> deleted, two don't appear. So here it is in private:

Probably we should look at it closer.
Some messages do not appear on the list or they appear after
long delay and not all messages are delivered to subscribers
(at least I do not receive all of them).
I'm setting CC to devel list.

Thanks, they still didn't appear, and I also spotted the problem 
of not receiving stuff in mailbox, plus several other smaller 
problems with this service. Maybe the management console 
can give some clues for lost/pending mails.
 

BTW Sorry for late response. I was out of city.

No problem at all, meanwhile I started to switch my 
app to UTF8EX as a hobby project, and after 1 day of 
work it run and worked fine, though as usual 80% of 
the work will need to be spent on 20% of weird cases 
(like obscure C code and external interfaces/printing). 
Harbour parts go smoothly and things work as expected. 
UTF8 opens a new world when finally you're not restricted 
with 8-bit, sounds obvious, but it's a huge step.

Noticed that sometimes it'd be useful to use the old 
raw (non-uni) versions of functions like LEFT(), RIGHT(), 
SUBSTR() (HB_BLEFT(), HB_BRIGHT(), HB_BSUBSTR() 
seems to be fitting names, following HB_BLEN()) for 
occasional binary data, it's work-aroundable but nevertheless.

Pending question is how to control sorting in UTF8EX, it's 
not critical yet, but it will be when using this CP in indexed 
tables.

I had minor confusion because, in order to make UTF8 box 
chars display as expected (at least with GTWIN/GTWVT), 
HB_GTI_BOXCP had to be set explicitly to "UTF8". Maybe 
it'd be better to somehow make this the default, if technically 
possible.

When reaching to some more peculiar parts of my apps, 
I may still have some experiences/questions to share.
(one candidate is stripping accents to convert string to readable 
ASCII string, looks like something hard to do from upper level 
code.)

Plus, it will be interesting to see how certain external libs 
handle UTF8 chars, like hbmzip, libcurl.

> One of the next logical questions: How to enable unicode

> fields in tables? Plus some more, but I'm still digesting the
> changes.

This is additional extension - you can use simple character fields
for UTF8 strings.
Anyhow I'll add support for setting field flags in DBCREATE() in
this week.

Sounds great, thank you. I've been toying with the idea of simply 
pouring UTF8 into the raw string fields. One disadvantage, that 
they will effectively change to variable length fields (which BTW 
may cause potential data loss when converting existing 8-bit 
data, even if you bump field widths at the same time), which may 
be a good compromise, but how to handle potentially cutting 
in the middle UTF8 chars that cannot fit to the field size? If this detail 
would be handled gracefully by RDD, it'd be the most ideal I guess.
The other disadvantage is potential loss of indexing performance, 
but all in all, these may well outweigh the double size of an UTF16 
solution when most of the data is ASCII.

---
/* encoding: utf-8 */
hb_cdpSelect( "UTF8EX" )
dbCreate( "test", {{ "TEST", "C", 3, 0 }} )
USE test
dbAppend() ; FIELD->TEST := "űű"
---
 

> One issue I've found:
>    fs_win_get_drive() in filesys.c has a call to hb_wcntombcpy()
>    which needs to be updated to one of the new APIs.

I've seen it but in fact it's necessary only for extracting drive
letter so it's unimportant.
BTW I think we should eliminate conversion to char* and call to
hb_fsNameSplit() in this function.
   {
      TCHAR lpBuffer[ HB_PATH_MAX ];
      int iDrive;
      lpBuffer[ 0 ] = TEXT( '\0' );
      hb_fsSetIOError( GetCurrentDirectory(
                       HB_SIZEOFARRAY( lpBuffer ), lpBuffer ) != 0, 0 );
      iDrive = HB_TOUPPER( lpBuffer[ 0 ] );
      if( iDrive >= 'A' && iDrive <= 'Z' &&
          lpBuffer[ 1 ] == HB_OS_DRIVE_DELIM_CHR )
        iDrive -= 'A';
      else
        iDrive = 0;
   }
should be enough.

Looks fine to me. Indeed chances for non-ASCII 
drive letters is pretty slim.
 

I also think that we should add new API function which returns
full path with drive letter if any. It nicely simplify upper
level code. We can also create such function for setting
current directory though this operation is not MT safe and MT
programs should not change current directory.

There is HB_CWD() implemented for this. Though probably 
it would be better to have it moved to C and using lower-level APIs.
The setter part is also there as a TODO + placeholder, and indeed 
users should not do this, so it's not critical to implement it.

Viktor

Mindaugas Kavaliauskas

unread,
Apr 23, 2012, 9:10:54 AM4/23/12
to harbou...@googlegroups.com
Hi,


On 2012.04.23 14:54, Przemysław Czerpak wrote:
>> Possibly the largest patch to Harbour at least in recent 5 years.
>> ...
>> svn diff -r 17403:17404> uni.dif

You are wrong Viktor :)

svn diff -r 9373:9374 > mt.dif
is multi-thread support:
2008-09-13 18:49 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
and is 4% larger than uni.dif (1185680 bytes vs. 1138273 bytes) :)

Thanks, Przemek, for such a huge contribution!!! Perhaps it will still
take me a few days (or weeks), to understand the whole new unicode
coding ideas, and how I should change my code (related to file IO,
socket IO, dbf char type column storage and sorting, etc) to work with
UTF8.


Thanks again and regards,
Mindaugas

vszakats

unread,
Apr 23, 2012, 9:33:03 AM4/23/12
to harbou...@googlegroups.com
On Monday, April 23, 2012 3:10:54 PM UTC+2, Mindaugas Kavaliauskas wrote:
Hi,

On 2012.04.23 14:54, Przemysław Czerpak wrote:
>> Possibly the largest patch to Harbour at least in recent 5 years.
>> ...
>>     svn diff -r 17403:17404>  uni.dif

You are wrong Viktor :)

   svn diff -r 9373:9374 > mt.dif
is multi-thread support:
   2008-09-13 18:49 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
and is 4% larger than uni.dif (1185680 bytes vs. 1138273 bytes) :)

I stand corrected :) I had appended a '?' at the end in my 
first answer that didn't appear (or was deleted), then for the 
3rd time I typed it in, I was more cheered up by the patch to bother.

However big it is, great job!

And an extra thought:
---
    + added new compiler switch:
         -ku  - strings in user encoding
      Now it informs compiler that strings use custom encoding so some
      optimizations which are byte oriented cannot be used.
      It's possible that in the future we will change above definition
      to sth like: "strings in UTF8 encoding" but now I would like to
      keep more general.
---

I'd be very glad to see '-ku:utf8' or similar option (it will be 
large job I reckon). For now I assumed I don't need -ku for 
my UTF8 sources, but I may be proven wrong as I move 
along the conversion process. Probably I'll fix those locally, 
as I wouldn't want to lose compile-time string optimization.

Viktor

Przemysław Czerpak

unread,
Apr 23, 2012, 11:13:23 AM4/23/12
to harbou...@googlegroups.com
On Mon, 23 Apr 2012, vszakats wrote:

Hi,

> ---
> + added new compiler switch:
> -ku - strings in user encoding
> Now it informs compiler that strings use custom encoding so some
> optimizations which are byte oriented cannot be used.
> It's possible that in the future we will change above definition
> to sth like: "strings in UTF8 encoding" but now I would like to
> keep more general.
> ---
> I'd be very glad to see '-ku:utf8' or similar option (it will be
> large job I reckon). For now I assumed I don't need -ku for
> my UTF8 sources, but I may be proven wrong as I move
> along the conversion process. Probably I'll fix those locally,
> as I wouldn't want to lose compile-time string optimization.

Now this switch disables optimization for few function calls with
literal arguments, i.e. LEN( "ąćęłńóśźż" ), what should give 9 for
UTF8EX and 19 for byte oriented CPs.
These are exactly:
AT( <cLiteralString1>, <cLiteralString2> ) -> <nPos>
LEN( <cLiteralString> ) -> <nLen>
ASC( <cLiteralString> ) -> <nVal> // if 1-st byte in the string is
// greater then 127
CHR( <nVal> ) -> <cLiteralString> // id <nVal> is greater then 127

As you can see it's not too wide area and in most of cases it's good
to look at such code during conversion yo UTF8.

Probably we will have to introduce to compiler switch to control
string encoding at compile time. It should help in few things:
1. enable optimizations like above
2. automatic translation of constant values in source code
3. interaction with OS API and filename translations used
in #include ... and similar compiler/PP directives.
We can make it in the compiler but it means that we have to
integrate with compiler also CP oritented Harbour RTL code.
We can also reach this effect much easier inside HBMK2 using
integrated compiler code because in such case compiler inherits
HVM from HBMK2.
I.e. point 3 above can be implemented even now only inside
HBMK2. It's enough to parse switches for -ku:<cpname> and call:
cSaveCP = hb_cdpSelect( <cpname> )
before HB_COMPILER() and then restore HBMK2 CP with
hb_cdpSelect( cSaveCP )
Some time ago I suggested to add compiler time optimizations
for some functions with literal parameters which can be executed
to calculate the results, i.e.:
HB_CRC32( <cLiteralString> ) -> <nVal>
If we add such optimization then with above user codepage setting
to HBMK2 then as side effect we also address the problem of disabled
optimization for above functions - they will be optimized by our code.
Finally point 2 with active HVM can be quite easy resolved by custom
open function like the one used in HBRUN for included files.
It means that we can reach all above goals inside HBMK2 with some
minor modifications in pure compiler and PP code.
It's the reason why I didn't want to make any deeper modifications
in compiler/PP code with unicode patch and added only very simple -ku
switch.

This is first thing we may address in the future.
It should not cause any backward compatibility problems - it's will
be extension only.

The second one is constant string encoding for box drawing characters
and default CP.
Now inside box.ch we have pure CP437 definitions.
Also in RTL code we have few constant values hardcoded for this CP:
browse.prg // constant values: 198, 181, 205
checkbox.prg // constant values: 251
dbedit.prg // constant values: 205, 209, 179
listbox.prg // constant values: B_SINGLE, B_DOUBLE, 31
scrollbr.prg // constant values: 24, 25, 26, 27, 176, 178
tmenuitm.prg // constant values: MENU_SEPARATOR, 251, 16
tpopup.prg // B_SINGLE, SEPARATOR_SINGLE, MENU_SEPARATOR
browse.prg // B_DOUBLE_SINGLE
We can ignore ASCII values smaller then 32 because they are not
part of any multibyte encodings.
It's the reason why I left CP437 as default CP encoding for BOX
characters. Changing default here strongly interacts with existing
code so at this stage I prefer that users who want to fully switch
to UTF8 will set HB_GTI_BOXCP themselves. Such choice does not
force modifications in existing code. It may change in the future
if we agree final version of Harbour Unicode API. I would like to
avoid situation when we are forcing user PRG code updating in the
same area many times.

The third thing is bound with FOR EACH c in str / NEXT.
Now it operates on binary data. It's possible to switch
to character indexes but I would like to confirm it.
Such modification is not backward compatible so we should
take the decision quite fast.

best regards,
Przemek

Przemysław Czerpak

unread,
Apr 23, 2012, 12:01:06 PM4/23/12
to harbou...@googlegroups.com
On Mon, 23 Apr 2012, Massimo Belgrano wrote:

Hi Massimo,

> Possible have a common way for harbour , ads, any upcoming rdd who add
> unicode like otc mediator
> sybase Advantage 10 contrib\rddads.lib includes three new field types;
> nChar, nVarChar and nMemo.
> These field types will be able to store Unicode characters
> http://blog.advantageevangelist.com/2010/06/ads-10-tip-4-unicode-support.html

I added support for this fields to ADS* RDDs:

2010-10-09 19:07 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
* harbour/include/hbapirdd.h
+ added new field flag: HB_FF_UNICODE
* harbour/contrib/rddads/ads1.c
+ added support for new ADS 10.0 UNICODE fields: NChar, NVarChar, NMemo
They are supported in all ADS* RDDs.

and also for DBF* RDDs:

2010-10-13 13:21 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
* harbour/src/rdd/dbf1.c
* harbour/src/rdd/dbffpt/dbffpt1.c
+ added support for UNICODE fields compatible with the one used
by ADS

so this was done long time ago.

Now I'm talking only about adding an option to control field flags
in dbCreate(). To keep current DBSTRUCT() table dimmensions I plan
to define that ":" character used in field type starts field flags.
It means that:
{ "NAME", "C:U", 20, 0 }
will mean that field NAME has UNICODE flag and
{ "SIGNATURE", "C:B", 20, 0 }
means that field SIGNATURE has BINARY flag, etc.
It wil allow to use some extensions which exists in native DBF*
RDDs from long time and also add some new ones, i.e. we can define
that "Z" means COMPRESS flag so:
{ "DATA", "M:Z", 4, 0 }
means memo field with compressed body.
Of course it will be possible to mix different flags:
{ "DATA", "M:UZ", 4, 0 }
I have to use separator character for backward compatibility with
existing RDDs which used multiletter field type descriptions, i.e.
ADS* RDDs.

best regards,
Przemek

vszakats

unread,
Apr 23, 2012, 12:37:17 PM4/23/12
to harbou...@googlegroups.com
 
Now this switch disables optimization for few function calls with
literal arguments, i.e. LEN( "ąćęłńóśźż" ), what should give 9 for
UTF8EX and 19 for byte oriented CPs.
These are exactly:
   AT( <cLiteralString1>, <cLiteralString2> ) -> <nPos>
   LEN( <cLiteralString> ) -> <nLen>
   ASC( <cLiteralString> ) -> <nVal> // if 1-st byte in the string is
                                     // greater then 127
   CHR( <nVal> ) -> <cLiteralString> // id <nVal> is greater then 127

As you can see it's not too wide area and in most of cases it's good
to look at such code during conversion yo UTF8.

Thanks, it looks far less worse and I probably don't use 
any of the above. It's not very easy to tell, it will need further 
analysis. Anyhow the point is that -ku can be used without 
much cost.
 
Probably we will have to introduce to compiler switch to control
string encoding at compile time. It should help in few things:
   1. enable optimizations like above
   2. automatic translation of constant values in source code
   3. interaction with OS API and filename translations used
      in #include ... and similar compiler/PP directives.
We can make it in the compiler but it means that we have to
integrate with compiler also CP oritented Harbour RTL code.
We can also reach this effect much easier inside HBMK2 using
integrated compiler code because in such case compiler inherits
HVM from HBMK2.
I.e. point 3 above can be implemented even now only inside
HBMK2. It's enough to parse switches for -ku:<cpname> and call:
   cSaveCP = hb_cdpSelect( <cpname> )
before HB_COMPILER() and then restore HBMK2 CP with
   hb_cdpSelect( cSaveCP )

It's good idea.

It has one disadvantage, it's not easy to add a #pragma that 
can control encoding on a per file basis. Unless we go the route 
to let hbmk2 peek into the source and look for certain things in 
it, but it won't be ideal for performance (and various other reasons), 
and f.e. current multi-file compilation would have to be disabled.

I'll check what it takes to implement this in hbmk2 to give it a 
first shot, with the option that we may move this option to the 
low-level compiler at a later stage.
 
Some time ago I suggested to add compiler time optimizations
for some functions with literal parameters which can be executed
to calculate the results, i.e.:
   HB_CRC32( <cLiteralString> ) -> <nVal>
If we add such optimization then with above user codepage setting
to HBMK2 then as side effect we also address the problem of disabled
optimization for above functions - they will be optimized by our code.

I cannot see the precise relation to above issue, but it seems 
clearly a cool feature. And combining this sort of trick with 
codepage support, I reckon it may even be possible to add support 
for an encoding #pragma in some ways. (with a callback?)

In the longer run this leads to a minimal harbour compiler used 
solely for the purpose of building Harbour itself, and hbmk2 which 
will basically become _the_ compiler visible for the outside world.
This is in sync with my past notion to eventually drop the raw 
harbour executable from the distribution (with the option to access 
it via hbmk2, which is already implemented).

If communication between hbmk2 and compiler engine will be 
closely integrated, we may even add automatic "package" (aka "lib") 
selection right from the source.

Finally point 2 with active HVM can be quite easy resolved by custom
open function like the one used in HBRUN for included files.
It means that we can reach all above goals inside HBMK2 with some
minor modifications in pure compiler and PP code.
It's the reason why I didn't want to make any deeper modifications
in compiler/PP code with unicode patch and added only very simple -ku
switch.

Sounds perfect to me, I just wish I'd could imagine more 
precisely the method for hbmk2 <=> compiler engine 
communication you have in mind.
 
The second one is constant string encoding for box drawing characters
and default CP.
Now inside box.ch we have pure CP437 definitions.
Also in RTL code we have few constant values hardcoded for this CP:
   browse.prg   // constant values: 198, 181, 205
   checkbox.prg // constant values: 251
   dbedit.prg   // constant values: 205, 209, 179
   listbox.prg  // constant values: B_SINGLE, B_DOUBLE, 31
   scrollbr.prg // constant values: 24, 25, 26, 27, 176, 178
   tmenuitm.prg // constant values: MENU_SEPARATOR, 251, 16
   tpopup.prg   // B_SINGLE, SEPARATOR_SINGLE, MENU_SEPARATOR
   browse.prg   // B_DOUBLE_SINGLE
We can ignore ASCII values smaller then 32 because they are not
part of any multibyte encodings.
It's the reason why I left CP437 as default CP encoding for BOX
characters. Changing default here strongly interacts with existing
code so at this stage I prefer that users who want to fully switch
to UTF8 will set HB_GTI_BOXCP themselves. Such choice does not
force modifications in existing code. It may change in the future
if we agree final version of Harbour Unicode API. I would like to
avoid situation when we are forcing user PRG code updating in the
same area many times.

Fair enough. This issue steps into the area to how to 
move Harbour sources themselves to unicode eventually.
Besides box drawing chars the only place where Harbour 
hosts non-ASCII string are the language modules. I converted 
it for Hungarian, and if this seems alright the same can 
be done for the rest of them. (then remains collations, but 
it leads too far).
 
The third thing is bound with FOR EACH c in str / NEXT.
Now it operates on binary data. It's possible to switch
to character indexes but I would like to confirm it.

Very useful information. I've been thinking about it without 
making tests and so far concluded that they _should_ work 
on CP. If they work on bytes, I'll have to look all over the 
source to see where this might cause a problem. [ I did 
and I seldom use it, only at two places, both of them 
expecting chars, plus a 3rd in a disabled low-level function 
where I use an equivalent implemented in C. ]

Such modification is not backward compatible so we should
take the decision quite fast. 

My vote for CP sensitive iteration, for consistency.

Maybe with additional keyword to force raw byte 
processing ('FOR EACH c IN s BYTE' or similar).

Viktor

wen....@gmail.com

unread,
Apr 23, 2012, 8:24:50 PM4/23/12
to harbou...@googlegroups.com
>
> Now I'm talking only about adding an option to control field flags
> in dbCreate(). To keep current DBSTRUCT() table dimmensions I plan
> to define that ":" character used in field type starts field flags.
> It means that:
> { "NAME", "C:U", 20, 0 }
> will mean that field NAME has UNICODE flag and
> { "SIGNATURE", "C:B", 20, 0 }
> means that field SIGNATURE has BINARY flag, etc.


Why not use the fifth parameter?

{ "NAME", "C", 20, 0, "U" }
{ "SIGNATURE", "C", 20, 0, "B" }

vszakats

unread,
Apr 23, 2012, 8:44:49 PM4/23/12
to harbou...@googlegroups.com
On Tuesday, April 24, 2012 2:24:50 AM UTC+2, WenSheng wrote:
>    { "NAME",      "C:U", 20, 0 }
> will mean that field NAME has UNICODE flag and
>    { "SIGNATURE", "C:B", 20, 0 }
> means that field SIGNATURE has BINARY flag, etc.


Why not use the fifth parameter?

{ "NAME",      "C", 20, 0, "U" }
{ "SIGNATURE", "C", 20, 0, "B" }

You can find the answer to that in detail 
in the archives where we've discussed this 
many years ago.

Viktor
 

Mindaugas Kavaliauskas

unread,
Apr 24, 2012, 9:41:08 AM4/24/12
to harbou...@googlegroups.com
Hi,


I'm trying to understand the amount of changes necessary to use HVM
unicode. Questions:

1) this not a question, more a suggestion for other people. You need to
request HB_CODEPAGE_UTF8EX to use hb_cdpselect("UTF8EX"). I expected it
is included just like hb_cdpselect("UTF8") and this took me a few hours
of testing.

2) I have a large amount of code that do data file/packet data parsing,
encodes/decodes various structures, etc. CHR(), ASC(), I2BIN(), L2BIN,
BIN2I() and other functions are very common. How all this code should be
written in UTF8EX case? Should I use HB_B{LEN,CODE,CHAR}() instead of
LEN(), ASC(), CHR()? What about HB_B*() versions of LEFT(), RIGHT(),
SUBSTR() functions?

3) AFAIU, the following code is buggy because of LEFT()?
cFile := ""
DO WHILE .T.
cBuf := SPACE(BUF_SIZE)
IF (nI := FREAD(hF, @cBuf, BUF_SIZE)) > 0
cFile += LEFT(cBuf, nI)
ELSE; EXIT
ENDIF
ENDDO


4) STRTRAN() was not patched. I guess it should.

5) If I have text file in win1257 encoding and I want to read it using
hb_memoread(), what function should be used to convert file content from
known encoding to internal HVM encoding (and vice-versa)?

6) HB_BCHAR() vs. CHR() for argument values 0..127?

7) Is e"\xEE\x02" == HB_BCHAR(0xEE) + HB_BCHAR(2)? Or it can depend on
some codepage selection or compiler switch?

8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it works
the same as ASC(). If some "non-unicode" codepage is selected it returns
character code of current code page, so, it is ASC() again. Am I wrong?

9) I'm a little confused among the meanings of "Unicode", UTF16, UCS-2,
etc. AFAIU, UTF-8 can represent characters having numbers up to 31-bit
length. UTF-8 representation can take 1 to 6 bytes. What about UTF8EX?
What character range is supported?
Ex., I see HB_UCHAR() uses:
( HB_WCHAR ) hb_parni( 1 )
so, character code are from range 0 to 65535?
What is expected maximum return values of HB_BLEN(HB_UCHAR(nChar))?
What character ranges are supported by functions hb_cdp*U16()?
http://en.wikipedia.org/wiki/UTF-16 says, that "UTF-16 is used for text
in the OS API in Microsoft Windows 2000/XP/2003/Vista/CE", so, it can
represent characters up to U+10FFFF (in some cases 1 character occupies
two 2-byte wide characters, i.e., 4 bytes). How Harbour internals works
in these cases?

10) What is the meaning of hbmk2 -ku:<cp> switch? As far as I can see in
the source code of compiler, -ku just disables some optimisations, but
it does not change string encoding in pcode. So, I understand that .prg
source code is expected to have encoding set by hb_cdpselect().

11) Now we have situation similar to SET_EXACT. hb_cdpselect()
significantly changes program logic is string functions are used. What
are the rules to write portable code (to make it work on different
codepage settings, including other possible user multibyte codepages)?

12) I'm not sure I understand the whole problems about commandline
options, but CLIPINIT() with SET_CODEPAGE, SET_OSCODEPAGE looks ugly.
Can we use *W() windows unicode functions to obtain command line and
convert it to current codepage in case user calls hb_progname(), etc? I
guess it should be possibility to obtain current windows codepage
(ANSI/OEM) for non-unicode *A() API. Can we set this codepage as default
value for SET_OSCODEPAGE?

13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows? Windows
uses ANSI or UTF-16/UCS-2 for its API, but not UTF-8.


Regards,
Mindaugas

vszakats

unread,
Apr 24, 2012, 10:29:08 AM4/24/12
to harbou...@googlegroups.com

2) I have a large amount of code that do data file/packet data parsing, 

encodes/decodes various structures, etc. CHR(), ASC(), I2BIN(), L2BIN,
BIN2I() and other functions are very common. How all this code should be
written in UTF8EX case? Should I use HB_B{LEN,CODE,CHAR}() instead of
LEN(), ASC(), CHR()? What about HB_B*() versions of LEFT(), RIGHT(),
SUBSTR() functions?

Yes. As for LEFT(), RIGHT(), SUBSTR() I also miss them, so if HB_B*() 
is not enough, I switch back locally to "EN" CP. The code doesn't look 
very good and I'm not sure of hb_cdpSelect() performance impact, but 
it works.

3) AFAIU, the following code is buggy because of LEFT()?
cFile := ""
DO WHILE .T.
   cBuf := SPACE(BUF_SIZE)
   IF (nI := FREAD(hF, @cBuf, BUF_SIZE)) > 0
     cFile += LEFT(cBuf, nI)
   ELSE; EXIT
   ENDIF
ENDDO

Also this:
      // LOOP
      nWritten := FWrite( fhnd, SubStr( cString, nPos + 1 ) )
      nPos += nWritten
      // ENDLOOP

Same when using hb_socket*() functions. F.e. I tried 
to update your UDPDS code, and I couldn't get to the bottom 
of it yet.

4) STRTRAN() was not patched. I guess it should.

It's okay as it is.

See explanation in src/rtl/cdpapihb.c
/* none of numeric parameters in STRTRAN() (4-th and 5-th) refers to
 * character position in string so we do not need to create new
 * HB_UTF8STRTRAN() but we can safely use normal STRTRAN() function
 */

FOR EACH is pending. Plus there are lots of standalone 
contrib functions written in C where unicode support should 
be decided and where applicable, implemented. Until then, 
they all work on binary strings.
 

5) If I have text file in win1257 encoding and I want to read it using
hb_memoread(), what function should be used to convert file content from
known encoding to internal HVM encoding (and vice-versa)?

HB_TRANSLATE()
 

7) Is e"\xEE\x02" == HB_BCHAR(0xEE) + HB_BCHAR(2)? Or it can depend on 

some codepage selection or compiler switch?

Former. e"" is always binary string ATM, because compiler 
is not aware of source encoding.
 

8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it works
the same as ASC(). If some "non-unicode" codepage is selected it returns
character code of current code page, so

, it is ASC() again. Am I wrong?

As far as I could find out, HB_UCODE()/HB_UCHAR() functions work 
on UTF16 characters. HB_UPEEK()/HB_UPOKE() also work on UTF16 
chars, but the position is expected as raw byte position. I'm less sure 
of HB_ULEN().

I have to admit to be little confused about these.
 

9) I'm a little confused among the meanings of "Unicode", UTF16, UCS-2,
etc. AFAIU, UTF-8 can represent characters having numbers up to 31-bit
length. UTF-8 representation can take 1 to 6 bytes. What about UTF8EX?
What character range is supported?
Ex., I see HB_UCHAR() uses:
   ( HB_WCHAR ) hb_parni( 1 )
so, character code are from range 0 to 65535?

From what I read/tested so far: Yes 0-0xFFFF. 
"UTF8EX" is just a made up name. I'd better like if "UTF8" could 
be used for this purpose.
"UTF16LE" CP cannot be used as HVM CP (it can be enabled, 
but many things won't work, which is expected, given how many 
things should be changed for it), but it's useful in HB_TRANSLATE().
 

10) What is the meaning of hbmk2 -ku:<cp> switch? As far as I can see in 

the source code of compiler, -ku just disables some optimisations, but
it does not change string encoding in pcode. So, I understand that .prg
source code is expected to have encoding set by hb_cdpselect().

So far I think it only helps in OS conversions that occur while the 
compiler is executing. I could not made up a test to confirm this. The 
plan is that it will allow compile-time optimizations for strings in 
passed CP, later maybe more.
 

11) Now we have situation similar to SET_EXACT. hb_cdpselect()
significantly changes program logic is string functions are used. What
are the rules to write portable code (to make it work on different
codepage settings, including other possible user multibyte codepages)?

Quite wide question. What's most important is to identify 
places where you handle strings in HVM CP vs. where you 
work on binary strings and change to HB_BCHAR() and 
similar where working with binary ones.

Change all places where you communicate with outside 
world to filter through HB_TRANSLATE(). This can be 
needed with certain APIs, too.
 

12) I'm not sure I understand the whole problems about commandline
options, but CLIPINIT() with SET_CODEPAGE, SET_OSCODEPAGE looks ugly.

I think _SET_CODEPAGE is unavoidable to receive PARAMETERs 
in Main() in selected CP, but I agree the _SET_OSCODEPAGE is far 
from ideal, f.e. in my app it depends on user setting, which is not 
available in CLIPINIT stage.
 

Can we use *W() windows unicode functions to obtain command line and
convert it to current codepage in case user calls hb_progname(), etc? I
guess it should be possibility to obtain current windows codepage
(ANSI/OEM) for non-unicode *A() API. Can we set this codepage as default
value for SET_OSCODEPAGE?

*W() can be used, which makes _SET_OSCODEPAGE unnecessary here.
 

13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows? Windows
uses ANSI or UTF-16/UCS-2 for its API, but not UTF-8.

In later years all Windows API interactions have been added support 
for WIDE API (-DUNICODE) mode, at the same time all such places 
were updated to use the String API to exchange strings between Harbour 
HVM and Windows API. Now String API will automatically convert between 
UTF8 or else (HVM CP) and WIndows CP (UTF16).

The only remaining place I know where WIDE API is not used (in fact 
no API is used, because values come via WinMain()), is the command 
line parsing stuff.

Worth noting that hbmzip, libcurl, libharu, expat, pcre and all other 
3rd party libs have their special unicode support level and requirements, 
and some of these will probably require attention to implement smooth 
interfaces. F.e. HBQT doesn't convert CP in certain functions, libharu 
seems seriously broken in some (yet to be determined) unicode cases.

Viktor

Bacco

unread,
Apr 24, 2012, 10:45:09 AM4/24/12
to harbou...@googlegroups.com
I really believe that HB_ULEFT( ) HB_ULEN( ) would be the right
approach to Unicode, and not changing the default and reliable
functions. Current HB_B functions make updating old software that
relies on binary strings for internal purposes to use unicode
interfaces a very risky task. Besides, the Unicode concept is a
harbour thing, not cl*pper one, so HB_U to the new functions seems
more logical to me.

vszakats

unread,
Apr 24, 2012, 11:05:30 AM4/24/12
to harbou...@googlegroups.com
On Tuesday, April 24, 2012 4:45:09 PM UTC+2, Bacco wrote:
I really believe that HB_ULEFT( ) HB_ULEN( ) would be the right
approach to Unicode, and not changing the default and reliable
functions. Current HB_B functions make updating old software that
relies on binary strings for internal purposes to use unicode
interfaces a very risky task. Besides, the Unicode concept is a
harbour thing, not cl*pper one, so HB_U to the new functions seems
more logical to me.

If doing it as you propose, much more app changes would be 
required and much more app changes would mean much more 
potential problems. (not to mention 3rd party code)

I find current direction quite fine, because I only had to touch 
code which was already an exception in some respect. Such 
code is more complicated but much less in quantity than the 
remaining 98% of my application. It's also a very positive 
change, that exceptions now will be much more easy to identify, 
due to special API that it must use. IOW it's a future proof change.

[ I'd be more happy with a full set of HB_B*() functions, though. ]

Also, pure string operations are just one part of the problem, 
what's also important is the automatic CP conversion on component 
boundaries. Which would be just impossible with your proposal. 
It would also move UTF8 into a different "league" than those 
other codepages that the HVM had long time supported, thus 
creating an exception, one that would never disappear.

With current method you once fix your code for binary strings, 
fix your C code (if any) and make the unicode transition, you 
seldom have to deal with this problem for the remaining lifetime 
of your app.

BTW, I don't see how this is related to reliability. They work 
just a reliably as before (especially compared that the whole 
patch is only few days old), only in any CP you selected, now 
including UTF8.

Viktor

Przemysław Czerpak

unread,
Apr 24, 2012, 11:08:06 AM4/24/12
to harbou...@googlegroups.com
On Tue, 24 Apr 2012, Mindaugas Kavaliauskas wrote:

Hi,

> I'm trying to understand the amount of changes necessary to use HVM
> unicode. Questions:
> 1) this not a question, more a suggestion for other people. You need
> to request HB_CODEPAGE_UTF8EX to use hb_cdpselect("UTF8EX"). I
> expected it is included just like hb_cdpselect("UTF8") and this took
> me a few hours of testing.

The conversion tables are very huge so I didn't made it default
part og HVM.

> 2) I have a large amount of code that do data file/packet data
> parsing, encodes/decodes various structures, etc. CHR(), ASC(),
> I2BIN(), L2BIN, BIN2I() and other functions are very common. How all
> this code should be written in UTF8EX case? Should I use
> HB_B{LEN,CODE,CHAR}() instead of LEN(), ASC(), CHR()?

HB_B*() functions always operates on bytes. It doesn't matter
what CP you use.

> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?

I'll add HB_BSUBSTR().
LEFT( <str>, <n> ) is the same as SUBSTR( <str>, 1, <n> ) and
RIGHT( <str>, <n> ) is the same as SUBSTR( <str>, -<n> )
so it's not strictly necessary anyhow I can add it too if you want.

> 3) AFAIU, the following code is buggy because of LEFT()?
> cFile := ""
> DO WHILE .T.
> cBuf := SPACE(BUF_SIZE)
> IF (nI := FREAD(hF, @cBuf, BUF_SIZE)) > 0
> cFile += LEFT(cBuf, nI)
> ELSE; EXIT
> ENDIF
> ENDDO

Yes, it's not portable.
In such context is necessary to use HB_BSUBSTR()/HB_BLEFT(), i.e.:
cFile += HB_BSUBSTR(cBuf, 1, nI)

> 4) STRTRAN() was not patched. I guess it should.

For valid and normalized UTF8 strings it's not necessary.

> 5) If I have text file in win1257 encoding and I want to read it
> using hb_memoread(), what function should be used to convert file
> content from known encoding to internal HVM encoding (and
> vice-versa)?

hb_cdpTranslate( <cText>, <cCpIN>, <cCpOUT> ) -> <cDest>
if <cCpIN> or <cCpOUT> is missing then HVM CP is used.
Limitations:
it operates on Harbour CPs not unicode ones which are more general
so please remember about REQUEST HB_CODEPAGE_<cpIN>, HB_CODEPAGE_<cpOUT>
In the future we should add support for using unicode CP IDs in this
functions.

> 6) HB_BCHAR() vs. CHR() for argument values 0..127?

No difference for UTF8 and DBCS encodings.

> 7) Is e"\xEE\x02" == HB_BCHAR(0xEE) + HB_BCHAR(2)? Or it can depend
> on some codepage selection or compiler switch?

Now it's binary string but this may change in the future if we
decide to add explicit support for unicode strings though maybe
in such case we should different syntax, i.e. u"..."

> 8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it
> works the same as ASC(). If some "non-unicode" codepage is selected
> it returns character code of current code page, so, it is ASC()
> again. Am I wrong?

It's not the same as ASC().
It always takes first character (not byte) from given string and
returns it unicode value, this code should illustrate it:

SET( _SET_CODEPAGE, "UTF8EX" )
s := HB_UCHAR( 0x104 ) // Ą
? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 0104

SET( _SET_CODEPAGE, "PLMAZ" )
? HB_NUMTOHEX( HB_UCODE( HB_UTF8TOSTR( s ) ), 4 ) // 0104
? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 2500

So regardless of used CP this functions operates on UNICODE values.

> 9) I'm a little confused among the meanings of "Unicode", UTF16,
> UCS-2, etc. AFAIU, UTF-8 can represent characters having numbers up
> to 31-bit length. UTF-8 representation can take 1 to 6 bytes. What
> about UTF8EX? What character range is supported?
> Ex., I see HB_UCHAR() uses:
> ( HB_WCHAR ) hb_parni( 1 )
> so, character code are from range 0 to 65535?
> What is expected maximum return values of HB_BLEN(HB_UCHAR(nChar))?
> What character ranges are supported by functions hb_cdp*U16()?
> http://en.wikipedia.org/wiki/UTF-16 says, that "UTF-16 is used for
> text in the OS API in Microsoft Windows 2000/XP/2003/Vista/CE", so,
> it can represent characters up to U+10FFFF (in some cases 1
> character occupies two 2-byte wide characters, i.e., 4 bytes). How
> Harbour internals works in these cases?

Harbour correctly process UTF8 strings up to 31 bytes characters.
Anyhow HB_WCHAR is 16 bit in current implementation so upper bits
are stripped from during translations. I haven't added support for
UTF16 encoding and intentionaly used U16 in names to not confuse
users. If we decide it's usefull then we can redefine HB_WCHAR
as 32 bit integer.

> 10) What is the meaning of hbmk2 -ku:<cp> switch? As far as I can
> see in the source code of compiler, -ku just disables some
> optimisations, but it does not change string encoding in pcode. So,
> I understand that .prg source code is expected to have encoding set
> by hb_cdpselect().

Exactly.

> 11) Now we have situation similar to SET_EXACT. hb_cdpselect()
> significantly changes program logic is string functions are used.
> What are the rules to write portable code (to make it work on
> different codepage settings, including other possible user multibyte
> codepages)?

HB_B*() functions for binary operations and HB_U*() functions for
operations on unicode characters. These functions are CP independent.

> 12) I'm not sure I understand the whole problems about commandline
> options, but CLIPINIT() with SET_CODEPAGE, SET_OSCODEPAGE looks
> ugly.
> Can we use *W() windows unicode functions to obtain command line and
> convert it to current codepage in case user calls hb_progname(),
> etc? I guess it should be possibility to obtain current windows
> codepage (ANSI/OEM) for non-unicode *A() API. Can we set this
> codepage as default value for SET_OSCODEPAGE?

This is minor and local problem which can be resolved quite easy
by modifications in cmdarg.c and hbwmain.c

> 13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows?

I do not know.
In Windows _SET_OSCODEPAGE should be set to ANSI CP. It's used
by code which operates on ANSI WIN32 API. In Harbour core code
we eliminated ANSI W32 API so it's rather for 3-rd party code
and communication with some libraries which do not have WCHAR
API.

> Windows uses ANSI or UTF-16/UCS-2 for its API, but not UTF-8.

yes it is.

best regards,
Przemek

Przemysław Czerpak

unread,
Apr 24, 2012, 11:19:44 AM4/24/12
to harbou...@googlegroups.com
On Tue, 24 Apr 2012, Bacco wrote:

Hi,

> I really believe that HB_ULEFT( ) HB_ULEN( ) would be the right
> approach to Unicode, and not changing the default and reliable
> functions. Current HB_B functions make updating old software that
> relies on binary strings for internal purposes to use unicode
> interfaces a very risky task. Besides, the Unicode concept is a
> harbour thing, not cl*pper one, so HB_U to the new functions seems
> more logical to me.

For me the most beautifully thing in the implementation I committed
is the fact that I do not have to agree or disagree with such messages
and discus about it ;-)
It's enough that you will use UTF8 instead of UTF8EX to keep
binary indexes as default.
And if you want then you can easy create your own custom UTF8XX
which will use any mixed parts of UTF8 and UTF8EX. That's only
your choice.

best regards,
Przemek

vszakats

unread,
Apr 24, 2012, 11:46:45 AM4/24/12
to harbou...@googlegroups.com

> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?

I'll add HB_BSUBSTR().
LEFT( <str>, <n> ) is the same as SUBSTR( <str>, 1, <n> ) and
RIGHT( <str>, <n> ) is the same as SUBSTR( <str>, -<n> )
so it's not strictly necessary anyhow I can add it too if you want.

I'd add a vote for HB_BLEFT() and HB_BRIGHT(). This 
would make code conversion easier for code that already 
uses LEFT() and RIGHT().
 

hb_cdpTranslate( <cText>, <cCpIN>, <cCpOUT> ) -> <cDest>

In the future we should add support for using unicode CP IDs in this
functions.

Would be great.
 

users. If we decide it's usefull then we can redefine HB_WCHAR

as 32 bit integer.

Even though I cannot see the practical benefits, 
for some strange reason, this sound also very cool.
 

> 13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows?

[ misread this question. ]

Viktor

vszakats

unread,
Apr 24, 2012, 12:03:05 PM4/24/12
to harbou...@googlegroups.com
Hi Przemek,


On Tuesday, April 24, 2012 5:46:45 PM UTC+2, vszakats wrote:

> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?

I'll add HB_BSUBSTR().
LEFT( <str>, <n> ) is the same as SUBSTR( <str>, 1, <n> ) and
RIGHT( <str>, <n> ) is the same as SUBSTR( <str>, -<n> )
so it's not strictly necessary anyhow I can add it too if you want.

I'd add a vote for HB_BLEFT() and HB_BRIGHT(). This 
would make code conversion easier for code that already 
uses LEFT() and RIGHT().

I will add some #defines for this. Seem fine and it avoids 
some bloat in RTL.

Thanks for the HB_BSUBSTR() patch, it help greatly to 
move along.

Viktor

Massimo Belgrano

unread,
Apr 24, 2012, 12:11:20 PM4/24/12
to harbou...@googlegroups.com
Can samebody  post a little source 
first and after
--
Massimo Belgrano

Mindaugas Kavaliauskas

unread,
Apr 24, 2012, 1:49:41 PM4/24/12
to harbou...@googlegroups.com
Hi,


thank you, Viktor and Przemek for all explanations!

> FOR EACH is pending.

It's hard for me to vote if FOR EACH should work on characters or bytes.
In general I avoid using this sentence for strings. In my head, FOR EACH
is a kind of optimisation for evaluation using integer index and []
operator. Since string characters are not accessed using cStr[nPos], I
avoid to use it on strings. But from general reasoning I would expect to
work it on characters just like SUBSTR(cStr, nPos, 1) (unprefixed by any
HB_B*() or HB_U*()).


>> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?
>
> I'll add HB_BSUBSTR().
> LEFT(<str>,<n> ) is the same as SUBSTR(<str>, 1,<n> ) and
> RIGHT(<str>,<n> ) is the same as SUBSTR(<str>, -<n> )
> so it's not strictly necessary anyhow I can add it too if you want.

Viktor:


> I will add some #defines for this. Seem fine and it avoids
> some bloat in RTL.

I find HB_BLEFT(), HB_BRIGHT() useful without PP tricks, or manual
conversion to HB_BSUBSTR(). Code bloat would be minimal in comparison to
all unicode tables. This is very basic functions just like LEFT() and
RIGHT(), and I have quite many code which communicates to some devices,
and binary protocol parsing is done at Harbour level. I hope nobody
votes for removal of LEFT() and RIGHT() and adding PP tricks to
implement it :)

Even more... I already need HB_BAT() and HB_BRAT(). In many cases I find
AT() doing the job OK, but will it work OK if my binary search needle is
some substring (possibly malformed) UTF8 byte representation?
In some cases I need the 3rd parameter, so, I use HB_AT(). For sure I
should change it to HB_BAT()...


>> 8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it
>> works the same as ASC(). If some "non-unicode" codepage is selected
>> it returns character code of current code page, so, it is ASC()
>> again. Am I wrong?
>
> It's not the same as ASC().
> It always takes first character (not byte) from given string and
> returns it unicode value, this code should illustrate it:
>
> SET( _SET_CODEPAGE, "UTF8EX" )
> s := HB_UCHAR( 0x104 ) // Ą
> ? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 0104
>
> SET( _SET_CODEPAGE, "PLMAZ" )
> ? HB_NUMTOHEX( HB_UCODE( HB_UTF8TOSTR( s ) ), 4 ) // 0104
> ? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 2500
>
> So regardless of used CP this functions operates on UNICODE values.

The things become clear only after I added
? hb_strtohex(s) // C484
and looked to http://en.wikipedia.org/wiki/Mazovia_encoding

Though, I still has the same question about HB_ULEN(). If I set UTF8EX,
return value of LEN() and HB_ULEN() is the same. If I set single byte
per char codepage, LEN() also the same value as HB_ULEN(). Can I have a
situation with LEN(cStr) != HB_ULEN(cStr)? (Maybe in some other custom
codepage...)


> HB_B*() functions for binary operations and HB_U*() functions for
> operations on unicode characters. These functions are CP independent.

I'm not sure I understand how HB_USUBSTR() is CP independent if it
depends on hb_vmCDP(). Can you give example with !(SUBSTR(cStr, nPos,
nLen) == HB_USUBSTR(cStr,nPos,nLen)) ?


Regards,
Mindaugas

vszakats

unread,
Apr 24, 2012, 2:17:56 PM4/24/12
to harbou...@googlegroups.com

Viktor:
> I will add some #defines for this. Seem fine and it avoids
> some bloat in RTL.

I find HB_BLEFT(), HB_BRIGHT() useful without PP tricks, or manual
conversion to HB_BSUBSTR(). Code bloat would be minimal in comparison to
all unicode tables. This is very basic functions just like LEFT() and
RIGHT(), and I have quite many code which communicates to some devices,
and binary protocol parsing is done at Harbour level. I hope nobody
votes for removal of LEFT() and RIGHT() and adding PP tricks to
implement it :)

Now that I had finished converting most code to use them, they 
turned out to be even more useful than I thought, they allowed to 
avoid reverting OS CP to EN almost completely (working on one 
remaining complex case), and you're right about the bloat being 
insignificant in general.

Even more... I already need HB_BAT() and HB_BRAT(). In many cases I find
AT() doing the job OK, but will it work OK if my binary search needle is
some substring (possibly malformed) UTF8 byte representation?
In some cases I need the 3rd parameter, so, I use HB_AT(). For sure I
should change it to HB_BAT()...

Nothing against these from my side.

Viktor

Przemysław Czerpak

unread,
Apr 24, 2012, 4:23:47 PM4/24/12
to harbou...@googlegroups.com
On Tue, 24 Apr 2012, Mindaugas Kavaliauskas wrote:

Hi,

> Though, I still has the same question about HB_ULEN(). If I set


> UTF8EX, return value of LEN() and HB_ULEN() is the same. If I set
> single byte per char codepage, LEN() also the same value as
> HB_ULEN(). Can I have a situation with LEN(cStr) != HB_ULEN(cStr)?
> (Maybe in some other custom codepage...)
> >HB_B*() functions for binary operations and HB_U*() functions for
> >operations on unicode characters. These functions are CP independent.
> I'm not sure I understand how HB_USUBSTR() is CP independent if it
> depends on hb_vmCDP(). Can you give example with !(SUBSTR(cStr,
> nPos, nLen) == HB_USUBSTR(cStr,nPos,nLen)) ?

In both cases the answer is the same.
It's necessary of multibyte CPs which do not use custom indexes in
standard functions so we can make people which has preferences like
Bacco happy.

I'll add HB_[UB]{LEFT,RIGHT}() soon.

best regards,
Przemek

Bacco

unread,
Apr 24, 2012, 4:43:55 PM4/24/12
to harbou...@googlegroups.com
Hi, Przemek

> In both cases the answer is the same.
> It's necessary of multibyte CPs which do not use custom indexes in
> standard functions so we can make people which has preferences like
> Bacco happy.

Just as a side note: I have no problem with current implementation,
neither with explicit use of HB_U and HB_B functions, and I have
currently no problem at all with encoding concepts. My comment was
entirely based on the relation of visible problems (display/encoding
errors are easily detectable) vs binary operationerrors that common
users are unaware and maybe will have a hard time locating.

I know this is a huge change and important one, and I've been aware
about it since you gently shared your "todo" list with us, and I think
the overall achievement is very good. Just raised a concern with very
specific details as one additional opinion.

Best regards
Bacco

Reply all
Reply to author
Forward
0 new messages