Hi Viktor,
> Can't get this message through the mailing list, one copy was
> deleted, two don't appear. So here it is in private:
Probably we should look at it closer.
Some messages do not appear on the list or they appear after
long delay and not all messages are delivered to subscribers
(at least I do not receive all of them).
I'm setting CC to devel list.
BTW Sorry for late response. I was out of city.
> Possibly the largest patch to Harbour at least in recent 5 years.
> Thank you very much Przemek. (and OTC for sponsoring)
Thank you.
> For those interested in looking into the whole patch (f.e. to
> update 3rd party code), use this command in Harbour SVN
> sandbox root:
> svn diff -r 17403:17404 > uni.dif
>
> One of the next logical questions: How to enable unicode
> fields in tables? Plus some more, but I'm still digesting the
> changes.
This is additional extension - you can use simple character fields
for UTF8 strings.
Anyhow I'll add support for setting field flags in DBCREATE() in
this week.
> One issue I've found:
> fs_win_get_drive() in filesys.c has a call to hb_wcntombcpy()
> which needs to be updated to one of the new APIs.
I've seen it but in fact it's necessary only for extracting drive
letter so it's unimportant.
BTW I think we should eliminate conversion to char* and call to
hb_fsNameSplit() in this function.
{
TCHAR lpBuffer[ HB_PATH_MAX ];
int iDrive;
lpBuffer[ 0 ] = TEXT( '\0' );
hb_fsSetIOError( GetCurrentDirectory(
HB_SIZEOFARRAY( lpBuffer ), lpBuffer ) != 0, 0 );
iDrive = HB_TOUPPER( lpBuffer[ 0 ] );
if( iDrive >= 'A' && iDrive <= 'Z' &&
lpBuffer[ 1 ] == HB_OS_DRIVE_DELIM_CHR )
iDrive -= 'A';
else
iDrive = 0;
}
should be enough.
I also think that we should add new API function which returns
full path with drive letter if any. It nicely simplify upper
level code. We can also create such function for setting
current directory though this operation is not MT safe and MT
programs should not change current directory.
best regards,
Przemek
On Sat, 21 Apr 2012, Viktor Szakáts wrote:Hi Viktor,
> Can't get this message through the mailing list, one copy was
> deleted, two don't appear. So here it is in private:Probably we should look at it closer.
Some messages do not appear on the list or they appear after
long delay and not all messages are delivered to subscribers
(at least I do not receive all of them).
I'm setting CC to devel list.
BTW Sorry for late response. I was out of city.
> One of the next logical questions: How to enable unicode
> fields in tables? Plus some more, but I'm still digesting the
> changes.This is additional extension - you can use simple character fields
for UTF8 strings.
Anyhow I'll add support for setting field flags in DBCREATE() in
this week.
> One issue I've found:
> fs_win_get_drive() in filesys.c has a call to hb_wcntombcpy()
> which needs to be updated to one of the new APIs.I've seen it but in fact it's necessary only for extracting drive
letter so it's unimportant.
BTW I think we should eliminate conversion to char* and call to
hb_fsNameSplit() in this function.
{
TCHAR lpBuffer[ HB_PATH_MAX ];
int iDrive;
lpBuffer[ 0 ] = TEXT( '\0' );
hb_fsSetIOError( GetCurrentDirectory(
HB_SIZEOFARRAY( lpBuffer ), lpBuffer ) != 0, 0 );
iDrive = HB_TOUPPER( lpBuffer[ 0 ] );
if( iDrive >= 'A' && iDrive <= 'Z' &&
lpBuffer[ 1 ] == HB_OS_DRIVE_DELIM_CHR )
iDrive -= 'A';
else
iDrive = 0;
}
should be enough.
I also think that we should add new API function which returns
full path with drive letter if any. It nicely simplify upper
level code. We can also create such function for setting
current directory though this operation is not MT safe and MT
programs should not change current directory.
Hi,
On 2012.04.23 14:54, Przemysław Czerpak wrote:
>> Possibly the largest patch to Harbour at least in recent 5 years.
>> ...
>> svn diff -r 17403:17404> uni.dif
You are wrong Viktor :)
svn diff -r 9373:9374 > mt.dif
is multi-thread support:
2008-09-13 18:49 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
and is 4% larger than uni.dif (1185680 bytes vs. 1138273 bytes) :)
Now this switch disables optimization for few function calls with
literal arguments, i.e. LEN( "ąćęłńóśźż" ), what should give 9 for
UTF8EX and 19 for byte oriented CPs.
These are exactly:
AT( <cLiteralString1>, <cLiteralString2> ) -> <nPos>
LEN( <cLiteralString> ) -> <nLen>
ASC( <cLiteralString> ) -> <nVal> // if 1-st byte in the string is
// greater then 127
CHR( <nVal> ) -> <cLiteralString> // id <nVal> is greater then 127
As you can see it's not too wide area and in most of cases it's good
to look at such code during conversion yo UTF8.
Probably we will have to introduce to compiler switch to control
string encoding at compile time. It should help in few things:
1. enable optimizations like above
2. automatic translation of constant values in source code
3. interaction with OS API and filename translations used
in #include ... and similar compiler/PP directives.
We can make it in the compiler but it means that we have to
integrate with compiler also CP oritented Harbour RTL code.
We can also reach this effect much easier inside HBMK2 using
integrated compiler code because in such case compiler inherits
HVM from HBMK2.
I.e. point 3 above can be implemented even now only inside
HBMK2. It's enough to parse switches for -ku:<cpname> and call:
cSaveCP = hb_cdpSelect( <cpname> )
before HB_COMPILER() and then restore HBMK2 CP with
hb_cdpSelect( cSaveCP )
Some time ago I suggested to add compiler time optimizations
for some functions with literal parameters which can be executed
to calculate the results, i.e.:
HB_CRC32( <cLiteralString> ) -> <nVal>
If we add such optimization then with above user codepage setting
to HBMK2 then as side effect we also address the problem of disabled
optimization for above functions - they will be optimized by our code.
Finally point 2 with active HVM can be quite easy resolved by custom
open function like the one used in HBRUN for included files.
It means that we can reach all above goals inside HBMK2 with some
minor modifications in pure compiler and PP code.
It's the reason why I didn't want to make any deeper modifications
in compiler/PP code with unicode patch and added only very simple -ku
switch.
The second one is constant string encoding for box drawing characters
and default CP.
Now inside box.ch we have pure CP437 definitions.
Also in RTL code we have few constant values hardcoded for this CP:
browse.prg // constant values: 198, 181, 205
checkbox.prg // constant values: 251
dbedit.prg // constant values: 205, 209, 179
listbox.prg // constant values: B_SINGLE, B_DOUBLE, 31
scrollbr.prg // constant values: 24, 25, 26, 27, 176, 178
tmenuitm.prg // constant values: MENU_SEPARATOR, 251, 16
tpopup.prg // B_SINGLE, SEPARATOR_SINGLE, MENU_SEPARATOR
browse.prg // B_DOUBLE_SINGLE
We can ignore ASCII values smaller then 32 because they are not
part of any multibyte encodings.
It's the reason why I left CP437 as default CP encoding for BOX
characters. Changing default here strongly interacts with existing
code so at this stage I prefer that users who want to fully switch
to UTF8 will set HB_GTI_BOXCP themselves. Such choice does not
force modifications in existing code. It may change in the future
if we agree final version of Harbour Unicode API. I would like to
avoid situation when we are forcing user PRG code updating in the
same area many times.
The third thing is bound with FOR EACH c in str / NEXT.
Now it operates on binary data. It's possible to switch
to character indexes but I would like to confirm it.
Such modification is not backward compatible so we should
take the decision quite fast.
> { "NAME", "C:U", 20, 0 }
> will mean that field NAME has UNICODE flag and
> { "SIGNATURE", "C:B", 20, 0 }
> means that field SIGNATURE has BINARY flag, etc.
Why not use the fifth parameter?
{ "NAME", "C", 20, 0, "U" }
{ "SIGNATURE", "C", 20, 0, "B" }
I'm trying to understand the amount of changes necessary to use HVM
unicode. Questions:
1) this not a question, more a suggestion for other people. You need to
request HB_CODEPAGE_UTF8EX to use hb_cdpselect("UTF8EX"). I expected it
is included just like hb_cdpselect("UTF8") and this took me a few hours
of testing.
2) I have a large amount of code that do data file/packet data parsing,
encodes/decodes various structures, etc. CHR(), ASC(), I2BIN(), L2BIN,
BIN2I() and other functions are very common. How all this code should be
written in UTF8EX case? Should I use HB_B{LEN,CODE,CHAR}() instead of
LEN(), ASC(), CHR()? What about HB_B*() versions of LEFT(), RIGHT(),
SUBSTR() functions?
3) AFAIU, the following code is buggy because of LEFT()?
cFile := ""
DO WHILE .T.
cBuf := SPACE(BUF_SIZE)
IF (nI := FREAD(hF, @cBuf, BUF_SIZE)) > 0
cFile += LEFT(cBuf, nI)
ELSE; EXIT
ENDIF
ENDDO
4) STRTRAN() was not patched. I guess it should.
5) If I have text file in win1257 encoding and I want to read it using
hb_memoread(), what function should be used to convert file content from
known encoding to internal HVM encoding (and vice-versa)?
6) HB_BCHAR() vs. CHR() for argument values 0..127?
7) Is e"\xEE\x02" == HB_BCHAR(0xEE) + HB_BCHAR(2)? Or it can depend on
some codepage selection or compiler switch?
8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it works
the same as ASC(). If some "non-unicode" codepage is selected it returns
character code of current code page, so, it is ASC() again. Am I wrong?
9) I'm a little confused among the meanings of "Unicode", UTF16, UCS-2,
etc. AFAIU, UTF-8 can represent characters having numbers up to 31-bit
length. UTF-8 representation can take 1 to 6 bytes. What about UTF8EX?
What character range is supported?
Ex., I see HB_UCHAR() uses:
( HB_WCHAR ) hb_parni( 1 )
so, character code are from range 0 to 65535?
What is expected maximum return values of HB_BLEN(HB_UCHAR(nChar))?
What character ranges are supported by functions hb_cdp*U16()?
http://en.wikipedia.org/wiki/UTF-16 says, that "UTF-16 is used for text
in the OS API in Microsoft Windows 2000/XP/2003/Vista/CE", so, it can
represent characters up to U+10FFFF (in some cases 1 character occupies
two 2-byte wide characters, i.e., 4 bytes). How Harbour internals works
in these cases?
10) What is the meaning of hbmk2 -ku:<cp> switch? As far as I can see in
the source code of compiler, -ku just disables some optimisations, but
it does not change string encoding in pcode. So, I understand that .prg
source code is expected to have encoding set by hb_cdpselect().
11) Now we have situation similar to SET_EXACT. hb_cdpselect()
significantly changes program logic is string functions are used. What
are the rules to write portable code (to make it work on different
codepage settings, including other possible user multibyte codepages)?
12) I'm not sure I understand the whole problems about commandline
options, but CLIPINIT() with SET_CODEPAGE, SET_OSCODEPAGE looks ugly.
Can we use *W() windows unicode functions to obtain command line and
convert it to current codepage in case user calls hb_progname(), etc? I
guess it should be possibility to obtain current windows codepage
(ANSI/OEM) for non-unicode *A() API. Can we set this codepage as default
value for SET_OSCODEPAGE?
13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows? Windows
uses ANSI or UTF-16/UCS-2 for its API, but not UTF-8.
Regards,
Mindaugas
2) I have a large amount of code that do data file/packet data parsing,
encodes/decodes various structures, etc. CHR(), ASC(), I2BIN(), L2BIN,
BIN2I() and other functions are very common. How all this code should be
written in UTF8EX case? Should I use HB_B{LEN,CODE,CHAR}() instead of
LEN(), ASC(), CHR()? What about HB_B*() versions of LEFT(), RIGHT(),
SUBSTR() functions?
3) AFAIU, the following code is buggy because of LEFT()?
cFile := ""
DO WHILE .T.
cBuf := SPACE(BUF_SIZE)
IF (nI := FREAD(hF, @cBuf, BUF_SIZE)) > 0
cFile += LEFT(cBuf, nI)
ELSE; EXIT
ENDIF
ENDDO
4) STRTRAN() was not patched. I guess it should.
5) If I have text file in win1257 encoding and I want to read it using
hb_memoread(), what function should be used to convert file content from
known encoding to internal HVM encoding (and vice-versa)?
7) Is e"\xEE\x02" == HB_BCHAR(0xEE) + HB_BCHAR(2)? Or it can depend on
some codepage selection or compiler switch?
8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it works
the same as ASC(). If some "non-unicode" codepage is selected it returns
character code of current code page, so
, it is ASC() again. Am I wrong?
9) I'm a little confused among the meanings of "Unicode", UTF16, UCS-2,
etc. AFAIU, UTF-8 can represent characters having numbers up to 31-bit
length. UTF-8 representation can take 1 to 6 bytes. What about UTF8EX?
What character range is supported?
Ex., I see HB_UCHAR() uses:
( HB_WCHAR ) hb_parni( 1 )
so, character code are from range 0 to 65535?
10) What is the meaning of hbmk2 -ku:<cp> switch? As far as I can see in
the source code of compiler, -ku just disables some optimisations, but
it does not change string encoding in pcode. So, I understand that .prg
source code is expected to have encoding set by hb_cdpselect().
11) Now we have situation similar to SET_EXACT. hb_cdpselect()
significantly changes program logic is string functions are used. What
are the rules to write portable code (to make it work on different
codepage settings, including other possible user multibyte codepages)?
12) I'm not sure I understand the whole problems about commandline
options, but CLIPINIT() with SET_CODEPAGE, SET_OSCODEPAGE looks ugly.
Can we use *W() windows unicode functions to obtain command line and
convert it to current codepage in case user calls hb_progname(), etc? I
guess it should be possibility to obtain current windows codepage
(ANSI/OEM) for non-unicode *A() API. Can we set this codepage as default
value for SET_OSCODEPAGE?
13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows? Windows
uses ANSI or UTF-16/UCS-2 for its API, but not UTF-8.
I really believe that HB_ULEFT( ) HB_ULEN( ) would be the right
approach to Unicode, and not changing the default and reliable
functions. Current HB_B functions make updating old software that
relies on binary strings for internal purposes to use unicode
interfaces a very risky task. Besides, the Unicode concept is a
harbour thing, not cl*pper one, so HB_U to the new functions seems
more logical to me.
Hi,
> I'm trying to understand the amount of changes necessary to use HVM
> unicode. Questions:
> 1) this not a question, more a suggestion for other people. You need
> to request HB_CODEPAGE_UTF8EX to use hb_cdpselect("UTF8EX"). I
> expected it is included just like hb_cdpselect("UTF8") and this took
> me a few hours of testing.
The conversion tables are very huge so I didn't made it default
part og HVM.
> 2) I have a large amount of code that do data file/packet data
> parsing, encodes/decodes various structures, etc. CHR(), ASC(),
> I2BIN(), L2BIN, BIN2I() and other functions are very common. How all
> this code should be written in UTF8EX case? Should I use
> HB_B{LEN,CODE,CHAR}() instead of LEN(), ASC(), CHR()?
HB_B*() functions always operates on bytes. It doesn't matter
what CP you use.
> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?
I'll add HB_BSUBSTR().
LEFT( <str>, <n> ) is the same as SUBSTR( <str>, 1, <n> ) and
RIGHT( <str>, <n> ) is the same as SUBSTR( <str>, -<n> )
so it's not strictly necessary anyhow I can add it too if you want.
> 3) AFAIU, the following code is buggy because of LEFT()?
> cFile := ""
> DO WHILE .T.
> cBuf := SPACE(BUF_SIZE)
> IF (nI := FREAD(hF, @cBuf, BUF_SIZE)) > 0
> cFile += LEFT(cBuf, nI)
> ELSE; EXIT
> ENDIF
> ENDDO
Yes, it's not portable.
In such context is necessary to use HB_BSUBSTR()/HB_BLEFT(), i.e.:
cFile += HB_BSUBSTR(cBuf, 1, nI)
> 4) STRTRAN() was not patched. I guess it should.
For valid and normalized UTF8 strings it's not necessary.
> 5) If I have text file in win1257 encoding and I want to read it
> using hb_memoread(), what function should be used to convert file
> content from known encoding to internal HVM encoding (and
> vice-versa)?
hb_cdpTranslate( <cText>, <cCpIN>, <cCpOUT> ) -> <cDest>
if <cCpIN> or <cCpOUT> is missing then HVM CP is used.
Limitations:
it operates on Harbour CPs not unicode ones which are more general
so please remember about REQUEST HB_CODEPAGE_<cpIN>, HB_CODEPAGE_<cpOUT>
In the future we should add support for using unicode CP IDs in this
functions.
> 6) HB_BCHAR() vs. CHR() for argument values 0..127?
No difference for UTF8 and DBCS encodings.
> 7) Is e"\xEE\x02" == HB_BCHAR(0xEE) + HB_BCHAR(2)? Or it can depend
> on some codepage selection or compiler switch?
Now it's binary string but this may change in the future if we
decide to add explicit support for unicode strings though maybe
in such case we should different syntax, i.e. u"..."
> 8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it
> works the same as ASC(). If some "non-unicode" codepage is selected
> it returns character code of current code page, so, it is ASC()
> again. Am I wrong?
It's not the same as ASC().
It always takes first character (not byte) from given string and
returns it unicode value, this code should illustrate it:
SET( _SET_CODEPAGE, "UTF8EX" )
s := HB_UCHAR( 0x104 ) // Ą
? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 0104
SET( _SET_CODEPAGE, "PLMAZ" )
? HB_NUMTOHEX( HB_UCODE( HB_UTF8TOSTR( s ) ), 4 ) // 0104
? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 2500
So regardless of used CP this functions operates on UNICODE values.
> 9) I'm a little confused among the meanings of "Unicode", UTF16,
> UCS-2, etc. AFAIU, UTF-8 can represent characters having numbers up
> to 31-bit length. UTF-8 representation can take 1 to 6 bytes. What
> about UTF8EX? What character range is supported?
> Ex., I see HB_UCHAR() uses:
> ( HB_WCHAR ) hb_parni( 1 )
> so, character code are from range 0 to 65535?
> What is expected maximum return values of HB_BLEN(HB_UCHAR(nChar))?
> What character ranges are supported by functions hb_cdp*U16()?
> http://en.wikipedia.org/wiki/UTF-16 says, that "UTF-16 is used for
> text in the OS API in Microsoft Windows 2000/XP/2003/Vista/CE", so,
> it can represent characters up to U+10FFFF (in some cases 1
> character occupies two 2-byte wide characters, i.e., 4 bytes). How
> Harbour internals works in these cases?
Harbour correctly process UTF8 strings up to 31 bytes characters.
Anyhow HB_WCHAR is 16 bit in current implementation so upper bits
are stripped from during translations. I haven't added support for
UTF16 encoding and intentionaly used U16 in names to not confuse
users. If we decide it's usefull then we can redefine HB_WCHAR
as 32 bit integer.
> 10) What is the meaning of hbmk2 -ku:<cp> switch? As far as I can
> see in the source code of compiler, -ku just disables some
> optimisations, but it does not change string encoding in pcode. So,
> I understand that .prg source code is expected to have encoding set
> by hb_cdpselect().
Exactly.
> 11) Now we have situation similar to SET_EXACT. hb_cdpselect()
> significantly changes program logic is string functions are used.
> What are the rules to write portable code (to make it work on
> different codepage settings, including other possible user multibyte
> codepages)?
HB_B*() functions for binary operations and HB_U*() functions for
operations on unicode characters. These functions are CP independent.
> 12) I'm not sure I understand the whole problems about commandline
> options, but CLIPINIT() with SET_CODEPAGE, SET_OSCODEPAGE looks
> ugly.
> Can we use *W() windows unicode functions to obtain command line and
> convert it to current codepage in case user calls hb_progname(),
> etc? I guess it should be possibility to obtain current windows
> codepage (ANSI/OEM) for non-unicode *A() API. Can we set this
> codepage as default value for SET_OSCODEPAGE?
This is minor and local problem which can be resolved quite easy
by modifications in cmdarg.c and hbwmain.c
> 13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows?
I do not know.
In Windows _SET_OSCODEPAGE should be set to ANSI CP. It's used
by code which operates on ANSI WIN32 API. In Harbour core code
we eliminated ANSI W32 API so it's rather for 3-rd party code
and communication with some libraries which do not have WCHAR
API.
> Windows uses ANSI or UTF-16/UCS-2 for its API, but not UTF-8.
yes it is.
best regards,
Przemek
Hi,
> I really believe that HB_ULEFT( ) HB_ULEN( ) would be the right
> approach to Unicode, and not changing the default and reliable
> functions. Current HB_B functions make updating old software that
> relies on binary strings for internal purposes to use unicode
> interfaces a very risky task. Besides, the Unicode concept is a
> harbour thing, not cl*pper one, so HB_U to the new functions seems
> more logical to me.
For me the most beautifully thing in the implementation I committed
is the fact that I do not have to agree or disagree with such messages
and discus about it ;-)
It's enough that you will use UTF8 instead of UTF8EX to keep
binary indexes as default.
And if you want then you can easy create your own custom UTF8XX
which will use any mixed parts of UTF8 and UTF8EX. That's only
your choice.
best regards,
Przemek
> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?
I'll add HB_BSUBSTR().
LEFT( <str>, <n> ) is the same as SUBSTR( <str>, 1, <n> ) and
RIGHT( <str>, <n> ) is the same as SUBSTR( <str>, -<n> )
so it's not strictly necessary anyhow I can add it too if you want.
hb_cdpTranslate( <cText>, <cCpIN>, <cCpOUT> ) -> <cDest>
In the future we should add support for using unicode CP IDs in this
functions.
users. If we decide it's usefull then we can redefine HB_WCHAR
as 32 bit integer.
> 13) What is idea of SET( _SET_OSCODEPAGE, "UTF8EX" ) in windows?
> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?
I'll add HB_BSUBSTR().
LEFT( <str>, <n> ) is the same as SUBSTR( <str>, 1, <n> ) and
RIGHT( <str>, <n> ) is the same as SUBSTR( <str>, -<n> )
so it's not strictly necessary anyhow I can add it too if you want.I'd add a vote for HB_BLEFT() and HB_BRIGHT(). Thiswould make code conversion easier for code that alreadyuses LEFT() and RIGHT().
thank you, Viktor and Przemek for all explanations!
> FOR EACH is pending.
It's hard for me to vote if FOR EACH should work on characters or bytes.
In general I avoid using this sentence for strings. In my head, FOR EACH
is a kind of optimisation for evaluation using integer index and []
operator. Since string characters are not accessed using cStr[nPos], I
avoid to use it on strings. But from general reasoning I would expect to
work it on characters just like SUBSTR(cStr, nPos, 1) (unprefixed by any
HB_B*() or HB_U*()).
>> What about HB_B*() versions of LEFT(), RIGHT(), SUBSTR() functions?
>
> I'll add HB_BSUBSTR().
> LEFT(<str>,<n> ) is the same as SUBSTR(<str>, 1,<n> ) and
> RIGHT(<str>,<n> ) is the same as SUBSTR(<str>, -<n> )
> so it's not strictly necessary anyhow I can add it too if you want.
Viktor:
> I will add some #defines for this. Seem fine and it avoids
> some bloat in RTL.
I find HB_BLEFT(), HB_BRIGHT() useful without PP tricks, or manual
conversion to HB_BSUBSTR(). Code bloat would be minimal in comparison to
all unicode tables. This is very basic functions just like LEFT() and
RIGHT(), and I have quite many code which communicates to some devices,
and binary protocol parsing is done at Harbour level. I hope nobody
votes for removal of LEFT() and RIGHT() and adding PP tricks to
implement it :)
Even more... I already need HB_BAT() and HB_BRAT(). In many cases I find
AT() doing the job OK, but will it work OK if my binary search needle is
some substring (possibly malformed) UTF8 byte representation?
In some cases I need the 3rd parameter, so, I use HB_AT(). For sure I
should change it to HB_BAT()...
>> 8) What is purpose of HB_UCODE()? In case UTF8EX is selected, it
>> works the same as ASC(). If some "non-unicode" codepage is selected
>> it returns character code of current code page, so, it is ASC()
>> again. Am I wrong?
>
> It's not the same as ASC().
> It always takes first character (not byte) from given string and
> returns it unicode value, this code should illustrate it:
>
> SET( _SET_CODEPAGE, "UTF8EX" )
> s := HB_UCHAR( 0x104 ) // Ą
> ? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 0104
>
> SET( _SET_CODEPAGE, "PLMAZ" )
> ? HB_NUMTOHEX( HB_UCODE( HB_UTF8TOSTR( s ) ), 4 ) // 0104
> ? HB_NUMTOHEX( HB_UCODE( s ), 4 ) // 2500
>
> So regardless of used CP this functions operates on UNICODE values.
The things become clear only after I added
? hb_strtohex(s) // C484
and looked to http://en.wikipedia.org/wiki/Mazovia_encoding
Though, I still has the same question about HB_ULEN(). If I set UTF8EX,
return value of LEN() and HB_ULEN() is the same. If I set single byte
per char codepage, LEN() also the same value as HB_ULEN(). Can I have a
situation with LEN(cStr) != HB_ULEN(cStr)? (Maybe in some other custom
codepage...)
> HB_B*() functions for binary operations and HB_U*() functions for
> operations on unicode characters. These functions are CP independent.
I'm not sure I understand how HB_USUBSTR() is CP independent if it
depends on hb_vmCDP(). Can you give example with !(SUBSTR(cStr, nPos,
nLen) == HB_USUBSTR(cStr,nPos,nLen)) ?
Regards,
Mindaugas
Viktor:
> I will add some #defines for this. Seem fine and it avoids
> some bloat in RTL.I find HB_BLEFT(), HB_BRIGHT() useful without PP tricks, or manual
conversion to HB_BSUBSTR(). Code bloat would be minimal in comparison to
all unicode tables. This is very basic functions just like LEFT() and
RIGHT(), and I have quite many code which communicates to some devices,
and binary protocol parsing is done at Harbour level. I hope nobody
votes for removal of LEFT() and RIGHT() and adding PP tricks to
implement it :)
Even more... I already need HB_BAT() and HB_BRAT(). In many cases I find
AT() doing the job OK, but will it work OK if my binary search needle is
some substring (possibly malformed) UTF8 byte representation?
In some cases I need the 3rd parameter, so, I use HB_AT(). For sure I
should change it to HB_BAT()...
Hi,
> Though, I still has the same question about HB_ULEN(). If I set
> UTF8EX, return value of LEN() and HB_ULEN() is the same. If I set
> single byte per char codepage, LEN() also the same value as
> HB_ULEN(). Can I have a situation with LEN(cStr) != HB_ULEN(cStr)?
> (Maybe in some other custom codepage...)
> >HB_B*() functions for binary operations and HB_U*() functions for
> >operations on unicode characters. These functions are CP independent.
> I'm not sure I understand how HB_USUBSTR() is CP independent if it
> depends on hb_vmCDP(). Can you give example with !(SUBSTR(cStr,
> nPos, nLen) == HB_USUBSTR(cStr,nPos,nLen)) ?
In both cases the answer is the same.
It's necessary of multibyte CPs which do not use custom indexes in
standard functions so we can make people which has preferences like
Bacco happy.
I'll add HB_[UB]{LEFT,RIGHT}() soon.
best regards,
Przemek
> In both cases the answer is the same.
> It's necessary of multibyte CPs which do not use custom indexes in
> standard functions so we can make people which has preferences like
> Bacco happy.
Just as a side note: I have no problem with current implementation,
neither with explicit use of HB_U and HB_B functions, and I have
currently no problem at all with encoding concepts. My comment was
entirely based on the relation of visible problems (display/encoding
errors are easily detectable) vs binary operationerrors that common
users are unaware and maybe will have a hard time locating.
I know this is a huge change and important one, and I've been aware
about it since you gently shared your "todo" list with us, and I think
the overall achievement is very good. Just raised a concern with very
specific details as one additional opinion.
Best regards
Bacco