Harbour Unicode

1,450 views
Skip to first unread message

Viktor Szakáts

unread,
Sep 3, 2010, 9:53:55 AM9/3/10
to harbou...@googlegroups.com
Hi,

I had recently switched my apps to use UTF-8 in all sources and
external files (except databases), which is quite nice step, but due
to limits Harbour, apps internally can still only use legacy 8-bit codepages.
This means that in some places I had to resort to some extra steps
because UTF-8 -> CP != CP -> UTF-8 in a general sense, plus
I have to manually make the conversion from UTF-8, wherever
required. [ Nevertheless I hope it will be useful as finally I can use
non-Windows OS/tools to edit the sources and files, while this was
very difficult to do with old 852 CP. ]

To take this to the next level, Harbour would need to have native
UNICODE support in core, so that it could handle UNICODE strings
as-is in all RTL functions and HVM operations.

Any app will have to differentiate between UNICODE strings and
raw byte streams (=binary data or strings using legacy CP),
so I was thinking of a system where string markers could be used
to markup these string types:

u"Hello, this is a UNICODE string, in UTF-8 encoding"
b"This is raw bytestream"

Default string markers would denote raw bytestream by
default to keep Clipper compatiblity, and this could be
changed with Harbour compiler option, so that regular
string markers mean UTF-8 encoded UNICODE strings.
This would offer an easy upgrade path for app developers.

HB_ITEM would have to be extended with new UNICODE
string type, current one would continue to mean raw byte
stream.

From this point all internal operations (functions/operators)
can query the string type and act accordingly.

F.e. by default:
ASC( u"ő" ) would return 337, while
ASC( "ő" ) would return 245 (in case the source file was encoded in
8-bit ISO-8859-2 CP).

In above example ASC() implementation would check which
string type has been passed and act accordingly.

We will have to decide what encoding to use for UNICODE
strings internally. IMO the two meaningful choices here are
UTF-8 and UTF-32, where UTF-8 is slower in any operations
where characters are addressed by index and UTF-32 being
easy to handle but consumed more memory. [ Pbly UTF-8
is still better if everything considered. ]

We have to make a per-function decision about how string
parameters are accepted and handled. Some function may
act differently on UNICODE and bytestream strings, some may
internally need one or the other and make the required
conversion. Another thing to decide is which function to
return what string type.

Current HB_CDPSELECT() will only influence the handling
of 8-bit (legacy) bytestreams.

Probably all current hb_parc[x]() calls will have to be replaced
with new API where we allow legacy bytestreams to be passed.
Fortunately this is a problem only with the smaller part of
functions.

Also, most interfaces with 3rd party libs will have to be extended
to use str API (just like hbwin, hbodbc does now), where required
string format differs from Harbour internal format and we're dealing
with strings instead of bytestreams.

Does this look like a path we can start on? Any comments are
welcome.

Viktor

Bacco

unread,
Sep 3, 2010, 3:51:33 PM9/3/10
to harbou...@googlegroups.com
Hi, Viktor

I suggest you create some #pragma or hbmk switch to deal with raw " "
that is not pure user CDP, as cl*pper compat was always your/harbour
priority. Users can select codepage to deal with these. I assume you
already know that I really believe that finally using UTF-8 is a high
priority, but on other side I believe that users shouldn't be required
to use UTF-8 editors. As HB is multi-platform, I also believe that the
SVN shouldn't use any characters over 127 in any code except language
packs.

Just a quick opinion. Rest assured I'll stop here later to give more
feedback, this subject is very important to me.


Regards,
Bacco

Viktor Szakáts

unread,
Sep 3, 2010, 4:27:54 PM9/3/10
to harbou...@googlegroups.com
Hi,

On Friday, September 3, 2010, Bacco <carlo...@gmail.com> wrote:
> I suggest you create some #pragma or hbmk switch to deal with raw " "
> that is not pure user CDP, as cl*pper compat was always your/harbour
> priority. Users can select codepage to deal with these. I assume you

Maybe my message was not clear, that's exactly what I suggested by
controlling plain quote behavior with hb compiler option, and keeping
current clipper behavior as default. Additionally it can be pragmas too.

(btw clipper compatibility is still the highest on our/my priorities, you
can count on this if in doubt)

> already know that I really believe that finally using UTF-8 is a high
> priority, but on other side I believe that users shouldn't be required
> to use UTF-8 editors. As HB is multi-platform, I also believe that the
> SVN shouldn't use any characters over 127 in any code except language
> packs.

While I agree, I don't think there is a way to apply this rule
for translated national text. And if ASCII must be exceeded
it better be UTF8 rather then some messy 8-bit CP, which
may easily be much less portable than UTF8. So, for now
ASCII is required for all SVN files, except .po. These must
also have svn:mime-type prop set to UTF-8.

> Just a quick opinion. Rest assured I'll stop here later to give more
> feedback, this subject is very important to me.

Okay.

Viktor

Bacco

unread,
Sep 3, 2010, 4:35:42 PM9/3/10
to harbou...@googlegroups.com
Hi, Viktor,

I think we are again "discussing because we agree" :) You reply meets
exactly my expectations, and I have no doubts about your commitment
with compatibility, thanks!


Regards,
Bacco

Przemysław Czerpak

unread,
Sep 4, 2010, 6:29:28 AM9/4/10
to harbou...@googlegroups.com

Hi,

Probably we should add to compiler automatic CP translation during
compilation and support for i"National string in some CP endinging".
Automatic CP translation during compilation can be very useful in
some places, i.e. in your case you can keep your source code in
UTF8 but during compilation compiler will translate it to any CP
you will chose and set as default in HVM (HU852).

> HB_ITEM would have to be extended with new UNICODE
> string type, current one would continue to mean raw byte
> stream.

The exact implementation inside HB_ITEM will be result of some
other choices. If we chose UTF-8 as base encoding for Unicode
strings then we can use additional flag to current HB_IT_STRING
but if we chose U16 or U32 then we should probably chose new
bit to mark such items.

> From this point all internal operations (functions/operators)
> can query the string type and act accordingly.
>
> F.e. by default:
> ASC( u"ő" ) would return 337, while
> ASC( "ő" ) would return 245 (in case the source file was encoded in
> 8-bit ISO-8859-2 CP).
>
> In above example ASC() implementation would check which
> string type has been passed and act accordingly.
>
> We will have to decide what encoding to use for UNICODE
> strings internally. IMO the two meaningful choices here are
> UTF-8 and UTF-32, where UTF-8 is slower in any operations
> where characters are addressed by index and UTF-32 being
> easy to handle but consumed more memory. [ Pbly UTF-8
> is still better if everything considered. ]

The internal implementation will be invisible to non core code
so we can chose anything what can give the best performance on
different platforms. This we can discuss when we collect all
things which have to be updated, i.e. current Harbour API functions
using strings as parameters. We should also remember about 3-rd
party code using it and not force some dummy modifications
if they are not necessary.
In HVM we also should define the type of result for concatenation
operation, i.e.
"a" + u"a" => u"aa"
or
"a" + u"a" => "aa"
or
RTE
the 1-st one seems to be most reasonable. The 3-rd one may
create serious problem for people who has to 3-rd party code
using Unicode and pure CP encoding.

> We have to make a per-function decision about how string
> parameters are accepted and handled. Some function may
> act differently on UNICODE and bytestream strings, some may
> internally need one or the other and make the required
> conversion. Another thing to decide is which function to
> return what string type.
>
> Current HB_CDPSELECT() will only influence the handling
> of 8-bit (legacy) bytestreams.

The unicode tables and translations between CPs is less important
part of current CDP subsystem. The most important is support for
functions and operations like ISDIGIT(), ISALPHA(), ISLOWER(),
ISUPPER(), LOWER(), UPPER() and national collation.
Evan if you unblock setting UTF8 as base HVM CP then only character
translation will work correctly but above functionality will be
reduced to pure ASCII characters and binary (byte weight) sorting.
This is the most important part which has to be implemented before
non English users can start to use UNICODE in their programs.
BTW I'm surprised that you do not need them directly or indirectly
i.e. working '!' or 'A' in picture clauses.

The biggest problem creates national collating because different
countries use different rules i.e. for accented characters (see
HB_CDP_ACSORT_* in hbapicdp.h) or multibyte characters (see 'ch'
in Czech or Slovak CDPs) but even UPPER and LOWER translations
are not universal i.e. see UPPER( 'i' ) in Turkish.
It means that using UNICODE does not resolve all national problems
but forces using some unified rules which will not respect real
national settings in many countries.

IMO working on UNICODE support in HVM we should start defining
such rule or even rules if we cannot create single rule to make
all Harbour users in different countries happy. It should give
us some more light on farther implementation.

BTW It would be nice to also extend current translation system and
introduce fallback unicode tables so during translation we can chose
"similar" glyphs if the original ones is not available.

> Probably all current hb_parc[x]() calls will have to be replaced
> with new API where we allow legacy bytestreams to be passed.
> Fortunately this is a problem only with the smaller part of
> functions.

This is Clipper compatible _parc() function commonly used by 3-rd
party code. Adding support for Unicode we should also thing about
keeping it working and if necessary introduce some hidden translations.
Not all code will be updated.

> Also, most interfaces with 3rd party libs will have to be extended
> to use str API (just like hbwin, hbodbc does now), where required
> string format differs from Harbour internal format and we're dealing
> with strings instead of bytestreams.
>
> Does this look like a path we can start on? Any comments are
> welcome.

For me the performance is also very important. Unicode is not
the solution for all national problems and many Harbour users
will not use it so any implementation should not noticeable
reduce HVM performance in code which operates on pure CP strings
only and this may force some choices in internal implementations.

best regards,
Przemek

wen....@gmail.com

unread,
Sep 4, 2010, 6:49:03 AM9/4/10
to harbou...@googlegroups.com
>
> For me the performance is also very important. Unicode is not
> the solution for all national problems and many Harbour users
> will not use it so any implementation should not noticeable
> reduce HVM performance in code which operates on pure CP strings
> only and this may force some choices in internal implementations.
>

'UNICODE' is very important in Asia,
We are looking forward to Harbour to support a complete 'UNICODE'
solution.

Antonio Maniero

unread,
Sep 4, 2010, 8:00:03 AM9/4/10
to harbou...@googlegroups.com
Perfect, Przmek. This is a good start to discuss an very important improvement. I think you are fundamentally right. I am analyzing the implications now and I post my thoughts soon.

[]'s Maniero

Viktor Szakáts

unread,
Sep 4, 2010, 10:17:01 AM9/4/10
to harbou...@googlegroups.com

This may be useful for some situations, but it's not related
to real UNICODE support. In my case it wouldn't help as I use
also .po files for translated text, where lookup is also done in
UTF-8.

Yes.

> the 1-st one seems to be most reasonable. The 3-rd one may
> create serious problem for people who has to 3-rd party code
> using Unicode and pure CP encoding.

I agree, such concatenation should not RTE.

>> We have to make a per-function decision about how string
>> parameters are accepted and handled. Some function may
>> act differently on UNICODE and bytestream strings, some may
>> internally need one or the other and make the required
>> conversion. Another thing to decide is which function to
>> return what string type.
>>
>> Current HB_CDPSELECT() will only influence the handling
>> of 8-bit (legacy) bytestreams.
>
> The unicode tables and translations between CPs is less important
> part of current CDP subsystem. The most important is support for
> functions and operations like ISDIGIT(), ISALPHA(), ISLOWER(),
> ISUPPER(), LOWER(), UPPER() and national collation.
> Evan if you unblock setting UTF8 as base HVM CP then only character

I don't plan to do unblock UTF8 as base CP in current
Harbour. It has much more drawbacks than gain.

> translation will work correctly but above functionality will be
> reduced to pure ASCII characters and binary (byte weight) sorting.
> This is the most important part which has to be implemented before
> non English users can start to use UNICODE in their programs.
> BTW I'm surprised that you do not need them directly or indirectly
> i.e. working '!' or 'A' in picture clauses.

These strings remained regular (non-UTF8) ones. So far these parts
use legacy CP, and my app uses legacy CP internally, so they continue
to work as before. IOW even with such huge work investment I could
only switch the "surface" to UNICODE.

Unicode sorting is definitely important and I guess the most important
feature to make it possible to enable UTF-8 encoding for RDDs.

ISDIGIT(), UPPER() and friends should work like ASC(), if they
detect a Unicode string passed they should return Unicode string
(like current HB_UTF8*() functionality), if legacy is passed they
should work as Clipper.

> The biggest problem creates national collating because different
> countries use different rules i.e. for accented characters (see
> HB_CDP_ACSORT_* in hbapicdp.h) or multibyte characters (see 'ch'
> in Czech or Slovak CDPs) but even UPPER and LOWER translations
> are not universal i.e. see UPPER( 'i' ) in Turkish.
> It means that using UNICODE does not resolve all national problems
> but forces using some unified rules which will not respect real
> national settings in many countries.

IMO here the first step is to separate this matter from codepage
handling. So current "codepage" modules would be split to real pure
codepage modules (f.e. ISO-8859-2), while the collation and other
national (uppercasing, ascii conversion) stuff could be moved to
language modules. BTW if language modules are converted to
Unicode, we wouldn't need HU852, HUWIN anymore, just "HU".
So we'd end up with such modules:

HB_CODEPAGE_WIN1252
HB_CODEPAGE_CP852
HB_CODEPAGE_ISO88592
...

HB_LANGUAGE_PL
HB_LANGUAGE_HU
HB_LANGUAGE_PT_BR
...

> IMO working on UNICODE support in HVM we should start defining
> such rule or even rules if we cannot create single rule to make
> all Harbour users in different countries happy. It should give
> us some more light on farther implementation.

Okay.

> BTW It would be nice to also extend current translation system and
> introduce fallback unicode tables so during translation we can chose
> "similar" glyphs if the original ones is not available.
>
>> Probably all current hb_parc[x]() calls will have to be replaced
>> with new API where we allow legacy bytestreams to be passed.
>> Fortunately this is a problem only with the smaller part of
>> functions.
>
> This is Clipper compatible _parc() function commonly used by 3-rd
> party code. Adding support for Unicode we should also thing about
> keeping it working and if necessary introduce some hidden translations.
> Not all code will be updated.

Good point, yes.

>> Also, most interfaces with 3rd party libs will have to be extended
>> to use str API (just like hbwin, hbodbc does now), where required
>> string format differs from Harbour internal format and we're dealing
>> with strings instead of bytestreams.
>>
>> Does this look like a path we can start on? Any comments are
>> welcome.
>
> For me the performance is also very important. Unicode is not
> the solution for all national problems and many Harbour users
> will not use it so any implementation should not noticeable
> reduce HVM performance in code which operates on pure CP strings
> only and this may force some choices in internal implementations.

Indeed. That's one reason I proposed this "dual" method.
With the extra type above current one, this seems feasible.

Viktor

Bacco

unread,
Sep 4, 2010, 11:53:31 AM9/4/10
to harbou...@googlegroups.com
Hi All,


I am thinking about the whole thing for some time, and already
mentioned in some moments that I have great interest in this subject.
I am doing another quick comment now, as I have some external things
to do, but will bring more detailed ideas later.

What I started to believe that will be the best way, is something
pluggable like "RDDs" for strings. Abstracting this layer, Harbour can
be "bundled" with default ascii and ascii case-insensitive, the
current real codepages will be one of this shipped plugins, as we only
change its tables, and utf-8 will be another, cause it involves
another kind of processing. One can write and plug the fundamentals
for their specific language, and collation. It can be both easier and
powerful.

What habour should mantain is just pointers to the collation
fundamentals, that is customized compare, upper and lower. Also this
abstraction will allow powerful handling not found in many languages,
capable even that conclude that "ss"="ß" if user chooses this way.

This will involve a rewrite of some internal calls, but we can recycle
many code already done, that I saw we have when I implemented the utf8
at/rat. Please, think seriously about, I'll detail it later.


Regards,
Bacco

Viktor Szakáts

unread,
Sep 4, 2010, 12:12:44 PM9/4/10
to harbou...@googlegroups.com

Yes this can be done though I'm not sure if we need dynamically
replaceable scheme here, so it can stay within current codepage/language
module frame, with the addition that such module may also contain code.
Though, I'd personally be much happier with a non-code approach like
current one, as it tends to be much more stable and reliable, and languages,
codepages exist in finite numbers, so there is not really need to make it
fully generic and "open" it for 3rd parties, like with RDDs. If there is a
CP or language it's better to be added to the core codebase.

Lightly running these thoughts in the background, I thought it'd be even
more generic and flexible to store CP ID information next to each HVM
string. This way it could cover UTF-8 and all legacy CPs, and even do
intelligent mixing of them.

So, f.e. instead a global HB_CDPSELECT( "CP850" ) you could do:
HB_STRCDP( "Hello", "CP850" ), marking this string as 850.

IOW, there wouldn't be _two_ fixed core string types, just one, where CP
would be stored as additional property. This would also allow to keep
using legacy CPs and also to mix them f.e. on screen or any other
interface which supports Unicode. So in such system, a fairly well
ported Unicode app would use UTF-8 and RAW type of strings. This
also means that expressions like 's1 + s2' can generate RTE if HVM
cannot convert them to common encoding, f.e. when one of them is
Unicode, the other is raw.

Well, maybe this is too much of a flexibility, but if it simplifies the
basic structure, it may be good.

Viktor

Przemysław Czerpak

unread,
Sep 5, 2010, 5:52:05 AM9/5/10
to harbou...@googlegroups.com
On Sat, 04 Sep 2010, Viktor Szakáts wrote:

Hi,

> > Probably we should add to compiler automatic CP translation during


> > compilation and support for i"National string in some CP endinging".
> > Automatic CP translation during compilation can be very useful in
> > some places, i.e. in your case you can keep your source code in
> > UTF8 but during compilation compiler will translate it to any CP
> > you will chose and set as default in HVM (HU852).
> This may be useful for some situations, but it's not related
> to real UNICODE support. In my case it wouldn't help as I use
> also .po files for translated text, where lookup is also done in
> UTF-8.

Our HB18N module already supports such situation so it should not be
a problem.
You should set correct codepages (for base and translated text) in your
<pI18N> structure using:
HB_I18N_CODEPAGE( <pI18N>, "UTF8", .T. ) // set base text CP
HB_I18N_CODEPAGE( <pI18N>, "UTF8", .F. ) // set translated text CP
before you generate .hbl files. This information is stored inside .hbl
files so later can be automatically extracted by out I18N subsystem.
Of course if you do not want to attach information about encoding to
.hbl files then you can set encoding dynamically after you load .po/.hbl
files, i.e:
HB_I18N_CODEPAGE( "UTF8", .T. )
HB_I18N_CODEPAGE( "UTF8", .F. )
but in such case you will have to use for all .hbl files fixed CP
hardcoded in your program and user (translator) cannot use the one
he prefers.

Then the only one thing you have to made is setting base (lookup) CP
in i18n module and force translation for lookup texts:
HB_I18N_CODEPAGE( HB_CDPSELECT(), .T., .T. )
and all will work as you want. The destination texts will be translated
automatically to current HVM CP when extracted. If you want then you
can speedup this process and translate all text once in whole i18n
module by:
HB_I18N_CODEPAGE( HB_CDPSELECT(), .F., .T. )
so later they are only extracted.
This is not related to UNICODE support but I think that I should
clarify such things because such information can be important for
Harbour users and can safe a lot of their time.

> Unicode sorting is definitely important and I guess the most important
> feature to make it possible to enable UTF-8 encoding for RDDs.

Yes, though you should know that some index structures like CDX
are strongly optimized for byte weight sorting. We can change the
weight of each byte without any problems and performance/compression
degradation but when we introduce multibyte or accented sorting then
it begins to interact with prefix compression used in CDX so in some
cases it disable it.
As workaround we can use some hash function which will crate binary
string which can be byte weight sorted. This method is used in
MS-Windows dbase compatible languages which support unicode sorting.
They simply use LCMapString*() function with LCMAP_SORTKEY to generate
index keys. It's working but it has few bad side effects:
1. it's MS-Windows only solution
2. indexs cannot be updated by Harbour application compiled for
other systems
3. for correct behavior of all operation VM should also respect the
same sorting using CompareString*() function.
4. the index key size is internally doubled to store hashed value
5. this is one way translation and hashed value cannot be converted
to original one so ordKeyVal() returns encrypted string
(untill is not hacked to create key by evaluating index key
expression)
For portable solution we will have to create our own hash function
because we cannot use the system one (WINE uses different algorithm
then the one in Windows so we cannot even replicate it in out code).
We can also store keys in raw form without any hashing accepting
possible performance/size degradation or create different index
algorithm.

> ISDIGIT(), UPPER() and friends should work like ASC(), if they
> detect a Unicode string passed they should return Unicode string
> (like current HB_UTF8*() functionality), if legacy is passed they
> should work as Clipper.

Fine but it's not a problem The problem is that they have to know
what it digit, what is alpha, what is the corresponding upper/lower
letter, etc. and for such functionality we will have to create table
with above information for all supported unicode characters. Such
table creates problem because in different countries some things
are different, i.e. upper( e"i" ) should give different value then
e"I" when Turkish encoding is set. It means that probably we will
have to create more then one unicode table or find a way to merge
information from current CDP modules with one global unicode table.
I also do not like the idea of updating such unicode table(s) with
all supported characters manually so it would be nice if we can look
for some already existed definition which we can use to generate
such table(s) automatically.

> > The biggest problem creates national collating because different
> > countries use different rules i.e. for accented characters (see
> > HB_CDP_ACSORT_* in hbapicdp.h) or multibyte characters (see 'ch'
> > in Czech or Slovak CDPs) but even UPPER and LOWER translations
> > are not universal i.e. see UPPER( 'i' ) in Turkish.
> > It means that using UNICODE does not resolve all national problems
> > but forces using some unified rules which will not respect real
> > national settings in many countries.
> IMO here the first step is to separate this matter from codepage
> handling. So current "codepage" modules would be split to real pure
> codepage modules (f.e. ISO-8859-2), while the collation and other
> national (uppercasing, ascii conversion) stuff could be moved to
> language modules. BTW if language modules are converted to
> Unicode, we wouldn't need HU852, HUWIN anymore, just "HU".

I do not know if it's possible for all languages.
It's possible that encoding forces also some translation or even
sorting order, i.e. in some CPs we may have glyphs for all letters
but in some others some of them are represented as multibyte
characters which need special collation rules.

I also do not know if encoding can force some differences in the
texts. It's also possible and it's the only one reason why I haven't
replaced all msg<lng><cp>.c modules by msg<lng>.c in some strict
encoding which will be translated automatically to given <cp> at
runtime.
Here we need user feedback with information if it's possible for
different languages. We can start with the languages suported by
us now. For msg*.c we have:
be bg ca cs de el eo es eu fr gl he hr hu id
is it ko lt nl pl pt ro ru sk sl sr tr ua zh

and for cp*.c we have:
bg cs de dk el es fi fr hr hu is it lt nl
no pl pt ro ru sk sl sr sv tr ua

Now I can see problem with UA866 because it uses Latin "i" instead
of missing Cyrillic "i" and such translation cannot be done without
fallback tables. But maybe there are also other problems. Without
help from real users living in different countries it will be hard
to guess what problems such unification can cause and what we should
implement as workarounds.
Anyhow such unification is really good move.

best regards,
Przemek

Viktor Szakáts

unread,
Sep 5, 2010, 5:19:14 PM9/5/10
to harbou...@googlegroups.com
Hi Przemek,

I didn't know about these. Storing CP in language file is
good, but IMO it should be done differently, ideally by
storing CP in .po file itself, so the whole CP handling is
transparent.

At this point above method would only save me one
manual HB_UTF8TOSTR() call in my custom string i18n
function, while requiring to add language file CP setting
functionality to hbmk2, add new settings to make files in
sync with source files. Storing CP in .po files also raises
some complications (f.e. their original source is .prg
files which we should also know the proper encoding), and
I cannot give any meaningful solutions to these, so for me it
was easier to simply assume UTF-8 all over place.

Nice. To keep ordKeyVal() support, also the original key could be
stored additionally and optionally (with size overhead of course).

>> ISDIGIT(), UPPER() and friends should work like ASC(), if they
>> detect a Unicode string passed they should return Unicode string
>> (like current HB_UTF8*() functionality), if legacy is passed they
>> should work as Clipper.
>
> Fine but it's not a problem The problem is that they have to know
> what it digit, what is alpha, what is the corresponding upper/lower
> letter, etc. and for such functionality we will have to create table
> with above information for all supported unicode characters. Such
> table creates problem because in different countries some things
> are different, i.e. upper( e"i" ) should give different value then
> e"I" when Turkish encoding is set. It means that probably we will
> have to create more then one unicode table or find a way to merge
> information from current CDP modules with one global unicode table.
> I also do not like the idea of updating such unicode table(s) with
> all supported characters manually so it would be nice if we can look
> for some already existed definition which we can use to generate
> such table(s) automatically.

Language specific upper/lower/digit logic (also accent stripping,
though for this I seem to remember seeing a generic solution
for Unicode) can be stored in our language files. Just like now it
is stored in "codepage" files (which aren't currently 'pure' codepage
files). Agreed it would be simpler to pull such information from
a reliable source. I'm sure such thing exist even with usable for
us license. Anyone to investigate?

[ BTW at the moment I roll my own language information with
accent stripping logic. ]

>> > The biggest problem creates national collating because different
>> > countries use different rules i.e. for accented characters (see
>> > HB_CDP_ACSORT_* in hbapicdp.h) or multibyte characters (see 'ch'
>> > in Czech or Slovak CDPs) but even UPPER and LOWER translations
>> > are not universal i.e. see UPPER( 'i' ) in Turkish.
>> > It means that using UNICODE does not resolve all national problems
>> > but forces using some unified rules which will not respect real
>> > national settings in many countries.
>> IMO here the first step is to separate this matter from codepage
>> handling. So current "codepage" modules would be split to real pure
>> codepage modules (f.e. ISO-8859-2), while the collation and other
>> national (uppercasing, ascii conversion) stuff could be moved to
>> language modules. BTW if language modules are converted to
>> Unicode, we wouldn't need HU852, HUWIN anymore, just "HU".
>
> I do not know if it's possible for all languages.
> It's possible that encoding forces also some translation or even
> sorting order, i.e. in some CPs we may have glyphs for all letters
> but in some others some of them are represented as multibyte
> characters which need special collation rules.

Might be, I've yet to see another language or tool which uses
Harbour-like mixture of codepage+national information. Maybe
I'm missing something (or they do), but I feel we're off the
mainstream here, so there must be a more common way to
solve these problem in satisfactory way.

> I also do not know if encoding can force some differences in the
> texts. It's also possible and it's the only one reason why I haven't
> replaced all msg<lng><cp>.c modules by msg<lng>.c in some strict
> encoding which will be translated automatically to given <cp> at
> runtime.

I'm no linguist expert, but I'd make an educated guess that
for majority of cases UTF-8 could be safely used as common
encoding. For the few exceptions we may try to find some special
solution, but generally I think CP is one thing and language rules
are another. IOW I wouldn't design the system based on exceptions
which we don't even now for sure they exist.

> Here we need user feedback with information if it's possible for
> different languages. We can start with the languages suported by
> us now. For msg*.c we have:
>   be bg ca cs de el eo es eu fr gl he hr hu id
>   is it ko lt nl pl pt ro ru sk sl sr tr ua zh
>
> and for cp*.c we have:
>   bg cs de dk el es fi fr hr hu is it lt nl
>   no pl pt ro ru sk sl sr sv tr ua
>
> Now I can see problem with UA866 because it uses Latin "i" instead
> of missing Cyrillic "i" and such translation cannot be done without
> fallback tables. But maybe there are also other problems. Without
> help from real users living in different countries it will be hard
> to guess what problems such unification can cause and what we should
> implement as workarounds.

I'd add that besides waiting for users to comment on their
own language, maybe there is a reliable source on the internet
to find these things out. Probably it was solved in the past by
other projects.

[ Found this lately: http://www.i18nguy.com/guidelines.html ]

The only extra we need to deal with is strict Clipper compatibility.
F.e. some legacy HU collations (and we know about more) are
messed and should keep compatibility with these. But IMO it
can be done f.e. by adding pseudo language drivers, like
"Hungarian (Clipper)" or "Hungarian (Clipper 852) where "852"
signals that the module is meant to be compatible with Clipper
HU852 driver, nevertheless it may also be used with Unicode CP.

Viktor

Bacco

unread,
Sep 5, 2010, 5:49:20 PM9/5/10
to harbou...@googlegroups.com
HI All,

Even mantaining all in the core, I still believe that the code should
be modularized, as it's easier to implement and mantain.
The current implementation mixes a little bit some concepts, and
reinventing the wheel is a little bit dangerous in the long run.
Whatever logic applied, we have distinct meaning and principles for
collation and encodings. One must choose the proper encoding and
collation, not simply "language".

Collation needs to be applied in all that involve comparision, upper
and lower, and by implementing these functions or tables, it's done.
When it's called on the code is another separated decision, depending
on the use desired (one indicator for each database? separated
indicator for console/gui? and so on).

Encoding is another, that is related directly to code. Involves left,
right, substr, at, rat, asc and so on. In this case I also believe
that we can have something more consistent that can be decided by an
indicator, and by this indicator the generic "left" function calls the
correct implementation. This way we can extend functions without
touching the core directly.

About the collation tables in the core: Even we having finite
languages as Viktor said, the various tables can do a significant
difference in the final app size, so I insist that in some way we can
have only the used ones in the final app. Collations are not
restricted to only 256 characters.

SQLite users can benefit of an standard interface, as SQLite can use
the same "generic" call and that call can choose the appropriate
functions. Also, there is no need to reinvent the wheel.

I believe that the problem is to decide what has to be achieved, not
the coding itself. Code seems to be a natural consequence only.


Regards,
Bacco

Viktor Szakáts

unread,
Sep 5, 2010, 6:34:49 PM9/5/10
to harbou...@googlegroups.com
Hi,

> Even mantaining all in the core, I still believe that the code should
> be modularized, as it's easier to implement and mantain.
> The current implementation mixes a little bit some concepts, and
> reinventing the wheel is a little bit dangerous in the long run.
> Whatever logic applied, we have distinct meaning and principles for
> collation and encodings. One must choose the proper encoding and
> collation, not simply "language".
>
> Collation needs to be applied in all that involve comparision, upper
> and lower, and by implementing these functions or tables, it's done.
> When it's called on the code is another separated decision, depending
> on the use desired (one indicator for each database? separated
> indicator for console/gui? and so on).

To me it seems we're saying quite the same thing here.

"codepage" ~= "encoding" (current rtl/cpage)
"language" ~= "collations" (current rtl/lang)

The problem is that right now the collation and some other
language information is contained in 'codepage' modules,
instead of being present in 'language' modules.

So to clean the situation, all CP related stuff must be
stored in CP (which is a CP/UTF8 conversion table in
our case) and all language dependent information stored
in language modules, encoded in UTF8, so that it can be
used with all CP modules.

> Encoding is another, that is related directly to code. Involves left,
> right, substr, at, rat, asc and so on. In this case I also believe
> that we can have something more consistent that can be decided by an
> indicator, and by this indicator the generic "left" function calls the
> correct implementation. This way we can extend functions without
> touching the core directly.

I don't think we need to provide replaceable low-level string
handling functions. UNICODE is the only one > 8-bit codepage, so we
only have to decide which UNICODE _encoding_ scheme to use internally
and implement the code for that one. I just see no point
supporting both UTF8, UTF16, UTF32, UTF7, etc... at the same
time in core. It's unnecessary complexity and overhead. Actually,
we have the code for UTF8 already. We also need one set of functions
for 8-bit CPs, which we also already have.

> About the collation tables in the core: Even we having finite
> languages as Viktor said, the various tables can do a significant
> difference in the final app size, so I insist that in some way we can
> have only the used ones in the final app. Collations are not
> restricted to only 256 characters.

We have this solution implemented since the early days of
Harbour. We can just continue to use it.

[ Why do I feel that everyone is urged to argue even if we say
the exact same thing or when talking about long accepted
concepts. ]

> SQLite users can benefit of an standard interface, as SQLite can use
> the same "generic" call and that call can choose the appropriate
> functions. Also, there is no need to reinvent the wheel.

This, I don't understand.

Viktor

Bacco

unread,
Sep 6, 2010, 3:29:27 AM9/6/10
to harbou...@googlegroups.com
Hi, Viktor

> To me it seems we're saying quite the same thing here.
>
> "codepage" ~= "encoding" (current rtl/cpage)
> "language" ~= "collations" (current rtl/lang)

Codepage is a very specific form of encoding. I avoid using the term
codepage for UTF-8, as the codepage is nothing more than a chosen
subset of all characters available, to avoid encoding greater than 1
byte/char (just to clarify my words).

> I don't think we need to provide replaceable low-level string
> handling functions. UNICODE is the only one > 8-bit codepage, so we
> only have to decide which UNICODE _encoding_ scheme to use internally

True collations need to forget the codepage thing and provide the
whole character set extension, so it can be used in any context (be it
in real codepages, be it in utf-8)

> We have this solution implemented since the early days of
> Harbour. We can just continue to use it.
>
> [ Why do I feel that everyone is urged to argue even if we say
> the exact same thing or when talking about long accepted
> concepts. ]

I really agree that current solution is working, but I imagine
something a little powerful and easier to extend. I am not thinking in
big changes, just some review on the way current good implemented
functions are called, by a central generic function, and the use of
real and generic collation tables available to all encodings, allowing
one add a new local language adding one more table only.

>> SQLite users can benefit of an standard interface, as SQLite can use
>> the same "generic" call and that call can choose the appropriate
>> functions. Also, there is no need to reinvent the wheel.
>
> This, I don't understand.

SQLite allows one to point at c level the functions used to compare
strings using our own collation sistem. With this in mind, even the
SQLite extension can honor the user selected collation.
http://www.sqlite.org/c3ref/create_collation.html


Regards,
Bacco

Viktor Szakáts

unread,
Sep 6, 2010, 9:24:15 AM9/6/10
to harbou...@googlegroups.com
>> "codepage" ~= "encoding" (current rtl/cpage)
>> "language" ~= "collations" (current rtl/lang)
>
> Codepage is a very specific form of encoding. I avoid using the term
> codepage for UTF-8, as the codepage is nothing more than a chosen
> subset of all characters available, to avoid encoding greater than 1
> byte/char (just to clarify my words).

Codepage is Unicode, encoding is UTF8.

>> I don't think we need to provide replaceable low-level string
>> handling functions. UNICODE is the only one > 8-bit codepage, so we
>> only have to decide which UNICODE _encoding_ scheme to use internally
>
> True collations need to forget the codepage thing and provide the
> whole character set extension, so it can be used in any context (be it
> in real codepages, be it in utf-8)

I don't understand.

>> [ Why do I feel that everyone is urged to argue even if we say
>> the exact same thing or when talking about long accepted
>> concepts. ]
>
> I really agree that current solution is working, but I imagine
> something a little powerful and easier to extend. I am not thinking in
> big changes, just some review on the way current good implemented
> functions are called, by a central generic function, and the use of
> real and generic collation tables available to all encodings, allowing
> one add a new local language adding one more table only.

I conclude I don't know what you want to achieve
in terms of modularity.

> SQLite allows one to point at c level the functions used to compare
> strings using our own collation sistem. With this in mind, even the
> SQLite extension can honor the user selected collation.
> http://www.sqlite.org/c3ref/create_collation.html

Ah ok. I still don't think collations modules must provide
_code_. IMO it's bad decision unless absolutely necessary.

Viktor

Mindaugas Kavaliauskas

unread,
Sep 6, 2010, 1:13:31 PM9/6/10
to harbou...@googlegroups.com
Hi,


> HB_ITEM would have to be extended with new UNICODE
> string type, current one would continue to mean raw byte
> stream.

Unicode subject is very interesting for me (especially after some
customers wants to use russian), but I've tried to understand what are
my expectation of this move. I still do not have a full picture about
that conversion are required between unicode and bytestream for
operators, procedure parameters, etc. Here are a few questions, to make
myself (and others) more clear about it:

1) u"a" + "a" = ?
2) VALTYPE(u"a") = ?
3) FWRITE(hFile, u"a", 1), how many bytes is written if u"a" is national
char?
4) FREAD(hFile, @buffer), that type buffer is binary string or unicode?
5) What is full list of "core" functions affected by unicode strings?
LEN(), ASC(), CHR() (?), SUBSTR(), AT(), RAT(), LEFT(), RIGHT(), what else?
6) Can I test (or this should be hidden internal) if string is binary or
unicode?
7) Does anybody know how unicode works in other languages? Python? Java?
Ruby? Etc?


> We will have to decide what encoding to use for UNICODE
> strings internally. IMO the two meaningful choices here are
> UTF-8 and UTF-32, where UTF-8 is slower in any operations
> where characters are addressed by index and UTF-32 being
> easy to handle but consumed more memory. [ Pbly UTF-8
> is still better if everything considered. ]

I thing two bytes per character fits better, it allows to address
character by index. This is not possible in UTF8, i.e. if you need to
get a 1000th character of the string you must scan and parse each
character until you find a 1000th. This is slow operation. I know we can
have more than 16-bit character but this is seldom situation.

But this is not very important question at this moment. We can implement
various versions of VM internals.


> Probably all current hb_parc[x]() calls will have to be replaced
> with new API where we allow legacy bytestreams to be passed.
> Fortunately this is a problem only with the smaller part of
> functions.

I think a better way is to replace hb_parc() by new function if function
accept unicode string, and leave hb_parc() for legacy strings. We will
need to review code anyway: string indexing, buffer allocations, etc.
would be different for unicode.


> Also, most interfaces with 3rd party libs will have to be extended
> to use str API (just like hbwin, hbodbc does now), where required
> string format differs from Harbour internal format and we're dealing
> with strings instead of bytestreams.

Does all native libraries can use unicode? Regex?


On 2010.09.05 12:52, Przemysław Czerpak wrote:
>> Unicode sorting is definitely important and I guess the most important
>> > feature to make it possible to enable UTF-8 encoding for RDDs.
> Yes, though you should know that some index structures like CDX
> are strongly optimized for byte weight sorting. We can change the
> weight of each byte without any problems and performance/compression
> degradation but when we introduce multibyte or accented sorting then
> it begins to interact with prefix compression used in CDX so in some
> cases it disable it.
> As workaround we can use some hash function which will crate binary
> string which can be byte weight sorted. This method is used in
> MS-Windows dbase compatible languages which support unicode sorting.
> They simply use LCMapString*() function with LCMAP_SORTKEY to generate
> index keys. It's working but it has few bad side effects:

I've tried unicode fields in ADS v10. It llows indexing on unicode
fields as well as on CHAR_FIELD + UNICODE_FIELD. I did not tried to hack
index format, but index keys looks cryptic inside .cdx.


Regards,
Mindaugas

Viktor Szakáts

unread,
Sep 7, 2010, 1:45:43 PM9/7/10
to harbou...@googlegroups.com
On Mon, Sep 6, 2010 at 7:13 PM, Mindaugas Kavaliauskas
<dbt...@dbtopas.lt> wrote:
> Hi,
>
>
>> HB_ITEM would have to be extended with new UNICODE
>> string type, current one would continue to mean raw byte
>> stream.
>
> Unicode subject is very interesting for me (especially after some customers
> wants to use russian), but I've tried to understand what are my expectation
> of this move. I still do not have a full picture about that conversion are
> required between unicode and bytestream for operators, procedure parameters,
> etc. Here are a few questions, to make myself (and others) more clear about
> it:
>
> 1) u"a" + "a" = ?
> 2) VALTYPE(u"a") = ?
> 3) FWRITE(hFile, u"a", 1), how many bytes is written if u"a" is national
> char?
> 4) FREAD(hFile, @buffer), that type buffer is binary string or unicode?
> 5) What is full list of "core" functions affected by unicode strings? LEN(),
> ASC(), CHR() (?), SUBSTR(), AT(), RAT(), LEFT(), RIGHT(), what else?
> 6) Can I test (or this should be hidden internal) if string is binary or
> unicode?
> 7) Does anybody know how unicode works in other languages? Python? Java?
> Ruby? Etc?

Interesting points to debate.

One more: CHR() should return what type?

>> We will have to decide what encoding to use for UNICODE
>> strings internally. IMO the two meaningful choices here are
>> UTF-8 and UTF-32, where UTF-8 is slower in any operations
>> where characters are addressed by index and UTF-32 being
>> easy to handle but consumed more memory. [ Pbly UTF-8
>> is still better if everything considered. ]
>
> I thing two bytes per character fits better, it allows to address character
> by index. This is not possible in UTF8, i.e. if you need to get a 1000th
> character of the string you must scan and parse each character until you
> find a 1000th. This is slow operation. I know we can have more than 16-bit
> character but this is seldom situation.

With UTF-16, you have the exact same complication if the
goal is to support full Unicode, not just one part of it.

To solve it we should use UTF-32.

>> Probably all current hb_parc[x]() calls will have to be replaced
>> with new API where we allow legacy bytestreams to be passed.
>> Fortunately this is a problem only with the smaller part of
>> functions.
>
> I think a better way is to replace hb_parc() by new function if function
> accept unicode string, and leave hb_parc() for legacy strings. We will need
> to review code anyway: string indexing, buffer allocations, etc. would be
> different for unicode.

That would be good, but in this case some tricky code will
have to be added to VM, because the temporary string buffer
needs to be stored somewhere and released later by some
automatism. Such not very efficient method is good for
compatibility, but certainly not as ideal as hb_str*() API.

>> Also, most interfaces with 3rd party libs will have to be extended
>> to use str API (just like hbwin, hbodbc does now), where required
>> string format differs from Harbour internal format and we're dealing
>> with strings instead of bytestreams.
>
> Does all native libraries can use unicode? Regex?

pcre supports Unicode, though it was introduced
not very long ago.

Generally this is an area to explore deeper. All this
has to be reviewed and fixed gradually. I seem to
remember some 3rd party libs where we interface
using 8-bit, while it would require UTF-8 strings,
maybe it was mysql, can't remember.

Viktor

Bacco

unread,
Sep 7, 2010, 2:06:58 PM9/7/10
to harbou...@googlegroups.com
Also, remember that string functions are commonly used with binary data.

I am waiting a little more to see the overall opinions, but I believe
that we should have a complete new set of commands to migrate the
current and future language aware functions, something like
i18n_asc(string,encoding) , i18n_left(string,len,encoding),
i18n_substr(string,start,count,encoding),
i18n_at(string,string,encoding) and so on

IMHO, this way we can honour legacy code, avoid performance and
storage penalty on binary strings, made clear coding, and benefit for
an easy development path for the new foundation.

Viktor Szakáts

unread,
Sep 7, 2010, 3:11:11 PM9/7/10
to harbou...@googlegroups.com
> I am waiting a little more to see the overall opinions, but I believe
> that we should have a complete new set of commands to migrate the
> current and future language aware functions, something like
> i18n_asc(string,encoding) , i18n_left(string,len,encoding),
> i18n_substr(string,start,count,encoding),
> i18n_at(string,string,encoding) and so on
>
> IMHO, this way we can honour legacy code, avoid performance and
> storage penalty on binary strings, made clear coding, and benefit for
> an easy development path for the new foundation.

In my book, this is barely 'native' support. Such API can be
created as an add-on even now without touching core. We
already have HB_UTF8*() API which is just that, with UTF8
only support.

"Core" support to me means that I can comfortably create
a fully Unicode app with a well defined upgrade path, with
least amount of changes, so that Unicode coding feels the
same as legacy coding felt. All I want is using Unicode
strings wherever applicable and expect those Unicode
strings to be preserved when passed around internally.

I mean it cannot be expected from users to change all AT()
calls to HB_I18N_AT() and keep using that from this point
on. It also creates problems when mixing own code with
library code, since user cannot be sure if library has been
updated to use HB_I18N_AT().

I'd expect above functionality to be supported using this code:

hb_translate( ASC( num ), encoding )
hb_translate( LEFT( string, len ), encoding )
hb_translate( SUBSTR( string, start, count ), encoding )
hb_translate( AT( string, string ), encoding )

Also notice that actual down-conversion to another cp/encoding
is rather the exception than rule in real applications, as
such conversion shall only be used when interfacing with
_external components_ which do not support Unicode, or
have fixed encoding/cp requirement. Typically import, export
functions or communicating with old APIs, interfaces, or
storing data used by other systems.

We should think about how to make the such native
Unicode support possible and how to make the transition
the most seamless for users.

One key point is to identify raw byte-stream vs. unicode
string usage in an application.

Viktor

Massimo Belgrano

unread,
Oct 28, 2010, 10:04:23 AM10/28/10
to Harbour Developers
I want remember this intresting thread regarding
the idea of make Harbour
- Full "unicode" support (including unicode .prg files)
- Proper and native multilanguage support (i18n)
- support Unicode as a native data type.
In the meantime, Harbour can support Unicode as a byte stream


On 7 Set, 21:11, Viktor Szakáts <harbour...@syenar.hu> wrote:
> > I am waiting a little more to see the overall opinions, but I believe
> > that we should have a complete new set of commands to migrate the
> > current and future language aware functions, something like
> > i18n_asc(string,encoding) , i18n_left(string,len,encoding),
> > i18n_substr(string,start,count,encoding),
> > i18n_at(string,string,encoding) and so on
>
> > IMHO, this way we can honour legacy code, avoid performance and
> > storage penalty on binary strings, made clear coding, and benefit for
> > an easy development path for the new foundation.
>
> In my book, this is barely 'native' support. Such API can be
> created as an add-on even now without touching core. We
> already have HB_UTF8*() API which is just that, with UTF8
> only support.
>
> "Core" support to me means that I can comfortably create
> a fullyUnicodeapp with a well defined upgrade path, with
> least amount of changes, so thatUnicodecoding feels the
> same as legacy coding felt. All I want is usingUnicode
> strings wherever applicable and expect thoseUnicode
> strings to be preserved when passed around internally.
>
> I mean it cannot be expected from users to change all AT()
> calls to HB_I18N_AT() and keep using that from this point
> on. It also creates problems when mixing own code with
> library code, since user cannot be sure if library has been
> updated to use HB_I18N_AT().
>
> I'd expect above functionality to be supported using this code:
>
> hb_translate( ASC( num ), encoding )
> hb_translate( LEFT( string, len ), encoding )
> hb_translate( SUBSTR( string, start, count ), encoding )
> hb_translate( AT( string, string ), encoding )
>
> Also notice that actual down-conversion to another cp/encoding
> is rather the exception than rule in real applications, as
> such conversion shall only be used when interfacing with
> _external components_ which do not supportUnicode, or
> have fixed encoding/cp requirement. Typically import, export
> functions or communicating with old APIs, interfaces, or
> storing data used by other systems.
>
> We should think about how to make the such nativeUnicodesupport possible and how to make the transition
Reply all
Reply to author
Forward
0 new messages