I plan to begin some modifications to add basic support for UNICODE to HVM.
I would like to agree some important things now.
Let's imagine that we use UTF8 as internal unistring representation.
Should we change functions like LEN(), SUBSTR(), LEFT(), RIGHT(), PADR(),
PADC(), ... to operate on character indexes or leave them as byte ones?
What should we make with CHR() and ASC() functions? Keep then operating
on ASCII values or switch to UNICODE?
What is your preferred behavior for INKEY() and unicode values?
If we want to keep compatibility then we need to introduce new
inkey flag to retrieve UNICODE values. We can also define one
inkey value K_UNICODE to indicate that there is unicode value
which can be retrieved by HB_UNIKEY() function.
Please also thing about updating upper level core code like GET
system to work with UNICODE values. New PRG API should allow to make
such update easy.
best regards,
Przemek
Hi,
> in ads is managed at fiedl level
> http://blog.advantageevangelist.com/2010/06/ads-10-tip-4-unicode-support.html
I added support for this fileds in Harbour's RDD ADS over year ago:
2010-10-09 19:07 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
* harbour/contrib/rddads/ads1.c
+ added support for new ADS 10.0 UNICODE fields: NChar, NVarChar, NMemo
They are supported in all ADS* RDDs.
and also in core DBF* RDDs:
2010-10-13 13:21 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
* harbour/src/rdd/dbf1.c
* harbour/src/rdd/dbffpt/dbffpt1.c
+ added support for UNICODE fields compatible with the ones used
by ADS
> so will in harbour create a new type nstring
> or is neccessary only for UTF-16 encoding because double the space
> occupation
It's completely independent. The internal representation is invisible
for applications using Harbour STR API so code which uses this API can
work with any HVM encodings.
> > > What is your preferred behavior for INKEY() and unicode values?
> > > If we want to keep compatibility then we need to introduce new
> > > inkey flag to retrieve UNICODE values. We can also define one
> > > inkey value K_UNICODE to indicate that there is unicode value
> > > which can be retrieved by HB_UNIKEY() function.
> > Probably the former would allow for slicker code on the user's
> > side, so I'd prefer a new inkey flag.
> +1 for inkey flag to retrieve UNICODE values
OK, but please remember that it means that we have to introduce
completely new set of K_* macros because current ones creates conflicts
with UNICODE values. It means that in some applications modifications
will be very deep.
Now HB_INKEY_EXTENDED is completly unused.
HB_INKEY_RAW is partially used in GTDOS, GTOS2 and GTSLN but it's old
dummy code which works in different way in these GTs and is not compatible
with upper level GT code so we can safely remove it and introduce new
flag, i.e. HB_INKEY_EXT. When it's used then completely new keycode
values are returned. These new keycode values will be used internally
by all low level GTs and core GT code. If inkey() is called without
HB_INKEY_EXT flag then they are translated to old Clipper INKEY values
and UNICODE values which do not have corresponding character in active
CP are converted to K_UNICODE.
Core PRG code should be updated to work correctly with any _SET_EVENTMASK
Anyhow this unicode value had to be converted to string so this question:
> > > What should we make with CHR() and ASC() functions? Keep then operating
> > > on ASCII values or switch to UNICODE?
is very important.
If we leave CHR() as is then we need to introduce new functions, i.e.:
HB_UNICHR()
HB_UNICODE()
best regards,
Przemek
I think Asc(), given its name, should not handle unicode chars.
My 2c.
Maurilio.
--
__________
| | | |__| Maurilio Longo
|_|_|_|____| farmaconsult s.r.l.
On 2011.11.10 12:16, Przemysław Czerpak wrote:
> I plan to begin some modifications to add basic support for UNICODE to HVM.
> I would like to agree some important things now.
> Let's imagine that we use UTF8 as internal unistring representation.
> Should we change functions like LEN(), SUBSTR(), LEFT(), RIGHT(), PADR(),
> PADC(), ... to operate on character indexes or leave them as byte ones?
> What should we make with CHR() and ASC() functions? Keep then operating
> on ASCII values or switch to UNICODE?
It is very important to have the whole picture before doing some final
agreements. I'll try put share imaginations, though I still do not have
the whole final picture of unicode support. So, it will be a little
brain-storming style ideas. It would be also nice to see at
implementation of other products like Java, PHP, Python before inventing
a wheel.
PRG level code should not depend on internal unicode representation in
HVM. We can use different internal representations. UTF-8 is only one of
internal representations. Another representation can be windows wide
char little-endian unicode representation. Both representations has
drawback and advantages. E.g., UTF-8 saves memory, but obtaining
character offset is more complex. Windows wide char usage let us avoid a
numerous string conversion in Windows application, if windows API is
used often.
The independence of internal string representation gives an answer to
question, how LEN(), SUBSTR(), LEFT(), etc should work. It should work
on characters, not bytes! Otherwise, we'll have different results for
the same "s caron" (U+0161) character:
ASC(LEFT(s caron, 1)) == 0xC5 // since UTF8 is C5 A1
or
ASC(LEFT(s caron, 1)) == 0x61 // since little-endian is 61 01
Using of byte in SUBSTR(), etc, will make PRG level code more hacky - we
can split string in the middle of the character. LEN(FIELD->CHARCTER)
result will depend on field content even for current fixed DBF character
fields, etc.
Byte operations can be useful for those, who works with binary strings,
because you have a strict control on the binary data representation you
store in memory, but in this case I'd say do not need unicode support at
all. Just let's do some binary transformation to UTF-8 or other encoding
before passing string parameter to Cairo API, to wide char in case of
Windows API, etc.
I would expect ASC() and CHR() to work on characters. I.e.,
ASC(s caron) is equal to 353, and CHR(353) return one character string
containing s caron.
The mess begins after I'm trying to thing about binary strings. We will
need such strings even if we have unicode strings. Many functions like
file read write, socket operations, etc. operates on raw bytes, not on
characters. I think the following conversion should be done in such cases:
cBin := hb_translate(cString,, "to_encoding")
FWRITE(hFile, cBin, LEN(cBin))
and
FREAD(hFile, @cBin, LEN(cBin))
cString := hb_translate(cBin, "from_encoding")
The question is how we will keep binary strings in HVM. Will we use some
flag to indicate if string is binary or not? We can store all strings in
unicode representations and do not use binary flag at all. E.g., binary
string "\x55\xAA" can be encoded and stored as 3 bytes in UTF-8. In this
case, we will have to do char to byte translation in functions like
FWRITE() by obtaining integer code for every characters and making
binary string of bytes (code%255). This complicates a little functions
which operates on "memory buffer" like FREAD(). I still can understant
this is not very difficult to solve. Well, storing all strings
(including binary) in unicode representation could be not optimal in
case application to a lot of binary data processing.
One more question I can not solve in my head is the result of functions
like hb_parc() and other. We have a huge number of C level functions
calls, that operates on strings and do not use String API. What result
is expected for such functions? It would be nice to have some setting
for not String API functions. E.g., we expect result be returned in
CP437. The bad thing is that hb_parc() returned value is never freed.
So, returned value should not be obtained by some transcoding and should
return internal HVM representation. This causes some serious
outcome. If we want to have a possibility of different internal HVM
string representation, hb_parc() is completely useless function. Since,
I may obtain string in UTF-8, o little-endian widechar format. I fail to
image any useful usage of old API for strings. Maybe it has some limited
application if we say, that UTF-8 is the only possible internal encoding
for unicode HVM.
Regards,
Mindaugas
Some considerations: I believe we need separate types for strings and
streams, so with strings the behaviour should be 1 byte always. There
are plenty of uses of Asc, left, right, substr to handle binary data,
that would be unusable simultaneously with UTF8. Being two separate
types, it would be easy to use the proper implementation of left/right
etc
Although I believe that internal UTF8 is good, I feel a little worried
about the implementation.
Regards,
Bacco
Hi,
> It is very important to have the whole picture before doing some
> final agreements. I'll try put share imaginations, though I still do
> not have the whole final picture of unicode support. So, it will be
> a little brain-storming style ideas. It would be also nice to see at
> implementation of other products like Java, PHP, Python before
> inventing a wheel.
So far looking at some of them I haven't found answers or interesting
solutions for real problems. Just few arbitrary decision or problem
is not touched at all in low level code or everything is redirected to
ICU.
> PRG level code should not depend on internal unicode representation
> in HVM. We can use different internal representations. UTF-8 is only
> one of internal representations. Another representation can be
> windows wide char little-endian unicode representation. Both
BTW Windows uses native endian not little endian - at least in
documentation we have arrays of TCHARs. Of course for x86 machines
it's the same.
> representations has drawback and advantages. E.g., UTF-8 saves
> memory, but obtaining character offset is more complex. Windows wide
> char usage let us avoid a numerous string conversion in Windows
> application, if windows API is used often.
Yes it is. The most important advantage of UTF8 used as internal
encoding is direct casting to char * strings so C code using old
string API (hb_parc*()) can work with such strings so can be updated
in longer terms. It also allow us to keep current 'char *' pointers
in existing HVM structures and functions so we will not have to update
it adding UNICODE strings to HVM.
Anyhow the final representation of UNICODE strings in HVM should be
fully independent from the public API so I will want to touch this
subject too though maybe later.
UTF8 also simplifies string constants in .prg and .c code.
As you said the most important disadvantage are much more complex
character accessing by index in UTF8 strings.
> The independence of internal string representation gives an answer
> to question, how LEN(), SUBSTR(), LEFT(), etc should work. It should
> work on characters, not bytes! Otherwise, we'll have different
> results for the same "s caron" (U+0161) character:
> ASC(LEFT(s caron, 1)) == 0xC5 // since UTF8 is C5 A1
> or
> ASC(LEFT(s caron, 1)) == 0x61 // since little-endian is 61 01
>
> Using of byte in SUBSTR(), etc, will make PRG level code more hacky
> - we can split string in the middle of the character.
> LEN(FIELD->CHARCTER) result will depend on field content even for
> current fixed DBF character fields, etc.
>
> Byte operations can be useful for those, who works with binary
> strings, because you have a strict control on the binary data
> representation you store in memory, but in this case I'd say do not
> need unicode support at all. Just let's do some binary
> transformation to UTF-8 or other encoding before passing string
> parameter to Cairo API, to wide char in case of Windows API, etc.
>
> I would expect ASC() and CHR() to work on characters. I.e.,
> ASC(s caron) is equal to 353, and CHR(353) return one character
> string containing s caron.
So if we want to separate internal representation from PRG code we
have to change string functions to operate on character indexes
instead bytes and used UNICODE character values instead of ASCII
ones.
Support for binary strings as separate type is also important but
it's not solution for all cases. Sooner or later someone will add
UNICODE string to binary string and we have to decide what is the
final result and which conversions on both strings should be done
before concatenation. We can also forbid such operation and generate
RTE what seems to be reasonable if we add set of functions for
conversions between byte and unicode strings so user can change
type of arguments before operation. It forces code updating but
the final code should be much cleaner without some unexpected
runtime results.
> The mess begins after I'm trying to thing about binary strings. We
> will need such strings even if we have unicode strings. Many
> functions like file read write, socket operations, etc. operates on
> raw bytes, not on characters. I think the following conversion
> should be done in such cases:
> cBin := hb_translate(cString,, "to_encoding")
> FWRITE(hFile, cBin, LEN(cBin))
> and
> FREAD(hFile, @cBin, LEN(cBin))
> cString := hb_translate(cBin, "from_encoding")
>
> The question is how we will keep binary strings in HVM. Will we use
> some flag to indicate if string is binary or not? We can store all
> strings in unicode representations and do not use binary flag at
> all. E.g., binary string "\x55\xAA" can be encoded and stored as 3
> bytes in UTF-8. In this case, we will have to do char to byte
> translation in functions like FWRITE() by obtaining integer code for
> every characters and making binary string of bytes (code%255). This
> complicates a little functions which operates on "memory buffer"
> like FREAD(). I still can understant this is not very difficult to
> solve. Well, storing all strings (including binary) in unicode
> representation could be not optimal in case application to a lot of
> binary data processing.
Or maybe we can also generate RTE here when non binary string is passed
to such function. In such case programmer will have to make conversion
himself to the form he needs.
> One more question I can not solve in my head is the result of
> functions like hb_parc() and other. We have a huge number of C level
> functions calls, that operates on strings and do not use String API.
> What result is expected for such functions? It would be nice to have
> some setting for not String API functions. E.g., we expect result be
> returned in CP437. The bad thing is that hb_parc() returned value is
> never freed. So, returned value should not be obtained by some
> transcoding and should return internal HVM representation. This
> causes some serious
> outcome. If we want to have a possibility of different internal HVM
> string representation, hb_parc() is completely useless function.
> Since, I may obtain string in UTF-8, o little-endian widechar
> format. I fail to image any useful usage of old API for strings.
> Maybe it has some limited application if we say, that UTF-8 is the
> only possible internal encoding for unicode HVM.
As I said the nice side effect of UTF8 representation is the fact
that it is still valid char * string so we can think about it later.
This can be resolved in two ways:
1) add to function frame list of strings allocated by hb_parc*()
functions for different parameters which are freed when function
returned.
2) extend asString item structure and add support for alternative
string representations which will be freed by hb_itemClean()
or code which modifies item string buffer. Such solution works
also for hb_itemGetCPtr*() functions.
3) eliminate old API from whole core code and use only new one.
Probably preferred because STR API is MT safe and theoretically
allows to simultaneous write access to the same item from different
threads if we decide to create such HVM version in the future.
best regards,
Przemek
The Stream (or whatever name) can be used for unicode, and can store
the data itself plus one byte to specify the type, so one can start
implementing E.G.: 0x08 as UTF-8 indicator, allowing room for future
encodings.
Also, the stream and character shouldn't never be added by simpler
operations, as If the user doesn't know exactly what it's doing,
probably it should not be doing. If one need to store some unicode
stream as a character, the conversion must be an explicit call to a
function.
Then we have the need to consider the sources. People shoud use the
encoding that best fits their needs. If I'm developing for Qt and
MySql, and don't need japanese or chinese, probably I want edit my
source in 1252, and so on (thinking about an ideal implementation, not
current). Other users may need mostly UTF-8. Maybe we will need some
#pragma in the future to specify the source encoding too.
"Bacco" <carlo...@gmail.com> pisze:
> I believe that creating a new type would be the first step, leaving
> current C type for binary strings, allowing 100% backward
> compatibility.
e.g.
Local txt1 as Unicode, txt2 := Unocode(), txt0 := ''
? Valtype( txt1), txt2:Classname, ValType( txt0) // S STRING|UNICODE ? C [HARACTER]
but wath when
? txt1 + txt0 // ????????
etc
Regards,
Marek Horodyski
----------------------------------------------------------------
Nie zmieniaj opon, zmien auto!
http://linkint.pl/f2a7c
It should RTL, exactly what I said in my last email.
You can't mix the two without functions. You need to do ?
someconversion( txt1, HB_UTF8, HB_LATIN1 ) + txt0 for example.
The HVM shouldn't assume anything to do with characters above 255,
because we have database encodings, console encodings, GUI encodings,
printer encodings and one can use completely different encodings for
each one simultaneously.
IMHO it's job of who is programming to deal with these. Anything
"automatic" will cause more confusion than solution. Besides, people
will need to learn to do it right from the start, what I think is
better for everyone instead of tons of people getting unexpected
results without a clue.
Regards
Bacco