Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Pre Delphi 2008-9 Unicode Do's and Dont's

74 views
Skip to first unread message

Lee Jenkins

unread,
Jul 21, 2008, 12:40:11 PM7/21/08
to

Has anyone posted information concerning do's and dont's for Unicode support in
upcoming Delphi versions?

It recent threads concerning Delphi/Unicode, I think the topic of being prepared
for Unicode has not been addressed so much, at least as far as I can see.

On one side, we have applications that have already been written whose authors
are rightfully concerned about compatibility.

On the other side, we have applications which are yet to be written and do not
have much threat of being

In the middle, we have applications which are currently being written (raises
hand) which could benefit from some suggestions on best practices to give the
applications currently being written to have a chance of being ported more
easily when D2008/9 is finally released.

--
Warm Regards,

Lee

Nick Hodges (Embarcadero)

unread,
Jul 21, 2008, 1:04:22 PM7/21/08
to
Lee Jenkins wrote:

> Has anyone posted information concerning do's and dont's for Unicode
> support in upcoming Delphi versions?

I'll be posting some articles on this very soon.

Short list:

Don't assume that the size of a Char is one.

Don't assume that the size of an array of Char is the same as the
Length of the string held in the array of Char.


--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges

John Herbster

unread,
Jul 21, 2008, 1:27:46 PM7/21/08
to
>> Has anyone posted information concerning do's and dont's for
>> Unicode support in upcoming Delphi versions?

"Nick Hodges (Embarcadero)" <nick....@codegear.com> wrote


> I'll be posting some articles on this very soon.

Nick, Here are a few suggestions and clarifications.

> Short list::
> Don't assume that the size of a Char is one.

Please start with the compiler op to switch def of "Char".

> Don't assume that the SizeOf an array of Char is the same as the
> Length of the string held in the array of Char [less one].

Show us how to iterate through a string of characters with indexes.

Show us how to iterate through a string of characters with pointers.

Show us how to load and store a string from and to TStreams.

Show us how to replace a character.

Show us how to make literal constants an assign them to strings.

Show us how to pass strings to and from DLLs.

Regards, JohnH

Nick Hodges (Embarcadero)

unread,
Jul 21, 2008, 1:44:42 PM7/21/08
to
John Herbster wrote:

>
> Show us how to iterate through a string of characters with indexes.

Exactly as before.

>
> Show us how to iterate through a string of characters with pointers.

Exactly as before -- but don't assume a character is of size 1.

> Show us how to load and store a string from and to TStreams.

Exactly as before but you can't assume that the length of a string char
is 1.

>
> Show us how to replace a character.

Exactly as before.

> Show us how to make literal constants an assign them to strings.

Exactly as before.

> Show us how to pass strings to and from DLLs.

Just as before, but again, don't assume that Char = 1 byte.

Serge Dosyukov (Dragon Soft)

unread,
Jul 21, 2008, 2:24:40 PM7/21/08
to
1) Few functions are expecting PAnsiChar/PWideChar instead of
AnsiCar/WideChar (windows API)
2) working with Windows API, be aware of what you are passing around
(windows messages)
3) use Length()
4) P-strings are still #0 terminated, but instead of #00, you might see
#0000.

In Delphi 7
var
LC: char;
LC2: widechar;
LC3: ansichar;
begin
ShowMessage(IntToStr(SizeOf(LC)) + ', ' + IntToStr(SizeOf(LC2)) + ', ' +
IntToStr(SizeOf(LC3)));
end;

gives "1, 2, 1"

where now you may get

gives "2, 2, 1"

"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message
news:4884c709$1...@newsgroups.borland.com...

John Herbster

unread,
Jul 21, 2008, 2:27:25 PM7/21/08
to

"Nick Hodges (Embarcadero)" <nick....@codegear.com> wrote
> Exactly as before -- but don't assume a character is of size 1.

Thanks Nick!

>> Show us how to iterate through a string of characters with pointers.

> Exactly as before -- but don't assume a character is of size 1.

May I presume like this?
p := @MyString[1];
Inc(p);
where MyStr: string; and p: PChar;

And how expensive are these operations during CPU execution?

TIA, JohnH

John Herbster

unread,
Jul 21, 2008, 2:32:23 PM7/21/08
to

"Serge Dosyukov (Dragon Soft)" <pooh996.gmail.com> wrote

> 1) Few functions are expecting PAnsiChar/PWideChar instead of
> AnsiCar/WideChar (windows API)

What are the type names for Unicode strings and chars?
What is the SizeOf() for a Unicode char variable?
--JohnH

Nick Hodges (Embarcadero)

unread,
Jul 21, 2008, 2:45:44 PM7/21/08
to
John Herbster wrote:

> May I presume like this?
> p := @MyString[1];
> Inc(p);
> where MyStr: string; and p: PChar;

Yes -- just like before.

> And how expensive are these operations during CPU execution?

Minimal -- it's very efficient. It's pointer math, right? ;-)

Nick Hodges (Embarcadero)

unread,
Jul 21, 2008, 2:47:11 PM7/21/08
to
John Herbster wrote:

> What are the type names for Unicode strings and chars?


string aliases to UnicodeString
PChar aliases to PWideChar

> What is the SizeOf() for a Unicode char variable?

SizeOf(Char) is now 2.

Serge Dosyukov (Dragon Soft)

unread,
Jul 21, 2008, 2:58:37 PM7/21/08
to
http://blogs.codegear.com/nickhodges

1) string, char
2) We are still sit on top of Windows API, so "Wide strings consist of
16-bit Unicode characters". Could be different for 64bit processors.

But "WideChar would suddenly grow in size"

http://en.wikipedia.org/wiki/Unicode
http://delphi.about.com/od/beginners/l/aa071800a.htm
http://www.codexterity.com/delphistrings.htm

As you can see from my code sample, I was getting widechar/widestring
representation: in case of char, it is a widechar, in case of the string it
is a widestring.

Rule of thumb. Stay away from assuming specific size of the string
representation in bytes, count its length in chars instead. Then if you need
exact size, multiply it by the size of the char being stored.

"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message

news:4884...@newsgroups.borland.com...

John Herbster

unread,
Jul 21, 2008, 3:01:27 PM7/21/08
to
>> What is the SizeOf() for a Unicode char variable?

"Nick Hodges (Embarcadero)" <nick....@codegear.com> wrote
> SizeOf(Char) is now 2.

Not up to 4 bytes?
How can you encode 100,000 characters in only 2-bytes when 2^(2*8) = 65536?
See
http://en.wikipedia.org/wiki/Unicode

If you mean UTF-8, why not call it UTF-8?

--JohnH

Tim Young [Elevate Software]

unread,
Jul 21, 2008, 3:16:47 PM7/21/08
to
John,

<< If you mean UTF-8, why not call it UTF-8? >>

It's UTF-16 (Word-sized characters), the same as with Windows 2000 and
later. It covers most of the character sets out there, but requires
surrogate pairs for more extensive character sets.

--
Tim Young
Elevate Software
www.elevatesoft.com


John Herbster

unread,
Jul 21, 2008, 3:20:42 PM7/21/08
to

"Tim Young [Elevate Software]" <timy...@elevatesoft.com> wrote

> It's UTF-16 (Word-sized characters), the same as with Windows 2000 and

Tim,

Let's try to pin some definitions down.

According to http://en.wikipedia.org/wiki/UTF-16

"UTF-16 (16-bit Unicode Transformation Format) is a variable-length
character encoding for Unicode"

If Windows and the new Delphi really do use UTF-16, how do they
handle the variable-length character encodings?

Rgds, JohnH

Thorsten Engler [NexusDB]

unread,
Jul 21, 2008, 3:16:05 PM7/21/08
to
John Herbster wrote:

> May I presume like this?
> p := @MyString[1];
> Inc(p);
> where MyStr: string; and p: PChar;
>
> And how expensive are these operations during CPU execution?

The Inc(p) used to add 1, now it adds 2 to the pointer. What difference
in performance compared to AnsiString/PAnsiChar do you expect?

--

Thorsten Engler [NexusDB]

unread,
Jul 21, 2008, 3:33:14 PM7/21/08
to
John Herbster wrote:

> "UTF-16 (16-bit Unicode Transformation Format) is a variable-length
> character encoding for Unicode"
>
> If Windows and the new Delphi really do use UTF-16, how do they
> handle the variable-length character encodings?

In pretty much the same way that windows and delphi handle MBCS ANSI
codepages currently.

See http://en.wikipedia.org/wiki/Multi-byte_character_set

"UTF-16 was devised to break free of the 65,536-character limit of the
original Unicode (1.x) without breaking compatibility with the 16-bit
encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF,
lead units the range D800-DBFF and trail units the range DC00-DFFF. The
lead and trail units, called in Unicode terminology high surrogates and
low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making
for a maximum of possible 1,114,112 codepoints in Unicode."

--

Ian Boyd

unread,
Jul 21, 2008, 4:08:34 PM7/21/08
to
>> Show us how to iterate through a string of characters with pointers.
> Exactly as before -- but don't assume a character is of size 1.

p: Pointer;


p := @MyString[1];
Inc(p);

?

Thorsten Engler [NexusDB]

unread,
Jul 21, 2008, 4:35:35 PM7/21/08
to
Ian Boyd wrote:

> p: Pointer;
> p := @MyString[1];
> Inc(p);

You can't do pointer math with untyped pointers. Never worked before.
Not going to start working suddenly. "Pointer" does not know how big
whatever it points to is.

--

Nick Hodges (Embarcadero)

unread,
Jul 21, 2008, 4:34:42 PM7/21/08
to
Ian Boyd wrote:

> p: Pointer;
> p := @MyString[1];
> Inc(p);

That will behave differently, since it (appears) to be assuming the
SizeOf(Char) = SizeOf(Pointer), which is no longer true.

Remy Lebeau (TeamB)

unread,
Jul 21, 2008, 4:58:27 PM7/21/08
to

"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message
news:4884dcfb$1...@newsgroups.borland.com...

> Not up to 4 bytes?

No. "Char" will now be an alias for WideChar, wheras it was an alias for
AnsiChar in previous versions. Thus SizeOf(Char) will be 2 now.

> How can you encode 100,000 characters in only 2-bytes when
> 2^(2*8) = 65536?

The new UnicodeString type will use UTF-16 (just like WideString does) in
order to match how Windows implements Unicode.

In UTF-16, Unicode code points (logical characters) less than $10000 can be
encoded using their original value as-is in a single WideChar. Unicode
codepoints above $10000, inclusive, have to be encoded as two WideChars
working together (known as a "surrogate pair"). The use of surrogate pairs
allows UTF-16 to support up to 2,097,152 Unicode codepoints. Anything more
than that requires UTF-32 instead. Which Tiburon will also support, via a
separate UCS4String (and UCS4Char) data type, which are 32-bit.


Gambit


Remy Lebeau (TeamB)

unread,
Jul 21, 2008, 5:23:58 PM7/21/08
to

"Remy Lebeau (TeamB)" <no....@no.spam.com> wrote in message
news:4884f8ca$1...@newsgroups.borland.com...

> The use of surrogate pairs allows UTF-16 to support up to
> 2,097,152 Unicode codepoints.

Correction: UTF-16 supports 1,112,064 Unicode codepoints ($00000000 -
$0010FFFF, minus $0000D800 - $0000DFFF which are reserved).


Gambit


Pieter Zijlstra

unread,
Jul 21, 2008, 6:01:27 PM7/21/08
to
John Herbster wrote:

> Then for "surrogate pairs" which require two WideChars for their
> representation, it seems to be that "exactly as before" character
> indexing will require sometimes stepping over two WideChars instead
> of one.

It is the same as before where multiple bytes where needed to display
one character in for instance Asian windows versions. Most of the time
you don't care you just read/write a number of bytes (with Unicode,
words) and leave it to the Windows API how this is displayed.

--
Pieter

Thorsten Engler [NexusDB]

unread,
Jul 21, 2008, 5:55:43 PM7/21/08
to
John Herbster wrote:

> Then for "surrogate pairs" which require two WideChars for their
> representation, it seems to be that "exactly as before" character
> indexing will require sometimes stepping over two WideChars instead
> of one.

UTF16 has the huge advantage that the values for singeltons and leading
and trailing surrogate pairs do not overlap:

"In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead
units the range D800-DBFF and trail units the range DC00-DFFF".

As a result of this, for code like "split this string into individual
strings at each \" and a lot of other string processing that's
happening on a per character basis, you don't have to worry about the
surrogate pairs because the the trailing unit can never be mistaken for
some other valid character.

> Are the individual WideChars stored big or little endian?
In memory, usually whatever your current hardware platform perfers.

> If little endian in Intel RAM, how are they stored in disk "text"
> files and communicated over wires?

That's what a BOM is for: http://en.wikipedia.org/wiki/Byte_Order_Mark

All UTF16 strings that go "over the wire" or onto disk should be
prefixed by a BOM. Either U+FFFE or U+FEFF, depending on the byte order
of the following data.

> What about the surrogate pairs? Is the low or high part of the pair
> at the lower address? And ditto for disk files and communications?
The order of the surrogate pairs always remains the same, the leading
one comes before the trailing one.

> Does that mean that UTF-16 characters are limited to 4-bytes?
That's why they are called "surrogate pairs" and not "surrogate
sequences" or something like that. You either have a singelton or a
pair of a leading and trailing surrogate.


--

John Herbster

unread,
Jul 21, 2008, 5:39:30 PM7/21/08
to
Remy, Thorsten, et. al,

> "Char" will now be an alias for WideChar, ...
> Thus SizeOf(Char) will now be 2.

Thanks for that info.

>> How can you encode 100,000 characters in only 2-bytes
>> when 2^(2*8) = 65536?

> The new UnicodeString type will use UTF-16 (just like WideString does) in
> order to match how Windows implements Unicode.

> In UTF-16, Unicode code points (logical characters) less than $10000 can be
> encoded using their original value as-is in a single WideChar. Unicode
> codepoints above $10000, inclusive, have to be encoded as two WideChars
> working together (known as a "surrogate pair"). The use of surrogate pairs
> allows UTF-16 to support up to 2,097,152 Unicode codepoints. Anything more
> than that requires UTF-32 instead. Which Tiburon will also support, via a
> separate UCS4String (and UCS4Char) data type, which are 32-bit.

Then for "surrogate pairs" which require two WideChars for their

representation, it seems to be that "exactly as before" character
indexing will require sometimes stepping over two WideChars instead
of one.

Are the individual WideChars stored big or little endian?

If little endian in Intel RAM, how are they stored in disk "text"
files and communicated over wires?

What about the surrogate pairs? Is the low or high part of the pair

at the lower address? And ditto for disk files and communications?

> UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode.
Retrieved from "http://en.wikipedia.org/wiki/Variable-width_encoding"

Does that mean that UTF-16 characters are limited to 4-bytes?

TIA for the education, JohnH

John Herbster

unread,
Jul 21, 2008, 6:37:45 PM7/21/08
to
"Thorsten Engler [NexusDB]" <thorste...@nexusdb.com> wrote

> UTF16 has the huge advantage that the values for singletons and leading


> and trailing surrogate pairs do not overlap:

I see!

>> Are the individual WideChars stored big or little endian?

> ... usually whatever your current hardware platform prefers.

Am I correct that "U+" is just a prefix indicating for Unicode
representation in hexadecimal. Is U+D801DC01 a surrogate pair?
Or is it written U+D801, U+DC01?

>> If little endian in Intel RAM, how are they stored in disk
>> "text" files and communicated over wires?

> That's what a BOM is for: http://en.wikipedia.org/wiki/Byte_Order_Mark

> All UTF16 strings that go "over the wire" or onto disk should be
> prefixed by a BOM. Either U+FFFE or U+FEFF, depending on the
> byte order of the following data.

I do not understand this. If I have MyAnsiString = 'AB' and assign
it to MyWideString in RAM on a PS with an Intel CPU, then I presume
that I have in increasing memory addresses $41, $00, $42, and $00,
or if you please U+0041 and U+0042.

Now if I sent this to a file, is this byte sequence valid?
Big-endian: $FE, $FF, $41, $00, $42, $00
And this one valid, too?
Little-endian: $FF, $FE, $00, $41, $00, $42
And if so, wouldn't the U+ representation in either case be
U+FEFF, U+0041, U+0042.

TIA, JohnH

Thorsten Engler [NexusDB]

unread,
Jul 21, 2008, 6:55:09 PM7/21/08
to
John Herbster wrote:

> Am I correct that "U+" is just a prefix indicating for Unicode

> representation in hexadecimal. Is a surrogate pair written
> U+D801, U+DC01?
Yes.

> I do not understand this. If I have MyAnsiString = 'AB' and assign
> it to MyWideString in RAM on a PS with an Intel CPU, then I presume
> that I have in increasing memory addresses $41, $00, $42, and $00,
> or if you please U+0041 and U+0042.

Yes.

> Now if I sent this to a file, is this byte sequence valid?

> Big-endian: $FE, $FF, $00, $41, $00, $42
Yes.

> And this one valid, too?

> Little-endian: $FF, $FE, $41, $00, $42, $00
Yes.

> And if so, wouldn't the U+ representation in either case be
> U+FEFF, U+0041, U+0042.

Yes, I was mistaken. It is always U+FEFF, which can be FF FE or FE FF
depending on the endianess, I should't have used U+FFFE except to say
that "The Unicode value U+FFFE is guaranteed never to be assigned as a
Unicode character; this implies that in a Unicode context the 0xFF,
0xFE byte pattern can only be interpreted as the U+FEFF character
expressed in little-endian byte order (since it could not be a U+FFFE
character expressed in big-endian byte order)."

--

John Herbster

unread,
Jul 21, 2008, 6:43:05 PM7/21/08
to
(Correction)

"Thorsten Engler [NexusDB]" <thorste...@nexusdb.com> wrote

> UTF16 has the huge advantage that the values for singletons and leading


> and trailing surrogate pairs do not overlap:

I see!

>> Are the individual WideChars stored big or little endian?

> ... usually whatever your current hardware platform prefers.



Am I correct that "U+" is just a prefix indicating for Unicode
representation in hexadecimal. Is a surrogate pair written
U+D801, U+DC01?

>> If little endian in Intel RAM, how are they stored in disk


>> "text" files and communicated over wires?

> That's what a BOM is for: http://en.wikipedia.org/wiki/Byte_Order_Mark

> All UTF16 strings that go "over the wire" or onto disk should be
> prefixed by a BOM. Either U+FFFE or U+FEFF, depending on the
> byte order of the following data.

I do not understand this. If I have MyAnsiString = 'AB' and assign

it to MyWideString in RAM on a PS with an Intel CPU, then I presume
that I have in increasing memory addresses $41, $00, $42, and $00,
or if you please U+0041 and U+0042.

Now if I sent this to a file, is this byte sequence valid?


Big-endian: $FE, $FF, $00, $41, $00, $42

And this one valid, too?
Little-endian: $FF, $FE, $41, $00, $42, $00

And if so, wouldn't the U+ representation in either case be
U+FEFF, U+0041, U+0042.

TIA, JohnH

Ivan

unread,
Jul 21, 2008, 7:10:41 PM7/21/08
to
>
> UTF16 has the huge advantage that the values for singeltons and leading
> and trailing surrogate pairs do not overlap:

Now the advantage over utf8 is finally becoming clear. Thanks so much Thorsten, very helpful as usual.

Remy Lebeau (TeamB)

unread,
Jul 21, 2008, 7:35:35 PM7/21/08
to

"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message
news:48850208$1...@newsgroups.borland.com...

> Then for "surrogate pairs" which require two WideChars for
> their representation, it seems to be that "exactly as before"
> character indexing will require sometimes stepping over
> two WideChars instead of one.

Potentially. But that requirement has existed since WideString was
introduced. It does not change now that UnicodeString is being added. If
you don't need to act on individual codepoints in your code, then you don't
have to worry about treating surrogates separately. Otherwise, you
generally would have to convert from UTF-16 to UTF-32 before you could work
with codepoints correctly anyway.

> Are the individual WideChars stored big or little endian?

WideString and UnicodeString use Big Endian, as that is the default endian
for Intel platforms.

> If little endian in Intel RAM, how are they stored in disk "text"
> files and communicated over wires?

It is the coder's responsibility to handle endian issues in those cases.
That is nothing new.

> What about the surrogate pairs? Is the low or high part of the pair
> at the lower address?

The High surrogate always appears in front of the Low surrogate in the
string, but each individual surrogate in the pair is affected by the endian
used for the entire string. This is clearly described in RFC 2781.

> And ditto for disk files and communications?

That is also the coder's responsibility to handle.

> Does that mean that UTF-16 characters are limited to 4-bytes?

Unicode itself is limited to 4 bytes per codepoint (encoded using UTF-32
and/or UCS4). There is no codepoint defined above $7FFFFFFF yet.

However, UTF-16 is limited to 3-byte codepoints, since the highest codepoint
it can handle is $10FFFF.


Gambit


Remy Lebeau (TeamB)

unread,
Jul 21, 2008, 7:44:58 PM7/21/08
to

"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message
news:488510ec$1...@newsgroups.borland.com...
(Correction)

> Am I correct that "U+" is just a prefix indicating for
> Unicode representation in hexadecimal.

Yes.

> If I have MyAnsiString = 'AB' and assign it to MyWideString
> in RAM on a PS with an Intel CPU, then I presume that I have
> in increasing memory addresses $41, $00, $42, and $00

Yes. That would be UTF-16 in Little Endian.

> Now if I sent this to a file, is this byte sequence valid?
Big-endian: $FE, $FF, $00, $41, $00, $42

Yes.

> And this one valid, too?
> Little-endian: $FF, $FE, $41, $00, $42, $00

Yes.

> And if so, wouldn't the U+ representation in either case
> be
> U+FEFF, U+0041, U+0042.

Yes, it would.


Gambit


Thorsten Engler [NexusDB]

unread,
Jul 21, 2008, 7:56:10 PM7/21/08
to
> WideString and UnicodeString use Big Endian, as that is the default
> endian for Intel platforms.
Eh. Little-endian is default on x86. Lowest byte first.
Which is why U+0041 will be $41 $00 in memory. But it doesn't really
matter much either way because in most cases you are not going to
access unicode strings byte by byte.


--

Chad Z. Hower aka Kudzu

unread,
Jul 22, 2008, 12:13:31 AM7/22/08
to
http://www.kudzuworld.com/blogs/tech/20080722A.aspx

--
Keep up to date - read the IntraWeb blog!
http://www.atozed.com/intraweb/blog/


"Lee Jenkins" <l...@nospam.net> wrote in message
news:4884bbe9$1...@newsgroups.borland.com...
>
> Has anyone posted information concerning do's and dont's for Unicode
> support in upcoming Delphi versions?
>
> It recent threads concerning Delphi/Unicode, I think the topic of being
> prepared for Unicode has not been addressed so much, at least as far as I
> can see.
>
> On one side, we have applications that have already been written whose
> authors are rightfully concerned about compatibility.
>
> On the other side, we have applications which are yet to be written and do
> not have much threat of being
>
> In the middle, we have applications which are currently being written
> (raises hand) which could benefit from some suggestions on best practices
> to give the applications currently being written to have a chance of being
> ported more easily when D2008/9 is finally released.
>
> --
> Warm Regards,
>
> Lee


Lee Jenkins

unread,
Jul 22, 2008, 12:23:27 AM7/22/08
to
Chad Z. Hower aka Kudzu wrote:
> http://www.kudzuworld.com/blogs/tech/20080722A.aspx
>

Hmmm. So how would things like the following be done?

begin
FInStream.Seek(0, soFromBeginning);
SetLength(lString, FInStream.Size);
FInStream.Read(lstring[1], FInStream.Size);
end;

Thanks,

--
Warm Regards,

Lee

Paul Scott

unread,
Jul 22, 2008, 6:31:20 AM7/22/08
to
On Mon, 21 Jul 2008 19:45:44 +0100, Nick Hodges (Embarcadero)
<nick....@codegear.com> wrote:

>> May I presume like this?
>> p := @MyString[1];
>> Inc(p);
>> where MyStr: string; and p: PChar;
>
> Yes -- just like before.

Err... Nick,

Doesn't this depend on exactly what you mean by "a character" ?


If CG had gone for UTF-32 encoding, there would have indeed have been a
one-to-one correspondence between a position in the string and what most
people would think of as "a character" - ie. a CodePoint.

But since (according to the blogs) Tiburon will be using "string=UTF-16",
then any character (CodePoint) which lies outside the BMP (ie. >64K) has
to be represented by a /pair/ of 16-bit CodeValues.

So, since your blog also says that for Tiburon "Char=WideChar=word", one
external "character" may actually need two internal "characters" to
represent it.

Now while most Delphi programs hopefully won't ever see any of these
million or so extended CodePoints, anywhere that a program allocates
memory on the basis of "ExpectedNumberOfChars x SizeOf(Char)" or, like the
code snippet above, might presume that each index position in a "string"
represents "one character" in some writing system leaves a ticking
timebomb.

--
Paul Scott
Information Management Systems
Macclesfield, UK.

Dave Nottage [TeamB]

unread,
Jul 22, 2008, 6:46:36 AM7/22/08
to
Nick Hodges (Embarcadero) wrote:

> Exactly as before -- but don't assume a character is of size 1.
>

> > Show us how to load and store a string from and to TStreams.
>
> Exactly as before but you can't assume that the length of a string
> char is 1.
>
> > Show us how to pass strings to and from DLLs.
>
> Just as before, but again, don't assume that Char = 1 byte.

So.. does that mean we should not assume that Char = 1 byte? <g>

--
Dave Nottage [TeamB]

Xavier

unread,
Jul 22, 2008, 6:43:43 AM7/22/08
to
Remy Lebeau (TeamB) wrote:
> Unicode itself is limited to 4 bytes per codepoint (encoded using UTF-32
> and/or UCS4). There is no codepoint defined above $7FFFFFFF yet.

There are no code points defined above $10FFFF.

> However, UTF-16 is limited to 3-byte codepoints, since the highest codepoint
> it can handle is $10FFFF.

Planes 3 through 13 ($30000–$DFFFF, 720895 code point slots) are
currently unallocated. That's most of the Unicode space, and much of it
is likely to never be filled.

Xavier

unread,
Jul 22, 2008, 7:05:49 AM7/22/08
to
Lee Jenkins wrote:
> Chad Z. Hower aka Kudzu wrote:
>> http://www.kudzuworld.com/blogs/tech/20080722A.aspx
>>
>
> Hmmm. So how would things like the following be done?
>
> begin
> FInStream.Seek(0, soFromBeginning);
> SetLength(lString, FInStream.Size);
> FInStream.Read(lstring[1], FInStream.Size);
> end;

Urgh. I hope you understand how bad this is and that it's just a dirty
example. Issue #1 is reading a full stream into a string. RAM be damned.
Issue #2 is reading the Size property once for allocation and once for
the read; files can change size between calls to GetSize[1]. I trust I
don't need to write what happens if you try to read 1MB into a 512KB
string.

But if you must:

FInStream.Seek(0, soFromBeginning);
SetLength(lString, FInStream.Size * SizeOf(Char));
FInStream.Read(lstring[1], FInStream.Size);

Storing the size on a variable would be safer though not exactly better.

[1] for file streams it's even expensive; TStream.GetSize calls Seek
*3 times* for *each* call. And it is SLOW. When I was first learning
streams I made the mistake of referring to Size and Position often (both
of them call Seek). With their use removed the only bottleneck was the
HD throughput.

Henrick Hellström

unread,
Jul 22, 2008, 7:18:26 AM7/22/08
to
Xavier wrote:
> FInStream.Seek(0, soFromBeginning);
> SetLength(lString, FInStream.Size * SizeOf(Char));
> FInStream.Read(lstring[1], FInStream.Size);

Better:

SetLength(lString, FInStream.Seek(0, soFromEnd) div SizeOf(Char));
FInStream.Seek(0, soFromBeginning);
FInStream.Read(Pointer(lString)^, Length(lString)*SizeOf(Char));

Firstly, such code should account for the possibility that the stream is
empty. Referencing lString[1] when Length(lString) = 0 is an error.
Referencing zero bytes from nil^ is however perfectly legal.

Secondly, getting your * and div operators right might help. ;)

Thirdly, adjusting the code in accordance with your Size and Position
remarks wasn't that hard. :)

Thorsten Engler [NexusDB]

unread,
Jul 22, 2008, 8:04:48 AM7/22/08
to
Paul Scott wrote:

> Doesn't this depend on exactly what you mean by "a character" ?

Exactly the same thing that an AnsiString meant by "a character". Where
you could have DBCS and MBCS.

> But since (according to the blogs) Tiburon will be using
> "string=UTF-16", then any character (CodePoint) which lies outside

> the BMP (ie. >64K) has to be represented by a pair of 16-bit
> CodeValues.
UTF-16 is the native type for unicode character data on the Windows
platform. Any Windows API taking or returning unicode strings uses
UTF-16. Internally ALL string data under NT based Windows versions (NT,
2000, and newer) is using UTF-16. All the ANSI APIs still offered are
just wrappers which perform the conversion into UTF-16 and call the
actual native API.

The use of any other encoding as default for unicode strings on Windows
does not make sense.

> So, since your blog also says that for Tiburon "Char=WideChar=word",
> one external "character" may actually need two internal "characters"
> to represent it.

Again, this is no different at all from DBCS and MBCS ANSI encodings.

The huge difference is that in UTF-16 the value ranges for singelton,
leading and trailing units do NOT overlap. That means that in almost
all cases you can ignore the fact that UTF-16 contains surrogate pairs.

In DBCS/MBCS ANSI you have to parse the strings always starting from
the beginning because you might need to skip ahead if you hit a lead
byte because one of the trailing bytes might have a value that on it's
own would represent a valid singelton (with a totally different meaning
then this trailing byte in combination with it's lead byte).

In UTF-16 a trailing unit seen on it's own can never ever be mistaken
for a different singelton value.

> Now while most Delphi programs hopefully won't ever see any of these
> million or so extended CodePoints, anywhere that a program allocates
> memory on the basis of "ExpectedNumberOfChars x SizeOf(Char)" or,
> like the code snippet above, might presume that each index position
> in a "string" represents "one character" in some writing system
> leaves a ticking timebomb.

Again, Delphi doesn't have any more or less problems with this then any
other application on the Windows platform. UTF16 is the native unicode
encoding on windows. Using UTF-16 is much much less of a problem then
trying to support ANSI for DBCS or MBCS codepages.

--

John Herbster

unread,
Jul 22, 2008, 8:31:38 AM7/22/08
to
Nick,

Just a few more questions to help with the book you are writing. <g>

"Nick Hodges (Embarcadero)" <nick....@codegear.com> wrote

>> Show us how to iterate through a string of characters with indexes.

> Exactly as before.

What about surrogate pairs? Do you consider such a pair to be two
characters or one?

>> Show us how to load and store a string from and to TStreams.

> Exactly as before but you can't assume that the length of a string
> char is 1.

Does the Length() function return the number of characters or number
of words. Will Length(MyUnicodeString)*SizeOf(MyUniCodeString[1])
give the required number of bytes (not including a terminator) in a
TStream?

>> Show us how to replace a character.

> Exactly as before.

What if we are replacing a pair with a singleton?

>> Show us how to make literal constants an assign them to strings.

> Exactly as before.

Can we use the U+ representations in constant statements?
Where do we find the or look up some of the more common U+
character definitions.

Regards, JohnH


Thorsten Engler [NexusDB]

unread,
Jul 22, 2008, 8:48:26 AM7/22/08
to
John Herbster wrote:

I'm not Nick, but...

> What about surrogate pairs? Do you consider such a pair to be two
> characters or one?

There is no difference between surrogate pairs in UTF-16 and
leading/trailing bytes you can already encounter in ANSI if the current
codepage happens to be a DBCS/MBCS. If anything surrogate pairs are
much easier to handle because there is no overlap between singelton,
surrogate leading and trailing units.

> Does the Length() function return the number of characters or number
> of words.

Just the same as with AnsiStrings or dynamic arrays. It returns the
number of elements.

> Will Length(MyUnicodeString)*SizeOf(MyUniCodeString[1])
> give the required number of bytes (not including a terminator) in a
> TStream?

It should, but that code is dangerous as MyUniCodeString[1] would raise
an exception if the string is empty.

> What if we are replacing a pair with a singleton?

Absolutely no different from handling DBCS/MBCS in ANSI already.

> Can we use the U+ representations in constant statements?
> Where do we find the or look up some of the more common U+
> character definitions.

You can use the usual # syntax that delphi uses to specify individual
characters by their byte value. Keep in mind that U+ is Hex and # is
decimal.

Lucian Radulescu

unread,
Jul 22, 2008, 9:07:47 AM7/22/08
to
If I have old code like:

NameRecType = packed record
Gender : Byte;
SurName : String[70];
FirName : String[20];
MidNames: String[30];
DOB : TDateTime;
end;

and I blockread/blockwrite vars of NameRecType from untyped files, do I
need to change anything or it will just compile and I don't have to
worry about it?

Lucian

Thorsten Engler [NexusDB]

unread,
Jul 22, 2008, 9:20:52 AM7/22/08
to
Lucian Radulescu wrote:

The shortstring type has not changed in any way.

--

Paul Scott

unread,
Jul 22, 2008, 9:16:30 AM7/22/08
to
On Tue, 22 Jul 2008 13:04:48 +0100, Thorsten Engler [NexusDB]
<thorste...@nexusdb.com> wrote:

> Again, Delphi doesn't have any more or less problems with this then any
> other application on the Windows platform. UTF16 is the native unicode
> encoding on windows. Using UTF-16 is much much less of a problem then
> trying to support ANSI for DBCS or MBCS codepages.

Thorsten,

I don't disagree with anything you said - and it was probably all obvious
to those who have been struggling with multiple code pages in the past.

But for the rest of us with applications in "English", there's more to
"Unicodifying an Application" than just a recompilation - even with a
liberal sprinkling of "SizeOf(Char)"

Nick Hodges (Embarcadero)

unread,
Jul 22, 2008, 9:34:58 AM7/22/08
to
Thorsten Engler [NexusDB] wrote:

> I'm not Nick, but...

..you are doing a much better job explaining it than I could. ;-)

--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges

Chad Z. Hower aka Kudzu

unread,
Jul 22, 2008, 9:36:41 AM7/22/08
to
"Lee Jenkins" <l...@nospam.net> wrote in message
news:4885...@newsgroups.borland.com...

> Hmmm. So how would things like the following be done?
>
> begin
> FInStream.Seek(0, soFromBeginning);
> SetLength(lString, FInStream.Size);
> FInStream.Read(lstring[1], FInStream.Size);
> end;

Aside from the comments posted about your general technique, you should not
do this with Unicode anyways. This assumes a read of "raw" UTF-16. With
Unicode anytime you convert from binary to string or the other way you
should always specify (or in some cases use a function that can determine
for you) the source (or destination if writing) encoding. Tiburon like .NET
will no doubt have many such functions.

Example 1

Expect that Streams that allow reading/writing using strings will have
optional parameters to specify encoding type of the binary, and if none will
default to ANSI.

Example 2

Expect that there will be conversino routines - ie ability to convert
strings to binary streams/arrays using ANSI, UTF-8, and more.

The key point is that any time you go from string to binary you should never
read the memory directly anymore, but instead use a function to do the
conversion and vice versa.

Maël Hörz

unread,
Jul 22, 2008, 9:31:10 AM7/22/08
to
> Urgh. I hope you understand how bad this is and that it's just a dirty
> example. Issue #1 is reading a full stream into a string. RAM be damned.
> Issue #2 is reading the Size property once for allocation and once for
> the read; files can change size between calls to GetSize[1].
When reading a file you should lock it anyway to get consistent data.
fmShareDenyWrite will ensure nothing goes wrong and the size stays the same.

BTW I never had bottlenecks due to the use of position or size
properties, maybe you read chunks which are too small in size?

John Herbster

unread,
Jul 22, 2008, 10:12:44 AM7/22/08
to
"Thorsten Engler [NexusDB]" <thorste...@nexusdb.com> wrote

>> What about surrogate pairs? Do you consider such a pair to be two
>> characters or one?

> There is no difference between surrogate pairs in UTF-16 and
> leading/trailing bytes you can already encounter in ANSI if the current
> codepage happens to be a DBCS/MBCS.

I have never used the DBCS/MBCS code page
-- I think that this may be part of my communication problem.

>> Does the Length() function return the number of characters
>> or number of words.

> Just the same as with AnsiStrings or dynamic arrays.
> It returns the number of elements.

Is there a standard function returning the number of "characters"?



>> Will Length(MyUnicodeString)*SizeOf(MyUniCodeString[1])
>> give the required number of bytes (not including a terminator) in a
>> TStream?

> It should, but .. MyUniCodeString[1] would raise


> an exception if the string is empty.

Of course.



>> What if we are replacing a pair with a singleton?

> Absolutely no different from handling DBCS/MBCS in ANSI already.

Whatever DBCS/MBCS is. <g>



>> Can we use the U+ representations in constant statements?

> You can use the usual # syntax that delphi uses to specify individual


> characters by their byte value. Keep in mind that U+ is Hex and # is
> decimal.

Was that a *no*?

>> Where do we find the or look up some of the more common U+
>> character definitions.

I appreciate your help. Rgds, JohnH

Lee Jenkins

unread,
Jul 22, 2008, 10:14:57 AM7/22/08
to
Chad Z. Hower aka Kudzu wrote:
> "Lee Jenkins" <l...@nospam.net> wrote in message
> news:4885...@newsgroups.borland.com...
>> Hmmm. So how would things like the following be done?
>>
>> begin
>> FInStream.Seek(0, soFromBeginning);
>> SetLength(lString, FInStream.Size);
>> FInStream.Read(lstring[1], FInStream.Size);
>> end;
>
> Aside from the comments posted about your general technique, you should not
> do this with Unicode anyways. This assumes a read of "raw" UTF-16. With
> Unicode anytime you convert from binary to string or the other way you
> should always specify (or in some cases use a function that can determine
> for you) the source (or destination if writing) encoding. Tiburon like .NET
> will no doubt have many such functions.

That would be nice. Like TStringBuilder...?

> Example 1
>
> Expect that Streams that allow reading/writing using strings will have
> optional parameters to specify encoding type of the binary, and if none will
> default to ANSI.
>
> Example 2
>
> Expect that there will be conversino routines - ie ability to convert
> strings to binary streams/arrays using ANSI, UTF-8, and more.
>
> The key point is that any time you go from string to binary you should never
> read the memory directly anymore, but instead use a function to do the
> conversion and vice versa.
>

What methods are available to do this now? I know I can create a TStringList
for instance and use its load/save from/to stream methods, but are there
specific methods or classes to convert between these now?


--
Warm Regards,

Lee

Lee Jenkins

unread,
Jul 22, 2008, 10:16:45 AM7/22/08
to
Xavier wrote:
> Lee Jenkins wrote:

>
> But if you must:
>
> FInStream.Seek(0, soFromBeginning);
> SetLength(lString, FInStream.Size * SizeOf(Char));
> FInStream.Read(lstring[1], FInStream.Size);
>
> Storing the size on a variable would be safer though not exactly better.
>
> [1] for file streams it's even expensive; TStream.GetSize calls Seek
> *3 times* for *each* call. And it is SLOW. When I was first learning
> streams I made the mistake of referring to Size and Position often (both
> of them call Seek). With their use removed the only bottleneck was the
> HD throughput.

Great stuff. Thank you,

--
Warm Regards,

Lee

Thorsten Engler [NexusDB]

unread,
Jul 22, 2008, 10:37:56 AM7/22/08
to
John Herbster wrote:

> Is there a standard function returning the number of "characters"?

Not that I know of. But it's largely pointless. There are hardly any
cases when you really need to concern yourself with surrogate pairs.

If you really want to count the number of codepoints in an UTF-16
string (but what for?) it's pretty straight forward. Start at the
beginning, any singelton counts as 1. Any leading unit must be followed
by a trailing unit or the string is invalid (same is true in reverse,
any trailing unit must be preceded by a leading unit or the string is
invalid). If you found a leading/trailing pair, count it as 1.

But that doesn't tell you how many glyphs that string would have if
it's displayed. There is no 1:1 relationship between codepoints and
glyphs (the things you recognize as individul elements on when
represented visually).

Thing is, in all cases where you are currently not confronted with
double byte character sets or multi byte character sets with ANSI, you
won't be confronted with surrogate pairs either.

And in the overwhelming majority of cases where you ARE confronted with
DBCS/MBCS in ANSI, you are STILL not going to be confronted with
surrogate pairs in UTF-16.

> Whatever DBCS/MBCS is. <g>
That's the thing you have to work with in any but the most trivial ANSI
codepages. Double Byte Character Sets, Multi Byte Character Sets.

Pretty much the same as with surrogate pairs. Just that you can have
more then 1 trailing byte. And the possible values of trailing bytes
can overlap with the values of singeltons. So if you don't always parse
the strings from the beginning and react properly to the leading bytes
to interprete the following trailing bytes differently, you might
totally misinterprete the trailing bytes as singletons with totally
different meaning.

> >> Can we use the U+ representations in constant statements?
>
> > You can use the usual # syntax that delphi uses to specify
> > individual characters by their byte value. Keep in mind that U+ is
> > Hex and # is decimal.
>

> Was that a no?
That was a "you can use what you've always used" in Delphi.

> >> Where do we find the or look up some of the more common U+
> >> character definitions.

Start > Accessories > System Tools > Character Map. Knock yourself out
(pay close attention to the status bar).


--

Craig Stuntz [TeamB]

unread,
Jul 22, 2008, 10:47:11 AM7/22/08
to
John Herbster wrote:

> Is there a standard function returning the number of "characters"?

John, the concept of what is or is not a "character" is very much
language-specific. So, when asking such questions, it is good to be
very specific about what problem you are trying to solve.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz
Please read and follow Borland's rules for the user of their
server: http://support.borland.com/entry.jspa?externalID=293

Chad Z. Hower aka Kudzu

unread,
Jul 22, 2008, 10:52:46 AM7/22/08
to
> John, the concept of what is or is not a "character" is very much
> language-specific. So, when asking such questions, it is good to be
> very specific about what problem you are trying to solve.

For example - Chinese which a character is a word. And believe it or not,
there are more complex scenarios in other languages...

Chad Z. Hower aka Kudzu

unread,
Jul 22, 2008, 10:57:18 AM7/22/08
to
"Lee Jenkins" <l...@nospam.net> wrote in message
news:4885eb5f$1...@newsgroups.borland.com...

>> Unicode anytime you convert from binary to string or the other way you
>> should always specify (or in some cases use a function that can determine
>> for you) the source (or destination if writing) encoding. Tiburon like
>> .NET will no doubt have many such functions.
>
> That would be nice. Like TStringBuilder...?

No. StringBuilder is for working with text in strings - becuase in .NET
strings cannot be changed. So StringBuilder is a class you can use when you
want to do string manipulations and extract a string when done.

Imagine something like:

MyBytes = ASCIIEncoding.GetBytes(MyString);

Thats how .NET does it. Tiburon might do something like that, or maybe
soemthing like:

MyBytes := UnicodeStringToASCIIBytes(MyString);

or maybe it takes a parameter that specifes what encoding to use. ASCII,
ANSI, UTF-8, UTF-32, etc...

> What methods are available to do this now? I know I can create a
> TStringList for instance and use its load/save from/to stream methods, but
> are there specific methods or classes to convert between these now?

Now as in <= Delphi 2007 or Tiburon? Surely Tiburon has methods, but I can't
discuss them except in generic was as I did above.

In <= Delphi 2007 there isnt Unicode so not really.... There are some
sources here and there that do some encodings from widestrings etc. Indy did
some too, but not Unicode until Tiburon.

Thorsten Engler [NexusDB]

unread,
Jul 22, 2008, 11:10:48 AM7/22/08
to
Chad Z. Hower aka Kudzu wrote:

> For example - Chinese which a character is a word. And believe it or
> not, there are more complex scenarios in other languages...

Actually, you are refering to visual representation here. That's a
Glyph, it's rather common in Unicode that multiple codepoints together
contribute to the visual representation of a single glyph (but again,
cases where you are not confronted with DBCS/MBCS when using ANSI you
are not going to be confronted with this in Unicode either).

I would like to strongly recommend that everyone heads over to

http://www.unicode.org/versions/Unicode5.1.0/

And read the Introduction:

http://www.unicode.org/versions/Unicode5.0.0/ch01.pdf

and at least fly over the General Structure:

http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

These 2 chapters together should answer 99% of any question someone
here might have about Unicode.


--

Chad Z. Hower aka Kudzu

unread,
Jul 22, 2008, 11:42:44 AM7/22/08
to
"Thorsten Engler [NexusDB]" <thorste...@nexusdb.com> wrote in message
news:4885...@newsgroups.borland.com...

> Actually, you are refering to visual representation here. That's a
> Glyph, it's rather common in Unicode that multiple codepoints together
> contribute to the visual representation of a single glyph (but again,
> cases where you are not confronted with DBCS/MBCS when using ANSI you
> are not going to be confronted with this in Unicode either).

I know thats the case for Simplified, is it the same for traditional?

John Herbster

unread,
Jul 22, 2008, 12:18:48 PM7/22/08
to
"Thorsten Engler [NexusDB]" <thorste...@nexusdb.com> wrote

>> Is there a standard function returning the number of "characters"?

> Not that I know of. But it's largely pointless. There are hardly any
> cases when you really need to concern yourself with surrogate pairs.

> If you really want to count the number of codepoints in an UTF-16

> ...


> But that doesn't tell you how many glyphs that string would have if
> it's displayed. There is no 1:1 relationship between codepoints and
> glyphs (the things you recognize as individul elements on when
> represented visually).

Oh -- More definition problems:
What is a "character", if it exists, in this brave new world?
What is a "glyph"?
What is a "codepoint"?



> Thing is, in all cases where you are currently not confronted with
> double byte character sets or multi byte character sets with ANSI,
> you won't be confronted with surrogate pairs either.

> And in the overwhelming majority of cases where you ARE confronted with
> DBCS/MBCS in ANSI, you are STILL not going to be confronted with
> surrogate pairs in UTF-16.

>> Whatever DBCS/MBCS is. <g>

> That's the thing you have to work with in any but the most trivial ANSI
> codepages. Double Byte Character Sets, Multi Byte Character Sets.

Thank goodness for trivial!

> ...

>> >> Can we use the U+ representations in constant statements?

>> > You can use the usual # syntax that delphi uses to specify
>> > individual characters by their byte value. Keep in mind that U+ is
>> > Hex and # is decimal.

>> Was that a no?

> That was a "you can use what you've always used" in Delphi.

I conclude for now, that one does not ever need to use "U+"
representations in consts.



>> >> Where do we find the or look up some of the more common U+
>> >> character definitions.

> Start > Accessories > System Tools > Character Map.

> (pay close attention to the status bar).

Thorsten, Thank you.

I hope that Unicode will help us to communicate better. Will
Unicode do anything for decimal points and thousand separators?
Will it help formatting messages in newsgroup readers?

Regards, JohnH

Thorsten Engler [NexusDB]

unread,
Jul 22, 2008, 12:15:40 PM7/22/08
to
Chad Z. Hower aka Kudzu wrote:

> I know thats the case for Simplified, is it the same for traditional?

I wasn't specifically talking about Traditional and Simplified Chinese
there but Unicode and complex scripts in general.

There exists about 70000 codepoints for fully composed Han Ideographs.
But there are also mechanisms for decompoising them into individual
components (up to 16 can make up a single composed Ideograph) as well
as many more Ideographs which can only be described in a decomposed
fashion using Ideographic Description Sequences.

Both Simplified and Traditional Chinese (as well as Japanese, Korean
and to some limit Vietnamese) take their visual representations from
these Han Ideographs, with many Ideographs used most of these languages
(but not always with the same meaning).

As far as Unicode is concerned there are no specific traditional or
simplified characters. Most Ideographs used in Simplified Chinese are
also used in Traditional Chinese (There are some Ideographs used in
Simplified which are, well, simplified, versions of Traditional
Ideographs, but mostly the simplification is based on using a single
Ideograph to represent the meaning of different Traditional Ideographs).

There are a lot more different Ideographs used in Traditional Chinese,
so it's more often in Traditional then Simplified that no precomposed
codepoint exist and a single Ideograph needs to be represented with a
Ideographic Description Sequence.

Cheers,
Thorsten

--

Thorsten Engler [NexusDB]

unread,
Jul 22, 2008, 12:37:00 PM7/22/08
to
John Herbster wrote:

> Oh -- More definition problems:
> What is a "character", if it exists, in this brave new world?
> What is a "glyph"?
> What is a "codepoint"?

To quote myself:

I would like to strongly recommend that everyone heads over to

http://www.unicode.org/versions/Unicode5.1.0/

And read the Introduction:

http://www.unicode.org/versions/Unicode5.0.0/ch01.pdf

and at least fly over the General Structure:

http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

These 2 chapters together should answer 99% of any question someone
here might have about Unicode.

> Thank goodness for trivial!
If you have no intention of supporting anything that you didn't support
so far with using just an english ANSI codepage then unicode is
absolute trivial for you. As far as you concerned the only difference
is basically going to be that chars are now two bytes instead of one
and the high byte is basically always going to be 0 for you.

Even if you want to support any of the complexer scripts, you usually
don't have to concern yourself with any of the details of unicode. The
user enters a string, store it the way it is. You want to display it?
Hand it to the Windows API the way it is. Simple. You want to compare 2
strings? Pass them to the CompareStringW API.

There are a few windows APIs that you might not have used before that
can be helpful:

NormalizeString:
http://msdn.microsoft.com/en-us/library/ms776395(VS.85).aspx

FoldString:
http://msdn.microsoft.com/en-us/library/cc709430(VS.85).aspx

LCMapString/LCMapStringEx:
http://msdn.microsoft.com/en-us/library/ms776290(VS.85).aspx
http://msdn.microsoft.com/en-us/library/ms776387(VS.85).aspx

The parameters might look a bit scary, but I'm pretty sure that Tiburon
will come with some form of nicer wrappers for such things.

> I conclude for now, that one does not ever need to use "U+"
> representations in consts.

Use #$ instead of U+

--

Jens Mühlenhoff

unread,
Jul 22, 2008, 12:25:28 PM7/22/08
to
Thorsten Engler [NexusDB] wrote:
>
>>>> Can we use the U+ representations in constant statements?
>>> You can use the usual # syntax that delphi uses to specify
>>> individual characters by their byte value. Keep in mind that U+ is
>>> Hex and # is decimal.
>> Was that a no?
> That was a "you can use what you've always used" in Delphi.
>

To put it another way, #$1A3B (hexadecimal character) is the delphi
representation of U+1A3B. The "U+" notation would really be superflous here.

--
Regards
Jens

Lee Jenkins

unread,
Jul 22, 2008, 1:17:11 PM7/22/08
to
Chad Z. Hower aka Kudzu wrote:
>
> In <= Delphi 2007 there isnt Unicode so not really.... There are some
> sources here and there that do some encodings from widestrings etc. Indy did
> some too, but not Unicode until Tiburon.
>
>

Sorry Chad, should have been more specific. I was referring to saving a stream
to a string. I hadn't run across any objects to do that (other than
TStringList, etc). I was just curious.

Thanks,

--
Warm Regards,

Lee

Chad Z. Hower aka Kudzu

unread,
Jul 22, 2008, 1:27:37 PM7/22/08
to
> Sorry Chad, should have been more specific. I was referring to saving a
> stream to a string. I hadn't run across any objects to do that (other
> than TStringList, etc). I was just curious.

In such existing methods, in absense of passing new parameters or using new
overloads, I would expect them to do translations to and from ANSI / ASCII
by default. So for English and possibly most languages with latin based
characters, they generally should work without changes.

Remy Lebeau (TeamB)

unread,
Jul 22, 2008, 1:26:43 PM7/22/08
to

"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message
news:4885d31d$1...@newsgroups.borland.com...

> What about surrogate pairs? Do you consider such a pair
> to be two characters or one?

When indexing through a String, they are separate *physical* characters -
they are each stored in a separate WideChar, and can thus be directly
indexed individually. To retreive a *logical* character (aka a Unicode
codepoint), then they have to be interpretted together. This is no
different than using MBCS or DBCS in AnsiString.

> Does the Length() function return the number of characters
> or number of words.

The meaning of Length() has not changed. It has always returned the number
of *physical* entries, not the number of *logical* entities. That is the
same whether it s used with an AnsiString, UnicodeString, dynamic array,
etc. In the case of a UnicodeString, that means the number of physically
allocated WideChar entries, not the number of encoded Unicode codepoints.

> Will Length(MyUnicodeString)*SizeOf(MyUniCodeString[1]) give
> the required number of bytes (not including a terminator) in a TStream?

Yes.

>> Show us how to replace a character.
>
> Exactly as before.

What if we are replacing a pair with a singleton?

Then you would have to pull the string apart and put it back together
manually - the same as you would have had to do with MBCS and DBCS in
AnsiString. For example:

if IsHighSurrogate(TheString[Index]) then
TheString := Copy(TheString, 1, Index-1) + TheSingleton +
Copy(TheString, Index+2, MaxInt)
else if IsLowSurrogate(TheString[Index]) then
TheString := Copy(TheString, 1, Index-2) + TheSingleton +
Copy(TheString, Index+1, MaxInt);
else
TheString[Index] := TheSinglegon;

> Can we use the U+ representations in constant statements?

I am not sure, but I think UnicodeString constants have to be UTF-16
compliant. Unless you specify them in UCS4/UTF-32 and let the compiler
convert for you at runtime.

> Where do we find the or look up some of the more common U+ character
> definitions.

The Unicode standard.


Gambit


Remy Lebeau (TeamB)

unread,
Jul 22, 2008, 1:29:39 PM7/22/08
to

"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message
news:4885ead4$1...@newsgroups.borland.com...

> Whatever DBCS/MBCS is. <g>

MBCS = Multi-Byte Character Set

DBCS = Double-Byte Character Set.


Gambit


Remy Lebeau (TeamB)

unread,
Jul 22, 2008, 1:35:37 PM7/22/08
to

"Xavier" <n...@spam.com> wrote in message
news:4885...@newsgroups.borland.com...

> There are no code points defined above $10FFFF.

Why do UTF-32 and UCS4 exist if all available codepoints fit within the
UTF-16 space?

> Planes 3 through 13 ($30000–$DFFFF, 720895 code point slots)
> are currently unallocated. That's most of the Unicode space, and
> much of it is likely to never be filled.

Older UTF-8 specs I have seen defined rules for handling codepoints up to
$7FFFFFFF (encoded up to 6 bytes in UTF-8). I just looked at the latest RFC
for UTF-8 and it has dropped any mention of codepoints above $10FFFF (thus
encoding up to 4 bytes instead). Interesting.


Gambit


Remy Lebeau (TeamB)

unread,
Jul 22, 2008, 1:40:44 PM7/22/08
to

"Chad Z. Hower aka Kudzu" <cha...@hower.org> wrote in message
news:4885f55a$1...@newsgroups.borland.com...

> Surely Tiburon has methods, but I can't discuss them except in generic was
> as I did above.

The new AEncoding parameter of LoadFrom...() and SaveTo...() methods, such
as in TStrings, and the new TEncoding class, have already been publically
blogged about by CodeGear employees.


Gambit


Lee Grissom

unread,
Jul 22, 2008, 7:12:02 PM7/22/08
to
What Font is the VCL defaulting to? Can anyone give advice on what
fonts to use/avoid?
--
Lee


Xavier

unread,
Jul 22, 2008, 8:31:08 PM7/22/08
to
Remy Lebeau (TeamB) wrote:
> "Xavier" <n...@spam.com> wrote in message
> news:4885...@newsgroups.borland.com...
>
>> There are no code points defined above $10FFFF.
>
> Why do UTF-32 and UCS4 exist if all available codepoints fit within the
> UTF-16 space?

Because they allow encoding any code point with a single code unit.

John Herbster

unread,
Jul 22, 2008, 8:38:16 PM7/22/08
to
Codepoints, code points, code units, elements, glyphs, physical,
logical, words, bytes, surrogates and other pairs, leading, trailing,
and occasionally characters. I need a glossary.

Thorsten Engler [NexusDB]

unread,
Jul 22, 2008, 8:40:36 PM7/22/08
to
John Herbster wrote:

Again, might I refere to www.unicode.org ?
They have a pretty decent documentation.

--

TJC Support

unread,
Jul 22, 2008, 8:58:28 PM7/22/08
to
"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message
news:48867d6c$1...@newsgroups.borland.com...

>Codepoints, code points, code units, elements, glyphs, physical,
>logical, words, bytes, surrogates and other pairs, leading, trailing,
>and occasionally characters. I need a glossary.

I need a beer.... :^)

Cheers,
Van


Q Correll

unread,
Jul 22, 2008, 9:07:13 PM7/22/08
to
TJC,

| I need a beer.... :^)

Passing a virtual XX to TJC... <clink>

--
Q

07/22/2008 18:06:49

XanaNews Version 1.17.5.7 [Q's Salutation mod]

TJC Support

unread,
Jul 22, 2008, 9:34:22 PM7/22/08
to
"Q Correll" <qcor...@pacNObell.net> wrote in message
news:48868441$1...@newsgroups.borland.com...

> TJC,
>
> | I need a beer.... :^)
>
> Passing a virtual XX to TJC... <clink>

Ah, that's better! And after a 2 day debugging session, too. Thanks!

Van


Maël Hörz

unread,
Jul 23, 2008, 2:53:06 PM7/23/08
to
> Why do UTF-32 and UCS4 exist if all available codepoints fit within
> the UTF-16 space?
Because it might be more efficient (for example indexing) and easier to
use since you have not to deal with surrogates and basically a DWORD = a
char.

> Older UTF-8 specs I have seen defined rules for handling codepoints
> up to $7FFFFFFF (encoded up to 6 bytes in UTF-8). I just looked at
> the latest RFC for UTF-8 and it has dropped any mention of codepoints
> above $10FFFF (thus encoding up to 4 bytes instead). Interesting.

If I remember correctly, it was about how much can be technically
encoded using UTF-8 and what part of this possible range will really be
used. And since UTF-32/16/8 should be able to encode the same Unicode
codepoints a range was chosen to be able to represent them in any UTF
encoding.

Remy Lebeau (TeamB)

unread,
Jul 23, 2008, 3:45:50 PM7/23/08
to

"Remy Lebeau (TeamB)" <no....@no.spam.com> wrote in message
news:48861aca$1...@newsgroups.borland.com...

> Older UTF-8 specs I have seen defined rules for handling codepoints
> up to $7FFFFFFF (encoded up to 6 bytes in UTF-8).

I double-checked and I was thinking of the UTF-8 encoding originally defined
in ISO/IEC 10646. UTF-8 was later updated in RFC 3629 to limit the valid
range of codepoints to match the formal definition in the Unicode standard,
which limits codepoints to a max of $10FFFF. So realistically, UTF-8
encoding of either UCS or Unicode will never have more than 4 bytes, but
ISO/IEC 10646 did define encoding rules for extra codepoints that have since
been restricted in Unicode and are no longer valid.


Gambit


Allen Bauer (CodeGear)

unread,
Jul 25, 2008, 5:19:33 PM7/25/08
to
Paul Scott wrote:

> On Tue, 22 Jul 2008 13:04:48 +0100, Thorsten Engler [NexusDB]
> <thorste...@nexusdb.com> wrote:
>
> > Again, Delphi doesn't have any more or less problems with this then
> > any other application on the Windows platform. UTF16 is the native
> > unicode encoding on windows. Using UTF-16 is much much less of a
> > problem then trying to support ANSI for DBCS or MBCS codepages.
>
> Thorsten,
>
> I don't disagree with anything you said - and it was probably all
> obvious to those who have been struggling with multiple code pages
> in the past.
>
> But for the rest of us with applications in "English", there's more
> to "Unicodifying an Application" than just a recompilation - even
> with a liberal sprinkling of "SizeOf(Char)"

If all your application ever expected to get was "English," just
because strings now support a much larger set of codepoints doesn't
suddenly mean that your application will magically begin to get all
kinds of characters outside the normal English characters in the ASCII
range. Only if your application were deployed to some region where
there is the possibility of non-English characters being encountered
will you run in to an issue. Rebuilding an application should not alter
its existing use-cases.

So as long as your application continues to get deployed into
environments that never encounter non-English characters, things should
just continue to work.

--
Allen Bauer
CodeGear/Embarcadero
Chief Scientist
http://blogs.codegear.com/abauer

Q Correll

unread,
Jul 25, 2008, 5:50:37 PM7/25/08
to
Allen,

| So as long as your application continues to get deployed into
| environments that never encounter non-English characters, things should
| just continue to work.

Where can I get a pair of those rose-colored glasses? ;-)

--
Q

07/25/2008 14:50:14

Allen Bauer (CodeGear)

unread,
Jul 25, 2008, 6:17:19 PM7/25/08
to
Q Correll wrote:

> Allen,
>
> > So as long as your application continues to get deployed into
> > environments that never encounter non-English characters, things
> > should just continue to work.
>
> Where can I get a pair of those rose-colored glasses? ;-)

My point was that just because you have a potentially more capable
application doesn't mean that it will magically begin to use that
capacity when faced with same use-cases and scenarios it always has. If
a user always just entered English string data into an application,
having a new version of that application that accepts a wider range of
characters will not mean that this same user will just being to enter
(or even know how to enter) non-English data.

Q Correll

unread,
Jul 25, 2008, 8:22:21 PM7/25/08
to
Allen,

Yes, I understand.

But my "point" was that things rarely seem to work out as simply as we
expect. <g>

--
Q <keeping fingers crossed>

07/25/2008 17:20:32

Thorsten Engler [NexusDB]

unread,
Jul 25, 2008, 10:18:37 PM7/25/08
to
Q Correll wrote:

> Allen,
>
> Yes, I understand.
>
> But my "point" was that things rarely seem to work out as simply as
> we expect. <g>

I have to say I'm with Allen here, if your application has only been
confronted with standard english characters in the past then just
switching out that application in the same environment with one which
is unicode enabled you are still only going to be confronted with
standard english characters.

--

Q Correll

unread,
Jul 25, 2008, 11:42:47 PM7/25/08
to
Thorsten,

| I have to say I'm with Allen here, if your application has only been
| confronted with standard english characters in the past then just
| switching out that application in the same environment with one which
| is unicode enabled you are still only going to be confronted with
| standard english characters.

I yield to the gurus! <g>

--
Q

07/25/2008 20:42:17

willr

unread,
Jul 26, 2008, 12:04:23 PM7/26/08
to
Q Correll wrote:
> Allen,
>
> | So as long as your application continues to get deployed into
> | environments that never encounter non-English characters, things should
> | just continue to work.
>
> Where can I get a pair of those rose-colored glasses? ;-)
>

I think they are available at most programming classes -- and "extra
rose coloured" are available at "design" classes.


--
Will R
PMC Consulting

Q Correll

unread,
Jul 26, 2008, 12:58:32 PM7/26/08
to
willr,

| I think they are available at most programming classes -- and "extra
rose coloured" are available at "design" classes.

<chuckle>

Regardless of what Allen and Thorsten feel will be the case, I'd still put
money on the phenomenon of "unintended consequences" popping up. <g> It
may not be with the display of any given character set, but my gut says
something unexpected will surface. Eight million lines of code is very
difficult to keep in the box when one starts messing with it. ;-)

--
Q

07/26/2008 09:53:16

David Erbas-White

unread,
Jul 26, 2008, 1:34:24 PM7/26/08
to


Yes, but they take them away when you actually complete your first real
'working' program...

David Erbas-White

Nick Hodges (Embarcadero)

unread,
Jul 26, 2008, 1:32:23 PM7/26/08
to
Thorsten Engler [NexusDB] wrote:

> I never said that porting to code from ANSI to Unicode will be without
> any issues and gotchas.

Or, put another way, chances are you won't even notice that you are
using Unicode strings. ;-)

--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges

Thorsten Engler [NexusDB]

unread,
Jul 26, 2008, 1:35:24 PM7/26/08
to
Q Correll wrote:

> I know, I know. <g> Allen used the noun "things." I was just
> having a bit of "fun" jabbing at his high optimism. I fully realize
> the "things" Allen meant were related to the display of the
> characters. But there may be some, if not many, other things that
> may be affected by the changes. Especially until Tiburon is
> significantly debugged. ;-)

Read Allen's post again. He was specifically talking about what types
of characters your code might get confronted with.

Supporting Unicode is *not* the *cause* for being confronted with new
types of characters.

If, because of some external change, your application should be
confronted with characters it has never seen before, then Unicode will
help you to better cope with them.


I prefer Allen working on Tiburon then being jabbed at for fun ;)

--

Thorsten Engler [NexusDB]

unread,
Jul 26, 2008, 1:12:24 PM7/26/08
to
Q Correll wrote:

> Regardless of what Allen and Thorsten feel will be the case, I'd
> still put money on the phenomenon of "unintended consequences"
> popping up. <g> It may not be with the display of any given
> character set, but my gut says something unexpected will surface.
> Eight million lines of code is very difficult to keep in the box when
> one starts messing with it. ;-)

I never said that porting to code from ANSI to Unicode will be without
any issues and gotchas.

But that is NOT what Allen was talking about.

What he and I was saying is that if your current ANSI application is
only confronted with simple, single byte characters (coming from Win
API calls, or typed in by the user and so on), then converting your
application to Unicode will NOT suddenly expose you to things like
surrogate pairs.

Only if the environment in which your application runs is changed could
it find itself be confronted with surrogate pairs.

But in any single possible case where your unicode application will be
confronted with surrogate pairs, the same application as ANSI
application would be confronted with double or even multi byte
character sets. Which are much more difficult to handle then Unicode
with surrogate pairs.


--

Q Correll

unread,
Jul 26, 2008, 1:29:13 PM7/26/08
to
Thorsten,

| But that is NOT what Allen was talking about.

I know, I know. <g> Allen used the noun "things." I was just having a

bit of "fun" jabbing at his high optimism. I fully realize the "things"
Allen meant were related to the display of the characters. But there may
be some, if not many, other things that may be affected by the changes.
Especially until Tiburon is significantly debugged. ;-)


--
Q

07/26/2008 10:21:25

Ivan

unread,
Jul 26, 2008, 1:38:41 PM7/26/08
to
To me Q Correll's original post sounded like a joke and the amount of back and forth it generated
added to the fun...

Q Correll

unread,
Jul 26, 2008, 4:28:46 PM7/26/08
to
Thorsten,

Sorry I pushed one of your hot buttons. That was NOT my intent.

I think we can let it drop now.

--
Q <Remembering the joke... "...Whew! I don't think I could take a dollar's worth of that!" ;->

07/26/2008 13:25:55

Q Correll

unread,
Jul 26, 2008, 4:29:46 PM7/26/08
to
Ivan,

| To me Q Correll's original post sounded like a joke

Which was my intent.

--
Q

07/26/2008 13:29:04

TJC Support

unread,
Jul 26, 2008, 4:41:15 PM7/26/08
to
"Q Correll" <qcor...@pacNObell.net> wrote in message
news:488b88fe$1...@newsgroups.borland.com...

>
> I think we can let it drop now.

Aw, c'mon Q, remember where you are. You can't get off _that_ easy! :^)

Cheers,
Van


TJC Support

unread,
Jul 26, 2008, 4:54:09 PM7/26/08
to
"Nick Hodges (Embarcadero)" <nick....@codegear.com> wrote in message
news:488b5fa7$1...@newsgroups.borland.com...

>
> Or, put another way, chances are you won't even notice that you are
> using Unicode strings. ;-)

Hi Nick,

I think Q and I both learned a lot of lessons from the school of hard
knocks. And while I don't expect any serious problems to crop up due to
your changes in the way things work, the phrase "there's always something"
comes to mind. :^) I expect there will probably be a few issues to crop up
in two areas. 1 is my own sloppy programming. I built my primary
application over 10 years ago and have maintained and added to it over the
years. I've learned a lot in the last 10 years, and if I started the
project over from scratch, it would be a much better piece of code, both
from the architecture/design standpoint and the code quality standpoint. So
there are probably going to be some of those places in the code where I made
bad decisions about string & character handling that'll rear up their ugly
heads. And 2, I use the old Turbopower libraries extensively, and there's
nobody maintaining those now. I did take a quick look at Systools the other
day, and it looks like they were pretty careful about declaring strings as
AnsiString. I was concerned, because a lot of the string handling routines
are written in assembler, but I think they may be okay. But if I have a
_lot_ of problems with those libraries, I'll probably end up having to
abandon them in favor of more up to date stuff that's maintained.

At any rate, I look forward to my first upgrade since D7. The good news for
me is that I don't have a schedule for switching over, so it can take as
long as it takes to work through the code to get it in shape before I switch
to D2009 for production code.

Cheers,
Van Swofford
Tybee Jet Corp.


Q Correll

unread,
Jul 26, 2008, 5:30:44 PM7/26/08
to
TJC, (I apologize that my simplistic name-capture doesn't come up with
"Van." <g> If you could get your newsreader client to post your signature
with the standard "--" line ahead of "Van Swofford" it might work. <g>)

| I think Q and I both learned a lot of lessons from the school of hard
knocks.

Yep. <g>

| 1 is my own sloppy programming. I built my primary application over 10
years ago and have maintained and added to it over the years. I've
learned a lot in the last 10 years, and if I started the project over from
scratch, it would be a much better piece of code, both from the
architecture/design standpoint and the code quality standpoint. So there
are probably going to be some of those places in the code where I made bad
decisions about string & character handling that'll rear up their ugly
heads.

That's also a ditto for me. ("It just grew and grew and grew..." ;-)

I do think, however, I may have more potential problems with my old
components than my own code. I also use Orpheus4, a TP product as per
your concern #2.


--
Q

07/26/2008 14:22:48

Q Correll

unread,
Jul 26, 2008, 5:20:56 PM7/26/08
to
TJC,

| Aw, c'mon Q, remember where you are. You can't get off that easy! :^)

<chuckle> Ah, yaaass indeedy,... there is that.

--
Q

07/26/2008 14:19:27

Ray Porter

unread,
Jul 26, 2008, 7:05:46 PM7/26/08
to

"Thorsten Engler [NexusDB]" <thorste...@nexusdb.com> wrote in message
news:488a897d$1...@newsgroups.borland.com...

>
> I have to say I'm with Allen here, if your application has only been
> confronted with standard english characters in the past then just
> switching out that application in the same environment with one which
> is unicode enabled you are still only going to be confronted with
> standard english characters.
>

One thing I haven't heard made clear yet is possible impact on database
reads/writes. We use Oracle 10G and will eventually move to SQL Server.
I'll ask our DBA but I'm fairly certain our database is currently configured
for ansi (Latin char set). Will I need to ansify our database reads/writes
(or at least the writes) or will the TADODataset and its descendents handle
the situation transparently?

Thanks,
Ray Porter


TJC Support

unread,
Jul 26, 2008, 6:51:19 PM7/26/08
to
"Q Correll" <qcor...@pacNObell.net> wrote in message
news:488b...@newsgroups.borland.com...

> TJC, (I apologize that my simplistic name-capture doesn't come up with
> "Van." <g> If you could get your newsreader client to post your signature
> with the standard "--" line ahead of "Van Swofford" it might work. <g>)

I'm about to switch to Xananews as soon as I get my new machine going,
hopefully in the next week. That _should_ fix it. :^)

> That's also a ditto for me. ("It just grew and grew and grew..." ;-)

Hehehe, yep, I know that one.

> I do think, however, I may have more potential problems with my old
> components than my own code. I also use Orpheus4, a TP product as per
> your concern #2.

Yeah, all my UI stuff is Orpheus4, and all my string stuff is Systools. And
"security" is OnGuard. That's in quotes because OG ain't all that secure,
but it's good enough for my audience. And my DB is BTree Filer. You might
say I'm pretty thoroughly TP'd. :^)

Cheers,
Van


Thorsten Engler [NexusDB]

unread,
Jul 26, 2008, 7:19:53 PM7/26/08
to
Ray Porter wrote:

> One thing I haven't heard made clear yet is possible impact on database
> reads/writes. We use Oracle 10G and will eventually move to SQL Server. I'll
> ask our DBA but I'm fairly certain our database is currently configured for
> ansi (Latin char set). Will I need to ansify our database reads/writes (or
> at least the writes) or will the TADODataset and its descendents handle the
> situation transparently?

All that should just continue to work. You just keep using TField.AsString to
read/write and the classes will take care that whatever ends up in the database
does so in the right format.


--

David Erbas-White

unread,
Jul 26, 2008, 7:25:31 PM7/26/08
to
Q Correll wrote:
>
> I do think, however, I may have more potential problems with my old
> components than my own code. I also use Orpheus4, a TP product as per
> your concern #2.
>
>


Ditto on the Orpheus and SysTools, and AsyncPro is still one of the best
communications libraries around for serial transfers...

David Erbas-White

Q Correll

unread,
Jul 26, 2008, 7:48:34 PM7/26/08
to

| I'm about to switch to Xananews as soon as I get my new machine going,
hopefully in the next week. That should fix it. :^)

TJC,

Yes, I think it will. ;-)

| You might say I'm pretty thoroughly TP'd. :^)

Yes I would.

I will be working on converting O4. Perhaps we can exchange notes when
the time comes?

--
Q

07/26/2008 16:34:18

XanaNews Version 1.18.1.11 [Leonel's & Q's Mods]

David Erbas-White

unread,
Jul 26, 2008, 8:03:35 PM7/26/08
to
Q Correll wrote:
>
> | I'm about to switch to Xananews as soon as I get my new machine going,
> hopefully in the next week. That should fix it. :^)
>
> TJC,
>
> Yes, I think it will. ;-)
>
> | You might say I'm pretty thoroughly TP'd. :^)
>
> Yes I would.
>
> I will be working on converting O4. Perhaps we can exchange notes when
> the time comes?
>


I'd be interested in hearing what other components you use (if
third-party) to replace them. I'm also doing what I can to get away
from using the SysTools package...

David Erbas-White

It is loading more messages.
0 new messages