Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Preparing for Unicode strings in Delphi...

252 views
Skip to first unread message

Kristofer Skaug

unread,
Nov 4, 2007, 5:38:29 PM11/4/07
to
I just read the latest Delphi roadmap (http://dn.codegear.com/article/36620)
with the description of the upcoming "Tiburon" release of Delphi where
<quote> The standard string in the Delphi language will become a Unicode
string </quote>.
Question in this regard, from someone who doesn't particularly (i.e. not at
all!) need Unicode, will it be possible (by a compiler switch or somesuch)
to switch OFF this Unicode feature so that my standard "string" will still
consist of simple 1-byte ASCII characters? What is the "compatability plan"
for this? Are there simple things we could/should do right now to prep our
code base, to keep it from turning into a blob of boiling lava under the new
Unicode regime?

--
Kristofer


Jolyon Smith

unread,
Nov 4, 2007, 8:11:52 PM11/4/07
to
In article <472e...@newsgroups.borland.com>, Kristofer Skaug says...

> <quote> The standard string in the Delphi language will become a Unicode
> string </quote>.

I am also intrigued at this easy-as switch to Unicode... in the past
far more learned commentators than I, with an "inside perspective", have
opined that the change to Unicode would be of such a great impact that
it could only sensibly be considered as part of a bigger, similarly
impactful change, such as (and specifically) a NEW 64-bit VCL.

http://hallvards.blogspot.com/2005/08/danny-thorpe-on-unicode-and-
vcl.html

--
JS
TWorld.Create.Free;

Rudy Velthuis [TeamB]

unread,
Nov 4, 2007, 7:16:11 PM11/4/07
to
Kristofer Skaug wrote:

I don't know. I guess you can still use AnsiStrings, but they'll have
to be declared as such, like in .NET.

--
Rudy Velthuis [TeamB]

"Future historians will be able to study at the Jimmy Carter
Library, the Gerald Ford Library, the Ronald Reagan Library,
and the Bill Clinton Adult Bookstore." -- George Carlin

Rudy Velthuis [TeamB]

unread,
Nov 4, 2007, 7:29:07 PM11/4/07
to
Jolyon Smith wrote:

While what Danny says is true, I would not be surprised to see a NEW
UnicodeString type, which leaves WideString where it is (i.e. still
implemented as OleStr). This new string type could be reference
counted, have copy on demand semantics, and be a string of WideChars.
Of course it should be possible to assign them to WideStrings and
AnsiStrings and vice versa, of course with the same or similar
restrictions as today between WideString and AnsiString.

Sure, if the VCL uses UnicodeString throughout, some interfaces will
have to change. Not sure if that is a bad thing, though, since you'll
have to see that change for Win64 anyway, so that would actually be a
good preparation.
--
Rudy Velthuis [TeamB]

"I'm desperately trying to figure out why kamikaze pilots wore
helmets." -- Dave Edison.

Jolyon Smith

unread,
Nov 4, 2007, 9:25:45 PM11/4/07
to
In article <xn0fdbn9w...@newsgroups.borland.com>, Rudy Velthuis
[TeamB] says...

> While what Danny says is true, I would not be surprised to see a NEW
> UnicodeString type, which leaves WideString where it is (i.e. still
> implemented as OleStr).

Who is talking about WideString?!?

You did READ the CodeGear quote??

<quote> The standard string in the Delphi language will become a
Unicode string </quote>.


See the words "standard string will become Unicode". i.e. "String" will
be synonymous for the new Unicode string type.

Of course it makes sense to leave WideString semantics alone, but nobody
is talking about, or even concerned about THAT. And of course the new
Unicode String type should be reference counted and all that lovely
stuff.

BUT NOT ACTUALLY THE POINT


> Sure, if the VCL uses UnicodeString throughout

That ALSO is a completely separate matter, although not entirely.

e.g. TStringList and TStrings raises some interesting questions.

Presumably a Unicode VCL will expose (e.g.) combo-box items as Unicode
strings, but then what does that mean for application code that is
expecting (perhaps even relying on) a TStrings of ANSI strings?

The one thing we've learned about considering a move to Unicode is that
the more you think about it, the more you realise you haven't properly
thought about and the harder it gets.

I haven't yet met anyone who could say "yeah, sure, we thought it was
going to be tough, but you know what, it actually turned out to be
reeeal easy"

--
JS
TWorld.Create.Free;

Rudy Velthuis [TeamB]

unread,
Nov 4, 2007, 8:35:06 PM11/4/07
to
Jolyon Smith wrote:

> In article <xn0fdbn9w...@newsgroups.borland.com>, Rudy Velthuis
> [TeamB] says...
>
> > While what Danny says is true, I would not be surprised to see a NEW
> > UnicodeString type, which leaves WideString where it is (i.e. still
> > implemented as OleStr).
>
> Who is talking about WideString?!?

Danny Thorpe, in the article you quoted. As you can see above, I even
mentioned him right at the beginning of what I wrote. IOW, what I wrote
clearly referred to what he said.

In the context of string becoming Unicode, he says:

<<
WideStrings are currently implemented by Delphi as OLEStr, aka BStrs,
allocated using the SysAllocString Win32 API. These are not reference
counted, and are rather promiscuous in copying themselves for every
reference. Clearly, the Delphi WideString implementation needs to be
changed to a reference counted WideString to save memory and
performance if WideString is to become the primary string data type.
>>

I don't think the WideString implementation should become the primary
string data type. I think a new UnicodeString should, as I wrote.

The CG roadmap says merely:

<<
Delphi Win32 Unicode This means that the IDE, the VCL, and all types of
development should be made fully Unicode-compatible. The standard
string in the Delphi language will become a Unicode string, meaning
that the IDE, the VCL - that is, the entire product - will be
Unicode-based. Developers around the world will be able to develop
applications for use in any language using the Unicode standard.
>>

They are not saying that this Unicode string is not WideString, AFAICS.
Danny Thorpe clearly seemed to think WideString would be the new
standard Unicode string type.
--
Rudy Velthuis [TeamB]

"Most people would sooner die than think; in fact, they do so."
-- Bertrand Russell (1872-1970)

Rudy Velthuis [TeamB]

unread,
Nov 4, 2007, 8:37:33 PM11/4/07
to
Jolyon Smith wrote:

> <quote> The standard string in the Delphi language will become a
> Unicode string </quote>.
>
> See the words "standard string will become Unicode". i.e. "String"
> will be synonymous for the new Unicode string type.

Who says it is a NEW Unicode string type? It COULD merely mean that
WideString would become the standard string type. I doubt this, though.
--
Rudy Velthuis [TeamB]

"Love: The warm feeling you get towards someone who meets your
neurotic needs."

Remy Lebeau (TeamB)

unread,
Nov 5, 2007, 1:29:55 AM11/5/07
to

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> wrote in message
news:xn0fdbp2c...@newsgroups.borland.com...

> Who says it is a NEW Unicode string type?

CodeGear did, actually. Have a look at the recent "Forward planning for
Unicode VCL" discussion in the "borland.public.cppbuilder.non-technical"
newsgroup. Specifically, Alisdair Meredith (recently hired C++ product
manager) stated the following:

"... customers today (generally) do not have unicode enabled source code
today, and MUST be able to re-compile
their applications with the product out-of-the-box, or at least we should
not force expensive rewrites to support Unicode before the product can be
used. Hopefully, if you don't care about Unicode, you will not be affected.

"It is clear that there will be at least 2 string types, AnsiString and
a new UnicodeString type. There is also a System::String typedef that today
maps to AnsiString and will probably map to UnicodeString in the future."

> It COULD merely mean that WideString would
> become the standard string type.

It won't. WideString does not have the same semantics as the current
(Ansi)String type, so it would be a poor choice as a replacement. Hense the
new UnicodeString that will have the same semantics as AnsiString does today
so that it can be a direct replacement.


Gambit


Rudy Velthuis [TeamB]

unread,
Nov 5, 2007, 6:11:38 AM11/5/07
to
Remy Lebeau (TeamB) wrote:

>
> "Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> wrote in message
> news:xn0fdbp2c...@newsgroups.borland.com...
>
> > Who says it is a NEW Unicode string type?
>
> CodeGear did, actually.

Not in the roadmap or in what Hallvard quoted from Danny. IOW, not in
the texts discussed here, until now.

> Have a look at the recent "Forward planning
> for Unicode VCL" discussion in the
> "borland.public.cppbuilder.non-technical" newsgroup. Specifically,
> Alisdair Meredith (recently hired C++ product manager) stated the
> following:

Cool. I hadn't seen that yet.

> "It is clear that there will be at least 2 string types,
> AnsiString and a new UnicodeString type. There is also a
> System::String typedef that today maps to AnsiString and will
> probably map to UnicodeString in the future."
>
> > It COULD merely mean that WideString would
> > become the standard string type.
>
> It won't. WideString does not have the same semantics as the current
> (Ansi)String type, so it would be a poor choice as a replacement.

Well, that is exactly why Danny thought it might be problematic finding
all spots where WideString and its being based on OleStr could be a
problem. Like I also said, a new UnicodeString type would change that.
I would however not get rid of WideString, I would just keep it for
backward compatibility.
--
Rudy Velthuis [TeamB]

"The optimist proclaims that we live in the best of all possible
worlds, and the pessimist fears this is true."
-- James Branch Cabell

Jens Mühlenhoff

unread,
Nov 5, 2007, 7:13:13 AM11/5/07
to
Kristofer Skaug wrote:
> consist of simple 1-byte ASCII characters? What is the "compatability plan"
> for this? Are there simple things we could/should do right now to prep our

I'm just guessing here:

One option for Codegear would be to ship two Win32 versions of the VCL
and RTL:

- The old non-Unicode version
- The new Unicode-enabled version

That would be a clean way to support old programs not ready for Unicode.
The old legacy version could then be dropped in some future release.

It seems logical to me, they also still carry around Win3.1 controls and
the BDE, why not a legacy version of the VCL and RTL?

--
Regards
Jens

Rudy Velthuis [TeamB]

unread,
Nov 5, 2007, 6:45:39 AM11/5/07
to
Jens Mühlenhoff wrote:

> One option for Codegear would be to ship two Win32 versions of the
> VCL and RTL:
>
> - The old non-Unicode version
> - The new Unicode-enabled version

Now I very much doubt they would do that.
--
Rudy Velthuis [TeamB]

"Manuscript: something submitted in haste and returned at
leisure." -- Oliver Herford (1863-1935)

Jens Mühlenhoff

unread,
Nov 5, 2007, 9:39:59 AM11/5/07
to
Rudy Velthuis [TeamB] wrote:
> Jens Mühlenhoff wrote:
>
>> One option for Codegear would be to ship two Win32 versions of the
>> VCL and RTL:
>>
>> - The old non-Unicode version
>> - The new Unicode-enabled version
>
> Now I very much doubt they would do that.

But what will they do instead? They already said that non-Unicode
application should still compiler just fine, I can't think of a way how
to achieve this with a completly reworked VCL and RTL?

--
Regards
Jens

Rudy Velthuis [TeamB]

unread,
Nov 5, 2007, 12:33:36 PM11/5/07
to
Jens Mühlenhoff wrote:

> > > One option for Codegear would be to ship two Win32 versions of the
> > > VCL and RTL:
> > >
> > > - The old non-Unicode version
> > > - The new Unicode-enabled version
> >
> > Now I very much doubt they would do that.
>
> But what will they do instead? They already said that non-Unicode

> application should still compiler just fine.

Sure. I'm sure you can use AnsiString for that. Today, you can use
WideStrings throughout your entire program and still use the VCL, can't
you? You'll lose something in the conversion from WideString to
AnsiString, of course.

Now, the other way around should also be possible. The loss by
conversion is just the other way around, too.
--
Rudy Velthuis [TeamB]

"He has all the virtues I dislike and none of the vices I admire."
-- Sir Winston Churchill (1874-1965)

Remy Lebeau (TeamB)

unread,
Nov 5, 2007, 2:02:37 PM11/5/07
to

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> wrote in message
news:xn0fdc485...@newsgroups.borland.com...

> I would however not get rid of WideString, I would just keep
> it for backward compatibility.

WideString still has its uses for ActiveX/COM development (and I'm sure that
is what it was originally designed for anyway), so I doubt CodeGear will
ever eliminate it.


Gambit


Remy Lebeau (TeamB)

unread,
Nov 5, 2007, 2:03:47 PM11/5/07
to

"Jens Mühlenhoff" <j.mueh...@accurata.com> wrote in message
news:472f...@newsgroups.borland.com...

> One option for Codegear would be to ship two Win32 versions
> of the VCL and RTL:

That is the exact same thing I suggested to them several years ago, and
mentioned again when this topic came up again a few weeks ago.


Gambit


Remy Lebeau (TeamB)

unread,
Nov 5, 2007, 2:06:37 PM11/5/07
to

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> wrote in message
news:xn0fdc52e...@newsgroups.borland.com...

> Now I very much doubt they would do that.

Why not? They could even update the RTL to use generics for those RTL
functions that need to support AnsiString and UnicodeString equally. That
way, the code does not have to be duplicated for Ansi/Wide flavors like it
has to right now. Then just set a compiler switch to specify what data type
the 'String' type maps to, and let the compiler figure out the rest.


Gambit


Rudy Velthuis [TeamB]

unread,
Nov 5, 2007, 1:08:59 PM11/5/07
to
Remy Lebeau (TeamB) wrote:

That is what I meant. But Danny, as quoted in the blog post, said that
WideString would have to change, and that that could cause problems if
people assumed OleStr and WideString were the same. I said that it
would be much easier to create a new UnicodeString type and leave
WideString as it is, and apparently that is the way that was chosen as
well.

--
Rudy Velthuis [TeamB]

"The true measure of a man is how he treats someone who can do
him absolutely no good." -- Samuel Johnson (1709-1784)

Loren Pechtel

unread,
Nov 5, 2007, 2:42:30 PM11/5/07
to

Or simply a compiler switch as to what a string is, just like we can
switch it to make a string a String[255] for compatibility with even
older code.

Jolyon Smith

unread,
Nov 5, 2007, 2:50:17 PM11/5/07
to
In article <xn0fdc485...@newsgroups.borland.com>, Rudy Velthuis
[TeamB] says...

> Not in the roadmap or in what Hallvard quoted from Danny. IOW, not in


> the texts discussed here, until now.

And heaven forfend that we should bring knowledge gleaned from outside
this NG into these NG's eh? Sheesh.

But the statement that "the standard string type will become a Unicode
string" says all that needs to be said. Everything else is just
clarification of details (and an opportunity for you to nit-pick, as
ever).


> Well, that is exactly why Danny thought it might be problematic finding
> all spots where WideString and its being based on OleStr could be a
> problem.

Using WideString isn't what would be problematic - finding ANY code that
relies on String being an ANSI string and which now finds itself dealing
with Unicode (whether it wants to or needs to or not!) is the problem.

> Like I also said, a new UnicodeString type would change that.
> I would however not get rid of WideString, I would just keep it for
> backward compatibility.

Which is the whole problem with changing String to a unicode string.

Who knows how much code out there that relies on String being a single
byte ANSI string (and perhaps doesn't even realise that it relies on
it).


My fear is that someone in CG thought that it shouldn't be too hard to
simply change the "built-in" String type to a Unicode string, given that
the change from "short" string to "long" string was pretty seamless and
this would be the same basic idea.

But it just aint that easy.

--
JS
TWorld.Create.Free;

Rudy Velthuis [TeamB]

unread,
Nov 5, 2007, 2:07:48 PM11/5/07
to
Jolyon Smith wrote:

> In article <xn0fdc485...@newsgroups.borland.com>, Rudy Velthuis
> [TeamB] says...
>
> > Not in the roadmap or in what Hallvard quoted from Danny. IOW, not
> > in the texts discussed here, until now.
>
> And heaven forfend that we should bring knowledge gleaned from
> outside this NG into these NG's eh? Sheesh.

Bullshit. We were discussing these two pieces, and nothing more. I was
only replying to them, and no other "knowledge" was part of the topic,
at that point.

Of course now I know more. But not at that point.
--
Rudy Velthuis [TeamB]

"The only difference between me and a madman is that I'm not mad."
-- Salvador Dali (1904-1989)

Jens Mühlenhoff

unread,
Nov 6, 2007, 6:01:58 AM11/6/07
to
Rudy Velthuis [TeamB] wrote:
> Jens Mühlenhoff wrote:
>
>> But what will they do instead? They already said that non-Unicode
>> application should still compiler just fine.
>
> Sure. I'm sure you can use AnsiString for that. Today, you can use
> WideStrings throughout your entire program and still use the VCL, can't
> you? You'll lose something in the conversion from WideString to
> AnsiString, of course.
>
> Now, the other way around should also be possible. The loss by
> conversion is just the other way around, too.

Perhaps you are right, I didn't look at it from that perspective.

But that would still imply that every piece of code that uses "string"
has to be rewritten to use AnsiString (if it does P[Ansi]Char casts or
direct Character manipulation) or else it won't compile and/or cause
other problems.

So the problem with this approach is that *some* old applications would
not work out of the box.

--
Regards
Jens

Jens Mühlenhoff

unread,
Nov 6, 2007, 7:03:23 AM11/6/07
to
Rudy Velthuis [TeamB] wrote:
>
> I guess thes compiler could give a warning if you are casting a string
> or UnicodeString to PAnsiChar, but not necessarily if you cast it to
> PChar.
>

So what about this code:

MessageBox(NIL, PChar('Hello World'), NIL, 0);

In D2007 this will be compiled as somthing like:

MessageBoxA(NIL, PAnsiChar('Hello World'), NIL, 0);

When PChar is really a PUnicodeChar, would the D2008 compiler output this?

MessageBoxW(NIL, PUnicodeChar('Hello World'), NIL, 0);

Or would I get that?

MessageBoxA(NIL, PUnicodeChar('Hello World'), NIL, 0);

And even more interesting, will MessageBoxW accept a "PUnicodeChar"
instead of a "PWideChar"? What exactly is the difference of PUnicodeChar
and PWideChar? (I understand the difference between the proposed
UnicodeString and WideString).

--
Regards
Jens

Rudy Velthuis [TeamB]

unread,
Nov 6, 2007, 5:46:57 AM11/6/07
to
Jens Mühlenhoff wrote:

> Perhaps you are right, I didn't look at it from that perspective.
>
> But that would still imply that every piece of code that uses
> "string" has to be rewritten to use AnsiString (if it does
> P[Ansi]Char casts or direct Character manipulation) or else it won't
> compile and/or cause other problems.

I'm pretty sure that if the standard string type becomes UnicodeString,
the standard Char type will become UnicodeChar or WideChar (like in
.NET), and PChar will probably also change accordingly.

> So the problem with this approach is that some old applications would


> not work out of the box.

I guess thes compiler could give a warning if you are casting a string


or UnicodeString to PAnsiChar, but not necessarily if you cast it to
PChar.

And code assiming that Char or PChar^ are one byte in size will indeed
have to be reviewed.

--
Rudy Velthuis [TeamB]

"God is a comedian playing to an audience too afraid to laugh."
-- Voltaire (1694-1778)

Rudy Velthuis [TeamB]

unread,
Nov 6, 2007, 6:21:54 AM11/6/07
to
Jens Mühlenhoff wrote:

> And even more interesting, will MessageBoxW accept a "PUnicodeChar"
> instead of a "PWideChar"?

Probably. Unless specifically set as option, pointers are not really
checked for type.

--
Rudy Velthuis [TeamB]

"I have had a perfectly wonderful evening, but this wasn't it."
-- Groucho Marx

Rudy Velthuis [TeamB]

unread,
Nov 6, 2007, 6:24:52 AM11/6/07
to
Jens Mühlenhoff wrote:

It will be compiled as something like:

MessageBox(nil, PChar('Hello World'), nil, 0);

(FWIW, since 'Hello world' is a literal, there is no need to cast at
all:

MessageBox(nil, 'Hello World', nil, 0);

)

Now, currently, in Windows.pas, MessageBox is linked to MessageBoxA,
and the second parameter is a PChar (which is currently an alias for
PAnsiChar). I'm sure it will be linked to MessageBoxW as soon as
strings are Unicode, and the second parameter of MessageBox will still
be PChar (which will then be an alias for PWideChar, though).

Read a bit about Unicode and .par files in my article about
conversions, and you can see how they do this:

http://rvelthuis.de/articles/articles-convert.html#unicode

--
Rudy Velthuis [TeamB]

"Physics is not a religion. If it were, we'd have a much easier
time raising money." -- Leon Lenderman

Jens Mühlenhoff

unread,
Nov 6, 2007, 9:22:16 AM11/6/07
to
Rudy Velthuis [TeamB] wrote:
> Jens Mühlenhoff wrote:
>
>> And even more interesting, will MessageBoxW accept a "PUnicodeChar"
>> instead of a "PWideChar"?
>
> Probably. Unless specifically set as option, pointers are not really
> checked for type.
>

Yes, i'm aware of that, let me rephrase it:

Will MessageBoxW work as expected with PUnicodeChar?

--
Regards
Jens

Jens Mühlenhoff

unread,
Nov 6, 2007, 9:31:07 AM11/6/07
to
Rudy Velthuis [TeamB] wrote:
> Jens Mühlenhoff wrote:
>
> It will be compiled as something like:
>
> MessageBox(nil, PChar('Hello World'), nil, 0);
>
> (FWIW, since 'Hello world' is a literal, there is no need to cast at
> all:
>
> MessageBox(nil, 'Hello World', nil, 0);
>
> )

Of course it is a literal, but this was a simplified example. This won't
work:

MessageBox(nil, S, nil, 0); // S is a UnicodeString

So for Windows API it has to be casted somehow and that's what I'm
currently talking about.

>
> Now, currently, in Windows.pas, MessageBox is linked to MessageBoxA,
> and the second parameter is a PChar (which is currently an alias for
> PAnsiChar). I'm sure it will be linked to MessageBoxW as soon as
> strings are Unicode, and the second parameter of MessageBox will still
> be PChar (which will then be an alias for PWideChar, though).
>

If they link all "dual API functions" to the W instead the A version in
the future then code that uses AnsiString won't work anymore unless
every API call in an ANSI application is rewritten to use the A version.

So is it going to be as hard to write ANSI applications with the future
Delphi as it is now to write Unicode applications? ;-)

--
Regards
Jens

Rudy Velthuis [TeamB]

unread,
Nov 6, 2007, 11:29:01 AM11/6/07
to
Jens Mühlenhoff wrote:

I can only guess, but: Of course. Why not?

--
Rudy Velthuis [TeamB]

"Never test for an error condition you don't know how to handle."
-- Steinbach's Guideline for Systems Programmers.

Rudy Velthuis [TeamB]

unread,
Nov 6, 2007, 11:31:30 AM11/6/07
to
Jens Mühlenhoff wrote:

> If they link all "dual API functions" to the W instead the A version
> in the future then code that uses AnsiString won't work anymore

No, indeed. The compiler will probably tell you so.

> unless every API call in an ANSI application is rewritten to use the
> A version.

Indeed. But if you simply cast to PChar, there should be no problem,
since PChar will be compatible with PWideChar then. If your strings are
simply declared as "string", I see no problem.
--
Rudy Velthuis [TeamB]

"Throughout American history, the government has said we're in an
unprecedented crisis and that we must live without civil
liberties until the crisis is over. It's a hoax."
-- Yale Kamisar, 1990.

Remy Lebeau (TeamB)

unread,
Nov 6, 2007, 12:41:11 PM11/6/07
to

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> wrote in message
news:xn0fddi32...@newsgroups.borland.com...

> I'm pretty sure that if the standard string type becomes
> UnicodeString, the standard Char type will become
> UnicodeChar or WideChar (like in .NET), and PChar
> will probably also change accordingly.

I'm guessing that it will be (P)WideChar, since that is already suitable for
Unicode characters. I don't see a need to introduce new (P)UnicodeChar
types.


Gambit


Rudy Velthuis [TeamB]

unread,
Nov 6, 2007, 12:31:13 PM11/6/07
to
Remy Lebeau (TeamB) wrote:

>
> "Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> wrote in message
> news:xn0fddi32...@newsgroups.borland.com...
>
> > I'm pretty sure that if the standard string type becomes
> > UnicodeString, the standard Char type will become
> > UnicodeChar or WideChar (like in .NET), and PChar
> > will probably also change accordingly.
>
> I'm guessing that it will be (P)WideChar, since that is already
> suitable for Unicode characters.

Indeed.
--
Rudy Velthuis [TeamB]

"Intellectuals solve problems; geniuses prevent them."
-- Albert Einstein

Kristofer Skaug

unread,
Nov 8, 2007, 5:11:22 PM11/8/07
to
Jens Mühlenhoff wrote:
> That would be a clean way to support old programs not ready for
> Unicode. The old legacy version could then be dropped in some future
> release.

IMO there's nothing "obsolete" about AnsiStrings.
We produce software for an English-speaking audience exclusively,
and have no roadmap whatsoever to internationalize our apps.
The costs of doing so (rewrites, translation, testing), apart from
the Unicode migration headaches, would outweigh the potential
market gains by a huge factor. Okay, so we're in a niche market
but all of our customers speak/read English well enough to
interface with us and use our products. Those who don't, don't.
But we do not intend to employ a whole team of translators to
provide application/technical support in Hebrew, Japanese,
Chinese, Finnish, Portuguese and Hungarian. It won't happen.

So while I appreciate that we are a relative minority in the
Delphi community, I suggest it's not a good idea to throw the
AnsiString away as an historical relic (sort-of like 16-bit ints).

The AnsiString is a "clean" fundamental data type, efficient,
easy to decode and manipulate, and so ubiquitous that
you couldn't make it go away - it's present in textual
contexts everywhere. You can't say something is
"legacy" just because it doesn't consume or produce Unicode.
At worst, it's a "limited" system.

--
Kristofer


Rudy Velthuis [TeamB]

unread,
Nov 9, 2007, 7:06:34 AM11/9/07
to
Kristofer Skaug wrote:

> Jens Mühlenhoff wrote:
> > That would be a clean way to support old programs not ready for
> > Unicode. The old legacy version could then be dropped in some future
> > release.
>
> IMO there's nothing "obsolete" about AnsiStrings.
> We produce software for an English-speaking audience exclusively,
> and have no roadmap whatsoever to internationalize our apps.
> The costs of doing so (rewrites, translation, testing), apart from
> the Unicode migration headaches, would outweigh the potential
> market gains by a huge factor.

But if you guys use "string" instead of the specific "AnsiString", you
should not notice a difference, except the slightly larger executables,
perhaps. Unicode has no problems with ASCII or Ansi text. <g>

--
Rudy Velthuis [TeamB]

"We don't like their sound, and guitar music is on the way out."
-- Decca Recording Co. rejecting the Beatles, 1962

Rudy Velthuis [TeamB]

unread,
Nov 9, 2007, 10:56:26 AM11/9/07
to
Jason Burgon wrote:

> "Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> wrote in message

> news:xn0fdhryn...@newsgroups.borland.com...


>
> > But if you guys use "string" instead of the specific "AnsiString",
> > you should not notice a difference, except the slightly larger
> > executables, perhaps. Unicode has no problems with ASCII or Ansi
> > text. <g>
>

> From my experience, the biggest incompatibility will be not with the
> Unicode strings themselves, but with sets that represent subsets of
> characters, such as sets of uppercase, numeric, alphabetic etc, since
> Delphi does not (yet) support huge sets, nor is there an "in"
> user-definable operator either.

That could be a problem, indeed. There is no TSysCharSet for wide
characters yet. I guess code using "set of Char" should, if more than
ASCII is required, use strings instead.

But OTOH, even now, "set of Char" can't contain more than ASCII and the
#127-#255 range.

--
Rudy Velthuis [TeamB]

"They show you how detergents take out bloodstains. I think if
you've got a T-shirt with bloodstains all over it, maybe your
laundry isn't your biggest problem." -- George Carlin

Jason Burgon

unread,
Nov 9, 2007, 11:28:49 AM11/9/07
to
"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> wrote in message
news:xn0fdhryn...@newsgroups.borland.com...

> But if you guys use "string" instead of the specific "AnsiString", you


> should not notice a difference, except the slightly larger executables,
> perhaps. Unicode has no problems with ASCII or Ansi text. <g>

From my experience, the biggest incompatibility will be not with the Unicode


strings themselves, but with sets that represent subsets of characters, such
as sets of uppercase, numeric, alphabetic etc, since Delphi does not (yet)
support huge sets, nor is there an "in" user-definable operator either.

--
Jay

Jason Burgon - author of Graphic Vision
http://homepage.ntlworld.com/gvision


yannis

unread,
Nov 12, 2007, 4:38:10 AM11/12/07
to
After serious thinking Jens Muhlenhoff wrote :

> Kristofer Skaug wrote:
>>
>> So while I appreciate that we are a relative minority in the
>> Delphi community, I suggest it's not a good idea to throw the
>> AnsiString away as an historical relic (sort-of like 16-bit ints).
>>
>
> I didn't suggest to drop the datatypes AnsiString and AnsiChar, but IMO the
> VCL (as in forms, controls, etc.) should not have any AnsiString properties
> in the future anymore (with some exceptions, I wouldn't make TComponent.Name
> a UnicodeString as identifiers in sourcecode should stay
> [a-z,A-Z,_][a-z,A-Z,0-9,_]* anyway).
>
> This still means that a Unicode VCL would consume AnsiStrings just fine
> (there would be only the /automatic/ conversion overhead everytime you have
> to assign between AnsiString and UnicodeString).
>

This would be best handled by the end user. It is far better to have an
options where we can choose if the String data type is unicode or Ansi
the same way the huge strings work.

Do not forget that many use the existing component property to hold
data that they assume will be in AnsiString and have code that works on
this assumption. There is no way to make the VCL unicode only and not
break existing code.

regards
Yannis.


Jens Mühlenhoff

unread,
Nov 12, 2007, 4:12:46 AM11/12/07
to
Kristofer Skaug wrote:
>
> So while I appreciate that we are a relative minority in the
> Delphi community, I suggest it's not a good idea to throw the
> AnsiString away as an historical relic (sort-of like 16-bit ints).
>

I didn't suggest to drop the datatypes AnsiString and AnsiChar, but IMO

the VCL (as in forms, controls, etc.) should not have any AnsiString
properties in the future anymore (with some exceptions, I wouldn't make
TComponent.Name a UnicodeString as identifiers in sourcecode should stay
[a-z,A-Z,_][a-z,A-Z,0-9,_]* anyway).

This still means that a Unicode VCL would consume AnsiStrings just fine
(there would be only the /automatic/ conversion overhead everytime you
have to assign between AnsiString and UnicodeString).

> contexts everywhere. You can't say something is


> "legacy" just because it doesn't consume or produce Unicode.
> At worst, it's a "limited" system.
>

Again I don't say that ANSI or ASCII code is legacy, but a ANSI VCL is.
It doesn't allow users to properly enter text in a portable way.

I don't know what your application does, but you may allow input and
output for user data in their native language, you wan't have to
translate your application, but only tweak your data handling.

That of course doesn't work very well, if you got a lot of functions
that expect a character to be fixed-size (8 Bit) and/or ANSI encoded.

I think that a Unicode VCL is a great improvement, even if you're user
interface is English-only.

Also I wonder if Microsoft won't drop support for the ANSI versions of
the API someday (in the long term).

--
Regards
Jens

Jens Mühlenhoff

unread,
Nov 12, 2007, 6:02:28 AM11/12/07
to
yannis wrote:
>
> This would be best handled by the end user. It is far better to have an
> options where we can choose if the String data type is unicode or Ansi
> the same way the huge strings work.
>

On the compiler-side yes, I agree. But the VCL (+ RTL) can only be
compiled with ANSI xor Unicode support.

> Do not forget that many use the existing component property to hold data
> that they assume will be in AnsiString and have code that works on this
> assumption. There is no way to make the VCL unicode only and not break
> existing code.
>

That's why my first suggestion was to ship both a Unicode and a ANSI VCL
(+ RTL).

The question is, will there be a "ANSI personality" or not. If there is
only a Unicode version then code assuming that "string" equals
"AnsiString" might fail miserably.

Only code that doesn't index into "string"'s or that uses AnsiString
(through implicit conversion from and to UnicodeString) instead will
work (but may still not work as expected) out of the box.

It's not as easy as setting a compiler switch!

--
Regards
Jens

Rudy Velthuis [TeamB]

unread,
Nov 12, 2007, 6:04:03 AM11/12/07
to
Jens Mühlenhoff wrote:

> I
> wouldn't make TComponent.Name a UnicodeString as identifiers in
> sourcecode should stay [a-z,A-Z,_][a-z,A-Z,0-9,_]* anyway).

It would be silly to use an AniString for it, if all other strings are
UnicodeString, IMO.

--
Rudy Velthuis [TeamB]

"Go on, get out. Last words are for fools who haven't said
enough." -- Karl Marx, dying words to his housekeeper.

yannis

unread,
Nov 12, 2007, 7:10:01 AM11/12/07
to
Jens Muhlenhoff pretended :

I think it is.

If all VCL is using the String data type which already equals to
AnsiString then by instructing the compiler to change the string data
type to be equall to Unicode string would create a unicode VCL on the
spot.
This ofcourse assumes that the proper RTL functions for unicode will be
used instead of the ANSI ones all over the vcl as well. A matter of
overloading I think but never tried to do it my self so probably there
is more to it than just overloading the function/procedures.

Not a simple task to implement and test far from it, but a simple
choise for the end user.

I think is do able I do not know how hard it is to be done.

Regards
Yannis.


Jens Mühlenhoff

unread,
Nov 12, 2007, 8:03:12 AM11/12/07
to
Rudy Velthuis [TeamB] wrote:
> Jens Mühlenhoff wrote:
>
>> I
>> wouldn't make TComponent.Name a UnicodeString as identifiers in
>> sourcecode should stay [a-z,A-Z,_][a-z,A-Z,0-9,_]* anyway).
>
> It would be silly to use an AniString for it, if all other strings are
> UnicodeString, IMO.
>

Hmm, yes I think you're right. TComponent.Name should become a
UnicodeString even though only a very small subset is valid (there
already is validation code anyway).

Operations on TComponent.Name are of course faster with UnicodeString, I
didn't think about this yet.

Maybe a Unicode VCL can be done without the AnsiString type, but my
point was that the type will still be around for "legacy" applications.

I'm very curious how CodeGear will solve all the problems discussed :-).
Too bad there are no details yet (other than a new UnicodeString type
and Char not being 8-Bit or even fixed size anymore) or does anyone have
more information yet?

--
Regards
Jens

Rudy Velthuis [TeamB]

unread,
Nov 12, 2007, 6:45:39 AM11/12/07
to
yannis wrote:

> > It's not as easy as setting a compiler switch!
>
> I think it is.

Not really. You would need two designers, one for the Unicode VCL and
one for the Ansi VCL. Then the code should be ware of Char types. If
Char really becomes equivalent to WideChar (like in .NET), I see a few
problems.
--
Rudy Velthuis [TeamB]

"I am not young enough to know everything."
-- Oscar Wilde (1854-1900)

Rudy Velthuis [TeamB]

unread,
Nov 12, 2007, 7:07:44 AM11/12/07
to
Jens Mühlenhoff wrote:

> I'm very curious how CodeGear will solve all the problems discussed
> :-). Too bad there are no details yet (other than a new
> UnicodeString type and Char not being 8-Bit or even fixed size
> anymore) or does anyone have more information yet?

I'm pretty sure Char will still be fixed size. Char <> code point.

--
Rudy Velthuis [TeamB]

"A mathematician is a device for turning coffee into theorems."
-- Paul Erdos

Jens Mühlenhoff

unread,
Nov 12, 2007, 10:25:00 AM11/12/07
to
Rudy Velthuis [TeamB] wrote:
> Jens Mühlenhoff wrote:
>
>> I'm very curious how CodeGear will solve all the problems discussed
>> :-). Too bad there are no details yet (other than a new
>> UnicodeString type and Char not being 8-Bit or even fixed size
>> anymore) or does anyone have more information yet?
>
> I'm pretty sure Char will still be fixed size. Char <> code point.
>

What I wanted to say is that one should not *assume* a fixed size. Have
you ever seen a datatype that didn't have a fixed size (excluding arrays
of course)? *g*

... and I'm not talking about different versions of a compiler or
different platforms here ;-).

On topic: I guess it's size is going to be 16-Bit, since they certainly
will use UTF-16 for maximum Windows API compatibility.

--
Regards
Jens

Rudy Velthuis [TeamB]

unread,
Nov 12, 2007, 11:28:45 AM11/12/07
to
Jens Mühlenhoff wrote:

> Rudy Velthuis [TeamB] wrote:
> > Jens Mühlenhoff wrote:
> >
> > > I'm very curious how CodeGear will solve all the problems
> > > discussed :-). Too bad there are no details yet (other than a
> > > new UnicodeString type and Char not being 8-Bit or even fixed size
> > > anymore) or does anyone have more information yet?
> >
> >I'm pretty sure Char will still be fixed size. Char <> code point.
>

> What I wanted to say is that one should not assume a fixed size.

I think it is very safe to assume a new fixed size: WideChar. The
problem is if people assume that a Char is one byte.
--
Rudy Velthuis [TeamB]

"Do illiterate people get the full effect of alphabet soup?"
-- John Mendoza

Remy Lebeau (TeamB)

unread,
Nov 12, 2007, 2:12:29 PM11/12/07
to

"Jens Mühlenhoff" <j.mueh...@accurata.com> wrote in message
news:4738190f$1...@newsgroups.borland.com...

> Also I wonder if Microsoft won't drop support for the
> ANSI versions of the API someday (in the long term).

They won't do that until they drop support for Win9x/Me altogether. I don't
see that happening anytime soon, since a lot of people still use those
systems.


Gambit


Franz-Leo Chomse

unread,
Nov 12, 2007, 2:22:41 PM11/12/07
to

They aren't supported any longer - in all active OS versions - the NT
family - the A version of the API is just a wrapper around the W
version and all new APIs are only available in the Unicode version.

Regards from Germany

Franz-Leo

Jolyon Smith

unread,
Nov 12, 2007, 2:50:36 PM11/12/07
to
In article <4738190f$1...@newsgroups.borland.com>, Jens Mühlenhoff says...

> I didn't suggest to drop the datatypes AnsiString and AnsiChar, but IMO
> the VCL (as in forms, controls, etc.) should not have any AnsiString
> properties in the future anymore (with some exceptions, I wouldn't make
> TComponent.Name a UnicodeString as identifiers in sourcecode should stay
> [a-z,A-Z,_][a-z,A-Z,0-9,_]* anyway).

Why make Component.Name an exception (or anything else for that matter)?


> This still means that a Unicode VCL would consume AnsiStrings just fine
> (there would be only the /automatic/ conversion overhead everytime you
> have to assign between AnsiString and UnicodeString).

Which for a non-Unicode application would be every time string data is
exchanged between the app and it's GUI, and "automatic" conversion is
asking for trouble.

Here's an ANSI string : you (the compiler) know nothing more than that -
it's an ANSI string. How do you automatically (and more importantly,
correctly) convert that to Unicode?

You need to know a bit more - character set/code page etc not to mention
is it SBCS or DBCS.

You could assume -for automatic conversion- that the system default
char-set should be applied, or I suppose require a compiler setting, as
long as only one char set was workable for an entire application (worry
about packages?, DLLs? another day).

The same or similar problems exist going the other way of course.

> > contexts everywhere. You can't say something is
> > "legacy" just because it doesn't consume or produce Unicode.
> > At worst, it's a "limited" system.
>
> Again I don't say that ANSI or ASCII code is legacy, but a ANSI VCL is.
> It doesn't allow users to properly enter text in a portable way.

Not all users NEED to enter text in a portable way. An ANSI VCL is just
as relevant as an ANSI API.

Playing devil's advocate here to an extent, because I want and need
Unicode, but I want it done properly so that the one tool can continue
to make sense for both types of project.

If Delphi goes 100% irrevocably and consistently Unicode then that is
one LESS reason to use Delphi rather than C# (.net being Unicode already
of course).

But a Delphi that enables fully fledged Unicode application development
AND ANSI application development.... now THAT is a clear advantage in
Delphi's favour.

> Also I wonder if Microsoft won't drop support for the ANSI versions of
> the API someday (in the long term).

But until they do....

Of course, they already _have_ in .net and one of the bug-bears in .net
that I have heard mention of is string performance....

A curious coincidence, no?

:)

--
JS
TWorld.Create.Free;

Jens Mühlenhoff

unread,
Nov 13, 2007, 4:05:21 AM11/13/07
to
Jolyon Smith wrote:
>
> Why make Component.Name an exception (or anything else for that matter)?
>

As I already mentioned in another sub-thread, I didn't think about that
properly at first. It doesn't make sense to make exceptions, right.

>
>> This still means that a Unicode VCL would consume AnsiStrings just fine
>> (there would be only the /automatic/ conversion overhead everytime you
>> have to assign between AnsiString and UnicodeString).
>
> Which for a non-Unicode application would be every time string data is
> exchanged between the app and it's GUI, and "automatic" conversion is
> asking for trouble.
>

True, but at the moment (ANSI VCL) Windows NT does the automatic
conversion inside the W functions that's the reason you have to set the
ANSI codepage for not unicode capable applications in the Windows locale
settings.

Now if the next VCL was Unicode-only, Codegear would just change the
time where it is done to an earlier point.

I know there are even worse problems involved here, but that's all due
to the fact that we have a ANSI window framework on a Unicode platform
in the first place.

>
> Not all users NEED to enter text in a portable way. An ANSI VCL is just
> as relevant as an ANSI API.
>
> Playing devil's advocate here to an extent, because I want and need
> Unicode, but I want it done properly so that the one tool can continue
> to make sense for both types of project.
>
> If Delphi goes 100% irrevocably and consistently Unicode then that is
> one LESS reason to use Delphi rather than C# (.net being Unicode already
> of course).
>

Then you don't need an Unicode OS in the first place ;-).

BTW: Applications using the new Unicode VCL won't run on Windows 95
anymore right? Will they work on 9x/ME/NT? Will Codegear go through all
the trouble to support these OS' that aren't support by Microsoft since
years? I don't think so.

You already have to work around problems if you want to write Windows 98
today. Just drop a TToolBar to a form and wonder why the program
suddenly depends on some .dll that Windows 98 doesn't provide ...


> But a Delphi that enables fully fledged Unicode application development
> AND ANSI application development.... now THAT is a clear advantage in
> Delphi's favour.
>

But IMO it's a *lot* harder to maintain for Codegear and *that* is
asking for trouble.

If two versions of the VCL were to be maintained the chance is high they
get out of sync somehow.

If two versions were to be build from the *same* codebase there would be
a lot of conditional compilation going on.


> Of course, they already _have_ in .net and one of the bug-bears in .net
> that I have heard mention of is string performance....
>
> A curious coincidence, no?
>
> :)
>

I think people complaining about string performance in .NET don't
understand the concept of the read-only strings used in .NET (and don't
know about the StringBuilder, etc.).

Of course a UnicodeString will be slower then an AnsiString, but unless
you do /a lot/ of string processing or you still have machines from the
last century, it is still acceptable.

You might also add that a UnicodeString takes more space than an
AnsiString, but we also have plenty of memory today.

Also you could still do background processing not related to UI in
AnsiString. The datatype will till be there for sure.

--
Regards
Jens

yannis

unread,
Nov 13, 2007, 3:54:22 AM11/13/07
to
Rudy Velthuis [TeamB] expressed precisely :

> yannis wrote:
>
>>> It's not as easy as setting a compiler switch!
>>
>> I think it is.
>
> Not really. You would need two designers, one for the Unicode VCL and
> one for the Ansi VCL.
The designer could be Unicode only and the streaming shoold know if it
is to save Unicode or ANSI.

> Then the code should be ware of Char types. If
> Char really becomes equivalent to WideChar (like in .NET), I see a few
> problems.

Never thought of it so I'll take your word.


regards
Yannis


Jens Mühlenhoff

unread,
Nov 13, 2007, 4:13:19 AM11/13/07
to
Rudy Velthuis [TeamB] wrote:
>
> I think it is very safe to assume a new fixed size: WideChar. The
> problem is if people assume that a Char is one byte.

What if Codegear suddenly wants to support Linux again and you suddenly
need a 4-Byte Char, because Linux API needs UCS-4? ;-)

I know thats a bad example since the default Char type would then
probably still be an alias for WideChar, but I wanted to point out that
it's unsafe to assume a Length of a generic datatype such as Char,
string, Integer or Real.

If you really need to assume that a datatype is 2 byte and stores a
UTF-16 character use a WideChar.

Also theres a UCS4Char that can take an entire Unicode codepoint.

--
Regards
Jens

Rudy Velthuis [TeamB]

unread,
Nov 13, 2007, 5:32:09 AM11/13/07
to
Jens Mühlenhoff wrote:

> Rudy Velthuis [TeamB] wrote:
> >
> > I think it is very safe to assume a new fixed size: WideChar. The
> > problem is if people assume that a Char is one byte.
>
> What if Codegear suddenly wants to support Linux again and you
> suddenly need a 4-Byte Char, because Linux API needs UCS-4? ;-)

Then it would be 4-byte in the Linux version. One should simply always
use (or multiply with) SizeOf(Char), instead of 1, 2 or 4.


>
> I know thats a bad example since the default Char type would then
> probably still be an alias for WideChar, but I wanted to point out
> that it's unsafe to assume a Length of a generic datatype such as
> Char, string, Integer or Real.

Well, I have always been saying that. There are fixed size types, like
Longint, Longword, WideChar, AnsiChar etc., and there are "generic"
types like Integer, Cardinal, string, Char, etc.

Real is a little bit of a problem. One should probably avoid it
altogether.
--
Rudy Velthuis [TeamB]

"Always go to other people's funerals, otherwise they won't come
to yours." -- Yogi Berra.

Jens Mühlenhoff

unread,
Nov 13, 2007, 7:48:45 AM11/13/07
to
Rudy Velthuis [TeamB] wrote:
> Jens Mühlenhoff wrote:
>
>> Rudy Velthuis [TeamB] wrote:
>>> I think it is very safe to assume a new fixed size: WideChar. The
>>> problem is if people assume that a Char is one byte.
>> What if Codegear suddenly wants to support Linux again and you
>> suddenly need a 4-Byte Char, because Linux API needs UCS-4? ;-)
>
> Then it would be 4-byte in the Linux version. One should simply always
> use (or multiply with) SizeOf(Char), instead of 1, 2 or 4.

Yes that's correct.

>> I know thats a bad example since the default Char type would then
>> probably still be an alias for WideChar, but I wanted to point out
>> that it's unsafe to assume a Length of a generic datatype such as
>> Char, string, Integer or Real.
>
> Well, I have always been saying that. There are fixed size types, like
> Longint, Longword, WideChar, AnsiChar etc., and there are "generic"
> types like Integer, Cardinal, string, Char, etc.
>

Ok then.

> Real is a little bit of a problem. One should probably avoid it
> altogether.

Yes, floating point is a totally different story as you mention it :-\

--
Regards
Jens

Craig Stuntz [TeamB]

unread,
Nov 13, 2007, 9:33:38 AM11/13/07
to
Jens Mühlenhoff wrote:

> Jolyon Smith wrote:
> >
> > Why make Component.Name an exception (or anything else for that
> > matter)?
>
> As I already mentioned in another sub-thread, I didn't think about
> that properly at first. It doesn't make sense to make exceptions,
> right.

Component.Name is an exception, just not the way you expressed it.
There are characters you can't have in a component name (e.g., spaces),
but there is no reason to restrict, for example, diacriticals. Any
valid Delphi identifier should do.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz
Borland newsgroup denizen Sergio González has a new CD of
Irish music out, and it's good: http://tinyurl.com/7hgfr

Craig Stuntz [TeamB]

unread,
Nov 13, 2007, 9:34:41 AM11/13/07
to
Jens Mühlenhoff wrote:

> Of course a UnicodeString will be slower then an AnsiString, but
> unless you do /a lot/ of string processing or you still have machines
> from the last century, it is still acceptable.

This may not be true, especially as Windows evolves. If every API call
you make means your AnsiString gets transliterated to Unicode and then
back to ANSI, then the AnsiString version may well be slower.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz

Everything You Need to Know About InterBase Character Sets:
http://blogs.teamb.com/craigstuntz/articles/403.aspx

Craig Stuntz [TeamB]

unread,
Nov 13, 2007, 9:32:27 AM11/13/07
to
Jens Mühlenhoff wrote:

> What if Codegear suddenly wants to support Linux again and you
> suddenly need a 4-Byte Char, because Linux API needs UCS-4? ;-)

A character, generally, should be a four-byte type no matter what the
encoding of a string is, since characters are more or less code points
and code units (which is what you have in an encoded string) aren't the
same thing as code points.

Jens Mühlenhoff

unread,
Nov 13, 2007, 11:36:36 AM11/13/07
to
Craig Stuntz [TeamB] wrote:
> Jens Mühlenhoff wrote:
>
>> Of course a UnicodeString will be slower then an AnsiString, but
>> unless you do /a lot/ of string processing or you still have machines
>> from the last century, it is still acceptable.
>
> This may not be true, especially as Windows evolves. If every API call
> you make means your AnsiString gets transliterated to Unicode and then
> back to ANSI, then the AnsiString version may well be slower.
>
I was talking about RAW AnsiString processing without any APIs involved.

--
Regards
Jens

Jens Mühlenhoff

unread,
Nov 13, 2007, 11:39:07 AM11/13/07
to
Craig Stuntz [TeamB] wrote:
> Jens Mühlenhoff wrote:
>
>> Jolyon Smith wrote:
>>> Why make Component.Name an exception (or anything else for that
>>> matter)?
>> As I already mentioned in another sub-thread, I didn't think about
>> that properly at first. It doesn't make sense to make exceptions,
>> right.
>
> Component.Name is an exception, just not the way you expressed it.
> There are characters you can't have in a component name (e.g., spaces),
> but there is no reason to restrict, for example, diacriticals. Any
> valid Delphi identifier should do.
>

I thought the only characters valid for a component name were
[a-z,A-Z,_][a-z,A-Z,_,0-9]?

But anyway the encoding of Component.Name should be Unicode when
everything else is also Unicode encoded. That was my original mistake.

--
Regards
Jens

Craig Stuntz [TeamB]

unread,
Nov 13, 2007, 10:43:37 AM11/13/07
to
Jens Mühlenhoff wrote:

> I was talking about RAW AnsiString processing without any APIs
> involved.

I recognized that you were looking at only one piece of the
performance picture, which is why I wrote the reply that I did.

Jens Mühlenhoff

unread,
Nov 13, 2007, 11:42:02 AM11/13/07
to
Craig Stuntz [TeamB] wrote:
> Jens Mühlenhoff wrote:
>
>> What if Codegear suddenly wants to support Linux again and you
>> suddenly need a 4-Byte Char, because Linux API needs UCS-4? ;-)
>
> A character, generally, should be a four-byte type no matter what the
> encoding of a string is, since characters are more or less code points
> and code units (which is what you have in an encoded string) aren't the
> same thing as code points.
>

Maybe we think too much ANSI-centric, why not name the new types
UTF16CodeUnit (for a single code unit), CodePoint (for a "character) and
UTF16String (for a UTF-16 encoded character string) or something like this.

--
Regards
Jens

Craig Stuntz [TeamB]

unread,
Nov 13, 2007, 10:43:07 AM11/13/07
to
Jens Mühlenhoff wrote:

> I thought the only characters valid for a component name were
> [a-z,A-Z,_][a-z,A-Z,_,0-9]?

I don't, off the top of my head, know what the /actual/ limitation is,
but in principle any valid Delphi (and C++) identifier should be OK.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz

How to ask questions the smart way:
http://www.catb.org/~esr/faqs/smart-questions.html

Rudy Velthuis [TeamB]

unread,
Nov 13, 2007, 12:39:20 PM11/13/07
to
Craig Stuntz [TeamB] wrote:

> Jens Mühlenhoff wrote:
>
> > I thought the only characters valid for a component name were
> > [a-z,A-Z,_][a-z,A-Z,_,0-9]?
>

> I don't, off the top of my head, know what the actual limitation is,


> but in principle any valid Delphi (and C++) identifier should be OK.

I doubt that Amöbe or HüskerDü are a valid C++ identifiers. They are
valid Delphi identifiers, though, these days.

I have the impression Jens assumes that only the characters he mentions
are allowed in Delphi identifiers. That used to be true, after all. The
change only came recently.

--
Rudy Velthuis [TeamB]

"Computer Science is no more about computers than astronomy is
about telescopes" -- Edsger W. Dijkstra.

Craig Stuntz [TeamB]

unread,
Nov 13, 2007, 12:43:27 PM11/13/07
to
Rudy Velthuis [TeamB] wrote:

> I doubt that Amöbe or HüskerDü are a valid C++ identifiers.

That would be an issue, I guess, but only if you care about C++
compatibility in your code.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz

All the great TeamB service you've come to expect plus (New!)
Irish Tin Whistle tips: http://learningtowhistle.blogspot.com

Loren Pechtel

unread,
Nov 13, 2007, 2:25:18 PM11/13/07
to
On Mon, 12 Nov 2007 12:02:28 +0100, Jens Mühlenhoff
<j.mueh...@accurata.com> wrote:

>yannis wrote:
>>
>> This would be best handled by the end user. It is far better to have an
>> options where we can choose if the String data type is unicode or Ansi
>> the same way the huge strings work.
>>
>
>On the compiler-side yes, I agree. But the VCL (+ RTL) can only be
>compiled with ANSI xor Unicode support.

Why? Just make overloading work a bit better.

Remy Lebeau (TeamB)

unread,
Nov 13, 2007, 4:04:34 PM11/13/07
to

"Loren Pechtel" <lorenp...@hotmail.invalid.com> wrote in message
news:4739f787$1...@newsgroups.borland.com...

> Why? Just make overloading work a bit better.

Even better, implement shared logic using Generics as well. That way, the
overloaded functions can specialize the generic calls without actually
duplicating the source code. There are a lot of places in SysUtils, for
instance, where the Wide...() functions do the same things as their Ansi
counterparts. For example (sorry, I'm not up on the latest syntax for
Generics):

interface

function Trim(const S: AnsiString): AnsiString; overload;
function Trim(const S: WideString): WideString; overload;

implementation

function InternalTrim<T>(const S: T): T; inline;
var
I, L: Integer;
begin
L := Length(S);
I := 1;
while (I <= L) and (S[I] <= ' ') do Inc(I);
if I > L then Result := '' else
begin
while S[L] <= ' ' do Dec(L);
Result := Copy(S, I, L - I + 1);
end;
end;

function Trim(const S: AnsiString): AnsiString;
begin
Result := InternalTrim<AnsiString>(S);
end;

function Trim(const S: WideString): WideString;
begin
Result := InternalTrim<WideString>(S);
end;


Gambit


Craig Stuntz [TeamB]

unread,
Nov 13, 2007, 3:34:05 PM11/13/07
to
Remy Lebeau (TeamB) wrote:

> function InternalTrim<T>(const S: T): T; inline;
> var
> I, L: Integer;
> begin
> L := Length(S);
> I := 1;
> while (I <= L) and (S[I] <= ' ') do Inc(I);

For this to work you'd have to have some kind of generic constraint
which indicates that T supports indexing and returns a char type. You
can do this with classes (e.g., with an interface), but I'm not sure
how you could do it with the string type.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz

IB 6 versions prior to 6.0.1.6 are pre-release and may corrupt
your DBs! Open Edition users, get 6.0.1.6 from http://mers.com

Craig Stuntz [TeamB]

unread,
Nov 13, 2007, 3:56:54 PM11/13/07
to
Remy Lebeau (TeamB) wrote:

> Why? Like I said, I'm not up on Delphi Generics, but C++ templates
> don't work that way.

Well, Delphi (and C#, etc.) generics and C++ templates are completely
different.

A C++ template is expanded at compile time, and code is generated for
the specific type used in the call. You end up with something akin to
duck typing: If you call T.Foo and then pass a MyClass to the method,
it will work if there's a MyClass.Foo, no matter where Foo comes from.

With a .NET generic, OTOH, there is no MyClass expansion of the
generic generated at compile time. Instead, a single type is generated
which handles all (reference-type) params. When you want to call a
method like Foo, you need to have a constraint to ensure that this will
work with the parameter passed at runtime, since runtime type-checking
will be done. The typing is truly static: If your constraint says that
T must support IFoo, then IFoo.Foo is what gets called, not
ISomethingElse.Foo.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz

Useful articles about InterBase development:
http://blogs.teamb.com/craigstuntz/category/21.aspx

Remy Lebeau (TeamB)

unread,
Nov 13, 2007, 4:48:42 PM11/13/07
to

"Craig Stuntz [TeamB]" <craig_...@nospam.please [a.k.a. acm.org]> wrote
in message news:473a184d$1...@newsgroups.borland.com...

> For this to work you'd have to have some kind of generic
> constraint which indicates that T supports indexing and
> returns a char type.

Why? Like I said, I'm not up on Delphi Generics, but C++ templates don't
work that way. T is used as-is. As long as the input type supports what
the code needs, then it is a simple substitution at the compiler level
before the code is evaluated. If an invalid type is passed in, you'll get a
compiler errors due to missing operators/members, overloaded functions that
don't support the type, etc. But you don't have to do anything special to
indicate that input supports the features the code needs (there are
templated ways to do that as well, though). Why can't Delphi do the same?


Gambit


Remy Lebeau (TeamB)

unread,
Nov 13, 2007, 5:29:45 PM11/13/07
to

"Craig Stuntz [TeamB]" <craig_...@nospam.please [a.k.a. acm.org]> wrote
in message news:473a1da6$1...@newsgroups.borland.com...

> With a .NET generic, OTOH, there is no MyClass expansion
> of the generic generated at compile time.

What about Win32 Generics?


Gambit


Craig Stuntz [TeamB]

unread,
Nov 13, 2007, 4:37:19 PM11/13/07
to
Remy Lebeau (TeamB) wrote:

> What about Win32 Generics?

Well, they don't exist yet, but I'm led to believe that they were
delayed so that they could work identically to .NET generics. I presume
they will be the same, as there's no real good reason to make them work
differently than the Delphi for .NET which has already shipped.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz

Want to help make Delphi and InterBase better? Use QC!
http://qc.borland.com -- Vote for important issues

Loren Pechtel

unread,
Nov 13, 2007, 8:54:36 PM11/13/07
to
On 13 Nov 2007 13:34:05 -0700, "Craig Stuntz [TeamB]"
<craig_...@nospam.please [a.k.a. acm.org]> wrote:

>Remy Lebeau (TeamB) wrote:
>
>> function InternalTrim<T>(const S: T): T; inline;
>> var
>> I, L: Integer;
>> begin
>> L := Length(S);
>> I := 1;
>> while (I <= L) and (S[I] <= ' ') do Inc(I);
>
> For this to work you'd have to have some kind of generic constraint
>which indicates that T supports indexing and returns a char type. You
>can do this with classes (e.g., with an interface), but I'm not sure
>how you could do it with the string type.

I think generics would require a few new abstract types.

Signed, Unsigned, Number, AnyPointer, AnyString, AnyFile, AnyObject.

In fact, I would like to see some of them even if we don't have
generics. They would be used in places where we currently use untyped
parameters.

Jens Mühlenhoff

unread,
Nov 14, 2007, 4:50:40 AM11/14/07
to
Rudy Velthuis [TeamB] wrote:
> I doubt that Amöbe or HüskerDü are a valid C++ identifiers. They are
> valid Delphi identifiers, though, these days.
>
> I have the impression Jens assumes that only the characters he mentions
> are allowed in Delphi identifiers. That used to be true, after all. The
> change only came recently.
>

Yes I wasn't aware of that and I find it very dangerous to use
identifiers like that.

--
Regards
Jens

Jens Mühlenhoff

unread,
Nov 14, 2007, 5:15:44 AM11/14/07
to

I just checked and found out that TComponent.Name is checked by the
IsValidIdent function in SysUtils. This function only allows that "old"
way as I presumed (an empty name is allowed and ['A'..'Z', 'a'..'z',
'_'] for the first character and ['A'..'Z', 'a'..'z', '_', '0'..'9'] for
the rest)!

I guess this really is because of C++ compatability, IIRC C++ can't
consume such identifiers.

I also did some test on a DLL that exports the function "Töst":

L := LoadLibrary('project2.dll');
Assert(L <> 0);
P := GetProcAddress(L, 'Töst');
Assert(Assigned(P));
P^;

This doesn't work (P gets NIL), I don't know how to (dynamically) import
a function that uses Umlaute.

Why was it allowed in the first place?

--
Regards
Jens

Rudy Velthuis [TeamB]

unread,
Nov 14, 2007, 5:20:52 AM11/14/07
to
Jens M�hlenhoff wrote:

It is not dangerous. But it could be incompatible with other
programming languages.
--
Rudy Velthuis [TeamB]

"I have spoken many a word, therefore, it is fact."
-- Eric the Verbose

Jens Mühlenhoff

unread,
Nov 14, 2007, 6:57:17 AM11/14/07
to
Rudy Velthuis [TeamB] wrote:
>
> It is not dangerous. But it could be incompatible with other
> programming languages.

Maybe I'm just a little bit too old-school then ;-)

--
Regards
Jens

Rudy Velthuis [TeamB]

unread,
Nov 14, 2007, 7:50:24 AM11/14/07
to
Jens M�hlenhoff wrote:

You like old-school bikes? <g>

--
Rudy Velthuis [TeamB]

"For if he like a madman lived, At least he like a wise one died."
-- Cervantes.

Craig Stuntz [TeamB]

unread,
Nov 14, 2007, 8:19:40 AM11/14/07
to
Jens Mhlenhoff wrote:

> This doesn't work (P gets NIL), I don't know how to (dynamically)
> import a function that uses Umlaute.

Use a PE viewer to ensure that it actually got exported that way.
Worst case, it could be done by index, but first figure out where the
problem is.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz

Jens Mühlenhoff

unread,
Nov 14, 2007, 9:22:02 AM11/14/07
to
Rudy Velthuis [TeamB] wrote:
> Jens M�hlenhoff wrote:
>
>> Rudy Velthuis [TeamB] wrote:
>>> It is not dangerous. But it could be incompatible with other
>>> programming languages.
>> Maybe I'm just a little bit too old-school then ;-)
>
> You like old-school bikes? <g>
>

Yes I do though I don't see a connection here ;-).

BTW: It seems that you have a UTF-8 encoding problem with my second name ...

--
Regards
Jens

Jens Mühlenhoff

unread,
Nov 14, 2007, 9:26:09 AM11/14/07
to
Craig Stuntz [TeamB] wrote:
> Jens Mhlenhoff wrote:
>
>> This doesn't work (P gets NIL), I don't know how to (dynamically)
>> import a function that uses Umlaute.
>
> Use a PE viewer to ensure that it actually got exported that way.
> Worst case, it could be done by index, but first figure out where the
> problem is.
>

I did youe a PE viewer and it seems it's encoded as UTF-8, so I probably
need to use #$C3#$B6 insted of 'ö', but I will not investigate it
further (unless somebody is really interested in this *g*).

--
Regards
Jens

Craig Stuntz [TeamB]

unread,
Nov 14, 2007, 8:32:57 AM11/14/07
to
Jens Mhlenhoff wrote:

> I did youe a PE viewer and it seems it's encoded as UTF-8, so I
> probably need to use #$C3#$B6 insted of 'ö', but I will not
> investigate it further (unless somebody is really interested in this

> g).

I suppose you could try changing the encoding of your source file to
UTF-8, but that's a pretty fragile fix. :)

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz

Rudy Velthuis [TeamB]

unread,
Nov 14, 2007, 8:37:14 AM11/14/07
to
Jens M�hlenhoff wrote:

> Rudy Velthuis [TeamB] wrote:
> > Jens M�hlenhoff wrote:
> >
> > > Rudy Velthuis [TeamB] wrote:
> > > > It is not dangerous. But it could be incompatible with other
> > > > programming languages.
> > > Maybe I'm just a little bit too old-school then ;-)
> >
> > You like old-school bikes? <g>
> >
>
> Yes I do though I don't see a connection here ;-)

The first thing I think of when I hear "old-school". I guess I watched
American Choppers once too often. <g>

--
Rudy Velthuis [TeamB]

"A printer consists of three main parts: the case, the jammed
paper tray and the blinking red light" -- unknown

Jolyon Smith

unread,
Nov 14, 2007, 2:40:12 PM11/14/07
to
In article <47396ab0$1...@newsgroups.borland.com>, Jens Mühlenhoff says...
> Rudy Velthuis [TeamB] wrote:
> >
> > I think it is very safe to assume a new fixed size: WideChar. The
> > problem is if people assume that a Char is one byte.

>
> What if Codegear suddenly wants to support Linux again and you suddenly
> need a 4-Byte Char, because Linux API needs UCS-4? ;-)

Why is 1-byte char being compared to a 2-byte WideChar?

AIUI a Unicode Char, in any encoding except 4byte/char (eg UTF32?), has
a MINIMUM number of bytes per char, not a FIXED number.


> If you really need to assume that a datatype is 2 byte and stores a
> UTF-16 character use a WideChar.

You often may need more than one WideChar to "hold" a single UTF-16
character.


> Also theres a UCS4Char that can take an entire Unicode codepoint.

Aha!

:)

--
JS
TWorld.Create.Free;

Craig Stuntz [TeamB]

unread,
Nov 14, 2007, 2:19:51 PM11/14/07
to
Jolyon Smith wrote:

> AIUI a Unicode Char, in any encoding except 4byte/char (eg UTF32?),
> has a MINIMUM number of bytes per char, not a FIXED number.

That's a bit tricky, since there isn't really such thing as a
"character" in Unicode, insofar as there's no single type which will
always map to a single glyph on paper. It sounds like you're talking
about code units, but when dealing with, for example, a single
character (insofar as the Char type is typically used) a code point may
be more appropriate. There are indeed cases where there are multiple
code units per code point, but that might not affect the size of a
"char" type.

Code points are always 4 bytes, and code units vary by encoding. A
code point is somewhat closer to a single character on paper.
Nonetheless, multiple code points may comprise a character on paper.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz

Jolyon Smith

unread,
Nov 15, 2007, 2:46:38 PM11/15/07
to
In article <473b5867$1...@newsgroups.borland.com>, Craig Stuntz [TeamB]
says...

> Jolyon Smith wrote:
>
> > AIUI a Unicode Char, in any encoding except 4byte/char (eg UTF32?),
> > has a MINIMUM number of bytes per char, not a FIXED number.
>
> That's a bit tricky

Which neatly sums up Unicode in its entirety I reckon.

:)


My point was only that any code that assumes (or relies on) a fixed
relationship between the number of bytes in a string and the number of
characters that string comprises is going to come a cropper if that
string is Unicode.

(with the exception of 4byte encodings, as noted)


--
JS
TWorld.Create.Free;

Jens Mühlenhoff

unread,
Nov 16, 2007, 6:37:10 AM11/16/07
to
Jolyon Smith wrote:
> In article <473b5867$1...@newsgroups.borland.com>, Craig Stuntz [TeamB]
> says...
>> Jolyon Smith wrote:
>>
>>> AIUI a Unicode Char, in any encoding except 4byte/char (eg UTF32?),
>>> has a MINIMUM number of bytes per char, not a FIXED number.
>> That's a bit tricky
>
> Which neatly sums up Unicode in its entirety I reckon.
>
> :)
>
>
> My point was only that any code that assumes (or relies on) a fixed
> relationship between the number of bytes in a string and the number of
> characters that string comprises is going to come a cropper if that
> string is Unicode.
>

Exactly.

> (with the exception of 4byte encodings, as noted)
>
>

Not true, as I understand it there are characters *longer* then one code
point.

http://en.wikipedia.org/wiki/Combining_character

The UCS-4 encoding only prevents code points from beeing truncated when
a string is cut between code units. By cutting a string into two with
UCS-4 you won't end up with invalid code as can happen with the other
encodings.

Thats also the case with UCS-2 btw, but UCS-2 can only represent the
BMP, but UCS-2 is obsolete anyway.

--
Regards
Jens

Craig Stuntz [TeamB]

unread,
Nov 16, 2007, 9:34:33 AM11/16/07
to
Jens Mühlenhoff wrote:

> Not true, as I understand it there are characters longer then one
> code point.

Right. And there are languages where multiple glyphs represent one
"character."

IMHO, it's very misleading to use the term "character" when dealing
with Unicode. It draws one to incorrect conclusions.

--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz

0 new messages