Tiburón and FastCode

Lee Grissom

unread,

Apr 4, 2008, 4:25:49 PM4/4/08

to

Hello FastCoders, just a curiosity question, but what kind of concerns
do you have about FastCode routines working with the new VCL coming in
the next release of Delphi?

Cheers,
Lee

Dennis

unread,

Apr 5, 2008, 3:35:33 AM4/5/08

to

Hi

> Hello FastCoders, just a curiosity question, but what kind of concerns
> do you have about FastCode routines working with the new VCL coming in
> the next release of Delphi?

Nothing yet, but let us discuss what we have to do. Please correct me if I
am wrong.

Functions defined with AnsiString as string type will have no problems

TCharPosFunction = function(Chr : Char; const Str : AnsiString) : Integer;

but functions defined to take or return the type "string" will fail if they
do not support WideString.

TTrimFunction = function (const S: string) : string;

One solution is to test that string=AnsiString prior to calling the
function. Another is to redefine the function prototypes.

TTrimFunction = function (const S: AnsiString) : AnsiString;

IMHO This solution is the correct one.

Why was functions like Trim defined to take and return "string" in the first
place?

Best regards
Dennis Kjaer Christensen

Rudy Velthuis [TeamB]

unread,

Apr 5, 2008, 12:49:15 PM4/5/08

to

Dennis wrote:

> Functions defined with AnsiString as string type will have no problems
>
> TCharPosFunction = function(Chr : Char; const Str : AnsiString) :
> Integer;

That uses Char, not AnsiChar, so it should have a big problem. <g>

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Research is what I'm doing when I don't know what I'm doing."
-- Wernher Von Braun (1912-1977)

Q Correll

unread,

Apr 5, 2008, 1:12:15 PM4/5/08

to

Dennis,

| Why was functions like Trim defined to take and return "string" in
| the first place?

Perhaps because in the ol'days "string" was all there was?

I am still shuddering at the thought of modifying all the apps I have
coded with plain ol' string. <sigh>

--
Q

04/05/2008 10:09:09

XanaNews Version 1.17.5.7 [Q's Salutation mod]

Dennis

unread,

Apr 5, 2008, 1:29:23 PM4/5/08

to

Hi

> Perhaps because in the ol'days "string" was all there was?

OK

> I am still shuddering at the thought of modifying all the apps I have
> coded with plain ol' string. <sigh>

I am not sure it will to hard. CodeGear has probably done a lot of work to
make the transition to Unicode easy.

What if you simply make a search and replace on string -> AnsiString - will
everything just work as before?

I have to admit that I know very little about Unicode ;-)

Dennis

unread,

Apr 5, 2008, 1:26:26 PM4/5/08

to

Hi Rudy

> > TCharPosFunction = function(Chr : Char; const Str : AnsiString) :
> > Integer;
>
> That uses Char, not AnsiChar, so it should have a big problem. <g>

OK

Is this good?

TCharPosFunction = function(Chr : AnsiChar; const Str : AnsiString) :
Integer;

Guess we have much work to do ;-)

Rudy Velthuis [TeamB]

unread,

Apr 5, 2008, 2:32:14 PM4/5/08

to

Dennis wrote:

> Hi Rudy
>
> > > TCharPosFunction = function(Chr : Char; const Str : AnsiString) :
> > > Integer;
> >
> > That uses Char, not AnsiChar, so it should have a big problem. <g>
>
> OK
>
> Is this good?
>
> TCharPosFunction = function(Chr : AnsiChar; const Str : AnsiString) :
> Integer;

That should be fine. <g>

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Humor is just another defense against the universe."
-- Mel Brooks (1926- )

Rudy Velthuis [TeamB]

unread,

Apr 5, 2008, 2:40:01 PM4/5/08

to

Dennis wrote:

> Hi
>
> > Perhaps because in the ol'days "string" was all there was?
>
> OK
>
> > I am still shuddering at the thought of modifying all the apps I
> > have coded with plain ol' string. <sigh>
>
> I am not sure it will to hard. CodeGear has probably done a lot of
> work to make the transition to Unicode easy.
>
> What if you simply make a search and replace on string -> AnsiString
> - will everything just work as before?

Yes, and no. Also Char, PChar etc. must be replaced. Then, since the
library routines like Format or Trim will probably return
UnicodeStrings, what if you replace all occurrences of string with
AnsiString?

var
S, T: AnsiString;
begin
S := ' Hello ';
T := Format(S);

This will cause two conversions: S will be converted to UnicodeString,
passed to Format, and the result of Format must be converted back.

I'd rather see new routines for Char (<-- WideChar).

--
Rudy Velthuis [TeamB] http://www.teamb.com

"USA Today has come out with a new survey: Apparently three out
of four people make up 75 percent of the population."
-- David Letterman.

Rudy Velthuis [TeamB]

unread,

Apr 5, 2008, 2:50:20 PM4/5/08

to

Rudy Velthuis [TeamB] wrote:

> Dennis wrote:
>
> > Hi
> >
> > > Perhaps because in the ol'days "string" was all there was?
> >
> > OK
> >
> > > I am still shuddering at the thought of modifying all the apps I
> > > have coded with plain ol' string. <sigh>
> >
> > I am not sure it will to hard. CodeGear has probably done a lot of
> > work to make the transition to Unicode easy.
> >
> > What if you simply make a search and replace on string -> AnsiString
> > - will everything just work as before?
>
> Yes, and no. Also Char, PChar etc. must be replaced.

Oh, and if you cast a string to PAnsiChar instead of PChar, you might
be casting a UnicodeString to AnsiChar, which is hardly what you want:

ShellExecuteA(0, PAnsiChar('myfile.txt'), ...);

Since myfile.txt is probably a string, that might cause problems. So
you should do:

ShellExecute(0, PChar('myfile.txt'), ...);

or:

ShellExecuteA(0, PAnsiChar(AnsiString('myfile.txt')), ...);

Note that I used the -A version of ShellExecute for the PAnsiChar cast.
<g>

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Some cause happiness wherever they go; others, whenever they go."
-- Oscar Wilde (1854-1900)

Q Correll

unread,

Apr 5, 2008, 2:45:42 PM4/5/08

to

Dennis,

| I am not sure it will to hard. CodeGear has probably done a lot of
| work to make the transition to Unicode easy.

I'm not either. However, I am still quite nervous about it. <g>

| What if you simply make a search and replace on string -> AnsiString
| - will everything just work as before?

According to Nick Hodges several months ago, that will work just fine.
He suggested at the time to do it "now." However, when I thought about
it a bit I realized that a global replace isn't the easiest thing to do
due to my commenting. And that means I would have to make a decision
for EVERY frickin occurrence of "string" in my ALL of my apps' code.

I'm still nervous. <g>

| I have to admit that I know very little about Unicode ;-)

If you know a little, then you know more than my "nothing." <g>

--
Q

04/05/2008 11:41:19

Rudy Velthuis [TeamB]

unread,

Apr 5, 2008, 2:51:54 PM4/5/08

to

Q Correll wrote:

> Dennis,
>
> > I am not sure it will to hard. CodeGear has probably done a lot of
> > work to make the transition to Unicode easy.
>
> I'm not either. However, I am still quite nervous about it. <g>
>
> | What if you simply make a search and replace on string -> AnsiString
> > - will everything just work as before?
>
> According to Nick Hodges several months ago, that will work just fine.

Actually, no it won't. See my replies. In general, not converting
strings to anything will probably work much better, but of course
assembler routines should know that Char is now WideChar.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"In any contest between power and patience, bet on patience."
-- W.B. Prescott

Xu, Qian

unread,

Apr 5, 2008, 5:28:50 PM4/5/08

to

As CodeGear's engineers said "Except user explicit use
AnsiString/AnsiChar, string/Char will be mapped to
UnicodeString/WideChar automatically."

If FastCode is only designed for ANSI world. It would be useless for
Tiburon.

--
Best regards
Xu, Qian (stanleyxu)
http://stanleyxu2005.blogspot.com/

Q Correll

unread,

Apr 5, 2008, 6:47:06 PM4/5/08

to

Rudy,

| Actually, no it won't.

Why doesn't that surprise me?

| See my replies.

Thanks. I have been reading your comments. And they make sense to me.
And they also make me even more concerned about what Nick typed those
few months ago. <sigh>

--
Q

04/05/2008 15:44:57

Rudy Velthuis [TeamB]

unread,

Apr 6, 2008, 7:06:11 AM4/6/08

to

Q Correll wrote:

> Rudy,
>
> > Actually, no it won't.
>
> Why doesn't that surprise me?
>
> > See my replies.
>
> Thanks. I have been reading your comments. And they make sense to
> me. And they also make me even more concerned about what Nick typed
> those few months ago. <sigh>

I'm not very concerned. Problems can arise if you mix and convert
between Ansi and Unicode strings, but most people don't do that, and if
they do, they already have safeguards in place.

Like I said: it is much safer to use the general string type than to
explicitly declare every string as AnsiString.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"I think it would be a good idea."
-- Mahatma Gandhi (1869-1948), when asked what he thought of
Western civilization

Q Correll

unread,

Apr 6, 2008, 12:06:36 PM4/6/08

to

Rudy,

| Like I said: it is much safer to use the general string type than to
| explicitly declare every string as AnsiString.

I hope that's still true in the eventual Unicode Delphi release.

--
Q

04/06/2008 09:06:01

Rudy Velthuis [TeamB]

unread,

Apr 6, 2008, 1:30:08 PM4/6/08

to

Q Correll wrote:

> Rudy,
>
> > Like I said: it is much safer to use the general string type than
> > to explicitly declare every string as AnsiString.
>
> I hope that's still true in the eventual Unicode Delphi release.

That's what I meant. If you do that now, it is most likely also safe in
the new Delphi.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"I do not have a body, I am a body." -- Unknown

Q Correll

unread,

Apr 6, 2008, 2:00:42 PM4/6/08

to

Rudy,

| That's what I meant. If you do that now, it is most likely also safe
| in the new Delphi.

I'm still nervous. <crossing fingers>

--
Q

04/06/2008 11:00:24

Lee Grissom

unread,

Apr 7, 2008, 8:08:52 PM4/7/08

to

It happens that Dennis formulated :

> I have to admit that I know very little about Unicode ;-)

I'm not an expert either.

Here's a good starting place:
http://www.codinghorror.com/blog/archives/001084.html

--
Lee

Nils Haeck

unread,

Jun 22, 2008, 7:55:23 AM6/22/08

to

I still find it quite a waste to use Unicode strings, and wonder why CG
didn't choose to use Utf8string instead as default. Maybe I'm selfish and
not international enough, but from my point of view (and customer base), I'd
say 80% of the software is latin-based (just requiring 1 byte per
character), with a few additional "weird" characters here and there, that
would perhaps require 3 bytes per character. Much better than the standard 2
bytes per character in proposed Delphi "Unicode" (which is, I think, UTF16
truncated to 2 bytes/char).

Another argument: you could state that 2 bytes/char standard is easier for
random access, but that limits the set to 2^16 possible values, and Unicode
already contains a lot more characters than that, so we'll face the same
problems in some years. Utf8 on the other hand, is extendible (1, 3, 5, 7..
bytes per character).

Yet another argument is that most old code (parsers, etc that look for char
values in 0..127 range) would just work when feeding it an UTF8 string,
since UTF8 is smartly defined to keep these 0..127 characters identical.

Nils

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> schreef in bericht
news:xn0fokzk04boyz...@rvelthuis.de...

Rudy Velthuis [TeamB]

unread,

Jun 22, 2008, 2:02:30 PM6/22/08

to

Nils Haeck wrote:

> I still find it quite a waste to use Unicode strings, and wonder why
> CG didn't choose to use Utf8string instead as default. Maybe I'm
> selfish and not international enough, but from my point of view (and
> customer base), I'd say 80% of the software is latin-based

I guess you're wrong. Or it is latin-based because of restrictions.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"I have come to believe that the whole world is an enigma, a
harmless enigma that is made terrible by our own mad attempt to
interpret it as though it had an underlying truth."
-- Umberto Eco

Remy Lebeau (TeamB)

unread,

Jun 23, 2008, 1:21:51 PM6/23/08

to

"Nils Haeck" <b...@bla.com> wrote in message
news:485e...@newsgroups.borland.com...

> I still find it quite a waste to use Unicode strings, and wonder why
> CG didn't choose to use Utf8string instead as default.

Win32 support. Win32 API functions are UTF-16. Using UTF8String as default
would require a lot of conversions. The Ansi version of the VCL today
already has to do that anyway. By moving to UTF-16 by default, those
conversions are avoided, increasing performance significantly.

> I'd say 80% of the software is latin-based

Perhaps, but 90-odd% of PCs are running an Unicode-based OS, even if
configured for a Latin-based languages. But then, you also have to think
about a lot of markers that actually use single-byte multi-character
languages. Those benefit from the Unicode switchover.

> with a few additional "weird" characters here and there, that would
> perhaps require 3 bytes per character.

You could say the same for UTF-8, though. In fact, UTF-8 supports encoded
characters up to 6 bytes per logical character. Even MBCS supported 2-byte
characters in a 1-byte environment.

> Much better than the standard 2 bytes per character in proposed Delphi
> "Unicode"

Other than memory usage, I don't see why it would be better.

> Another argument: you could state that 2 bytes/char standard is
> easier for random access, but that limits the set to 2^16 possible
> values, and Unicode already contains a lot more characters than
> that

You can say the same for MBCS as well. Unicode surrogate pairs addresses
that issue, just as lead-bytes addresses it in MBCS.

> Utf8 on the other hand, is extendible (1, 3, 5, 7.. bytes per character).

Which makes it a lot harder to process, exactly. The loss of random seeking
alone would be too huge a performance hit to be beneficial.

> Yet another argument is that most old code (parsers, etc that look
> for char values in 0..127 range) would just work when feeding it
> an UTF8 string

Only when feed characters that do not go above 127, which is something UTF-8
is designed for.

> since UTF8 is smartly defined to keep these 0..127 characters identical.

But, if feed a UTF-8 encoded Unicode string with characters above 127, those
parsers would fail anyway. At least by moving to Unicode, those parsers
would have a change of processing higher characters now.

Gambit

Pierre le Riche

unread,

Jun 24, 2008, 6:16:31 AM6/24/08

to

Hi Remy,

> Win32 support. Win32 API functions are UTF-16. Using UTF8String as default
> would require a lot of conversions.

UTF-8 <-> UTF-16 conversions are really cheap.

> The Ansi version of the VCL today already has to do that anyway. By moving to UTF-16 by default, those conversions are avoided, increasing performance significantly.

UTF-8 is also faster than the current AnsiString situation, because the
conversion is cheap. Of course avoiding a conversion completely is the
fastest, but the Win32 API is generally so slow that the difference is
not noticeable.

> Other than memory usage, I don't see why it would be better.

Memory usage is a very big deal under Win32.

> Which makes it a lot harder to process, exactly. The loss of random
seeking alone would be too huge a performance hit to be beneficial.

Neither are fixed length encodings, so the complexity required to
process them is comparable.

> But, if feed a UTF-8 encoded Unicode string with characters above 127, those
> parsers would fail anyway.

There are many that continue to work fine, and that is the point that
Nils is trying to make and also what I have found in practice.

Switching to UTF-8 is effectively a code page change (CP65001, to be
exact), so any routines that were code page agnostic before continue to
work fine. That is a large chunk of the RTL.

If CodeGear enabled Unicode in the components (but left the string type
at AnsiString/UTF8String) and changed System.DefaultSystemCodePage to
65001, our applications would get the benefit of Unicode with
significantly less work required on our part.

Regards,
Pierre

Nils Haeck

unread,

Jun 24, 2008, 11:47:58 AM6/24/08

to

>
> I guess you're wrong. Or it is latin-based because of restrictions.
>

With "latin-based" I didn't mean the Latin encoding, but the fact that the
strings stored are in some latin language, which then again is mostly
English.

It's mostly about the storage formats used, not even so much about the
software itself I think. Looking at storage formats.. take XML or HTML,
which are probably the most common text-based formats around (and a basis
for many other formats).

Storing an XML file with UTF8-encoding definitely has advantages, as you can
easily parse it with a random-access parser (all the delimiters are in the
0..127 range), making for some lightning fast implementations. In a new
upcoming version of NativeXml which I'm creating, I already chose to handle
unicode that way; simply convert everything internally to UTF8, and also use
that as the preferred storage encoding. It definitely saves on memory. I saw
a lot of different XML files from around the world, and for instance, almost
always the tags are using latin-based names (thus in general one byte only
per character) and many of the values are numbers. And actually in most
cases the tags make up for the bulk of the file, I would say 70 - 80%.

Memory usage is often shuffled under, but it is a bottleneck! I know, after
having seen XML files in the multi-gigabyte range. One can say available
memory will probably grow fast, but then again, it seems the amount of data
people want to store grows even faster.

Anyway, I'm not here to defend myself, I only gave a suggestion based on
experience. I love Delphi, will continue to use it, but I will definitely
not convert my software to use 2byte unicode as a default. Since I will have
to type "ansistring" or "utf8string" wherever I just could type "string", it
will be a feature I could do without.

Nils

"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> schreef in bericht

news:xn0frqqmn0000...@rvelthuis.de...

Remy Lebeau (TeamB)

unread,

Jun 24, 2008, 12:30:23 PM6/24/08

to

"Pierre le Riche" <pler...@hotmail.com> wrote in message
news:4860...@newsgroups.borland.com...

> UTF-8 <-> UTF-16 conversions are really cheap.

It is still a lot of overhead to perform that on every API call:

UTF-8 parameters -> UTF-16 function call -> UTF-8 results

Compared with what Delphi has to do right now:

Ansi parameters -> UTF-16 function call -> Ansi results

With the switch over, there would be no more conversions:

UTF-16 parameters -> UTF-16 function call -> UTF-16 results

> Memory usage is a very big deal under Win32.

Hardly. Memory is cheap nowadays.

> If CodeGear enabled Unicode in the components

... thus requiring lots of conversions that could be avoided ...

> (but left the string type at AnsiString/UTF8String) and changed
> System.DefaultSystemCodePage to 65001, our applications
> would get the benefit of Unicode with significantly less work
> required on our part.

There is nothing stopping you from just using AnsiString/UTF8String if you
don't want to use the new UnicodeString type.

Gambit

Rudy Velthuis [TeamB]

unread,

Jun 24, 2008, 12:09:01 PM6/24/08

to

Nils Haeck wrote:

> >
> > I guess you're wrong. Or it is latin-based because of restrictions.
> >
>
> With "latin-based" I didn't mean the Latin encoding, but the fact
> that the strings stored are in some latin language, which then again
> is mostly English.

I understood that, and I disagreed.

I guess that many people still use such strings (and codepages or some
such) because of the fact that current sterings are still AnsiStrings.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Ah well, then I suppose I shall have to die beyond my means."
-- Oscar Wilde, dying words

Rudy Velthuis [TeamB]

unread,

Jun 24, 2008, 12:09:59 PM6/24/08

to

Pierre le Riche wrote:

> Hi Remy,
>
> > Win32 support. Win32 API functions are UTF-16. Using UTF8String
> > as default would require a lot of conversions.
>
> UTF-8 <-> UTF-16 conversions are really cheap.

No, they are not. And UTF-16 doesn't require any conversions, which is
infinitely cheap.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"There is a charm about the forbidden that makes it unspeakably
diserable." -- Mark Twain.

Eric Grange

unread,

Jun 24, 2008, 12:54:48 PM6/24/08

to

>> UTF-8 <-> UTF-16 conversions are really cheap.
> No, they are not.

Actually they are, no look up table or expensive processing logic
required, unlike for ANSI -> UTF16.
And given that the current ANSI -> UTF16 overhead is negligible compared
to the huge execution time of API calls, UTF8 conversion performance
would be a complete non-issue.

Eric

Eric Grange

unread,

Jun 24, 2008, 12:56:26 PM6/24/08

to

>With the switch over, there would be no more conversions:
> UTF-16 parameters -> UTF-16 function call -> UTF-16 results

Well that's not entirely correct: there would be a conversion required
for the results between the zero-terminated convention of the API, and
the known-length Pascal string, involving either and allocation+copy, or
a search for the zero terminator (both are of similar complexity to
UTF8/16 conversions).

>There is nothing stopping you from just using AnsiString/UTF8String if
>you don't want to use the new UnicodeString type.

...apart from the native String type being hardwired to UTF16String.

Eric

Rudy Velthuis [TeamB]

unread,

Jun 24, 2008, 12:55:08 PM6/24/08

to

Eric Grange wrote:

> >>UTF-8 <-> UTF-16 conversions are really cheap.
> > No, they are not.
>
> Actually they are, no look up table or expensive processing logic
> required, unlike for ANSI -> UTF16.

They may be cheaper than conversion from other types of encoding, but
I'd still not call them cheap.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"I'm trying to see things from your point of view but I can't get
my head that far up my ass." --- Unknown

Rudy Velthuis [TeamB]

unread,

Jun 24, 2008, 12:57:54 PM6/24/08

to

Eric Grange wrote:

> > With the switch over, there would be no more conversions:
> > UTF-16 parameters -> UTF-16 function call -> UTF-16 results
>
> Well that's not entirely correct: there would be a conversion
> required for the results between the zero-terminated convention of
> the API, and the known-length Pascal string

The current known-length Pascal string is internally in the
zero-terminated format already. I'm sure the same could be done for
Unicode strings. No conversion required.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"No one can earn a million dollars honestly."
-- William Jennings Bryan (1860-1925)

Remy Lebeau (TeamB)

unread,

Jun 24, 2008, 1:07:27 PM6/24/08

to

"Eric Grange" <egra...@SPAMglscene.org> wrote in message
news:4861...@newsgroups.borland.com...

> >There is nothing stopping you from just using AnsiString/UTF8String
> >if you don't want to use the new UnicodeString type.
>
> ...apart from the native String type being hardwired to UTF16String.

So, you simply update your code to use 'AnsiString' or 'UTF8String'
explicitally instead of 'String' generically.

Gambit

Caleb Hattingh

unread,

Jun 24, 2008, 7:39:00 PM6/24/08

to

On Tue, 24 Jun 2008 18:30:23 +0200, Remy Lebeau (TeamB)
<no....@no.spam.com> wrote:

> "Pierre le Riche" <pler...@hotmail.com> wrote in message
> news:4860...@newsgroups.borland.com...
>

>> Memory usage is a very big deal under Win32.
>
> Hardly. Memory is cheap nowadays.
>

If you can stay below the address space limits, e.g. Win32 in

http://msdn.microsoft.com/en-us/library/aa366778.aspx

Xu, Qian

unread,

Jun 24, 2008, 9:23:37 PM6/24/08

to

Remy Lebeau (TeamB) wrote:
>> Memory usage is a very big deal under Win32.
>
> Hardly. Memory is cheap nowadays.
>

Delphi uses reference counting for string types internally, so I do not
think, that UTF16 encoding will be much more memory intensive than UTF8.

If you compare WideString (not ref-counted), AnsiString and UTF8String,
you will see the huge difference.

--
Xu, Qian (stanleyxu)
http://stanleyxu2005.blogspot.com

Nils Haeck

unread,

Jun 25, 2008, 12:14:55 AM6/25/08

to

> I understood that, and I disagreed.
>
> I guess that many people still use such strings (and codepages or some
> such) because of the fact that current sterings are still AnsiStrings.

Don't get me wrong, I sure agree that there should be support for unicode in
each application so that any language of the world can be suppored. It is
just about performance here, where performance means both memory usage and
optimal coding.

Then again, my argument holds if you look at most text-based documents in
practice: the percentage of characters that can be represented with just one
byte is huge: numbers, delimiters, GUIDs, etc. Just open the registry and
see for yourself, as an example of such data. What I've seen from my
customers in XML form also abides these trends (and they're a large group,
come from all over the world).

Nils

Pierre le Riche

unread,

Jun 25, 2008, 4:11:22 AM6/25/08

to

Hi Rudy,

> They may be cheaper than conversion from other types of encoding, but
> I'd still not call them cheap.

My UTF8 -> UTF16 routine, which is not optimized to do more than 1
character per loop iteration, takes between 4 (with Western text) and 7
(with Asian text) clock cycles per character. Converting from UTF-16 to
UTF-8 takes between 4 and 10 clock cycles per character.

I find that I rarely need to do those conversions except when
interacting with the Win32 API. The cost of the conversion as a
percentage of the total cost (conversion cost + API call cost) is
usually very small. Since the Win32 API is so slow, I believe one should
avoid calling the Win32 API in speed critical code anyway.

In summary: I do not find the argument against using UTF-8 on the
grounds that the extra conversions make applications run slower convincing.

Regards,
Pierre

Pierre le Riche

unread,

Jun 25, 2008, 4:42:25 AM6/25/08

to

Hi Remy,

> It is still a lot of overhead to perform that on every API call:

"A lot" is relative. It certainly takes less CPU time than the current
internal Ansi <-> UTF-16 conversions. You also have to consider that
most of the Win32 API is so expensive that the conversions are generally
not a big percentage of the total cost.

>> Memory usage is a very big deal under Win32.
> Hardly. Memory is cheap nowadays.

I was referring to the limited address space. The current 2-4GB address
space limit is the single biggest reason why I would like to see a
64-bit Delphi in the future.

> There is nothing stopping you from just using AnsiString/UTF8String if you
> don't want to use the new UnicodeString type.

That's the plan. It would be nice if Tiburon could also have good
support for UTF-8 out of the box, i.e. implicit UTF-8 <-> UTF-16
conversions and good performance. At the moment I have to do the
conversions explicitly, which is a hassle.

Regards,
Pierre

Rudy Velthuis [TeamB]

unread,

Jun 25, 2008, 7:47:50 AM6/25/08

to

Nils Haeck wrote:

> Then again, my argument holds if you look at most text-based
> documents in practice: the percentage of characters that can be
> represented with just one byte is huge: numbers, delimiters, GUIDs,
> etc. Just open the registry and see for yourself, as an example of
> such data.

I don't think the registry is a very good example of a text based
document.

Anyway, no one prevents you from using UTF8 or Ansi. But your code will
have to specify these types explicitly and be sure to use the
corresponding routines.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"The difference between fiction and reality? Fiction has to make
sense." -- Tom Clancy

Rudy Velthuis [TeamB]

unread,

Jun 25, 2008, 7:49:25 AM6/25/08

to

Pierre le Riche wrote:

> Hi Rudy,
>
> > They may be cheaper than conversion from other types of encoding,
> > but I'd still not call them cheap.
>
> My UTF8 -> UTF16 routine, which is not optimized to do more than 1
> character per loop iteration, takes between 4 (with Western text) and
> 7 (with Asian text) clock cycles per character. Converting from
> UTF-16 to UTF-8 takes between 4 and 10 clock cycles per character.

Fine. The non-conversion takes 0 cycles. <g>

--
Rudy Velthuis [TeamB] http://www.teamb.com

"I don't even butter my bread; I consider that cooking."
-- Katherine Cebrian

Eric Grange

unread,

Jun 25, 2008, 11:55:27 AM6/25/08

to

That still makes the performance argument a strawman/bad excuse.

From what has filtered from the new string type, there will likely be
much worse performance overheads involved from the dynamic
typing/codepaging, and which won't be limited to API calls.

Eric

Eric Grange

unread,

Jun 25, 2008, 11:57:49 AM6/25/08

to

> The current known-length Pascal string is internally in the
> zero-terminated format already. I'm sure the same could be done for
> Unicode strings. No conversion required.

Going from pascal to zero-terminated is free indeed, but the opposite
isn't: you need to search for that zero to compute the length, the
complexity of this search is comparable to that Pierre gave for his
non-optimized UTF conversion routine.

Eric

Eric Grange

unread,

Jun 25, 2008, 12:00:06 PM6/25/08

to

> So, you simply update your code to use 'AnsiString' or 'UTF8String'
> explicitally instead of 'String' generically.

The "simply" here is is easier said than done if the whole of VCL + RTL
uses "String"... it's as "simple" as rewriting huge chunks of of the VCL
and RTL to get an alternative AnsiString-based version of the base
classes and common string routines.

Eric

Rudy Velthuis [TeamB]

unread,

Jun 25, 2008, 1:46:56 PM6/25/08

to

Eric Grange wrote:

Indeed. I personally don't think I would do that. I'd use the standard
string type as much as I could.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Did you ever walk in a room and forget why you walked in? I think
that's how dogs spend their lives." -- Sue Murphy.

Rudy Velthuis [TeamB]

unread,

Jun 25, 2008, 1:45:55 PM6/25/08

to

Eric Grange wrote:

What do you mean with dynamic typing/codepaging?

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Go on, get out. Last words are for fools who haven't said
enough." -- Karl Marx, dying words to his housekeeper.

Rudy Velthuis [TeamB]

unread,

Jun 25, 2008, 1:49:24 PM6/25/08

to

Eric Grange wrote:

> > The current known-length Pascal string is internally in the
> > zero-terminated format already. I'm sure the same could be done for
> > Unicode strings. No conversion required.
>
> Going from pascal to zero-terminated is free indeed, but the opposite
> isn't: you need to search for that zero to compute the length

Most API routines return the length after the call or even tell you the
length of the buffer needed before the call. OK, if you do:

Str := PChar(Str);

to set the length, then something like StrLen will be used and perhaps
even a reallocation will have to take place. Same with Ansi, AFAIK.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Having the source code is the difference between buying a house
and renting an apartment." -- Behlendorf

dk_sz

unread,

Jun 25, 2008, 2:08:16 PM6/25/08

to

>> Memory usage is a very big deal under Win32.
>
> Hardly. Memory is cheap nowadays.

Most certainly not. The problem is memory address space.
It is extremely easy to hit for any data intensive applications...

best regards
Thomas Schulz

dk_sz

unread,

Jun 25, 2008, 2:13:03 PM6/25/08

to

> Only when feed characters that do not go above 127, which is something
> UTF-8 is designed for.

If you in an UTF8 document encounter a byte value
in ASCII range, it is ASCII... So as long as you
are parsing e.g. HTML, XML etc. you are fine.

Personally, I also convert all XML unicode documents
into UTF-8 and handle parsing in that. Less memory
and faster as well.

best regards
Thomas Schulz

Remy Lebeau (TeamB)

unread,

Jun 25, 2008, 3:00:22 PM6/25/08

to

"Pierre le Riche" <pler...@hotmail.com> wrote in message

news:486204f2$1...@newsgroups.borland.com...

> It would be nice if Tiburon could also have good support
> for UTF-8 out of the box, i.e. implicit UTF-8 <-> UTF-16
> conversions and good performance.

I can't tell you how exactly, because CodeGear has not publicized those
details, but Tiburon will have more support for UTF-8.

Gambit

Eric Grange

unread,

Jun 26, 2008, 3:58:18 AM6/26/08

to

> [...] Same with Ansi, AFAIK.

Indeed, but the point is given that such an operation will have to
happen at some point anyway, and that its cost is similar to an UTF
conversion, your "zero cycle" argument doesn't hold: first before it's
not zero cycle, and second because there will be a search & copy.

Interestingly enough converting from zero-terminated UTF16 to "Pascal
UTF8" might occasionnally turn out faster that converting to "Pascal
UTF16" since there will typically be less bytes to write in the destination.

Eric

Rudy Velthuis [TeamB]

unread,

Jun 26, 2008, 7:07:13 AM6/26/08

to

Eric Grange wrote:

> > [...] Same with Ansi, AFAIK.
>
> Indeed, but the point is given that such an operation will have to
> happen at some point anyway

There is no need for the conversion or even simple copying or counting,
not in Unicode and not in Ansi (IOW, in my current Delphi 2007 code),
if I use the payload of the string as the buffer for the API call, IOW,
if you DON'T use

MyStr := PChar(MyStr);

> and that its cost is similar to an UTF
> conversion, your "zero cycle" argument doesn't hold

Yes, it does.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"A woman is an occasional pleasure but a cigar is always a smoke."
-- Groucho Marx

Eric Grange

unread,

Jun 26, 2008, 9:34:33 AM6/26/08

to

> There is no need for the conversion or even simple copying or counting,
> not in Unicode and not in Ansi (IOW, in my current Delphi 2007 code),
> if I use the payload of the string as the buffer for the API call, IOW,
> if you DON'T use
>
> MyStr := PChar(MyStr);

The cases were you can use strings as result buffers are those that
return only one string, and they are irrelevant performance-wise, like
GetComputerName and siblings.
In all other cases that matter, you'll be using the API's datastructures
(ala WIN32_FIND_DATA, etc.), whose content you'll be converting to
regular pascal String.

Eric

Nils Haeck

unread,

Jun 26, 2008, 9:43:44 AM6/26/08

to

> I don't think the registry is a very good example of a text based
> document.

What then? A word-file written in Swahili? I don't know about you, maybe I'm
missing the zillions of Delphi applications having to interact with these
kind of files..

> Anyway, no one prevents you from using UTF8 or Ansi. But your code will
> have to specify these types explicitly and be sure to use the
> corresponding routines.

Now we're talking in circles. Of course I know I can do that and no-one is
preventing me. But I have seen no argument at all so far to warrant the dumb
step to choose UTF-16 as default string type. In other words, it just means
extra work.

Nils

Rudy Velthuis [TeamB]

unread,

Jun 26, 2008, 12:28:08 PM6/26/08

to

Nils Haeck wrote:

> > I don't think the registry is a very good example of a text based
> > document.
>
> What then? A word-file written in Swahili?

Not many special Unicode characters in the registry, AFAICS. A Japanese
text, or an Arabic one might be a better example. <g>

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Everybody's worried about stopping terrorism. Well, there's a
really easy way: stop participating in it." -- Noam Chomsky

Rudy Velthuis [TeamB]

unread,

Jun 26, 2008, 12:33:22 PM6/26/08

to

Eric Grange wrote:

> The cases were you can use strings as result buffers are those that
> return only one string

Why? What if they return multiple strings in different buffers? I can
still use strings as buffers, and I usually even do that. But anyway,
using UTF-16 throughout makes life a lot easier for those who need
Unicode.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"A camel is a horse designed by a committee" -- Unknown

Rudy Velthuis [TeamB]

unread,

Jun 26, 2008, 12:31:20 PM6/26/08

to

Nils Haeck wrote:

The fact that you can use Unicode wherever you need it, and that the
VCL, RTL etc. will all be Unicode. No need to (often explicitly)
convert between UTF-8 and UTF-16 when calling an API.

I'd say it simplifies live for those who need Unicode a lot. And I
don't see it making life for the others any harder.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"Comedy is nothing more than tragedy deferred."
-- Pico Iyer, Time

Rudy Velthuis [TeamB]

unread,

Jun 26, 2008, 12:34:21 PM6/26/08

to

Eric Grange wrote:

> In all other cases that matter,
> you'll be using the API's datastructures (ala WIN32_FIND_DATA, etc.),
> whose content you'll be converting to regular pascal String.

"Converting" is quite a word. It is a simple copy. I don't see how
UTF-8 would make this any easier.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"You got to be careful if you don't know where you're going,
because you might not get there." -- Yogi Berra

Eric Grange

unread,

Jun 27, 2008, 3:51:00 AM6/27/08

to

> "Converting" is quite a word. It is a simple copy. I don't see how
> UTF-8 would make this any easier.

Well, you should have a look at UTF16-UTF8 conversion routines then, you
would understand why they're cheap: the amount of work they're doing is
comparable to a copy (see Pierre's clock cycles per character figures).

Where UTF8 can make it easier on the CPU (faster) is because less bytes
are involved in typical texts, less bytes to read, less bytes to write.
Given that read/writes to main memory are relatively slow, copying
(without change) and UTF16 string to an UTF16 string (simple copy) can
be slower than copying from UTF16 to UTF8 as soon as your app is a
little memory intensive (because if you'll end up reading the same
amount of bytes, you'll be writing less).

As soon as you factor in further usages of the string, the size-induced
speedup can become quickly significant: if you merely compare your
string once, or search for a substring in it once, the reduced byte
count will improve things (less bytes to compare, better cache
utilization, etc.).

Eric

Rudy Velthuis [TeamB]

unread,

Jun 27, 2008, 5:56:55 AM6/27/08

to

Eric Grange wrote:

> > "Converting" is quite a word. It is a simple copy. I don't see how
> > UTF-8 would make this any easier.
>
> Well, you should have a look at UTF16-UTF8 conversion routines then,
> you would understand why they're cheap: the amount of work they're
> doing is comparable to a copy (see Pierre's clock cycles per
> character figures).

Well, I don't see why you would want to. Using UTF-16 throughout is a
lot easier.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"In any contest between power and patience, bet on patience."
-- W.B. Prescott

Eric Grange

unread,

Jun 27, 2008, 6:06:02 AM6/27/08

to

> Well, I don't see why you would want to. Using UTF-16 throughout is a
> lot easier.

We were talking about performance, the point *you* raised in case you
forgot ;)

As for "easy for codegear" that's irrelevant IMO, the point here is
about CodeGear offering added-value rather than saving bucks by going
for the easiest solution *for them* while causing trouble *for us*, see
initial posts.

Eric

Rudy Velthuis [TeamB]

unread,

Jun 27, 2008, 6:22:08 AM6/27/08

to

Eric Grange wrote:

> > Well, I don't see why you would want to. Using UTF-16 throughout is
> > a lot easier.
>

> We were talking about performance, the point you raised in case you

> forgot ;)
>
> As for "easy for codegear"

Easy for the programmer. Simply stop worrying about conversions and use
string as you always did. No need to worry about conversions and
reallocations each time you go from UTF-16 to UTF-8 and back.

Performance is only a concern if it becomes noticeable.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"No Sane man will dance." -- Cicero (106-43 B.C.)

Q Correll

unread,

Jun 27, 2008, 1:22:09 PM6/27/08

to

Rudy,

| Performance is only a concern if it becomes noticeable.

One of the tenets of the "good enough" philosophy. ;-)

--
Q <proud hater of "good enough">

06/27/2008 10:19:12

XanaNews Version 1.17.5.7 [Q's Salutation mod]

Rudy Velthuis [TeamB]

unread,

Jun 27, 2008, 1:31:51 PM6/27/08

to

Q Correll wrote:

> Rudy,
>
> > Performance is only a concern if it becomes noticeable.
>
> One of the tenets of the "good enough" philosophy. ;-)

Well, it is merely meant to avoid premature optimization. <g>

--
Rudy Velthuis [TeamB] http://www.teamb.com

"I've just learned about his illness. Let's hope it's nothing
trivial." -- Irvin S. Cobb

Q Correll

unread,

Jun 27, 2008, 7:53:00 PM6/27/08

to

Rudy,

| Well, it is merely meant to avoid premature optimization. <g>

--
Q

06/27/2008 16:52:42