Cheers,
Lee
> Hello FastCoders, just a curiosity question, but what kind of concerns
> do you have about FastCode routines working with the new VCL coming in
> the next release of Delphi?
Nothing yet, but let us discuss what we have to do. Please correct me if I
am wrong.
Functions defined with AnsiString as string type will have no problems
TCharPosFunction = function(Chr : Char; const Str : AnsiString) : Integer;
but functions defined to take or return the type "string" will fail if they
do not support WideString.
TTrimFunction = function (const S: string) : string;
One solution is to test that string=AnsiString prior to calling the
function. Another is to redefine the function prototypes.
TTrimFunction = function (const S: AnsiString) : AnsiString;
IMHO This solution is the correct one.
Why was functions like Trim defined to take and return "string" in the first
place?
Best regards
Dennis Kjaer Christensen
> Functions defined with AnsiString as string type will have no problems
>
> TCharPosFunction = function(Chr : Char; const Str : AnsiString) :
> Integer;
That uses Char, not AnsiChar, so it should have a big problem. <g>
--
Rudy Velthuis [TeamB] http://www.teamb.com
"Research is what I'm doing when I don't know what I'm doing."
-- Wernher Von Braun (1912-1977)
| Why was functions like Trim defined to take and return "string" in
| the first place?
Perhaps because in the ol'days "string" was all there was?
I am still shuddering at the thought of modifying all the apps I have
coded with plain ol' string. <sigh>
--
Q
04/05/2008 10:09:09
XanaNews Version 1.17.5.7 [Q's Salutation mod]
> Perhaps because in the ol'days "string" was all there was?
OK
> I am still shuddering at the thought of modifying all the apps I have
> coded with plain ol' string. <sigh>
I am not sure it will to hard. CodeGear has probably done a lot of work to
make the transition to Unicode easy.
What if you simply make a search and replace on string -> AnsiString - will
everything just work as before?
I have to admit that I know very little about Unicode ;-)
> > TCharPosFunction = function(Chr : Char; const Str : AnsiString) :
> > Integer;
>
> That uses Char, not AnsiChar, so it should have a big problem. <g>
OK
Is this good?
TCharPosFunction = function(Chr : AnsiChar; const Str : AnsiString) :
Integer;
Guess we have much work to do ;-)
> Hi Rudy
>
> > > TCharPosFunction = function(Chr : Char; const Str : AnsiString) :
> > > Integer;
> >
> > That uses Char, not AnsiChar, so it should have a big problem. <g>
>
> OK
>
> Is this good?
>
> TCharPosFunction = function(Chr : AnsiChar; const Str : AnsiString) :
> Integer;
That should be fine. <g>
--
Rudy Velthuis [TeamB] http://www.teamb.com
"Humor is just another defense against the universe."
-- Mel Brooks (1926- )
> Hi
>
> > Perhaps because in the ol'days "string" was all there was?
>
> OK
>
> > I am still shuddering at the thought of modifying all the apps I
> > have coded with plain ol' string. <sigh>
>
> I am not sure it will to hard. CodeGear has probably done a lot of
> work to make the transition to Unicode easy.
>
> What if you simply make a search and replace on string -> AnsiString
> - will everything just work as before?
Yes, and no. Also Char, PChar etc. must be replaced. Then, since the
library routines like Format or Trim will probably return
UnicodeStrings, what if you replace all occurrences of string with
AnsiString?
var
S, T: AnsiString;
begin
S := ' Hello ';
T := Format(S);
This will cause two conversions: S will be converted to UnicodeString,
passed to Format, and the result of Format must be converted back.
I'd rather see new routines for Char (<-- WideChar).
--
Rudy Velthuis [TeamB] http://www.teamb.com
"USA Today has come out with a new survey: Apparently three out
of four people make up 75 percent of the population."
-- David Letterman.
> Dennis wrote:
>
> > Hi
> >
> > > Perhaps because in the ol'days "string" was all there was?
> >
> > OK
> >
> > > I am still shuddering at the thought of modifying all the apps I
> > > have coded with plain ol' string. <sigh>
> >
> > I am not sure it will to hard. CodeGear has probably done a lot of
> > work to make the transition to Unicode easy.
> >
> > What if you simply make a search and replace on string -> AnsiString
> > - will everything just work as before?
>
> Yes, and no. Also Char, PChar etc. must be replaced.
Oh, and if you cast a string to PAnsiChar instead of PChar, you might
be casting a UnicodeString to AnsiChar, which is hardly what you want:
ShellExecuteA(0, PAnsiChar('myfile.txt'), ...);
Since myfile.txt is probably a string, that might cause problems. So
you should do:
ShellExecute(0, PChar('myfile.txt'), ...);
or:
ShellExecuteA(0, PAnsiChar(AnsiString('myfile.txt')), ...);
Note that I used the -A version of ShellExecute for the PAnsiChar cast.
<g>
--
Rudy Velthuis [TeamB] http://www.teamb.com
"Some cause happiness wherever they go; others, whenever they go."
-- Oscar Wilde (1854-1900)
| I am not sure it will to hard. CodeGear has probably done a lot of
| work to make the transition to Unicode easy.
I'm not either. However, I am still quite nervous about it. <g>
| What if you simply make a search and replace on string -> AnsiString
| - will everything just work as before?
According to Nick Hodges several months ago, that will work just fine.
He suggested at the time to do it "now." However, when I thought about
it a bit I realized that a global replace isn't the easiest thing to do
due to my commenting. And that means I would have to make a decision
for EVERY frickin occurrence of "string" in my ALL of my apps' code.
I'm still nervous. <g>
| I have to admit that I know very little about Unicode ;-)
If you know a little, then you know more than my "nothing." <g>
--
Q
04/05/2008 11:41:19
> Dennis,
>
> > I am not sure it will to hard. CodeGear has probably done a lot of
> > work to make the transition to Unicode easy.
>
> I'm not either. However, I am still quite nervous about it. <g>
>
> | What if you simply make a search and replace on string -> AnsiString
> > - will everything just work as before?
>
> According to Nick Hodges several months ago, that will work just fine.
Actually, no it won't. See my replies. In general, not converting
strings to anything will probably work much better, but of course
assembler routines should know that Char is now WideChar.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"In any contest between power and patience, bet on patience."
-- W.B. Prescott
If FastCode is only designed for ANSI world. It would be useless for
Tiburon.
--
Best regards
Xu, Qian (stanleyxu)
http://stanleyxu2005.blogspot.com/
| Actually, no it won't.
Why doesn't that surprise me?
| See my replies.
Thanks. I have been reading your comments. And they make sense to me.
And they also make me even more concerned about what Nick typed those
few months ago. <sigh>
--
Q
04/05/2008 15:44:57
> Rudy,
>
> > Actually, no it won't.
>
> Why doesn't that surprise me?
>
> > See my replies.
>
> Thanks. I have been reading your comments. And they make sense to
> me. And they also make me even more concerned about what Nick typed
> those few months ago. <sigh>
I'm not very concerned. Problems can arise if you mix and convert
between Ansi and Unicode strings, but most people don't do that, and if
they do, they already have safeguards in place.
Like I said: it is much safer to use the general string type than to
explicitly declare every string as AnsiString.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"I think it would be a good idea."
-- Mahatma Gandhi (1869-1948), when asked what he thought of
Western civilization
| Like I said: it is much safer to use the general string type than to
| explicitly declare every string as AnsiString.
I hope that's still true in the eventual Unicode Delphi release.
--
Q
04/06/2008 09:06:01
> Rudy,
>
> > Like I said: it is much safer to use the general string type than
> > to explicitly declare every string as AnsiString.
>
> I hope that's still true in the eventual Unicode Delphi release.
That's what I meant. If you do that now, it is most likely also safe in
the new Delphi.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"I do not have a body, I am a body." -- Unknown
| That's what I meant. If you do that now, it is most likely also safe
| in the new Delphi.
I'm still nervous. <crossing fingers>
--
Q
04/06/2008 11:00:24
I'm not an expert either.
Here's a good starting place:
http://www.codinghorror.com/blog/archives/001084.html
--
Lee
Another argument: you could state that 2 bytes/char standard is easier for
random access, but that limits the set to 2^16 possible values, and Unicode
already contains a lot more characters than that, so we'll face the same
problems in some years. Utf8 on the other hand, is extendible (1, 3, 5, 7..
bytes per character).
Yet another argument is that most old code (parsers, etc that look for char
values in 0..127 range) would just work when feeding it an UTF8 string,
since UTF8 is smartly defined to keep these 0..127 characters identical.
Nils
"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> schreef in bericht
news:xn0fokzk04boyz...@rvelthuis.de...
> I still find it quite a waste to use Unicode strings, and wonder why
> CG didn't choose to use Utf8string instead as default. Maybe I'm
> selfish and not international enough, but from my point of view (and
> customer base), I'd say 80% of the software is latin-based
I guess you're wrong. Or it is latin-based because of restrictions.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"I have come to believe that the whole world is an enigma, a
harmless enigma that is made terrible by our own mad attempt to
interpret it as though it had an underlying truth."
-- Umberto Eco
> I still find it quite a waste to use Unicode strings, and wonder why
> CG didn't choose to use Utf8string instead as default.
Win32 support. Win32 API functions are UTF-16. Using UTF8String as default
would require a lot of conversions. The Ansi version of the VCL today
already has to do that anyway. By moving to UTF-16 by default, those
conversions are avoided, increasing performance significantly.
> I'd say 80% of the software is latin-based
Perhaps, but 90-odd% of PCs are running an Unicode-based OS, even if
configured for a Latin-based languages. But then, you also have to think
about a lot of markers that actually use single-byte multi-character
languages. Those benefit from the Unicode switchover.
> with a few additional "weird" characters here and there, that would
> perhaps require 3 bytes per character.
You could say the same for UTF-8, though. In fact, UTF-8 supports encoded
characters up to 6 bytes per logical character. Even MBCS supported 2-byte
characters in a 1-byte environment.
> Much better than the standard 2 bytes per character in proposed Delphi
> "Unicode"
Other than memory usage, I don't see why it would be better.
> Another argument: you could state that 2 bytes/char standard is
> easier for random access, but that limits the set to 2^16 possible
> values, and Unicode already contains a lot more characters than
> that
You can say the same for MBCS as well. Unicode surrogate pairs addresses
that issue, just as lead-bytes addresses it in MBCS.
> Utf8 on the other hand, is extendible (1, 3, 5, 7.. bytes per character).
Which makes it a lot harder to process, exactly. The loss of random seeking
alone would be too huge a performance hit to be beneficial.
> Yet another argument is that most old code (parsers, etc that look
> for char values in 0..127 range) would just work when feeding it
> an UTF8 string
Only when feed characters that do not go above 127, which is something UTF-8
is designed for.
> since UTF8 is smartly defined to keep these 0..127 characters identical.
But, if feed a UTF-8 encoded Unicode string with characters above 127, those
parsers would fail anyway. At least by moving to Unicode, those parsers
would have a change of processing higher characters now.
Gambit
> Win32 support. Win32 API functions are UTF-16. Using UTF8String as default
> would require a lot of conversions.
UTF-8 <-> UTF-16 conversions are really cheap.
> The Ansi version of the VCL today already has to do that anyway. By moving to UTF-16 by default, those conversions are avoided, increasing performance significantly.
UTF-8 is also faster than the current AnsiString situation, because the
conversion is cheap. Of course avoiding a conversion completely is the
fastest, but the Win32 API is generally so slow that the difference is
not noticeable.
> Other than memory usage, I don't see why it would be better.
Memory usage is a very big deal under Win32.
> Which makes it a lot harder to process, exactly. The loss of random
seeking alone would be too huge a performance hit to be beneficial.
Neither are fixed length encodings, so the complexity required to
process them is comparable.
> But, if feed a UTF-8 encoded Unicode string with characters above 127, those
> parsers would fail anyway.
There are many that continue to work fine, and that is the point that
Nils is trying to make and also what I have found in practice.
Switching to UTF-8 is effectively a code page change (CP65001, to be
exact), so any routines that were code page agnostic before continue to
work fine. That is a large chunk of the RTL.
If CodeGear enabled Unicode in the components (but left the string type
at AnsiString/UTF8String) and changed System.DefaultSystemCodePage to
65001, our applications would get the benefit of Unicode with
significantly less work required on our part.
Regards,
Pierre
With "latin-based" I didn't mean the Latin encoding, but the fact that the
strings stored are in some latin language, which then again is mostly
English.
It's mostly about the storage formats used, not even so much about the
software itself I think. Looking at storage formats.. take XML or HTML,
which are probably the most common text-based formats around (and a basis
for many other formats).
Storing an XML file with UTF8-encoding definitely has advantages, as you can
easily parse it with a random-access parser (all the delimiters are in the
0..127 range), making for some lightning fast implementations. In a new
upcoming version of NativeXml which I'm creating, I already chose to handle
unicode that way; simply convert everything internally to UTF8, and also use
that as the preferred storage encoding. It definitely saves on memory. I saw
a lot of different XML files from around the world, and for instance, almost
always the tags are using latin-based names (thus in general one byte only
per character) and many of the values are numbers. And actually in most
cases the tags make up for the bulk of the file, I would say 70 - 80%.
Memory usage is often shuffled under, but it is a bottleneck! I know, after
having seen XML files in the multi-gigabyte range. One can say available
memory will probably grow fast, but then again, it seems the amount of data
people want to store grows even faster.
Anyway, I'm not here to defend myself, I only gave a suggestion based on
experience. I love Delphi, will continue to use it, but I will definitely
not convert my software to use 2byte unicode as a default. Since I will have
to type "ansistring" or "utf8string" wherever I just could type "string", it
will be a feature I could do without.
Nils
"Rudy Velthuis [TeamB]" <newsg...@rvelthuis.de> schreef in bericht
news:xn0frqqmn0000...@rvelthuis.de...
> UTF-8 <-> UTF-16 conversions are really cheap.
It is still a lot of overhead to perform that on every API call:
UTF-8 parameters -> UTF-16 function call -> UTF-8 results
Compared with what Delphi has to do right now:
Ansi parameters -> UTF-16 function call -> Ansi results
With the switch over, there would be no more conversions:
UTF-16 parameters -> UTF-16 function call -> UTF-16 results
> Memory usage is a very big deal under Win32.
Hardly. Memory is cheap nowadays.
> If CodeGear enabled Unicode in the components
... thus requiring lots of conversions that could be avoided ...
> (but left the string type at AnsiString/UTF8String) and changed
> System.DefaultSystemCodePage to 65001, our applications
> would get the benefit of Unicode with significantly less work
> required on our part.
There is nothing stopping you from just using AnsiString/UTF8String if you
don't want to use the new UnicodeString type.
Gambit
> >
> > I guess you're wrong. Or it is latin-based because of restrictions.
> >
>
> With "latin-based" I didn't mean the Latin encoding, but the fact
> that the strings stored are in some latin language, which then again
> is mostly English.
I understood that, and I disagreed.
I guess that many people still use such strings (and codepages or some
such) because of the fact that current sterings are still AnsiStrings.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"Ah well, then I suppose I shall have to die beyond my means."
-- Oscar Wilde, dying words
> Hi Remy,
>
> > Win32 support. Win32 API functions are UTF-16. Using UTF8String
> > as default would require a lot of conversions.
>
> UTF-8 <-> UTF-16 conversions are really cheap.
No, they are not. And UTF-16 doesn't require any conversions, which is
infinitely cheap.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"There is a charm about the forbidden that makes it unspeakably
diserable." -- Mark Twain.
Actually they are, no look up table or expensive processing logic
required, unlike for ANSI -> UTF16.
And given that the current ANSI -> UTF16 overhead is negligible compared
to the huge execution time of API calls, UTF8 conversion performance
would be a complete non-issue.
Eric
Well that's not entirely correct: there would be a conversion required
for the results between the zero-terminated convention of the API, and
the known-length Pascal string, involving either and allocation+copy, or
a search for the zero terminator (both are of similar complexity to
UTF8/16 conversions).
>There is nothing stopping you from just using AnsiString/UTF8String if
>you don't want to use the new UnicodeString type.
...apart from the native String type being hardwired to UTF16String.
Eric
> >>UTF-8 <-> UTF-16 conversions are really cheap.
> > No, they are not.
>
> Actually they are, no look up table or expensive processing logic
> required, unlike for ANSI -> UTF16.
They may be cheaper than conversion from other types of encoding, but
I'd still not call them cheap.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"I'm trying to see things from your point of view but I can't get
my head that far up my ass." --- Unknown
> > With the switch over, there would be no more conversions:
> > UTF-16 parameters -> UTF-16 function call -> UTF-16 results
>
> Well that's not entirely correct: there would be a conversion
> required for the results between the zero-terminated convention of
> the API, and the known-length Pascal string
The current known-length Pascal string is internally in the
zero-terminated format already. I'm sure the same could be done for
Unicode strings. No conversion required.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"No one can earn a million dollars honestly."
-- William Jennings Bryan (1860-1925)
> >There is nothing stopping you from just using AnsiString/UTF8String
> >if you don't want to use the new UnicodeString type.
>
> ...apart from the native String type being hardwired to UTF16String.
So, you simply update your code to use 'AnsiString' or 'UTF8String'
explicitally instead of 'String' generically.
Gambit
> "Pierre le Riche" <pler...@hotmail.com> wrote in message
> news:4860...@newsgroups.borland.com...
>
>> Memory usage is a very big deal under Win32.
>
> Hardly. Memory is cheap nowadays.
>
If you can stay below the address space limits, e.g. Win32 in
Delphi uses reference counting for string types internally, so I do not
think, that UTF16 encoding will be much more memory intensive than UTF8.
If you compare WideString (not ref-counted), AnsiString and UTF8String,
you will see the huge difference.
--
Xu, Qian (stanleyxu)
http://stanleyxu2005.blogspot.com
Don't get me wrong, I sure agree that there should be support for unicode in
each application so that any language of the world can be suppored. It is
just about performance here, where performance means both memory usage and
optimal coding.
Then again, my argument holds if you look at most text-based documents in
practice: the percentage of characters that can be represented with just one
byte is huge: numbers, delimiters, GUIDs, etc. Just open the registry and
see for yourself, as an example of such data. What I've seen from my
customers in XML form also abides these trends (and they're a large group,
come from all over the world).
Nils
> They may be cheaper than conversion from other types of encoding, but
> I'd still not call them cheap.
My UTF8 -> UTF16 routine, which is not optimized to do more than 1
character per loop iteration, takes between 4 (with Western text) and 7
(with Asian text) clock cycles per character. Converting from UTF-16 to
UTF-8 takes between 4 and 10 clock cycles per character.
I find that I rarely need to do those conversions except when
interacting with the Win32 API. The cost of the conversion as a
percentage of the total cost (conversion cost + API call cost) is
usually very small. Since the Win32 API is so slow, I believe one should
avoid calling the Win32 API in speed critical code anyway.
In summary: I do not find the argument against using UTF-8 on the
grounds that the extra conversions make applications run slower convincing.
Regards,
Pierre
> It is still a lot of overhead to perform that on every API call:
"A lot" is relative. It certainly takes less CPU time than the current
internal Ansi <-> UTF-16 conversions. You also have to consider that
most of the Win32 API is so expensive that the conversions are generally
not a big percentage of the total cost.
>> Memory usage is a very big deal under Win32.
> Hardly. Memory is cheap nowadays.
I was referring to the limited address space. The current 2-4GB address
space limit is the single biggest reason why I would like to see a
64-bit Delphi in the future.
> There is nothing stopping you from just using AnsiString/UTF8String if you
> don't want to use the new UnicodeString type.
That's the plan. It would be nice if Tiburon could also have good
support for UTF-8 out of the box, i.e. implicit UTF-8 <-> UTF-16
conversions and good performance. At the moment I have to do the
conversions explicitly, which is a hassle.
Regards,
Pierre
> Then again, my argument holds if you look at most text-based
> documents in practice: the percentage of characters that can be
> represented with just one byte is huge: numbers, delimiters, GUIDs,
> etc. Just open the registry and see for yourself, as an example of
> such data.
I don't think the registry is a very good example of a text based
document.
Anyway, no one prevents you from using UTF8 or Ansi. But your code will
have to specify these types explicitly and be sure to use the
corresponding routines.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"The difference between fiction and reality? Fiction has to make
sense." -- Tom Clancy
> Hi Rudy,
>
> > They may be cheaper than conversion from other types of encoding,
> > but I'd still not call them cheap.
>
> My UTF8 -> UTF16 routine, which is not optimized to do more than 1
> character per loop iteration, takes between 4 (with Western text) and
> 7 (with Asian text) clock cycles per character. Converting from
> UTF-16 to UTF-8 takes between 4 and 10 clock cycles per character.
Fine. The non-conversion takes 0 cycles. <g>
--
Rudy Velthuis [TeamB] http://www.teamb.com
"I don't even butter my bread; I consider that cooking."
-- Katherine Cebrian
From what has filtered from the new string type, there will likely be
much worse performance overheads involved from the dynamic
typing/codepaging, and which won't be limited to API calls.
Eric
Going from pascal to zero-terminated is free indeed, but the opposite
isn't: you need to search for that zero to compute the length, the
complexity of this search is comparable to that Pierre gave for his
non-optimized UTF conversion routine.
Eric
The "simply" here is is easier said than done if the whole of VCL + RTL
uses "String"... it's as "simple" as rewriting huge chunks of of the VCL
and RTL to get an alternative AnsiString-based version of the base
classes and common string routines.
Eric
Indeed. I personally don't think I would do that. I'd use the standard
string type as much as I could.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"Did you ever walk in a room and forget why you walked in? I think
that's how dogs spend their lives." -- Sue Murphy.
What do you mean with dynamic typing/codepaging?
--
Rudy Velthuis [TeamB] http://www.teamb.com
"Go on, get out. Last words are for fools who haven't said
enough." -- Karl Marx, dying words to his housekeeper.
> > The current known-length Pascal string is internally in the
> > zero-terminated format already. I'm sure the same could be done for
> > Unicode strings. No conversion required.
>
> Going from pascal to zero-terminated is free indeed, but the opposite
> isn't: you need to search for that zero to compute the length
Most API routines return the length after the call or even tell you the
length of the buffer needed before the call. OK, if you do:
Str := PChar(Str);
to set the length, then something like StrLen will be used and perhaps
even a reallocation will have to take place. Same with Ansi, AFAIK.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"Having the source code is the difference between buying a house
and renting an apartment." -- Behlendorf
Most certainly not. The problem is memory address space.
It is extremely easy to hit for any data intensive applications...
best regards
Thomas Schulz
If you in an UTF8 document encounter a byte value
in ASCII range, it is ASCII... So as long as you
are parsing e.g. HTML, XML etc. you are fine.
Personally, I also convert all XML unicode documents
into UTF-8 and handle parsing in that. Less memory
and faster as well.
best regards
Thomas Schulz
> It would be nice if Tiburon could also have good support
> for UTF-8 out of the box, i.e. implicit UTF-8 <-> UTF-16
> conversions and good performance.
I can't tell you how exactly, because CodeGear has not publicized those
details, but Tiburon will have more support for UTF-8.
Gambit
Indeed, but the point is given that such an operation will have to
happen at some point anyway, and that its cost is similar to an UTF
conversion, your "zero cycle" argument doesn't hold: first before it's
not zero cycle, and second because there will be a search & copy.
Interestingly enough converting from zero-terminated UTF16 to "Pascal
UTF8" might occasionnally turn out faster that converting to "Pascal
UTF16" since there will typically be less bytes to write in the destination.
Eric
> > [...] Same with Ansi, AFAIK.
>
> Indeed, but the point is given that such an operation will have to
> happen at some point anyway
There is no need for the conversion or even simple copying or counting,
not in Unicode and not in Ansi (IOW, in my current Delphi 2007 code),
if I use the payload of the string as the buffer for the API call, IOW,
if you DON'T use
MyStr := PChar(MyStr);
> and that its cost is similar to an UTF
> conversion, your "zero cycle" argument doesn't hold
Yes, it does.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"A woman is an occasional pleasure but a cigar is always a smoke."
-- Groucho Marx
The cases were you can use strings as result buffers are those that
return only one string, and they are irrelevant performance-wise, like
GetComputerName and siblings.
In all other cases that matter, you'll be using the API's datastructures
(ala WIN32_FIND_DATA, etc.), whose content you'll be converting to
regular pascal String.
Eric
What then? A word-file written in Swahili? I don't know about you, maybe I'm
missing the zillions of Delphi applications having to interact with these
kind of files..
> Anyway, no one prevents you from using UTF8 or Ansi. But your code will
> have to specify these types explicitly and be sure to use the
> corresponding routines.
Now we're talking in circles. Of course I know I can do that and no-one is
preventing me. But I have seen no argument at all so far to warrant the dumb
step to choose UTF-16 as default string type. In other words, it just means
extra work.
Nils
> > I don't think the registry is a very good example of a text based
> > document.
>
> What then? A word-file written in Swahili?
Not many special Unicode characters in the registry, AFAICS. A Japanese
text, or an Arabic one might be a better example. <g>
--
Rudy Velthuis [TeamB] http://www.teamb.com
"Everybody's worried about stopping terrorism. Well, there's a
really easy way: stop participating in it." -- Noam Chomsky
> The cases were you can use strings as result buffers are those that
> return only one string
Why? What if they return multiple strings in different buffers? I can
still use strings as buffers, and I usually even do that. But anyway,
using UTF-16 throughout makes life a lot easier for those who need
Unicode.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"A camel is a horse designed by a committee" -- Unknown
The fact that you can use Unicode wherever you need it, and that the
VCL, RTL etc. will all be Unicode. No need to (often explicitly)
convert between UTF-8 and UTF-16 when calling an API.
I'd say it simplifies live for those who need Unicode a lot. And I
don't see it making life for the others any harder.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"Comedy is nothing more than tragedy deferred."
-- Pico Iyer, Time
> In all other cases that matter,
> you'll be using the API's datastructures (ala WIN32_FIND_DATA, etc.),
> whose content you'll be converting to regular pascal String.
"Converting" is quite a word. It is a simple copy. I don't see how
UTF-8 would make this any easier.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"You got to be careful if you don't know where you're going,
because you might not get there." -- Yogi Berra
Well, you should have a look at UTF16-UTF8 conversion routines then, you
would understand why they're cheap: the amount of work they're doing is
comparable to a copy (see Pierre's clock cycles per character figures).
Where UTF8 can make it easier on the CPU (faster) is because less bytes
are involved in typical texts, less bytes to read, less bytes to write.
Given that read/writes to main memory are relatively slow, copying
(without change) and UTF16 string to an UTF16 string (simple copy) can
be slower than copying from UTF16 to UTF8 as soon as your app is a
little memory intensive (because if you'll end up reading the same
amount of bytes, you'll be writing less).
As soon as you factor in further usages of the string, the size-induced
speedup can become quickly significant: if you merely compare your
string once, or search for a substring in it once, the reduced byte
count will improve things (less bytes to compare, better cache
utilization, etc.).
Eric
> > "Converting" is quite a word. It is a simple copy. I don't see how
> > UTF-8 would make this any easier.
>
> Well, you should have a look at UTF16-UTF8 conversion routines then,
> you would understand why they're cheap: the amount of work they're
> doing is comparable to a copy (see Pierre's clock cycles per
> character figures).
Well, I don't see why you would want to. Using UTF-16 throughout is a
lot easier.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"In any contest between power and patience, bet on patience."
-- W.B. Prescott
We were talking about performance, the point *you* raised in case you
forgot ;)
As for "easy for codegear" that's irrelevant IMO, the point here is
about CodeGear offering added-value rather than saving bucks by going
for the easiest solution *for them* while causing trouble *for us*, see
initial posts.
Eric
> > Well, I don't see why you would want to. Using UTF-16 throughout is
> > a lot easier.
>
> We were talking about performance, the point you raised in case you
> forgot ;)
>
> As for "easy for codegear"
Easy for the programmer. Simply stop worrying about conversions and use
string as you always did. No need to worry about conversions and
reallocations each time you go from UTF-16 to UTF-8 and back.
Performance is only a concern if it becomes noticeable.
--
Rudy Velthuis [TeamB] http://www.teamb.com
"No Sane man will dance." -- Cicero (106-43 B.C.)
| Performance is only a concern if it becomes noticeable.
One of the tenets of the "good enough" philosophy. ;-)
--
Q <proud hater of "good enough">
06/27/2008 10:19:12
XanaNews Version 1.17.5.7 [Q's Salutation mod]
> Rudy,
>
> > Performance is only a concern if it becomes noticeable.
>
> One of the tenets of the "good enough" philosophy. ;-)
Well, it is merely meant to avoid premature optimization. <g>
--
Rudy Velthuis [TeamB] http://www.teamb.com
"I've just learned about his illness. Let's hope it's nothing
trivial." -- Irvin S. Cobb
| Well, it is merely meant to avoid premature optimization. <g>
<chuckle>
--
Q
06/27/2008 16:52:42