Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to write Ascii+UTF8 stream in Delphi 2008

1,466 views
Skip to first unread message

dk_sz

unread,
May 9, 2008, 10:47:44 PM5/9/08
to
OK,

I was using a custom string builder to to build a huge
XML string in-memory... And convert it to UTF8
in-memory... And then save to disk... But this is actually
somewhat of a problem since memory usage peaks
no matter what solution I use... So I have been
tinkering with the idea of writing the file directly to disk...

The only thing I am going to save that isn't ascii for sure is
HTTP urls. (e.g. http://www.example.com?query=utf8str)

... So I can save all as ascii... and a few bytes
as UTF8 into the file stream... And I am all done...


But how do I make this efficient and safe for Delphi 2008 as well?


Suppose I have:

const
CmsNewLine = #13#10;

procedure msStreamWriteStrRaw_Ansi(const AStream: TStream; const AStr:
AnsiString);
begin
AStream.Write(PAnsiChar(Pointer((AStr))^, msxBytesLength_Ansi(AStr));
end;

* Will underneath make implicit conversion UTF16String_To_Ansi in D2008?
msStreamWriteStrRaw_Ansi(Astream, '<xml>');
// if so... Can I solve this by some way tell compiler to store '<xml>' as
Ansi?

* Will underneath make implicit conversion UTF16String_To_Ansi in D2008?
msStreamWriteStrRaw_Ansi(Astream, CmsNewLine);
// I guess this will be OK since is was declared as bytes #13#10 ?

In general I seek a solution that is optimal regarding no
implicit conversion (hate that!), and which will work on
Delphi 2007 + Delphi 2008 (and in all locales etc. which
should be possible? Since I only do ascii stuff... and where I don't...
I will convert to UTF8 in-memory and write that on stream instead)


best regards
Thomas Schulz


Jaakko Salmenius

unread,
May 9, 2008, 11:10:22 PM5/9/08
to
UTF-8 is compatible to ASCII. This means that any ASCII string is equal as
UTF-8. When you handle your ASCII or UTF-8 strings in Delphu code use
AnsiString or Utf8String. They are the same.

var
asciiStr: AnsiString;

> AStream.Write(PAnsiChar(Pointer((AStr))^, msxBytesLength_Ansi(AStr));

AStream.Write(aStr[1], Length(AStr));
is qually fast and much cleaner.

AnsiString is the same in Delphi 2007 and Delphi 2008 so you code works
unchanged on both Delphis. There is not implicit conversion if you always
use AnsiString as a variable type.

One thing. Most often string that people call ASCII string is not actually
ASCII but Ansi (e.g. string that uses code page). Any Ansi string is
compatible to ASCII if it contains only 0-0x7E values.

The best practive in your case would be handle eveything as WideString
(UnicodeString in Delphi 2008) internally and convert it to UTF-8 only
before writing it to the file.

UTF-8 is very good for file storege but UTF-16 (WideString and
UnicodeString) is much easier for string handling.

Best regards,
Jaakko Salmenius
http://www.sisulizer.com - Three simple steps to localize

Marc Rohloff [TeamB]

unread,
May 10, 2008, 7:43:36 AM5/10/08
to

CodeGear hasn't said much about the implementation of strings in
Tiburon. The most important thing is probably to structure your code
so that the necessary assumptions are in a small set of functions
which you can rewrite when more details are available.

I would expect that an AnsiString would not necessarily be storing
UTF-8 and that currently when you convert the XML (which is probably
stored as UTF-16 internally) to an AnsiString that you are *not*
getting UTF-8 but a string encoded in your current codepage.

--
Marc Rohloff [TeamB]
marc -at- marc rohloff -dot- com

dk_sz

unread,
May 10, 2008, 8:29:27 AM5/10/08
to
Hi,


> (which is probably stored as UTF-16 internally)

Actually I currently use the string data type for all things
through program. (Which can be sen as a bright move when
I change to Delphi 2008, hehe). It isn't really a problem
since I have learned to tolerate MBCS quite well :-)

> I would expect that an AnsiString would
> not necessarily be storing UTF-8 and that

* UTF-8 is comtpaible with ASCII
(you could say ASCII is a subset of UTF-8)

* all/most? ansi (even those storing MBCS?) maintain ASCII range
(I believe there might be a /few/ tiny exceptions)

So Ascii ranges should be safe

And for the rest, file paths, I can convert to UTF8 from
Ansi/MBCS (and in Delphi 2008 from UTF16String)


best regards
Thomas Schulz


Jaakko Salmenius

unread,
May 10, 2008, 10:41:24 PM5/10/08
to
> * all/most? ansi (even those storing MBCS?) maintain ASCII range
> (I believe there might be a /few/ tiny exceptions)

All Ansi strings (even Asian) and also UTF-8 strins have ASCII chars in
0-0x7F

If you have only ASCII then all Ansi strings and UTF-8 strings are the same.

Jaakko Salmenius

unread,
May 10, 2008, 11:03:45 PM5/10/08
to
A good resource for character encodings can be found from here:
http://www.cs.tut.fi/~jkorpela/chars/index.html

Jaakko Salmenius

unread,
May 10, 2008, 10:56:44 PM5/10/08
to
XML standard allows to use almost any character encoding. However UTF-8 is
the recommended encoding. The reason is that it is Unicode and it is
endian-ness. This makes it an ideal data storage and transfer format.

However it is not that well suited for string processing. It is always best
to use the native string encoding of the OS/compiler in your code.
WideString (and UnicodeString in next Delphi) are this encoding. Both
strings are UTF-16 that on most cases (except some rare Chinese chars) need
exactly two bytes for a character.

The mest string processing format would be UTF-32 (UCS-4) where one
character _always_ takes four bytes (even rare Chinese characters).
Unfortunately Delphi or Windows does not support this encoding. The main
reason is obvious: UTF-32 wastes 75% of space when using only ASCII strings.

Some people use UTF-16 in a file format. The reason is that if you have lots
of Asian character UTF-16 takes 2 bytes per char and UTF-8 takes 3 bytes.
Howerver I consider this as a marginal benefit. UTF-16 has always two ways
to encode (big and little endian). This brings extra complexity and it might
be that the platform using the file does not support your endian type. If
you format is binary or your text file is only used on the same operating
system (e.g. Windows) then UTF-16 is also a good choice.

Jaakko Salmenius

unread,
May 11, 2008, 9:49:40 AM5/11/08
to
My original reply two days ago was cancelled by Marc Rohloff so here is a
modified version.

UTF-8 is compatible to ASCII. This means that any ASCII string is equal as

UTF-8. When you handle your ASCII or UTF-8 strings in Delphi code use

AnsiString or Utf8String. They are the same.

> AStream.Write(PAnsiChar(Pointer((AStr))^, msxBytesLength_Ansi(AStr));

AStream.Write(aStr[1], Length(AStr));
is qually fast and much cleaner.

AnsiString will propably be the same in the future Delphis so you code works
unchanged on all Delphis. There is not implicit conversion if you always use

AnsiString as a variable type.

One thing. Most often string that people call ASCII string is not actually
ASCII but Ansi (e.g. string that uses code page). Any Ansi string is
compatible to ASCII if it contains only 0-0x7E values.

The best practive in your case would be handle eveything as WideString

internally and convert it to UTF-8 only before writing it to the file.

UTF-8 is very good for file storege but UTF-16 (WideString) is much easier
for string handling.

Best regards,

dk_sz

unread,
May 11, 2008, 2:03:04 PM5/11/08
to
Hi,

> unchanged on all Delphis. There is not implicit conversion if you always
> use AnsiString as a variable type.

I asked to these two cases:

msStreamWriteStrRaw_Ansi(Astream, '<xml>');
msStreamWriteStrRaw_Ansi(Astream, #13#10);

As you see, they aren't declared as AnsiString anywhere.

> AStream.Write(aStr[1], Length(AStr));
> is qually fast and much cleaner.

I have my own wrapper function for
length as CodeGear uses Length for:
* character + bytes count
* different string types.

Using [1] is not entirely great if you
might want to convert to Chrome later.

However, regarding speed, you may be right.
Using PChar(AStr)^ is much slower than my code,
but I haven't checked the code generated by yours.

Anyways, I have mine wrapped up in:

function msxCastAnsiStrAsPAnsiChar(const S: AnsiString): PAnsiChar;
begin
Result := PAnsiChar(Pointer(S));
end;

What is cleaner is a matter of taste I think :-)
Most may vote for yours, but I have tried
hard to insulate myself from Delphi, platform
and tool changes if I ever need to port.

> The best practive in your case would be handle eveything as WideString
> internally and convert it to UTF-8 only before writing it to the file.

That is absolute no go in my situation. Memory peaks
too high for any in-memory stuff for files of the size I
handle. Luckily, as I wrote, most of my stuff is Ascii...
So in my specific case, it makes good sense to write files as
Ascii (and then the necessary UTF8 parts as that) to file stream.


> One thing. Most often string that people call ASCII string is not actually
> ASCII but Ansi (e.g. string that uses code page). Any Ansi string is
> compatible to ASCII if it contains only 0-0x7E values.

I really mean ascii when I say ascii :-)


best regards
Thomas Schuzl


Jaakko Salmenius

unread,
May 11, 2008, 5:10:30 PM5/11/08
to
OK. I got your point. You are writing the file on the fly without first
building it to memory. I did not read your original email well enought.

> * Will underneath make implicit conversion UTF16String_To_Ansi in D2008?
> msStreamWriteStrRaw_Ansi(Astream, '<xml>');
> // if so... Can I solve this by some way tell compiler to store '<xml>'
> as Ansi?
>
> * Will underneath make implicit conversion UTF16String_To_Ansi in D2008?
> msStreamWriteStrRaw_Ansi(Astream, CmsNewLine);
> // I guess this will be OK since is was declared as bytes #13#10 ?

I don't know this but if it works as Delphi 7 now there will be an implicit
conversion. If we have the following lines in Delphi 7

var
str, str1, str2: WideString;
...
str1 := "This is a sample";
str2 := "This too";
str := str1 + #13#10 + str2;
>>> Here both str1 and str2 are first converter to AnsiString and the three
>>> strings are concatenated. Finally the whole stuff is converter to
>>> WideString. All implicit converions are done using default code page.
>>> Dangerous.

To solve this
str := str1 + WideString(#13#10) + str2;

You can check if there is an implicit conversion by compiling your
application usin the Debug DCU and putting breakpoint in the string handling
line and then debugging in...

I am sure there will be a way to mark string constant as Ansi or the
compiler will be smart enought not to make implicit conversion. Delphi 7
compiler was not smart enought :-)

Jaakko

dk_sz

unread,
May 13, 2008, 9:26:35 AM5/13/08
to
Hi Jaakko,


> To solve this
> str := str1 + WideString(#13#10) + str2;
>
> You can check if there is an implicit conversion by compiling your
> application usin the Debug DCU and putting breakpoint in the string
> handling line and then debugging in...

> I am sure there will be a way to mark string constant as Ansi or the


Thanks for the information, I will try experiment a little :-)


best regards
Thomas Schulz


0 new messages