What is the different between AnsiString and UTF8String?

Emmanuel

unread,

Jan 8, 2007, 2:06:48 PM1/8/07

to

Igor Siticov

unread,

Jan 10, 2007, 3:45:38 AM1/10/07

to

UTF8String is AnsiString but all characters with code > 128 are encoded. So
if you just display it somewhere you would see a garbage chars for these
characters.

--
Best regards.
TsiLang Components Suite - Best Globalization Tool 2004
http://www.tsilang.com
"Emmanuel" <emma...@erphk.com> wrote in message
news:45a2...@newsgroups.borland.com...

Jaakko Salmenius at

unread,

Jan 11, 2007, 4:26:13 AM1/11/07

to

What do you mean encoded? As far as I see they are both encoded.

UTF8String contains UTF-8 encoded Unicode string. Each character is 1, 2, 3,
4 bytes long.

AnsiString contains code page encoded string. Depending on the code page
each character is either 1, or 1 or 2 characters long. Asian code pages use
multi byte encoding code pages where each character is either 1 byte or 2
bytes. All Windows code pages has identical first 128 characters. If the
most significant byte of an AnsiString is 1 then the byte is extended byte
that is either single character (such as ä or ö in code page 1252) or needs
the following byte to describe the character.

Also in AnsiString you will see garbage if the system code page does not
match the code page used in the string. For example using Japanese (code
page 932) string on English Windows.

Both UTF8String and AnsiString must be read byte by byte to get the real
meaning.

Best regards,
Jaakko Salmenius
www.sisulizer.com

Igor Siticov

unread,

Jan 15, 2007, 12:25:23 PM1/15/07

to

Hi,

> What do you mean encoded? As far as I see they are both encoded.
>

IMHO!
Utf8String is encoded but AnsiString is "as is" string. AnsiString is single
byte string and will be displayed properly only while using proper code
page. MBCS strings are multi-byte and they apply to all you wrote below. But
AnsiString is "plain" single-byte string and this is why they will look
differently under different code pages because there is no codepage info in
single-byte data just char itself (for some Asian codepages there will be
needed 2 bytes to decode one single char but there still no any code page
info in AnsiString). I think you mixed in your post byte and bit. Most
significant BIT describes the character "range".

Jaakko Salmenius at

unread,

Jan 15, 2007, 10:30:31 PM1/15/07

to

Multi byte strings are also AnsiStrings. Most major Asian countries
including Japan, China, Taiwan and Korean use multi byte encoding. Shift
JIS/932 is a multi byte encoding. Delphi's AnsiString can contains these
string just like English, Finnish and Russian.

My point is that whenever there is an AnsiString in Delphi you need to know
what is the code page that the string uses. Otherwise you might end up
interpreting the string incorrectly.

You were right about byte and bit. I ment bit but wrote byte.

Igor Siticov

unread,

Jan 16, 2007, 4:53:24 AM1/16/07

to

Hi,

> My point is that whenever there is an AnsiString in Delphi you need to
> know what is the code page that the string uses. Otherwise you might end
> up interpreting the string incorrectly.

Yes, this is absolutely right.

>
> You were right about byte and bit. I ment bit but wrote byte.
>

OK, no problem, it was just confusing a bit. :)