Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unicode and utf 8 /utf 16

24 views
Skip to first unread message

archana

unread,
Jun 29, 2006, 2:32:18 AM6/29/06
to
Hi all,

can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.

Please help me.

thanks in advance.

Jon Skeet [C# MVP]

unread,
Jun 29, 2006, 3:26:13 AM6/29/06
to
archana wrote:
> can someone tell me difference between unicode and utf 8 or utf 18 and
> which one is supporting more character set.
>
> whic i should use to support character ucs-2.
>
> I want to use ucs-2 character in streamreader and streamwriter.
>
> How unicode and utf chacters are stored.

See http://www.pobox.com/~skeet/csharp/unicode.html

I'm always hazy about the difference between UCS-2 and UTF-16 - it's
almost certainly to do with surrogate pairs, if there is a difference -
but you can get a UTF-16 encoding with Encoding.Unicode.

Jon

Göran Andersson

unread,
Jun 29, 2006, 3:42:35 AM6/29/06
to
Unicode is a character set, just as UCS.

UTF-8 and UTF-16 are UCS Transformation Formats. As Unicode and UCS are
effectively synonymous, UTF-8 and UTF-16 is used to encode Unicode strings.

In UTF-16 the characters are encoded as 16 bit sequences (two bytes).
UTF-16 and UCS-2 are identical for all characters that USC-2 handles.
You can treat UCS-2 data as UTF-16 without any problems.

In UTF-8 the most common characters are encoded as 8 bit sequences (one
byte). Special characters are encoded as 24 bit sequences (three bytes).

As the character type in .NET is a 16 bit Uncode character, it's
synonymous with the UCS BMP (Basic Multilingual Plane) that UCS-2 handles.

In conclusion, in .NET the Unicode and UCS BMP character sets are the
same, and UCS-2 and UTF-16 are the same.

There is no encoding in UCS that corresponds to UTF-8. If you export
data to something that only handles UCS, you have to use UTF-16.

Göran Andersson

unread,
Jun 29, 2006, 4:39:47 PM6/29/06
to

From what I can gather, the only difference is that UTF-16 is capable
of encoding the full 31 bit range of unicode characters, while UCS-2
only handles the 16 bit range specified as the UCS BMP (Basic
Multilingual Plane).

As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET.

Barry Kelly

unread,
Jun 29, 2006, 7:25:11 PM6/29/06
to
Göran Andersson <gu...@guffa.com> wrote:

> From what I can gather, the only difference is that UTF-16 is capable
> of encoding the full 31 bit range of unicode characters, while UCS-2
> only handles the 16 bit range specified as the UCS BMP (Basic
> Multilingual Plane).
>
> As the Char datatype in .NET is a 16 bit data type, it doesn't handle
> any characters that UCS-2 doesn't handle. As I understand it, that would
> make UTF-16 and UCS-2 synonymous in .NET.

.NET chars have surrogate pair forms (check out Char.IsHighSurrogate()
and Char.IsLowSurrogate()) combining two characters to form a single
abstract character. Thus, the number of physical characters in a .NET
string may be greater than the number of actual, abstract characters.

-- Barry

--
http://barrkel.blogspot.com/

Mihai N.

unread,
Jun 29, 2006, 11:41:37 PM6/29/06
to
> As the Char datatype in .NET is a 16 bit data type, it doesn't handle
> any characters that UCS-2 doesn't handle. As I understand it, that would
> make UTF-16 and UCS-2 synonymous in .NET.
No. UTF-16 is a superset of UCS2. And .NET is UTF-16, not UCS2.

Short example
You decide initially that 10 digits is enough to encode a certain character
set.
So you can have
0 1 2 3 4 5 6 7 8 9

Later on, you discover this is not true, and you need a way to represent
more. But you have some areas that are not allocated yet in your encoding, so
you can reuse that:
0 1 [ 2 3 4 | 5 6 7 ] 8 9
Let's call the 2-4 range "high surrogate" and the 5-7 "low surrogate"

Then you can represent stuff like this:
0 1 8 9 = 4 values
(you are not allowed to use the surrogate area for real characters)
but you can also represent characters using two code units:
25 26 27 35 36 37 45 46 47 = 9 values
And you have a way to map 25 => 10, 26=>11, ... 47=>18

So you end up being able to represent 13 values!

This is 10 + HighSurrogate * LowSurrogate =
= 10 + 9 = 19 = covered range
And number of usefull codes for encoding (you cannot use surrogates):
= 10 + HighSurrogate * LowSurrogate - HighSurrogate - LowSurrogate
= 19 - 3 - 3 = 13 = number of characters that you can now encode


Now, for Unicode before surrogate introduction you had
0000 -FFFF
But when it proved that more than FFFF code points where needed,
the mechanism described above was created (at another scale):
0000 0001 0002 0003 ... D7FF [ D800 - DBFF | DC00 DFFF ] E000 ... FFFF
D800 - DBFF = high surrogates
DC00 - DFFF = low surrogates

So what you can represent is:
0000 0001 0002 0003 ... D7FF E000 ... FFFF
and you add the stuff above BMP with one high and one low surrogate:
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF

Covered range:
FFFF + ( DBFF - D800 + 1 ) x (DFFF - DC00 + 1 ) =
FFFF + 0400 x 0400 = 10FFFF
Wow! Exactly what is covered by UTF-16! Coincidence?

Number of code points disponible for encoding:
FFFF + 0400 x 0400 - 0400 - 0400 = 10FFFF - 0400 - 0400 = 10F7FF =
1112063 (decimal)
If you read this http://www.unicode.org/book/uc20ch1.html and you will find
that "more than 1 million characters can be encoded"
Well, the 1112063 value is the "technically possible" value, but you should
exclude reserved areas, private use areas and others.


Anyway, long story short: UCS2 = before UTF-8/surrogates mechanism was
introduced.
When an application is surrogate aware, you can say is utf-16.
If it is not surrogate aware, then is probably ucs2

And .NET is UTF-16
=================================================
To answer the original questions:

> can someone tell me difference between unicode and utf 8 or utf 18 and
> which one is supporting more character set.

There is no utf-18, it is utf-16
Unicode is a "coded character set" basically mapping characters with numbers
(A=0x41, B=0x42 and so on)
UTF-8 and UTF-18 are different ways of representing this mapping.
And there is no coverage difference.

You can compare it to (in a way) with various base of numeration systems
If you say A=0x41, B=0x42 in hex
or if you say A=65, B=66 in decimal
or if you say A=0101, B=0102 in octal
is the same thing.
So your utf-8, utf-16 question is a bit like asking "hex or decimal, which
one can represent more numbers?" Answer: they are the same.


See some official standard here:
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G13708
and here:
http://www.unicode.org/reports/tr17/index.html
or here
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-
Chapter04a#96f19a02


> whic i should use to support character ucs-2.
> I want to use ucs-2 character in streamreader and streamwriter.

Use utf-16. It is a superset of ucs2 and is the one supported by all the .NET
API.


> How unicode and utf chacters are stored.

The story is long, but I would send you to the standard (free):
http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html
And if you have to get deep into this, I would recomend
http://www.amazon.com/gp/product/0201700522


--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Mihai N.

unread,
Jul 1, 2006, 4:24:30 AM7/1/06
to
> Anyway, long story short: UCS2 = before UTF-8/surrogates mechanism was
> introduced.
Correction: UCS2 = before UTF-16/surrogates mechanism
0 new messages