can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.
whic i should use to support character ucs-2.
I want to use ucs-2 character in streamreader and streamwriter.
How unicode and utf chacters are stored.
Please help me.
thanks in advance.
See http://www.pobox.com/~skeet/csharp/unicode.html
I'm always hazy about the difference between UCS-2 and UTF-16 - it's
almost certainly to do with surrogate pairs, if there is a difference -
but you can get a UTF-16 encoding with Encoding.Unicode.
Jon
UTF-8 and UTF-16 are UCS Transformation Formats. As Unicode and UCS are
effectively synonymous, UTF-8 and UTF-16 is used to encode Unicode strings.
In UTF-16 the characters are encoded as 16 bit sequences (two bytes).
UTF-16 and UCS-2 are identical for all characters that USC-2 handles.
You can treat UCS-2 data as UTF-16 without any problems.
In UTF-8 the most common characters are encoded as 8 bit sequences (one
byte). Special characters are encoded as 24 bit sequences (three bytes).
As the character type in .NET is a 16 bit Uncode character, it's
synonymous with the UCS BMP (Basic Multilingual Plane) that UCS-2 handles.
In conclusion, in .NET the Unicode and UCS BMP character sets are the
same, and UCS-2 and UTF-16 are the same.
There is no encoding in UCS that corresponds to UTF-8. If you export
data to something that only handles UCS, you have to use UTF-16.
From what I can gather, the only difference is that UTF-16 is capable
of encoding the full 31 bit range of unicode characters, while UCS-2
only handles the 16 bit range specified as the UCS BMP (Basic
Multilingual Plane).
As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET.
> From what I can gather, the only difference is that UTF-16 is capable
> of encoding the full 31 bit range of unicode characters, while UCS-2
> only handles the 16 bit range specified as the UCS BMP (Basic
> Multilingual Plane).
>
> As the Char datatype in .NET is a 16 bit data type, it doesn't handle
> any characters that UCS-2 doesn't handle. As I understand it, that would
> make UTF-16 and UCS-2 synonymous in .NET.
.NET chars have surrogate pair forms (check out Char.IsHighSurrogate()
and Char.IsLowSurrogate()) combining two characters to form a single
abstract character. Thus, the number of physical characters in a .NET
string may be greater than the number of actual, abstract characters.
-- Barry
Short example
You decide initially that 10 digits is enough to encode a certain character
set.
So you can have
0 1 2 3 4 5 6 7 8 9
Later on, you discover this is not true, and you need a way to represent
more. But you have some areas that are not allocated yet in your encoding, so
you can reuse that:
0 1 [ 2 3 4 | 5 6 7 ] 8 9
Let's call the 2-4 range "high surrogate" and the 5-7 "low surrogate"
Then you can represent stuff like this:
0 1 8 9 = 4 values
(you are not allowed to use the surrogate area for real characters)
but you can also represent characters using two code units:
25 26 27 35 36 37 45 46 47 = 9 values
And you have a way to map 25 => 10, 26=>11, ... 47=>18
So you end up being able to represent 13 values!
This is 10 + HighSurrogate * LowSurrogate =
= 10 + 9 = 19 = covered range
And number of usefull codes for encoding (you cannot use surrogates):
= 10 + HighSurrogate * LowSurrogate - HighSurrogate - LowSurrogate
= 19 - 3 - 3 = 13 = number of characters that you can now encode
Now, for Unicode before surrogate introduction you had
0000 -FFFF
But when it proved that more than FFFF code points where needed,
the mechanism described above was created (at another scale):
0000 0001 0002 0003 ... D7FF [ D800 - DBFF | DC00 DFFF ] E000 ... FFFF
D800 - DBFF = high surrogates
DC00 - DFFF = low surrogates
So what you can represent is:
0000 0001 0002 0003 ... D7FF E000 ... FFFF
and you add the stuff above BMP with one high and one low surrogate:
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF
Covered range:
FFFF + ( DBFF - D800 + 1 ) x (DFFF - DC00 + 1 ) =
FFFF + 0400 x 0400 = 10FFFF
Wow! Exactly what is covered by UTF-16! Coincidence?
Number of code points disponible for encoding:
FFFF + 0400 x 0400 - 0400 - 0400 = 10FFFF - 0400 - 0400 = 10F7FF =
1112063 (decimal)
If you read this http://www.unicode.org/book/uc20ch1.html and you will find
that "more than 1 million characters can be encoded"
Well, the 1112063 value is the "technically possible" value, but you should
exclude reserved areas, private use areas and others.
Anyway, long story short: UCS2 = before UTF-8/surrogates mechanism was
introduced.
When an application is surrogate aware, you can say is utf-16.
If it is not surrogate aware, then is probably ucs2
And .NET is UTF-16
=================================================
To answer the original questions:
> can someone tell me difference between unicode and utf 8 or utf 18 and
> which one is supporting more character set.
There is no utf-18, it is utf-16
Unicode is a "coded character set" basically mapping characters with numbers
(A=0x41, B=0x42 and so on)
UTF-8 and UTF-18 are different ways of representing this mapping.
And there is no coverage difference.
You can compare it to (in a way) with various base of numeration systems
If you say A=0x41, B=0x42 in hex
or if you say A=65, B=66 in decimal
or if you say A=0101, B=0102 in octal
is the same thing.
So your utf-8, utf-16 question is a bit like asking "hex or decimal, which
one can represent more numbers?" Answer: they are the same.
See some official standard here:
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G13708
and here:
http://www.unicode.org/reports/tr17/index.html
or here
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-
Chapter04a#96f19a02
> whic i should use to support character ucs-2.
> I want to use ucs-2 character in streamreader and streamwriter.
Use utf-16. It is a superset of ucs2 and is the one supported by all the .NET
API.
> How unicode and utf chacters are stored.
The story is long, but I would send you to the standard (free):
http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html
And if you have to get deep into this, I would recomend
http://www.amazon.com/gp/product/0201700522
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email