Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UCS-2 and UTF-8 conversion

2,012 views
Skip to first unread message

George

unread,
Aug 1, 2007, 2:24:02 AM8/1/07
to
Hello everyone,


I am writing a pure C/C++ program to convert from UCS-2 to UTF-8 character
string. I can not find enough information from Google -- the mapping tables
(formula) between UCS-2 and UTF-8.

I want to develop the program through pure bit operations (&, | and
shifting), and I do not want to invoking any OS specific APIs.

Any reference samples or the mapping tables (formula) between UCS-2 and UTF-8?


thanks in advance,
George

Ulrich Eckhardt

unread,
Aug 1, 2007, 3:15:41 AM8/1/07
to
George wrote:
> I am writing a pure C/C++ program to convert from UCS-2 to UTF-8 character
> string. I can not find enough information from Google -- the mapping
> tables (formula) between UCS-2 and UTF-8.

http://www.unicode.org - it's not as simple as a mapping table though.

Are you sure you mean UCS2 and not UTF-16, btw?

> Any reference samples or the mapping tables (formula) between UCS-2 and
> UTF-8?

There are thousands of open source programs out there, e.g. iconv or yudit
which both can do this.

Uli

George

unread,
Aug 1, 2007, 3:34:06 AM8/1/07
to
Thanks Uli,


I think UCS-2 should be the same as UTF-16, right? Any differences?


regards,
George

Ulrich Eckhardt

unread,
Aug 1, 2007, 5:15:51 AM8/1/07
to
No, yes, see link.

Uli

George

unread,
Aug 1, 2007, 5:48:03 AM8/1/07
to
Sorry Uli,


I can not open unicode.org, could you help to post the related content
please? :-)


regards,
George

Ulrich Eckhardt

unread,
Aug 1, 2007, 6:16:25 AM8/1/07
to
George wrote:
> I can not open unicode.org, could you help to post the related content
> please? :-)

It's a bit too big for a Usenet posting. Also, I'm too lazy to extract
everything that might be relevant to your case.

Uli

Kim Gräsman

unread,
Aug 1, 2007, 7:12:13 AM8/1/07
to
Hi George,

> Sorry Uli,
>
> I can not open unicode.org, could you help to post the related content
> please? :-)

"
Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is what a Unicode implementation was up to Unicode 1.1, *before*
surrogate code points and UTF-16 were added as concepts to Version 2.0 of
the standard. This term should be now be avoided.

When interpreting what people have meant by "UCS-2" in past usage, it is
best thought of as not a data format, but as an indication that an implementation
does not interpret any supplementary characters. In particular, for the purposes
of data exchange, UCS-2 and UTF-16 are identical formats. Both are 16-bit,
and have exactly the same code unit representation.

The effective difference between UCS-2 and UTF-16 lies at a different level,
when one is interpreting a sequence code units as code points or as characters.
In that case, a UCS-2 implementation would not handle processing like character
properties, codepoint boundaries, collation, etc. for supplementary characters.
[MD] & [KW]
"

FWIW,
- Kim


Igor Tandetnik

unread,
Aug 1, 2007, 8:10:28 AM8/1/07
to
"George" <Geo...@discussions.microsoft.com> wrote in message
news:D12AA1FE-7F5F-416B...@microsoft.com

> I am writing a pure C/C++ program to convert from UCS-2 to UTF-8
> character string. I can not find enough information from Google --
> the mapping tables (formula) between UCS-2 and UTF-8.

Wikipedia seems to have a page on everything:

http://en.wikipedia.org/wiki/UTF-8

--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925


Alexander Nickolov

unread,
Aug 1, 2007, 2:32:10 PM8/1/07
to
That page actually is misleading. It lists codepoints in range
000800-00FFFF for the third row representation. In reality,
there are no codepoints in the 0xD800-0xDFFF range and
those values should be treated differently - backconvert the
surrogate pair to a 32-bit codepoint value and re-encode as
per row 4.

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: agnic...@mvps.org
MVP VC FAQ: http://vcfaq.mvps.org
=====================================

"Igor Tandetnik" <itand...@mvps.org> wrote in message
news:Om6IwUD1...@TK2MSFTNGP02.phx.gbl...

Alexander Nickolov

unread,
Aug 1, 2007, 2:59:31 PM8/1/07
to
Well, since this is Wikipedia, I went ahead and fixed the page...

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: agnic...@mvps.org
MVP VC FAQ: http://vcfaq.mvps.org
=====================================

"Alexander Nickolov" <agnic...@mvps.org> wrote in message
news:O9t3EpG1...@TK2MSFTNGP05.phx.gbl...

George

unread,
Aug 2, 2007, 9:20:02 AM8/2/07
to
It is ok, Uli. I do not know why I can not access unicode.org these days.
Perhaps DNS issue. :-)


regards,
George

George

unread,
Aug 2, 2007, 9:20:02 AM8/2/07
to
Thanks Kim,


I want to confirm that you mean UCS-2 and UTF-16 are the same thing?


regards,
George

George

unread,
Aug 2, 2007, 9:52:13 AM8/2/07
to
Hi Alexander,


Is there a mapping table (or formula) between UTF-8 and UCS-2? I can not
find. If you have, could you help to post please?


regards,
George

George

unread,
Aug 2, 2007, 9:54:08 AM8/2/07
to
Thanks Igor,


Is there a mapping table (or formula) between UTF-8 and UCS-2? I can not
find. If you have, could you help to post please?


regards,
George

David Wilkinson

unread,
Aug 2, 2007, 10:03:33 AM8/2/07
to
George wrote:
> Thanks Kim,
>
>
> I want to confirm that you mean UCS-2 and UTF-16 are the same thing?

George:

No, KIm did not say that. In fact he said the opposite. UCS-2 is
obsolete, and you should not use it. Recent versions of Windows support
UTF-16, complete with surrogate pairs.

--
David Wilkinson
Visual C++ MVP

Igor Tandetnik

unread,
Aug 2, 2007, 10:18:30 AM8/2/07
to
George <Geo...@discussions.microsoft.com> wrote:
> Thanks Igor,
>
> Is there a mapping table (or formula) between UTF-8 and UCS-2? I can
> not find. If you have, could you help to post please?

It's right there in the article I gave you a link to. It is quite
obvious that you didn't bother following the link. In which case I, too,
can't be bothered to help you any further, sorry.

George

unread,
Aug 2, 2007, 11:12:01 PM8/2/07
to
Hi David,


What means surrogate pairs?


regards,
George

Igor Tandetnik

unread,
Aug 2, 2007, 11:20:51 PM8/2/07
to

George

unread,
Aug 3, 2007, 2:16:00 AM8/3/07
to
Thanks Igor,


Good link.


regards,
George

George

unread,
Aug 3, 2007, 2:18:02 AM8/3/07
to
Thanks, Igor!


I will read the link.


regards,
George

0 new messages