Unicode supplemental planes in Harbour

KevinC

unread,

Oct 6, 2020, 7:42:40 AM10/6/20

to Harbour Developers

Surely Przemek and Viktor and everyone helping them know that

Unicode is more than a 16 bit encoding
its upper limit is 0x10FFFF, not 0xFFFF
it consists of 17 "planes", each of which contains 2^16 characters
the first plane, the Basic Multilingual Plane (BMP), has code points 0 to 0xFFFF
the other 16 planes, the supplemental planes, have code points 0x10000 to 0x10FFFF
the first supplemental plane, the Supplemental Multilingual Plane (SMP), has code points 0x10000 to 0x1FFFF
until recently only the BMP was normally used
the recent addition of emoticons like 😊 😲😔 to the SMP has made it much more popular

Why then do we see the following?

UTF8 for U+F60A <private-use-F60A> = e"\xEF\x98\x8A"
UTF8 for U+1F60A SMILING FACE WITH SMILING EYES = e"\xF0\x9F\x98\x8A"
HB_UTF8CHR(0x1F60A) --> e"\xEF\x98\x8A"
HB_NUMTOHEX(HB_UTF8ASC(HB_UTF8CHR(0x1F60A))) --> "F60A"
HB_NUMTOHEX(HB_UTF8ASC(e"\xF0\x9F\x98\x8A")) --> "F60A"

Is it a bug or a feature?

KevinC

unread,

Oct 6, 2020, 9:03:37 AM10/6/20

to Harbour Developers

A few years ago, I wrote a small library called UnicodeLib, written in pure Harbour, that has conversion functions written that handle all Unicode planes - http://kevincarmody.com/software/unicodelib.zip

Diego Pego

unread,

Jun 27, 2024, 4:48:07 AM (6 days ago) Jun 27

to Harbour Developers

I'd really like to know how to contribute to make this into the core!

Reply all

Reply to author

Forward