A bit more about Unicode Surrogates

Duncan Roe

unread,

Jul 29, 2025, 7:04:49 AMJul 29

to mlug

Hi Everyone,

Following on from last night's hazy binary arithmetic:

To recap: UTF-16 characters stand for themslves as Unicode code points up to
64K, except for the surrogate range, which can encode up to 1M (the top 32K of
these encodings is reserved for private use).

Well-formed UTF-16 always contains surrogates in pairs, high followed by low.
| first High Surrogate U+D800 1101 1000 0000 0000
| last High Surrogate U+DBFF 1101 1011 1111 1111
| first Low Surrogate U+DC00 1101 1100 0000 0000
| last Low Surrogate U+DFFF 1101 1111 1111 1111
All High Surrogates start 110110 and all Low Surrogates start 110111, i.e 6
bits, leaving 10 bits for data giving 20 bits of data in a pair.

Cheers ... Duncan.

Kevin Exton

unread,

Jul 29, 2025, 10:51:41 PMJul 29

to mlu...@googlegroups.com

> Well-formed UTF-16 always contains surrogates in pairs, high followed by low.
> | first High Surrogate U+D800 1101 1000 0000 0000
> | last High Surrogate U+DBFF 1101 1011 1111 1111
> | first Low Surrogate U+DC00 1101 1100 0000 0000
> | last Low Surrogate U+DFFF 1101 1111 1111 1111
> All High Surrogates start 110110 and all Low Surrogates start 110111, i.e 6
> bits, leaving 10 bits for data giving 20 bits of data in a pair.

Are these Unicode code points (U+D800 to U+DFFF) reserved for UTF-16
surrogate pairs?

Best,
Kevin

Duncan Roe

unread,

Jul 30, 2025, 4:47:46 AMJul 30

to mlu...@googlegroups.com

Hi Kevin,

Yes indeed, they are so reserved. Also, in UTF-16 it is illegal to encode them
in a surrogate pair.

Cheers ... Duncan.

Reply all

Reply to author

Forward