Welcome to UTF-8.
This is something I consult on all the time. The days that encoding
length equaled character size length and even equaled representation
length are long gone. It's something you have to break your mind of
(and it doesn't help that languages like C and C++ call a byte a
"char".
1 character can count anywhere from 1 to 5 bytes in some cases.
Basicly:
U+000000 to U+00007F (basic Latin) = 1 byte - the graceful part of
UTF-8 is that it is directly equivalent to ASCII in that range.
U+000080 to U+0007FF - 2 bytes
U+000800 to U+00FFFF - 3 bytes
U+010000 to U+10FFFF - 4 bytes
etc...
See:
http://en.wikipedia.org/wiki/UTF-8
Zac Bowling
http://zbowling.com/