Lets say I programmatically append the following text:
Voilà la phrase
and then set the selection again programmatically to be in range (0,7)?
Thank you.
> What encoding is used by scintilla internally? UTF-16, UTF-32 or UTF-8?
Scintilla uses various encodings including UTF-8, Latin-1 and
Shift-JIS. The encoding can be set with SCI_SETCODEPAGE and
SCI_STYLESETCHARACTERSET.
Neil
The problem in the thread is that people are expecting the same
character in text that is encoded differently to have the same index.
This will not be the case for anything other than ASCII characters.
In general the best advice is to never translate offsets from one
encoding to another, work in one only. Otherwise life will become
very hard. Scintilla does not offer a wchar_t encoding because this
would be very inefficient for most of its use-cases.
Cheers
Lex
The Scintilla documentation is clear, all positions are *bytes*, so if
your accented a is encoded as two bytes as you say it is, then you
need 0-7 to include the space. The positions are not Unicode code
points, "characters" (whatever that may mean since combined characters
are more than one code point) or glyphs.
The byte sequence is not encoded by Scintilla, it is encoded by the
application and passed to Scintilla as a sequence of bytes. Scintilla
has to be told by the application what the encoding is so that it can
prevent the caret from being located between bytes of multibyte
encodings and so it can convert the byte sequence to the platform
dependent encoding required for display. It does not convert
positions to/from byte positions because there is no cheap way of
doing it.
>
>>
>> In general the best advice is to never translate offsets from one
>> encoding to another, work in one only. Otherwise life will become
>> very hard. Scintilla does not offer a wchar_t encoding because this
>> would be very inefficient for most of its use-cases.
>
> So how do you work around the problem with the cases like the one I
> showed?
> And it becomes even worse with Chinese characters as they use
> sometimes 3
> symbols.
>
Yes, thats why I advised "don't mix encodings" it makes it really hard :)
> Once again it's not a matter of using different encoding. There is
> only one encoding.
Unfortunately there are two in your example, although it might not be
obvious, the text is considered a sequence of bytes by scintilla and a
sequence of characters/code points by you/your application. When
multi-byte encodings are involved these views do not match.
Cheers
Lex