Question about scintilla

83 views
Skip to first unread message

Igor Korot

unread,
Jan 15, 2012, 9:34:10 PM1/15/12
to scintilla...@googlegroups.com
Hi,
I have a quick question about scintilla.
What encoding is used by scintilla internally? UTF-16, UTF-32 or UTF-8?

Lets say I programmatically append the following text:

Voilà la phrase

and then set the selection again programmatically to be in range (0,7)?

Thank you.

Neil Hodgson

unread,
Jan 16, 2012, 6:53:57 PM1/16/12
to scintilla...@googlegroups.com
Igor Korot:

> What encoding is used by scintilla internally? UTF-16, UTF-32 or UTF-8?

Scintilla uses various encodings including UTF-8, Latin-1 and
Shift-JIS. The encoding can be set with SCI_SETCODEPAGE and
SCI_STYLESETCHARACTERSET.

Neil

Igor Korot

unread,
Jan 16, 2012, 9:42:24 PM1/16/12
to scintilla-interest
Neil,
In my example what test will be selected by default: "Voilà l" or
"Voilà "?
I'm using Windows 7 US 64 bit without any locales.

Thank you.

Igor Korot

unread,
Jan 21, 2012, 1:17:49 AM1/21/12
to scintilla-interest
As a reference I can use the origin of my question:

http://trac.wxwidgets.org/ticket/1300.

Could someone please give a suggestion of what would be the best
solution here?

Thank you.

Lex Trotman

unread,
Jan 21, 2012, 3:11:45 AM1/21/12
to scintilla...@googlegroups.com
On Sat, Jan 21, 2012 at 5:17 PM, Igor Korot <ikor...@gmail.com> wrote:
> As a reference I can use the origin of my question:
>
> http://trac.wxwidgets.org/ticket/1300.
>
> Could someone please give a suggestion of what would be the best
> solution here?
>
> Thank you.
>

The problem in the thread is that people are expecting the same
character in text that is encoded differently to have the same index.
This will not be the case for anything other than ASCII characters.

In general the best advice is to never translate offsets from one
encoding to another, work in one only. Otherwise life will become
very hard. Scintilla does not offer a wchar_t encoding because this
would be very inefficient for most of its use-cases.

Cheers
Lex

Igor Korot

unread,
Jan 21, 2012, 5:37:58 PM1/21/12
to scintilla-interest
Lex,

On Jan 21, 12:11 am, Lex Trotman <ele...@gmail.com> wrote:
> On Sat, Jan 21, 2012 at 5:17 PM, Igor Korot <ikoro...@gmail.com> wrote:
> > As a reference I can use the origin of my question:
>
> >http://trac.wxwidgets.org/ticket/1300.
>
> > Could someone please give a suggestion of what would be the best
> > solution here?
>
> > Thank you.
>
> The problem in the thread is that people are expecting the same
> character in text that is encoded differently to have the same index.
> This will not be the case for anything other than ASCII characters.

I think you misunderstood the point a little.
In my first post here I used a phrase in French with a special
character.
Now if I want to select only the first word I should be able to do
that, right?

However it looks like I can't do that because that special character
"à"
takes 2 symbols internally.
Unfortunately I'm not using French systems and don't have French
locales
installed, but it shouldn't matter much. I still should be able to
make a
selection from 0 to 6 and expect that my word (voilà) will be selected
with the
space that is the next character.
So see the text I'm using is not encoded differently. It is a
perfectly fine
Unicode string.

>
> In general the best advice is to never translate offsets from one
> encoding to another, work in one only.  Otherwise life will become
> very hard.  Scintilla does not offer a wchar_t encoding because this
> would be very inefficient for most of its use-cases.

So how do you work around the problem with the cases like the one I
showed?
And it becomes even worse with Chinese characters as they use
sometimes 3
symbols.

Once again it's not a matter of using different encoding. There is
only one encoding.

>
> Cheers
> Lex

Lex Trotman

unread,
Jan 21, 2012, 6:25:11 PM1/21/12
to scintilla...@googlegroups.com

The Scintilla documentation is clear, all positions are *bytes*, so if
your accented a is encoded as two bytes as you say it is, then you
need 0-7 to include the space. The positions are not Unicode code
points, "characters" (whatever that may mean since combined characters
are more than one code point) or glyphs.

The byte sequence is not encoded by Scintilla, it is encoded by the
application and passed to Scintilla as a sequence of bytes. Scintilla
has to be told by the application what the encoding is so that it can
prevent the caret from being located between bytes of multibyte
encodings and so it can convert the byte sequence to the platform
dependent encoding required for display. It does not convert
positions to/from byte positions because there is no cheap way of
doing it.

>
>>
>> In general the best advice is to never translate offsets from one
>> encoding to another, work in one only.  Otherwise life will become
>> very hard.  Scintilla does not offer a wchar_t encoding because this
>> would be very inefficient for most of its use-cases.
>
> So how do you work around the problem with the cases like the one I
> showed?
> And it becomes even worse with Chinese characters as they use
> sometimes 3
> symbols.
>

Yes, thats why I advised "don't mix encodings" it makes it really hard :)

> Once again it's not a matter of using different encoding. There is
> only one encoding.

Unfortunately there are two in your example, although it might not be
obvious, the text is considered a sequence of bytes by scintilla and a
sequence of characters/code points by you/your application. When
multi-byte encodings are involved these views do not match.

Cheers
Lex

Reply all
Reply to author
Forward
0 new messages