Using SC_CP_UTF8 (65001) questions

47 views
Skip to first unread message

Brian Griffin

unread,
May 25, 2016, 12:13:39 AM5/25/16
to scintilla-interest
I am trying to understand and make use of SC_CP_UTF8 (65001) in Scintilla Tk implementation.  It's not clear from the documentation how to feed utf8 strings to scintilla and how to render the output text from scintilla in a platform implementation.  From the results I'm seeing, it appears as if Scintilla does not really understand utf8 strings.  What am I missing?  Is there more detailed documentation on string handling and interpretation in Scintilla?

Thanks,
-Brian

Paul K

unread,
May 25, 2016, 12:51:51 AM5/25/16
to scintilla-interest
Hi Brian,

> From the results I'm seeing, it appears as if Scintilla does not really understand utf8 strings.  What am I missing?  Is there more detailed documentation on string handling and interpretation in Scintilla?

I think you need to be more specific about what you are trying to do and the results you see (and how they deviate from the expected results). I'm using CP_UTF8 in my application on various platforms and the only issue I've seen is with the invalid UTF-8 sequences; the issue is not caused by Scintilla, but rather how invalid sequences are handled in the Unicode build of wxwidgets and to avoid the issue wxwidgets provides alternative Raw methods to send or retrieve text to or from Scintilla.

Paul.

Brian Griffin

unread,
May 25, 2016, 1:03:30 AM5/25/16
to scintilla...@googlegroups.com
On May 24, 2016, at 9:51 PM, Paul K <paulc...@gmail.com> wrote:

Hi Brian,

> From the results I'm seeing, it appears as if Scintilla does not really understand utf8 strings.  What am I missing?  Is there more detailed documentation on string handling and interpretation in Scintilla?

I think you need to be more specific about what you are trying to do and the results you see (and how they deviate from the expected results).

When adding utf8 strings containing multi-byte codes (anything outside of simple latin), the indexing no longer works and deleting characters deletes the individual bytes instead of the whole character. Of course one part of a character is removed, the rendering goes sour.

I'm using CP_UTF8 in my application on various platforms and the only issue I've seen is with the invalid UTF-8 sequences; the issue is not caused by Scintilla, but rather how invalid sequences are handled in the Unicode build of wxwidgets and to avoid the issue wxwidgets provides alternative Raw methods to send or retrieve text to or from Scintilla.

I have implemented Platform interface between the Scintilla Editor core and Tk.  In this case, Tk is the rendering engine, as opposed to gtk, qt, cocoa, etc.  The text is provided to Scintilla via Tcl, which is all utf8.

Thanks,
-Brian

Lex Trotman

unread,
May 25, 2016, 1:50:06 AM5/25/16
to scintilla...@googlegroups.com
On 25 May 2016 at 15:03, Brian Griffin <bgriffin...@gmail.com> wrote:
> On May 24, 2016, at 9:51 PM, Paul K <paulc...@gmail.com> wrote:
>
>
> Hi Brian,
>
>> From the results I'm seeing, it appears as if Scintilla does not really
>> understand utf8 strings. What am I missing? Is there more detailed
>> documentation on string handling and interpretation in Scintilla?
>
> I think you need to be more specific about what you are trying to do and the
> results you see (and how they deviate from the expected results).
>
>
> When adding utf8 strings containing multi-byte codes (anything outside of
> simple latin), the indexing no longer works and deleting characters deletes
> the individual bytes instead of the whole character. Of course one part of a
> character is removed, the rendering goes sour.

This is the documented behaviour, see
http://www.scintilla.org/ScintillaDoc.html#TextRetrievalAndModification.
In particular be aware that the term "character" throughout the
documentation means "byte" not code point. Indexing is by bytes, and
UTF-8 uses more than one byte for code points over 127.


>
> I'm using CP_UTF8 in my application on various platforms and the only issue
> I've seen is with the invalid UTF-8 sequences; the issue is not caused by
> Scintilla, but rather how invalid sequences are handled in the Unicode build
> of wxwidgets and to avoid the issue wxwidgets provides alternative Raw
> methods to send or retrieve text to or from Scintilla.
>
>
> I have implemented Platform interface between the Scintilla Editor core and
> Tk. In this case, Tk is the rendering engine, as opposed to gtk, qt, cocoa,
> etc. The text is provided to Scintilla via Tcl, which is all utf8.
>
> Thanks,
> -Brian
>
> --
> You received this message because you are subscribed to the Google Groups
> "scintilla-interest" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scintilla-inter...@googlegroups.com.
> To post to this group, send email to scintilla...@googlegroups.com.
> Visit this group at https://groups.google.com/group/scintilla-interest.
> For more options, visit https://groups.google.com/d/optout.

Brian Griffin

unread,
May 25, 2016, 2:04:00 AM5/25/16
to scintilla...@googlegroups.com

> On May 24, 2016, at 10:50 PM, Lex Trotman <ele...@gmail.com> wrote:
>
> On 25 May 2016 at 15:03, Brian Griffin <bgriffin...@gmail.com> wrote:
>> On May 24, 2016, at 9:51 PM, Paul K <paulc...@gmail.com> wrote:
>>
>>
>> Hi Brian,
>>
>>> From the results I'm seeing, it appears as if Scintilla does not really
>>> understand utf8 strings. What am I missing? Is there more detailed
>>> documentation on string handling and interpretation in Scintilla?
>>
>> I think you need to be more specific about what you are trying to do and the
>> results you see (and how they deviate from the expected results).
>>
>>
>> When adding utf8 strings containing multi-byte codes (anything outside of
>> simple latin), the indexing no longer works and deleting characters deletes
>> the individual bytes instead of the whole character. Of course one part of a
>> character is removed, the rendering goes sour.
>
> This is the documented behaviour, see
> http://www.scintilla.org/ScintillaDoc.html#TextRetrievalAndModification.
> In particular be aware that the term "character" throughout the
> documentation means "byte" not code point. Indexing is by bytes, and
> UTF-8 uses more than one byte for code points over 127.

That's all well and good, but what does SC_CP_UTF8 mean in that context. UTF8 is (variable) multibyte. If Scintilla is only bytes, then how does it support UTF8?

-Brian

Brian Griffin

unread,
May 25, 2016, 2:14:07 AM5/25/16
to scintilla...@googlegroups.com
Here http://www.scintilla.org/ScintillaDoc.html#SCI_SETCODEPAGE it says:

"The text is converted to the platform's normal Unicode encoding before being drawn by the OS..."

What does this mean? How does it "convert"?

-Brian

Lex Trotman

unread,
May 25, 2016, 2:44:30 AM5/25/16
to scintilla...@googlegroups.com
As you say, UTF-8 is defined in terms of bytes, and in UTF-8 mode
Scintilla stores exactly the UTF-8 sequence you give it, ie it uses
multiple bytes to encode code points above 127 the same way UTF-8
does.

Lex Trotman

unread,
May 25, 2016, 2:49:30 AM5/25/16
to scintilla...@googlegroups.com
On 25 May 2016 at 16:14, Brian Griffin <bgriffin...@gmail.com> wrote:
> Here http://www.scintilla.org/ScintillaDoc.html#SCI_SETCODEPAGE it says:
>
> "The text is converted to the platform's normal Unicode encoding before being drawn by the OS..."
>
> What does this mean? How does it "convert"?

Some platform graphics libraries use UTF-16 and some use other
representations for non-ASCII code points. Scintilla converts the
multi-byte UTF-8 representation of a code point sequence into the
representation needed by the platform that it is built to run on.
Conceptually it converts the UTF-8 byte sequence into a Unicode code
point sequence and then converts that into the required platform
representation.

>
> -Brian
>
>> On May 24, 2016, at 11:04 PM, Brian Griffin <bgriffin...@gmail.com> wrote:
>>
>>>
>>> On May 24, 2016, at 10:50 PM, Lex Trotman <ele...@gmail.com> wrote:
>>>
>>> On 25 May 2016 at 15:03, Brian Griffin <bgriffin...@gmail.com> wrote:
>>>> On May 24, 2016, at 9:51 PM, Paul K <paulc...@gmail.com> wrote:
>>>>
>>>>
>>>> Hi Brian,
>>>>
>>>>> From the results I'm seeing, it appears as if Scintilla does not really
>>>>> understand utf8 strings. What am I missing? Is there more detailed
>>>>> documentation on string handling and interpretation in Scintilla?
>>>>
>>>> I think you need to be more specific about what you are trying to do and the
>>>> results you see (and how they deviate from the expected results).
>>>>
>>>>
>>>> When adding utf8 strings containing multi-byte codes (anything outside of
>>>> simple latin), the indexing no longer works and deleting characters deletes
>>>> the individual bytes instead of the whole character. Of course one part of a
>>>> character is removed, the rendering goes sour.
>>>
>>> This is the documented behaviour, see
>>> http://www.scintilla.org/ScintillaDoc.html#TextRetrievalAndModification.
>>> In particular be aware that the term "character" throughout the
>>> documentation means "byte" not code point. Indexing is by bytes, and
>>> UTF-8 uses more than one byte for code points over 127.
>>
>> That's all well and good, but what does SC_CP_UTF8 mean in that context. UTF8 is (variable) multibyte. If Scintilla is only bytes, then how does it support UTF8?
>>
>> -Brian
>
Reply all
Reply to author
Forward
0 new messages