Scintilla and Unicode

Randy Kramer

unread,

Aug 31, 2009, 5:47:35 PM8/31/09

to scintilla...@googlegroups.com

--DRAFT--

I am (was) trying to understand more about Unicode and Scintilla.

After some research, I think I've now got a pretty good understanding,
with maybe a question or two left.

I'm sending this to ask those questions and to see if I have a basic
misunderstanding of what I think I've learned--some of it involved some
guesswork.

I understand that Scintilla can handle Unicode. There are lots of
varieties, including UTF-8, which is a variable length format which can
use up to six bytes per character.

Scintilla, iiuc, handles UTF-8 by converting it to a 16-bit version,
UTF-16, which uses one or two bytes per character. (It is also a
limited subset of Unicode / UTF-8, limited to a particular
character "plane" (the Basic Multilingual Plane (BMP), which consists
of about 65,000 of the most common characters).

Since Scintilla could already handle the Japanese DCBS, Scintilla is
able to handle UTF-16 by the same mechanism.

It seems to me this places some limitations on what Scintilla can do
with UTF-16 characters, but I'll have to refresh my memory.

Some are "perfectly logical" restrictions that don't affect the user, or
rather, make Scintilla work properly, e.g., not allowing the caret /
insertion point to be positioned between the two bytes of a 2-byte
character.

Likewise, iiuc, the "position" logic in Scintilla counts those two bytes
as one character. (I'm a little uncertain about this--for example, why
the -1 for invalid positions?)

In other words, if there are, for example 1000 dbcs (only) characters in
a document, do the positions go 0 to 1000 or 0 to 2000?

Is there any Unicode punctuation or anything that is lexically
significant to TWiki markup? (Possibilities to think about: em dash,
en dash, curly quotes--and the answer is no.

Internally, does Scintilla work on UTF-8 (multibyte) or UTF-16--my
understanding / best guess is UTF-16, but words in some places pretty
clearly say UTF-8. I think (guess) Scintilla can handle UTF-8, but
does so by converting it to UTF-16 (and that's why the limitation to
the Basic Multilingual Plane

My questions now relate to any restrictions on dealing with two-byte
characters in a lexer.

* The logic of Scintilla in some places seems to ignore Unicode on
the basis that keywords and the character set in programming languages
almost always (maybe no known exceptions so far) use only the 128 7-bit
ASCII characters. Thus for certain logic, Unicode does not have to be
considered. My application is a text markup language, and while my
particular bias and preference leads me to intend to use only English
writings (and characters), I don't want to preclude others from using
other character sets. Heck, I might even occasionally throw in a
foreign word for some reason.

* Hmm, current question (after doing all the research (previously)
described here: will (can) lexing work based on Unicode characters, or
are they ignored?

* The other day, I thought of some typical punctuation that might not
be available in ASCII--I forgot what I was thinking of at the time, but
trying to remember, I'm thinking that curly quotes might not be part of
the normal 7-bit ASCII character set, yet might be something that I or
others paste into a document to be viewed and edited in Scintilla.
Also, I have a very strong liking for some of the variety of dashes
that can be used (and, for that matter varying width spaces)--for
dashes I'm thinking immediately of the em-dash and en-dash. I suspect
at least some of these are outside the 7-bit ASCII character set, and I
wonder if that's going to cause problems for me. <so, are curly quotes
in the 7 bit ASCII character set? No>

Randy Kramer

unread,

Aug 31, 2009, 6:28:25 PM8/31/09

to scintilla...@googlegroups.com

Oops, sent that by mistake--unless you're a glutton for punishment,
ignore it--I may send a version of it later, but I seem to be answering
most of my questions myself--maybe at some point I want to get
confirmation.

Sorry about that. (The --DRAFT-- was supposed to remind me to not send
it--I guess I just hit enter or something.)

Randy Kramer

On Monday 31 August 2009 01:47:35 pm you wrote:
> --DRAFT--

Neil Hodgson

unread,

Aug 31, 2009, 11:05:16 PM8/31/09

to scintilla...@googlegroups.com

Randy Kramer:

> Oops, sent that by mistake--unless you're a glutton for punishment,
> ignore it

I skimmed it and thought I should clear some things up.

Scintilla uses UTF-8 internally when in Unicode mode. It converts
this into UTF-16 when calling Windows APIs. On GTK+ it stays as UTF-8.

Characters outside the Basic Multilingual Plane should now work (I
just updated the documentation) although, since they are relatively
rare they are not as well tested.

Neil

Randy Kramer

unread,

Sep 1, 2009, 2:04:17 AM9/1/09

to scintilla...@googlegroups.com

On Monday 31 August 2009 07:05:16 pm Neil Hodgson wrote:
> Randy Kramer:
> > Oops, sent that by mistake--unless you're a glutton for punishment,
> > ignore it
>
> I skimmed it and thought I should clear some things up.

Thanks!

> Scintilla uses UTF-8 internally when in Unicode mode. It converts
> this into UTF-16 when calling Windows APIs. On GTK+ it stays as
> UTF-8.

That sounds good, but I'm sort of surprised. UTF-8 is the variable
width encoding that uses up to six bytes per character, right?

I'm having trouble getting my head around that--doesn't that cause some
problems for some of the lexers where a line is copied into a fixed
length lineBuffer. For example in LexYAML.cxx?

static void ColouriseYAMLDoc(unsigned int startPos, int length, int,
WordList *keywordLists[], Accessor &styler) {
char lineBuffer[1024];

And I thought I was starting to get the idea of what was going on...

Randy Kramer

Mike Lischke

unread,

Sep 1, 2009, 8:00:28 AM9/1/09

to scintilla...@googlegroups.com

>> Scintilla uses UTF-8 internally when in Unicode mode. It converts
>> this into UTF-16 when calling Windows APIs. On GTK+ it stays as
>> UTF-8.
>
> That sounds good, but I'm sort of surprised. UTF-8 is the variable
> width encoding that uses up to six bytes per character, right?
>
> I'm having trouble getting my head around that--doesn't that cause
> some
> problems for some of the lexers where a line is copied into a fixed
> length lineBuffer. For example in LexYAML.cxx?
>
> static void ColouriseYAMLDoc(unsigned int startPos, int length, int,
> WordList *keywordLists[], Accessor &styler) {
> char lineBuffer[1024];
>

Not only that. Depending on the set code page the content of the text
buffer can be anything and loads quite a burden on all parts in
Scintilla which do text processing. So it might be worth to make the
step away from codepages (they are really the product of an ancient
time, considering the development of the IT) and handle everything
internally as UTF-16 (or even UTF-32 to allow for the full Unicode
range). This will also ease and speed up text processing in general as
we then had efficient and simple access to each individual character.
We could then also use Unicode character classes for syntax
highlighting, code folding and line breaking (all the info for that is
part of these char classes and doesn't need to be hardcoded, e.g. what
is an identifier start/part, what is considered a number in any
language, what is a word boundary, what is punctuation etc. etc., see http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf#G39
, page 139). Another benefit would be one would not have to take care
for any possible character composition (and there *are* weird
compositions). Normalize the text when loading it into Scintilla (http://www.unicode.org/reports/tr15/
) which gives the control a smaller set of states to consider and also
helps with proper sorting and searching (normalize the search input
too and you can directly compare any text in any way it can be
written, you can even ignore upper/lower/title casing, accents and the
like).

However, one frequently mentioned drawback is that UTF-16/32 require
significantly more memory to store text than UTF-8. That is true but
on todays machines it really doesn't matter anymore (and nobody will
load a 100 MB textfile into a control which loads eveything completely
in memory). However, having all the additionally functionality and
speed increase outweights that increased memory consumption by far.

Another point is that you will have to convert between the encoding in
a file and Unicode in Scintilla if you really insist of keeping your
documents in a code-page'd form (which is itself very questionable,
but sometimes requirements are like that). However, that is rather a
small issue given that this happens only rarely (compared to character
access in the control itself)

Mike
--
Mike Lischke, Senior Software Engineer
Database Group, Developer Tools
Sun Microsystems Inc., www.sun.com

Neil Hodgson

unread,

Sep 3, 2009, 1:11:21 AM9/3/09

to scintilla...@googlegroups.com

Randy Kramer:

> That sounds good, but I'm sort of surprised. UTF-8 is the variable
> width encoding that uses up to six bytes per character, right?

UTF-8 is now only defined for up to 4 bytes per character. 5 or 6
byte characters are no longer valid.

> I'm having trouble getting my head around that--doesn't that cause some
> problems for some of the lexers where a line is copied into a fixed
> length lineBuffer. For example in LexYAML.cxx?

There will be artifacts at the buffer limit in YAML files.

Neil

Randy Kramer

unread,

Sep 3, 2009, 3:11:38 PM9/3/09

to scintilla...@googlegroups.com

Neil,

On Wednesday 02 September 2009 09:11:21 pm Neil Hodgson wrote:
> UTF-8 is now only defined for up to 4 bytes per character. 5 or 6
> byte characters are no longer valid.

Thanks! (I don't even understand the "old" Unicode (UTF-8) and "they"
are making changes already. ;-) (I just don't know how anyone can keep
up with this stuff--but, maybe I don't have to in most cases, even
here ;-)

> > doesn't that cause
> > some problems for some of the lexers where a line is copied into a
> > fixed length lineBuffer. For example in LexYAML.cxx?
>
> There will be artifacts at the buffer limit in YAML files.

I started writing a long response, and then decided to (try to) keep it
short and simple (or shorter than my first start, at least ;-). Still
turned out on the long side, sorry!

I guess I should give just a little background--when you said (or I
read) that SciTE/Scintilla now handled UTF-8, I was surprised because I
thought that would require some major changes, and I didn't see any
evidence of major changes around the lexers. (I'm sure it did require
major changes in some areas, but my primary area of interest (atm) is
for a lexer.)

I had seen the discussion on how lexers handled DBCS characters (by, in
simple terms, ignoring the upper byte), but hadn't noticed anything
similar regarding UTF-8. (Since then, I found some relevant comments
in StyleContext.h (reformatted slightly to "flow" better):

// All languages handled so far can treat all characters >= 0x80 as one
class which just continues the current token or starts an identifier if
in default.
// DBCS treated specially as the second character can be < 0x80 and
hence syntactically significant. UTF-8 avoids this as all trail bytes
are >= 0x80

IIUC, the way I can look at this (and I could/should have done this
before asking any questions) ;-) is that you've done the magic required
to make UTF-8 work in Scintilla, and I can build a lexer modeled on the
existing ones (that is, treating only the 7-bit ASCII characters
as "lexically significant") and expect the lexer to work with very few
worries.

Maybe the only worry being what you mention, that is the "artifacts at
the buffer limit in YAML files" (or any other lexers that break the
document into lines and copy to a lineBuffer for the "actual" lexing).

Digressing for a second, another question I was going to ask has to do
with that lineBuffer and breaking the text into lines for lexing. I
see some (maybe most) lexers do that and some (for example,
LexMarkdown.cxx), don't.

My question may be left over from before I saw a lexer that didn't break
the document into separate lines for lexing--I didn't see a good reason
to do that (well, except a line at a time might be easier for my brain
to handle), and it just seemed to be extra overhead.

Probably insignificant overhead. (I am surely guilty of thinking, if
not actually, "prematurely optimizing".) Nevertheless, my initial plan
is not to copy a line at a time to a lineBuffer for lexing.

If I'm (significantly) mistaken in any of this, or you think I'm making
a mistake, I'd appreciate your (or anybody's) comments.

Sorry for rehashing this, which is probably very clear to you--it may
even be clear to me now, or in a few days ;-)

regards,
Randy Kramer

Wolfendale, David

unread,

Sep 3, 2009, 3:32:53 PM9/3/09

to scintilla...@googlegroups.com

The auto-completion list in Scintilla looks like a typical list box
control but it does not respond to the mouse scroll wheel.
I am using release 1.77 on Windows XP.
Should this be working, or is it a defect or is it just not implemented?

Dave Wolfendale

This e-mail and any attachments are intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified any dissemination, distribution or copying of this email, and any attachments thereto, is strictly prohibited. If you receive this email in error please immediately notify the sender and permanently delete the original copy and any copy of any e-mail, and any printout thereof.

Neil Hodgson

unread,

Sep 4, 2009, 2:41:50 AM9/4/09

to scintilla...@googlegroups.com

Wolfendale, David:

> The auto-completion list in Scintilla looks like a typical list box
> control but it does not respond to the mouse scroll wheel.

This is a bug. The Scintilla window handles input events so it can
decide which keystrokes are seen by the list and which by the editor.
Should be changed to let through mouse scroll events. It is unlikely
that I will work on this.

Neil

Neil Hodgson

unread,

Sep 4, 2009, 2:51:47 AM9/4/09

to scintilla...@googlegroups.com

Randy Kramer:

> I guess I should give just a little background--when you said (or I
> read) that SciTE/Scintilla now handled UTF-8,

Scintilla first supported UTF-8 in version 1.25 in May 2000 and
almost all lexers have been written or updated since then.

> Maybe the only worry being what you mention, that is the "artifacts at
> the buffer limit in YAML files" (or any other lexers that break the
> document into lines and copy to a lineBuffer for the "actual" lexing).

Using a fixed length line buffer is wrong in multiple ways. It was
done for some of the simple lexers (like the properties lexer) at the
beginning but is not the right way when there is any possibility of
long lines. Since YAML can be used for data files with arbitrary
amounts of data then there is a good chance it will give bad results
even without multi-byte characters.

Neil

Randy Kramer

unread,

Sep 4, 2009, 11:13:35 AM9/4/09

to scintilla...@googlegroups.com

On Thursday 03 September 2009 10:51:47 pm Neil Hodgson wrote:
> Randy Kramer:
> > I guess I should give just a little background--when you said (or I
> > read) that SciTE/Scintilla now handled UTF-8,
>
> Scintilla first supported UTF-8 in version 1.25 in May 2000 and
> almost all lexers have been written or updated since then.

Wow! I really made you rehash old stuff--I guess I'm not thoroughly
reading the documentation, etc. Sorry!

And, I re-read Markus Kuhn's Unicode FAQ and found the following in
another section, and now understand the reference to a max of 4 bytes
(since 21 bits can be encoded in 4 UTF-8 bytes):

"Current plans are that there will never be characters assigned outside
the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit
over one million potential future characters."

(from: http://linux-cjk.net/Howto/cam.ac.uk/unicode.html)

> > Maybe the only worry being what you mention, that is the "artifacts
> > at the buffer limit in YAML files" (or any other lexers that break
> > the document into lines and copy to a lineBuffer for the "actual"
> > lexing).
>
> Using a fixed length line buffer is wrong in multiple ways. It was
> done for some of the simple lexers (like the properties lexer) at the
> beginning but is not the right way when there is any possibility of
> long lines. Since YAML can be used for data files with arbitrary
> amounts of data then there is a good chance it will give bad results
> even without multi-byte characters.

Thanks! I won't use a fixed length line buffer.

Randy Kramer

KHMan

unread,

Sep 4, 2009, 1:42:52 PM9/4/09

to scintilla...@googlegroups.com

Randy Kramer wrote:
> On Thursday 03 September 2009 10:51:47 pm Neil Hodgson wrote:
>> Randy Kramer:
>>> I guess I should give just a little background--when you said (or I
>>> read) that SciTE/Scintilla now handled UTF-8,
>> Scintilla first supported UTF-8 in version 1.25 in May 2000 and
>> almost all lexers have been written or updated since then.
>
> Wow! I really made you rehash old stuff--I guess I'm not thoroughly
> reading the documentation, etc. Sorry!
>
> And, I re-read Markus Kuhn's Unicode FAQ and found the following in
> another section, and now understand the reference to a max of 4 bytes
> (since 21 bits can be encoded in 4 UTF-8 bytes):
>
> "Current plans are that there will never be characters assigned outside
> the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit
> over one million potential future characters."
>
> (from: http://linux-cjk.net/Howto/cam.ac.uk/unicode.html)

>[snip]

The newer UTF-8 RFC which specifies up to 4 bytes is RFC3629,
while the older one which specifies up to 6 bytes is RFC2279. I
keep a copy around, along with Markus Kuhn's excellent material in
my utf8 notes folder.

--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia

Randy Kramer

unread,

Sep 4, 2009, 2:33:22 PM9/4/09

to scintilla...@googlegroups.com

On Friday 04 September 2009 09:42:52 am KHMan wrote:
> The newer UTF-8 RFC which specifies up to 4 bytes is RFC3629,
> while the older one which specifies up to 6 bytes is RFC2279. I
> keep a copy around, along with Markus Kuhn's excellent material in
> my utf8 notes folder.

Kein-Hong (is that an appropriate way to address you?),

Thanks!

Randy Kramer

KHMan

unread,

Sep 4, 2009, 4:14:52 PM9/4/09

to scintilla...@googlegroups.com

Yes

Randy Kramer

unread,

Sep 4, 2009, 4:47:24 PM9/4/09

to scintilla...@googlegroups.com

On Friday 04 September 2009 12:14:52 pm KHMan wrote:

> Randy Kramer wrote:
> > Kein-Hong (is that an appropriate way to address you?),
>
> Yes

Ok, thanks!

Randy Kramer

Eric Promislow

unread,

Sep 14, 2009, 8:18:27 PM9/14/09

to scintilla...@googlegroups.com

My two cents -- the use of a "ColouriseLine" routine in the YAML lexer is
a design error. I'm not aware of any other lexer for a major language
that does this. The Python lexer shows how to handle a
whitespace-sensitive language.

- Eric

Randy Kramer

unread,

Sep 15, 2009, 11:04:57 AM9/15/09

to scintilla...@googlegroups.com

On Monday 14 September 2009 04:18:27 pm Eric Promislow wrote:
> My two cents -- the use of a "ColouriseLine" routine in the YAML
> lexer is a design error. I'm not aware of any other lexer for a
> major language that does this. The Python lexer shows how to handle
> a
> whitespace-sensitive language.

Thanks!

I won't be using a ColouriseLine routine--I guess my accidental choice
of YAML as my first model was unfortunate. I do plan to take a look at
the Python lexer.

Randy Kramer

Wolfendale, David

unread,

Sep 23, 2009, 2:53:59 PM9/23/09

to scintilla...@googlegroups.com

Neil,

I would like to fix this myself, but I don't have much experience with
programming at the Windows API level, (although I am more familiar with
MFC) so please could you give me a little guidance.
I see where the WM_MOUSEWHEEL message is handled in
ScintillaWin::WndProc() but I am not familiar with how messages are
routed when the Auto-complete list box window is active.
Wouldn't the Auto-complete window get the WM_MOUSEWHEEL message first?
Why does the Scintilla window have to let the message through and if so,
how do I detect that the auto-complete window is active so I know when
to let the message through?

Dave.

Neil Hodgson

unread,

Sep 24, 2009, 9:51:46 AM9/24/09

to scintilla...@googlegroups.com

David:

> I see where the WM_MOUSEWHEEL message is handled in
> ScintillaWin::WndProc() but I am not familiar with how messages are
> routed when the Auto-complete list box window is active.

See ScintillaBase::KeyCommand in src/ScintillaBase.cxx.

> Wouldn't the Auto-complete window get the WM_MOUSEWHEEL message first?

It is sent to the focus window, just like keystrokes.

> Why does the Scintilla window have to let the message through

"so it can decide which keystrokes are seen by the list and which by the editor"

> and if so,

> how do I detect that the auto-complete window is active so I know when
> to let the message through?

ac.Active()

Neil

Wolfendale, David

unread,

Sep 24, 2009, 9:25:41 PM9/24/09

to scintilla...@googlegroups.com

Thanks,
I managed to get the auto-complete lists to scroll with the mouse wheel.
Do you want to incorporate this fix into your next release?

Dave.

-----Original Message-----
From: scintilla...@googlegroups.com
[mailto:scintilla...@googlegroups.com] On Behalf Of Neil Hodgson
Sent: Thursday, September 24, 2009 5:52 AM
To: scintilla...@googlegroups.com
Subject: [scintilla] Re: Mouse scroll wheel support in auto-complete
lists

Neil Hodgson

unread,

Sep 26, 2009, 10:19:26 AM9/26/09

to scintilla...@googlegroups.com

Wolfendale, David:

> I managed to get the auto-complete lists to scroll with the mouse wheel.
> Do you want to incorporate this fix into your next release?

Depends on whether there are any problems and if anyone wants to
preserve the current behaviour. Add it to the feature request tracker
and we'll see what people think.
http://sourceforge.net/tracker/?group_id=2439&atid=352439

Neil

method

unread,

Oct 30, 2009, 10:35:49 PM10/30/09

to scintilla-interest

Our user are a bit surprised that the scroll wheel moves the editor
underneath - this fix would be very much appreciated. The link below,
however, didn't work so this is my way of voting "yes".

A short note on preserving the current behavior: I've come to
understand that backwards compability is very important around here
but being able to scroll the text which the autocomplete box completes
away or even out of view is quite inconsistent w/ current practice.

Markus Nißl

unread,

Aug 24, 2011, 5:53:25 AM8/24/11

to scintilla...@googlegroups.com

Is there any news regarding this issue?

I know that mouse wheel scrolling in autocompletion list is currently not supported (as of Scintilla 2.28), so what happened with the fix provided by David Wolfendale?

Neil Hodgson

unread,

Aug 25, 2011, 5:03:10 AM8/25/11

to scintilla...@googlegroups.com

Markus Nißl:

> I know that mouse wheel scrolling in autocompletion list is currently not
> supported (as of Scintilla 2.28), so what happened with the fix provided by
> David Wolfendale?