Unicode line end implementation

108 views
Skip to first unread message

Neil Hodgson

unread,
May 26, 2012, 8:55:16 PM5/26/12
to scintilla...@googlegroups.com
   Recognition of Unicode line ends has now been implemented. This has not been committed to the mainline yet but downloads are available and this patch contains the changes:

   The Unicode standard section 5.8 Newline Guidelines covers the use of various line end characters, including CR, LF, CRLF, NEL, VT, FF, LS, and PS.
   
   Recognition of the Unicode line ends NEL, LS, and PS when the document is in Unicode (UTF-8) mode is added by this modification. Recognition of FF (Form Feed) as a line end is added for all encodings.

   Since NEL, LS, and PS take multiple bytes (2 for NEL and 3 for LS and PS), the line end recognition code has become more complex and slower. Previously a test machine would discover line ends at a rate of around 200 Megabytes per second but with this change the rate is around 120 Megabytes per second.

   When switching between UTF-8 mode and another encoding, there may now be fewer lines, so folding information and other line state may not be valid, and so is discarded. Most applications are likely to set the encoding once at load time, so I don't see this as an important issue.

   There are some issues to resolve, primarily what control applications need over this feature. It is currently turned on when the document encoding is set to UTF-8. It is possible that some applications would like to turn off Unicode line ends so that they can easily switch between encodings without the number of lines changing. The performance cost may be a problem for some applications. Treating Form Feed as a line end may not be wanted.

   There is some unfortunately duplicated code between the commonly used line discovery in BasicInsertString and the ResetLineEnds method used when the encoding is changed. This is because the document bytes may be non-contiguous for ResetLineEnds so substance.ValueAt is called but this would slow down BasicInsertString. I haven't found a fast way to deduplicate this code yet.

   Downloads available from
http://www.scintilla.org/scite.zip  Source
http://www.scintilla.org/wscite.zip Windows executable

   Neil

mr.maX

unread,
May 27, 2012, 6:00:44 AM5/27/12
to scintilla...@googlegroups.com
You haven't said anything about lexers. Most of them (if not all) are written to support only the "standard" CR, LF, CRLF line endings and will produce wrong results when encountering one of the new line end characters.

-- 
Regards,
Marko Njezic - mr.maX @ MAX Interactive corp.
MAX's HTML Beauty++ 2004: http://www.htmlbeauty.com/

Philippe Lhoste

unread,
May 28, 2012, 4:42:32 AM5/28/12
to scintilla...@googlegroups.com
On 27/05/2012 12:00, mr.maX wrote:
> You haven't said anything about lexers. Most of them (if not all) are written to support
> only the "standard" CR, LF, CRLF line endings and will produce wrong results when
> encountering one of the new line end characters.

I believe that for most languages, it is not a problem, because these new line end chars
won't be used in their source code. They might even be not legal, except, perhaps, in
literal strings or, less likely, in comments. Both are probably treated as blobs of data,
skipping everything except end of state and perhaps some reserved keywords (eg. JavaDoc tags).

The new kind of line ends can be relevant for some lexers like XML / HTML or some kind of
mark up language (eg. Markdown), where natural language data is more present.

--
Philippe Lhoste
-- (near) Paris -- France
-- http://Phi.Lho.free.fr
-- -- -- -- -- -- -- -- -- -- -- -- -- --

Neil Hodgson

unread,
May 28, 2012, 4:53:58 AM5/28/12
to scintilla...@googlegroups.com
mr.maX:

You haven't said anything about lexers. Most of them (if not all) are written to support only the "standard" CR, LF, CRLF line endings and will produce wrong results when encountering one of the new line end characters.

   The main issues with lexers is that they may not store line state and folders may not store fold state when needed with the correct line numbers since they won't trigger their line end logic on Unicode line ends. StyleContext already has an atLineEnd member which should be used in preference to explicit checks against known line end characters. It needs to be extended to know about other line ends. Folders use LexAccessor and an AtLineEnd method can be added to LexAccessor.

   The next issue is that lexing covers whole lines so can now start in what was previously the middle of a line. For example, the JavaScript text
\x2028var f;
   is highlighted up to the space as an identifier currently. With Unicode line ends, it may be initially styled in one call as it is currently but insert a 'x' after the f and "var" will highlight as a keyword. Changing appearance without apparent cause is bad even though the Unicode line end version actually styles correctly some of the time and the current version never does for that example. Stability could be forced by lexing only ranges that don't start with Unicode line ends but files that use Unicode line ends are likely to use them for all line ends. Here the solution is for the lexer to break the identifier at the line start.

   Many languages are only defined over ASCII or only informally defined over Unicode but JavaScript has defined behaviour for Unicode line ends in section 7.3 of the ECMAScript standard http://www.ecma-international.org/publications/standards/Ecma-262.htm

   There are some other issues like continued lines but that looks fine since allowing continuations to go over Unicode line ends would be an optional addition.

   While making some fixes to line-end recognition may make many Unicode files work well, it doesn't help much with implementing languages with defined meanings for Unicode characters. For example, identifiers will often not be allowed to contain symbol characters but this is currently difficult to support in a Scintilla lexer. This may be the right time to extend StyleContext to present the document as a sequence of characters instead of bytes and also add methods to CharacterSet to help classify the characters.

   Attached is a patch (to add to the previous one) that improves LexAccessor and StyleContext to recognize Unicode line ends. The C++ folder was changed to use the new call and the C++ lexer was made to break identifiers at line starts. There was also a change to Editor to cause full relexing when switching to/from UTF-8.

   Neil

lule.patch

Neil Hodgson

unread,
Jan 12, 2013, 11:51:44 PM1/12/13
to scintilla...@googlegroups.com
A branch of current Scintilla implementing Unicode line ends and substyles (rainbow identifiers) can be found at
https://bitbucket.org/nyamatongwe/unicodelineends

I intend to commit these to the main repository soon after 3.2.4 is released. These features will initially be 'provisional'. That is, the API they present may change before it becomes permanent. The features may even be removed if there is a major problem. Applications that wish to avoid using provisional APIs will be able to define a preprocessor symbol that will hide the API definitions.

This is new code so there may be bugs. Each API is documented in ScintillaDoc but the documentation is, as always, brief.

The branch contains 16 commits with the first 11 for Unicode line ends and the last 5 for substyles. The two features touch closely related pieces of code so are not truly independent. Some of this is deliberate as they both add methods to ILexer and I didn't want to support more variants of ILexer than necessary. Each commit should build and run. Committing in increments should make it easier to check for correctness. When these are committed to the mainline, there may be some reordering and merging of commits.

The most likely change to cause trouble is that StyleContext now decodes all the bytes in a UTF-8 encoded character as one character instead of as multiple bytes. This should make it easier for lexers to treat particular non-ASCII characters as syntactically significant.

A SciTE patch to allow experimentation with substyles is attached. A set of properties for this is:

unicode.line.ends=1
substyles.cpp.11=2
substylewords.11.1.$(file.patterns.cpp)=CharacterSet LexAccessor SString WordList
substylewords.11.2.$(file.patterns.cpp)=std map string vector
style.cpp.11.1=fore:#AA00EE
style.cpp.11.2=fore:#EE00AA
style.cpp.75.1=$(style.cpp.75),fore:#663388
style.cpp.75.2=$(style.cpp.75),fore:#883366
substyles.cpp.17=1
style.cpp.17.1=$(style.cpp.17),fore:#00AAEE
substylewords.17.1.$(file.patterns.cpp)=random

Neil
SciTEULE.patch

Mike Lischke

unread,
Jan 13, 2013, 4:50:50 AM1/13/13
to scintilla...@googlegroups.com
Hey Neil,

>
> This is new code so there may be bugs. Each API is documented in ScintillaDoc but the documentation is, as always, brief.


I read the new documentation but tbh I have trouble to get the idea of substyles. What is the base idea behind that? So far a style is just a number that is used to look up the style's text properties. How do substyles fit there? I have the vague idea substyles could be useful for the MySQL lexer because there I have those version comments, that is, code within multiline comments. So what I did was to allocate "shadow styles" for each normal style which have only a different background color.

Mike
--
www.soft-gems.net

Neil Hodgson

unread,
Jan 13, 2013, 6:42:18 AM1/13/13
to scintilla...@googlegroups.com
Mile Lischke:

I read the new documentation but tbh I have trouble to get the idea of substyles.

   This was originally proposed under the name "rainbow identifiers" but that didn't cover all the uses.

What is the base idea behind that? So far a style is just a number that is used to look up the style's text properties. How do substyles fit there?

   It is dividing one style, such as SCE_C_IDENTIFIER, into multiple styles so that people who want to have 10 different sets of identifiers in different styles can do so. Each substyle is allocated a style number which is then assigned style attributes just like a standard style. 

I have the vague idea substyles could be useful for the MySQL lexer because there I have those version comments, that is, code within multiline comments. So what I did was to allocate "shadow styles" for each normal style which have only a different background color.

   Shadow styles seem to be similar to the inactive styles in the C++ lexer which shows which pieces of code are inactive due to the preprocessor. The only part of this that is related to substyles is that the SCI_DISTANCETOSECONDARYSTYLES method can be used to find out if there is a secondary set of styles and the amount to add to a primary style to calculate the corresponding secondary style.

   Neil

Neil Hodgson

unread,
Jan 18, 2013, 11:57:39 PM1/18/13
to scintilla...@googlegroups.com
Me:

> A branch of current Scintilla implementing Unicode line ends and substyles (rainbow identifiers) can be found at

This code has, with minor alterations, now been committed to the main Scintilla repository.

The features are provisional so may change after the next release. Applications that can only use stable APIs can turn off access to provisional messages by defining SCI_DISABLE_PROVISIONAL. Provisional features are marked in Scintilla.iface by being in the Provisional category near the end of the file. In the documentation, such features are shown with a golden background. The Qt APIs do not currently hide Provisional APIs - it may be simpler to have a command line option to the source code generators and either generate the derived APIs or not instead of trying to use the preprocessor.

There is no support in SciTE for these features. If you want to experiment, use the patch and settings from the previous message.

Neil

Neil Hodgson

unread,
Jan 31, 2013, 2:31:36 AM1/31/13
to scintilla...@googlegroups.com

Pasha:
  • The doc says that each byte in Scintilla document is followed by one byte of styling information. This style byte consists of style number in lower bits, and few flags in higher bits. --- How do the substyles fit in this scheme? Are they taking up the indices of regular styles (that is, can we have more different visual appearances with substyles than without them)? Or was the style byte extended into a style short-word? 
Substyles are taking up indices of regular styles. The cpp lexer uses 8 bits for styles and no flags. It only defines 24*2=48 named styles so there are 256-8-48=200 style numbers currently available for substyles. Divide by 2 for secondary versions of each style and there can be 100 substyles. However, the current implementation in the cpp lexer is limited to only 64 substyles.
  • Once the substyle acquires an index, the lexer can use this substyle as it would a regular style. For example, this new substyle index can be fed into the ColourTo() method. --- If so, what is the actual difference between a style a substyle? Are substyles somehow related to the parent style? Do they inherit the properties of their parent style? Which attributes a substyle can have different from its parent -- just one, colour, or any and all?
The visual attributes of substyles are independent of their parent style.
  • Once a style has been "divided" into substyles (btw, is it more like dividing or multiplying?), can the original style number still be used? If so, what relationship it retains with its substyles?
It is dividing the text that would have appeared in the parent style into sets with each set appearing differently. The original style number can still be used. The substyle numbers are related to the parent style through the API.
  • You say that a substyle gets a style number, which allows us to use it as a regular style --- Then, can we in turn divide the substyle into another set of substyles? 
It may be possible but the lexer would have to be written to allow this. The current implementation only allows a fixed set of parent styles.

   Neil

Reply all
Reply to author
Forward
0 new messages