Document options for no styling data and large documents

186 views
Skip to first unread message

Neil Hodgson

unread,
Jan 27, 2018, 2:24:06 AM1/27/18
to Scintilla mailing list
There have been requests for Scintilla to drop styling data and to hold documents larger than 2 GigaBytes. Other requests include UTF-16 encoding of document text or using a sparse data structure for styles to save memory.

Some code has been written before for these features where the options are decided at compile-time. This lacks flexibility as an application must choose up-front and can’t adapt to different files - maybe you turned off the styling buffer to fit large XML files but still want styling for the code that processes them.

Changing these options dynamically at any time or even automatically while retaining any derived data would be very flexible but would require some tedious coding and may require substantial temporary memory. Making these choices at document creation time seems a reasonable compromise.

*** SCI_CREATELOADER ***
implemented in 6433_docopt.patch

The SCI_CREATELOADER API can be augmented with a second argument "int documentOption” that specifies a set of bit flags turning on new optional behaviours. The initial implementation would accept SC_DOCUMENTOPTION_DEFAULT (0) and so be the same as existing calls. Downstream libraries and applications can adapt to this change before any options are implemented. If this change is a problem for some projects because they don’t like changing the signature of current APIs then a distinct SCI_CREATELOADERWITHOPTIONS API could be added.

The documentOption argument as well as the bytes argument from SCI_CREATELOADER could also be added to the existing SCI_CREATEDOCUMENT API. Additional APIs such as a SCI_GETDOCUMENTOPTION could be added. If someone wants to do the work, a SCI_SETDOCUMENTOPTION could also be added although its a bit of a tax on adding further options if it is to preserve current data.

*** SC_DOCUMENTOPTION_NOSTYLES ***
implemented in 6434_nostyles.patch

This option stops allocating memory to store styling information and every byte in the document has style 0. This may reduce Scintilla’s memory consumption by as much as 40% depending on average line width and how much other features are used. This does not stop running lexers so it will often be coupled with setting the lexer to SCLEX_NULL. It is, however, possible for a lexer to produce visual effects using indicators instead of style bytes. Indicators use memory differently with a 32-bit or 64-bit index and a 32-bit value for each styling run which may be efficient when only parts of a document are intensively styled.

This patch appears fairly simple and safe but it does change behaviour.

Alternative approaches could try to optimize storage while producing the same behaviour as the current code and so allow current lexers to still work.

*** Preparation: changing some types, using interfaces, templatizing data structures ***
implemented in 6435_splitvec64.patch, 6436_ilinevector.patch, 6437_templatize.patch

The change to allow 64-bit indices is complex so several patches have been split off.

*** SC_DOCUMENTOPTION_LARGE ***
implemented in 6438_largedoc.patch

This change set implements SC_DOCUMENTOPTION_LARGE by choosing either 32-bit or 64-bit implementations of data structures.

It does not do this consistently - some elements are always 64-bit even when they should match other 32-bit elements. ContractionState is the most important example as it is attached to the view instead of the document. This is one area where there may be more work performed before the code is finished.

While I have played around with documents larger than 2GB and with SC_DOCUMENTOPTION_NOSTYLES on and off, this has only been on Windows and little testing has been performed.

An example of how the flags could be used is to set file size thresholds above which the options are progressively turned on:
sptr_t docOptions = SC_DOCUMENTOPTION_DEFAULT;
if (fileSize > 64*1024)
docOptions |= SC_DOCUMENTOPTION_NOSTYLES;
if (fileSize > INT32_MAX / 2)
docOptions |= SC_DOCUMENTOPTION_LARGE;
pdocLoad = reinterpret_cast<ILoader *>(
wEditor.CallReturnPointer(SCI_CREATELOADER, static_cast<uptr_t>(fileSize) + 1000,
docOptions));

Your application could also take into account file types or memory pressure or other considerations.

Implementation patches are attached. Apply in numerical order.

Neil
6438_largedoc.patch
6437_templatize.patch
6436_ilinevector.patch
6435_splitvec64.patch
6434_nostyles.patch
6433_docopt.patch

Neil Hodgson

unread,
Jan 31, 2018, 1:10:49 AM1/31/18
to scintilla...@googlegroups.com
The initial stages of this have been committed.

A documentOption argument was added to SCI_CREATELOADER.

For consistency, SCI_CREATEDOCUMENT now has bytes and documentOption parameters. This should not affect binary compatibility as zeroes should have been passed in these parameters but it could affect language bindings.

https://sourceforge.net/p/scintilla/code/ci/92c8f0f1b3e64900cbb868a56936898693b9cfcc/

SplitVector uses ptrdiff_t instead of int for sizes which allows it to contain more than 2 billion elements on 64-bit systems. However, this increase in capacity is not exposed through the Scintilla API at this point.

https://sourceforge.net/p/scintilla/code/ci/3e3bfe29a819c1f7a1761096ec54e9b6ee446a68/

These are equivalent to the earlier 6433 and 6435 patches.

Neil

Neil Hodgson

unread,
Jan 31, 2018, 10:05:10 PM1/31/18
to scintilla...@googlegroups.com
Committed the templatizing of the Partitioning and RunStyles classes which is most of the 6437_templatize.patch but omits the more complex parts.

https://sourceforge.net/p/scintilla/code/ci/1bd57324aa36e3fce1ed8a2371001b062322884b/
https://sourceforge.net/p/scintilla/code/ci/89d992f380a1ce28a3ba6934230388ffaf1ea611/

Its likely that I’ll make a new release in about a week but will be avoiding the more difficult parts of the published patches. The only feature that may go into the next release from these apart from those already committed is the 6434_nostyles.patch. Then I’ll rebase the remaining patches against the release so that others can test them.

Neil

Neil Hodgson

unread,
Feb 2, 2018, 2:56:20 AM2/2/18
to scintilla...@googlegroups.com
Committed the no styles option. Renamed the option to SC_DOCUMENTOPTION_STYLES_NONE as there may be several options for each aspect (styles, text, lines, …) of a document. There are other minor changes to code and documentation from the 6434_nostyles.patch.

https://sourceforge.net/p/scintilla/code/ci/431b814a54a62d81c8069655bbbebec7bda782e3/

Neil

Neil Hodgson

unread,
Feb 20, 2018, 4:51:59 PM2/20/18
to Scintilla mailing list
With the earlier changes released with 4.0.3, here is a rebased set of changes that add an option for large document support. The option name has added “_TEXT” to be SC_DOCUMENTOPTION_TEXT_LARGE as I thought it likely that there could be several options affecting the text aspect of the document.

These changes only make the most significant allocations responsive to the large document choice and so increase some other elements larger unconditionally. In particular, the ContractionState class which contains data on folding may be much larger. I may do additional work in this area.

The changes will decrease performance in some cases. While this hasn’t appeared to be a problem in my tests, your usage may be different. Performance problems should be reported so they can be worked on before its committed.

These patch sets can be applied to the current state of the repository (revision 6456) and they should be applied in numerical order.

Neil
6459_largedoc.patch
6457_ilinevector.patch
6458_templatize.patch

Neil Hodgson

unread,
Mar 10, 2018, 12:56:22 AM3/10/18
to scintilla...@googlegroups.com
Some more progress in this area.

A patch which uses Sci::Position/Sci::Line/int more accurately and consistently was committed as:
https://sourceforge.net/p/scintilla/code/ci/f2650eaa75e690489aeb80546a852469ea84a98d/
This reduced the size of the patch set.

The ContractionState class, which is responsible for much of folding and line wrapping, was changed to allow a choice between 32-bit and 64-bit line numbers and this has been added onto the patch set. The choice follows the document size choice but, because there are normally far fewer lines than character, a future possibility could allow 32-bit line numbers and 64-bit text positions.

The DecorationList and Decoration classes responsible for indicators were changed similarly to allow 32-bit and 64-bit positions and base this on the ‘IsLarge’ status of Document.

The patch set now contains 6 patches which are attached to this mail and which should be applied in numerical order.

Neil
6486_DecorationList.patch
6481_LineVector_split.patch
6482_LineVector_template.patch
6483_documentlarge.patch
6484_ContractionState_split.patch
6485_ContractionState_template.patch

Neil Hodgson

unread,
Mar 27, 2018, 11:29:50 PM3/27/18
to scintilla...@googlegroups.com
Four more pieces from this sequence are now committed.

(1) Use an interface for ContractionState so that there can be different implementations of that interface.
https://sourceforge.net/p/scintilla/code/ci/ffa2a06d39876eb662ca8f5bf62fad422c613e1c/

(2) Return a FillResult struct from RunStyles::FillRange instead of modifying arguments as that is clumsy when converting types.
https://sourceforge.net/p/scintilla/code/ci/43515e7709c68d089a707f9bdfddc8c927524444/

(3) Split decorations into interface and implementation.
https://sourceforge.net/p/scintilla/code/ci/693e737f3155ecddd7b6520a6da31e5212893520/

(4) Update qt/ScintillaEdit to match (3)
https://sourceforge.net/p/scintilla/code/ci/bb434e6d3f9b80144c58d04a8320ad705dd45015/

These changes could affect platform layers but that won’t be common. Cocoa used ContractionState directly so some instances of “cs.” had to change to “pcs->”. qt/ScintillaEdit used Document::decorations directly which also had to change from a "decorations." to “decorations->”.

The remaining patches are attached to this mail.

The last patch is a version of the line indexing feature mentioned recently but rebased to after the other changes. However, this hasn’t yet parameterized line indexes by 32- or 64-bit position type.

Neil
6659_decoration_template.patch
6660_linevector_split.patch
6661_linevector_template.patch
6662_DocumentLarge.patch
6663_string_view.patch
6664_lineindex.patch

Neil Hodgson

unread,
Apr 12, 2018, 1:52:30 AM4/12/18
to scintilla...@googlegroups.com
One more step committed to use an interface for the line vector class.
https://sourceforge.net/p/scintilla/code/ci/59913262eb19754b9a48a5f8e6d750e5255ffbf9/

This change has the potential to slow Scintilla down as calls to virtual methods are used and the the extra indirection makes it more difficult for the compiler to optimize. There was no consistent performance loss with my tests on the three main current compilers on Windows 10.

However, there may be noticeable slow downs on other workloads so performance-sensitive applications should check that they are not seeing problems. It would be difficult, but possible, to add a compilation option to choose between the previous approach and the new code.

Running the speed tests (test/performanceTests.py) with all 3 compilers showed some differences between compilers particularly on the case sensitive and case-insensitive UTF-8 search tests. With MinGW-64 g++ 7.3.0 being slower than Microsoft Visual C++ 2017.6.6 and clang++ 6.0.0 in the middle. The largest difference was that MSVC was about 30% faster than g++ on case-sensitive searches. This may just be different optimizer switches but it could be worth examining in case there are improvements that could be made.

Comparing g++ 7.3.1 in a Linux VM on the same machine showed it was better (20% slower than MSVC) but I don’t really trust timing tests in VMs.

Neil

Neil Hodgson

unread,
Apr 17, 2018, 7:47:39 PM4/17/18
to scintilla...@googlegroups.com
The option to support documents larger than 2GB has been committed.

This option is provisional and experimental. That means that the APIs and implementation may change and that bugs with the option are not release blockers and are lower priority than bugs with default behaviour.

Each lexer may have problems with documents larger than 2GB or 4GB and its difficult to discover this automatically. Applications should test the lexers they want to use with large documents. Since documents this large lex slowly, it may be worthwhile turning off lexing with SCLEX_NULL or turning on idle styling with SCI_SETIDLESTYLING.

Marking lexers that are known to work or fail would help other projects. For now, a standard comment could be used like // 4GB documents: works.

Performance has seen some minor shifts on some tests. g++ is a little faster, clang a little slower, and MSVC about the same. These may just be code layout caching effects.

There are some corresponding changes to SciTE to allow testing although much of SciTE truncates results from Scintilla to 32-bits so is not really suitable for large documents as yet. New undocumented properties are implemented to test both SC_DOCUMENTOPTION_TEXT_LARGE and SC_DOCUMENTOPTION_STYLES_NONE.

file.size.no.styles specifies a file size over which styles are turned off with the SC_DOCUMENTOPTION_STYLES_NONE option. When this option is turned on, the lexer is automatically set to SCLEX_NULL by SciTE.

file.size.large specifies a file size over which large documents is turned on with the SC_DOCUMENTOPTION_TEXT_LARGE option.

Each of these setting are based on the file size when loaded: they do not change if the document changes size after loading.

Change sets:
https://sourceforge.net/p/scintilla/code/ci/7247d1c9c27fe2b1c26883d30ab1a3dbe8ceb073/
https://sourceforge.net/p/scintilla/code/ci/6df3a85efb809a8b45c4a066cfa15f80731956c0/
https://sourceforge.net/p/scintilla/code/ci/86c008249ce539ad3c7829dc75cae241cad3af03/
https://sourceforge.net/p/scintilla/code/ci/9729ff36c5b19972be682b20af0a431a036174dd/
https://sourceforge.net/p/scintilla/code/ci/eed960ed3828391b2cdd009d12963575d7782baa/

SciTE:
https://sourceforge.net/p/scintilla/scite/ci/05d26ed8369bdb67a7753cb1eb4a5852d78c7fad/

https://www.scintilla.org/scite.zip Source
https://www.scintilla.org/wscite.zip Windows 64-bit executable

Neil

Reply all
Reply to author
Forward
0 new messages