About big file

reiko

unread,

Oct 26, 2010, 2:45:14 AM10/26/10

to scintilla-interest

Are there any size limit about scintilla in large file?

I used scite to open a large file with 300MB in size, and if i
increased more content, the scite closed and showed "Memory Exhausted"

What is the maximum file size supporting whit scintilla ?

Thanks.

Philippe Lhoste

unread,

Oct 26, 2010, 3:44:22 AM10/26/10

to scintilla...@googlegroups.com

The limit is roughly the memory limit of applications in your system, which can be limited
by available physical memory or tap on virtual memory (on disk, slow).
Now, Scintilla consumes at least twice the size of the file (memory for characters, at
least one byte of style per character, some line data too).
You should try and profile the memory usage of SciTE in this case.

--
Philippe Lhoste
-- (near) Paris -- France
-- http://Phi.Lho.free.fr
-- -- -- -- -- -- -- -- -- -- -- -- -- --

Mike Lischke

unread,

Oct 26, 2010, 6:04:34 AM10/26/10

to scintilla-interest

> > Are there any size limit about scintilla in large file?
>
> > I used scite to open a large file with 300MB in size, and if i
> > increased more content, the scite closed and showed "Memory Exhausted"
>
> > What is the maximum file size supporting whit scintilla ?
>
> The limit is roughly the memory limit of applications in your system, which can be limited
> by available physical memory or tap on virtual memory (on disk, slow).
> Now, Scintilla consumes at least twice the size of the file (memory for characters, at
> least one byte of style per character, some line data too).
> You should try and profile the memory usage of SciTE in this case.

Unfortunately it is much more than that. By accident I was the last
two days trying to optimize memory usage when loading big files in our
application. There is a huge waste in the wrappers (both .NET and
Cocoa) as they both convert the entire file several times, but also
Scintilla itself is quite generous when it comes to memory.

See CellBuffer::BasicInsertString, in CellBuffer.cxx around line 486.
When setting the text it makes first a full copy in the field called
"substance" (whatever that is for), then it allocates the style
buffer. After that the text is splitted into lines and added to the
buffer. This results in ca. 350 MB mem usage for a 125 MB file on OS X
(in our app). It is currently impossible to load this file on Windows
as RAM usage peaks at 1.7 GB and then Scintilla crashes. I don't say
this is mostly Scintilla's fault. For instance Notepad++ does a much
better job in the same Windows VM and loads the file using around 350
MB (similar to the value we get on OS X, which indicates this is the
pure amount Scintilla needs). I think the main problem are the
wrappers, especially on Windows. If you don't use C# then you are
quite fine with Scintilla.

In order to attack this memory problem I'd like to share the text
buffer between my application (for sytax parsing, markup and other
stuff) with what Scintilla keeps (e.g. in this "substance" buffer),
instead keeping a copy in my backend which needs updates whenever
something changes in Scintilla. Does anyone know a possibility to do
that?

Thanks,

Mike

Neil Hodgson

unread,

Oct 26, 2010, 6:58:01 PM10/26/10

to scintilla...@googlegroups.com

Mike Lischke:

> There is a huge waste in the wrappers (both .NET and
> Cocoa) as they both convert the entire file several times,

Very large files should be loaded in blocks.

> See CellBuffer::BasicInsertString, in CellBuffer.cxx around line 486.
> When setting the text it makes first a full copy in the field called
> "substance" (whatever that is for),

The substance field is where the text is stored.

> then it allocates the style
> buffer. After that the text is splitted into lines and added to the
> buffer.

It was already added to substance before line discovery.

> This results in ca. 350 MB mem usage for a 125 MB file on OS X
> (in our app).

Scintilla uses a growth strategy to minimize resizes which are
expensive. Large files are normally read-mostly with relatively small
changes or read-only. If you can estimate the likely amount of text
that will be added during a session, you are much better off calling
SCI_ALLOCATE with your estimated size since this will improve both
actual memory use and memory fragmentation.

SciTE uses SCI_ALLOCATE and uses 294 MB for a 128 MB C++ file of
4,000,000 lines without folding.

There is per-line memory use which depends on which features are
active. Line start positions are always needed at 4 bytes per
allocated line and folding information is also generally on at 4 bytes
per allocated line. Just like the substance and style, per-line data
is allocated using a growth strategy so will be larger than lines*4.
Using line state, annotations, etc will also take 4 bytes+ per
allocated line.

Neil

Mike Lischke

unread,

Oct 27, 2010, 4:00:36 AM10/27/10

to scintilla...@googlegroups.com

Hey Neil,

>
>> See CellBuffer::BasicInsertString, in CellBuffer.cxx around line 486.
>> When setting the text it makes first a full copy in the field called
>> "substance" (whatever that is for),
>
> The substance field is where the text is stored.

And this field is therefore always up-to-date re the content of Scintilla? That would be a perfect fit for our own processing (mostly semantic, like statement borders, error detection and such. Is there any way to get a pointer to the data so we don't need to keep a duplicate in our backend? That would really help to avoid frequent copy operations.

>
>> then it allocates the style
>> buffer. After that the text is splitted into lines and added to the
>> buffer.
>
> It was already added to substance before line discovery.

So the text is actually kept in two places?

>
>> This results in ca. 350 MB mem usage for a 125 MB file on OS X
>> (in our app).
>
> Scintilla uses a growth strategy to minimize resizes which are
> expensive. Large files are normally read-mostly with relatively small
> changes or read-only. If you can estimate the likely amount of text
> that will be added during a session, you are much better off calling
> SCI_ALLOCATE with your estimated size since this will improve both
> actual memory use and memory fragmentation.

Thanks for that hint. I'll check if that improves our situation.

Mike
--
Mike Lischke, Senior Software Engineer
MySQL Developer Tools
Oracle Corporation, www.oracle.com

Neil Hodgson

unread,

Oct 27, 2010, 6:16:49 PM10/27/10

to scintilla...@googlegroups.com

Mike Lischke:

> And this field is therefore always up-to-date re the content of Scintilla?

'substance' *is* the content of Scintilla.

> That would be a perfect fit for our own processing (mostly semantic, like
> statement borders, error detection and such. Is there any way to get a
> pointer to the data so we don't need to keep a duplicate in our backend?

It is a split (or gapped) buffer so there is not normally a single
pointer to the data. You can temporarily squeeze out the gap and
retrieve a pointer with SCI_GETCHARACTERPOINTER but that will only be
valid until a modification is made. It may also be expensive to move
the second segment to be next to the first.

> So the text is actually kept in two places?

No.

Neil

Mike Lischke

unread,

Oct 28, 2010, 4:00:45 AM10/28/10

to scintilla...@googlegroups.com

>> That would be a perfect fit for our own processing (mostly semantic, like
>> statement borders, error detection and such. Is there any way to get a
>> pointer to the data so we don't need to keep a duplicate in our backend?
>
> It is a split (or gapped) buffer so there is not normally a single
> pointer to the data. You can temporarily squeeze out the gap and
> retrieve a pointer with SCI_GETCHARACTERPOINTER but that will only be
> valid until a modification is made. It may also be expensive to move
> the second segment to be next to the first.

I assume this would still be cheaper or equal to the cost when retrieving the content to store it in a local buffer via SCI_GETTEXT. So this sounds very promising to me.

>
>> So the text is actually kept in two places?

I understand now (hadn't looked down further the route via LineVector::InsertLine). So that means Scintilla needs twice as much memory (for text and style info) as the actual text is in size + some management data (for each line etc.). That should open the possibility to load files far larger than 300 MB (especially in 64 bit apps). I suppose there is no intrinsic limitation of the possible line count (except for the usual integer size).

Thanks for your time, Neil.

Neil Hodgson

unread,

Oct 28, 2010, 6:32:08 PM10/28/10

to scintilla...@googlegroups.com

Mike Lischke:

> I assume this would still be cheaper or equal to the cost when retrieving the content
> to store it in a local buffer via SCI_GETTEXT. So this sounds very promising to me.

Depends on the operation being performed with Scintilla lexers
using a buffered accessor class as it is likely that a lexing call
will only be reading a small segment of the document.

> That should open the possibility to load files far larger than 300 MB (especially in 64 bit
> apps). I suppose there is no intrinsic limitation of the possible line count (except for the
> usual integer size).

Yes, although I have been thinking of separating out the document
indexing type so you could have a 64-bit version of Scintilla that
only used 32-bit indexes allowing 2GB or 4GB in each document and
saving space in other structures such as the line starts.

Neil

Reply all

Reply to author

Forward