The limit is roughly the memory limit of applications in your system, which can be limited
by available physical memory or tap on virtual memory (on disk, slow).
Now, Scintilla consumes at least twice the size of the file (memory for characters, at
least one byte of style per character, some line data too).
You should try and profile the memory usage of SciTE in this case.
--
Philippe Lhoste
-- (near) Paris -- France
-- http://Phi.Lho.free.fr
-- -- -- -- -- -- -- -- -- -- -- -- -- --
> There is a huge waste in the wrappers (both .NET and
> Cocoa) as they both convert the entire file several times,
Very large files should be loaded in blocks.
> See CellBuffer::BasicInsertString, in CellBuffer.cxx around line 486.
> When setting the text it makes first a full copy in the field called
> "substance" (whatever that is for),
The substance field is where the text is stored.
> then it allocates the style
> buffer. After that the text is splitted into lines and added to the
> buffer.
It was already added to substance before line discovery.
> This results in ca. 350 MB mem usage for a 125 MB file on OS X
> (in our app).
Scintilla uses a growth strategy to minimize resizes which are
expensive. Large files are normally read-mostly with relatively small
changes or read-only. If you can estimate the likely amount of text
that will be added during a session, you are much better off calling
SCI_ALLOCATE with your estimated size since this will improve both
actual memory use and memory fragmentation.
SciTE uses SCI_ALLOCATE and uses 294 MB for a 128 MB C++ file of
4,000,000 lines without folding.
There is per-line memory use which depends on which features are
active. Line start positions are always needed at 4 bytes per
allocated line and folding information is also generally on at 4 bytes
per allocated line. Just like the substance and style, per-line data
is allocated using a growth strategy so will be larger than lines*4.
Using line state, annotations, etc will also take 4 bytes+ per
allocated line.
Neil
>
>> See CellBuffer::BasicInsertString, in CellBuffer.cxx around line 486.
>> When setting the text it makes first a full copy in the field called
>> "substance" (whatever that is for),
>
> The substance field is where the text is stored.
And this field is therefore always up-to-date re the content of Scintilla? That would be a perfect fit for our own processing (mostly semantic, like statement borders, error detection and such. Is there any way to get a pointer to the data so we don't need to keep a duplicate in our backend? That would really help to avoid frequent copy operations.
>
>> then it allocates the style
>> buffer. After that the text is splitted into lines and added to the
>> buffer.
>
> It was already added to substance before line discovery.
So the text is actually kept in two places?
>
>> This results in ca. 350 MB mem usage for a 125 MB file on OS X
>> (in our app).
>
> Scintilla uses a growth strategy to minimize resizes which are
> expensive. Large files are normally read-mostly with relatively small
> changes or read-only. If you can estimate the likely amount of text
> that will be added during a session, you are much better off calling
> SCI_ALLOCATE with your estimated size since this will improve both
> actual memory use and memory fragmentation.
Thanks for that hint. I'll check if that improves our situation.
Mike
--
Mike Lischke, Senior Software Engineer
MySQL Developer Tools
Oracle Corporation, www.oracle.com
> And this field is therefore always up-to-date re the content of Scintilla?
'substance' *is* the content of Scintilla.
> That would be a perfect fit for our own processing (mostly semantic, like
> statement borders, error detection and such. Is there any way to get a
> pointer to the data so we don't need to keep a duplicate in our backend?
It is a split (or gapped) buffer so there is not normally a single
pointer to the data. You can temporarily squeeze out the gap and
retrieve a pointer with SCI_GETCHARACTERPOINTER but that will only be
valid until a modification is made. It may also be expensive to move
the second segment to be next to the first.
> So the text is actually kept in two places?
No.
Neil
I assume this would still be cheaper or equal to the cost when retrieving the content to store it in a local buffer via SCI_GETTEXT. So this sounds very promising to me.
>
>> So the text is actually kept in two places?
I understand now (hadn't looked down further the route via LineVector::InsertLine). So that means Scintilla needs twice as much memory (for text and style info) as the actual text is in size + some management data (for each line etc.). That should open the possibility to load files far larger than 300 MB (especially in 64 bit apps). I suppose there is no intrinsic limitation of the possible line count (except for the usual integer size).
Thanks for your time, Neil.
> I assume this would still be cheaper or equal to the cost when retrieving the content
> to store it in a local buffer via SCI_GETTEXT. So this sounds very promising to me.
Depends on the operation being performed with Scintilla lexers
using a buffered accessor class as it is likely that a lexing call
will only be reading a small segment of the document.
> That should open the possibility to load files far larger than 300 MB (especially in 64 bit
> apps). I suppose there is no intrinsic limitation of the possible line count (except for the
> usual integer size).
Yes, although I have been thinking of separating out the document
indexing type so you could have a 64-bit version of Scintilla that
only used 32-bit indexes allowing 2GB or 4GB in each document and
saving space in other structures such as the line starts.
Neil