Allowing direct access to ranges in document

100 views
Skip to first unread message

Neil Hodgson

unread,
May 14, 2012, 7:46:31 PM5/14/12
to scintilla...@googlegroups.com
The SCI_GETCHARACTERPOINTER call has proved useful to applications that want rapid direct access to Scintilla's document text. For example, SciTE uses it for background saves. However, it can be slow because it makes the text contiguous by copying data after the gap to the gap start and this may be copying a large amount of data.

One possibility would be to expose a simplified view of the internal buffer structure. There could be calls for discovering the number of segments (currently 2), the length of each segment and a pointer to the start of each segment. However, this still means that the application has to handle the segments explicitly and it can be much more work to write code that jumps between segments.

Another approach would be to offer a call to retrieve a pointer to a range of text from the buffer. If needed, this would move data so that the returned pointer was to a contiguous block. The benefit is that the maximum size moved would be the range size, not the whole buffer. The application could request ranges that it can handle easily: perhaps sets of lines or fixed size blocks.

The implementation would be quite simple: if the range includes the gap, then the gap is moved to the start or end of the range. A useful property of this technique is that any pointers returned for ranges that do not overlap the range requiring the move remain valid. Thus a loop that partitions the document into fixed size blocks by calling this method for each block would produce a list of good pointers. An example where this could be used is for multi-threaded saving to a compressed file format where each block is compressed independently.

Other operations that could use this operation include code analysis and searches where they can work on ranges of lines, potentially farming out these ranges to multiple threads.

http://www.scintilla.org/ScintillaDoc.html#SCI_GETCHARACTERPOINTER

Neil

Jason Haslam

unread,
May 15, 2012, 12:00:13 PM5/15/12
to scintilla...@googlegroups.com
Yes! This would be very useful. So in this case the data will not be 0-terminated, right? Otherwise you would have to move the gap to the end of the requested range every time. Probably okay since the client already knows the size of the range.

Jason

Neil Hodgson

unread,
May 15, 2012, 8:58:39 PM5/15/12
to scintilla...@googlegroups.com
Jason Haslam:

> So in this case the data will not be 0-terminated, right? Otherwise you would have to move the gap to the end of the requested range every time. Probably okay since the client already knows the size of the range.

Yes, there's no room to place NUL terminators in the general case. I suppose for line-oriented requests, you could temporarily replace line-end characters with NULs but I'd want to see some strong justification before adding that.

Neil

mr.maX

unread,
May 16, 2012, 3:46:11 AM5/16/12
to scintilla...@googlegroups.com
On Tuesday, May 15, 2012 1:46:31 AM UTC+2, Neil Hodgson wrote:
  The SCI_GETCHARACTERPOINTER call has proved useful to applications that want rapid direct access to Scintilla's document text. For example, SciTE uses it for background saves. However, it can be slow because it makes the text contiguous by copying data after the gap to the gap start and this may be copying a large amount of data.

   One possibility would be to expose a simplified view of the internal buffer structure. There could be calls for discovering the number of segments (currently 2), the length of each segment and a pointer to the start of each segment. However, this still means that the application has to handle the segments explicitly and it can be much more work to write code that jumps between segments.

I have been using this approach in my local heavily modified version of Scintilla and can only suggest you to do the same. The gap should not move when accessing underlying data as it defeats its purpose. Depending on use case scenario, container applications can process parts separately while storing previous state during transition between parts or divide parts into segments and use a temporary segment buffer when segment spans over separate parts.

--
Regards,
Marko Njezic - mr.maX @ MAX Interactive corp.
MAX's HTML Beauty++ 2004: http://www.htmlbeauty.com/

Neil Hodgson

unread,
May 17, 2012, 6:26:48 AM5/17/12
to scintilla...@googlegroups.com
   Attached is a prototype implementation of direct range access for anyone that wants to experiment with the idea. Called as SCI_GETRANGEPOINTER(position, rangeLength).

   Neil
rangepointer.patch

Lex Trotman

unread,
May 17, 2012, 7:33:26 AM5/17/12
to scintilla...@googlegroups.com
Hi Neil,

This prompts me to request an addition or variation.

For symbol parsing of the buffer it is currently necessary to read the
entire buffer at once. An interface (such as this one) that uses
position to read successive slices of the buffer cannot cope with
changes to the buffer between calls, as the position of the next
character to read will change if there is an insert or delete prior to
it. So using positions can lead to characters being missed or double
parsed.

Instead is it possible to have a stream like interface that provides
access to successive slices through the buffer. This means Scintilla
retaining some state, eg a pointer to the next byte and adjusting it
if the byte moves, due to gap movement (if it points to part that is
moved), deletion or a new buffer.

eg

SCI_OPENSTREAM() returns small int identifying the stream
SCI_GETSTREAMPOINTER(stream, rangeLength) which returns a direct range
pointer (as above) to the current stream position and advances the
stream position
SCI_CLOSESTREAM(stream) makes the int available for re-use

Only a limited number of streams need to be accommodated, minimum one
per buffer, so that the overhead of checking on gap movement is kept
small.

In this way the parsing does not need to block the UI as it can allow
changes between parsing successive slices.

Cheers
Lex

>
>    Neil
>
> --
> You received this message because you are subscribed to the Google Groups
> "scintilla-interest" group.
> To post to this group, send email to scintilla...@googlegroups.com.
> To unsubscribe from this group, send email to
> scintilla-inter...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/scintilla-interest?hl=en.

Neil Hodgson

unread,
May 17, 2012, 9:51:34 AM5/17/12
to scintilla...@googlegroups.com
Lex Trotman:

> Instead is it possible to have a stream like interface that provides
> access to successive slices through the buffer. This means Scintilla
> retaining some state, eg a pointer to the next byte and adjusting it
> if the byte moves, due to gap movement (if it points to part that is
> moved), deletion or a new buffer.

Your description seems to imply a snapshot although you may mean something less. If you have read line 1 through the stream and the user deletes the first two lines are you OK with the next read returning the line that was initially 3, meaning that your parser has only seen lines 1 and 3?

Snapshots are possible but there will be a cost either in storage or complexity. They are far more work if they have to work from multiple threads with low costs.

If missing line 2 in the above example is acceptable then this can be implemented completely in the application by copying slices into application buffers and moving the read position when modification notifications are received.

Neil

Ferdinand Prantl

unread,
May 17, 2012, 12:11:50 PM5/17/12
to scintilla...@googlegroups.com
On Tue, May 15, 2012 at 1:46 AM, Neil Hodgson <nyama...@me.com> wrote:
  One possibility would be to expose a simplified view of the internal buffer structure. There could be calls for discovering the number of segments (currently 2), the length of each segment and a pointer to the start of each segment. 

Interesting. It could be used to speed up alternative regex implementations compiled into Scintilla. Currently I duplicate the text block to search and run the regex on the buffer copy.

char *buffer = new char[endPos - startPos + 1];
doc->GetCharRange(buffer, startPos, endPos - startPos);
buffer[endPos - startPos] = 0;
regex_match(buffer, ...);

if the text block to search were consecutive I wouldn't need the copy. (It is quite likely. assuming that the searcing starts usually from the cursor - on the gap boundary - forwards or backwards.) Then in addition to SCI_GETRANGEPOINTER, some other calls would be needed to decide if the pointer can be obtained for the range without moving te gap. Either inspecting the block structure as you write or just a simple call to check if the requested range is "OK".

Principially, it would be performance improvement. On the other hand, the current approach is lightnigly fast on usual files, questioning the effort...

   --- Ferda

Ferdinand Prantl

unread,
May 17, 2012, 12:25:16 PM5/17/12
to scintilla...@googlegroups.com
On Thu, May 17, 2012 at 1:33 PM, Lex Trotman <ele...@gmail.com> wrote:
Instead is it possible to have a stream like interface that provides
access to successive slices through the buffer.

SCI_OPENSTREAM()  returns small int identifying the stream
SCI_GETSTREAMPOINTER(stream, rangeLength) which returns a direct range
pointer

If you renounced the request for a specific slice length (rangeLength) and left the immediate slice size on Scintilla such API would have optimal performance and also looked like what Neil's mentioned - a low-level access to internal buffers. Well, it would be no char* for classical string-processing methods but I could imagine a fast std::istream implementation based on it. (Filling an internal buffer of basic_streambuf on demand - multiple gaps would not be problem; the current position would be kept in the streambuf and not in Scintilla.) Then some functionality accepting streams could process the editor content without any slowdown.

   --- Ferda

Ferdinand Prantl

unread,
May 17, 2012, 12:33:43 PM5/17/12
to scintilla...@googlegroups.com
On Thu, May 17, 2012 at 6:11 PM, Ferdinand Prantl <pra...@gmail.com> wrote:
Interesting. It could be used to speed up alternative regex implementations compiled into Scintilla. Currently I duplicate the text block to search and run the regex on the buffer copy.

Hm, reading what I wrote, actually inside Scintilla I could try to get the consecutive block anytime; as ong as the search is performed inside Scintilla and not in SciTE. So I wouldn't need the SCI_ messages; I needed just the idea :-)
 
   --- Ferda

Neil Hodgson

unread,
May 17, 2012, 7:08:09 PM5/17/12
to scintilla...@googlegroups.com
Ferdinand Prantl:

> if the text block to search were consecutive I wouldn't need the copy. (It is quite likely. assuming that the searcing starts usually from the cursor - on the gap boundary - forwards or backwards.)

The gap is commonly only at the cursor position immediately after a modification. Consider the case where you search (from the gap) and the first match isn't the one you want: subsequent searches will require the copy.

Neil

Lex Trotman

unread,
May 17, 2012, 9:10:48 PM5/17/12
to scintilla...@googlegroups.com
On 17 May 2012 23:51, Neil Hodgson <nyama...@me.com> wrote:
> Lex Trotman:
>
>> Instead is it possible to have a stream like interface that provides
>> access to successive slices through the buffer.  This means Scintilla
>> retaining some state, eg a pointer to the next byte and adjusting it
>> if the byte moves, due to gap movement (if it points to part that is
>> moved), deletion or a new buffer.
>
>   Your description seems to imply a snapshot although you may mean something less. If you have read line 1 through the stream and the user deletes the first two lines are you OK with the next read returning the line that was initially 3, meaning that your parser has only seen lines 1 and 3?

Hi Neil,

Parsing editor buffers can always be inconsistent even if you take a
snapshot, the parsers have to handle that.

>
>   Snapshots are possible but there will be a cost either in storage or complexity. They are far more work if they have to work from multiple threads with low costs.

It makes no difference if the snapshot is kept by the application or
scintilla, I don't expect Scintilla taking the snapshot would be
significantly faster than the application using SCI_GETTEXT on the
whole buffer and both would take the same space. I am looking to
limit the size of the copy to rangeLength at a time.

>
>   If missing line 2 in the above example is acceptable then this can be implemented completely in the application by copying slices into application buffers and moving the read position when modification notifications are received.

True, but Scintilla would only have to do the checks on movement of
the gap, not on every character inserted or deleted.

Neil Hodgson

unread,
May 18, 2012, 8:54:50 AM5/18/12
to scintilla...@googlegroups.com
Lex Trotman:

> It makes no difference if the snapshot is kept by the application or
> scintilla, I don't expect Scintilla taking the snapshot would be
> significantly faster than the application using SCI_GETTEXT on the
> whole buffer and both would take the same space.

A snapshot may not need to duplicate the document contents - it just has to produce the same effect as if it had duplicated the contents.

As a simple dumb implementation, the duplication could be delayed until the document is first modified.

A more complex implementation could keep track of which parts of the document have remained the same and allocate memory to hold any deletions. A document which contains 100 bytes when the snapshot is taken and then has 20 bytes added at position 80 and 10 bytes deleted at position 20 could have a piece table that looks like:

start end allocation offset
0 20 Original 0
20 30 Deletions 0
30 80 Original 20
80 100 Original 90

All the contents at the time of the snapshot can be retrieved. This kind of technique is often used to snapshot file systems or databases so that consistent backups can be produced.

Neil

Ferdinand Prantl

unread,
May 19, 2012, 5:05:39 AM5/19/12
to scintilla...@googlegroups.com
Yes, you're right; caret changes don't move the gap. I tested it a little and found the optimization trigger pretty often. Searching through the document right after opening the file is the perfect case. After editing, searches starting in the text before gap need a copy. But once the repeating forward search passes the gap the rest of the document can be accessed without copying it.

Although I felt the search with and without copy fast enough, the optimization which I used needed just a minimum code, so I left it there. Basically, I provided an access to the pointer if the gap was out of range, otherwise an automatically deleted copy, all in a std::string-like class:

class Sci_SafeCharRange {
char *buffer;    int length;    bool copy;
public:
Sci_SafeCharRange(Document *doc, int position, int length) : length(length) {
buffer = const_cast<char*>(doc->SafeRangePointer(position, length));
copy = !buffer;
if (copy) {
buffer = new char[length];
doc->GetCharRange(buffer, position, length);
}
}

~Sci_SafeCharRange() {    if (copy) delete[] buffer;    }

char const *c_str() const {    return buffer;    }    // no trailing zero - use size()
int size() const {    return length;    }
};

T * SplitVector::SafeRangePointer(int position, int rangeLength) {
if (position < part1Length) {
if ((position + rangeLength) > part1Length)
return NULL;
return body + position;
}
return body + position + gapLength;
}    // propagated to CellBuffer and Document

You can consider using a wrapper like this on places where the buffer copy is used just to be sure for a general case.

   --- Ferda

Neil Hodgson

unread,
May 19, 2012, 8:37:01 PM5/19/12
to scintilla...@googlegroups.com
Ferdinand Prantl:

> Although I felt the search with and without copy fast enough, the optimization which I used needed just a minimum code, so I left it there. Basically, I provided an access to the pointer if the gap was out of range, otherwise an automatically deleted copy, all in a std::string-like class:

I'd expect this to be slower than allowing Scintilla to move the gap to one end. It has to copy the whole range whereas moving the gap will, on average, copy half the range and even less if the gap is moved to the closest end. It also requires a memory allocation.

Neil

Neil Hodgson

unread,
May 22, 2012, 8:07:11 PM5/22/12
to scintilla...@googlegroups.com
Direct range access will be reasonably efficient even for naive 'replace all' operations. Each replacement moves the gap to the replace position and a subsequent range access request to search from that position will not require the gap to be moved. There may be a large move at startup but the whole replace all procedure won't degenerate into a O(n^2) operation.

Unless new problems are raised, this feature will be committed soon.

Neil

mr.maX

unread,
May 24, 2012, 7:14:05 PM5/24/12
to scintilla...@googlegroups.com
On Thursday, May 17, 2012 12:26:48 PM UTC+2, Neil Hodgson wrote:
   Attached is a prototype implementation of direct range access for anyone that wants to experiment with the idea. Called as SCI_GETRANGEPOINTER(position, rangeLength).

In addition to SCI_GETRANGEPOINTER, you should add a method for retrieving gap position (i.e. SCI_GETGAPPOSITION). Knowing the gap position would also allow applications to access two separate parts directly and process them in whole without moving the gap (if the further operations don't make modifications that would move the gap, of course).

Use case sample in pseudo code:

int gap = SCI_GETGAPPOSITION();
char* part1 = SCI_GETRANGEPOINTER(0, gap);
char* part2 = SCI_GETRANGEPOINTER(gap, SCI_GETLENGTH() - gap);

Neil Hodgson

unread,
May 25, 2012, 2:16:08 AM5/25/12
to scintilla...@googlegroups.com
mr.maX:

> In addition to SCI_GETRANGEPOINTER, you should add a method for retrieving gap position (i.e. SCI_GETGAPPOSITION).

Exposing internal details could make it more difficult to change the implementation in the future. This looks safe for the purpose of calling SCI_GETRANGEPOINTER for the current two segments. It should be taken as a hint as to the best place to break requests. That allows a changed implementation to return any position in the document even if it does not have contiguous segments and needs to allocate memory for (some) calls of SCI_GETRANGEPOINTER.

Neil

Neil Hodgson

unread,
May 26, 2012, 1:39:01 AM5/26/12
to scintilla...@googlegroups.com
The GetRangePointer and GetGapPosition methods are now committed.

Neil

Reply all
Reply to author
Forward
0 new messages