Let say I am writing a word processor. It would be a bad design to
consider that a document is a std::vector of instances of a class
MyChar ? Where a MyChar would look like:
struct MyChar
{
char TheChar;
bool Bold;
bool Italic;
unsigned char Color[3];
};
This design is hardcoded (what if I need another attribute later on)
and it would be very difficult to look for a specify array of char. It
would also eat a lot of memory...
Is this something describe in the literature ? If yes, where can I
find documentation ?
Thanks,
FlyWeight pattern.
Cheers & hth.,
- Alf
I cannot believe :
1. I got an answer in a couple of minutes,
2. You managed to decipher the description of my issue,
3. There is actually a pattern very well detailed, and with a very
intuitive name
Thanks Alf !
Inserting and deleting in a vector<> for an editor is already a
problem.
>
> This design is hardcoded (what if I need another attribute later on)
> and it would be very difficult to look for a specify array of char. It
> would also eat a lot of memory...
> Is this something describe in the literature ? If yes, where can I
> find documentation ?
There are many ways to tackle this problem and they depend on your
needs.
You could have a look into some open source projects which do what you
want and see how they did it.
Googling for 'how to design a text editor' also gives interesting
results.
Please come back when you have a C++ language question.
--
Michael
> Let say I am writing a word processor. It would be a bad design to
> consider that a document is a std::vector of instances of a class
> MyChar ? Where a MyChar would look like:
>
> struct MyChar
> {
> char TheChar;
> bool Bold;
> bool Italic;
> unsigned char Color[3];
> };
Yes, that would be bad design. A document is much more than simply a
vector of MyChar objects.
> Is this something describe in the literature ? If yes, where can I
> find documentation ?
"Design Patterns" by Gamma et al, uses a word processor example in the
entire second chapter. I suggest you read that.
FlyWeight has already been mentioned, but it's really only part of
the answer.
IMO, you should take a look at the general design of CSS and/or ODF.
They're obviously not identical, but they have some characteristics
on common, and give an idea of how this general type of problem has
been solved (with a reasonable degree of success, I might add)
already.
The general idea can be described fairly simply though: a couple of
extra levels of indirection, at least one of them explicitly visible
to the user. Specifically, you want to allow the user to put together
a "style sheet" that describes characteristics of some text (font,
color, spacing, etc.) then you want to let them apply that style to
lots of different text. The user can then format most of his/her
document(s) based on a logical style rather than always separately
editing the individual attributes. For example, you could start by
stealing a few "styles" from HTML, and having H1, H2, H3 and H4.
Much like a browser, give each of these a default style -- but let
the user edit styles, so if the user wants H1 to be 32 point Bodoni
bold, and H2 to be 27.5 point Castellar MT black, s/he can do that.
Then when they change their mind and decide those should both be
Arial bold, in 30 and 20 points respectively, they can easily do that
-- and when they change the style, every instance of H1 or H2
throughout their document will automatically change together.
Likewise, of course, you want to let the user define new styles, so
the styles can and will reflect the logical structure of the document
they're creating, in the way they think about things. The style one
user calls "chemical formula" might happen to have identical
attributes to another user's "algorithm", but they're still separate
styles that happen (at least at some particular moment) to look the
same.
--
Later,
Jerry.
No, it's the preferred and fastestest for an editor. The idiomatic choice for
those who have made some editors from scratch (not just using an edit
control/widget). Look up "cursor gap" in Wikipedia or wherever.
For a word processor it's a different matter.
I thought it was the scratch buffer (like in ed or vi).
> The idiomatic
> choice for those who have made some editors from scratch (not just using
> an edit control/widget).Look up "cursor gap" in Wikipedia or wherever.
I had a look on wikipedia ... as clear as the bottom of my socks. But it
looks interesting. Definitely will get a look at "The Craft of Text
Editing" (if I can find a working link).
--
Michael
[ ... ]
> > The idiomatic
> > choice for those who have made some editors from scratch (not just using
> > an edit control/widget).Look up "cursor gap" in Wikipedia or wherever.
>
> I had a look on wikipedia ... as clear as the bottom of my socks.
The idea is pretty simple, really. When you're editing a text file,
you separate the file into two pieces: everything that's before the
cursor, and everything that's after the cursor.
Those two pieces are place into memory with the part that's before
the cursor at the very beginning of the buffer, and the part that's
after the cursor at the very end of the buffer. In between the two,
you have a "gap" of unused memory space. When/if the user enters new
text, you add the characters into that memory space (for overwrite
mode, you write at the end of the gap instead of the beginning).
When the user moves the cursor, you copy text from one side of the
gap to the other. Since the user usually only moves the cursor in
small increments (word, line, possibly page) it's pretty easy to copy
the text around quickly enough to keep up (with a little care, you
could keep up even on 8-bit processors). With this design, the single
most difficult situation to handle quickly is a goto line N, which
involves not only scanning through the data to find the Nth line, but
also copying data around to put the gap at the right place. Unless
files get really huge, though, it's still not much of a problem.
--
Later,
Jerry.
That I understood but what do you gain except an amortized copy ?
I guess it is memory cost effective.
> Those two pieces are place into memory with the part that's before
> the cursor at the very beginning of the buffer, and the part that's
> after the cursor at the very end of the buffer. In between the two,
> you have a "gap" of unused memory space. When/if the user enters new
> text, you add the characters into that memory space (for overwrite
> mode, you write at the end of the gap instead of the beginning).
That has the sake of simplicity but if you maintain a list of
modification (like rope or another list structure with modifications
stored in a scratch buffer) until next write to file then insertion
and deletion seems cheap to me.
I have never tried to write a text editor but this approach appeals me
more than gap buffer.
> When the user moves the cursor, you copy text from one side of the
> gap to the other. Since the user usually only moves the cursor in
> small increments (word, line, possibly page)
I don't know what you call small increment but, on a daily basis, I
move between marks, end/beginning of line, next/previous char/word/
symbol ... I move much more than I write; I expect the gap is created
on the first modification.
Don't mistake me. I don't discuss the usefulness of the technique but
I try to understand the forces that apply to such a design.
--
Michael
It's a *simple* structure that yields O(1) insertion and deletion at the cursor
position and O(1) random access read and write, where the former is the same as
with a linked list, and the latter is much better than a linked list.
The main cost is that, without reallocation and copying of all the data, the
maximum size of the buffer is determined up front, and it uses that much memory
regardless of the amount of data (for an editor, text), so no, it isn't
especially memory cost effective. :-)
Of course, in C++ any complexity for a more complicated structure can be hidden
under a simpler programmatic interface. So I guess that my comment reflected
that since C++ displaced C for such tasks, implementing editors from scratch
hasn't been that much in vogue. And the only such editor I made that I still
have the source code for, on magnetic tape, an editor for the HP3000 from 198x!,
wasn't written in C but in Pascal, and it didn't use cursor gap but a more
sophisticated list of blocks, sort of like a C++ std::deque implementation.
Cheers,
- Alf
> > Let say I am writing a word processor. It would be a bad
> > design to consider that a document is a std::vector of
> > instances of a class MyChar ? Where a MyChar would look
> > like:
> > struct MyChar
> > {
> > char TheChar;
> > bool Bold;
> > bool Italic;
> > unsigned char Color[3];
> > };
> Inserting and deleting in a vector<> for an editor is already
> a problem.
Maybe. The classical solution has always been to keep the free
space at the position of the cursor, rather than at the end, but
I do remember some measurements, made a long time ago, which
showed that copying even a MB of characters wasn't that
expensive.
> > This design is hardcoded (what if I need another attribute later on)
> > and it would be very difficult to look for a specify array of char. It
> > would also eat a lot of memory...
Maybe. Alf's already mentionned the flyweight pattern; more
classically, character attributes tend to affect blocks of
characters, and not really be needed for all editing tasks, so
the usual solution is to maintain them separately, over ranges,
rather than single positions. Say in a structure something like
std::map< position, attributes >. (If you need to find the
attributes for a specific character, lower_bound with the
character's position should do the trick. On the other hand,
insertions and deletions can be tricky, since unless you can do
so without modifying the position key, you'll have to update all
of the map entries following the affected position. The systems
I'm aware of that use this strategy use two dimensional
positionning---line and column---so that normally, very few
entries have to be updated.)
You can also use a discriminated union with either character
data or a pointer to the (new) attributes in the same array. To
find the attributes for a character, scan backwards until you
hit an entry which contains or points to attributes. (If you
don't mind some tricky programming, there are bits free in
something like:
union Character
{
uint32_t codePoint ;
Attributes const* attributes ;
} ;
which could be used. Unicode only requires 21 bits, and on most
systems, an Attributes* must be aligned---on a lot of systems,
like Windows or Linux, you're even guaranteed that a certain
number of the upper bits won't all be zero (and on all of the 32
bit Unix systems I know, it would be simple to write an
allocator/garbage collector for Attributes which would ensure
that the top bit of the address is always set).
If you want to keep the data with the character, then the
solution should still involve a pointer, so that the size and
structure of Attributes can easily evolve:
struct Character
{
uint32_t codePoint ;
Attributes const* attributes ;
} ;
(This is probably the easiest solution to implement, so should
probably be chosen until it is clear that it causes problems.)
Unless you're keeping the attributes in a separately keyed
structure, you'll want garbage collection. Managing their
memory by hand is too painful, and replacing the raw pointer
with a smart pointer here will kill you when it comes to
performance (and can't be done at all with the union
solution).
--
James Kanze (GABI Software) email:james...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
> >>> struct MyChar
> >>> {
> >>> char TheChar;
> >>> bool Bold;
> >>> bool Italic;
> >>> unsigned char Color[3];
> >>> };
> >> Inserting and deleting in a vector<> for an editor is already a
> >> problem.
> > No, it's the preferred and fastestest for an editor.
There is no "preferred and fastest" solution.
> I thought it was the scratch buffer (like in ed or vi).
It depends on the editor idiom. Ed and vi are line oriented,
and keep their data in a two dimensional structure, something
like an std::vector< std::string > or and std::vector<
std::vector< char > >. Most text processors are not line
oriented; line breaks are decided when formatting. The editors
in such programs are probably more like emacs, internally, using
some sort of gap buffer. The choice of the preferred idiom
depends on the set of commands the editor understands.
Note that most of the statements you will find concerning
performance in editors (including those I make:-)) are based on
measurements made a long, long time ago. The one time I
suggested using a gap buffer to a collegue, he actually measured
the difference compared to using the equivalent of std::vector<
char >, inserting characters at the beginning of a very large
block of text, and found that the std::vector< char > solution
was largely sufficient with regards to performance. (And that
was around 20 years ago. The "common knowledge" concerning how
to organize a buffer in an editor is a lot older than that,
however.)
> > The idiomatic choice for those who have made some editors
> > from scratch (not just using an edit control/widget).Look up
> > "cursor gap" in Wikipedia or wherever.
> I had a look on wikipedia ... as clear as the bottom of my
> socks. But it looks interesting. Definitely will get a look at
> "The Craft of Text Editing" (if I can find a working link).
You might be interested in the editor in Kernighan and Plauger's
_Software Tools in Pascal_. I think it's pretty close to what
is used in ed (and vi, and probably vim as well). (The book
itself is, despite its age, required reading for anyone wanting
to write programs professionally. And don't let the Pascal in
the title throw you---the authors have managed to write C code
in Pascal, and I've used the book as a textbook for a course in
C---today, of course, I'd teach C++ and use Stroustrup's
_Programming Principles and Practice Using C++_, which is
definitely the best book I've seen for teaching programming.)
> > In article <4a808b83$0$426$426a7...@news.free.fr>,
> > michael.dou...@free.fr says...
> > [ ... ]
> > > > The idiomatic choice for those who have made some
> > > > editors from scratch (not just using an edit
> > > > control/widget).Look up "cursor gap" in Wikipedia or
> > > > wherever.
> > > I had a look on wikipedia ... as clear as the bottom of my socks.
> > The idea is pretty simple, really. When you're editing a
> > text file, you separate the file into two pieces: everything
> > that's before the cursor, and everything that's after the
> > cursor.
> That I understood but what do you gain except an amortized
> copy ? I guess it is memory cost effective.
You don't get a large copy on inserting near the beginning of
the buffer. It was an important issue 30 or 40 years ago, when
using the equivalent of std::vector< char >, an insertion near
the beginning of a large body of text visibly slowed things
down. Measurements made by a collegue roughly 20 years ago
showed that it wasn't an issue then; unless machines have gotten
slower since then, it isn't an issue today.
> > Those two pieces are place into memory with the part that's
> > before the cursor at the very beginning of the buffer, and
> > the part that's after the cursor at the very end of the
> > buffer. In between the two, you have a "gap" of unused
> > memory space. When/if the user enters new text, you add the
> > characters into that memory space (for overwrite mode, you
> > write at the end of the gap instead of the beginning).
> That has the sake of simplicity but if you maintain a list of
> modification (like rope or another list structure with
> modifications stored in a scratch buffer) until next write to
> file then insertion and deletion seems cheap to me.
Boehm designed Rope expressedly for this type of use. It also
handles spilling to disk gracefully, since you merge and spill
blocks, keeping the merged block header with the reference to
the disk image. (When spilling a gap buffer, it's usual to
maintain two separate spill zones, at each end, and maintain one
of them in reverse order.)
I'm not sure that spilling is an important issue today; if the
text doesn't fit in memory, it seems reasonable to require the
user to split it up into several files. Back when a process
could only have 64 KB of memory, however, efficient spilling was
an important design issue.
> I have never tried to write a text editor but this approach
> appeals me more than gap buffer.
> > When the user moves the cursor, you copy text from one side
> > of the gap to the other. Since the user usually only moves
> > the cursor in small increments (word, line, possibly page)
> I don't know what you call small increment but, on a daily
> basis, I move between marks, end/beginning of line,
> next/previous char/word/ symbol ... I move much more than I
> write; I expect the gap is created on the first modification.
The gap is always there. Of course, when the cursor is one end
of the file, the amount of text on the other side of the gap is
0. (When you start up the editor, most editors put the cursor
at the beginning of the file, which means all of the text is in
the end buffer, and the gap is at the start.)
> Don't mistake me. I don't discuss the usefulness of the
> technique but I try to understand the forces that apply to
> such a design.
Today, nothing. Even in the older times, I think that the line
oriented approach used by vi was probably preferable (but then,
I've never cared much for emacs---and this really is an emacs
vs. vi argument). Today, just use std::vector< char > or
std::vector< std::vector< char > > until the profiler says
otherwise, wrapping it in a class which presents a minimal
interface to the rest of the code, to facilitate changing it if
that moment ever comes.
> >> [ ... ]
> >>>> The idiomatic choice for those who have made some editors
> >>>> from scratch (not just using an edit control/widget).Look
> >>>> up "cursor gap" in Wikipedia or wherever.
> >>> I had a look on wikipedia ... as clear as the bottom of my
> >>> socks.
> >> The idea is pretty simple, really. When you're editing a
> >> text file, you separate the file into two pieces:
> >> everything that's before the cursor, and everything that's
> >> after the cursor.
> > That I understood but what do you gain except an amortized
> > copy ? I guess it is memory cost effective.
> It's a *simple* structure that yields O(1) insertion and
> deletion at the cursor position and O(1) random access read
> and write, where the former is the same as with a linked list,
> and the latter is much better than a linked list.
But nobody uses a linked list. And it's O(n) for random cursor
positionning, or going to a specific line. It is, in fact, O(n)
when you go to the end of the file, which is a frequent
operation (since in many cases, that's where you're inserting
text). And I don't know how it works in today's environment,
when you have several windows open on the file, each with its
own cursor. (I think, but I'm very far from sure---maybe
someone who has the sources to emacs handy could check---that
emacs moves the cursor in the file to correspond to its position
in the active window. So that if you have one window open with
the local class definitions, at the top of the file, and another
where you're coding the implementations, at the bottom, each
time you move the cursor between the windows, you copy most of
the text.)
For a program editor, I'd definitely use something line
oriented, along the lines of vi, if only because one of the most
common operations is going to a specific line, with the line
number coming from an error message. The fact that emacs does
this almost instantly is, in fact, proof that on today's
machines, it really doesn't matter---just use
std::vector< char >, and be done with it.
> The main cost is that, without reallocation and copying of all
> the data, the maximum size of the buffer is determined up
> front, and it uses that much memory regardless of the amount
> of data (for an editor, text), so no, it isn't especially
> memory cost effective. :-)
The real cost is when you start using OS's which penalize
programs for excessive memory use. On the earliest version of
Unix I used (no virtual memory), when the OS had to swap out a
process, it choose the biggest, because that recovered the most
memory. (Back then, vi was the biggest program on the machine.
And everytime someone started a compile, vi stopped reacting for
three to five seconds.) Under Linux, you have to worry about
the famous OOM killer, which logically might behave
similarly---at the worst, the fact that you allocate a lot of
memory without touching it immediately does create a situation
where the OOM killer will be released. (Under Windows or
Solaris, it just works, thank you, and there's no OOM killer to
be worried about.)
> Of course, in C++ any complexity for a more complicated
> structure can be hidden under a simpler programmatic
> interface. So I guess that my comment reflected that since C++
> displaced C for such tasks, implementing editors from scratch
> hasn't been that much in vogue. And the only such editor I
> made that I still have the source code for, on magnetic tape,
> an editor for the HP3000 from 198x!, wasn't written in C but
> in Pascal, and it didn't use cursor gap but a more
> sophisticated list of blocks, sort of like a C++ std::deque
> implementation.
Sounds like vi:-).
> > struct MyChar
> > {
> > char TheChar;
> > bool Bold;
> > bool Italic;
> > unsigned char Color[3];
> > };
> > This design is hardcoded (what if I need another attribute
> > later on) and it would be very difficult to look for a
> > specify array of char. It would also eat a lot of memory...
> > Is this something describe in the literature ? If yes, where
> > can I find documentation ?
> FlyWeight has already been mentioned, but it's really only
> part of the answer.
> IMO, you should take a look at the general design of CSS
> and/or ODF. They're obviously not identical, but they have
> some characteristics on common, and give an idea of how this
> general type of problem has been solved (with a reasonable
> degree of success, I might add) already.
Those are both external formats, using pure text (for
portability and ease of transfer), which implies a lot of
additional constraints.
> The general idea can be described fairly simply though: a
> couple of extra levels of indirection, at least one of them
> explicitly visible to the user. Specifically, you want to
> allow the user to put together a "style sheet" that describes
> characteristics of some text (font, color, spacing, etc.) then
> you want to let them apply that style to lots of different
> text. The user can then format most of his/her document(s)
> based on a logical style rather than always separately editing
> the individual attributes. For example, you could start by
> stealing a few "styles" from HTML, and having H1, H2, H3 and
> H4.
That's actually a very good idea. I think it concerns user
interface more than internal representation, but it probably
doesn't make sense to invest too much effort on internal
representation until you've designed the user interface---as you
say, supporting this will probably involve some additional
levels of indirection.
Not sure about that.
> And it's O(n) for random cursor
> positionning, or going to a specific line.
Well, you got that right. :-)
Making that operation O(1) generally goes at the cost of O(n) insertion (e.g.
inserting a line). Old Basic implementations provided a possibility to work
around that (although I don't think they actually exploited the possibility they
created) by having a line number explicitly specified for, that is within, each
line. Then if you exhausted the "gaps" that you'd inserted in the line numbering
you had to "renumber" the lines.
Compromise is O(log n) for random cursor positioning.
> It is, in fact, O(n)
> when you go to the end of the file, which is a frequent
> operation (since in many cases, that's where you're inserting
> text).
Above is incorrect. Cursor gap is O(1) for moving to other end. With any
reasonable implementation, since going to the ends are frequent operations.
> And I don't know how it works in today's environment,
> when you have several windows open on the file, each with its
> own cursor. (I think, but I'm very far from sure---maybe
> someone who has the sources to emacs handy could check---that
> emacs moves the cursor in the file to correspond to its position
> in the active window. So that if you have one window open with
> the local class definitions, at the top of the file, and another
> where you're coding the implementations, at the bottom, each
> time you move the cursor between the windows, you copy most of
> the text.)
I've never seen or made a cursor gap structure that supported multiple cursors.
Possibly it can be done for small number of cursors. I'm not sure.
> For a program editor, I'd definitely use something line
> oriented, along the lines of vi, if only because one of the most
> common operations is going to a specific line, with the line
> number coming from an error message. The fact that emacs does
> this almost instantly is, in fact, proof that on today's
> machines, it really doesn't matter---just use
> std::vector< char >, and be done with it.
I think you mean std::vector< std::wstring >. And that's OK, at least with a COW
string implementation or C++0x move semantics, or a with any half-decent DIY
immutable string class. But as I recall, the makers of the main free C# IDE
started out with that but had to switch to a more sophisticated scheme (they
wrote a book about it, available free online).
No, it didn't have modes, except overwrite/insert.
It was also one of my first small-language implementations, because I "had" to
invent a language to store configuration of the terminal function keys. The HP
terminals had very nice function keys, with small displays at the bottom of the
screen showing the current assignment of any function key. Oh it was nice.
Cheers,
- Alf
EMACS = Eight Megabytes Are Continually Swapping ... :)
[ ... ]
> > Those two pieces are place into memory with the part that's
> > before the cursor at the very beginning of the buffer, and the
> > part that's after the cursor at the very end of the buffer. In
> > between the two, you have a "gap" of unused memory space. When/if
> > the user enters new text, you add the characters into that memory
> > space (for overwrite mode, you write at the end of the gap
> > instead of the beginning).
>
> That has the sake of simplicity but if you maintain a list of
> modification (like rope or another list structure with modifications
> stored in a scratch buffer) until next write to file then insertion
> and deletion seems cheap to me.
>
> I have never tried to write a text editor but this approach appeals
> me more than gap buffer.
I can't quite understand why. Offhand, I can only think of one
operation (pasting a large chunk of text) that it _might_ make faster
(if you could just link in the pasted block instead of copying it).
In general, it sounds to me like a lose-lose situation -- slower
operation, and more complex implementation.
> > When the user moves the cursor, you copy text from one side of the
> > gap to the other. Since the user usually only moves the cursor in
> > small increments (word, line, possibly page)
>
> I don't know what you call small increment but, on a daily basis, I
> move between marks, end/beginning of line, next/previous char/word/
> symbol ... I move much more than I write;
The only of those that's of any concern at all is moving between
marks. To move between marks, you copy all the text between the marks
from one side of the gap to the other. With marks more than a few
tens of megabytes apart, it's going to be difficult to implement that
in the 100 ms (or so) to be perceived as "instant". A couple hundred
ms isn't too terrible, but much beyond that can start to bother the
user.
The rest are simply too small to care about. Figure a line is <80
characters, usually being moved around in the cache. Such a movement
is usually going to be in the sub-microsecond range.
> I expect the gap is created
> on the first modification.
The gap is always at the cursor, from the time the file is loaded.
In the end editing/word processing is interactive -- a (very) soft
real-time process. In most cases, what matters is fast _enough_
response; faster doesn't hurt, but rarely matters much either.
--
Later,
Jerry.
> > But nobody uses a linked list.
> Not sure about that.
OK. I've never seen it:-).
> > It is, in fact, O(n)
> > when you go to the end of the file, which is a frequent
> > operation (since in many cases, that's where you're inserting
> > text).
> Above is incorrect. Cursor gap is O(1) for moving to other
> end. With any reasonable implementation, since going to the
> ends are frequent operations.
It's certainly O(n) with the classical implementation of cursor
gap. Say, the one in emacs (or at least, the one in emacs 15-20
years ago, which was the last time I looked at the source).
[...]
> I've never seen or made a cursor gap structure that supported
> multiple cursors. Possibly it can be done for small number of
> cursors. I'm not sure.
In a modern editor, you do have several cursors. More
precisely, a modern editor will separate the model (the buffer)
and the view (the window), with the possibility of having
several views in the same model. Logically, the cursor is part
of the view, although it can be complicated, since it is also
used by the model in most commands. (I'd still keep it in the
view, and only pass it down to the model when a command which
needed it, like insert, was issued. Maybe that's how emacs and
others using a cursor gap buffer work today: they only move the
cursor in the buffer when needed---when a view passes them a
command with the cursor in it.)
> > For a program editor, I'd definitely use something line
> > oriented, along the lines of vi, if only because one of the
> > most common operations is going to a specific line, with the
> > line number coming from an error message. The fact that
> > emacs does this almost instantly is, in fact, proof that on
> > today's machines, it really doesn't matter---just use
> > std::vector< char >, and be done with it.
> I think you mean std::vector< std::wstring >.
Maybe. The original vi definitely used something more like
vector< string >:-).
Actually, for a program editor, I'd probably go with line
oriented and UTF-8 internally. So std::vector< std::string >,
or more likely std::vector< Line >, where Line is a structure
which contains an std::vector< char >.
> And that's OK, at least with a COW string implementation or
> C++0x move semantics, or a with any half-decent DIY immutable
> string class.
Or you could use a vector< Line* >:-).
You might as well say that quicksort is O(n^3) average time because some novice
has managed to implement something he calls quicksort with cubic average time.
> [...]
>> I've never seen or made a cursor gap structure that supported
>> multiple cursors. Possibly it can be done for small number of
>> cursors. I'm not sure.
>
> In a modern editor, you do have several cursors.
Yeah, but except for collaborative editors like the Mac one, only one at a time.
> More
> precisely, a modern editor will separate the model (the buffer)
> and the view (the window), with the possibility of having
> several views in the same model. Logically, the cursor is part
> of the view, although it can be complicated, since it is also
> used by the model in most commands. (I'd still keep it in the
> view, and only pass it down to the model when a command which
> needed it, like insert, was issued. Maybe that's how emacs and
> others using a cursor gap buffer work today: they only move the
> cursor in the buffer when needed---when a view passes them a
> command with the cursor in it.)
The cursors and information about which one's active is to me obviously part of
the model used to present the GUI (note: it will be backed by other models).
It puts the knowledge where it's needed to avoid complication and unnecessary
communication between parts of the program.
It also allows reinstating the cursor positions when you reload the file, if the
file has associated information (e.g. in an editor "project"), and although I
think of the previous paragraph as most important, someone not very familiar
with MVC may just consider this simple rule-of-thumb: what do you want to save.
I suspect that the tendency to think of views as irrelevant passive consumers of
data from a model, not introducing data requirements themselves, stem from some
general feeling of inferiority and imperfectness, that the old Smalltalk MVC was
godlike perfect, an unachievable goal that one should nevertheless try to aim
at. Trying to emulate something one doesn't understand often leads to such
bizarre ideas. Conventions adopted as sort of laws, with no real understanding.
The important point to understand here is that MVC or just a model-view is
intended to simplify. If it turns out to complicate, then something's wrong. :-)
[snip]
Cheers,
- Alf
How do you come up with slower operation ?
In the cursor gap case, I have a contiguous representation (except the
gap) requiring a copy at each move.
In the scratch buffer case, I have a tree structure with no cost of
insertion or deletion except maintaining the tree and perhaps some
merge logic to avoid too much fragmentation.
> > > When the user moves the cursor, you copy text from one side of the
> > > gap to the other. Since the user usually only moves the cursor in
> > > small increments (word, line, possibly page)
>
> > I don't know what you call small increment but, on a daily basis, I
> > move between marks, end/beginning of line, next/previous char/word/
> > symbol ... I move much more than I write;
>
> The only of those that's of any concern at all is moving between
> marks. To move between marks, you copy all the text between the marks
> from one side of the gap to the other. With marks more than a few
> tens of megabytes apart, it's going to be difficult to implement that
> in the 100 ms (or so) to be perceived as "instant". A couple hundred
> ms isn't too terrible, but much beyond that can start to bother the
> user.
I don't speak about user experience. The sheer raw power of computer
makes everything possible (once the text is in memory). I agree that I
will always type more slowly than any processor and in the case you
mention, the editor can always display the text before copying.
> The rest are simply too small to care about. Figure a line is <80
> characters, usually being moved around in the cache. Such a movement
> is usually going to be in the sub-microsecond range.
>
> > I expect the gap is created
> > on the first modification.
>
> The gap is always at the cursor, from the time the file is loaded.
My point here is that a lot of editing is reading/moving hence a lot
of useless copy occur; while this would not be normally a concern, it
bothers me somewhat - perhaps a belief I have to get rid of. All the
most since moving/writing are usually well defined phase of editing
(like in vi).
> In the end editing/word processing is interactive -- a (very) soft
> real-time process. In most cases, what matters is fast _enough_
> response; faster doesn't hurt, but rarely matters much either.
But implementation is done only once and the added complexity is
really small. IMO it wouldn't even be noticed on a budget.
My question is regarding the design, not the user experience.
--
Michael
> [ ... ]
> > That has the sake of simplicity but if you maintain a list
> > of modification (like rope or another list structure with
> > modifications stored in a scratch buffer) until next write
> > to file then insertion and deletion seems cheap to me.
> > I have never tried to write a text editor but this approach
> > appeals me more than gap buffer.
> I can't quite understand why.
There's a quality rope class available on the internet. I don't
know of a readily available gap buffer class.
> Offhand, I can only think of one operation (pasting a large
> chunk of text) that it _might_ make faster (if you could just
> link in the pasted block instead of copying it). In general,
> it sounds to me like a lose-lose situation -- slower
> operation, and more complex implementation.
Implementing a gap buffer isn't that trivial, either.
> The only of those that's of any concern at all is moving
> between marks. To move between marks, you copy all the text
> between the marks from one side of the gap to the other. With
> marks more than a few tens of megabytes apart, it's going to
> be difficult to implement that in the 100 ms (or so) to be
> perceived as "instant". A couple hundred ms isn't too
> terrible, but much beyond that can start to bother the user.
Just out of curiousity, I ran a couple of tests on my machine,
with a buffer containing 10 million characters, over a total
buffer size of 12,5 million characters. I get the following
times:
Move the cursor from the position 10 to 10 before the end, and
back, alternately, in a gap buffer:
11.2 ms per move with char
28,0 ms per move with wchar_t
Insert or delete at the position 10, alternatively ("flat"
buffer:
6.9 ms per operation with char
16.8 ms per operation with wchar_t
Those aren't what I'd consider prohibitive times. And the
simplest implementation is clearly the flat buffer, which is in
practice probably faster than the gap buffer for the most
frequent operations, and acceptably fast for the least frequent.
(Admittedly, this is a machine which was installed only a few
weeks ago: A PC running under Ubuntu 9.04 (Linux 2.6), with
3.2GB real memory and four Intel Xeon processor cores at
2.00GHz.)
[...]
> The gap is always at the cursor, from the time the file is
> loaded.
And what do you do if there are several cursors? (A frequent
case, at least in the code I work on.)
> In the end editing/word processing is interactive -- a (very)
> soft real-time process. In most cases, what matters is fast
> _enough_ response; faster doesn't hurt, but rarely matters
> much either.
Agreed. And since just using an std::vector in the most obvious
manner seems fast enough, and is surely the simplest solution...
This is a bad implementation.
The operation involves copying some 20 bytes.
You should not get into millisecond range for that.
If you can't, drop me a line. I have a copy of it.
/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Well, it /is/ the first pattern in the GoF book, and the example they
use is characters in a word processor ...
> 3. There is actually a pattern very well detailed, and with a very
> intuitive name
>
> Thanks Alf !
There's also a flyweight utility in Boost, but I don't know how
relevant it is.
I think you didn't read correctly the test: he moved from index 10 to
index size-10 back and forth. Which means a copy of 10Mb - 10b per
move.
--
Michael
My thanks. The author made a version available on his site:
http://www.finseth.com/craft/craft.pdf
For some reason I couldn't access his site the first time. It is now
in my reading list.
--
Michael
I did.
> he moved from index 10 to
> index size-10 back and forth.
Yes.
> Which means a copy of 10Mb - 10b per move.
No.
[ ... ]
> > I can't quite understand why. Offhand, I can only think of one
> > operation (pasting a large chunk of text) that it _might_ make faster
> > (if you could just link in the pasted block instead of copying it).
> > In general, it sounds to me like a lose-lose situation -- slower
> > operation, and more complex implementation.
>
> How do you come up with slower operation ?
Consider, just for example, displaying the current "screen" of
information. With a split buffer, you have one possible
discontinuity: at the cursor. When the cursor is visible, you grab
one contiguous block from the beginning of the screen to the
beginning of the gap, and another contiguous block from the end of
the gap to the end of the screen. Pretty trivial really.
With a list structure with modifications stored in a scratch buffer,
you need to grab the base data from the (noncontiguous) list, and
then grab the changes from the scratch buffer, apply the changes from
the scratch buffer to the base data, so you have a contiguous buffer
holding the current version of the text for display.
From a purely algorithmic viewpoint, that's clearly more work. From a
viewpoint of locality of reference, it appears likely that you'd
refer to more different blocks of memory as well.
> In the cursor gap case, I have a contiguous representation (except the
> gap) requiring a copy at each move.
Yes, but the amount of copying is trivial, and most of the time
involved can't really be avoided anyway. Just for example, assume you
move right one word. You have (nearly) no choice but to read bytes
until you get to a byte that's whitespace. Doing a copy just means
that as you read each byte, you also put it into a write buffer.
You can only avoid reading data if you have some pointer directly to
the next place to go -- e.g. if you start with a vector of linked
list of lines, then it's simple and straightforward to move by lines
without reading the data in each line. Absent that, you need to read
the data to find where you're going -- and once you've read the data
into cache, writing is usually almost free, since it can be done
asynchronously (i.e. you put the data into a write queue and move
on).
We're left with only a few rather cases where the copying matters: 1)
move to beginning or end of file, 2) move to a mark.
As James Kanze already pointed out, even those are pretty easy to
handle fast enough, even for fairly outrageous file sizes (~10 MiB).
> In the scratch buffer case, I have a tree structure with no cost of
> insertion or deletion except maintaining the tree and perhaps some
> merge logic to avoid too much fragmentation.
But, 1) unless you have a pointer from one word/line/whatever to the
next, you still need to read through the data to find that next word,
line, or whatever, and 2) while you're doing it, you have to apply
the changes from the scratch buffer. The only part you're avoiding is
the writing -- which, as already noted, is the cheapest part of the
whole operation in most cases.
[ ... ]
> My point here is that a lot of editing is reading/moving hence a
> lot of useless copy occur; while this would not be normally a
> concern, it
> bothers me somewhat - perhaps a belief I have to get rid of. All the
> most since moving/writing are usually well defined phase of editing
> (like in vi).
I think the problem is that you're more or less ignoring a few basic
facts, such as:
1) Locality of reference matters -- even a few references to main
memory are expensive compared to lots of references to cache.
2) In small chunks, reading is much more expensive than writing.
> > In the end editing/word processing is interactive -- a (very) soft
> > real-time process. In most cases, what matters is fast _enough_
> > response; faster doesn't hurt, but rarely matters much either.
>
> But implementation is done only once and the added complexity is
> really small. IMO it wouldn't even be noticed on a budget.
>
> My question is regarding the design, not the user experience.
I think for something like a word processor, attempting to design
without taking the user experience into account is a poor idea.
--
Later,
Jerry.
As I recall, in _Software Tools_, they say they tried it, but never
got it working quite right, and eventually discarded the idea. I
suspect with a higher-level interface (e.g. std::list) you could make
it work a bit more easily, but I'm still not at all sure anybody has
done so (or that it's a particularly good idea).
[ ... ]
> In a modern editor, you do have several cursors. More
> precisely, a modern editor will separate the model (the buffer)
> and the view (the window), with the possibility of having
> several views in the same model. Logically, the cursor is part
> of the view, although it can be complicated, since it is also
> used by the model in most commands. (I'd still keep it in the
> view, and only pass it down to the model when a command which
> needed it, like insert, was issued. Maybe that's how emacs and
> others using a cursor gap buffer work today: they only move the
> cursor in the buffer when needed---when a view passes them a
> command with the cursor in it.)
I doubt that part of emacs gets updated a lot, so it wouldn't
surprise me if its implementation today is similar to 15-20 years
ago.
If I was doing it, I'd probably have each window keep track of its
cursor position, and the buffer manager would keep the gap at the
position of the current view's cursor.
--
Later,
Jerry.
[ ... ]
> > Offhand, I can only think of one operation (pasting a large
> > chunk of text) that it _might_ make faster (if you could just
> > link in the pasted block instead of copying it). In general,
> > it sounds to me like a lose-lose situation -- slower
> > operation, and more complex implementation.
>
> Implementing a gap buffer isn't that trivial, either.
It's hardly what I'd call tremendously difficult. Way back when, I
wrote a split-buffer editor in Turbo Pascal 2.0. As I recall, it was
reasonably complete and usable in something like two weeks (and that,
of course, with quite a minimal standard library by modern
standards).
> > The only of those that's of any concern at all is moving
> > between marks. To move between marks, you copy all the text
> > between the marks from one side of the gap to the other. With
> > marks more than a few tens of megabytes apart, it's going to
> > be difficult to implement that in the 100 ms (or so) to be
> > perceived as "instant". A couple hundred ms isn't too
> > terrible, but much beyond that can start to bother the user.
>
> Just out of curiousity, I ran a couple of tests on my machine,
> with a buffer containing 10 million characters, over a total
> buffer size of 12,5 million characters. I get the following
> times:
>
> Move the cursor from the position 10 to 10 before the end, and
> back, alternately, in a gap buffer:
> 11.2 ms per move with char
> 28,0 ms per move with wchar_t
> Insert or delete at the position 10, alternatively ("flat"
> buffer:
> 6.9 ms per operation with char
> 16.8 ms per operation with wchar_t
> Those aren't what I'd consider prohibitive times. And the
> simplest implementation is clearly the flat buffer, which is in
> practice probably faster than the gap buffer for the most
> frequent operations, and acceptably fast for the least frequent.
You could be right, but I'm a bit less convinced. About the only
operation for which it's faster is moving the cursor, but it's not
going to be a lot faster even for that. To move by increments of
something like a word or a line, you still have to read through the
data to find the next whitespace or newline, so you haven't saved
much even at best.
The only places I can see it saving a substantial amount of time
would be moving to an explicit position (e.g. a mark, or beginning or
end of file).
> (Admittedly, this is a machine which was installed only a few
> weeks ago: A PC running under Ubuntu 9.04 (Linux 2.6), with
> 3.2GB real memory and four Intel Xeon processor cores at
> 2.00GHz.)
For a job like this, CPU speed won't usually make a lot of
difference; the bottleneck will almost certainly be main-memory
bandwidth.
> Agreed. And since just using an std::vector in the most obvious
> manner seems fast enough, and is surely the simplest solution...
Regardless of which implementation I was going to use, I'd create a
small buffer manager class to give a reasonable interface -- and I'd
expect that buffer manager to take a matter of hours to implement for
either a plain vector or a split buffer. In fact, most of the
implementation for either one is so obvious that the limitation for a
lot of it is likely to be typing speed...
--
Later,
Jerry.
Could you explain? The discussions of gap buffers that I've seen use a
simple array with a gap in it, which would indeed have the
characteristics James and Micheal note. You're evidently talking about a
somewhat more sophisticated implementation (here, and above where you
mentioned gap buffers that deal efficiently with moving to beginning and
end of a document). Could you explain the implementation you're thinking
of? One improvement that comes to mind is using a circular buffer,
although I think this still requires copying half the size of the buffer
in the worst case (when moving the gap from the middle of the buffer to
one of the ends).
Yes. :-)
The reason that circular buffer is almost required and is implicitly assumed is
that moving from one end to the other is a frequent operation.
Anyway, the data structure supports that operation, so from the point of view of
discussing data structures that operation is O(1), regardless of implementations
that manage to do it inefficiently.
Cheers,
- Alf
PS: This sort of connects with Andrei's recently proposed "ranges", because
there's much that suddenly becomes possible when there's maximum one active
cursor/pointer/iterator. E.g. consider the problem of deleting a node in a
singly linked list when all you have is a pointer directly to that node. With
suitable assumptions that's an O(1) operation for the general case. <g>
Yes. :-)
Ok, I don't understand the point of all this cat and mouse? It
seems to be impeding discussion more than helping it. Fine, you
are considering a circular gap buffer. So just pretend that all
the "end of buffer" points were being made about "middle of
buffer" instead. Now we are back to the simple point that move
is O(n) for such moves.
Now what? Perhaps we have in mind another "better" "gap buffer"
that optimizes for the "middle of buffer" moves too because they
are a common case. At what point are these souped-up "gap buffer"
implementations no longer a "gap buffer" in any reasonable sense?
Eventually we are talking about things that probably have other
other names such as unrolled linked list, b-tree, rope, skip
list, etc.
KHD
That may be because the play you see is only in your mind, Keith.
> It
> seems to be impeding discussion more than helping it. Fine, you
> are considering a circular gap buffer. So just pretend that all
> the "end of buffer" points were being made about "middle of
> buffer" instead. Now we are back to the simple point that move
> is O(n) for such moves.
That point has not been raised, so you're not back to it.
Moving to the ends have been discussed because those are common operations in
editors, and as it happens also in other uses of this data structure.
Regarding these operations in editors, most editors have shortcut keys for them,
e.g. [Ctrl Home] and [Ctrl End], while few if any have shortcut keys to go to
the middle of the text.
> Now what? Perhaps we have in mind another "better" "gap buffer"
> that optimizes for the "middle of buffer" moves too because they
> are a common case.
No improvement of the basic cursor gap buffer has been discussed.
The only issue that has apparently entered the discussion has been the incorrect
idea that that a flawed, (relatively speaking) inefficient implementation says
something about the efficiency the data structure allows for certain operations.
> At what point are these souped-up "gap buffer"
> implementations no longer a "gap buffer" in any reasonable sense?
You'd have to provide detailed descriptions of your proposed data structures,
and for some of your creations your question might be practically decidable,
while for other of your creations your question might be practically undecidable.
However, if you plan on actually describing what you have in mind, please do so
over in e.g. [comp.programming].
> Eventually we are talking about things that probably have other
> other names such as unrolled linked list, b-tree, rope, skip
> list, etc.
The "we", presumably referring to an assumed participation of others in your
proposed off-topic discussion, sounds a little odd. :-)
Ok, I guess the confusion shown by at least three other readers
(James, Michael, IV) was just a figment of my imagination and had
nothing to do with your answers.
> > It
> > seems to be impeding discussion more than helping it. Fine, you
> > are considering a circular gap buffer. So just pretend that all
> > the "end of buffer" points were being made about "middle of
> > buffer" instead. Now we are back to the simple point that move
> > is O(n) for such moves.
>
> That point has not been raised, so you're not back to it.
I think if some others had known that you had in mind a circular
buffer, they would have raised their algorithmic points differently.
That's what I'm trying to point out to you. In other words, simply
mentioning "circular buffer" in any of these posts
http://groups.google.com/group/comp.lang.c++/msg/22a77eed7c1c1dba
http://groups.google.com/group/comp.lang.c++/msg/90118a99ea35238c
http://groups.google.com/group/comp.lang.c++/msg/db656bf160333225
(especially the last two) would have prevented and/or ended the
confusion. And the last one is very much, in my opinion, a example
of cat and mouse games that add nothing to the discussion and
usually add further to the confusion.
> Moving to the ends have been discussed because those are common operations in
> editors, and as it happens also in other uses of this data structure.
>
> Regarding these operations in editors, most editors have shortcut keys for them,
> e.g. [Ctrl Home] and [Ctrl End], while few if any have shortcut keys to go to
> the middle of the text.
Maybe. Though, I think some others were also just trying to raise
points about the structure in general (for example because a move
from "mark to mark" was brought up) not just about the convenient
move to begin/end.
> > Now what? Perhaps we have in mind another "better" "gap buffer"
> > that optimizes for the "middle of buffer" moves too because they
> > are a common case.
>
> No improvement of the basic cursor gap buffer has been discussed.
Ok, so a circular buffer is the "basic cursor gap buffer"? Then
judging from the confusion it seems some others mistakenly thought
two linear buffers was the "basic cursor gap buffer".
> The only issue that has apparently entered the discussion has been the incorrect
> idea that that a flawed, (relatively speaking) inefficient implementation says
> something about the efficiency the data structure allows for certain operations.
>
> > At what point are these souped-up "gap buffer"
> > implementations no longer a "gap buffer" in any reasonable sense?
>
> You'd have to provide detailed descriptions of your proposed data structures,
> and for some of your creations your question might be practically decidable,
> while for other of your creations your question might be practically undecidable.
>
> However, if you plan on actually describing what you have in mind, please do so
> over in e.g. [comp.programming].
So discussing general algorithms is on-topic when you are doing it
(now for 8 posts) and off-topic if I want to discuss them? Or why
didn't you move your responses re "cursor gap" algorithms to another
forum as you suggest I do?
> > Eventually we are talking about things that probably have other
> > other names such as unrolled linked list, b-tree, rope, skip
> > list, etc.
>
> The "we", presumably referring to an assumed participation of others in your
> proposed off-topic discussion, sounds a little odd. :-)
Given that you have already been participating in a non-C++
specific cursor gap discussion for several posts, how is your
suggestion and comment not hypocritical?
KHD
That depends on how you represent a position in the text.
With the gap buffer, it would be an offset; iterating on the text is
increasing the offset (minus the gap handling) With scratch buffer, it
would be a buffer and an offset in the buffer; iterating on the text
is end-buffer/next-buffer.
Example:
Step 1: Initial state
Buffer="This is a text"
Scratch=""
Text=[ptr=buffer+0,size=14]
Step2: Inserting " nice " before text
Buffer="This is a text"
Scratch=" nice"
Text=[buffer+0,9],[scratch+0,5],[buffer+9,5]
It is true that then going at a specif index in the text is in o(B)
where (B is the number of element in the list) but I expect M would be
small and if movements are small, it is negligible.
> From a purely algorithmic viewpoint, that's clearly more work. From a
> viewpoint of locality of reference, it appears likely that you'd
> refer to more different blocks of memory as well.
True. But mitigated by the fact that with the gap buffer, there is a
copy that requires moving things in main memory and I would expect it
to be at each move since the cache will likely be flushed, given the
interaction time.
> > In the cursor gap case, I have a contiguous representation (except the
> > gap) requiring a copy at each move.
>
> Yes, but the amount of copying is trivial, and most of the time
> involved can't really be avoided anyway. Just for example, assume you
> move right one word. You have (nearly) no choice but to read bytes
> until you get to a byte that's whitespace. Doing a copy just means
> that as you read each byte, you also put it into a write buffer.
I agree that individual move are trivial, even give moves.
> You can only avoid reading data if you have some pointer directly to
> the next place to go -- e.g. if you start with a vector of linked
> list of lines, then it's simple and straightforward to move by lines
> without reading the data in each line.
IMO, line representation is another issue. With gap buffer, the
position of char are moving while they don't with tree representation.
> Absent that, you need to read
> the data to find where you're going -- and once you've read the data
> into cache, writing is usually almost free, since it can be done
> asynchronously (i.e. you put the data into a write queue and move
> on).
It is also true of many buffers. Isn't it ?
> We're left with only a few rather cases where the copying matters: 1)
> move to beginning or end of file, 2) move to a mark.
>
> As James Kanze already pointed out, even those are pretty easy to
> handle fast enough, even for fairly outrageous file sizes (~10 MiB).
I agree with that.
> > In the scratch buffer case, I have a tree structure with no cost of
> > insertion or deletion except maintaining the tree and perhaps some
> > merge logic to avoid too much fragmentation.
>
> But, 1) unless you have a pointer from one word/line/whatever to the
> next, you still need to read through the data to find that next word,
> line, or whatever, and 2) while you're doing it, you have to apply
> the changes from the scratch buffer. The only part you're avoiding is
> the writing -- which, as already noted, is the cheapest part of the
> whole operation in most cases.
The same comment as above apply since it is the same argument.
I mentioned that the contiguous buffer would be reformed upon writing;
in most cases, I expect saves happens fairly often, in which case you
are in the same situation as with the gap buffer..
>
> [ ... ]
>
> > My point here is that a lot of editing is reading/moving hence a
> > lot of useless copy occur; while this would not be normally a
> > concern, it
> > bothers me somewhat - perhaps a belief I have to get rid of. All the
> > most since moving/writing are usually well defined phase of editing
> > (like in vi).
>
> I think the problem is that you're more or less ignoring a few basic
> facts, such as:
> 1) Locality of reference matters -- even a few references to main
> memory are expensive compared to lots of references to cache.
But to have something practical (I usually write quite a lot of char
when I write), I guess the gap size would exceed a cache entry and you
loose this advantage around the gap where most of the operation occur
and which, in definitive, is what is displayed to the screen.
> 2) In small chunks, reading is much more expensive than writing.
I would not expect that in in multiple-core environment: write is
expensive because of synchronisation of memory and R/W requires two
bus access which is nowadays the bottleneck.
> > > In the end editing/word processing is interactive -- a (very) soft
> > > real-time process. In most cases, what matters is fast _enough_
> > > response; faster doesn't hurt, but rarely matters much either.
>
> > But implementation is done only once and the added complexity is
> > really small. IMO it wouldn't even be noticed on a budget.
>
> > My question is regarding the design, not the user experience.
>
> I think for something like a word processor, attempting to design
> without taking the user experience into account is a poor idea.
I think you will agree that, given the power of computers, even with a
very poorly designed or over-consumptive editor, the user experience
will be the same so I don't think this is an issue here.
--
Michael
> >> Above is incorrect. Cursor gap is O(1) for moving to other
> >> end. With any reasonable implementation, since going to the
> >> ends are frequent operations.
> > It's certainly O(n) with the classical implementation of
> > cursor gap. Say, the one in emacs (or at least, the one in
> > emacs 15-20 years ago, which was the last time I looked at
> > the source).
> You might as well say that quicksort is O(n^3) average time
> because some novice has managed to implement something he
> calls quicksort with cubic average time.
I wouldn't call the implementors of emacs novices. I'm talking
of actual implementations that I've seen, in real code.
> > [...]
> >> I've never seen or made a cursor gap structure that
> >> supported multiple cursors. Possibly it can be done for
> >> small number of cursors. I'm not sure.
> > In a modern editor, you do have several cursors.
> Yeah, but except for collaborative editors like the Mac one,
> only one at a time.
Emacs supports opening windows on different terminals, at least
under X. (That's probably the biggest single feature of emacs I
miss in vim.) I don't know how the implement it, however; I
suspect that they distinguish between the cursor in the model,
and the one in the view, and only update the one in the model
when necessary.
> > More precisely, a modern editor will separate the model (the
> > buffer) and the view (the window), with the possibility of
> > having several views in the same model. Logically, the
> > cursor is part of the view, although it can be complicated,
> > since it is also used by the model in most commands. (I'd
> > still keep it in the view, and only pass it down to the
> > model when a command which needed it, like insert, was
> > issued. Maybe that's how emacs and others using a cursor
> > gap buffer work today: they only move the cursor in the
> > buffer when needed---when a view passes them a command with
> > the cursor in it.)
> The cursors and information about which one's active is to me
> obviously part of the model used to present the GUI (note: it
> will be backed by other models).
Different views have different cursors, and several can be
visible (with the cursor materialized) at the same time. (But
maybe this is just an artifact of the way the non-active windows
are updated, when they are updated.)
> I doubt that part of emacs gets updated a lot, so it wouldn't
> surprise me if its implementation today is similar to 15-20
> years ago.
Me neither. If it ain't broken, don't fix it.
> If I was doing it, I'd probably have each window keep track of
> its cursor position, and the buffer manager would keep the gap
> at the position of the current view's cursor.
That's roughly Alf's suggestion, and it sounds likely to me,
except... emacs supports windows open on different X terminals,
and there's one active view per terminal. (I think---I don't have
the possibility of trying it at present.)
> [ ... ]
> > In the cursor gap case, I have a contiguous representation
> > (except the gap) requiring a copy at each move.
> Yes, but the amount of copying is trivial, and most of the
> time involved can't really be avoided anyway. Just for
> example, assume you move right one word. You have (nearly) no
> choice but to read bytes until you get to a byte that's
> whitespace. Doing a copy just means that as you read each
> byte, you also put it into a write buffer.
> You can only avoid reading data if you have some pointer
> directly to the next place to go -- e.g. if you start with a
> vector of linked list of lines, then it's simple and
> straightforward to move by lines without reading the data in
> each line. Absent that, you need to read the data to find
> where you're going -- and once you've read the data into
> cache, writing is usually almost free, since it can be done
> asynchronously (i.e. you put the data into a write queue and
> move on).
> We're left with only a few rather cases where the copying
> matters: 1) move to beginning or end of file, 2) move to a
> mark.
Don't forget changing the active window or pane. If the cursor
is maintained as part of the buffer, then each time the active
view changes, it has to be updated.
> As James Kanze already pointed out, even those are pretty easy
> to handle fast enough, even for fairly outrageous file sizes
> (~10 MiB).
And as I pointed out, just naïvely using a flat buffer is even
faster:-). All of these strategies were developed back at a
time when machines were a lot, lot slower.
The point stands regardless of what you'd like to call the implementors of emacs.
I've seen implementations of quicksort that didn't even work.
By your logic that means quicksort doesn't work. It's an invalid inference.
Whether or not you would call the programmer of such implementation a novice.
Cheers,
- Alf
> > I think the problem is that you're more or less ignoring a
> > few basic facts, such as:
> > 1) Locality of reference matters -- even a few references to
> > main memory are expensive compared to lots of references to
> > cache.
> But to have something practical (I usually write quite a lot
> of char when I write), I guess the gap size would exceed a
> cache entry and you loose this advantage around the gap where
> most of the operation occur and which, in definitive, is what
> is displayed to the screen.
The cache will generally hold several pages, which don't have to
be contiguous. With the gap buffer, you have to deal with two
contiguous blocks, no more (and the cache will certainly hold
two entries, and the page size is likely to be more than you can
display at any one time, anyway). With the rope strategy, you
may have to hit many more than two separate blocks to refresh
the screen.
With both strategies, I'd almost certainly maintain a cached
image of the screen, only updating what needs to be updated in
it. Which means that in both cases, there probably won't be a
need to access more than one or two chunks, regardless of their
size.
> > 2) In small chunks, reading is much more expensive than
> > writing.
> I would not expect that in in multiple-core environment: write
> is expensive because of synchronisation of memory and R/W
> requires two bus access which is nowadays the bottleneck.
The processor doesn't wait for the write to finish; it pushes a
write request into the write pipeline, and goes on. This is
also true for a read, except that it must wait as soon as it
needs the value which was read.
When Jerry says "small chunks", he means very, very small
chunks. Probably 2 to 4 words at the most. Beyond that, the
processor is filling the pipeline faster than it can be emptied,
and you get a stall. (Note that very low level locality helps
here, as well. If you write a byte into a word, and there is
already a write to that word in the pipeline, the processor just
updates the request in the pipeline.)
> [ ... ]
> > Those aren't what I'd consider prohibitive times. And the
> > simplest implementation is clearly the flat buffer, which is in
> > practice probably faster than the gap buffer for the most
> > frequent operations, and acceptably fast for the least frequent.
> You could be right, but I'm a bit less convinced. About the
> only operation for which it's faster is moving the cursor, but
> it's not going to be a lot faster even for that. To move by
> increments of something like a word or a line, you still have
> to read through the data to find the next whitespace or
> newline, so you haven't saved much even at best.
> The only places I can see it saving a substantial amount of
> time would be moving to an explicit position (e.g. a mark, or
> beginning or end of file).
I don't know if the savings would be substantial, but anytime
your searching, and the cursor is in the middle of the search
space, a flat buffer will be faster; simply advancing to the
next character is significantly faster. If you're searching for
a single character, I doubt that it would make much difference,
but for a regular expression search, where you don't know in
advance the size of the match, you'll need some sort of iterator
which takes the gap into account, and that's more complicated
(and slower) than an iterator into flat memory (which can be
just a pointer). And depending on how regular expressions are
implemented, that might be a bottleneck operation; a regular
expression search can easily be O(n*m) in terms of incrementing
the iterator (where n is the size of the buffer, and m the
average number of characters matched before failure).
[ ... ]
> That's roughly Alf's suggestion, and it sounds likely to me,
> except... emacs supports windows open on different X terminals,
> and there's one active view per terminal. (I think---I don't have
> the possibility of trying it at present.)
That could be handled any of a few different ways. One would be for
the buffer to allow multiple gaps, one for each window that's
currently active (you wouldn't normally expect more than a few
anyway...). Another would be to consider only one of the windows
truly active at any given time, and when/if a user on another
terminal did something, you'd make that the active window.
Even assuming it does support multiple windows having the focus on
the same document concurrently, it's a sufficiently unusual situation
that optimizing for it is almost certain to have very low priority...
--
Later,
Jerry.
[ ... ]
> Don't forget changing the active window or pane. If the cursor
> is maintained as part of the buffer, then each time the active
> view changes, it has to be updated.
True -- but when something like moving a window to the foreground
gets involved, the times just went up so much that almost anything
related to buffer handling is (at least usually) going to lose much
significance.
> > As James Kanze already pointed out, even those are pretty easy
> > to handle fast enough, even for fairly outrageous file sizes
> > (~10 MiB).
>
> And as I pointed out, just naïvely using a flat buffer is even
> faster:-). All of these strategies were developed back at a
> time when machines were a lot, lot slower.
Yes and no. You weren't really comparing apples to apples there. For
a flat buffer, moving the cursor is O(K) and inserting text is
roughly O(N) (where N=distance from cursor to end of text).
A split buffer reverses those: moving the cursor is roughly O(N)
(where N=distance of the move) and inserting text is O(K).
What you were comparing was, in essence, the speed of an insertion
with a flat buffer to the speed of a large cursor move with a split
buffer.
If you treat the situation as _purely_ a real-time problem (i.e. the
upper bound on the time for each operation is _all_ that matters)
then it seems likely that the flat buffer will generally be the
better choice.
At least for me, and I suspect for many others as well, that's not
exactly the expectation: while I expect essentially instantaneous
response time, I also expect that when I'm simply typing in text,
that it should take virtually no CPU time, so it has virtually no
impact on other things running in the background.
The other question is the percentage of cursor movement to actual
editing. That varies rather widely, and depends (among other things)
on the kind of editing at hand. For something like word processing,
there tends to be a fairly high percentage of actual editing, and
even when proofreading, making minor changes is still pretty common.
For a programming editor, there's an awful lot of the time that the
editor really acts as just a browser -- I might easily end up opening
and reading through half a dozen files before I make a really minor
change (at least in terms of the editing involved) in one part of one
of those files.
I wouldn't be surprised if the optimum for each was rather different.
--
Later,
Jerry.
[ ... ]
> I don't know if the savings would be substantial, but anytime
> your searching, and the cursor is in the middle of the search
> space, a flat buffer will be faster; simply advancing to the
> next character is significantly faster. If you're searching for
> a single character, I doubt that it would make much difference,
> but for a regular expression search, where you don't know in
> advance the size of the match, you'll need some sort of iterator
> which takes the gap into account, and that's more complicated
> (and slower) than an iterator into flat memory (which can be
> just a pointer).
At least in most cases, it's fairly trivial to avoid that problem
entirely. At least with most versions I've used, an RE can only match
up to a single line. In that case, you save the current cursor
position, move the cursor to the beginning (or end) of the current
line, and then do two separate searches, one of the section of the
buffer before the split, and the other of the section of the buffer
after the split.
In theory, this could be a problem if the data didn't contain any
newlines -- but then again, the worst case move is the same as a
worst case move for inserting a character with a flat buffer...
In a typical case, however, the move involved is on the order of tens
of bytes or so -- so small that it's completely lost in the noise of
compiling the RE.
--
Later,
Jerry.
[ ... ]
> That depends on how you represent a position in the text.
> With the gap buffer, it would be an offset; iterating on the text is
> increasing the offset (minus the gap handling) With scratch buffer, it
> would be a buffer and an offset in the buffer; iterating on the text
> is end-buffer/next-buffer.
>
> Example:
>
> Step 1: Initial state
> Buffer="This is a text"
> Scratch=""
> Text=[ptr=buffer+0,size=14]
>
> Step2: Inserting " nice " before text
> Buffer="This is a text"
> Scratch=" nice"
> Text=[buffer+0,9],[scratch+0,5],[buffer+9,5]
>
> It is true that then going at a specif index in the text is in o(B)
> where (B is the number of element in the list) but I expect M would be
> small and if movements are small, it is negligible.
Hmm...you haven't specified what "M" is supposed to represent here,
so I'm not sure what you're trying to say. In the end, however, I
think we're left with a fairly simple situation: what you're calling
"negligible" here will usually be larger than what you're objecting
to in the case of a split buffer.
> > From a purely algorithmic viewpoint, that's clearly more work. From a
> > viewpoint of locality of reference, it appears likely that you'd
> > refer to more different blocks of memory as well.
>
> True. But mitigated by the fact that with the gap buffer, there is a
> copy that requires moving things in main memory and I would expect it
> to be at each move since the cache will likely be flushed, given the
> interaction time.
First of all, I doubt that, at least with a modern processor. Most
processors now have at least a megabyte of cache, and five to ten
megabytes is probably closer to typical. Unless you have something in
the background that's really churning through a _lot_ of data, the
two cache lines used by a split buffer are fairly likely to survive a
task switch.
Second, even if we assume that all the data involved has to be read
from main memory, we're still stuck with the fact that a split buffer
is going to deal with two contiguous areas, where chained buffers
plus change buffers is going to deal with a larger number of disjoint
areas of memory, and (ultimately) is going to simply deal with more
total data -- virtually all the data being read with a split buffer
is the actual data itself, whereas the chained buffers plus change
buffer involves things like buffers containing not only the raw data,
but also pointers to other buffers.
[ ... ]
> IMO, line representation is another issue. With gap buffer, the
> position of char are moving while they don't with tree representation.
I have a hard time believing that this means much, at least under
most typical circumstances. If you really think you're going to see a
'goto line N' command often enough to bother, it's pretty easy to
optimize for it reasonably well in a split buffer. Simply put, you
create an index, but instead of storing line positions, you store
line sizes. To navigate to a particular line, you add up the sizes of
lines up to that one, and if that's greater than the current cursor
position, you add the gap size. This is still linear, but on the
number of lines rather than the number of characters.
With the tree representation, it really does change -- every time you
consolidate trees. From what you've said elsethread (and I'd tend to
agree) you're going to have to do that fairly frequently to avoid
other problems -- which means you're going to be updating your line
position data fairly frequently as well.
> > Absent that, you need to read
> > the data to find where you're going -- and once you've read the data
> > into cache, writing is usually almost free, since it can be done
> > asynchronously (i.e. you put the data into a write queue and move
> > on).
>
> It is also true of many buffers. Isn't it ?
Yes and no. Yes, writes have short latency and reads have long
latency in both cases. The difference is that a multiple buffer setup
does more reading to avoid doing writes -- "penny wise and pound
foolish" as the old saying goes.
> The same comment as above apply since it is the same argument.
> I mentioned that the contiguous buffer would be reformed upon writing;
> in most cases, I expect saves happens fairly often, in which case you
> are in the same situation as with the gap buffer..
You're in the same situation -- but only by doing some arbitrary
amount of extra work that you're currently ignoring. To an extent,
that's fair -- the extra work can be done asynchronously. To another
extent, it's completely unfair. To make your design competitive, you
just about _have_ to do it asynchronously. That, in turn, means you
have to write the whole thing to be thread safe.
[ ... ]
> > 1) Locality of reference matters -- even a few references to main
> > memory are expensive compared to lots of references to cache.
>
> But to have something practical (I usually write quite a lot of char
> when I write), I guess the gap size would exceed a cache entry and you
> loose this advantage around the gap where most of the operation occur
> and which, in definitive, is what is displayed to the screen.
Not so. When you're inserting text, a gap buffer is dealing with
exactly one cache line, covering the space immediately after the
cursor. You only start to use a different cache line when the user
has inserted text that completely fills the current cache line.
In the end, you're always stuck with a few simple facts: a split
buffer spends most of its time dealing with one contiguous chunk of
memory. A multiple buffer/change buffer setup uses a bare minimum of
two separate areas of memory, _and_ it almost inevitably deals with
more raw data representing the same amount of real (user-visible)
data. That means it has to read more data to do its job.
> > 2) In small chunks, reading is much more expensive than writing.
>
> I would not expect that in in multiple-core environment: write is
> expensive because of synchronisation of memory and R/W requires two
> bus access which is nowadays the bottleneck.
Write is expensive when and if you need synchronization. As noted
above, that's probably important for the design you're advocating,
but (at most) is much less so for the split-buffer design.
[ ... ]
> I think you will agree that, given the power of computers, even
> with a very poorly designed or over-consumptive editor, the user
> experience will be the same so I don't think this is an issue here.
I agree that fast enough hardware can overcome a poor design, but I
don't think that's a good reason to create a poor design.
--
Later,
Jerry.
Oups M is B (I first wrote M for modification and replaced by B for
Buffer). I mean going at a specific position is linear with the number
of chunks.
> so I'm not sure what you're trying to say. In the end, however, I
> think we're left with a fairly simple situation: what you're calling
> "negligible" here will usually be larger than what you're objecting
> to in the case of a split buffer.
I can hear that.
> > > From a purely algorithmic viewpoint, that's clearly more work. From a
> > > viewpoint of locality of reference, it appears likely that you'd
> > > refer to more different blocks of memory as well.
>
> > True. But mitigated by the fact that with the gap buffer, there is a
> > copy that requires moving things in main memory and I would expect it
> > to be at each move since the cache will likely be flushed, given the
> > interaction time.
>
> First of all, I doubt that, at least with a modern processor. Most
> processors now have at least a megabyte of cache, and five to ten
> megabytes is probably closer to typical. Unless you have something in
> the background that's really churning through a _lot_ of data, the
> two cache lines used by a split buffer are fairly likely to survive a
> task switch.
>
> Second, even if we assume that all the data involved has to be read
> from main memory, we're still stuck with the fact that a split buffer
> is going to deal with two contiguous areas, where chained buffers
> plus change buffers is going to deal with a larger number of disjoint
> areas of memory, and (ultimately) is going to simply deal with more
> total data -- virtually all the data being read with a split buffer
> is the actual data itself, whereas the chained buffers plus change
> buffer involves things like buffers containing not only the raw data,
> but also pointers to other buffers.
Well. Actually, there is also only two buffers with the scratch
solution: one buffer holds the original data while the other holds all
the changes.
What might be a hit, regarding the cache, is the table holding the
pointers and size of each chunks and the indirection.
> [ ... ]
> > IMO, line representation is another issue. With gap buffer, the
> > position of char are moving while they don't with tree representation.
>
> I have a hard time believing that this means much, at least under
> most typical circumstances.
Agreed.
[snip]
> With the tree representation, it really does change -- every time you
> consolidate trees. From what you've said elsethread (and I'd tend to
> agree) you're going to have to do that fairly frequently to avoid
> other problems -- which means you're going to be updating your line
> position data fairly frequently as well.
True.
[snip]
> > > I think for something like a word processor, attempting to design
> > > without taking the user experience into account is a poor idea.
> >
> > I think you will agree that, given the power of computers, even
> > with a very poorly designed or over-consumptive editor, the user
> > experience will be the same so I don't think this is an issue here.
>
> I agree that fast enough hardware can overcome a poor design, but I
> don't think that's a good reason to create a poor design.
My point exactly.
I start to like the gap solution, I appreciate how it balances the
time scale, usage and architecture.
--
Michael
> For a programming editor, there's an awful lot of the time that the
> editor really acts as just a browser -- I might easily end up opening
> and reading through half a dozen files before I make a really minor
> change (at least in terms of the editing involved) in one part of one
> of those files.
>
> I wouldn't be surprised if the optimum for each was rather different.
I would split the buffer at the cursor only when insertion or
suppression starts...
--
__Pascal Bourguignon__
> [ ... ]
> > I don't know if the savings would be substantial, but anytime
> > your searching, and the cursor is in the middle of the search
> > space, a flat buffer will be faster; simply advancing to the
> > next character is significantly faster. If you're searching for
> > a single character, I doubt that it would make much difference,
> > but for a regular expression search, where you don't know in
> > advance the size of the match, you'll need some sort of iterator
> > which takes the gap into account, and that's more complicated
> > (and slower) than an iterator into flat memory (which can be
> > just a pointer).
> At least in most cases, it's fairly trivial to avoid that
> problem entirely. At least with most versions I've used, an RE
> can only match up to a single line.
That's generally true in line oriented editors, like vi(m). But
then, a line oriented representation is more appropriate than a
gap buffer for them anyway. It's not true in emacs, however,
and generally, in a non line oriented editor, the new line is
just another character, including in a regular expression.
Of course, if you mean that it is rare in practice for a regular
expression to actually contain something which matches a new
line, you may be right. (You're certainly right in my case, but
my use of regular expressions in editors is probably
conditionned by vi, to the extent that the idea of creating an
expression whose match includes a newline doesn't occur to me.)
But in order to take advantage of this fact, you have to analyse
the regular expression to figure out whether it can or cannot
match a new line (not too difficult, but extra work none the
less), and use two different matcher, depending on the case.
> In that case, you save the current cursor position, move the
> cursor to the beginning (or end) of the current line, and then
> do two separate searches, one of the section of the buffer
> before the split, and the other of the section of the buffer
> after the split.
> In theory, this could be a problem if the data didn't contain
> any newlines -- but then again, the worst case move is the
> same as a worst case move for inserting a character with a
> flat buffer...
> In a typical case, however, the move involved is on the order
> of tens of bytes or so -- so small that it's completely lost
> in the noise of compiling the RE.
It's not so much the move itself, but all of the necessary
additional logic to manage it. Frankly, given the performance
of modern processors, I'd simply go with a smart iterator; I
don't think the extra check each iteration, as to whether you're
at the gap or not would kill you.
Note too that editors normally use extended regular expressions
which allow capture of sub-strings. To my knowledge, those
can't be implemented as a pure DFA, which means that the actual
processing of each character probably overwhelms the additional
checking. Off hand, and this is just my basic intuition, and
not a result of any measurements, I would expect such a regular
expression to take about ten times more time scanning a large
range as it takes to copy it. (My own regular expression class
will take less time to scan than it takes to copy, because it
uses a DFA. But it doesn't support capture.)
[ ... ]
> Of course, if you mean that it is rare in practice for a regular
> expression to actually contain something which matches a new
> line, you may be right. (You're certainly right in my case, but
> my use of regular expressions in editors is probably
> conditionned by vi, to the extent that the idea of creating an
> expression whose match includes a newline doesn't occur to me.)
> But in order to take advantage of this fact, you have to analyse
> the regular expression to figure out whether it can or cannot
> match a new line (not too difficult, but extra work none the
> less), and use two different matcher, depending on the case.
Yes, I'd say incrementing a pointer/subscript across the gap is only
worth doing if you can do so consistently. If you're going to deal
with incrementing across the gap at least part of the time, it
probably makes more sense to just do it all the time.
[ ... ]
> It's not so much the move itself, but all of the necessary
> additional logic to manage it.
If you supported REs that included newlines, I'd agree -- I wouldn't
advocate this technique in such a case. I doubt the speed gain would
be worth the extra complexity.
[ ... ]
> Note too that editors normally use extended regular expressions
> which allow capture of sub-strings. To my knowledge, those
> can't be implemented as a pure DFA, which means that the actual
> processing of each character probably overwhelms the additional
> checking. Off hand, and this is just my basic intuition, and
> not a result of any measurements, I would expect such a regular
> expression to take about ten times more time scanning a large
> range as it takes to copy it. (My own regular expression class
> will take less time to scan than it takes to copy, because it
> uses a DFA. But it doesn't support capture.)
Offhand, it seems like that _would_ probably justify the complexity
of scanning the RE to see if a DFA can be used or not, and using it
if possible. Then again, unless I had a really good reason to do
otherwise, I'd probably just use the regex class in TR1.
--
Later,
Jerry.
[ ... ]
> Yes, I'd say incrementing a pointer/subscript across the gap is
> only worth doing if you can do so consistently.
Oops -- that should have said "... avoiding incrementing ..."
--
Later,
Jerry.
> [ ... ]
> > It's not so much the move itself, but all of the necessary
> > additional logic to manage it.
> If you supported REs that included newlines, I'd agree -- I
> wouldn't advocate this technique in such a case. I doubt the
> speed gain would be worth the extra complexity.
Since we're talking about editors here:-): all regular
expression implementations I'm familiar with support newlines in
the expression. The data structure of some editors (ed, vi,
vim), and some other programs like those in the grep family, is
such, however, that they will only apply the regular expression
to the text one line at a time, to text which doesn't contain a
new line. (It's possible to create a regular expression in vim
which will only match if there is a newline in the text. Such a
regular expression will never match anything, however.)
> [ ... ]
> > Note too that editors normally use extended regular
> > expressions which allow capture of sub-strings. To my
> > knowledge, those can't be implemented as a pure DFA, which
> > means that the actual processing of each character probably
> > overwhelms the additional checking. Off hand, and this is
> > just my basic intuition, and not a result of any
> > measurements, I would expect such a regular expression to
> > take about ten times more time scanning a large range as it
> > takes to copy it. (My own regular expression class will
> > take less time to scan than it takes to copy, because it
> > uses a DFA. But it doesn't support capture.)
> Offhand, it seems like that _would_ probably justify the
> complexity of scanning the RE to see if a DFA can be used or
> not, and using it if possible. Then again, unless I had a
> really good reason to do otherwise, I'd probably just use the
> regex class in TR1.
Exactly. Typically, editors are not scanning millions of lines
of text; the amount of text being scanned is relatively small,
and the time to compile the regular expression is usually a
significant part of the time necessary for the command using it.
In such cases, building a complete DFA is counter productive
(because it is a fairly expensive operation); using lazy
construction of the DFA might be acceptable, but it does
introduce the added complexity of having to handle out of memory
in the middle of the search. But IMHO, tr1::regex should work
just fine, unless you need some special features.
(FWIW: I wouldn't use my own RegularExpression class in an
editor. It's not designed for that. On the other hand, it has
some nice features that boost::regex doesn't: you can define
your accept code, and or existing regular expressions, along the
lines of:
RegularExpression decimal ( "[1-9][0-9]*" , 10 ) ;
RegularExpression octal ( "0[0-7]*" , 8 ) ;
RegularExpression hexadecimal( "0[xX][a-fA-F0-9]+", 16 ) ;
RegularExpression number( decimal | octal | hexadecimal ) ;
and then:
StringMatch< std::string::const_iterator >
match( number.match( s.begin(), s.end() ) ) ;
if ( match ) {
int i
= convert( s.begin(), match.endOfMatch(), match.acceptCode
() ) ;
// ...
}
where the last argument to convert is the base. And the class
has a function "dumpAsCpp", which outputs the tables as C++, so
you can compile them into a StaticRegularExpression, which has
pretty much the same interface as RegularExpression for the
non-mutating functions, but is statically initialized. And
finally, the class(es) exist in two variants, supporting either
single byte encodings or UTF-8---although complex regular
expressions in UTF-8 can rapidly become very, very large---I use
the single byte variant a lot even when working with UTF-8.)