Regex Support

114 views
Skip to first unread message

Ben Hanson

unread,
Mar 11, 2011, 11:05:24 AM3/11/11
to scintilla...@googlegroups.com

Hi there,

I was wondering if there is any interest in improving the regular expression support in Scintilla. Currently the syntax supported is very basic indeed. I would recommend using tr1::regex (or boost::regex if you want to support older compilers).

Is there any interest in this as well as search backwards ("Up" option) using regex and recursive patterns?

Thanks,

Ben

Simon Steele

unread,
Mar 11, 2011, 11:55:21 AM3/11/11
to scintilla...@googlegroups.com
Scintilla already has support for using better regular expression engines in the host application. There are two mechanisms you can use off the top of my head:

1. Build in a new engine to Scintilla, this is done by compiling in your own implementation of RegexSearchBase and instantiating that in response to Scintilla::CreateRegexSearch. This is what I use in Programmer's Notepad to embed Boost's Xpressive engine. See:


2. Use the character buffer support to pass the full document to an external regex engine.

Option 1 has the benefit of not needing to create the single buffer, as you can use iterators or equivalent. Option 2 allows use of more engines, there are several like PCRE that only work well with continuous buffers.

Hope this helps,

Simon.
-- 
Programmer's Notepad - http://www.pnotepad.org/

Ben Hanson

unread,
Mar 11, 2011, 12:19:25 PM3/11/11
to scintilla...@googlegroups.com

Thanks for your reply.

After trying out pnotepad I see that "Search up" does not work correctly with regular expressions. The way to fix that is to tokenise the regex and reverse each joined block before passing it on to Xpressive. I can help with that if you like.

As I understand it, Xpressive does not support recursive regexes in dynamic mode. I have added recursive pattern matching to lexertl recently (http://www.benhanson.net/lexertl.html) and was wondering if you were interested in incorporating another mode for this?

I also wrote Notepad RE (http://www.codeproject.com/KB/recipes/notepadre.aspx), but I just don't have the time to make it as fancy as pnotepad, hence my questions! :-)

Cheers,

Ben

Ben Hanson

unread,
Mar 11, 2011, 12:25:22 PM3/11/11
to scintilla...@googlegroups.com
Actually searching for w[a-z]+ backwards works, however searching [a-z]+ backwards doesn't.

Philippe Lhoste

unread,
Mar 11, 2011, 12:45:58 PM3/11/11
to scintilla...@googlegroups.com
On 11/03/2011 17:05, Ben Hanson wrote:
> I was wondering if there is any interest in improving the regular expression support in
> Scintilla. Currently the syntax supported is very basic indeed. I would recommend using
> tr1::regex (or boost::regex if you want to support older compilers).

Short answer: no.
Neil deliberately chose a simple RE engine, to keep the code and binary small (and no need
for upgrade on each change of the chosen library...).
As said, plugin a better engine is possible.

> Is there any interest in this as well as search backwards ("Up" option) using regex and
> recursive patterns?

That's funny, I realize I nearly never use backward search, even for simple strings. I
used to use it in the past, but I rarely do today.

--
Philippe Lhoste
-- (near) Paris -- France
-- http://Phi.Lho.free.fr
-- -- -- -- -- -- -- -- -- -- -- -- -- --

Simon Steele

unread,
Mar 11, 2011, 5:56:01 PM3/11/11
to scintilla...@googlegroups.com
Hi Ben,

Thanks, backwards regex is not something I've put a lot of effort into - like Philippe it's not something I generally seem to do! That being said, any help/contribution is always welcome!

Could you give me some examples of where recursive regexes are useful? I wonder also whether Boost::Regex supports them? 

If so and the main regex library now supports named captures (my main reason for choosing Xpressive) then it might be worth switching PN from Xpressive to Regex.

Simon.

Neil Hodgson

unread,
Mar 11, 2011, 6:54:41 PM3/11/11
to scintilla...@googlegroups.com
Ben Hanson:

> I was wondering if there is any interest in improving the regular expression
> support in Scintilla. Currently the syntax supported is very basic indeed.

I believe including regular expression support inside Scintilla was
a mistake. Applications that use Scintilla and expose regular
expression functionality should aim to incorporate another regex
library. A Perl environment is likely to want to use Perl's built-in
regular expressions rather than another library which may be largely
compatible but differs in details. A Lua environment may wish to allow
a choice between Lua patterns and a more standard library.

Any additional regex functionality in Scintilla should minimize the
costs. There should not be any need to download an external library,
since this makes it harder to build Scintilla. Copying a library like
boost::regex into the Scintilla source tree makes the code base larger
and could require Scintilla releases whenever upstream fixes a
significant or security bug.

If each of the 3 platforms included a compatible regex library then
it may be beneficial to use this from Scintilla but the current
situation is that they don't. tr1::regex support appears to be
unfinished in libstdc++ which seems to be moving towards C++0x
std::regex instead. OS X is probably worse, a search just produced the
old BSD libc regex(3) man page.
http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1

As Simon mentions, some regular expression libraries only work well
with continuous buffers and Scintilla uses a split buffer. If
Scintilla were to use such a library, the cost of joining the two
buffers whenever a search was to be performed could cause some
operations to be unreasonably slow.

Neil

Ben Hanson

unread,
Mar 12, 2011, 4:24:27 AM3/12/11
to scintilla...@googlegroups.com
On Friday, March 11, 2011 10:56:01 PM UTC, Simon Steele wrote:
Hi Ben,

Thanks, backwards regex is not something I've put a lot of effort into - like Philippe it's not something I generally seem to do! That being said, any help/contribution is always welcome!

I think it's worth talking about further off-line. Thanks for your interest!

Could you give me some examples of where recursive regexes are useful? I wonder also whether Boost::Regex supports them? 

Examples are something I desperately need. For the testing of my own implementation I have used the classic test cases of nested C comments and checking a string for an even number of 0s and 1s. These are obviously toy examples. When I'm searching files, it's usually source code so any kind of nested construct is a potential search target. I see the ultimate aim as being able to specify mini-grammars (Lua pattern matching has gone that route as far as I can see).

As far as I am aware boost::regex doesn't support recursive regexes (its 'recursive mode' refers to recursion in its implementation, rather than in its regexes). I've looked at other regex engines like PCRE and the .NET implementation, but they seem bizzare and confusing to me. Also, PCRE only supports UTF-8 for Unicode. The approach I went for was that at the end of a pattern a push or pop can be flagged. The idea is that a push takes you to a new lexer state and a pop pops a lexer state off the (software) stack. A 'lexer state' refers to a particular DFA - i.e. there is a vector of DFAs so that matching can be constrained per 'lexer state'.

If so and the main regex library now supports named captures (my main reason for choosing Xpressive) then it might be worth switching PN from Xpressive to Regex.

There is also the issue of proper Unicode support (boost::regex supports ICU for this, but I've not tried it to see how it works). People actually ask for Unicode support (http://www.codeproject.com/KB/recipes/notepadre.aspx?msg=3743896#xx3743896xx).

In the end, I think that searching needs to move to the 'next level', just like Roberto Ierusalimschy author of the Lua pattern matcher.

Regards,

Ben

Ben Hanson

unread,
Mar 12, 2011, 4:37:43 AM3/12/11
to scintilla...@googlegroups.com
On Friday, March 11, 2011 11:54:41 PM UTC, Neil Hodgson wrote:
Ben Hanson:

> I was wondering if there is any interest in improving the regular expression
> support in Scintilla. Currently the syntax supported is very basic indeed.

   I believe including regular expression support inside Scintilla was
a mistake. Applications that use Scintilla and expose regular
expression functionality should aim to incorporate another regex
library. A Perl environment is likely to want to use Perl's built-in
regular expressions rather than another library which may be largely
compatible but differs in details. A Lua environment may wish to allow
a choice between Lua patterns and a more standard library.

I sympathise with this point of view. The problem (as I see it) is that the default engine gets used because it is available (Notepad++ for example)  and then no more attention is paid to it. I would be very interested to see an editor support Lua pattern matching! :-)

   Any additional regex functionality in Scintilla should minimize the
costs. There should not be any need to download an external library,
since this makes it harder to build Scintilla. Copying a library like
boost::regex into the Scintilla source tree makes the code base larger
and could require Scintilla releases whenever upstream fixes a
significant or security bug.

The fact that other regex engines can be plugged in makes replacing the default one a lot less interesting, I agree.

   If each of the 3 platforms included a compatible regex library then
it may be beneficial to use this from Scintilla but the current
situation is that they don't. tr1::regex support appears to be
unfinished in libstdc++ which seems to be moving towards C++0x
std::regex instead. OS X is probably worse, a search just produced the
old BSD libc regex(3) man page.
http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1

C++0x is what interests me the most. I have recently ditched support for VC++ 6 in lexertl, and I'm not looking back...

   As Simon mentions, some regular expression libraries only work well
with continuous buffers and Scintilla uses a split buffer. If
Scintilla were to use such a library, the cost of joining the two
buffers whenever a search was to be performed could cause some
operations to be unreasonably slow.

The easiest (and modern) solution to dealing with split buffers is to code an iterator that is aware of Scintillas internal structure and use that with whatever regex engine you like. Of course that implies a regex engine that copes with iterators correctly.

Regards,

Ben

mitchell

unread,
Mar 13, 2011, 10:31:31 PM3/13/11
to scintilla...@googlegroups.com
Hi,

On Sat, 12 Mar 2011, Neil Hodgson wrote:
> Ben Hanson:
>
>> I was wondering if there is any interest in improving the regular expression
>> support in Scintilla. Currently the syntax supported is very basic indeed.
>
> I believe including regular expression support inside Scintilla was
> a mistake. Applications that use Scintilla and expose regular
> expression functionality should aim to incorporate another regex
> library. A Perl environment is likely to want to use Perl's built-in
> regular expressions rather than another library which may be largely
> compatible but differs in details. A Lua environment may wish to allow
> a choice between Lua patterns and a more standard library.

I beg to differ. I love having the basic regexp support for simple
scripting purposes. It is so much easier to have that than to use an
external regex lib, something that would bloat my application.

Thank you for including it.

mitchell

>
> Any additional regex functionality in Scintilla should minimize the
> costs. There should not be any need to download an external library,
> since this makes it harder to build Scintilla. Copying a library like
> boost::regex into the Scintilla source tree makes the code base larger
> and could require Scintilla releases whenever upstream fixes a
> significant or security bug.
>
> If each of the 3 platforms included a compatible regex library then
> it may be beneficial to use this from Scintilla but the current
> situation is that they don't. tr1::regex support appears to be
> unfinished in libstdc++ which seems to be moving towards C++0x
> std::regex instead. OS X is probably worse, a search just produced the
> old BSD libc regex(3) man page.
> http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1
>
> As Simon mentions, some regular expression libraries only work well
> with continuous buffers and Scintilla uses a split buffer. If
> Scintilla were to use such a library, the cost of joining the two
> buffers whenever a search was to be performed could cause some
> operations to be unreasonably slow.
>
> Neil
>

> --
> You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
> To post to this group, send email to scintilla...@googlegroups.com.
> To unsubscribe from this group, send email to scintilla-inter...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scintilla-interest?hl=en.
>
>
>

mitchell

mitchell

unread,
Mar 13, 2011, 10:36:35 PM3/13/11
to scintilla...@googlegroups.com
Hi,

On Sat, 12 Mar 2011, Ben Hanson wrote:

> On Friday, March 11, 2011 11:54:41 PM UTC, Neil Hodgson wrote:
> Ben Hanson:
>
> > I was wondering if there is any interest in improving the regular expression
> > support in Scintilla. Currently the syntax supported is very basic indeed.
>
> � �I believe including regular expression support inside Scintilla was
> a mistake. Applications that use Scintilla and expose regular
> expression functionality should aim to incorporate another regex
> library. A Perl environment is likely to want to use Perl's built-in
> regular expressions rather than another library which may be largely
> compatible but differs in details. A Lua environment may wish to allow
> a choice between Lua patterns and a more standard library.
>
> I sympathise with this point of view. The problem (as I see it) is that the default engine gets used because it is available
> (Notepad++ for example)� and then no more attention is paid to it. I would be very interested to see an editor support Lua pattern
> matching! :-)

Textadept[1] uses Lua patterns[2] instead of regex

[1]: http://caladbolg.net/textadept
[2]: http://caladbolg.net/luadoc/textadept/manual/6_AdeptEditing.html#find_and_replace

mitchell

>
> � �Any additional regex functionality in Scintilla should minimize the


> costs. There should not be any need to download an external library,
> since this makes it harder to build Scintilla. Copying a library like
> boost::regex into the Scintilla source tree makes the code base larger
> and could require Scintilla releases whenever upstream fixes a
> significant or security bug.
>
> The fact that other regex engines can be plugged in makes replacing the default one a lot less interesting, I agree.
>
> � �If each of the 3 platforms included a compatible regex library then
> it may be beneficial to use this from Scintilla but the current
> situation is that they don't. tr1::regex support appears to be
> unfinished in libstdc++ which seems to be moving towards C++0x
> std::regex instead. OS X is probably worse, a search just produced the
> old BSD libc regex(3) man page.
> http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1
>
> C++0x is what interests me the most. I have recently ditched support for VC++ 6 in lexertl, and I'm not looking back...
>
> � �As Simon mentions, some regular expression libraries only work well
> with continuous buffers and Scintilla uses a split buffer. If
> Scintilla were to use such a library, the cost of joining the two
> buffers whenever a search was to be performed could cause some
> operations to be unreasonably slow.
>
> The easiest (and modern) solution to dealing with split buffers is to code an iterator that is aware of Scintillas internal
> structure and use that with whatever regex engine you like. Of course that implies a regex engine that copes with iterators
> correctly.
>
> Regards,
>
> Ben
>

Philippe Lhoste

unread,
Mar 14, 2011, 5:16:02 AM3/14/11
to scintilla...@googlegroups.com
On 14/03/2011 03:31, mitchell wrote:
> Neil:

>> I believe including regular expression support inside Scintilla was
>> a mistake.
>
> I beg to differ. I love having the basic regexp support for simple scripting purposes. It
> is so much easier to have that than to use an external regex lib, something that would
> bloat my application.

I agree. In SciTE, the current support of regexes covers 99% of my needs for text editing.
And it is getting better with small, useful contributions that doesn't bloat it.

For heavy duty text processing, I believe that an external, specialized program will be
both more efficient (buffer management with gap can be weak on lot of automated small
changes through a whole big file, but it isn't not its purpose either), more flexible and
powerful.
From sed to awk to a script in your favorite language...

Now, I understand Neil which might be tired of hearing the same complaints. In hindsight,
perhaps it would have been a better idea to just ship a good API to integrate tightly a
regex engine supporting iterators (because of the gap), and to give the current library as
an optional example (so we can get it quickly in SciTE or other small projects using
Scintilla) while letting other projects to use their favorite engine. (Actually, it might
even be in this state already, but well, it is compiled and integrated to Scintilla by
default.)

BTW, I recently discovered that Google released two regex libraries, one lightweight, with
limitation but fast, and another more featured. PCRE is powerful, but with the version 7,
it starts looking like a parsing library, something I would rather prefer to defer to a
full PEG...

Reply all
Reply to author
Forward
0 new messages