Problems with searching using regular expressions

166 views
Skip to first unread message

Yury Dubinsky

unread,
Feb 28, 2024, 11:00:15 PM2/28/24
to scite-interest
Hello, Neil,
I found a few problems with regular expression searches in Scite when I was trying to quickly bookmark and remove all the blank lines in a file. (There's nothing special about the file, but I've attached it for convenience.) I am talking about the following expressions: empty and blank lines - ^$ ,^\s*$ and just demos: ^.*$ , ,$ , ^
Here are the facts found.
1. Applies to Scite starting from 4.05.
  • " Find Next" does not work with ^$ , ^.*$ $ , after the first found occurrence.
  •  With ^\s*$ "Find Next" works for blank lines, but stops at the empty line (*$). See attached screenshot 1. Find Next stopped on line 3.
  •  For all the expressions mentioned, Scite can "Bookmark All", however there is another problem mentioned in item 3.
2. Applies to Scite older than 4.05. "Find Next" works well with all of the expressions above except ^. Here, "Find Next" works on empty strings (^$), but does not work after non-empty strings.
3. Affects any version of Scite, problem with "Bookmark All" using one of the expressions above. Scite does not bookmark all the required lines. See attached screenshots 2 and 3. (My intention was to bookmark all empty lines using ^$ in the attached file.) The behaviour is strange. For example, when we scroll to the bottom of an attached file and call "bookmark all" with ^$, the top lines will not be marked. The number of lines with bookmarks depends on the scroll position when calling “Bookmark All”. (It is possible to bookmark all the lines. We need to scroll through the file and use the command again and again. But this requires a lot of attention.)
I don't thing this issue is a priority. There are often many other editors around. But I use SciTE on remote servers where the software installation policy is very strict. Then I only have Scite.
I've never looked into Scite's implementation of regular expressions. Before looking, I would like to check with you.
Thanks,
 Yury
Screenshot3.png
copy-output-count1.txt
Screenshot1.png
Screenshot2.png

Neil Hodgson

unread,
Feb 29, 2024, 4:36:01 PM2/29/24
to scite-interest
Hi Yury,
> • " Find Next" does not work with ^$ , ^.*$ , $ , ^ after the first found occurrence.

This isn't about regular expressions matching, its about handling empty matches where SciTE's behaviour is poor.

When there is a simple caret (empty selection) at the start of a non-empty line, then searching for ^. should find and select the first character. So SciTE asks the regular expression code to search from the end of the selection (in this case the caret) to the end of file and returns start-of-line extending for 1 character.

If you try the same thing for ^$ when at the start of an empty line, the regular expression code returns start-of-line extending for 0 characters which is also a valid response. It can't treat an empty match as wrong since that is precisely what you are seeking.

Code sometimes tries to break out of this issue by advancing the start of search by 1 which can lead to more problems. I think earlier versions of SciTE tried something like this and it caused worse problems.

Neil

Neil Hodgson

unread,
Mar 3, 2024, 4:56:06 PM3/3/24
to scite-interest
Some regular expression libraries have a flag for not matching the empty string at start which could be a direction here.


  PCRE_NOTEMPTY
An empty string is not considered to be a valid match if this option is set. If there are alternatives in the pattern, they are tried. If all the alternatives match the empty string, the entire match fails. For example, if the pattern
  a?b?
is applied to a string not beginning with "a" or "b", it matches an empty string at the start of the subject. With PCRE_NOTEMPTY set, this match is not valid, so PCRE searches further into the string for occurrences of "a" or "b".

  PCRE_NOTEMPTY_ATSTART
This is like PCRE_NOTEMPTY, except that an empty string match that is not at the start of the subject is permitted. If the pattern is anchored, such a match can occur only if the pattern contains \K.

   Neil

Neil Hodgson

unread,
Mar 5, 2024, 2:05:20 AM3/5/24
to scite-interest
Yury:
3. Affects any version of Scite, problem with "Bookmark All" using one of the expressions above. Scite does not bookmark all the required lines. See attached screenshots 2 and 3. (My intention was to bookmark all empty lines using ^$ in the attached file.) The behaviour is strange. For example, when we scroll to the bottom of an attached file and call "bookmark all" with ^$, the top lines will not be marked. The number of lines with bookmarks depends on the scroll position when calling “Bookmark All”. (It is possible to bookmark all the lines. We need to scroll through the file and use the command again and again. But this requires a lot of attention.)

This is caused by the lack of progress with an empty result at the end of a batch of lines. It loops around infinitely retrying the end of the batch until a 250ms timer goes off and it gives up. Might be fixable with some better handling of the end condition but it's messy as an empty result is valid for this search and you don't want a match missing at the end of the batch.

Neil

Yury Dubinsky

unread,
Mar 12, 2024, 9:52:08 PM3/12/24
to scite-interest
Hi, Neil,
Thanks for the explanation!  I am trying to identify high-level tasks to improve the current behaviour.
>>>about handling empty matches
Once again:  current Scite doesn't find empty matches using  ^    $    ^$ . I did some additional test on the behaviour of many editors and utilities and compared them to Scite before and after 4.05. All of editors, plus old Scite (< 4.05),  could correctly find    ^    $    ^$ . (Including Geany, which also uses Scintilla's Reg. Ex engine.) 
(Searching for ^$ is important to identify empty lines. The search for ^  and $ allows us to quickly prefix and suffix all lines in the text. We could do this also using backslash expressions. But, Reg. Ex is simpler.) 
Here we have another confusing issue. The Replace empty matches using  ^  ,  $  and ^$  works fine! I tested in  5.4.1.  Thus,  we could replace  ^  ,  $  and ^$  , but we couldn't find them. The old Scite's behaviour seems more consistent with this. I would consider restoring the old behaviour, but you mentioned  that "earlier versions of SciTE tried something like this and it caused worse problems."  Could you remember these problems?

>>>a flag for not matching the empty string at start
Do you suggest adding similar flags to the engine (RESearch.cxx) and adding flags to Scite to control the behaviour?

>> problem with "Bookmark All
I'll try to suggest a solution for better handling. 

Thanks
Yury

Neil Hodgson

unread,
Mar 13, 2024, 6:16:57 PM3/13/24
to scite-interest
Yury:

The old Scite's behaviour seems more consistent with this. I would consider restoring the old behaviour, but you mentioned  that "earlier versions of SciTE tried something like this and it caused worse problems."  Could you remember these problems?

Most likely this change that fixed infinite loops or hangs.
Hangs cause data loss so are more important than incorrect searching.

>>>a flag for not matching the empty string at start
Do you suggest adding similar flags to the engine (RESearch.cxx) and adding flags to Scite to control the behaviour?

SciTE can be changed to retry-after an empty match that is the same as the current selection. Implementing part of this in Scintilla may help other applications.

Neil

Neil Hodgson

unread,
Sep 20, 2024, 5:52:12 AM9/20/24
to scite-i...@googlegroups.com
The behaviour of regular expressions that match empty ranges is now better.

Changes will be in the next release and can be tried with
https://www.scintilla.org/wscite.zip Windows executable (64-bit)

Available from the repository with the relevant changes being
https://sourceforge.net/p/scintilla/scite/ci/2b1b1f5b0fa89d7fe6e03a45e285093960be0f25/
https://sourceforge.net/p/scintilla/scite/ci/5fbf0b20d3f3887cd065c44e494b48d1a5d75295/

Neil

Yury Dubinsky

unread,
Sep 22, 2024, 10:17:39 PM9/22/24
to scite-i...@googlegroups.com
Hi, Neil,
Your solution is perfect! Thanks! This "empty much" behavior is better than in many other editors.
(Unfortunately I couldn't help as I've been very busy since March. I always planned to start but always got distracted.)

Thanks, 
Yury

--
You received this message because you are subscribed to a topic in the Google Groups "scite-interest" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scite-interest/CfLJSfXdObU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scite-interes...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scite-interest/A061477D-1AD6-4705-83DE-238340F85932%40gmail.com.

Yury Dubinsky

unread,
Dec 2, 2024, 9:18:55 PM12/2/24
to scite-interest
Hi, Neil,
I would like to ask you about your plan for implementing a part of the logic in Scintilla. Are you still considering it? What part of the logic can be moved in your opinion?
 
Thanks,
  Yury

Neil Hodgson

unread,
Dec 4, 2024, 10:22:49 PM12/4/24
to scite-i...@googlegroups.com
Yury Dubinsky:

> I would like to ask you about your plan for implementing a part of the logic in Scintilla. Are you still considering it? What part of the logic can be moved in your opinion?

A `don't match empty at start` flag could be implemented in Scintilla. However, the application still needs to know when to apply this so I don't think that it simplifies the application much. It's also not a significant performance win unless it could be embedded deeper in the regex engine.

Neil
Reply all
Reply to author
Forward
0 new messages