Regexp "bug"

22 views
Skip to first unread message

Carlo Hogeveen

unread,
Sep 5, 2025, 8:44:32 AMSep 5
to Semware @ GoogleGroups

I have the feeling that I ran into this bug before, and that again I have no problem with it.
But formally it is a bug, so here goes.

Searching for regular expression "?" finds an empty string, instead of correctly not finding anything and returning an error.

For contrast, regular expressions "*" and "\" correctly find nothing and return an error.

Again, for me this bug does not cause a problem anywhere.
I will just document that "?", "^?" and "{}?" will find an empty string.

Carlo


proc Main()
integer old_MsgLevel = Set(MsgLevel, _ALL_MESSAGES_)
if Find('?', 'gx')
Warn('Found "', GetFoundText(), '". (', DosIOResult(), ')')
else
Warn('Not found. (', DosIOResult(), ')')
endif
Set(MsgLevel, old_MsgLevel)
PurgeMacro(CurrMacroFilename())
end Main




Knud van Eeden

unread,
Sep 5, 2025, 9:30:46 AMSep 5
to TSE Pro Support, Knud van Eeden


I checked it for Perl (famous for its regular expression handling and in general the norm in regular expression land) 
and Python using ChatGPT generated .pl and .py programs, but basically it would be rejected by the compilers.
This was also confirmed in the analysis by the AI.

Perl:

c:\temp Fri 05-09-25 15:15:04>perl c:\temp\ddd.pl "?" c:\temp\ddd.txt
syntax error at c:\temp\ddd.pl line 20, near "qr/$pattern/) "
syntax error at c:\temp\ddd.pl line 23, near "}"
Execution of c:\temp\ddd.pl aborted due to compilation errors.

c:\temp Fri 05-09-25 15:15:27>perl c:\temp\ddd.pl "?" c:\temp\ddd.txt
Quantifier follows nothing in regex; marked by <-- HERE in m/? <-- HERE / at c:\temp\ddd.pl line 17, <$fh> line 1.

Python:

c:\temp Fri 05-09-25 15:17:34>python ddd.py "?" ddd.txt
Invalid regex '?': nothing to repeat at position 0

c:\temp Fri 05-09-25 15:21:02>

image.png
===

with friendly greetings
Knud van Eeden

FYI: AI: ChatGPT reply:

Great write-up, Carlo — and nice minimal repro.

Short answer: yes, I’d classify this as a regex parser bug (or at least a standards-deviation) in TSE’s regex engine: a bare quantifier (?) is being accepted and compiled as if it quantifies the empty (epsilon) atom, which then matches the empty string everywhere. Most engines reject this as a syntax error (“nothing to repeat”).

Why it’s a bug (by conventional regex rules)

In virtually all mainstream engines (PCRE, Perl, .NET, Java, Python, POSIX ERE):

  • ?, *, and + are quantifiers and must follow a valid atom (character, group, class, etc.).

  • A pattern that starts with a quantifier (e.g., ?, *) is invalid and should fail to compile: “nothing to repeat”.

What you’re seeing implies TSE’s parser does something like:

  • If there’s no preceding atom, treat it as quantifying epsilon (the empty match).

  • epsilon? is still epsilon, so it “finds” an empty string at the current position.

That explains your observations:

  • ? → empty match

  • ^?^ matches start-of-line, then ? quantifies epsilon → empty match at bol

  • { }? → if {} is parsed as an empty atom (also wrong), ? quantifies it → empty match

By contrast, you noted * and \ “correctly find nothing and return an error.” That’s actually the expected behavior for * (dangling quantifier) and \ (dangling escape).

Practical impact

Like you said: usually harmless, but it can cause surprising behavior, particularly:

  • Infinite/zero-progress find loops if a script repeatedly calls Find without advancing past zero-length matches.

  • Unexpected “Found” dialogs on an empty selection.

Workarounds you can adopt today

  1. For a literal question mark, use an escaped pattern:

    Find('\?', 'g')   ; 'g' = regex mode (adjust to your TSE flags)
    

    This should match only literal ?.

  2. Pre-validate patterns in macros before calling Find:

    • Reject patterns that begin with a dangling quantifier or contain {}
      with no counts.

    • Minimal guard (pseudo-SAL):

    string proc IsSafeRegex(string pat)
      if pat == '' then return '0' endif
      if SubStr(pat,1,1) == '?' or SubStr(pat,1,1) == '*' or SubStr(pat,1,1) == '+' then return '0' endif
      ; crude check for '{}', refine as needed
      if Pos('{}', pat) > 0 then return '0' endif
      return '1'
    end
    
  3. When looping over matches, explicitly skip zero-length matches:

    if Find(pat, 'g')
      string found = GetFoundText()
      if Len(found) == 0
        ; advance cursor by 1 to avoid re-finding the same empty match
        Right()   ; or MoveRel(0, 1) depending on your macro style
      endif
      ; ... process hit ...
    endif
    

Recommendation (for engine maintainers)

During regex compilation, if the parser encounters a quantifier with no preceding atom, it should raise a compile error (consistent with other engines). That would make ?, *, +, {}?, ^? invalid unless there’s a proper atom before them. If there’s an intentional extension to treat them as epsilon-quantifiers, it should be documented, and the finder should ensure progress is made on zero-length matches to avoid loops.

Tiny SAL harness to document the behavior

If you want to keep this in your “quirks” doc, here’s a concise test macro that shows what TSE returns for a few edge patterns:

PROC Main()
  integer oldMsg = Set(MsgLevel, _ALL_MESSAGES_)
  Test('?')
  Test('^?')
  Test('{}?')
  Test('\?')   ; literal question mark (should match only '?')
  Test('*')    ; should error
  Test('\')    ; should error
  Set(MsgLevel, oldMsg)
  PurgeMacro(CurrMacroFilename())
END

proc Test(string pat)
  if Find(pat, 'g')
    Warn('Pattern "', pat, '" FOUND as "', GetFoundText(), '" (', DosIOResult(), ')')
  else
    Warn('Pattern "', pat, '" NOT found (', DosIOResult(), ')')
  endif
end

This will give you a reproducible log of TSE’s current behavior.


Verdict: your instinct is right — formally a bug (accepting a dangling ? and producing an empty match). Since it’s benign for your workflows, documenting the quirk plus adding a guard/advance strategy in macros is perfectly reasonable until the engine is changed.

with friendly greetings

Knud van Eeden


--

---
You received this message because you are subscribed to the Google Groups "SemWare TSE Pro text editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semware+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/semware/006601dc1e62%24cf6b8270%246e428750%24%40ecarlo.nl.

Carlo Hogeveen

unread,
Sep 6, 2025, 1:58:40 PMSep 6
to sem...@googlegroups.com

Hi Knud,

You are doing it again.
An AI cannot react to my TSE bug report, because it is not trained on TSE regular expressions.
Apparently you cannot either.
So you go off-topic and ask your AI about Perl regular expressions, which work differently.

I asked my AI which programming editors it has the most knowledge about, and to list them top-down.
It listed in order: Visual Studio Code (VS Code), Sublime Text, Vim, Emacs (and Spacemacs), and Notepad++.
With honorable mentions for Atom, Webstorm, BBEdit and Nova & Espresso.
I hope this helps,

Carlo




S.E. Mitchell

unread,
Sep 6, 2025, 2:31:16 PMSep 6
to sem...@googlegroups.com
I have updated the code for all 3 cases listed, and now issue an
appropriate error message along with setting the error, so
DosIOResult() will return a non-zero value.
--

Carlo Hogeveen

unread,
Sep 6, 2025, 2:36:27 PMSep 6
to sem...@googlegroups.com

> I have updated the code for all 3 cases listed, and now issue an
> appropriate error message along with setting the error, so
> DosIOResult() will return a non-zero value.

Muchas gracias, señor Mitchell!

Carlo




Knud van Eeden

unread,
Sep 7, 2025, 9:13:59 AMSep 7
to TSE Pro Support, Knud van Eeden

Regular expressions are the simplest form of language in the 4 order hierarchy of Noam Chomsky and based on the work of Stephen Kleene, then Ken Thompson who implemented Unix and C also and then as first implemented regular expressions in the QED text editor.

See also:

Regular expressions are regular expressions, with Perl usually taking the leading role and following about the same rules, it has just some extra possibilities implemented.

Especially the Kleene operator '?' or thus 'if then else' or '0 1' operator should be handled the SAME in all computer languages and text editors.

So completely all good with that angle of approach thus by having a look how other main computer languages approached the same question about '?'.

And yes AI ChatGPT knows TSE SAL for sure.

TSE regular expressions are not an exception thus they follow the rules quite strictly also.

Anyhow Semware already put up a change in the code for '?', so the question can be considered answered now.

So yes it will be done again where applicable, resistance is futile.


--

---
You received this message because you are subscribed to the Google Groups "SemWare TSE Pro text editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semware+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages