Great write-up, Carlo — and nice minimal repro.
Short answer: yes, I’d classify this as a regex parser bug (or at least a standards-deviation) in TSE’s regex engine: a bare quantifier (?
) is being accepted and compiled as if it quantifies the empty (epsilon) atom, which then matches the empty string everywhere. Most engines reject this as a syntax error (“nothing to repeat”).
In virtually all mainstream engines (PCRE, Perl, .NET, Java, Python, POSIX ERE):
?
, *
, and +
are quantifiers and must follow a valid atom (character, group, class, etc.).
A pattern that starts with a quantifier (e.g., ?
, *
) is invalid and should fail to compile: “nothing to repeat”.
What you’re seeing implies TSE’s parser does something like:
If there’s no preceding atom, treat it as quantifying epsilon (the empty match).
epsilon?
is still epsilon
, so it “finds” an empty string at the current position.
That explains your observations:
?
→ empty match
^?
→ ^
matches start-of-line, then ?
quantifies epsilon → empty match at bol
{ }?
→ if {}
is parsed as an empty atom (also wrong), ?
quantifies it → empty match
By contrast, you noted *
and \
“correctly find nothing and return an error.” That’s actually the expected behavior for *
(dangling quantifier) and \
(dangling escape).
Like you said: usually harmless, but it can cause surprising behavior, particularly:
Infinite/zero-progress find loops if a script repeatedly calls Find
without advancing past zero-length matches.
Unexpected “Found” dialogs on an empty selection.
For a literal question mark, use an escaped pattern:
Find('\?', 'g') ; 'g' = regex mode (adjust to your TSE flags)
This should match only literal ?
.
Pre-validate patterns in macros before calling Find
:
Reject patterns that begin with a dangling quantifier or contain {}
with no counts.
Minimal guard (pseudo-SAL):
string proc IsSafeRegex(string pat)
if pat == '' then return '0' endif
if SubStr(pat,1,1) == '?' or SubStr(pat,1,1) == '*' or SubStr(pat,1,1) == '+' then return '0' endif
; crude check for '{}', refine as needed
if Pos('{}', pat) > 0 then return '0' endif
return '1'
end
When looping over matches, explicitly skip zero-length matches:
if Find(pat, 'g')
string found = GetFoundText()
if Len(found) == 0
; advance cursor by 1 to avoid re-finding the same empty match
Right() ; or MoveRel(0, 1) depending on your macro style
endif
; ... process hit ...
endif
During regex compilation, if the parser encounters a quantifier with no preceding atom, it should raise a compile error (consistent with other engines). That would make ?
, *
, +
, {}?
, ^?
invalid unless there’s a proper atom before them. If there’s an intentional extension to treat them as epsilon-quantifiers, it should be documented, and the finder should ensure progress is made on zero-length matches to avoid loops.
If you want to keep this in your “quirks” doc, here’s a concise test macro that shows what TSE returns for a few edge patterns:
PROC Main()
integer oldMsg = Set(MsgLevel, _ALL_MESSAGES_)
Test('?')
Test('^?')
Test('{}?')
Test('\?') ; literal question mark (should match only '?')
Test('*') ; should error
Test('\') ; should error
Set(MsgLevel, oldMsg)
PurgeMacro(CurrMacroFilename())
END
proc Test(string pat)
if Find(pat, 'g')
Warn('Pattern "', pat, '" FOUND as "', GetFoundText(), '" (', DosIOResult(), ')')
else
Warn('Pattern "', pat, '" NOT found (', DosIOResult(), ')')
endif
end
This will give you a reproducible log of TSE’s current behavior.
Verdict: your instinct is right — formally a bug (accepting a dangling ?
and producing an empty match). Since it’s benign for your workflows, documenting the quirk plus adding a guard/advance strategy in macros is perfectly reasonable until the engine is changed.
with friendly greetings
Knud van Eeden
--
---
You received this message because you are subscribed to the Google Groups "SemWare TSE Pro text editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semware+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/semware/006601dc1e62%24cf6b8270%246e428750%24%40ecarlo.nl.
--
---
You received this message because you are subscribed to the Google Groups "SemWare TSE Pro text editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semware+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/semware/000601dc1f57%24dbeb9ee0%2493c2dca0%24%40ecarlo.nl.