On Fri, 15 Dec 2023 at 04:24, 'Ze'ev Atlas' via PCRE2 discussion list
<
pcre...@googlegroups.com> wrote:
>
> Hi Felipe
> In that case, could you please look at the original code on PCRE1 and see whether it is feasible to isolate the quotemeta function.itself. Perhaps it does not rely on too much C++ mumbo jumbo and we could convert it into C.
> I may or may not habe access to the PCRE1 Code, but it may be interesting and cost efective to do it.
> Are there other "missing" finctions
The logic behind perl's quotemeta and QRegularExpression::escape (and
PCRE1's QuoteMeta, I assume) is extremely straightforward: iterate on
all code points, and if they're outside of the [a-zA-Z0-9_] set,
escape them with a backslash.
I've implemented from scratch in 2012 [1], it's 20 lines that have
never changed since.
Complications:
1) NUL must be escaped with "\\0" (backslash + 0) and not "\\\0"
(backslash + NUL), because pcre_compile uses a NUL-terminated string.
2) Handling of Unicode, specifically multiple code-unit sequences (in
UTF8/16). Only the first code unit needs a backslash in front of it,
the others need to be copied as-is. How to determine if it's the first
is left to you, there's no proper facilities in C for this. (Of course
it's trivial to do, I'm just ranting :-)). Also, you get to pick what
happens in case of illegal sequences. In Qt I do garbage in/garbage
out.
3) Memory management in C. In C++ we use `string` (or equivalent),
store the quoted string into it, return it, and call it a day. In C
you need the usual dance -- have the caller pass you output pointer +
size, be sure not to overflow it, return how much you would actually
need to write... the usual drill. Luckily, there's always an upper
bound the user can always rely upon (and you can document it): at
most, you'll write twice the size of the input string.
[1]
https://codereview.qt-project.org/c/qt/qtbase/+/12319/21/src/corelib/tools/qregularexpression.cpp#1437
Hope this helps,
--
Giuseppe D'Angelo