Unicode regex matches 's' and 'k'

30 views
Skip to first unread message

Sven Berg Ryen

unread,
Apr 30, 2021, 8:16:20 AM4/30/21
to BBEdit Talk
Hi!

I was going to run a regular expression on a large document.
What I wanted to extract was lines matching [\x{007f}-\x{ffff}], also known as high or extended ASCII.

When I search for that pattern in the document, however, it also oddly matches the characters "s" and "k", which according to the Character inspector have Unicode 0073 and 006B respectively.

Am I doing something wrong here? It seems to me this could be a bug.

I'm at BBEdit 13.5.6.

Best regards,
Sven

Patrick Woolsey

unread,
Apr 30, 2021, 9:57:01 AM4/30/21
to bbe...@googlegroups.com
Good morning and as a gentle reminder :-) per the footer:


This is the BBEdit Talk public discussion group. If you have a feature request
or need technical support, please email "sup...@barebones.com" rather than
posting here.


Thanks & regards,

Patrick Woolsey
==
Bare Bones Software, Inc. <https://www.barebones.com/>

jj

unread,
Apr 30, 2021, 4:18:17 PM4/30/21
to BBEdit Talk

Hi Sven,

Is it possible that you did a case insensitive search (the "Case sensitive" check box was unchecked in the Find window)?
In this case it is not a bug but simply Unicode case conversion, your regex  finds the "lowercase" version of this two Unicode character:

Unicode Character “K” (U+212A) is based on k

Unicode Character “ſ” (U+017F) is based on s

If you try you regex with "Case sensitive" checked, 's' and 'k' are not found because there is no case conversion.

Case conversion + Unicode + Locales can be tricky.

Regards,

Jean Jourdain
Reply all
Reply to author
Forward
0 new messages