Unicode regex matches 's' and 'k'

Sven Berg Ryen

unread,

Apr 30, 2021, 8:16:20 AM4/30/21

to BBEdit Talk

Hi!

I was going to run a regular expression on a large document.

What I wanted to extract was lines matching [\x{007f}-\x{ffff}], also known as high or extended ASCII.

When I search for that pattern in the document, however, it also oddly matches the characters "s" and "k", which according to the Character inspector have Unicode 0073 and 006B respectively.

Am I doing something wrong here? It seems to me this could be a bug.

I'm at BBEdit 13.5.6.

Best regards,

Sven

Patrick Woolsey

unread,

Apr 30, 2021, 9:57:01 AM4/30/21

to bbe...@googlegroups.com

Good morning and as a gentle reminder :-) per the footer:

This is the BBEdit Talk public discussion group. If you have a feature request
or need technical support, please email "sup...@barebones.com" rather than
posting here.

Thanks & regards,

Patrick Woolsey
==
Bare Bones Software, Inc. <https://www.barebones.com/>

jj

unread,

Apr 30, 2021, 4:18:17 PM4/30/21

to BBEdit Talk

Hi Sven,

Is it possible that you did a case insensitive search (the "Case sensitive" check box was unchecked in the Find window)?

In this case it is not a bug but simply Unicode case conversion, your regex finds the "lowercase" version of this two Unicode character:

Unicode Character “K” (U+212A) is based on k

https://www.compart.com/en/unicode/U+212a

Unicode Character “ſ” (U+017F) is based on s

https://www.compart.com/en/unicode/U+017f

If you try you regex with "Case sensitive" checked, 's' and 'k' are not found because there is no case conversion.

Case conversion + Unicode + Locales can be tricky.

Regards,

Jean Jourdain

Reply all

Reply to author

Forward