Preventing a newline from being recognized by [\s]+ *without* using [[:blank:]]

Jim Witte

unread,

Sep 24, 2023, 2:02:20 PM9/24/23

to BBEdit Talk

I'm trying to create a pattern that will find two Chinese characters separated by 1 or more spaces and covert it to a single ideographic space (\x{3000}), using the following pattern:

Find: ([\x{2f00}-\x{ffff}]){1}[\s^$]+([\x{2f00}-\x{ffff}])

Replace: \1\x{3000}\2

But this also recognizes newlines as spaces. I figure out how to do it using [[:blank:]] with

Find: ([\x{2f00}-\x{ffff}]){1}[[:blank:]]+([\x{2f00}-\x{ffff}])

But is there another way? Something like [\s^$] ? "[\s^\n]" doesn't work.

Jim

Patrick Woolsey

unread,

Sep 24, 2023, 2:56:21 PM9/24/23

to bbe...@googlegroups.com

Since per the discussion of character classes in Chapter 8, the special class \s intrinsically includes linefeeds:

====

* Other Special Character Classes *

BBEdit uses several other sequences for matching different types or categories of characters.

Special Character Matches

\s any whitespace character (space, tab, carriage return, line feed, form feed)

====

I suggest you instead define a character class which contains only the whitespace characters that you explicitly wish to exclude, e.g. [^\t ] since you needn't worry about carriage returns and I expect you aren't likely to encounter form feeds. :-)

Regards,

Patrick Woolsey
==
Bare Bones Software, Inc. <https://www.barebones.com/>

jj

unread,

Sep 25, 2023, 3:19:16 AM9/25/23

to BBEdit Talk

Hi Jim,

As per the PCRE2 documentation, you could use \h instead of \s :

https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC4

CHARACTER TYPES

. any character except newline; in dotall mode, any character whatsoever

\C one code unit, even in UTF mode (best avoided)

\d a decimal digit

\D a character that is not a decimal digit

\h a horizontal white space character

\H a character that is not a horizontal white space character

\N a character that is not a newline

\p{xx} a character with the xx property

\P{xx} a character without the xx property

\R a newline sequence

\s a white space character

\S a character that is not a white space character

\v a vertical white space character

\V a character that is not a vertical white space character

\w a "word" character

\W a "non-word" character

\X a Unicode extended grapheme cluster

HTH

Jean Jourdain

Reply all

Reply to author

Forward