Preventing a newline from being recognized by [\s]+ *without* using [[:blank:]]

24 views
Skip to first unread message

Jim Witte

unread,
Sep 24, 2023, 2:02:20 PM9/24/23
to BBEdit Talk
I'm trying to create a pattern that will find two Chinese characters separated by 1 or more spaces and covert it to a single ideographic space (\x{3000}), using the following pattern:

Find: ([\x{2f00}-\x{ffff}]){1}[\s^$]+([\x{2f00}-\x{ffff}])
Replace: \1\x{3000}\2

But this also recognizes newlines as spaces.  I figure out how to do it using [[:blank:]] with

Find: ([\x{2f00}-\x{ffff}]){1}[[:blank:]]+([\x{2f00}-\x{ffff}])

But is there another way?  Something like [\s^$] ?  "[\s^\n]" doesn't work.

Jim

Patrick Woolsey

unread,
Sep 24, 2023, 2:56:21 PM9/24/23
to bbe...@googlegroups.com
Since per the discussion of character classes in Chapter 8, the special class \s intrinsically includes linefeeds:

====

* Other Special Character Classes *

BBEdit uses several other sequences for matching different types or categories of characters.

Special Character Matches

\s any whitespace character (space, tab, carriage return, line feed, form feed)

====

I suggest you instead define a character class which contains only the whitespace characters that you explicitly wish to exclude, e.g. [^\t ] since you needn't worry about carriage returns and I expect you aren't likely to encounter form feeds. :-)

Regards,

Patrick Woolsey
==
Bare Bones Software, Inc. <https://www.barebones.com/>

jj

unread,
Sep 25, 2023, 3:19:16 AM9/25/23
to BBEdit Talk
Hi Jim,

As per the PCRE2 documentation, you could use \h instead of \s :


CHARACTER TYPES


. any character except newline; in dotall mode, any character whatsoever 

\C one code unit, even in UTF mode (best avoided) 

\d a decimal digit 

\D a character that is not a decimal digit 

\h a horizontal white space character 

\H a character that is not a horizontal white space character 

\N a character that is not a newline 

\p{xx} a character with the xx property 

\P{xx} a character without the xx property 

\R a newline sequence 

\s a white space character 

\S a character that is not a white space character 

\v a vertical white space character 

\V a character that is not a vertical white space character 

\w a "word" character 

\W a "non-word" character 

\X a Unicode extended grapheme cluster


HTH

Jean Jourdain

Reply all
Reply to author
Forward
0 new messages