Punctuation mark is not included in correct capture group

57 views
Skip to first unread message

Otto Munters

unread,
Apr 8, 2025, 5:09:20 AMApr 8
to BBEdit Talk
Error in regex for BBedit:
What is wrong in this regex?
(?=^.{42,}$)(.{,23}\b[,.:;"!?]?\b)(.*)

problem: the comma is also moved with capture group 2, it should stay with capture group 1

Example, whole sentence to be split in two parts:
They get rid of things, very simple clothing.

regex with error returns:
1st line: They get rid of things
2nd line: , very simple clothing.

should be:
1st line: They get rid of things,
2nd line: very simple clothing.

punctuation mark is not included in correct capture group

I tried different patterns, like:
(?=^.{42,}$)(.{0,23}\b[\w,.:;"!?]*\b)(.*)     also not working right

(?=^.{42,}$)(.{1,23}\b(?:[,.:;"!?]?)\b)(.*)   also not working right

(?=^.{42,}$)(?<Group1>.{1,23}\b[,.:;"!?]?)\b(?<Group2>.*)   also not working right


Thanks a lot for your help!
Otto

Neil Faiman

unread,
Apr 8, 2025, 9:30:11 AMApr 8
to BBEdit Talk Mailing List
Your first capture group matches
  • a string of no more than 23 characters
  • that is followed by a word break
  • optionally followed by a punctuation mark
  • that is followed by a word break

You expect this to match
  • “They get rid of things” (22 characters)
  • Which is followed by a word break (between the letter “s” and the non-letter “,”)
  • And the following punctuation mark (comma)
  • Which is followed be a word break — and there is the error. Neither the comma nor the following space is a letter, so the comma is not followed by a word break.

Instead, it matches
  • “They get rid of things” (22 characters)
  • Which is followed by a word break (between the letter “s” and the non-letter “,”)
  • Not the comma, which is OK, because the punctuation mark is optional
  • Followed by a word break — still the same word break between “s” and “,”.

So the first capture group matches up to, but not including, the comma.

Regards,
Neil Faiman

Otto Munters

unread,
Apr 8, 2025, 11:53:14 AMApr 8
to BBEdit Talk
I took the word break (\b) away, the grep is now: (?=^.{42,}$)(.{0,23}[,.:;"!?]*\b)(.*)
Still the same error. Tried many variations, still not working. Pff.

Op dinsdag 8 april 2025 om 15:30:11 UTC+2 schreef Neil Faiman:

flet...@cumuli.com

unread,
Apr 8, 2025, 12:53:43 PMApr 8
to bbe...@googlegroups.com
This pattern works on the sample text. Replacing the \b with an optional space. 

(?=^.{42,}$)(.{,23}\b[,.:;"!?]? ?)(.*)\n


[fletcher]


-- 
This is the BBEdit Talk public discussion group. If you have a feature request or believe that the application isn't working correctly, please email "sup...@barebones.com" rather than posting here. Follow @bbedit on Mastodon: <https://mastodon.social/@bbedit>
--- 
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/bbedit/7b137e9d-1a03-4978-aa60-33558e6195aen%40googlegroups.com.

GP

unread,
Apr 8, 2025, 2:17:47 PMApr 8
to BBEdit Talk
If you're trying to find long lines of subtitle text and then find a grammar appropriate place to break the line up within the first 23 characters (or 24 if there's one of the punctuation characters you've specified), then this should do the job:

(?=^.{42,}$)(.{,23}\b[,.:;"!?]?) \b(.*)

I'm not an expert in written English grammar but I'm pretty sure within a single line of text whole words are always separated by space character. The first whole word of a two whole word pair may be followed by a punctuation character but if so the punctuation character has to be followed by a space character. This in part was why your regular expression patterns weren't working for you.

The space between the two words where you're separating the line parts doesn't need to be captured when reformatted one line into two lines. It isn't proper formatting to carry that space character with the second part of the line. Also, keeping the space character with the first line part isn't necessary and if kept could add some complications dealing with it in any later text manipulations.

The one thing the grep pattern doesn't handle is the double space characters after punctuation marks carried over from the typewriter days. If the text you're dealing with has that, change the " \b"  to " {1,2}\b" (all without the " characters).

Otto Munters

unread,
Apr 8, 2025, 3:21:55 PMApr 8
to BBEdit Talk
Thank you all. Got it working!
Otto

Op dinsdag 8 april 2025 om 18:53:43 UTC+2 schreef flet...@cumuli.com:
Reply all
Reply to author
Forward
0 new messages